L2 书生大模型强化学习 RL 实践

本文带你使用xturner,在gsm8k数据集上使用GRPO算法进行对于InternLM的训练

原理介绍:PPO,DPO,GRPO

RLVR是大模型训练的重要组成部分,包含监督微调(SFT),建立奖励模型(RM)和强化学习(RL)。其中强化学习阶段主要使用PPO,DPO和GRPO三种算法进行优化。

XTurner

XTuner V1 是一个专为超大规模 MoE 模型打造的新一代大模型训练引擎。与传统 3D 并行训练架构相比,XTuner V1 针对当前学术界主流的 MoE 训练场景进行了深度优化。

环境准备

确保显卡驱动正确安装即可,例如在 NVIDIA GPU 设备上,nvidia-smi 的 Driver Version 需要大于 550.127.08

XTurner安装

bash 复制代码
git clone https://github.com/InternLM/xtuner.git
cd xtuner
pip install -e.

不同的任务xtuner依赖会有所不同,比如想训练gpt-oss模型,需要强制安装安装 torch2.8

训练 MoE 模型建议额外安装 GroupedGEMM

bash 复制代码
pip install git+https://github.com/InternLM/GroupedGEMM.git@main

如果需要训练 FP8 MoE 模型,除了安装上述 GroupedGEMM 外需要额外安装 AdaptiveGEMM

bash 复制代码
pip install git+https://github.com/InternLM/AdaptiveGEMM.git@main

此外,XTuner 推荐安装 flash-attn,RL 推荐安装 flash-attn-3,能够显著提升训练速度。可以参考官网文档进行安装。

如果想抢先体验 RL 相关功能,则需要执行下述命令来安装 RL 部分依赖。除此之外,需要安装你选择的推理引擎。以LMDeploy为例,可参考++官网文档++进行安装。

bash 复制代码
pip install -r requirements/rl.txt
# 或者直接安装
# pip install -e '.[rl]'

GRPO 实践

机器环境

Cuda 12.6

A100 50%

创建conda环境

如果你已经有了ms-swift环境,可以尝试先跑一下试试,安装缺失的包,或者新创建一个ms-swift环境(如下)

bash 复制代码
conda create -n ms-swift python=3.10 -y
conda activate ms-swift

安装依赖

bash 复制代码
pip install uv

uv pip install -U \
  ms-swift \
  torch==2.8.0 \
  torchvision \
  torchaudio \
  transformers==4.57.1 \
  modelscope>=1.23 \
  "peft>=0.11,<0.19" \
  trl==0.23.1 \
  deepspeed==0.17.6 \
  vllm==0.11.0 \
  lmdeploy==0.10.2 \
  evalscope>=1.0 \
  gradio==5.32.1 \
  math_verify==0.5.2 \
  -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
bash 复制代码
mkdir gsm8k_rl
cd ./gsm8k

GRPO

数据集处理

开发机的share目录下已经有gsm8k数据集,不需要我们再进行下载

bash 复制代码
/share/datasets/gsm8k_datas/

先把数据集处理成用于grpo训练的格式data_pre.py

bash 复制代码
import re
from datasets import Dataset
import os
import json

SYSTEM_PROMPT = "You are a meticulous mathematical reasoning assistant."

def parse_gsm8k_final_number(raw_answer: str) -> str:
    s = "" if raw_answer is None else str(raw_answer).strip()

    try:
        tail = s.split("####")[-1].strip()
        m = re.search(r"(-?\d+(?:\.\d+)?(?:/\d+(?:\.\d+)?)?)", tail)
        return m.group(1) if m else tail
    except:
        print("ERROR")

def to_target_schema(ex):
    q = (ex.get("question") or "").strip()
    a = ex.get("answer")
    ans = parse_gsm8k_final_number(a)

    return {
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content":"Please reason step by step, and put your final answer within \\boxed{}\n" + q},
        ],
        "solution": f"\\boxed{{{ans}}}",
    }

def load_split(split: str):
    path = f"/share/datasets/gsm8k_datas/main/{split}-00000-of-00001.parquet"
    ds = Dataset.from_parquet(path)
    out = ds.map(to_target_schema, remove_columns=ds.column_names)
    return out

train_ds = load_split("train")
test_ds = load_split("test")

def save_as_jsonl(dataset, save_path):
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    with open(save_path, "w", encoding="utf-8") as f:
        for item in dataset:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")

train_out = "./data/train.jsonl"
test_out  = "./data/test.jsonl"

save_as_jsonl(train_ds, train_out)
save_as_jsonl(test_ds, test_out)

print(f"Saved train set to: {train_out}")
print(f"Saved test  set to: {test_out}")

数据集处理后的格式

bash 复制代码
{
  "messages": [
    {
      "role": "system",
      "content": "You are a meticulous mathematical reasoning assistant."
    },
    {
      "role": "user",
      "content": "Please reason step by step, and put your final answer within \\boxed{}\nNatalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
    }
  ],
  "solution": "\\boxed{72}"
}

Eval 评测

使用vllm进行评测,Temperature 设置为0,测试脚本eval.py

python 复制代码
import json
import os
from typing import List, Dict, Any
from tqdm import tqdm
from vllm import LLM, SamplingParams
from datasets import Dataset

class MathAccuracy:  
    """数学准确率评估器,使用math_verify包进行LaTeX解析和验证,参考swift accurary reward"""

    def __init__(self):
        import importlib.util
        assert importlib.util.find_spec('math_verify') is not None, (
            "The math_verify package is required but not installed. "
            "Please install it using 'pip install math_verify'.")

    def __call__(self, completions: List[str], solution: List[str], **kwargs) -> List[float]:
        from latex2sympy2_extended import NormalizationConfig
        from math_verify import LatexExtractionConfig, parse, verify

        rewards = []
        for content, sol in zip(completions, solution):
            gold_parsed = parse(sol, extraction_mode='first_match',
                              extraction_config=[LatexExtractionConfig()])

            if len(gold_parsed) != 0:
                # 解析模型生成的答案
                answer_parsed = parse(
                    content,
                    extraction_config=[
                        LatexExtractionConfig(
                            normalization_config=NormalizationConfig(
                                nits=False,
                                malformed_operators=False,
                                basic_latex=True,
                                boxed=True,
                                units=True,
                            ),
                            # 确保优先尝试匹配boxed内容
                            boxed_match_priority=0,
                            try_extract_without_anchor=False,
                        )
                    ],
                    extraction_mode='first_match',
                )
                # 如果内容与标准答案匹配,奖励为1,否则为0
                reward = float(verify(answer_parsed, gold_parsed))
            else:
                # 如果标准答案无法解析,跳过该样本并奖励1
                reward = 1.0

            rewards.append(reward)

        return rewards

def load_dataset(data_path: str) -> Dataset:
    if not os.path.exists(data_path):
        raise FileNotFoundError(f"数据集文件不存在: {data_path}")

    # 读取JSONL文件
    data = []
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line.strip()))

    # 转换为Dataset对象
    dataset = Dataset.from_list(data)
    print(f"加载了 {len(dataset)} 个样本")

    return dataset

def format_prompt(messages: List[Dict[str, str]], tokenizer) -> str:
    # 检查是否有 chat template
    if hasattr(tokenizer, 'chat_template') and tokenizer.chat_template is not None:
        try:
            # 使用模型的chat template格式化消息
            prompt = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
            )
        except Exception as e:
            # 如果模板应用失败,回退到原始格式
            print(f"警告: 应用chat template失败 ({e}),使用备用格式")
            prompt = _format_fallback(messages)
    else:
        # 没有chat template,使用原始消息格式
        prompt = _format_fallback(messages)

    return prompt

def _format_fallback(messages: List[Dict[str, str]]) -> str:
    """
    备用格式化函数,当没有chat template时使用
    使用标准的 <s><|im_start|>{role}\n{content}<|im_end|> 格式
    """
    prompt = "<s>"
    for message in messages:
        role = message.get("role", "user")
        content = message.get("content", "")
        prompt += f"<|im_start|>{role}\n{content}<|im_end|>\n"

    # 添加assistant前缀,准备生成
    prompt += "<|im_start|>assistant\n"

    return prompt

def run_evaluation(
    model_path: str,
    data_path: str,
    output_path: str = None,
    tensor_parallel_size: int = 1,
    temperature: float = 0.0,
    max_tokens: int = 2048,
    batch_size: int = 32,
    seed: int = 42,
):
    dataset = load_dataset(data_path)

    prompts = []
    solutions = []

    for item in dataset:
        messages = item.get("messages", [])
        solution = item.get("solution", "")
        prompts.append(messages)  # 先保存原始messages
        solutions.append(solution)


    # 初始化vLLM模型
    llm = LLM(
        model=model_path,
        tensor_parallel_size=tensor_parallel_size,
        seed=seed,
        dtype="half",  
        trust_remote_code=True,  
    )

    # 获取模型的tokenizer用于应用chat template
    tokenizer = llm.get_tokenizer()

    # 格式化提示词
    formatted_prompts = []
    for messages in prompts:
        prompt = format_prompt(messages, tokenizer)
        formatted_prompts.append(prompt)


    # 配置采样参数
    sampling_params = SamplingParams(
        temperature=temperature,
        max_tokens=max_tokens,
        stop=["<|endoftext|>", "<|im_end|>"],
    )

    # 分批生成答案
    print("inferring...")
    all_completions = []

    for i in tqdm(range(0, len(formatted_prompts), batch_size), desc="生成进度"):
        batch_prompts = formatted_prompts[i:i + batch_size]

        outputs = llm.generate(
            batch_prompts,
            sampling_params=sampling_params,
            use_tqdm=False,
        )

        for output in outputs:
            # 获取生成的文本
            generated_text = output.outputs[0].text
            all_completions.append(generated_text)

    # 评估答案准确率
    print("evaluating...")
    evaluator = MathAccuracy()
    rewards = evaluator(all_completions, solutions)

    # 计算统计信息
    correct_count = sum(rewards)
    total_count = len(rewards)
    accuracy = correct_count / total_count * 100

    print(f"\n========== 评估结果 ==========")
    print(f"总样本数: {total_count}")
    print(f"正确数: {correct_count}")
    print(f"准确率: {accuracy:.2f}%")
    print(f"================================\n")

    # 保存详细结果
    if output_path:
        os.makedirs(os.path.dirname(output_path), exist_ok=True)

        results = {
            "model_path": model_path,
            "data_path": data_path,
            "total_samples": total_count,
            "correct_count": correct_count,
            "accuracy": accuracy,
            "individual_results": []
        }

        for i, (prompt, completion, solution, reward) in enumerate(zip(
            formatted_prompts, all_completions, solutions, rewards
        )):
            results["individual_results"].append({
                "index": i,
                "prompt": prompt,
                "completion": completion,
                "solution": solution,
                "reward": reward,
            })

        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(results, f, ensure_ascii=False, indent=2)

        print(f"详细结果已保存到: {output_path}")

    # 保存简洁结果(便于快速查看)
    summary_path = output_path.replace('.json', '_summary.json') if output_path else None
    if summary_path:
        summary = {
            "model_path": model_path,
            "total_samples": total_count,
            "correct_count": correct_count,
            "accuracy": accuracy,
        }

        with open(summary_path, 'w', encoding='utf-8') as f:
            json.dump(summary, f, ensure_ascii=False, indent=2)

        print(f"简洁结果已保存到: {summary_path}")

    return accuracy

def main():
    """主函数"""
    import argparse

    parser = argparse.ArgumentParser(description="GSM8K数学问题评估")

    parser.add_argument("--model_path", type=str,
                       default="/root/gsm8k_rl/output/Qwen2.5-Math-1.5B/checkpoint-2000",
                       help="模型路径或名称")
    parser.add_argument("--data_path", type=str,
                       default="/root/gsm8k_rl/data/test.jsonl",
                       help="数据集路径")
    parser.add_argument("--output_path", type=str,
                       default="/root/gsm8k_rl/eval_results.json",
                       help="输出结果路径")
    parser.add_argument("--tensor_parallel_size", type=int, default=1,
                       help="张量并行大小")
    parser.add_argument("--temperature", type=float, default=0.0,
                       help="采样温度(0表示贪婪解码)")
    parser.add_argument("--max_tokens", type=int, default=2048,
                       help="最大生成长度")
    parser.add_argument("--batch_size", type=int, default=32,
                       help="批处理大小")
    parser.add_argument("--seed", type=int, default=42,
                       help="随机种子")

    args = parser.parse_args()

    # 运行评估
    accuracy = run_evaluation(
        model_path=args.model_path,
        data_path=args.data_path,
        output_path=args.output_path,
        tensor_parallel_size=args.tensor_parallel_size,
        temperature=args.temperature,
        max_tokens=args.max_tokens,
        batch_size=args.batch_size,
        seed=args.seed,
    )

    return accuracy

if __name__ == "__main__":
    main()

在训练之前可以先测试一下base模型的准确度,大概十分钟左右,正确率19.86%

bash 复制代码
python eval.py \
      --model_path  /share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b \
      --data_path /root/gsm8k_rl/data/test.jsonl \
      --output_path ./result/base.json \
      --batch_size 32 \
      --max_tokens 1024 


python eval.py \
      --model_path  /share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b \
      --data_path /root/data/test.jsonl \
      --output_path ./result/base.json \
      --batch_size 32 \
      --max_tokens 1024
bash 复制代码
{
  "model_path": "/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b",
  "total_samples": 1319,
  "correct_count": 262.0,
  "accuracy": 19.863532979529946
}

奖励函数

我们这里用了两个reward func,一个是accurary,另一个是box_reward

accurary

swift 自带,该函数将模型的生成结果与数据集中的 solution 列进行比较,计算准确率分数。如果生成结果与标准答案一致,则得分为 1.0;否则为 0.0。

box_reward

自定义的奖励,如果模型输出的slolution包裹在\boxed{}中 返回1,否则0

box_reward.py

bash 复制代码
import re
from typing import List
from swift.plugin import ORM, orms

class BoxedReward(ORM):
    """Reward: check whether output contains \\boxed{...}"""

    def __call__(self, completions, **kwargs) -> List[float]:
        pattern = re.compile(r"\\boxed\s*\{.*?\}", re.DOTALL)
        return [1.0 if pattern.search(str(c)) else 0.0 for c in completions]

orms["box_reward"] = BoxedReward

模型训练

bash 复制代码
#!/bin/bash
set -e
# 创建日志目录
LOG_DIR="./logs"
mkdir -p "$LOG_DIR"

# 获取当前时间戳
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
LOG_FILE="$LOG_DIR/[GRPO]internvl3.5_1b_${TIMESTAMP}.log"

export OMP_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=0
export MASTER_PORT=$((10000 + RANDOM % 50000))

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_LOGGING_LEVEL=INFO

{
  echo "===== Training start: $(date) ====="
  echo "Log file: $LOG_FILE"
  echo "Using port: $MASTER_PORT"
  echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
  echo "Enable vLLM: true"
} >> "$LOG_FILE"

nohup swift rlhf \
  --rlhf_type grpo \
  --model '/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b' \
  --dataset './data/train.jsonl#4000' \
  --external_plugins ./box_reward.py \
  --reward_funcs accuracy box_reward \
  --reward_weights 0.5 0.5 \
  --eval_steps 50 \
  --train_type lora \
  --target_modules all-linear \
  --max_completion_length 768 \
  --torch_dtype bfloat16 \
  --num_train_epochs 2 \
  --per_device_train_batch_size 8 \
  --per_device_eval_batch_size 4 \
  --learning_rate 5e-6 \
  --warmup_ratio 0.05 \
  --gradient_accumulation_steps 4 \
  --save_steps 50 \
  --save_total_limit 5 \
  --gradient_checkpointing_kwargs '{"use_reentrant": false}' \
  --logging_steps 5 \
  --max_length 2048 \
  --output_dir ./grpo_out \
  --dataset_num_proc 8 \
  --dataloader_num_workers 0 \
  --freeze_vit true \
  --log_completions true \
  --use_vllm true \
  --vllm_gpu_memory_utilization 0.50 \
  --vllm_max_model_len 2048 \
  --vllm_tensor_parallel_size 1 \
  --vllm_enforce_eager false \
  --vllm_mode colocate \
  > "$LOG_FILE" 2>&1 &

TRAIN_PID=$!
sleep 2

if kill -0 "$TRAIN_PID" 2>/dev/null; then
  echo "Training started successfully with PID $TRAIN_PID"
  echo "To view logs in real-time, use:"
  echo "tail -f $LOG_FILE"
  echo ""
  echo "To stop training, use:"
  echo "kill -9 $TRAIN_PID"
else
  echo "Failed to start training process"
  echo "Check log file for errors: $LOG_FILE"
fi

TIPs:

  1. 时间: 这个脚本大概跑了8h,数据选择了4000/7473,可以从以下方面减少训练时间成本,

    1. 数据量:--datasets 参数重的#4000 为#2000,减少一半,最终效果应该也会降低

    2. **epoch:**原来设置是2,大家可以调整为1

  2. 这里两个奖励函数的权重设置为0.5 0.5 ,大家可以自行探索更好的权重设置

  3. --num_generations 参数,这里没有指定,用的默认的8,挺合适,不用修改

  4. Swift 默认使用lora,这里也都使用的默认参数

  5. ...

合并模型

注意换成你自己的lora adapter checkpoint路径

bash 复制代码
swift export --adapter "/root/gsm8k_rl/grpo_out/v0-20251230-214229/checkpoint-2000" --merge_lora True

出现上面这种情况则则为训练成功。

eval训练后的模型

注意换成自己merged后的模型路径

bash 复制代码
python eval.py \
      --model_path  /root/gsm8k_rl/grpo_out/v0-20251230-214229/checkpoint-2000-merged \
      --data_path /root/gsm8k_rl/data/test.jsonl \
      --output_path ./result/grpo.json \
      --batch_size 32 \
      --max_tokens 1024 
bash 复制代码
{
  "model_path": "/root/gsm8k_rl/grpo_out/v0-20251230-214229/checkpoint-2000-merged",
  "total_samples": 1319,
  "correct_count": 455.0,
  "accuracy": 34.495830174374525
}

上面这种情况则为训练成功。


L2 书生大模型强化学习 RL 实践(ms-swift)

环境安装

使用环境基于L2G2的Intern-S1-mini微调的ms-swift11环境,

bash 复制代码
# 如果没有装的话,可以用下面命令重新装一下。
# 创建环境:
conda create -n ms-swift11 python=3.10 -y
conda activate ms-swift11

# 安装VLM的依赖:
conda activate ms-swift11
cd /root
git clone https://gh.llkk.cc/https://github.com/fak111/VLM-formula-recognition-dataset.git
cd VLM-formula-recognition-dataset
pip install -r requirements.txt
pip install transformers -U
# 安装ms-swift的依赖:
cd /root
git clone https://gh.llkk.cc/https://github.com/modelscope/ms-swift.git
cd ms-swift
git checkout cab4aa59
pip install -e .
pip install msgspec
end
# 然后安装grpo微调需要的依赖:
conda activate ms-swift11
pip install math_verify==0.5.2 vllm
模型下载

模型下载部分直接跳过,改为本地intern-s1-mini的路径:/share/new_models/Intern-S1-mini或者/share/new_models/InternVL3.5/InternVL3_5-4B

数据集下载

使用的数据集需要用固定格式,并且加上thinking的思维链内容,可以使用的训练数据示例如下:

bash 复制代码
{
    "data_source": "openai/gsm8k", 
    "ability": "math", 
    "reward_model": 
        {"ground_truth": "72", "style": "rule"}, 
    "extra_info": 
        {"answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72", "index": 0, "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "split": "train"}, 
    "messages": 
        [{"content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".", "role": "user", "type": "text"}], 
    "solution": "72"
}
{
    "data_source": "openai/gsm8k", 
    "ability": "math", 
    "reward_model": {"ground_truth": "10", "style": "rule"}, 
    "extra_info": {"answer": "Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.\nWorking 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.\n#### 10", "index": 1, "question": "Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?", "split": "train"}, 
    "messages": [{"content": "Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Let's think step by step and output the final answer after \"####\".", "role": "user", "type": "text"}], 
    "solution": "10"
}
...

已经准备好了gsm8k数据集里筛选出来的2000条数据:

数据集下载后,存放于/root/share/datasets里。

直接训练
bash 复制代码
conda activate ms-swift11
CUDA_VISIBLE_DEVICES=0 swift rlhf \
  --rlhf_type grpo \
  --model "/root/share/new_models/Intern-S1-mini" \
  --train_type lora \
  --lora_rank 8 \
  --lora_alpha 16 \
  --target_modules all-linear \
  --dataset '/root/share/datasets/camp6_l2g4_grpo_gsm8k_trainset.jsonl' \
  --torch_dtype bfloat16 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 6 \
  --gradient_accumulation_steps 4 \
  --learning_rate 5e-6 \
  --save_total_limit 2 \
  --logging_steps 10 \
  --save_strategy steps \
  --save_steps 500 \
  --output_dir '/root/Internlm/output/grpo-intern-s1-mini-gsm8k' \
  --max_completion_length 256 \
  --reward_funcs accuracy \
  --num_generations 2 \
  --generation_batch_size 6 \
  --use_vllm false \
  --offload_model true \
  --offload_optimizer true \
  --freeze_llm false \
  --freeze_vit true \
  --dataloader_num_workers 4 \
  --lazy_tokenize True

报错误,需要解决。

模型合并

合并用swift export命令

bash 复制代码
conda activate ms-swift11
swift export --adapters <last_model_checkpoint_path> --merge_lora True

实际路径为:/root/Internlm/output/grpo-intern-s1-mini-gsm8k/v1-20251102-010146/checkpoint-167,因此合并的命令为:

bash 复制代码
conda activate ms-swift11
swift export --adapters '/root/Internlm/output/grpo-intern-s1-mini-gsm8k/v1-20251102-010146/checkpoint-167' --merge_lora True
模型推理

使用vllm在服务端把merge完成的模型部署了,然后新开终端,在http://localhost:54257/v1/chat/completions调用即可。

基于L1G1课程的vllm_py312环境,快速给这个微调后的qwen3-4b模型部署出来:

bash 复制代码
conda create -n vllm_py312 python=3.12 -y
conda activate vllm_py312
pip install vllm

服务端用vllm部署:

bash 复制代码
conda activate vllm_py312
vllm serve /root/Internlm/output/grpo-intern-s1-mini-gsm8k/v1-20251102-010146/checkpoint-167-merged \
  --trust-remote-code \
  --port 54257 \
  --max-model-len 8000 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 8 \
  --swap-space 8 \
  --enforce-eager

部署就绪了,等待客户端发送请求。

客户端执行:

bash 复制代码
curl http://localhost:54257/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "//root/Internlm/output/grpo-intern-s1-mini-gsm8k/v1-20251102-010146/checkpoint-167-merged",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Lets think step by step and output the final answer after \"####\"."}
        ]
      }'
相关推荐
康康的AI博客6 小时前
腾讯王炸:CodeMoment - 全球首个产设研一体 AI IDE
ide·人工智能
中达瑞和-高光谱·多光谱6 小时前
中达瑞和LCTF:精准调控光谱,赋能显微成像新突破
人工智能
mahtengdbb16 小时前
【目标检测实战】基于YOLOv8-DynamicHGNetV2的猪面部检测系统搭建与优化
人工智能·yolo·目标检测
Pyeako7 小时前
深度学习--BP神经网络&梯度下降&损失函数
人工智能·python·深度学习·bp神经网络·损失函数·梯度下降·正则化惩罚
清 澜7 小时前
大模型面试400问第一部分第一章
人工智能·大模型·大模型面试
不大姐姐AI智能体7 小时前
搭了个小红书笔记自动生产线,一句话生成图文,一键发布,支持手机端、电脑端发布
人工智能·经验分享·笔记·矩阵·aigc
虹科网络安全8 小时前
艾体宝方案 | 释放数据潜能 · 构建 AI 驱动的自动驾驶实时数据处理与智能筛选平台
人工智能·机器学习·自动驾驶
Deepoch8 小时前
Deepoc数学大模型:发动机行业的算法引擎
人工智能·算法·机器人·发动机·deepoc·发动机行业
2501_940198698 小时前
从“数据孤岛”到“智慧医脑”:实战 MCP 协议安全接入 HIS 系统,构建医疗级 AI 辅助诊断合规中台
人工智能·安全·asp.net