DeepSeek-Math-V2解读:稠密Reward信号回归到RLVR

DeepSeek-Math-V2解读:稠密Reward信号回归到RLVR

最近半年博主工作科研比较繁忙,有段时间没有写博客了,但并不代表博主没有更新相关技术,后续会补上更多科研信息。

今天详细解读一下前段时间发布的DeepSeek-Math-V2,DeepSeek-Math早在24年带有GRPO这个广为流传的RLVR技术,那么憋了一年后的V2版有什么看点呢?

最近一系列的RL算法都关注ORM,即只要结果正确就认为整个推理过程正确,这种方式在AIME、HMMT等一系列高难度的数学任务上被验证有效。但是其依然存在一些问题:

  • 模型最终虽然可以给出正确答案,但是推理过程可能会存在逻辑错误;
  • 仅靠结果作为反馈,很难适用于定理证明类等对推理过程准确性强依赖的任务;

因此,如何能够有效验证推理过程,对ATP类任务至关重要。主要动机包括:

  • 即使没有参考解决方案,人类也能识别证明中的问题------这在解决开放性问题(即答案未知的任务)时至关重要。
  • 如果经过大规模验证仍无法识别出任何问题,则该证明更可能有效。
  • 识别有效问题所需的工作量可以作为证明质量的指标,并可用于优化证明生成。
    基于Verifier,实现Proof Generator和Verifier协同更新:
  • 利用验证反馈优化证明生成;
  • 扩展验证计算能力以自动标记难以验证的新证明,从而创建训练数据以改进验证器本身;
  • 利用改进后的验证器进一步优化证明生成。此外,可靠的证明验证器能够教会证明生成器像验证器一样评估证明。这使得证明生成器可以迭代地改进其证明,直到无法识别或解决任何问题为止。
    本质上,我们让模型明确地感知其奖励函数,并使其能够通过深思熟虑的推理而非盲目的试错来最大化该奖励。

本文提出DeepSeek-Math-V2,基于DeepSeek-V3.2-Exp-Base模型进行训练,并得到natural language theorem proving模型,该模型拿到了IMO25、CMO24的金牌、Putname上得到118/120分。

一、方法

Proof Verification

采用cold-start + RL的recipe来训练Verifier。Verfier的任务定义如下:

给定一个problem和proof,verifier输出三个等级得分:

  • 1.0表示生成的proof完全正确;
  • 0.5表示proof整体逻辑正确,但是包含一些细小的错误;
  • 0表示整体步骤逻辑错误;

Cold Start数据构建

  • 爬取AoPS、相关的数学olympiad数据、team selection task和post-2010相关涉及到proof的数据,总共17503个query;
  • 使用DeepSeek-V3.2-Exp-Thinking模型生成proof,由于该模型并没有专门优化过proof,因此输出内容偏向于简短且包含逻辑错误的proof,因此,采用多轮迭代来进行改进;
  • 按照algebra、number theory等领域进行随机采样,并采样一些proof,并邀请人工按照1.0-0.5-0三个等级来对proof进行打分;

为此,获得了可用于训练Verifier的RL数据集,每个数据包含problem、proof和对应人工的打分;

RL prompt数据的指令格式如下

text 复制代码
## Instruction  Your task is to evaluate the quality of a solution to a problem. The problem may ask for a proof of statement , or ask for an answer . If finding an answer is required , the solution hould present the answer , and it should also be a rigorous proof of that answer being valid .

Please evaluate the solution and score it according to the following criteria : 
- If the solution is completely correct , with all steps executed properly and clearly demonstrated , then the score is 1 
- If the solution is generally correct , but with some details omitted or minor errors , then the score is 0.5 
- If the solution does not actually address the required problem , contains fatal errors , or has severe omissions , then the score is 0 
- Additionally , referencing anything from any paper does not save the need to prove the reference . It ' s okay IF AND ONLY IF the solution also presents a valid proof of the reference argument ( s ) ; otherwise , if the solution omits the proof or if the proof provided is not completely correct , the solution should be scored according to the criteria above , and definitely not with a score of 1

Please carefully reason out and analyze the quality of the solution below , and in your final response present a detailed evaluation of the solution ' s quality followed by your score . Therefore , your response should be in the following format :

Here is my evaluation of the solution: ... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness , and explicitly analyze whether each step is accurate : for correct steps , explain why you initially doubted their correctness and why they are indeed correct ; for erroneous steps , explain the reason for the error and the impact of that error on the solution .  

Based on my evaluation , the final overall score should be: \\boxed {{...}} // where ... should be the final overall score (0, 0.5 , or 1 , and nothing else ) based on the above criteria  

--

Here is your task input: 

## Problem 
{question}  

## Solution 
{proof}

RL训练

基于DeepSeek-V3.2-Exp-SFT模型(在math、code等推理任务上进行SFT的模型)进行RL训练。RL的目标是让模型能够对proof生成proof analysis。Reward信号包含:

  • Format:约束Verfier模型输出summary和proof score。如果模型输出包含"Here is my evaluation of the solution:"以及"Based on my evaluation, the final overall score should be:"和带有boxed的得分,则Format为1分;
  • Score reward:Verifier模型输出的boxed中的打分假设为si′s_i'si′,人工标注的打分为sis_isi,reward则表示为Rscore(si′,si)=1−∣si′−si∣R_{score}(s_i', s_i)=1-|s_i'-s_i|Rscore(si′,si)=1−∣si′−si∣。即如果模型预测分数与人类打分完全一样,reward则为1,打分差异越大,reward越小,最小为0。
    最终RL训练目标如下:

max⁡πϕE(Xi,Yi,si)∼Dv′(Vi′,si′)∼πϕ(⋅∣Xi,Yi)[Rformat(Vi′)⋅Rscore(si′,si)]\max_{\pi_{\phi}}\mathbb{E}{(X_i, Y_i, s_i)\sim\mathcal{D}{v'}(V_i', s_i')\sim\pi_{\phi}(\cdot|X_i, Y_i)}[R_{format}(V_i')\cdot R_{score}(s_i', s_i)]πϕmaxE(Xi,Yi,si)∼Dv′(Vi′,si′)∼πϕ(⋅∣Xi,Yi)[Rformat(Vi′)⋅Rscore(si′,si)]

Meta-Verification

考虑到一个问题,假设Verifier在评估proof时,其在打分方面很容易预测正确(即预测打分能够与人类保持一致),但是生成的proof analysis可能会有问题,例如分析错误,或者臆想不存在的问题等情况;

因此提出meta-verification,来对verfier生成的proof analysis进行二次校验,确保verifier所分析的问题确实可靠。

Meta Verifer的prompt如下所示:

text 复制代码
You are given a "problem", "solution", and "solution evaluation", and you need to assess the whether this " solution evaluation " is reasonable .  

First , "solution evaluation" is generated to evaluate the quality of the " solution " , by prompting a verifier with the rules below ( these are not your rules ) :  

''' 
Please evaluate the solution and score it according to the following criteria : 
- If the solution is completely correct , with all steps executed properly and clearly demonstrated , then the score is 1 
- If the solution is generally correct , but with some details omitted or minor errors , then the score is 0.5 
- If the solution does not actually address the required problem , contains fatal errors , or has severe omissions , then the score is 0  

Additionally , referencing anything from any paper does not save the need to prove the reference . It ' s okay IF AND ONLY IF the solution also presents a valid proof of the reference argument ( s ) ; otherwise , if the solution omits the proof or if the proof provided is not completely correct , the solution should be scored according to the criteria above , and definitely not with a score of 1 
'''  

Next , I will introduce the rules for you to analyze the quality of the " solution evaluation ":  

1. Your task is to analyze the "solution evaluation". You do not need to solve the " problem " , nor do you need to strictly assess whether the " solution " is accurate . Your only task is to strictly follow the rules below to evaluate whether the " solution evaluation " is reasonable .  

2. You need to analyze the content of the "solution evaluation" from three aspects :  

Step Restatement: In the "solution evaluation", certain behaviors of the " solution " may be restated . You need to return to the original text of the " solution " and check whether the " solution " actually has these behaviors mentioned in the " solution evaluation ".  

Defect Analysis: "solution evaluation" may point out errors or defects in the " solution ". You need to carefully analyze whether the mentioned errors and defects are indeed valid .  

Expression Analysis: Whether the "solution evaluation"'s expressions are accurate .  

Score Analysis: Whether the final score given by the "solution evaluation " matches the defects it found . You need to analyze according to the scoring rules given above . 

3. The most important part is **defect analysis **: In this part , your core task is to check whether the errors or defects of the " solution " pointed out in the " solution evaluation " are reasonable . In other words , any positive components about the " solution " in the " solution evaluation " , regardless of whether they are reasonable , are not within your evaluation scope .  

- For example: If the "solution evaluation" says that a certain conclusion in the " solution " is correct , but actually this conclusion is incorrect , then you do not need to care about this point . All parts that the " solution evaluation " considers correct do not belong to your evaluation scope . 

- Specifically: If the "solution evaluation" believes that the " solution " is completely accurate and has not found any errors or defects , then regardless of whether the " solution " itself is actually accurate , even if there are obvious errors , you should still consider its analysis of errors to be reasonable .  ** Importantly**, for defects found by the "solution evaluation", you need to analyze two points simultaneously :  

- whether this defect actually exists 

- whether the "solution evaluation"'s analysis of this defect is accurate  These two aspects constitute the analysis of defects.  4. About ** expression analysis**, if there are certain expression errors in the " solution evaluation " , even minor errors in details , you need to identify them . However , please note that identifying incorrect steps in the " solution " as correct steps does not constitute an ** expression error **.  In practice , expression errors include but are not limited to:  

- If the "solution evaluation" identifies some reasoning step(s) in the " solution " as incorrect , then it cannot further indicate that subsequent conclusion ( s ) depending on those reasoning step ( s ) are wrong , but can only indicate that subsequent conclusion ( s ) are " not rigorously demonstrated ." 

- Typos and calculation errors made by "solution evaluation" 

- Inaccurate restatement of content from "solution"  5. Finally , you need to present your analysis of the "solution evaluation " in your output and also rate its quality based on the rules below :  First , if there is at least one unreasonable defect among the defects found by the " solution evaluation " , then you only need to do ** defect analysis **:  

- If all defects found by the "solution evaluation" are unreasonable , then you should rate it with \(0\) 

- If some defects found by the "solution evaluation" are reasonable and some are unreasonable , then your rating should be \(0.5\)  Next , if the "solution evaluation" points out no errors or defects , or all defects found by the evaluation are reasonable , then you should do the following things :  

- Analyze whether "expression errors" exist in the "solution evaluation " (** expression analysis **) or whether " solution evaluation " gives a wrong score according to the rules for " solution evaluation " (** score analysis **) . If yes , you should rate the " solution evaluation " with \(0.5\) ; if no , your rating should be \(1\)  

Your output should follow the format below:  

Here is my analysis of the "solution evaluation": 
... // Your analysis here.  Based on my analysis , I will rate the "solution evaluation" as: \\boxed {{...}} // where ... should be a numerical rating of the " solution evaluation " (0 , 0.5 , or 1 , and nothing else ) based on the criteria above .  

--

Here is your task input:  

## Problem 
{question}  

## Solution 
{proof}  

## Solution Evaluation 
{proof analysis}

Meta-Verifier也通过RL来进行训练,训练过程如下:

  • 给定一个problem、proof,首先获得verifier生成的proof analysis和score;
  • 邀请人类专家,对proof analysis进行打分,从而得到标注的数据集,每条数据包含problem、proof、proof analysis、proof score以及人类对analysis的打分结果;
  • RL训练与Verifier一样,包含format和score reward
    当训练完Meta-Verifier RL后,再次训练Verifier RL,此时Verifier RL除了format RformatR_{format}Rformat和score rewardRscoreR_{score}Rscore外,还添加Meta-Verifier的反馈得分RmetaR_{meta}Rmeta,最终Verifier RL的reward为:

RV=Rformat⋅Rscore⋅RmetaR_V=R_{format}\cdot R_{score}\cdot R_{meta}RV=Rformat⋅Rscore⋅Rmeta

在增强Verifier训练过程中,同时也融入Meta-Verifier的数据,为此最终的Verifier不仅可以对proof进行analysis和打分,也能过对proof analysis进行打分。

最终Verifier同时具备生成proof analysis和score,以及meta-verification的能力。

Proof Generator

Proof Generator RLVR

Verifier可以视为一个GRM(Generative Reward Model),在此基础上来训练Proof Generator。Proof Generator的RL训练可以视作最大化Verifier对其的打分:

max⁡πθEXi∼Dp,Yi∼πθ(⋅∣Xi)[RY]\max_{\pi_{\theta}}E_{X_i\sim\mathcal{D}p, Y_i\sim\pi{\theta}(\cdot|X_i)}[R_Y]πθmaxEXi∼Dp,Yi∼πθ(⋅∣Xi)[RY]

Enhancing with Self-Verification

一般情况下,IMO、CMO等高难度题,Proof Generator模型通常难以一次性解决。

为此,参考之前的Self-verification,即将Verifier作为一个tool,模型在生成proof的时候,未必一次性能够成功,因此可以在Verifier的反馈下不断修改proof。

然而,在实践中发现一个问题,RL训练完的Proof Generator,一次性生成的proof可能依然存在错误,虽然Verifier能够清晰的指出错误并反馈给Generator,但Generator可能依然声称其证明没有问题。

换句话说,Proof Generator虽然会根据Verifier的反馈结果来更新proof,但不会像Verifier那样对自己的生成结果进行严格评估,从而容易出现"自以为是不理会"的情况。

为此,Proof Generator在RL训练时也有必要使用Verifier RL训练是所用的策略。

Proof Generator的prompt如下所示:

text 复制代码
Your task is to solve a given problem. The problem may ask you to prove a statement , or ask for an answer . If finding an answer is required , you should come up with the answer , and your final solution should also be a rigorous proof of that answer being valid .  

Your final solution to the problem should be exceptionally comprehensive and easy - to - follow , which will be rated according to the following evaluation instruction :  

'''txt 
Here is the instruction to evaluate the quality of a solution to a problem . The problem may ask for a proof of statement , or ask for an answer . If finding an answer is required , the solution should present the answer , and it should also be a rigorous proof of that answer being valid .  

Please evaluate the solution and score it according to the following criteria : 

- If the solution is completely correct , with all steps executed properly and clearly demonstrated , then the score is 1 

- If the solution is generally correct , but with some details omitted or minor errors , then the score is 0.5 

- If the solution does not actually address the required problem , contains fatal errors , or has severe omissions , then the score is 0  Additionally , referencing anything from any paper does not save the need to prove the reference . It ' s okay IF AND ONLY IF the solution also presents a valid proof of the 

reference argument ( s ) ; otherwise , if the solution omits the proof or if the proof provided is not completely correct , the solution should be scored according to the criteria above , and definitely not with a score of 1 
'''  

In fact , you already have the ability to rate your solution yourself , so you are expected to reason carefully about how to solve a given problem , evaluate your method according to the instruction , and refine your solution by fixing issues identified until you can make no further progress .  

In your final response , you should present a detailed solution to the problem followed by your evaluation of that solution . 

- To give a good final response , you should try your best to locate potential issues in your own ( partial ) solution according to the evaluation instruction above , and fix them as many as you can . 

- A good final response should just faithfully present your progress , including the best solution you can give , as well as a faithful evaluation of that solution .  

- Only when you fail to locate any issues in your solution should you score it with 1. 

- If you do notice some issues in your solution but fail to resolve them with your best efforts , it ' s totally ok to faithfully present the issues in your final response . 

- The worst final response would provide a wrong solution but lie that it ' s correct or claim that it ' s correct without careful error checking . A better version should faithfully identify errors in the solution . Remember ! You CAN ' T cheat ! If you cheat , we will know , and you will be penalized !  

Your final response should be in the following format:  

## Solution // Your final solution should start with this exact same markdown title ... // Your final solution to the problem here. You should try your best to optimize the quality of your solution according to the evaluation instruction above before finalizing it here . 

## Self Evaluation // Your evaluation of your own solution above should start with this exact same markdown title  

Here is my evaluation of the solution: // Your analysis should start with this exact same phrase 
... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness , and explicitly analyze whether each step is accurate : for correct steps , explain why you initially doubted their correctness and why they are indeed correct ; for erroneous steps , explain the reason for the error and the impact of that error on the solution . You should analyze your solution faithfully . E . g . , if there are issues in your final solution , you should point it out .  

Based on my evaluation , the final overall score should be: \\boxed {{...}} // where ... should be the final overall score (0, 0.5 , or 1 , and nothing else ) based on the evaluation instruction above . You should reach this score ONLY AFTER careful RE - examination of your own solution above  

--

Here is your task input:  

## Problem 
{question}

从prompt中可以发现,Proof Generator不仅需要生成proof,还需要像Verifier一样自己给出self-analysis和score。

在Self-verification增强训练过程中,首先让Proof Generator生成一个proof,以及self-analysis和score,此时的self-analysis和score与Verfier遵循相同的format和rubrics规则。

  • 假设Proof Generator生成的proof表示为YYY,同时生成的self-analysis表示为zzz,score表示为s′s's′。
  • Verfier对Proof Generator生成的Proof YYY也会给出一个得分,表示为sss;
  • Meta-Verification对Proof Generator所生成的self-analysisZZZ会给一个meta score,表示为msmsms;
    那么,对于Proof Generator的reward可以设计为:

其中

● RY=sR_Y=sRY=s

● Rz=Rscore(s′,s)⋅Rmeta(Z)R_z=R_{score}(s', s)\cdot R_{meta}(Z)Rz=Rscore(s′,s)⋅Rmeta(Z)

● Rmeta=msR_{meta}=msRmeta=ms

● Rscore(s′,s)=1−∣s′−s∣R_{score}(s', s)=1-|s'-s|Rscore(s′,s)=1−∣s′−s∣

● α=0.76,β=0.24\alpha=0.76, \beta=0.24α=0.76,β=0.24;

通过这个Reward,可以发现:

  • Proof Generator生成的self-analysis首先需要符合format要求(0或1分);
  • 如果Proof Generator自己生成的self-analysis和score越大,那么reward越高,且这部分在总的reward占比比较大(0.76);
  • 确保Proof Generator生成的self-analysis要正确,这部分reward来自两方面,一方面是Proof Generator自己给出的score要与Verifier给出的score尽可能一样,差距越小,reward越大;另一方面,Proof Generator生成的self-analysis也要通过meta-verification的校验打分。

Proof Verification与Generation的协同优化

Verifier可以比较好得指导Generator生成高质量的proof,但是Verifier有时候会遇到问题,当Proof Generator生成的proof,Verifier无法在一次尝试内正确判断,最直接的做法是不断让人工介入来对这些proof进行标注,但比较耗时费力。

为此提出一个pipeline来尝试auto labeling verification数据:

  1. 对于每个Proof,使用Verfier生成 n 个独立的proof analysis。
  2. 对于每个proof analysis(得分 0 或 0.5),生成 m 个meta-verification,以验证已识别的问题。如果大多数meta-verification证实了proof analysis结果,则该proof analysis被认为是有效的。
  3. 对于每个proof,检查得分最低的proof analysis。如果至少有 k 个这样的proof analysis被认为是有效的,则将该proof标记为该最低分。如果在所有验证尝试中均未发现任何合法问题,则将该证明标记为 1。否则,该证明将被丢弃或提交给专家进行标注。

Proof Verification与Generation交替训练,最后两轮训练时则完全剔除掉人工标注环节。

二、实验

(1)训练细节

RL算法选择GRPO,进行多轮迭代训练。

  • 对于每一轮,首先优化Verification,获得优化后的Verifier后,将该Verifier作为Proof Generator的初始化参数,并进行Generator优化;
  • 从第二轮开始,每一轮的Verifier则是由上一轮优化完的Generator作为初始化(此时Generator是同时包含Verification、Meta-verification和proof generation能力),在verifier优化前会加入rejected fine-tuning(RFT)阶段。

(2)评测数据集

内部评测数据:91条

外部数据:

● IMO25(6条);

● CMO24(6条);

● Putnam24(12条);

● ISL24(31条);

● IMO-ProofBench(60条);

(3)评测方法

One-shot Generation

评估模型一次性输出proof的性能。每个problem生成8个proof,并分别送入verifier,并得到verification结果,每个proof都生成8次analysis,最终进行投票来判定proof是否正确。

in-house评测数据上的指标如下:

self-verification

带有self-verification的生成过程如下:

● 首先生成一个proof,以及对应的self-analysis;

● 基于problem、已经生成的proof以及self-analysis,进行re-prompt,并输入到Generator进行再次思考。

re-prompt如下所示:

text 复制代码
{proof_generation_prompt}  

## Candidate Solution(s) to Refine 
Here are some solution sample(s) along with their correctness evaluation ( s ) . You should provide a better solution by solving issues mentioned in the evaluation ( s ) , or by re using promising ideas mentioned in the solution sample ( s ) , or by doing both .  

{proof} 
{proof analyses}  

## Final Instruction Your final response should follow the format above , including a '## Solution ' section followed by a '## Self Evaluation '  section

模型在推理时不断进行self-verification,直到自己给自己的score达到1分,或者达到轮次上限。

在评测时,对于每个problem,生成32条refinement轨迹,对于每条轨迹最终输出的proof,使用最终训练好的verifier进行32次评估,并取投票的结果作为proof的打分。

  • pass@1:每个problem的32条轨迹,verifier所给的得分的均值;
  • best@32:每个problem的32条轨迹中,根据generator自己生成的self-analysis的得分,取得分最高的轨迹,该轨迹对应verifier的得分;
    实验如下图所示,可以发现:
  • 随着refinement的轮次增长,整体指标也在上升;
  • Best@32始终大于Pass@1,说明通过self-analysis自己打分所挑选的proof更有效,也侧面说明generator的Self-analysis是可靠的


Hign-comput Search模式

该模式下是为了刷榜IMO、CMO榜单。扩展了验证和生成计算能力------利用广泛的验证来识别细微问题,并利用并行生成来探索不同的证明策略。

具体方法如下:

  • 为每个problem维护一个候选proof pool,初始时包含 64 个proof,每个proof都生成了 64 个analysis。下面进行多轮的self-verification,随着refinement轮次进行,proof pool会不断新增更多的proof;
  • 在每次refinement轮次中,根据平均验证得分选择 64 个得分最高的proof,并将每个proof与 8 个随机选择的analysis配对,优先考虑那些识别出问题的分析(得分为 0 或 0.5)。每个(proof,analysis)pair用于生成一个改进后的proof,然后将改进后的proof更新到pool中。此过程最多持续 16 次迭代,或者直到某个证明成功通过所有 64 次验证尝试,表明其正确性具有很高的置信度。所有实验均使用单个模型,即我们最终的Proof Generator,该模型同时执行Proof生成和Verification。
相关推荐
强化学习与机器人控制仿真2 小时前
Holosoma 开源人形机器人强化学习训练部署框架
人工智能·stm32·神经网络·机器人·强化学习·具身智能·人形机器人
deephub3 小时前
LlamaIndex检索调优实战:七个能落地的技术细节
人工智能·python·大语言模型·rag·llamaindex
肖邦德夜曲1 天前
8.IsaacGymEnvs (NVIDIA 官方)的shadow灵巧手强化学习
强化学习·灵巧手
emacs5lisp1 天前
基于强化学习的自动驾驶控制
神经网络·自动驾驶·强化学习·carla·智能体
铮铭3 天前
扩散模型简介:The Annotated Diffusion Model
人工智能·机器人·强化学习·世界模型
七牛云行业应用3 天前
告别RLHF?DeepSeek过程奖励(PRM)架构解析与推理数据流设计
人工智能·强化学习·大模型架构·deepseek
一个处女座的程序猿3 天前
AI:新书预告—从机器学习避坑指南(分类/回归/聚类/可解释性)到大语言模型落地手记(RAG/Agent/MCP),一场耗时5+3年的技术沉淀—“代码可跑,经验可抄”—【一个处女座的程序猿】携两本AI
人工智能·机器学习·大语言模型
iiiiii114 天前
【论文阅读笔记】IDAQ:离线元强化学习中的分布内在线适应
论文阅读·人工智能·笔记·学习·算法·机器学习·强化学习
deephub4 天前
BipedalWalker实战:SAC算法如何让机器人学会稳定行走
人工智能·机器学习·机器人·强化学习