DeepSeek-Math-V2解读:稠密Reward信号回归到RLVR
最近半年博主工作科研比较繁忙,有段时间没有写博客了,但并不代表博主没有更新相关技术,后续会补上更多科研信息。
今天详细解读一下前段时间发布的DeepSeek-Math-V2,DeepSeek-Math早在24年带有GRPO这个广为流传的RLVR技术,那么憋了一年后的V2版有什么看点呢?
最近一系列的RL算法都关注ORM,即只要结果正确就认为整个推理过程正确,这种方式在AIME、HMMT等一系列高难度的数学任务上被验证有效。但是其依然存在一些问题:
- 模型最终虽然可以给出正确答案,但是推理过程可能会存在逻辑错误;
- 仅靠结果作为反馈,很难适用于定理证明类等对推理过程准确性强依赖的任务;
因此,如何能够有效验证推理过程,对ATP类任务至关重要。主要动机包括:
- 即使没有参考解决方案,人类也能识别证明中的问题------这在解决开放性问题(即答案未知的任务)时至关重要。
- 如果经过大规模验证仍无法识别出任何问题,则该证明更可能有效。
- 识别有效问题所需的工作量可以作为证明质量的指标,并可用于优化证明生成。
基于Verifier,实现Proof Generator和Verifier协同更新: - 利用验证反馈优化证明生成;
- 扩展验证计算能力以自动标记难以验证的新证明,从而创建训练数据以改进验证器本身;
- 利用改进后的验证器进一步优化证明生成。此外,可靠的证明验证器能够教会证明生成器像验证器一样评估证明。这使得证明生成器可以迭代地改进其证明,直到无法识别或解决任何问题为止。
本质上,我们让模型明确地感知其奖励函数,并使其能够通过深思熟虑的推理而非盲目的试错来最大化该奖励。
本文提出DeepSeek-Math-V2,基于DeepSeek-V3.2-Exp-Base模型进行训练,并得到natural language theorem proving模型,该模型拿到了IMO25、CMO24的金牌、Putname上得到118/120分。
一、方法
Proof Verification
采用cold-start + RL的recipe来训练Verifier。Verfier的任务定义如下:
给定一个problem和proof,verifier输出三个等级得分:
- 1.0表示生成的proof完全正确;
- 0.5表示proof整体逻辑正确,但是包含一些细小的错误;
- 0表示整体步骤逻辑错误;
Cold Start数据构建
- 爬取AoPS、相关的数学olympiad数据、team selection task和post-2010相关涉及到proof的数据,总共17503个query;
- 使用DeepSeek-V3.2-Exp-Thinking模型生成proof,由于该模型并没有专门优化过proof,因此输出内容偏向于简短且包含逻辑错误的proof,因此,采用多轮迭代来进行改进;
- 按照algebra、number theory等领域进行随机采样,并采样一些proof,并邀请人工按照1.0-0.5-0三个等级来对proof进行打分;
为此,获得了可用于训练Verifier的RL数据集,每个数据包含problem、proof和对应人工的打分;
RL prompt数据的指令格式如下
text
## Instruction Your task is to evaluate the quality of a solution to a problem. The problem may ask for a proof of statement , or ask for an answer . If finding an answer is required , the solution hould present the answer , and it should also be a rigorous proof of that answer being valid .
Please evaluate the solution and score it according to the following criteria :
- If the solution is completely correct , with all steps executed properly and clearly demonstrated , then the score is 1
- If the solution is generally correct , but with some details omitted or minor errors , then the score is 0.5
- If the solution does not actually address the required problem , contains fatal errors , or has severe omissions , then the score is 0
- Additionally , referencing anything from any paper does not save the need to prove the reference . It ' s okay IF AND ONLY IF the solution also presents a valid proof of the reference argument ( s ) ; otherwise , if the solution omits the proof or if the proof provided is not completely correct , the solution should be scored according to the criteria above , and definitely not with a score of 1
Please carefully reason out and analyze the quality of the solution below , and in your final response present a detailed evaluation of the solution ' s quality followed by your score . Therefore , your response should be in the following format :
Here is my evaluation of the solution: ... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness , and explicitly analyze whether each step is accurate : for correct steps , explain why you initially doubted their correctness and why they are indeed correct ; for erroneous steps , explain the reason for the error and the impact of that error on the solution .
Based on my evaluation , the final overall score should be: \\boxed {{...}} // where ... should be the final overall score (0, 0.5 , or 1 , and nothing else ) based on the above criteria
--
Here is your task input:
## Problem
{question}
## Solution
{proof}
RL训练
基于DeepSeek-V3.2-Exp-SFT模型(在math、code等推理任务上进行SFT的模型)进行RL训练。RL的目标是让模型能够对proof生成proof analysis。Reward信号包含:
- Format:约束Verfier模型输出summary和proof score。如果模型输出包含"Here is my evaluation of the solution:"以及"Based on my evaluation, the final overall score should be:"和带有boxed的得分,则Format为1分;
- Score reward:Verifier模型输出的boxed中的打分假设为si′s_i'si′,人工标注的打分为sis_isi,reward则表示为Rscore(si′,si)=1−∣si′−si∣R_{score}(s_i', s_i)=1-|s_i'-s_i|Rscore(si′,si)=1−∣si′−si∣。即如果模型预测分数与人类打分完全一样,reward则为1,打分差异越大,reward越小,最小为0。
最终RL训练目标如下:
maxπϕE(Xi,Yi,si)∼Dv′(Vi′,si′)∼πϕ(⋅∣Xi,Yi)[Rformat(Vi′)⋅Rscore(si′,si)]\max_{\pi_{\phi}}\mathbb{E}{(X_i, Y_i, s_i)\sim\mathcal{D}{v'}(V_i', s_i')\sim\pi_{\phi}(\cdot|X_i, Y_i)}[R_{format}(V_i')\cdot R_{score}(s_i', s_i)]πϕmaxE(Xi,Yi,si)∼Dv′(Vi′,si′)∼πϕ(⋅∣Xi,Yi)[Rformat(Vi′)⋅Rscore(si′,si)]
Meta-Verification
考虑到一个问题,假设Verifier在评估proof时,其在打分方面很容易预测正确(即预测打分能够与人类保持一致),但是生成的proof analysis可能会有问题,例如分析错误,或者臆想不存在的问题等情况;
因此提出meta-verification,来对verfier生成的proof analysis进行二次校验,确保verifier所分析的问题确实可靠。
Meta Verifer的prompt如下所示:
text
You are given a "problem", "solution", and "solution evaluation", and you need to assess the whether this " solution evaluation " is reasonable .
First , "solution evaluation" is generated to evaluate the quality of the " solution " , by prompting a verifier with the rules below ( these are not your rules ) :
'''
Please evaluate the solution and score it according to the following criteria :
- If the solution is completely correct , with all steps executed properly and clearly demonstrated , then the score is 1
- If the solution is generally correct , but with some details omitted or minor errors , then the score is 0.5
- If the solution does not actually address the required problem , contains fatal errors , or has severe omissions , then the score is 0
Additionally , referencing anything from any paper does not save the need to prove the reference . It ' s okay IF AND ONLY IF the solution also presents a valid proof of the reference argument ( s ) ; otherwise , if the solution omits the proof or if the proof provided is not completely correct , the solution should be scored according to the criteria above , and definitely not with a score of 1
'''
Next , I will introduce the rules for you to analyze the quality of the " solution evaluation ":
1. Your task is to analyze the "solution evaluation". You do not need to solve the " problem " , nor do you need to strictly assess whether the " solution " is accurate . Your only task is to strictly follow the rules below to evaluate whether the " solution evaluation " is reasonable .
2. You need to analyze the content of the "solution evaluation" from three aspects :
Step Restatement: In the "solution evaluation", certain behaviors of the " solution " may be restated . You need to return to the original text of the " solution " and check whether the " solution " actually has these behaviors mentioned in the " solution evaluation ".
Defect Analysis: "solution evaluation" may point out errors or defects in the " solution ". You need to carefully analyze whether the mentioned errors and defects are indeed valid .
Expression Analysis: Whether the "solution evaluation"'s expressions are accurate .
Score Analysis: Whether the final score given by the "solution evaluation " matches the defects it found . You need to analyze according to the scoring rules given above .
3. The most important part is **defect analysis **: In this part , your core task is to check whether the errors or defects of the " solution " pointed out in the " solution evaluation " are reasonable . In other words , any positive components about the " solution " in the " solution evaluation " , regardless of whether they are reasonable , are not within your evaluation scope .
- For example: If the "solution evaluation" says that a certain conclusion in the " solution " is correct , but actually this conclusion is incorrect , then you do not need to care about this point . All parts that the " solution evaluation " considers correct do not belong to your evaluation scope .
- Specifically: If the "solution evaluation" believes that the " solution " is completely accurate and has not found any errors or defects , then regardless of whether the " solution " itself is actually accurate , even if there are obvious errors , you should still consider its analysis of errors to be reasonable . ** Importantly**, for defects found by the "solution evaluation", you need to analyze two points simultaneously :
- whether this defect actually exists
- whether the "solution evaluation"'s analysis of this defect is accurate These two aspects constitute the analysis of defects. 4. About ** expression analysis**, if there are certain expression errors in the " solution evaluation " , even minor errors in details , you need to identify them . However , please note that identifying incorrect steps in the " solution " as correct steps does not constitute an ** expression error **. In practice , expression errors include but are not limited to:
- If the "solution evaluation" identifies some reasoning step(s) in the " solution " as incorrect , then it cannot further indicate that subsequent conclusion ( s ) depending on those reasoning step ( s ) are wrong , but can only indicate that subsequent conclusion ( s ) are " not rigorously demonstrated ."
- Typos and calculation errors made by "solution evaluation"
- Inaccurate restatement of content from "solution" 5. Finally , you need to present your analysis of the "solution evaluation " in your output and also rate its quality based on the rules below : First , if there is at least one unreasonable defect among the defects found by the " solution evaluation " , then you only need to do ** defect analysis **:
- If all defects found by the "solution evaluation" are unreasonable , then you should rate it with \(0\)
- If some defects found by the "solution evaluation" are reasonable and some are unreasonable , then your rating should be \(0.5\) Next , if the "solution evaluation" points out no errors or defects , or all defects found by the evaluation are reasonable , then you should do the following things :
- Analyze whether "expression errors" exist in the "solution evaluation " (** expression analysis **) or whether " solution evaluation " gives a wrong score according to the rules for " solution evaluation " (** score analysis **) . If yes , you should rate the " solution evaluation " with \(0.5\) ; if no , your rating should be \(1\)
Your output should follow the format below:
Here is my analysis of the "solution evaluation":
... // Your analysis here. Based on my analysis , I will rate the "solution evaluation" as: \\boxed {{...}} // where ... should be a numerical rating of the " solution evaluation " (0 , 0.5 , or 1 , and nothing else ) based on the criteria above .
--
Here is your task input:
## Problem
{question}
## Solution
{proof}
## Solution Evaluation
{proof analysis}
Meta-Verifier也通过RL来进行训练,训练过程如下:
- 给定一个problem、proof,首先获得verifier生成的proof analysis和score;
- 邀请人类专家,对proof analysis进行打分,从而得到标注的数据集,每条数据包含problem、proof、proof analysis、proof score以及人类对analysis的打分结果;
- RL训练与Verifier一样,包含format和score reward
当训练完Meta-Verifier RL后,再次训练Verifier RL,此时Verifier RL除了format RformatR_{format}Rformat和score rewardRscoreR_{score}Rscore外,还添加Meta-Verifier的反馈得分RmetaR_{meta}Rmeta,最终Verifier RL的reward为:
RV=Rformat⋅Rscore⋅RmetaR_V=R_{format}\cdot R_{score}\cdot R_{meta}RV=Rformat⋅Rscore⋅Rmeta
在增强Verifier训练过程中,同时也融入Meta-Verifier的数据,为此最终的Verifier不仅可以对proof进行analysis和打分,也能过对proof analysis进行打分。
最终Verifier同时具备生成proof analysis和score,以及meta-verification的能力。
Proof Generator
Proof Generator RLVR
Verifier可以视为一个GRM(Generative Reward Model),在此基础上来训练Proof Generator。Proof Generator的RL训练可以视作最大化Verifier对其的打分:
maxπθEXi∼Dp,Yi∼πθ(⋅∣Xi)[RY]\max_{\pi_{\theta}}E_{X_i\sim\mathcal{D}p, Y_i\sim\pi{\theta}(\cdot|X_i)}[R_Y]πθmaxEXi∼Dp,Yi∼πθ(⋅∣Xi)[RY]
Enhancing with Self-Verification
一般情况下,IMO、CMO等高难度题,Proof Generator模型通常难以一次性解决。
为此,参考之前的Self-verification,即将Verifier作为一个tool,模型在生成proof的时候,未必一次性能够成功,因此可以在Verifier的反馈下不断修改proof。
然而,在实践中发现一个问题,RL训练完的Proof Generator,一次性生成的proof可能依然存在错误,虽然Verifier能够清晰的指出错误并反馈给Generator,但Generator可能依然声称其证明没有问题。
换句话说,Proof Generator虽然会根据Verifier的反馈结果来更新proof,但不会像Verifier那样对自己的生成结果进行严格评估,从而容易出现"自以为是不理会"的情况。
为此,Proof Generator在RL训练时也有必要使用Verifier RL训练是所用的策略。
Proof Generator的prompt如下所示:
text
Your task is to solve a given problem. The problem may ask you to prove a statement , or ask for an answer . If finding an answer is required , you should come up with the answer , and your final solution should also be a rigorous proof of that answer being valid .
Your final solution to the problem should be exceptionally comprehensive and easy - to - follow , which will be rated according to the following evaluation instruction :
'''txt
Here is the instruction to evaluate the quality of a solution to a problem . The problem may ask for a proof of statement , or ask for an answer . If finding an answer is required , the solution should present the answer , and it should also be a rigorous proof of that answer being valid .
Please evaluate the solution and score it according to the following criteria :
- If the solution is completely correct , with all steps executed properly and clearly demonstrated , then the score is 1
- If the solution is generally correct , but with some details omitted or minor errors , then the score is 0.5
- If the solution does not actually address the required problem , contains fatal errors , or has severe omissions , then the score is 0 Additionally , referencing anything from any paper does not save the need to prove the reference . It ' s okay IF AND ONLY IF the solution also presents a valid proof of the
reference argument ( s ) ; otherwise , if the solution omits the proof or if the proof provided is not completely correct , the solution should be scored according to the criteria above , and definitely not with a score of 1
'''
In fact , you already have the ability to rate your solution yourself , so you are expected to reason carefully about how to solve a given problem , evaluate your method according to the instruction , and refine your solution by fixing issues identified until you can make no further progress .
In your final response , you should present a detailed solution to the problem followed by your evaluation of that solution .
- To give a good final response , you should try your best to locate potential issues in your own ( partial ) solution according to the evaluation instruction above , and fix them as many as you can .
- A good final response should just faithfully present your progress , including the best solution you can give , as well as a faithful evaluation of that solution .
- Only when you fail to locate any issues in your solution should you score it with 1.
- If you do notice some issues in your solution but fail to resolve them with your best efforts , it ' s totally ok to faithfully present the issues in your final response .
- The worst final response would provide a wrong solution but lie that it ' s correct or claim that it ' s correct without careful error checking . A better version should faithfully identify errors in the solution . Remember ! You CAN ' T cheat ! If you cheat , we will know , and you will be penalized !
Your final response should be in the following format:
## Solution // Your final solution should start with this exact same markdown title ... // Your final solution to the problem here. You should try your best to optimize the quality of your solution according to the evaluation instruction above before finalizing it here .
## Self Evaluation // Your evaluation of your own solution above should start with this exact same markdown title
Here is my evaluation of the solution: // Your analysis should start with this exact same phrase
... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness , and explicitly analyze whether each step is accurate : for correct steps , explain why you initially doubted their correctness and why they are indeed correct ; for erroneous steps , explain the reason for the error and the impact of that error on the solution . You should analyze your solution faithfully . E . g . , if there are issues in your final solution , you should point it out .
Based on my evaluation , the final overall score should be: \\boxed {{...}} // where ... should be the final overall score (0, 0.5 , or 1 , and nothing else ) based on the evaluation instruction above . You should reach this score ONLY AFTER careful RE - examination of your own solution above
--
Here is your task input:
## Problem
{question}
从prompt中可以发现,Proof Generator不仅需要生成proof,还需要像Verifier一样自己给出self-analysis和score。
在Self-verification增强训练过程中,首先让Proof Generator生成一个proof,以及self-analysis和score,此时的self-analysis和score与Verfier遵循相同的format和rubrics规则。
- 假设Proof Generator生成的proof表示为YYY,同时生成的self-analysis表示为zzz,score表示为s′s's′。
- Verfier对Proof Generator生成的Proof YYY也会给出一个得分,表示为sss;
- Meta-Verification对Proof Generator所生成的self-analysisZZZ会给一个meta score,表示为msmsms;
那么,对于Proof Generator的reward可以设计为:
其中
● RY=sR_Y=sRY=s
● Rz=Rscore(s′,s)⋅Rmeta(Z)R_z=R_{score}(s', s)\cdot R_{meta}(Z)Rz=Rscore(s′,s)⋅Rmeta(Z)
● Rmeta=msR_{meta}=msRmeta=ms
● Rscore(s′,s)=1−∣s′−s∣R_{score}(s', s)=1-|s'-s|Rscore(s′,s)=1−∣s′−s∣
● α=0.76,β=0.24\alpha=0.76, \beta=0.24α=0.76,β=0.24;
通过这个Reward,可以发现:
- Proof Generator生成的self-analysis首先需要符合format要求(0或1分);
- 如果Proof Generator自己生成的self-analysis和score越大,那么reward越高,且这部分在总的reward占比比较大(0.76);
- 确保Proof Generator生成的self-analysis要正确,这部分reward来自两方面,一方面是Proof Generator自己给出的score要与Verifier给出的score尽可能一样,差距越小,reward越大;另一方面,Proof Generator生成的self-analysis也要通过meta-verification的校验打分。
Proof Verification与Generation的协同优化
Verifier可以比较好得指导Generator生成高质量的proof,但是Verifier有时候会遇到问题,当Proof Generator生成的proof,Verifier无法在一次尝试内正确判断,最直接的做法是不断让人工介入来对这些proof进行标注,但比较耗时费力。
为此提出一个pipeline来尝试auto labeling verification数据:
- 对于每个Proof,使用Verfier生成 n 个独立的proof analysis。
- 对于每个proof analysis(得分 0 或 0.5),生成 m 个meta-verification,以验证已识别的问题。如果大多数meta-verification证实了proof analysis结果,则该proof analysis被认为是有效的。
- 对于每个proof,检查得分最低的proof analysis。如果至少有 k 个这样的proof analysis被认为是有效的,则将该proof标记为该最低分。如果在所有验证尝试中均未发现任何合法问题,则将该证明标记为 1。否则,该证明将被丢弃或提交给专家进行标注。
Proof Verification与Generation交替训练,最后两轮训练时则完全剔除掉人工标注环节。
二、实验
(1)训练细节
RL算法选择GRPO,进行多轮迭代训练。
- 对于每一轮,首先优化Verification,获得优化后的Verifier后,将该Verifier作为Proof Generator的初始化参数,并进行Generator优化;
- 从第二轮开始,每一轮的Verifier则是由上一轮优化完的Generator作为初始化(此时Generator是同时包含Verification、Meta-verification和proof generation能力),在verifier优化前会加入rejected fine-tuning(RFT)阶段。
(2)评测数据集
内部评测数据:91条
外部数据:
● IMO25(6条);
● CMO24(6条);
● Putnam24(12条);
● ISL24(31条);
● IMO-ProofBench(60条);
(3)评测方法
One-shot Generation
评估模型一次性输出proof的性能。每个problem生成8个proof,并分别送入verifier,并得到verification结果,每个proof都生成8次analysis,最终进行投票来判定proof是否正确。
in-house评测数据上的指标如下:

self-verification
带有self-verification的生成过程如下:
● 首先生成一个proof,以及对应的self-analysis;
● 基于problem、已经生成的proof以及self-analysis,进行re-prompt,并输入到Generator进行再次思考。
re-prompt如下所示:
text
{proof_generation_prompt}
## Candidate Solution(s) to Refine
Here are some solution sample(s) along with their correctness evaluation ( s ) . You should provide a better solution by solving issues mentioned in the evaluation ( s ) , or by re using promising ideas mentioned in the solution sample ( s ) , or by doing both .
{proof}
{proof analyses}
## Final Instruction Your final response should follow the format above , including a '## Solution ' section followed by a '## Self Evaluation ' section
模型在推理时不断进行self-verification,直到自己给自己的score达到1分,或者达到轮次上限。
在评测时,对于每个problem,生成32条refinement轨迹,对于每条轨迹最终输出的proof,使用最终训练好的verifier进行32次评估,并取投票的结果作为proof的打分。
- pass@1:每个problem的32条轨迹,verifier所给的得分的均值;
- best@32:每个problem的32条轨迹中,根据generator自己生成的self-analysis的得分,取得分最高的轨迹,该轨迹对应verifier的得分;
实验如下图所示,可以发现: - 随着refinement的轮次增长,整体指标也在上升;
- Best@32始终大于Pass@1,说明通过self-analysis自己打分所挑选的proof更有效,也侧面说明generator的Self-analysis是可靠的

Hign-comput Search模式
该模式下是为了刷榜IMO、CMO榜单。扩展了验证和生成计算能力------利用广泛的验证来识别细微问题,并利用并行生成来探索不同的证明策略。
具体方法如下:
- 为每个problem维护一个候选proof pool,初始时包含 64 个proof,每个proof都生成了 64 个analysis。下面进行多轮的self-verification,随着refinement轮次进行,proof pool会不断新增更多的proof;
- 在每次refinement轮次中,根据平均验证得分选择 64 个得分最高的proof,并将每个proof与 8 个随机选择的analysis配对,优先考虑那些识别出问题的分析(得分为 0 或 0.5)。每个(proof,analysis)pair用于生成一个改进后的proof,然后将改进后的proof更新到pool中。此过程最多持续 16 次迭代,或者直到某个证明成功通过所有 64 次验证尝试,表明其正确性具有很高的置信度。所有实验均使用单个模型,即我们最终的Proof Generator,该模型同时执行Proof生成和Verification。