07_评测系统_拒绝采样与校准

07 · 评测系统:拒绝采样与校准

本篇在总分中是"分"的评测侧深度拆解。DSpark 评测不是简单的"采 K 个 token 看接受率",而是一套拒绝采样验证循环 + 跨 rank 指标聚合 + confidence head 校准评估 的完整系统。本篇拆解 BaseEvaluatorverify_draft_tokens 数学、generate_decoding_sample 循环、ConfidenceHeadRecorder 校准流程,以及 DSpark / Eagle3 evaluator 的 _propose / _update 钩子差异。


总览段(总)

DeepSpec 评测系统抽象在 BaseEvaluatorbase_evaluator.py:444-728(file:///workspace/deepspec/eval/base_evaluator.py#L444-728))中,定义了所有推测解码评测的统一流程:draft 提议 → target 验证 → 拒绝采样 → 修正/bonus → 推进 context 。子类只实现三个钩子:_init_context(初始化算法特定状态)、_propose(生成 draft 候选)、_update(用验证后的 token 推进 context)。
verify_draft_tokens Draft Model (子类) Target Model BaseEvaluator verify_draft_tokens Draft Model (子类) Target Model BaseEvaluator #mermaid-svg-cc4LwRltXt7nKQ7Y{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-cc4LwRltXt7nKQ7Y .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-cc4LwRltXt7nKQ7Y .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-cc4LwRltXt7nKQ7Y .error-icon{fill:#552222;}#mermaid-svg-cc4LwRltXt7nKQ7Y .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-cc4LwRltXt7nKQ7Y .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-cc4LwRltXt7nKQ7Y .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-cc4LwRltXt7nKQ7Y .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-cc4LwRltXt7nKQ7Y .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-cc4LwRltXt7nKQ7Y .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-cc4LwRltXt7nKQ7Y .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-cc4LwRltXt7nKQ7Y .marker{fill:#333333;stroke:#333333;}#mermaid-svg-cc4LwRltXt7nKQ7Y .marker.cross{stroke:#333333;}#mermaid-svg-cc4LwRltXt7nKQ7Y svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-cc4LwRltXt7nKQ7Y p{margin:0;}#mermaid-svg-cc4LwRltXt7nKQ7Y .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-cc4LwRltXt7nKQ7Y text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-cc4LwRltXt7nKQ7Y .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-cc4LwRltXt7nKQ7Y .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-cc4LwRltXt7nKQ7Y .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-cc4LwRltXt7nKQ7Y .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-cc4LwRltXt7nKQ7Y #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-cc4LwRltXt7nKQ7Y .sequenceNumber{fill:white;}#mermaid-svg-cc4LwRltXt7nKQ7Y #sequencenumber{fill:#333;}#mermaid-svg-cc4LwRltXt7nKQ7Y #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-cc4LwRltXt7nKQ7Y .messageText{fill:#333;stroke:none;}#mermaid-svg-cc4LwRltXt7nKQ7Y .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-cc4LwRltXt7nKQ7Y .labelText,#mermaid-svg-cc4LwRltXt7nKQ7Y .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-cc4LwRltXt7nKQ7Y .loopText,#mermaid-svg-cc4LwRltXt7nKQ7Y .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-cc4LwRltXt7nKQ7Y .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-cc4LwRltXt7nKQ7Y .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-cc4LwRltXt7nKQ7Y .noteText,#mermaid-svg-cc4LwRltXt7nKQ7Y .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-cc4LwRltXt7nKQ7Y .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-cc4LwRltXt7nKQ7Y .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-cc4LwRltXt7nKQ7Y .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-cc4LwRltXt7nKQ7Y .actorPopupMenu{position:absolute;}#mermaid-svg-cc4LwRltXt7nKQ7Y .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-cc4LwRltXt7nKQ7Y .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-cc4LwRltXt7nKQ7Y .actor-man circle,#mermaid-svg-cc4LwRltXt7nKQ7Y line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-cc4LwRltXt7nKQ7Y :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} loop generate_decoding_sample prefill prompt 得第一个 token + hidden states _init_context(hidden) _propose() DraftProposal (K tokens + probs) verify_draft_tokens(proposal) target forward K+1 target_probs 拒绝采样 accept_prob=min(1,p_t/p_d) cumprod prefix mask VerificationResult (accepted prefix + next token) _post_verify (可选 confidence 校准) _update(accepted + next) 记录 acceptance_length

图说明: 评测循环是推测解码的标准 draft-verify-correct 流程。target 先 prefill prompt 拿到第一个 token 和初始 hidden states;draft 用 _init_context 初始化状态(DSpark 提取 context feature,Eagle3 extend draft cache);循环里 _propose 生成 K 个候选,verify_draft_tokens 用拒绝采样验证,_post_verify 记录校准数据,_update 用验证后的 token 推进 draft 状态。关键约束:仅支持 bsz=1base_evaluator.py:331(file:///workspace/deepspec/eval/base_evaluator.py#L331)),因为多 batch speculative decoding 实现复杂。

关键文件清单:

文件 角色
deepspec/eval/base_evaluator.py(file:///workspace/deepspec/eval/base_evaluator.py) BaseEvaluator + 拒绝采样核心
deepspec/eval/dspark/evaluator.py(file:///workspace/deepspec/eval/dspark/evaluator.py) DSpark evaluator
deepspec/eval/dspark/draft_ops.py(file:///workspace/deepspec/eval/dspark/draft_ops.py) DSpark 推理 forward + proposal 构造
deepspec/eval/dspark/confidence_head.py(file:///workspace/deepspec/eval/dspark/confidence_head.py) 校准 ECE/AUROC/Brier
deepspec/eval/eagle3/evaluator.py(file:///workspace/deepspec/eval/eagle3/evaluator.py) Eagle3 evaluator

分述段(分)

7.1 拒绝采样的无损性证明

`verify_draft_tokens`(file:///workspace/deepspec/eval/base_evaluator.py) (base_evaluator.py:186-304(file:///workspace/deepspec/eval/base_evaluator.py#L186-304))的核心数学:

accept_prob k = min ⁡ ( 1 , p k t ( x k ) p k d ( x k ) ) \text{accept\_prob}_k = \min\left(1, \frac{p^t_k(x_k)}{p^d_k(x_k)}\right) accept_probk=min(1,pkd(xk)pkt(xk))
#mermaid-svg-WGDaiLvKnRhlRdJr{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-WGDaiLvKnRhlRdJr .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-WGDaiLvKnRhlRdJr .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-WGDaiLvKnRhlRdJr .error-icon{fill:#552222;}#mermaid-svg-WGDaiLvKnRhlRdJr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-WGDaiLvKnRhlRdJr .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-WGDaiLvKnRhlRdJr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-WGDaiLvKnRhlRdJr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-WGDaiLvKnRhlRdJr .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-WGDaiLvKnRhlRdJr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-WGDaiLvKnRhlRdJr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-WGDaiLvKnRhlRdJr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-WGDaiLvKnRhlRdJr .marker.cross{stroke:#333333;}#mermaid-svg-WGDaiLvKnRhlRdJr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-WGDaiLvKnRhlRdJr p{margin:0;}#mermaid-svg-WGDaiLvKnRhlRdJr .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-WGDaiLvKnRhlRdJr .cluster-label text{fill:#333;}#mermaid-svg-WGDaiLvKnRhlRdJr .cluster-label span{color:#333;}#mermaid-svg-WGDaiLvKnRhlRdJr .cluster-label span p{background-color:transparent;}#mermaid-svg-WGDaiLvKnRhlRdJr .label text,#mermaid-svg-WGDaiLvKnRhlRdJr span{fill:#333;color:#333;}#mermaid-svg-WGDaiLvKnRhlRdJr .node rect,#mermaid-svg-WGDaiLvKnRhlRdJr .node circle,#mermaid-svg-WGDaiLvKnRhlRdJr .node ellipse,#mermaid-svg-WGDaiLvKnRhlRdJr .node polygon,#mermaid-svg-WGDaiLvKnRhlRdJr .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-WGDaiLvKnRhlRdJr .rough-node .label text,#mermaid-svg-WGDaiLvKnRhlRdJr .node .label text,#mermaid-svg-WGDaiLvKnRhlRdJr .image-shape .label,#mermaid-svg-WGDaiLvKnRhlRdJr .icon-shape .label{text-anchor:middle;}#mermaid-svg-WGDaiLvKnRhlRdJr .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-WGDaiLvKnRhlRdJr .rough-node .label,#mermaid-svg-WGDaiLvKnRhlRdJr .node .label,#mermaid-svg-WGDaiLvKnRhlRdJr .image-shape .label,#mermaid-svg-WGDaiLvKnRhlRdJr .icon-shape .label{text-align:center;}#mermaid-svg-WGDaiLvKnRhlRdJr .node.clickable{cursor:pointer;}#mermaid-svg-WGDaiLvKnRhlRdJr .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-WGDaiLvKnRhlRdJr .arrowheadPath{fill:#333333;}#mermaid-svg-WGDaiLvKnRhlRdJr .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-WGDaiLvKnRhlRdJr .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-WGDaiLvKnRhlRdJr .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-WGDaiLvKnRhlRdJr .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-WGDaiLvKnRhlRdJr .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-WGDaiLvKnRhlRdJr .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-WGDaiLvKnRhlRdJr .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-WGDaiLvKnRhlRdJr .cluster text{fill:#333;}#mermaid-svg-WGDaiLvKnRhlRdJr .cluster span{color:#333;}#mermaid-svg-WGDaiLvKnRhlRdJr div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-WGDaiLvKnRhlRdJr .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-WGDaiLvKnRhlRdJr rect.text{fill:none;stroke-width:0;}#mermaid-svg-WGDaiLvKnRhlRdJr .icon-shape,#mermaid-svg-WGDaiLvKnRhlRdJr .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-WGDaiLvKnRhlRdJr .icon-shape p,#mermaid-svg-WGDaiLvKnRhlRdJr .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-WGDaiLvKnRhlRdJr .icon-shape rect,#mermaid-svg-WGDaiLvKnRhlRdJr .image-shape rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-WGDaiLvKnRhlRdJr .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-WGDaiLvKnRhlRdJr .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-WGDaiLvKnRhlRdJr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} verify_draft_tokens 流程
接受 EOS
无 EOS


target forward verify_input_ids

K draft + 1 bonus
stop_token 处理
提前终止
有拒绝?
从 max(0,p_t-p_d) 采样修正
从 p_t-1 采 bonus
committed = accepted + next
target_probs = softmax(logits/temp)
对每个 draft token:

accept_prob = min(1, p_t/p_d)
accept_mask = rand < accept_prob
accept_prefix_mask = cumprod(accept_mask)

图说明: verify_draft_tokens 是论文 Section 2 描述的拒绝采样的代码实现。target forward 验证 K+1 个 token(K 个 draft + 1 个 bonus);逐位置算 accept_prob = min(1, p_t/p_d),采样决定接受/拒绝;cumprod 保证第一个拒绝位置之后所有后续归零(前缀语义);若有拒绝,从残差 max(0, p_t - p_d) 归一化后采样修正 token;若全接受则从 target 最后位置采 bonus token。committed_tokens = cat([accepted_draft_tokens, next_token])base_evaluator.py:287-293(file:///workspace/deepspec/eval/base_evaluator.py#L287-293))。

为什么无损 :接受概率恰好抵消 draft 与 target 分布的差异,期望输出分布严格等于 target 分布。这是经典 rejection sampling 的性质。论文 Section 3.2.2 强调的 non-anticipating 性质------截断决策不能依赖未来 token------正是为保证这一性质而设。

7.2 generate_decoding_sample:通用推测解码循环

`generate_decoding_sample`(file:///workspace/deepspec/eval/base_evaluator.py) (base_evaluator.py:307-441(file:///workspace/deepspec/eval/base_evaluator.py#L307-441)):

  1. prefill target 模型,得到第一个 token + 初始 hidden states
  2. init_context 构建算法特定 context(子类实现)
  3. 循环:proposeverify_draft_tokenspost_verify(可选)→ update(推进 context)直到达到 max_new_tokens 或遇到 stop token
  4. 记录 acceptance_lengthsproposal_lengthsaccepted_draft_lengths

关键约束

  • 仅支持 bsz=1base_evaluator.py:331(file:///workspace/deepspec/eval/base_evaluator.py#L331))
  • max_proposal_tokens >= 1
  • current_token_ids 校验:verify_input_ids[:, :1] 必须等于当前已接受的最后一个 token(base_evaluator.py:205-212(file:///workspace/deepspec/eval/base_evaluator.py#L205-212))

7.3 指标计算

`build_metrics_row`(file:///workspace/deepspec/eval/base_evaluator.py) (base_evaluator.py:469-511(file:///workspace/deepspec/eval/base_evaluator.py#L469-511)):

  • draft_tokens_per_proposal:每轮 draft 数
  • acceptance_length (τ):每轮平均接受的 token 数(含 bonus)
  • verify_rate:= accepted / (proposed + bonus)
  • accept_rates_by_position:每个 draft 位置的接受率(用于 Figure 2 复现)

`allreduce_response_metrics`(file:///workspace/deepspec/eval/base_evaluator.py) (base_evaluator.py:550-630(file:///workspace/deepspec/eval/base_evaluator.py#L550-630)):跨 rank 聚合 sample_count / proposal_count / acceptance_length_sum / proposal_length_sum / 各位置 accept 计数。

7.4 DSpark Evaluator

`Qwen3DSparkEvaluator`(file:///workspace/deepspec/eval/dspark/evaluator.py) (dspark/evaluator.py:32-222(file:///workspace/deepspec/eval/dspark/evaluator.py#L32-222)):

  • EVAL_ATTN_IMPLEMENTATION = "sdpa"(eval 用 sdpa 而非 flex_attention)
  • max_proposal_tokens = draft_model.block_sizedspark/evaluator.py:40-42(file:///workspace/deepspec/eval/dspark/evaluator.py#L40-42))

draft_ops Draft Model Qwen3DSparkEvaluator draft_ops Draft Model Qwen3DSparkEvaluator #mermaid-svg-ALSXLZJvXRLv8m5Y{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ALSXLZJvXRLv8m5Y .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ALSXLZJvXRLv8m5Y .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ALSXLZJvXRLv8m5Y .error-icon{fill:#552222;}#mermaid-svg-ALSXLZJvXRLv8m5Y .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ALSXLZJvXRLv8m5Y .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ALSXLZJvXRLv8m5Y .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ALSXLZJvXRLv8m5Y .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ALSXLZJvXRLv8m5Y .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ALSXLZJvXRLv8m5Y .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ALSXLZJvXRLv8m5Y .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ALSXLZJvXRLv8m5Y .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ALSXLZJvXRLv8m5Y .marker.cross{stroke:#333333;}#mermaid-svg-ALSXLZJvXRLv8m5Y svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ALSXLZJvXRLv8m5Y p{margin:0;}#mermaid-svg-ALSXLZJvXRLv8m5Y .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-ALSXLZJvXRLv8m5Y text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-ALSXLZJvXRLv8m5Y .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-ALSXLZJvXRLv8m5Y .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-ALSXLZJvXRLv8m5Y .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-ALSXLZJvXRLv8m5Y .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-ALSXLZJvXRLv8m5Y #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-ALSXLZJvXRLv8m5Y .sequenceNumber{fill:white;}#mermaid-svg-ALSXLZJvXRLv8m5Y #sequencenumber{fill:#333;}#mermaid-svg-ALSXLZJvXRLv8m5Y #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-ALSXLZJvXRLv8m5Y .messageText{fill:#333;stroke:none;}#mermaid-svg-ALSXLZJvXRLv8m5Y .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-ALSXLZJvXRLv8m5Y .labelText,#mermaid-svg-ALSXLZJvXRLv8m5Y .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-ALSXLZJvXRLv8m5Y .loopText,#mermaid-svg-ALSXLZJvXRLv8m5Y .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-ALSXLZJvXRLv8m5Y .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-ALSXLZJvXRLv8m5Y .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-ALSXLZJvXRLv8m5Y .noteText,#mermaid-svg-ALSXLZJvXRLv8m5Y .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-ALSXLZJvXRLv8m5Y .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-ALSXLZJvXRLv8m5Y .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-ALSXLZJvXRLv8m5Y .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-ALSXLZJvXRLv8m5Y .actorPopupMenu{position:absolute;}#mermaid-svg-ALSXLZJvXRLv8m5Y .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-ALSXLZJvXRLv8m5Y .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-ALSXLZJvXRLv8m5Y .actor-man circle,#mermaid-svg-ALSXLZJvXRLv8m5Y line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-ALSXLZJvXRLv8m5Y :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} _init_context 提取 prefill hidden 作为 target_hidden _propose forward_dspark_draft_block (use_cache=True, attn_mask=None) proposal_hidden compute_logits + sample (走 markov_head.sample_block_tokens) sampled_tokens + corrected logits predict_confidence_step (若 confidence_head 存在) confidence_logits _confident_prefix_length (按 threshold 截断) DSparkDraftProposal (tokens + probs + conf) verify_draft_tokens(proposal) _post_verify (若 confidence_head_recorder) _update (用 accepted+1 更新 context)

图说明: DSpark eval 流程的关键差异:① eval 时 attention_mask=None(无 block mask),is_causal=False------因为 eval 是单 block 串行采样,不需要 block 隔离;② eval 用 KV cache(use_cache=True),与训练时的并行 block forward 不同;③ build_dspark_proposal 调用 markov_head 的 sample_block_tokens 而非 apply_block_logits------前者按序采样,后者 teacher-forced。

`build_dspark_proposal`(file:///workspace/deepspec/eval/dspark/draft_ops.py) (draft_ops.py:96-153(file:///workspace/deepspec/eval/dspark/draft_ops.py#L96-153)):

  1. compute_logits(proposal_hidden) 得到 base logits
  2. sample_draft_tokens(走 markov_head.sample_block_tokens 若存在)得到 sampled_tokens + corrected logits
  3. 若有 confidence_head:_predict_confidence_logits 构造 prev_token_ids,调用 model.predict_confidence_step_confident_prefix_lengthconfidence_threshold 截断,找到第一个 sigmoid < threshold 的位置作为 draft 长度(draft_ops.py:82-93(file:///workspace/deepspec/eval/dspark/draft_ops.py#L82-93))
  4. 截取 sampled_tokens 与 draft_logits 到 proposal_draft_tokens
  5. draft_probs = logits_to_probs(draft_logits, temperature)

关键约束confidence_threshold == 0.0 时才记录校准指标(evaluator.py:46-48(file:///workspace/deepspec/eval/dspark/evaluator.py#L46-48))------若设了阈值,采样被截断后不再是无偏观测。

7.5 Confidence head 校准

`PerPositionConfidenceMetrics`(file:///workspace/deepspec/eval/dspark/confidence_head.py) (confidence_head.py:30-172(file:///workspace/deepspec/eval/dspark/confidence_head.py#L30-172)):
#mermaid-svg-0LdFRupI7Ijbzr24{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-0LdFRupI7Ijbzr24 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-0LdFRupI7Ijbzr24 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-0LdFRupI7Ijbzr24 .error-icon{fill:#552222;}#mermaid-svg-0LdFRupI7Ijbzr24 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-0LdFRupI7Ijbzr24 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-0LdFRupI7Ijbzr24 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-0LdFRupI7Ijbzr24 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-0LdFRupI7Ijbzr24 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-0LdFRupI7Ijbzr24 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-0LdFRupI7Ijbzr24 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-0LdFRupI7Ijbzr24 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-0LdFRupI7Ijbzr24 .marker.cross{stroke:#333333;}#mermaid-svg-0LdFRupI7Ijbzr24 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-0LdFRupI7Ijbzr24 p{margin:0;}#mermaid-svg-0LdFRupI7Ijbzr24 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-0LdFRupI7Ijbzr24 .cluster-label text{fill:#333;}#mermaid-svg-0LdFRupI7Ijbzr24 .cluster-label span{color:#333;}#mermaid-svg-0LdFRupI7Ijbzr24 .cluster-label span p{background-color:transparent;}#mermaid-svg-0LdFRupI7Ijbzr24 .label text,#mermaid-svg-0LdFRupI7Ijbzr24 span{fill:#333;color:#333;}#mermaid-svg-0LdFRupI7Ijbzr24 .node rect,#mermaid-svg-0LdFRupI7Ijbzr24 .node circle,#mermaid-svg-0LdFRupI7Ijbzr24 .node ellipse,#mermaid-svg-0LdFRupI7Ijbzr24 .node polygon,#mermaid-svg-0LdFRupI7Ijbzr24 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-0LdFRupI7Ijbzr24 .rough-node .label text,#mermaid-svg-0LdFRupI7Ijbzr24 .node .label text,#mermaid-svg-0LdFRupI7Ijbzr24 .image-shape .label,#mermaid-svg-0LdFRupI7Ijbzr24 .icon-shape .label{text-anchor:middle;}#mermaid-svg-0LdFRupI7Ijbzr24 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-0LdFRupI7Ijbzr24 .rough-node .label,#mermaid-svg-0LdFRupI7Ijbzr24 .node .label,#mermaid-svg-0LdFRupI7Ijbzr24 .image-shape .label,#mermaid-svg-0LdFRupI7Ijbzr24 .icon-shape .label{text-align:center;}#mermaid-svg-0LdFRupI7Ijbzr24 .node.clickable{cursor:pointer;}#mermaid-svg-0LdFRupI7Ijbzr24 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-0LdFRupI7Ijbzr24 .arrowheadPath{fill:#333333;}#mermaid-svg-0LdFRupI7Ijbzr24 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-0LdFRupI7Ijbzr24 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-0LdFRupI7Ijbzr24 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0LdFRupI7Ijbzr24 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-0LdFRupI7Ijbzr24 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0LdFRupI7Ijbzr24 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-0LdFRupI7Ijbzr24 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-0LdFRupI7Ijbzr24 .cluster text{fill:#333;}#mermaid-svg-0LdFRupI7Ijbzr24 .cluster span{color:#333;}#mermaid-svg-0LdFRupI7Ijbzr24 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-0LdFRupI7Ijbzr24 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-0LdFRupI7Ijbzr24 rect.text{fill:none;stroke-width:0;}#mermaid-svg-0LdFRupI7Ijbzr24 .icon-shape,#mermaid-svg-0LdFRupI7Ijbzr24 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0LdFRupI7Ijbzr24 .icon-shape p,#mermaid-svg-0LdFRupI7Ijbzr24 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-0LdFRupI7Ijbzr24 .icon-shape rect,#mermaid-svg-0LdFRupI7Ijbzr24 .image-shape rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0LdFRupI7Ijbzr24 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-0LdFRupI7Ijbzr24 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-0LdFRupI7Ijbzr24 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} report_dataset
打印 JSON
可选写 metrics.json
可选画 reliability PNG
finish 跨 rank 聚合
all_reduce 直方图
计算 ECE

sum avg_pred - avg_target × weight / total
计算 AUROC

从直方图近似
计算 Brier score

sum prob - target^2 / total
ConfidenceHeadRecorder.observe 每次 verify
计算 cumprod sigmoid confidence_logits

作为预测生存概率
accept_prefix_mask

作为真实标签
累积到 per-position 直方图

coarse_bins=20 / fine_bins=1000

图说明: 校准评估的核心是对比"预测生存概率"与"真实接受前缀"。observe 每次 verify 后对 effective_length 内的位置计算 cumprod(sigmoid(confidence_logits)) 作为预测,accept_prefix_mask 作为真实标签。三个指标:① ECE (Expected Calibration Error,confidence_head.py:140(file:///workspace/deepspec/eval/dspark/confidence_head.py#L140)):sum(|avg_pred - avg_target| × weight) / total;② AUROC :从直方图近似计算(_auroc_from_histconfidence_head.py:105-114(file:///workspace/deepspec/eval/dspark/confidence_head.py#L105-114));③ Brier scoresum((prob - target)^2) / totalconfidence_head.py:142(file:///workspace/deepspec/eval/dspark/confidence_head.py#L142))。

plot_reliability_diagramconfidence_head.py:232-304(file:///workspace/deepspec/eval/dspark/confidence_head.py#L232-304)):用 matplotlib 画每个位置的可靠性图,对应论文 Figure 6。

7.6 Eagle3 Evaluator

`Qwen3Eagle3Evaluator`(file:///workspace/deepspec/eval/eagle3/evaluator.py) (eagle3/evaluator.py:22-188(file:///workspace/deepspec/eval/eagle3/evaluator.py#L22-188)):

  • 关键约束:draft_num_hidden_layers == 1eagle3/evaluator.py:33-37(file:///workspace/deepspec/eval/eagle3/evaluator.py#L33-37))------因为 _update 一次性 extend 多个 committed token 到 draft cache,只有单层时 KV cache 直接由 per-token 输入投影,不需要 causal mask
  • max_proposal_tokens = draft_model.ttt_lengtheagle3/evaluator.py:39-41(file:///workspace/deepspec/eval/eagle3/evaluator.py#L39-41))

Draft Cache Draft Model Qwen3Eagle3Evaluator Draft Cache Draft Model Qwen3Eagle3Evaluator #mermaid-svg-fm75XDDbHaLkdbLf{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fm75XDDbHaLkdbLf .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fm75XDDbHaLkdbLf .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fm75XDDbHaLkdbLf .error-icon{fill:#552222;}#mermaid-svg-fm75XDDbHaLkdbLf .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fm75XDDbHaLkdbLf .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fm75XDDbHaLkdbLf .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fm75XDDbHaLkdbLf .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fm75XDDbHaLkdbLf .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fm75XDDbHaLkdbLf .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fm75XDDbHaLkdbLf .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fm75XDDbHaLkdbLf .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fm75XDDbHaLkdbLf .marker.cross{stroke:#333333;}#mermaid-svg-fm75XDDbHaLkdbLf svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fm75XDDbHaLkdbLf p{margin:0;}#mermaid-svg-fm75XDDbHaLkdbLf .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fm75XDDbHaLkdbLf text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-fm75XDDbHaLkdbLf .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-fm75XDDbHaLkdbLf .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-fm75XDDbHaLkdbLf .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-fm75XDDbHaLkdbLf .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-fm75XDDbHaLkdbLf #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-fm75XDDbHaLkdbLf .sequenceNumber{fill:white;}#mermaid-svg-fm75XDDbHaLkdbLf #sequencenumber{fill:#333;}#mermaid-svg-fm75XDDbHaLkdbLf #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-fm75XDDbHaLkdbLf .messageText{fill:#333;stroke:none;}#mermaid-svg-fm75XDDbHaLkdbLf .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fm75XDDbHaLkdbLf .labelText,#mermaid-svg-fm75XDDbHaLkdbLf .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-fm75XDDbHaLkdbLf .loopText,#mermaid-svg-fm75XDDbHaLkdbLf .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-fm75XDDbHaLkdbLf .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-fm75XDDbHaLkdbLf .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-fm75XDDbHaLkdbLf .noteText,#mermaid-svg-fm75XDDbHaLkdbLf .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-fm75XDDbHaLkdbLf .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fm75XDDbHaLkdbLf .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fm75XDDbHaLkdbLf .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fm75XDDbHaLkdbLf .actorPopupMenu{position:absolute;}#mermaid-svg-fm75XDDbHaLkdbLf .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-fm75XDDbHaLkdbLf .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fm75XDDbHaLkdbLf .actor-man circle,#mermaid-svg-fm75XDDbHaLkdbLf line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-fm75XDDbHaLkdbLf :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} shifted_prompt_ids 训练时 hiddeni 对应 tokeni+1 loop ttt_length 次 draft_cache.crop(cache_len_before) 回滚到 propose 前状态 _init_context prefill draft cache with prompt hidden _propose 循环 ttt_length 次 compute_logits(proposal_hidden) logit sample_tokens → 候选 model(hidden, input_ids=next_token, use_cache=True) extend cache next proposal_hidden verify_draft_tokens _update 用 committed hidden + tokens 重新 extend

图说明: Eagle3 eval 的关键差异:① propose 阶段循环 ttt_length 次,每次采一个 token、extend draft cache、推进 hidden;② cache_len_before 在 propose 开始时记录,_updatedraft_cache.crop(cache_len_before) 回滚到 propose 前状态,用 verify 后的 committed hidden + tokens 重新 extend。这是 SpecForge Eagle3 的标准做法------每次 propose 后回滚 cache,用 target 验证过的 token 重新填充,确保 draft cache 与 target 同步。

7.7 DSpark vs Eagle3 Evaluator 对比

维度 DSpark Eagle3
attn impl sdpa(eval) sdpa
max_proposal block_size ttt_length
propose 方式 单次 forward 出 K 个 token 循环 K 次每次出 1 个
cache 用法 _forward_backbone(use_cache=True) extend_draft_cache 循环 extend
update 方式 用 accepted+1 更新 context 的 target_hidden cache crop + re-extend
post_verify confidence 校准(若有 recorder)
单层约束 draft_num_hidden_layers == 1

7.8 关键约束:target_layer_ids 不能含最后一层

`assert_no_final_target_layer`(file:///workspace/deepspec/eval/base_evaluator.py) (base_evaluator.py:100-112(file:///workspace/deepspec/eval/base_evaluator.py#L100-112)):target_layer_ids 不能包含 target 模型最后一层 。原因:transformers 的 output_hidden_states 存的是归一化后的 final hidden(即 model.norm(last_decoder_output)),而 target cache 存的是 raw decoder output。如果包含最后一层,eval 时取到的 hidden 与 cache 中的不一致,会导致 draft 输入分布偏移、acceptance rate 大幅下降。

7.9 STS(Sequential Temperature Scaling)

论文 Section 3.2.1 描述的 STS 是 confidence head 的后处理校准:

  • 每个位置 k ∈ { 1 , . . . , γ } k \in \{1, ..., \gamma\} k∈{1,...,γ} 用 1D grid search 找最优温度 T k T_k Tk
  • 优化目标:最小化累积乘积 ∏ i ≤ k σ ( c i / T i ) \prod_{i \leq k} \sigma(c_i / T_i) ∏i≤kσ(ci/Ti) 的 ECE
  • 保持前 k − 1 k-1 k−1 个位置的温度已校准
  • 温度缩放是 order-preserving 变换,不破坏 draft token 排名

重要说明 :STS 在生产 HAI-LLM 内部实现,本仓库无对应代码 。本仓库 ConfidenceHeadRecorder 提供 raw confidence 的 ECE/AUROC/Brier 评估,可作为离线 STS 校准的输入。论文 Figure 6 显示 raw ECE 约 3%-8%,STS 后降到 ~1%。

7.10 9 个 benchmark 任务

eval.pyTASKS 列表(eval.py:18-28(file:///workspace/eval.py#L18-28)):

python 复制代码
TASKS = [
    ("gsm8k", 500),         # 数学推理
    ("math500", 500),       # 数学推理
    ("aime25", 30),         # 数学推理
    ("humaneval", 164),     # 代码生成
    ("mbpp", 256),          # 代码生成
    ("livecodebench", 500), # 代码生成
    ("mt-bench", 80),       # 日常对话
    ("alpaca", 500),        # 日常对话
    ("arena-hard-v2", 500), # 日常对话
]

每个 task 对应 eval_datasets/<task>.jsonl 文件。三领域划分对应论文 Section 4.1:Math(gsm8k/math500/aime25)、Code(humaneval/mbpp/livecodebench)、Chat(mt-bench/alpaca/arena-hard-v2)。


小结段(总)

DeepSpec 评测系统的精髓是把推测解码的 draft-verify-correct 循环抽象到 BaseEvaluator,让三算法子类只关心 propose 与 update。拒绝采样的数学保证无损,confidence head 的校准评估让 DSpark 的"智能截断"能力可量化。

设计要点回顾:

  1. 拒绝采样 = min(1, p_t/p_d) + cumprod 前缀 mask:标准 lossless speculative decoding。
  2. bsz=1 约束:简化实现,多 batch 留给生产推理引擎。
  3. DSpark eval 用 sdpa 而非 flex_attention:单 block 不需要 block mask。
  4. Eagle3 cache 回滚-重填:保证 draft cache 与 target 同步,避免 drift。
  5. draft_num_hidden_layers == 1 约束:让 cache extend 不需要 causal mask。
  6. confidence_threshold == 0.0 才记录校准:截断后采样不再无偏。
  7. STS 在生产侧:本仓库只提供 raw confidence,校准评估可作为离线 STS 输入。
  8. 9 个 benchmark 三领域:与论文 Table 1 完全对齐。

易踩坑点:

  • target_layer_ids 不能含最后一层:会让 eval 时 hidden 与 cache 不一致。
  • confidence_threshold > 0 时不能用作校准评估:必须设为 0 才能记录无偏指标。
  • Eagle3 draft 必须单层 :多层会让 _update 的 cache extend 失效。
  • max_proposal_tokens 必须 ≥ 1:边界检查。
  • 生成 stop token 时:committed_tokens 会截断,acceptance_length 计算需注意。

延伸阅读 :进入 08 实验复现 看如何用这套 evaluator 复现论文 Table 1 与 Figure 2/5/6;进入 03 DSpark 建模 回顾 confidence head 的训练监督。论文 Section 3.2(Confidence-Scheduled Verification)与 3.2.1(STS)在 DSpark_paper.pdf(file:///workspace/DSpark_paper.pdf)。