03_DeepSpec-DSpark-DSpark建模_Markov与Confidence

03 · DSpark 建模:Markov 与 Confidence

本篇是 DSpark 实现的核心拆解,对应论文 Section 3.1(Semi-Autoregressive Generation)、3.2.1(Confidence Head)、3.3(Training)。DSpark 是 DeepSpec 的主角算法,本篇从 forward 13 步流程、anchor 采样、block attention mask、三种 Markov head、confidence head 输入构造、loss 三项加权 6 个维度逐一拆解。配套 04 Eagle3 对照自回归范式。


总览段(总)

DSpark 的设计目标是用一次并行 forward 产出 block_size 个 draft token,同时通过极轻的串行 Markov head 注入前缀依赖,再用 confidence head 预测每个位置的存活概率 。整个 forward 在 Qwen3DSparkModel.forward 中实现,13 个步骤从 anchor 采样到 DSparkForwardOutput 返回。
#mermaid-svg-NSvm9v53fg90Bcya{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-NSvm9v53fg90Bcya .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-NSvm9v53fg90Bcya .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-NSvm9v53fg90Bcya .error-icon{fill:#552222;}#mermaid-svg-NSvm9v53fg90Bcya .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-NSvm9v53fg90Bcya .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-NSvm9v53fg90Bcya .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-NSvm9v53fg90Bcya .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-NSvm9v53fg90Bcya .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-NSvm9v53fg90Bcya .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-NSvm9v53fg90Bcya .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-NSvm9v53fg90Bcya .marker{fill:#333333;stroke:#333333;}#mermaid-svg-NSvm9v53fg90Bcya .marker.cross{stroke:#333333;}#mermaid-svg-NSvm9v53fg90Bcya svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-NSvm9v53fg90Bcya p{margin:0;}#mermaid-svg-NSvm9v53fg90Bcya .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-NSvm9v53fg90Bcya .cluster-label text{fill:#333;}#mermaid-svg-NSvm9v53fg90Bcya .cluster-label span{color:#333;}#mermaid-svg-NSvm9v53fg90Bcya .cluster-label span p{background-color:transparent;}#mermaid-svg-NSvm9v53fg90Bcya .label text,#mermaid-svg-NSvm9v53fg90Bcya span{fill:#333;color:#333;}#mermaid-svg-NSvm9v53fg90Bcya .node rect,#mermaid-svg-NSvm9v53fg90Bcya .node circle,#mermaid-svg-NSvm9v53fg90Bcya .node ellipse,#mermaid-svg-NSvm9v53fg90Bcya .node polygon,#mermaid-svg-NSvm9v53fg90Bcya .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-NSvm9v53fg90Bcya .rough-node .label text,#mermaid-svg-NSvm9v53fg90Bcya .node .label text,#mermaid-svg-NSvm9v53fg90Bcya .image-shape .label,#mermaid-svg-NSvm9v53fg90Bcya .icon-shape .label{text-anchor:middle;}#mermaid-svg-NSvm9v53fg90Bcya .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-NSvm9v53fg90Bcya .rough-node .label,#mermaid-svg-NSvm9v53fg90Bcya .node .label,#mermaid-svg-NSvm9v53fg90Bcya .image-shape .label,#mermaid-svg-NSvm9v53fg90Bcya .icon-shape .label{text-align:center;}#mermaid-svg-NSvm9v53fg90Bcya .node.clickable{cursor:pointer;}#mermaid-svg-NSvm9v53fg90Bcya .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-NSvm9v53fg90Bcya .arrowheadPath{fill:#333333;}#mermaid-svg-NSvm9v53fg90Bcya .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-NSvm9v53fg90Bcya .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-NSvm9v53fg90Bcya .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-NSvm9v53fg90Bcya .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-NSvm9v53fg90Bcya .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-NSvm9v53fg90Bcya .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-NSvm9v53fg90Bcya .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-NSvm9v53fg90Bcya .cluster text{fill:#333;}#mermaid-svg-NSvm9v53fg90Bcya .cluster span{color:#333;}#mermaid-svg-NSvm9v53fg90Bcya div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-NSvm9v53fg90Bcya .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-NSvm9v53fg90Bcya rect.text{fill:none;stroke-width:0;}#mermaid-svg-NSvm9v53fg90Bcya .icon-shape,#mermaid-svg-NSvm9v53fg90Bcya .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-NSvm9v53fg90Bcya .icon-shape p,#mermaid-svg-NSvm9v53fg90Bcya .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-NSvm9v53fg90Bcya .icon-shape .label rect,#mermaid-svg-NSvm9v53fg90Bcya .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-NSvm9v53fg90Bcya .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-NSvm9v53fg90Bcya .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-NSvm9v53fg90Bcya :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Qwen3DSparkModel.forward 13 步
DSparkForwardOutput
draft_logits B×A×γ×V
target_ids B×A×γ
eval_mask B×A×γ
block_keep_mask B×A
confidence_pred B×A×γ
aligned_target_logits 可选
输入
input_ids B×L
target_hidden_states

B×L×(5·H)
loss_mask B×L
target_last_hidden_states

B×L×H 可选

  1. sample_anchor_positions

采 num_anchors 个 anchor
2. create_noise_embed

anchor token + γ-1 mask tokens
3. position_ids 拼接
4. create_dspark_attention_mask

block mask
5. _forward_backbone

5 层 draft decoder
6. reshape 4D
7. gather target_ids
8. 对齐 aligned_target_logits
9. build_eval_mask cumprod
10. 构造 prev_token_ids
11. compute_logits
12. markov_head.apply_block_logits
13. confidence_head 预测

图说明: DSpark forward 是一个 13 步流水线,对应 qwen3/modeling.py:389-526(file:///workspace/deepspec/modeling/dspark/qwen3/modeling.py#L389-526)。输入 4 个张量(input_ids / target_hidden_states / loss_mask / target_last_hidden_states),输出 DSparkForwardOutput 含 6 个张量。关键创新在第 4 步(block attention mask,让 context 与 draft block 在同一 forward 中互不污染)、第 12 步(Markov head 注入前缀依赖)、第 13 步(confidence head 预测接受率)。

关键文件清单:

文件 角色
deepspec/modeling/dspark/common.py(file:///workspace/deepspec/modeling/dspark/common.py) 公共组件:anchor 采样 / block mask / noise embed / eval_mask / AcceptRatePredictor
deepspec/modeling/dspark/markov_head.py(file:///workspace/deepspec/modeling/dspark/markov_head.py) 三种 Markov head:Vanilla / Gated / RNN
deepspec/modeling/dspark/loss.py(file:///workspace/deepspec/modeling/dspark/loss.py) CE / L1 / Confidence BCE 三项 loss
deepspec/modeling/dspark/qwen3/modeling.py(file:///workspace/deepspec/modeling/dspark/qwen3/modeling.py) Qwen3 后端 DSpark 实现
deepspec/modeling/dspark/qwen3/config.py(file:///workspace/deepspec/modeling/dspark/qwen3/config.py) Qwen3 draft config 构造
deepspec/modeling/dspark/gemma4/modeling.py(file:///workspace/deepspec/modeling/dspark/gemma4/modeling.py) Gemma4 后端(与 Qwen3 对称)

分述段(分)

3.1 DSparkForwardOutput 数据结构

`DSparkForwardOutput`(file:///workspace/deepspec/modeling/dspark/common.py) (common.py:12-40(file:///workspace/deepspec/modeling/dspark/common.py#L12-40)):
#mermaid-svg-pJkoOJBKX5ANazyl{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-pJkoOJBKX5ANazyl .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-pJkoOJBKX5ANazyl .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-pJkoOJBKX5ANazyl .error-icon{fill:#552222;}#mermaid-svg-pJkoOJBKX5ANazyl .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-pJkoOJBKX5ANazyl .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-pJkoOJBKX5ANazyl .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-pJkoOJBKX5ANazyl .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-pJkoOJBKX5ANazyl .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-pJkoOJBKX5ANazyl .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-pJkoOJBKX5ANazyl .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-pJkoOJBKX5ANazyl .marker{fill:#333333;stroke:#333333;}#mermaid-svg-pJkoOJBKX5ANazyl .marker.cross{stroke:#333333;}#mermaid-svg-pJkoOJBKX5ANazyl svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-pJkoOJBKX5ANazyl p{margin:0;}#mermaid-svg-pJkoOJBKX5ANazyl .entityBox{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-pJkoOJBKX5ANazyl .relationshipLabelBox{fill:hsl(80, 100%, 96.2745098039%);opacity:0.7;background-color:hsl(80, 100%, 96.2745098039%);}#mermaid-svg-pJkoOJBKX5ANazyl .relationshipLabelBox rect{opacity:0.5;}#mermaid-svg-pJkoOJBKX5ANazyl .labelBkg{background-color:rgba(248.6666666666, 255, 235.9999999999, 0.5);}#mermaid-svg-pJkoOJBKX5ANazyl .edgeLabel .label{fill:#9370DB;font-size:14px;}#mermaid-svg-pJkoOJBKX5ANazyl .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-pJkoOJBKX5ANazyl .edge-pattern-dashed{stroke-dasharray:8,8;}#mermaid-svg-pJkoOJBKX5ANazyl .node rect,#mermaid-svg-pJkoOJBKX5ANazyl .node circle,#mermaid-svg-pJkoOJBKX5ANazyl .node ellipse,#mermaid-svg-pJkoOJBKX5ANazyl .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-pJkoOJBKX5ANazyl .relationshipLine{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-pJkoOJBKX5ANazyl .marker{fill:none!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-pJkoOJBKX5ANazyl .edgeLabel{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-pJkoOJBKX5ANazyl .edgeLabel .label rect{fill:rgba(232,232,232, 0.8);}#mermaid-svg-pJkoOJBKX5ANazyl .edgeLabel .label text{fill:#333;}#mermaid-svg-pJkoOJBKX5ANazyl :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} B×A×γ×V
B×A×γ
B×A×γ
B×A
B×A×γ (可选)
B×A×γ×V (可选)
DSparkForwardOutput
Tensor
draft_logits
draft 预测 logits
Tensor
target_ids
ground truth token ids
Tensor
eval_mask
前缀连续有效 mask
Tensor
block_keep_mask
anchor 是否有效
Tensor
confidence_pred
confidence head 输出
Tensor
aligned_target_logits
对齐后的 target logits
draft_logits
target_ids
eval_mask
block_keep_mask
confidence_pred
aligned_target_logits

图说明: 形状记号 B=batch、A=num_anchors、γ=block_size、V=vocab_size、H=hidden_size。eval_maskcumprod(dim=-1) 强制前缀连续------一旦某位置无效,其后所有位置自动归零(common.py:172-188(file:///workspace/deepspec/modeling/dspark/common.py#L172-188))。block_keep_mask 标记 anchor 是否真实采样(不足 num_anchors 时填充 dummy)。aligned_target_logits 仅在训练时提供(用于 L1 loss 与 accept rate 监督),评测时为 None。

3.2 Anchor 采样:训练数据的切分方式

`sample_anchor_positions`(file:///workspace/deepspec/modeling/dspark/common.py) (common.py:123-169(file:///workspace/deepspec/modeling/dspark/common.py#L123-169)):

  • 候选 maskbuild_anchor_candidate_maskcommon.py:109-120(file:///workspace/deepspec/modeling/dspark/common.py#L109-120))------anchor 候选位置 = 当前位置和下一位置都在 loss_mask 内的位置(保证 anchor 后还能取 γ 个有效 label)。
  • 采样 :对每个样本最多采样 num_anchors=512 个 anchor,采用随机值排序后取前 N 个并按位置升序排列。
  • 不足时填充 :dummy anchor 用 block_keep_mask=0 屏蔽,不参与 loss。

这与论文 Section 3.3 描述完全一致:"we randomly sample multiple anchor positions from each target sequence to form γ-token blocks as training data"。

3.3 Noise Embedding:anchor + γ-1 mask tokens

`create_noise_embed`(file:///workspace/deepspec/modeling/dspark/common.py) (common.py:264-294(file:///workspace/deepspec/modeling/dspark/common.py#L264-294)):
#mermaid-svg-lJF6n2Ihj01bdv9I{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-lJF6n2Ihj01bdv9I .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-lJF6n2Ihj01bdv9I .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-lJF6n2Ihj01bdv9I .error-icon{fill:#552222;}#mermaid-svg-lJF6n2Ihj01bdv9I .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-lJF6n2Ihj01bdv9I .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-lJF6n2Ihj01bdv9I .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-lJF6n2Ihj01bdv9I .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-lJF6n2Ihj01bdv9I .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-lJF6n2Ihj01bdv9I .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-lJF6n2Ihj01bdv9I .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-lJF6n2Ihj01bdv9I .marker{fill:#333333;stroke:#333333;}#mermaid-svg-lJF6n2Ihj01bdv9I .marker.cross{stroke:#333333;}#mermaid-svg-lJF6n2Ihj01bdv9I svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-lJF6n2Ihj01bdv9I p{margin:0;}#mermaid-svg-lJF6n2Ihj01bdv9I .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-lJF6n2Ihj01bdv9I .cluster-label text{fill:#333;}#mermaid-svg-lJF6n2Ihj01bdv9I .cluster-label span{color:#333;}#mermaid-svg-lJF6n2Ihj01bdv9I .cluster-label span p{background-color:transparent;}#mermaid-svg-lJF6n2Ihj01bdv9I .label text,#mermaid-svg-lJF6n2Ihj01bdv9I span{fill:#333;color:#333;}#mermaid-svg-lJF6n2Ihj01bdv9I .node rect,#mermaid-svg-lJF6n2Ihj01bdv9I .node circle,#mermaid-svg-lJF6n2Ihj01bdv9I .node ellipse,#mermaid-svg-lJF6n2Ihj01bdv9I .node polygon,#mermaid-svg-lJF6n2Ihj01bdv9I .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-lJF6n2Ihj01bdv9I .rough-node .label text,#mermaid-svg-lJF6n2Ihj01bdv9I .node .label text,#mermaid-svg-lJF6n2Ihj01bdv9I .image-shape .label,#mermaid-svg-lJF6n2Ihj01bdv9I .icon-shape .label{text-anchor:middle;}#mermaid-svg-lJF6n2Ihj01bdv9I .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-lJF6n2Ihj01bdv9I .rough-node .label,#mermaid-svg-lJF6n2Ihj01bdv9I .node .label,#mermaid-svg-lJF6n2Ihj01bdv9I .image-shape .label,#mermaid-svg-lJF6n2Ihj01bdv9I .icon-shape .label{text-align:center;}#mermaid-svg-lJF6n2Ihj01bdv9I .node.clickable{cursor:pointer;}#mermaid-svg-lJF6n2Ihj01bdv9I .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-lJF6n2Ihj01bdv9I .arrowheadPath{fill:#333333;}#mermaid-svg-lJF6n2Ihj01bdv9I .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-lJF6n2Ihj01bdv9I .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-lJF6n2Ihj01bdv9I .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lJF6n2Ihj01bdv9I .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-lJF6n2Ihj01bdv9I .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lJF6n2Ihj01bdv9I .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-lJF6n2Ihj01bdv9I .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-lJF6n2Ihj01bdv9I .cluster text{fill:#333;}#mermaid-svg-lJF6n2Ihj01bdv9I .cluster span{color:#333;}#mermaid-svg-lJF6n2Ihj01bdv9I div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-lJF6n2Ihj01bdv9I .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-lJF6n2Ihj01bdv9I rect.text{fill:none;stroke-width:0;}#mermaid-svg-lJF6n2Ihj01bdv9I .icon-shape,#mermaid-svg-lJF6n2Ihj01bdv9I .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lJF6n2Ihj01bdv9I .icon-shape p,#mermaid-svg-lJF6n2Ihj01bdv9I .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-lJF6n2Ihj01bdv9I .icon-shape .label rect,#mermaid-svg-lJF6n2Ihj01bdv9I .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lJF6n2Ihj01bdv9I .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-lJF6n2Ihj01bdv9I .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-lJF6n2Ihj01bdv9I :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} anchor_pos p
位置 p: anchor token

real embed
mask_token_id
位置 p+1..p+γ-1:

mask token embed
拼接 γ 个 embed
送入 draft backbone

图说明: 每个 block 的输入是 1 个 anchor token(位置 p 的真实 token)+ γ-1 个 mask token(mask_token_id=151669 是 Qwen3 的 <|mask|>)。这是论文 Section 3.1 "Parallel stage" 的代码落地------"γ input tokens (anchor + γ-1 masks) yield γ draft logits"。注意 DSpark 把 anchor 本身也作为第一个预测位置,相比 DFlash 原版(anchor 不预测)少一次 forward 计算。

3.4 Block Attention Mask:并行 forward 的隔离墙

`create_dspark_attention_mask`(file:///workspace/deepspec/modeling/dspark/common.py) (common.py:78-106(file:///workspace/deepspec/modeling/dspark/common.py#L78-106))用 torch.nn.attention.flex_attention.create_block_mask 构造一个混合 mask:
#mermaid-svg-dI7GAGGCSBgVAFlj{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-dI7GAGGCSBgVAFlj .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-dI7GAGGCSBgVAFlj .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-dI7GAGGCSBgVAFlj .error-icon{fill:#552222;}#mermaid-svg-dI7GAGGCSBgVAFlj .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-dI7GAGGCSBgVAFlj .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-dI7GAGGCSBgVAFlj .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-dI7GAGGCSBgVAFlj .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-dI7GAGGCSBgVAFlj .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-dI7GAGGCSBgVAFlj .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-dI7GAGGCSBgVAFlj .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-dI7GAGGCSBgVAFlj .marker{fill:#333333;stroke:#333333;}#mermaid-svg-dI7GAGGCSBgVAFlj .marker.cross{stroke:#333333;}#mermaid-svg-dI7GAGGCSBgVAFlj svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-dI7GAGGCSBgVAFlj p{margin:0;}#mermaid-svg-dI7GAGGCSBgVAFlj .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-dI7GAGGCSBgVAFlj .cluster-label text{fill:#333;}#mermaid-svg-dI7GAGGCSBgVAFlj .cluster-label span{color:#333;}#mermaid-svg-dI7GAGGCSBgVAFlj .cluster-label span p{background-color:transparent;}#mermaid-svg-dI7GAGGCSBgVAFlj .label text,#mermaid-svg-dI7GAGGCSBgVAFlj span{fill:#333;color:#333;}#mermaid-svg-dI7GAGGCSBgVAFlj .node rect,#mermaid-svg-dI7GAGGCSBgVAFlj .node circle,#mermaid-svg-dI7GAGGCSBgVAFlj .node ellipse,#mermaid-svg-dI7GAGGCSBgVAFlj .node polygon,#mermaid-svg-dI7GAGGCSBgVAFlj .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-dI7GAGGCSBgVAFlj .rough-node .label text,#mermaid-svg-dI7GAGGCSBgVAFlj .node .label text,#mermaid-svg-dI7GAGGCSBgVAFlj .image-shape .label,#mermaid-svg-dI7GAGGCSBgVAFlj .icon-shape .label{text-anchor:middle;}#mermaid-svg-dI7GAGGCSBgVAFlj .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-dI7GAGGCSBgVAFlj .rough-node .label,#mermaid-svg-dI7GAGGCSBgVAFlj .node .label,#mermaid-svg-dI7GAGGCSBgVAFlj .image-shape .label,#mermaid-svg-dI7GAGGCSBgVAFlj .icon-shape .label{text-align:center;}#mermaid-svg-dI7GAGGCSBgVAFlj .node.clickable{cursor:pointer;}#mermaid-svg-dI7GAGGCSBgVAFlj .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-dI7GAGGCSBgVAFlj .arrowheadPath{fill:#333333;}#mermaid-svg-dI7GAGGCSBgVAFlj .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-dI7GAGGCSBgVAFlj .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-dI7GAGGCSBgVAFlj .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-dI7GAGGCSBgVAFlj .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-dI7GAGGCSBgVAFlj .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-dI7GAGGCSBgVAFlj .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-dI7GAGGCSBgVAFlj .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-dI7GAGGCSBgVAFlj .cluster text{fill:#333;}#mermaid-svg-dI7GAGGCSBgVAFlj .cluster span{color:#333;}#mermaid-svg-dI7GAGGCSBgVAFlj div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-dI7GAGGCSBgVAFlj .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-dI7GAGGCSBgVAFlj rect.text{fill:none;stroke-width:0;}#mermaid-svg-dI7GAGGCSBgVAFlj .icon-shape,#mermaid-svg-dI7GAGGCSBgVAFlj .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-dI7GAGGCSBgVAFlj .icon-shape p,#mermaid-svg-dI7GAGGCSBgVAFlj .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-dI7GAGGCSBgVAFlj .icon-shape .label rect,#mermaid-svg-dI7GAGGCSBgVAFlj .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-dI7GAGGCSBgVAFlj .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-dI7GAGGCSBgVAFlj .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-dI7GAGGCSBgVAFlj :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} dspark_mask_mod 规则
context 部分 query

只能 attend 到 anchor_pos 之前

的标准 causal
draft block query

只能 attend 到

  1. context 中 anchor_pos 之前

  2. 同一 block 内的 draft KV
    跨 block 互相隔离

block_keep_mask=0 时整 block 无效
序列布局 长度 = seq_len + num_blocks × γ
Context

位置 0..seq_len-1
Block 0

位置 seq_len..seq_len+γ-1
Block 1

位置 seq_len+γ..seq_len+2γ-1
Block N-1

图说明: 这是 DSpark 的核心创新之一。整个序列是 context(target hidden,长度 seq_len)拼接所有 draft block(长度 num_blocks × γ),用单一 forward 同时处理。规则保证:① context 内标准 causal;② 每个 draft block 只能看自己 anchor 之前的 context + 同 block 内的 KV,绝不漏看其他 block 的 draft token;③ 无效 block(block_keep_mask=0)整块屏蔽。这避免了 padding 浪费,让 num_anchors 个 block 一次 forward 出全部 logits。代码定义在 dspark_mask_modcommon.py:86-96(file:///workspace/deepspec/modeling/dspark/common.py#L86-96))。

3.5 Custom Attention:context K/V 与 draft K/V 拼接

Qwen3DSparkAttentionqwen3/modeling.py:44-152(file:///workspace/deepspec/modeling/dspark/qwen3/modeling.py#L44-152))的关键差异:forward 接受两路输入------target_hidden_states(context)与 hidden_states(noise embedding,draft)。K/V 投影同时作用于两路再拼接:

python 复制代码
k = cat([k_proj(target), k_proj(noise)], dim=1)   # qwen3/modeling.py:108-113
v = cat([v_proj(target), v_proj(noise)], dim=1)

这实现了"cross-attention to target + self-attention over draft"的混合模式。Q 来自 draft noise embedding,所以 query 是"基于 anchor + mask tokens 的 draft 表征",K/V 是"context + draft"两部分拼接。

3.6 build_eval_mask:cumprod 强制前缀连续

`build_eval_mask`(file:///workspace/deepspec/modeling/dspark/common.py) (common.py:172-188(file:///workspace/deepspec/modeling/dspark/common.py#L172-188)):
#mermaid-svg-JfJ4Clsylu9ksMR1{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-JfJ4Clsylu9ksMR1 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-JfJ4Clsylu9ksMR1 .error-icon{fill:#552222;}#mermaid-svg-JfJ4Clsylu9ksMR1 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-JfJ4Clsylu9ksMR1 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-JfJ4Clsylu9ksMR1 .marker.cross{stroke:#333333;}#mermaid-svg-JfJ4Clsylu9ksMR1 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-JfJ4Clsylu9ksMR1 p{margin:0;}#mermaid-svg-JfJ4Clsylu9ksMR1 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-JfJ4Clsylu9ksMR1 .cluster-label text{fill:#333;}#mermaid-svg-JfJ4Clsylu9ksMR1 .cluster-label span{color:#333;}#mermaid-svg-JfJ4Clsylu9ksMR1 .cluster-label span p{background-color:transparent;}#mermaid-svg-JfJ4Clsylu9ksMR1 .label text,#mermaid-svg-JfJ4Clsylu9ksMR1 span{fill:#333;color:#333;}#mermaid-svg-JfJ4Clsylu9ksMR1 .node rect,#mermaid-svg-JfJ4Clsylu9ksMR1 .node circle,#mermaid-svg-JfJ4Clsylu9ksMR1 .node ellipse,#mermaid-svg-JfJ4Clsylu9ksMR1 .node polygon,#mermaid-svg-JfJ4Clsylu9ksMR1 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-JfJ4Clsylu9ksMR1 .rough-node .label text,#mermaid-svg-JfJ4Clsylu9ksMR1 .node .label text,#mermaid-svg-JfJ4Clsylu9ksMR1 .image-shape .label,#mermaid-svg-JfJ4Clsylu9ksMR1 .icon-shape .label{text-anchor:middle;}#mermaid-svg-JfJ4Clsylu9ksMR1 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-JfJ4Clsylu9ksMR1 .rough-node .label,#mermaid-svg-JfJ4Clsylu9ksMR1 .node .label,#mermaid-svg-JfJ4Clsylu9ksMR1 .image-shape .label,#mermaid-svg-JfJ4Clsylu9ksMR1 .icon-shape .label{text-align:center;}#mermaid-svg-JfJ4Clsylu9ksMR1 .node.clickable{cursor:pointer;}#mermaid-svg-JfJ4Clsylu9ksMR1 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-JfJ4Clsylu9ksMR1 .arrowheadPath{fill:#333333;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-JfJ4Clsylu9ksMR1 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-JfJ4Clsylu9ksMR1 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-JfJ4Clsylu9ksMR1 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-JfJ4Clsylu9ksMR1 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-JfJ4Clsylu9ksMR1 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-JfJ4Clsylu9ksMR1 .cluster text{fill:#333;}#mermaid-svg-JfJ4Clsylu9ksMR1 .cluster span{color:#333;}#mermaid-svg-JfJ4Clsylu9ksMR1 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-JfJ4Clsylu9ksMR1 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-JfJ4Clsylu9ksMR1 rect.text{fill:none;stroke-width:0;}#mermaid-svg-JfJ4Clsylu9ksMR1 .icon-shape,#mermaid-svg-JfJ4Clsylu9ksMR1 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-JfJ4Clsylu9ksMR1 .icon-shape p,#mermaid-svg-JfJ4Clsylu9ksMR1 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-JfJ4Clsylu9ksMR1 .icon-shape .label rect,#mermaid-svg-JfJ4Clsylu9ksMR1 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-JfJ4Clsylu9ksMR1 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-JfJ4Clsylu9ksMR1 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-JfJ4Clsylu9ksMR1 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是





每个 draft 位置
在序列范围内?
loss_mask 覆盖?
0
block 有效?
1
cumprod dim=-1
前缀连续 mask

图说明: 三个条件 AND 后做 cumprod(dim=-1)------一旦某位置为 0,其后所有位置自动归零。这与推测解码"接受最长正确前缀"的语义完全对齐:训练时若位置 k 应被屏蔽(如超出 loss_mask 范围),位置 k+1, k+2, ... 全部归零,loss 只算连续有效前缀。

3.7 三种 Markov head

`build_markov_head`(file:///workspace/deepspec/modeling/dspark/markov_head.py) (markov_head.py:287-311(file:///workspace/deepspec/modeling/dspark/markov_head.py#L287-311))工厂函数,根据 markov_head_type 配置返回三种之一。markov_rank=0 时返回 None(DFlash 走这条路径)。
#mermaid-svg-8J9W8k3js3r7VLj9{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-8J9W8k3js3r7VLj9 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-8J9W8k3js3r7VLj9 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-8J9W8k3js3r7VLj9 .error-icon{fill:#552222;}#mermaid-svg-8J9W8k3js3r7VLj9 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-8J9W8k3js3r7VLj9 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-8J9W8k3js3r7VLj9 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-8J9W8k3js3r7VLj9 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-8J9W8k3js3r7VLj9 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-8J9W8k3js3r7VLj9 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-8J9W8k3js3r7VLj9 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-8J9W8k3js3r7VLj9 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-8J9W8k3js3r7VLj9 .marker.cross{stroke:#333333;}#mermaid-svg-8J9W8k3js3r7VLj9 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-8J9W8k3js3r7VLj9 p{margin:0;}#mermaid-svg-8J9W8k3js3r7VLj9 g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-8J9W8k3js3r7VLj9 g.classGroup text .title{font-weight:bolder;}#mermaid-svg-8J9W8k3js3r7VLj9 .cluster-label text{fill:#333;}#mermaid-svg-8J9W8k3js3r7VLj9 .cluster-label span{color:#333;}#mermaid-svg-8J9W8k3js3r7VLj9 .cluster-label span p{background-color:transparent;}#mermaid-svg-8J9W8k3js3r7VLj9 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-8J9W8k3js3r7VLj9 .cluster text{fill:#333;}#mermaid-svg-8J9W8k3js3r7VLj9 .cluster span{color:#333;}#mermaid-svg-8J9W8k3js3r7VLj9 .nodeLabel,#mermaid-svg-8J9W8k3js3r7VLj9 .edgeLabel{color:#131300;}#mermaid-svg-8J9W8k3js3r7VLj9 .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-8J9W8k3js3r7VLj9 .label text{fill:#131300;}#mermaid-svg-8J9W8k3js3r7VLj9 .labelBkg{background:#ECECFF;}#mermaid-svg-8J9W8k3js3r7VLj9 .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-8J9W8k3js3r7VLj9 .classTitle{font-weight:bolder;}#mermaid-svg-8J9W8k3js3r7VLj9 .node rect,#mermaid-svg-8J9W8k3js3r7VLj9 .node circle,#mermaid-svg-8J9W8k3js3r7VLj9 .node ellipse,#mermaid-svg-8J9W8k3js3r7VLj9 .node polygon,#mermaid-svg-8J9W8k3js3r7VLj9 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-8J9W8k3js3r7VLj9 .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 g.clickable{cursor:pointer;}#mermaid-svg-8J9W8k3js3r7VLj9 g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-8J9W8k3js3r7VLj9 g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-8J9W8k3js3r7VLj9 .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-8J9W8k3js3r7VLj9 .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-8J9W8k3js3r7VLj9 .dashed-line{stroke-dasharray:3;}#mermaid-svg-8J9W8k3js3r7VLj9 .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-8J9W8k3js3r7VLj9 #compositionStart,#mermaid-svg-8J9W8k3js3r7VLj9 .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #compositionEnd,#mermaid-svg-8J9W8k3js3r7VLj9 .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #dependencyStart,#mermaid-svg-8J9W8k3js3r7VLj9 .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #dependencyStart,#mermaid-svg-8J9W8k3js3r7VLj9 .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #extensionStart,#mermaid-svg-8J9W8k3js3r7VLj9 .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #extensionEnd,#mermaid-svg-8J9W8k3js3r7VLj9 .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #aggregationStart,#mermaid-svg-8J9W8k3js3r7VLj9 .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #aggregationEnd,#mermaid-svg-8J9W8k3js3r7VLj9 .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #lollipopStart,#mermaid-svg-8J9W8k3js3r7VLj9 .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 #lollipopEnd,#mermaid-svg-8J9W8k3js3r7VLj9 .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-8J9W8k3js3r7VLj9 .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-8J9W8k3js3r7VLj9 .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-8J9W8k3js3r7VLj9 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-8J9W8k3js3r7VLj9 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-8J9W8k3js3r7VLj9 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} VanillaMarkov
+markov_w1: nn.Embedding V×r
+markov_w2: nn.Linear r→V
+apply_block_logits(logits, token_ids, hidden)
+sample_block_tokens(logits, prev_token)
+get_prev_embeddings(token_ids)
GatedMarkovHead
+gate_proj: nn.Linear
+gate: sigmoid(W_g · concat(hidden, W1prev))
+bias: W2(gate · W1prev)
RNNHead
+state: R^r 循环状态
+prev_emb + hidden_state 拼接
+joint_proj 输出 gate/candidate/output
+new_state = gate·state +(1-gate) : ·candidate
+bias = W2(tanh(output_raw))

图说明: 三种 head 对应论文 Section 3.1 的三个公式:

VanillaMarkovmarkov_head.py:8-90(file:///workspace/deepspec/modeling/dspark/markov_head.py#L8-90),论文 Eq. 5):参数 W1 ∈ R^{V×r}W2 ∈ R^{r×V},bias = W2(W1[prev_token])。低秩分解 r=256 让参数量从 V 2 V^2 V2(Qwen3 词表 ~150K,约 22.5B 参数)降到 2 r V 2rV 2rV(~77M)。apply_block_logits 训练时 teacher-forced,每个位置的 bias 由对应 prev token 决定;sample_block_tokens 推理时按序采样,prev_token_ids 迭代更新。

GatedMarkovHeadmarkov_head.py:93-122(file:///workspace/deepspec/modeling/dspark/markov_head.py#L93-122)):在 vanilla 基础上加 gate,让 backbone hidden state 调制 Markov 偏置:

gate = σ ( W g h ; W 1 \[ x k − 1 ] ) , B = W 2 ( gate ⊙ W 1 x k − 1 ) \text{gate} = \sigma(W_gh; W_1\[x_{k-1}]),\quad B = W_2(\text{gate} \odot W_1x_{k-1}) gate=σ(Wgh;W1\[xk−1]),B=W2(gate⊙W1xk−1)

RNNHeadmarkov_head.py:125-284(file:///workspace/deepspec/modeling/dspark/markov_head.py#L125-284),论文 Eq. 6):GRU-like 循环状态,拼接 z k = s k − 1 ; W 1 \[ x k − 1 ; h k ] ∈ R 2 r + d z_k = s_{k-1}; W_1\[x_{k-1}; h_k] \in R^{2r+d} zk=sk−1;W1\[xk−1;hk]∈R2r+d,单层 gated update:

s k = σ ( W g z k ) ⊙ s k − 1 + ( 1 − σ ( W g z k ) ) ⊙ tanh ⁡ ( W c z k ) , B k = W 2 ⊤ tanh ⁡ ( W o z k ) s_k = \sigma(W_g z_k) \odot s_{k-1} + (1-\sigma(W_g z_k)) \odot \tanh(W_c z_k), \quad B_k = W_2^\top \tanh(W_o z_k) sk=σ(Wgzk)⊙sk−1+(1−σ(Wgzk))⊙tanh(Wczk),Bk=W2⊤tanh(Wozk)

apply_block_logits 训练时 teacher-forced 但状态在 block 内累积;sample_block_tokens 推理时状态在采样步之间传递。论文 Section 4.3.2 实测 RNN head 相比 Markov head 只在长 proposal length(γ=12,16)有边际增益,默认用 Markov。

3.8 Confidence head:预测前缀接受率

`AcceptRatePredictor`(file:///workspace/deepspec/modeling/dspark/common.py) (common.py:43-49(file:///workspace/deepspec/modeling/dspark/common.py#L43-49)):nn.Linear(input_dim, 1) + squeeze,输入维度由 confidence_head_with_markov 决定------若为 True 则 input_dim = hidden_size + markov_rank(拼 backbone hidden 与 markov embedding),否则 input_dim = hidden_size
#mermaid-svg-BrwkV6kiKMVWcipX{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-BrwkV6kiKMVWcipX .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-BrwkV6kiKMVWcipX .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-BrwkV6kiKMVWcipX .error-icon{fill:#552222;}#mermaid-svg-BrwkV6kiKMVWcipX .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-BrwkV6kiKMVWcipX .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-BrwkV6kiKMVWcipX .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-BrwkV6kiKMVWcipX .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-BrwkV6kiKMVWcipX .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-BrwkV6kiKMVWcipX .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-BrwkV6kiKMVWcipX .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-BrwkV6kiKMVWcipX .marker{fill:#333333;stroke:#333333;}#mermaid-svg-BrwkV6kiKMVWcipX .marker.cross{stroke:#333333;}#mermaid-svg-BrwkV6kiKMVWcipX svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-BrwkV6kiKMVWcipX p{margin:0;}#mermaid-svg-BrwkV6kiKMVWcipX .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-BrwkV6kiKMVWcipX .cluster-label text{fill:#333;}#mermaid-svg-BrwkV6kiKMVWcipX .cluster-label span{color:#333;}#mermaid-svg-BrwkV6kiKMVWcipX .cluster-label span p{background-color:transparent;}#mermaid-svg-BrwkV6kiKMVWcipX .label text,#mermaid-svg-BrwkV6kiKMVWcipX span{fill:#333;color:#333;}#mermaid-svg-BrwkV6kiKMVWcipX .node rect,#mermaid-svg-BrwkV6kiKMVWcipX .node circle,#mermaid-svg-BrwkV6kiKMVWcipX .node ellipse,#mermaid-svg-BrwkV6kiKMVWcipX .node polygon,#mermaid-svg-BrwkV6kiKMVWcipX .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-BrwkV6kiKMVWcipX .rough-node .label text,#mermaid-svg-BrwkV6kiKMVWcipX .node .label text,#mermaid-svg-BrwkV6kiKMVWcipX .image-shape .label,#mermaid-svg-BrwkV6kiKMVWcipX .icon-shape .label{text-anchor:middle;}#mermaid-svg-BrwkV6kiKMVWcipX .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-BrwkV6kiKMVWcipX .rough-node .label,#mermaid-svg-BrwkV6kiKMVWcipX .node .label,#mermaid-svg-BrwkV6kiKMVWcipX .image-shape .label,#mermaid-svg-BrwkV6kiKMVWcipX .icon-shape .label{text-align:center;}#mermaid-svg-BrwkV6kiKMVWcipX .node.clickable{cursor:pointer;}#mermaid-svg-BrwkV6kiKMVWcipX .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-BrwkV6kiKMVWcipX .arrowheadPath{fill:#333333;}#mermaid-svg-BrwkV6kiKMVWcipX .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-BrwkV6kiKMVWcipX .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-BrwkV6kiKMVWcipX .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-BrwkV6kiKMVWcipX .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-BrwkV6kiKMVWcipX .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-BrwkV6kiKMVWcipX .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-BrwkV6kiKMVWcipX .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-BrwkV6kiKMVWcipX .cluster text{fill:#333;}#mermaid-svg-BrwkV6kiKMVWcipX .cluster span{color:#333;}#mermaid-svg-BrwkV6kiKMVWcipX div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-BrwkV6kiKMVWcipX .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-BrwkV6kiKMVWcipX rect.text{fill:none;stroke-width:0;}#mermaid-svg-BrwkV6kiKMVWcipX .icon-shape,#mermaid-svg-BrwkV6kiKMVWcipX .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-BrwkV6kiKMVWcipX .icon-shape p,#mermaid-svg-BrwkV6kiKMVWcipX .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-BrwkV6kiKMVWcipX .icon-shape .label rect,#mermaid-svg-BrwkV6kiKMVWcipX .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-BrwkV6kiKMVWcipX .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-BrwkV6kiKMVWcipX .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-BrwkV6kiKMVWcipX :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} backbone hidden h_k
concat
markov embedding

W1x_{k-1}
nn.Linear 1
sigmoid
c_k ∈ 0,1

图说明: 对应论文 Eq. 7。confidence head 输出每个 draft 位置的"条件接受率"------给定前缀全被接受,本位置被接受的概率。监督目标是 accept_rate_3d.detach()(解析接受率,由 draft 与 target 分布的 total variation 距离算出)。代码中 qwen3/modeling.py:505-517(file:///workspace/deepspec/modeling/dspark/qwen3/modeling.py#L505-517) 的输入构造体现了"with_markov"差异:若开启则把 markov head 的 get_prev_embeddings(prev_token_ids) 与 backbone hidden 拼接,让 confidence 看到前一个 draft token 的信息。

3.9 Loss 三项加权

`compute_dspark_loss`(file:///workspace/deepspec/modeling/dspark/loss.py) (loss.py:255(file:///workspace/deepspec/modeling/dspark/loss.py#L255))汇总三项:
#mermaid-svg-YMNfWrEYbC8do3ak{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-YMNfWrEYbC8do3ak .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-YMNfWrEYbC8do3ak .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-YMNfWrEYbC8do3ak .error-icon{fill:#552222;}#mermaid-svg-YMNfWrEYbC8do3ak .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-YMNfWrEYbC8do3ak .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-YMNfWrEYbC8do3ak .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-YMNfWrEYbC8do3ak .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-YMNfWrEYbC8do3ak .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-YMNfWrEYbC8do3ak .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-YMNfWrEYbC8do3ak .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-YMNfWrEYbC8do3ak .marker{fill:#333333;stroke:#333333;}#mermaid-svg-YMNfWrEYbC8do3ak .marker.cross{stroke:#333333;}#mermaid-svg-YMNfWrEYbC8do3ak svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-YMNfWrEYbC8do3ak p{margin:0;}#mermaid-svg-YMNfWrEYbC8do3ak .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-YMNfWrEYbC8do3ak .cluster-label text{fill:#333;}#mermaid-svg-YMNfWrEYbC8do3ak .cluster-label span{color:#333;}#mermaid-svg-YMNfWrEYbC8do3ak .cluster-label span p{background-color:transparent;}#mermaid-svg-YMNfWrEYbC8do3ak .label text,#mermaid-svg-YMNfWrEYbC8do3ak span{fill:#333;color:#333;}#mermaid-svg-YMNfWrEYbC8do3ak .node rect,#mermaid-svg-YMNfWrEYbC8do3ak .node circle,#mermaid-svg-YMNfWrEYbC8do3ak .node ellipse,#mermaid-svg-YMNfWrEYbC8do3ak .node polygon,#mermaid-svg-YMNfWrEYbC8do3ak .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-YMNfWrEYbC8do3ak .rough-node .label text,#mermaid-svg-YMNfWrEYbC8do3ak .node .label text,#mermaid-svg-YMNfWrEYbC8do3ak .image-shape .label,#mermaid-svg-YMNfWrEYbC8do3ak .icon-shape .label{text-anchor:middle;}#mermaid-svg-YMNfWrEYbC8do3ak .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-YMNfWrEYbC8do3ak .rough-node .label,#mermaid-svg-YMNfWrEYbC8do3ak .node .label,#mermaid-svg-YMNfWrEYbC8do3ak .image-shape .label,#mermaid-svg-YMNfWrEYbC8do3ak .icon-shape .label{text-align:center;}#mermaid-svg-YMNfWrEYbC8do3ak .node.clickable{cursor:pointer;}#mermaid-svg-YMNfWrEYbC8do3ak .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-YMNfWrEYbC8do3ak .arrowheadPath{fill:#333333;}#mermaid-svg-YMNfWrEYbC8do3ak .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-YMNfWrEYbC8do3ak .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-YMNfWrEYbC8do3ak .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YMNfWrEYbC8do3ak .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-YMNfWrEYbC8do3ak .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YMNfWrEYbC8do3ak .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-YMNfWrEYbC8do3ak .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-YMNfWrEYbC8do3ak .cluster text{fill:#333;}#mermaid-svg-YMNfWrEYbC8do3ak .cluster span{color:#333;}#mermaid-svg-YMNfWrEYbC8do3ak div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-YMNfWrEYbC8do3ak .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-YMNfWrEYbC8do3ak rect.text{fill:none;stroke-width:0;}#mermaid-svg-YMNfWrEYbC8do3ak .icon-shape,#mermaid-svg-YMNfWrEYbC8do3ak .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YMNfWrEYbC8do3ak .icon-shape p,#mermaid-svg-YMNfWrEYbC8do3ak .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-YMNfWrEYbC8do3ak .icon-shape .label rect,#mermaid-svg-YMNfWrEYbC8do3ak .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YMNfWrEYbC8do3ak .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-YMNfWrEYbC8do3ak .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-YMNfWrEYbC8do3ak :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} draft_logits
Lce CE 损失
L1 = Ltv 损失
aligned_target_logits
accept_rate_3d
Lconf Confidence BCE
confidence_pred
位置衰减 w_k=exp-k-1/γ
L = 0.1·Lce + 0.9·Ltv + 1.0·Lconf
× world_size (梯度缩放)

图说明: 对应论文 Eq. 9-12。三项 loss 都用位置权重 w_k = exp(-(k-1)/γ) 加权(loss_decay_gamma=4.0loss.py:25-37(file:///workspace/deepspec/modeling/dspark/loss.py#L25-37) 的 _build_loss_weight_mask),让靠前位置权重更大。关键细节backward_lossloss.py:252(file:///workspace/deepspec/modeling/dspark/loss.py#L252) 乘以 world_size------因为梯度只在 local batch 上累积,需要按 world_size 缩放以匹配全局平均(_all_reduce_loss_denominatorsloss.py:11-22(file:///workspace/deepspec/modeling/dspark/loss.py#L11-22) 做 all_reduce SUM 把各 rank 的分母聚合)。

CE 损失loss.py:109-114(file:///workspace/deepspec/modeling/dspark/loss.py#L109-114)):标准 F.cross_entropy,每个 draft 位置加权求和后除以全局分母。

L1 / TV 损失loss.py:73-87(file:///workspace/deepspec/modeling/dspark/loss.py#L73-87)):l1_dist = |softmax(draft) - softmax(target)|.sum(-1),对应论文 Eq. 10。这是 L1 距离 = 2 × total variation distance,与接受率直接对偶(论文 Section 3.3:"minimizing Ltv directly maximizes the expected acceptance rate")。

Accept rate 3Dloss.py:60-70(file:///workspace/deepspec/modeling/dspark/loss.py#L60-70)):accept_rate = 1 - 0.5 * |draft_probs - target_probs|.sum(-1),clamp 到 0,1。对应论文 Eq. 8。

Confidence BCEloss.py:152-181(file:///workspace/deepspec/modeling/dspark/loss.py#L152-181)):F.binary_cross_entropy_with_logits(confidence_pred, accept_rate.detach()) 加权。同时记录 confidence_abs_errorconfidence_biasconfidence_cumprod_bias 用于校准评估。

3.10 τ 指标:期望接受 draft 数

`_compute_local_probabilistic_stats`(file:///workspace/deepspec/modeling/dspark/loss.py) (loss.py:40-57(file:///workspace/deepspec/modeling/dspark/loss.py#L40-57)):

python 复制代码
expected_draft_accepted = (accept_rate * eval_mask).cumprod(dim=-1).sum(dim=-1)
tau_prob_per_block = expected_draft_accepted + 1   # +1 是 bonus token

这是论文 τ ≈ 1 + ∑ k ∏ i ≤ k c i \tau \approx 1 + \sum_k \prod_{i \leq k} c_i τ≈1+∑k∏i≤kci 的代码实现。训练时记录 tau_greedytau_probabilistic 两个指标------前者用 argmax token 算接受率,后者用概率分布算。这是训练日志的核心指标,可直接反映"训练了多少步后推测解码会变快"。

3.11 Aligned target logits 对齐策略

训练时若提供 target_last_hidden_states,需要把 target 的 hidden 对齐到 draft 的预测位置(qwen3/modeling.py:448-466(file:///workspace/deepspec/modeling/dspark/qwen3/modeling.py#L448-466)):

  • target_pred_indices = (safe_label_indices - 1).clamp(min=0):因为 draft 在位置 p 预测 token p+1,要看的是 target 在位置 p 的 hidden
  • torch.gather 抽取对齐后的 hidden
  • compute_logits 得到 aligned_target_logits,用于 L1 loss 与 accept rate 监督

3.12 Gemma4 后端的差异

Gemma4DSparkModelgemma4/modeling.py(file:///workspace/deepspec/modeling/dspark/gemma4/modeling.py))与 Qwen3 版本对称,差异主要在:

  • Gemma4DSparkAttention 使用 global_head_dimnum_global_key_value_headsattention_k_eq_v(k=v 复用)、v_normscaling=1.0
  • 使用 flex_attention 直接调用(而非 ALL_ATTENTION_FUNCTIONS
  • compute_logits 应用 final_logit_softcappingtanh(logits/softcap)*softcapgemma4/modeling.py:340-349(file:///workspace/deepspec/modeling/dspark/gemma4/modeling.py#L340-349))
  • 嵌入层是 Gemma4TextScaledWordEmbedding,带 embed_scale = sqrt(hidden_size)
  • Decoder layer 有额外的 pre_feedforward_layernorm / post_feedforward_layernormlayer_scalar
  • config 路径多一层 text_configgemma4/config.py:9-19(file:///workspace/deepspec/modeling/dspark/gemma4/config.py#L9-19))

forward 主体逻辑(gemma4/modeling.py:451-598(file:///workspace/deepspec/modeling/dspark/gemma4/modeling.py#L451-598))与 Qwen3 完全一致------这印证了 DSpark 算法与 target 模型家族无关。

3.13 DSpark vs DFlash:配置差异即算法差异

回顾 01 架构 的发现,DFlash 在本仓库是 DSpark 的退化配置:

字段 DSpark DFlash 含义
markov_rank 256 0 关闭串行 Markov head
markov_head_type 'vanilla' - build_markov_head 返回 None
confidence_head_alpha 1.0 0.0 关闭 confidence 监督
confidence_head_with_markov True - -
ce_loss_alpha 0.1 1.0 纯 CE 训练
l1_loss_alpha 0.9 0.0 关闭 TV 对齐

这印证了论文 Section 4.3.1 的诊断:DFlash suffix decay 来自并行独立性,DSpark 通过 Markov head 注入前缀依赖解决之;通过 confidence head 让验证变智能。


小结段(总)

DSpark 的工程实现可以浓缩为三句话:一次 forward 出 γ 个 draft token(并行主干)+ 极轻 Markov head 注入前缀依赖(解决 suffix decay)+ confidence head 预测接受率(让验证智能截断) 。13 步 forward 流程把这三件事用 flex_attention block mask 一次性完成,无 padding、无串行 forward、无 target 模型在线推理。

设计要点回顾:

  1. anchor 采样让 batch 内可同时监督多个 blocknum_anchors=512,单 batch 最多 512×7=3584 个监督位置。
  2. block mask 是核心创新:让 context 与 draft block 共用一次 forward,但保证 draft block 间互不污染。
  3. Markov head 低秩分解r=256 把 V 2 V^2 V2 参数降到 2 r V 2rV 2rV,让单步串行修正极轻(论文 Figure 4 右图:γ=16 时延迟仅增 1.3%)。
  4. 三项 loss 与 acceptance rate 直接对偶 :CE/TV/BCE 都可解析推导到 τ \tau τ,训练目标与评测目标对齐。
  5. confidence head 输入含 markov embedding:让置信度看到前一个 draft token,建模条件接受率。
  6. world_size 缩放 :loss 乘 world_size 匹配 all_reduce SUM 的分母聚合。

易踩坑点:

  • aligned_target_logits 必须从 target_last_hidden_states ------它用 draft 模型的 compute_logits(共享的 lm_head),所以需要在 prepare_target_cache 阶段额外存 target_last_hidden_states(每 token 1 个 hidden,非 5 层)。
  • eval_mask 的 cumprod 在 fp32 上做,bf16 上数值不稳定。
  • num_anchors 过大会 OOM------默认 512 在 8×A100 80G 上刚好。
  • markov_head_type='rnn' 推理时需要维护状态,不能简单并行化,默认用 vanilla。
  • confidence_head_with_markov=True 需要 markov_head 不为 None,即 markov_rank > 0------DFlash 配置下不能开。

延伸阅读 :进入 04 Eagle3 建模 看自回归 drafter 怎么做对照;进入 07 评测系统Qwen3DSparkEvaluator 如何在推理时调用 forward_dspark_draft_blockbuild_dspark_proposal;进入 08 实验复现 看如何用 --opts 调 block_size/markov_rank 复现论文 Figure 3、4。论文 Section 3.1(Markov head 公式 5、RNN head 公式 6)、3.2.1(confidence head 公式 7、8)、3.3(loss 公式 9-12)在 DSpark_paper.pdf(file:///workspace/DSpark_paper.pdf)。