Qwen3.5-MoE 系列详解:混合专家 + 线性注意力 + 多模态的完整生命周期
本文档以 Qwen3.5-MoE 模型为例,将 Transformers 框架的所有模块串联起来,深度剖析最前沿的 混合专家 + 多模态 + 线性注意力 模型在 Transformers 中的完整生命周期。
源码文件:
- configuration_qwen3_5_moe.py(file:///workspace/src/transformers/models/qwen3_5_moe/configuration_qwen3_5_moe.py)
- modeling_qwen3_5_moe.py(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py)
- modular_qwen3_5_moe.py(file:///workspace/src/transformers/models/qwen3_5_moe/modular_qwen3_5_moe.py)
相关文章:
Hugging Face Transformers 源码全景解读
01-Hugging Face Transformers 核心基础设施深度分析
02-Hugging Face Transformers 配置系统深度分析
03-Hugging Face Transformers 模型系统深度分析
04-Hugging Face Transformers 注意力与掩码系统深度分析
05-Hugging Face Transformers 缓存系统深度分析
06-Hugging Face Transformers 生成系统深度分析
07-Hugging Face Transformers 分词器系统深度分析
08-Hugging Face Transformers 多模态处理系统深度分析
09-Hugging Face Transformers 训练系统深度分析
10-Hugging Face Transformers 量化系统深度分析
11-Hugging Face Transformers 分布式与并行系统深度分析
12-Hugging Face Transformers之Pipeline 推理管道深入分析
13-Hugging Face Transformers之AutoModel 自动分发机制深入分析
14-Hugging Face Transformers 模型实现模式深度分析
15-Hugging Face Transformers之CLI 与工具架构总览
16-Hugging Face Transformers之测试体系架构总览
17-Hugging Face Transformers之BERT 案例详解:Transformers 框架全模块串联
18-Hugging Face Transformers之GPT-2 案例详解:Decoder-only 自回归模型的完整生命周期
19-Hugging Face Transformers之Qwen3.5-MoE 系列详解:混合专家 + 线性注意力 + 多模态的完整生命周期
1. Qwen3.5-MoE 在 Transformers 中的定位
Qwen3.5-MoE 是 Qwen 系列中最前沿的混合架构模型,它同时融合了三大创新:混合注意力层 (full_attention + linear_attention 交替)、MoE 专家路由 (256 专家 Top-8 路由 + 共享专家)和多模态视觉编码器(Vision Transformer + PatchMerger)。
1.1 架构定位图
#mermaid-svg-tfBgeanuvV5jRV9P{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-tfBgeanuvV5jRV9P .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-tfBgeanuvV5jRV9P .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-tfBgeanuvV5jRV9P .error-icon{fill:#552222;}#mermaid-svg-tfBgeanuvV5jRV9P .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-tfBgeanuvV5jRV9P .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-tfBgeanuvV5jRV9P .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-tfBgeanuvV5jRV9P .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-tfBgeanuvV5jRV9P .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-tfBgeanuvV5jRV9P .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-tfBgeanuvV5jRV9P .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-tfBgeanuvV5jRV9P .marker{fill:#333333;stroke:#333333;}#mermaid-svg-tfBgeanuvV5jRV9P .marker.cross{stroke:#333333;}#mermaid-svg-tfBgeanuvV5jRV9P svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-tfBgeanuvV5jRV9P p{margin:0;}#mermaid-svg-tfBgeanuvV5jRV9P .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-tfBgeanuvV5jRV9P .cluster-label text{fill:#333;}#mermaid-svg-tfBgeanuvV5jRV9P .cluster-label span{color:#333;}#mermaid-svg-tfBgeanuvV5jRV9P .cluster-label span p{background-color:transparent;}#mermaid-svg-tfBgeanuvV5jRV9P .label text,#mermaid-svg-tfBgeanuvV5jRV9P span{fill:#333;color:#333;}#mermaid-svg-tfBgeanuvV5jRV9P .node rect,#mermaid-svg-tfBgeanuvV5jRV9P .node circle,#mermaid-svg-tfBgeanuvV5jRV9P .node ellipse,#mermaid-svg-tfBgeanuvV5jRV9P .node polygon,#mermaid-svg-tfBgeanuvV5jRV9P .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-tfBgeanuvV5jRV9P .rough-node .label text,#mermaid-svg-tfBgeanuvV5jRV9P .node .label text,#mermaid-svg-tfBgeanuvV5jRV9P .image-shape .label,#mermaid-svg-tfBgeanuvV5jRV9P .icon-shape .label{text-anchor:middle;}#mermaid-svg-tfBgeanuvV5jRV9P .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-tfBgeanuvV5jRV9P .rough-node .label,#mermaid-svg-tfBgeanuvV5jRV9P .node .label,#mermaid-svg-tfBgeanuvV5jRV9P .image-shape .label,#mermaid-svg-tfBgeanuvV5jRV9P .icon-shape .label{text-align:center;}#mermaid-svg-tfBgeanuvV5jRV9P .node.clickable{cursor:pointer;}#mermaid-svg-tfBgeanuvV5jRV9P .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-tfBgeanuvV5jRV9P .arrowheadPath{fill:#333333;}#mermaid-svg-tfBgeanuvV5jRV9P .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-tfBgeanuvV5jRV9P .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-tfBgeanuvV5jRV9P .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tfBgeanuvV5jRV9P .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-tfBgeanuvV5jRV9P .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tfBgeanuvV5jRV9P .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-tfBgeanuvV5jRV9P .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-tfBgeanuvV5jRV9P .cluster text{fill:#333;}#mermaid-svg-tfBgeanuvV5jRV9P .cluster span{color:#333;}#mermaid-svg-tfBgeanuvV5jRV9P div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-tfBgeanuvV5jRV9P .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-tfBgeanuvV5jRV9P rect.text{fill:none;stroke-width:0;}#mermaid-svg-tfBgeanuvV5jRV9P .icon-shape,#mermaid-svg-tfBgeanuvV5jRV9P .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tfBgeanuvV5jRV9P .icon-shape p,#mermaid-svg-tfBgeanuvV5jRV9P .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-tfBgeanuvV5jRV9P .icon-shape .label rect,#mermaid-svg-tfBgeanuvV5jRV9P .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tfBgeanuvV5jRV9P .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-tfBgeanuvV5jRV9P .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-tfBgeanuvV5jRV9P :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Qwen 模型家族
Qwen2
纯 Dense + 标准注意力
Qwen2-VL
Dense + 多模态 + 标准注意力
Qwen2-MoE
MoE + 标准注意力
Qwen3-VL
Dense + 多模态 + 标准注意力
Qwen3-MoE
MoE + 标准注意力
Qwen3
Dense + 标准注意力 + 思考模式
Qwen3-VL-MoE
MoE + 多模态 + 标准注意力
Qwen3-Next
Dense + 混合注意力 + MoE
Qwen3.5
Dense + 混合注意力
Qwen3.5-MoE
🔥 MoE + 混合注意力 + 多模态
1.2 三大创新点图示
#mermaid-svg-fzRU5QU4LsHs4yfD{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fzRU5QU4LsHs4yfD .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fzRU5QU4LsHs4yfD .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fzRU5QU4LsHs4yfD .error-icon{fill:#552222;}#mermaid-svg-fzRU5QU4LsHs4yfD .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fzRU5QU4LsHs4yfD .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fzRU5QU4LsHs4yfD .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fzRU5QU4LsHs4yfD .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fzRU5QU4LsHs4yfD .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fzRU5QU4LsHs4yfD .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fzRU5QU4LsHs4yfD .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fzRU5QU4LsHs4yfD .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fzRU5QU4LsHs4yfD .marker.cross{stroke:#333333;}#mermaid-svg-fzRU5QU4LsHs4yfD svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fzRU5QU4LsHs4yfD p{margin:0;}#mermaid-svg-fzRU5QU4LsHs4yfD .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fzRU5QU4LsHs4yfD .cluster-label text{fill:#333;}#mermaid-svg-fzRU5QU4LsHs4yfD .cluster-label span{color:#333;}#mermaid-svg-fzRU5QU4LsHs4yfD .cluster-label span p{background-color:transparent;}#mermaid-svg-fzRU5QU4LsHs4yfD .label text,#mermaid-svg-fzRU5QU4LsHs4yfD span{fill:#333;color:#333;}#mermaid-svg-fzRU5QU4LsHs4yfD .node rect,#mermaid-svg-fzRU5QU4LsHs4yfD .node circle,#mermaid-svg-fzRU5QU4LsHs4yfD .node ellipse,#mermaid-svg-fzRU5QU4LsHs4yfD .node polygon,#mermaid-svg-fzRU5QU4LsHs4yfD .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fzRU5QU4LsHs4yfD .rough-node .label text,#mermaid-svg-fzRU5QU4LsHs4yfD .node .label text,#mermaid-svg-fzRU5QU4LsHs4yfD .image-shape .label,#mermaid-svg-fzRU5QU4LsHs4yfD .icon-shape .label{text-anchor:middle;}#mermaid-svg-fzRU5QU4LsHs4yfD .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fzRU5QU4LsHs4yfD .rough-node .label,#mermaid-svg-fzRU5QU4LsHs4yfD .node .label,#mermaid-svg-fzRU5QU4LsHs4yfD .image-shape .label,#mermaid-svg-fzRU5QU4LsHs4yfD .icon-shape .label{text-align:center;}#mermaid-svg-fzRU5QU4LsHs4yfD .node.clickable{cursor:pointer;}#mermaid-svg-fzRU5QU4LsHs4yfD .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fzRU5QU4LsHs4yfD .arrowheadPath{fill:#333333;}#mermaid-svg-fzRU5QU4LsHs4yfD .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fzRU5QU4LsHs4yfD .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fzRU5QU4LsHs4yfD .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fzRU5QU4LsHs4yfD .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fzRU5QU4LsHs4yfD .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fzRU5QU4LsHs4yfD .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fzRU5QU4LsHs4yfD .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fzRU5QU4LsHs4yfD .cluster text{fill:#333;}#mermaid-svg-fzRU5QU4LsHs4yfD .cluster span{color:#333;}#mermaid-svg-fzRU5QU4LsHs4yfD div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fzRU5QU4LsHs4yfD .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fzRU5QU4LsHs4yfD rect.text{fill:none;stroke-width:0;}#mermaid-svg-fzRU5QU4LsHs4yfD .icon-shape,#mermaid-svg-fzRU5QU4LsHs4yfD .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fzRU5QU4LsHs4yfD .icon-shape p,#mermaid-svg-fzRU5QU4LsHs4yfD .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fzRU5QU4LsHs4yfD .icon-shape .label rect,#mermaid-svg-fzRU5QU4LsHs4yfD .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fzRU5QU4LsHs4yfD .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fzRU5QU4LsHs4yfD .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fzRU5QU4LsHs4yfD :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 创新点 3:多模态视觉编码器
PatchEmbed
3D 卷积
VisionBlocks × 27
- 旋转位置编码
PatchMerger
空间合并 + 投影
创新点 2:MoE 专家路由
TopKRouter
256 专家 Top-8
Qwen3_5MoeExperts
3D 参数张量
SharedExpert
- SharedExpertGate
创新点 1:混合注意力层
每隔 4 层交替
每隔 4 层交替
full_attention 层
标准 Softmax 注意力
- QK Norm + Gate
linear_attention 层
GatedDeltaNet
- 因果卷积 + 门控 Delta 规则
1.3 继承关系
从 modular_qwen3_5_moe.py(file:///workspace/src/transformers/models/qwen3_5_moe/modular_qwen3_5_moe.py) 可以看出,Qwen3.5-MoE 的类继承链非常清晰:
| Qwen3.5-MoE 类 | 直接父类 | 来源模块 |
|---|---|---|
Qwen3_5MoeTextConfig |
Qwen3NextConfig |
qwen3_next |
Qwen3_5MoeVisionConfig |
Qwen3_5VisionConfig |
qwen3_5 |
Qwen3_5MoeConfig |
Qwen3VLConfig |
qwen3_vl |
Qwen3_5MoeGatedDeltaNet |
Qwen3_5GatedDeltaNet |
qwen3_5 |
Qwen3_5MoeAttention |
Qwen3NextAttention |
qwen3_next |
Qwen3_5MoeExperts |
Qwen3NextExperts |
qwen3_next |
Qwen3_5MoeTopKRouter |
Qwen3VLMoeTextTopKRouter |
qwen3_vl_moe |
Qwen3_5MoeSparseMoeBlock |
Qwen3NextSparseMoeBlock |
qwen3_next |
Qwen3_5MoeForConditionalGeneration |
Qwen3VLMoeForConditionalGeneration |
qwen3_vl_moe |
2. Config 三层嵌套设计
Qwen3.5-MoE 采用三层 Config 嵌套设计,顶层 Qwen3_5MoeConfig 包含 text_config 和 vision_config 两个子配置。
2.1 Config 嵌套类图
#mermaid-svg-iptmIhTmW5GEddKj{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-iptmIhTmW5GEddKj .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-iptmIhTmW5GEddKj .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-iptmIhTmW5GEddKj .error-icon{fill:#552222;}#mermaid-svg-iptmIhTmW5GEddKj .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-iptmIhTmW5GEddKj .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-iptmIhTmW5GEddKj .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-iptmIhTmW5GEddKj .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-iptmIhTmW5GEddKj .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-iptmIhTmW5GEddKj .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-iptmIhTmW5GEddKj .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-iptmIhTmW5GEddKj .marker{fill:#333333;stroke:#333333;}#mermaid-svg-iptmIhTmW5GEddKj .marker.cross{stroke:#333333;}#mermaid-svg-iptmIhTmW5GEddKj svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-iptmIhTmW5GEddKj p{margin:0;}#mermaid-svg-iptmIhTmW5GEddKj g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-iptmIhTmW5GEddKj g.classGroup text .title{font-weight:bolder;}#mermaid-svg-iptmIhTmW5GEddKj .cluster-label text{fill:#333;}#mermaid-svg-iptmIhTmW5GEddKj .cluster-label span{color:#333;}#mermaid-svg-iptmIhTmW5GEddKj .cluster-label span p{background-color:transparent;}#mermaid-svg-iptmIhTmW5GEddKj .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-iptmIhTmW5GEddKj .cluster text{fill:#333;}#mermaid-svg-iptmIhTmW5GEddKj .cluster span{color:#333;}#mermaid-svg-iptmIhTmW5GEddKj .nodeLabel,#mermaid-svg-iptmIhTmW5GEddKj .edgeLabel{color:#131300;}#mermaid-svg-iptmIhTmW5GEddKj .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-iptmIhTmW5GEddKj .label text{fill:#131300;}#mermaid-svg-iptmIhTmW5GEddKj .labelBkg{background:#ECECFF;}#mermaid-svg-iptmIhTmW5GEddKj .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-iptmIhTmW5GEddKj .classTitle{font-weight:bolder;}#mermaid-svg-iptmIhTmW5GEddKj .node rect,#mermaid-svg-iptmIhTmW5GEddKj .node circle,#mermaid-svg-iptmIhTmW5GEddKj .node ellipse,#mermaid-svg-iptmIhTmW5GEddKj .node polygon,#mermaid-svg-iptmIhTmW5GEddKj .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-iptmIhTmW5GEddKj .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj g.clickable{cursor:pointer;}#mermaid-svg-iptmIhTmW5GEddKj g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-iptmIhTmW5GEddKj g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-iptmIhTmW5GEddKj .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-iptmIhTmW5GEddKj .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-iptmIhTmW5GEddKj .dashed-line{stroke-dasharray:3;}#mermaid-svg-iptmIhTmW5GEddKj .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-iptmIhTmW5GEddKj #compositionStart,#mermaid-svg-iptmIhTmW5GEddKj .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #compositionEnd,#mermaid-svg-iptmIhTmW5GEddKj .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #dependencyStart,#mermaid-svg-iptmIhTmW5GEddKj .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #dependencyStart,#mermaid-svg-iptmIhTmW5GEddKj .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #extensionStart,#mermaid-svg-iptmIhTmW5GEddKj .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #extensionEnd,#mermaid-svg-iptmIhTmW5GEddKj .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #aggregationStart,#mermaid-svg-iptmIhTmW5GEddKj .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #aggregationEnd,#mermaid-svg-iptmIhTmW5GEddKj .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #lollipopStart,#mermaid-svg-iptmIhTmW5GEddKj .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj #lollipopEnd,#mermaid-svg-iptmIhTmW5GEddKj .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-iptmIhTmW5GEddKj .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-iptmIhTmW5GEddKj .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-iptmIhTmW5GEddKj .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-iptmIhTmW5GEddKj .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-iptmIhTmW5GEddKj :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} text_config
vision_config
PreTrainedConfig
+model_type: str
+post_init()
+to_dict()
Qwen3_5MoeTextConfig
+model_type = "qwen3_5_moe_text"
+base_config_key = "text_config"
+vocab_size: int = 248320
+hidden_size: int = 2048
+num_hidden_layers: int = 40
+num_attention_heads: int = 16
+num_key_value_heads: int = 2
+head_dim: int = 256
+num_experts: int = 256
+num_experts_per_tok: int = 8
+moe_intermediate_size: int = 512
+shared_expert_intermediate_size: int = 512
+layer_types: list<str> | None
+linear_conv_kernel_dim: int = 4
+linear_key_head_dim: int = 128
+linear_value_head_dim: int = 128
+linear_num_key_heads: int = 16
+linear_num_value_heads: int = 32
+base_model_tp_plan: dict
+base_model_pp_plan: dict
+post_init()
Qwen3_5MoeVisionConfig
+model_type = "qwen3_5_moe_vision"
+base_config_key = "vision_config"
+depth: int = 27
+hidden_size: int = 1152
+intermediate_size: int = 4304
+num_heads: int = 16
+in_channels: int = 3
+patch_size: int = 16
+spatial_merge_size: int = 2
+temporal_patch_size: int = 2
+out_hidden_size: int = 3584
+num_position_embeddings: int = 2304
Qwen3_5MoeConfig
+model_type = "qwen3_5_moe"
+sub_configs: dict
+text_config: Qwen3_5MoeTextConfig
+vision_config: Qwen3_5MoeVisionConfig
+image_token_id: int = 248056
+video_token_id: int = 248057
+vision_start_token_id: int = 248053
+vision_end_token_id: int = 248054
+post_init()
2.2 sub_configs 机制图
sub_configs 是 Transformers 中多模态模型的标准机制,定义在 configuration_qwen3_5_moe.py:171(file:///workspace/src/transformers/models/qwen3_5_moe/configuration_qwen3_5_moe.py#L171):
python
class Qwen3_5MoeConfig(PreTrainedConfig):
sub_configs = {"vision_config": Qwen3_5MoeVisionConfig, "text_config": Qwen3_5MoeTextConfig}
Qwen3_5MoeVisionConfig Qwen3_5MoeTextConfig Qwen3_5MoeConfig config.json Qwen3_5MoeVisionConfig Qwen3_5MoeTextConfig Qwen3_5MoeConfig config.json #mermaid-svg-f7qYiHOxhNXRwRIm{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-f7qYiHOxhNXRwRIm .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-f7qYiHOxhNXRwRIm .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-f7qYiHOxhNXRwRIm .error-icon{fill:#552222;}#mermaid-svg-f7qYiHOxhNXRwRIm .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-f7qYiHOxhNXRwRIm .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-f7qYiHOxhNXRwRIm .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-f7qYiHOxhNXRwRIm .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-f7qYiHOxhNXRwRIm .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-f7qYiHOxhNXRwRIm .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-f7qYiHOxhNXRwRIm .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-f7qYiHOxhNXRwRIm .marker{fill:#333333;stroke:#333333;}#mermaid-svg-f7qYiHOxhNXRwRIm .marker.cross{stroke:#333333;}#mermaid-svg-f7qYiHOxhNXRwRIm svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-f7qYiHOxhNXRwRIm p{margin:0;}#mermaid-svg-f7qYiHOxhNXRwRIm .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-f7qYiHOxhNXRwRIm text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-f7qYiHOxhNXRwRIm .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-f7qYiHOxhNXRwRIm .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-f7qYiHOxhNXRwRIm .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-f7qYiHOxhNXRwRIm .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-f7qYiHOxhNXRwRIm #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-f7qYiHOxhNXRwRIm .sequenceNumber{fill:white;}#mermaid-svg-f7qYiHOxhNXRwRIm #sequencenumber{fill:#333;}#mermaid-svg-f7qYiHOxhNXRwRIm #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-f7qYiHOxhNXRwRIm .messageText{fill:#333;stroke:none;}#mermaid-svg-f7qYiHOxhNXRwRIm .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-f7qYiHOxhNXRwRIm .labelText,#mermaid-svg-f7qYiHOxhNXRwRIm .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-f7qYiHOxhNXRwRIm .loopText,#mermaid-svg-f7qYiHOxhNXRwRIm .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-f7qYiHOxhNXRwRIm .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-f7qYiHOxhNXRwRIm .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-f7qYiHOxhNXRwRIm .noteText,#mermaid-svg-f7qYiHOxhNXRwRIm .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-f7qYiHOxhNXRwRIm .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-f7qYiHOxhNXRwRIm .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-f7qYiHOxhNXRwRIm .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-f7qYiHOxhNXRwRIm .actorPopupMenu{position:absolute;}#mermaid-svg-f7qYiHOxhNXRwRIm .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-f7qYiHOxhNXRwRIm .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-f7qYiHOxhNXRwRIm .actor-man circle,#mermaid-svg-f7qYiHOxhNXRwRIm line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-f7qYiHOxhNXRwRIm :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} alt text_config 是 dict text_config 是 None alt vision_config 是 dict vision_config 是 None self.text_config 和 self.vision_config 均为实例化的 Config 对象 加载顶层配置 post_init() 检查 text_config Qwen3_5MoeTextConfig(**text_config) Qwen3_5MoeTextConfig() 使用默认值 post_init() 检查 vision_config Qwen3_5MoeVisionConfig(**vision_config) Qwen3_5MoeVisionConfig() 使用默认值
关键代码在 configuration_qwen3_5_moe.py:183-194(file:///workspace/src/transformers/models/qwen3_5_moe/configuration_qwen3_5_moe.py#L183):
python
def __post_init__(self, **kwargs):
if isinstance(self.vision_config, dict):
self.vision_config = self.sub_configs["vision_config"](**self.vision_config)
elif self.vision_config is None:
self.vision_config = self.sub_configs["vision_config"]()
if isinstance(self.text_config, dict):
self.text_config = self.sub_configs["text_config"](**self.text_config)
elif self.text_config is None:
self.text_config = self.sub_configs["text_config"]()
super().__post_init__(**kwargs)
2.3 base_model_tp_plan / base_model_pp_plan 并行策略声明图
Qwen3_5MoeTextConfig 在 configuration_qwen3_5_moe.py:59-77(file:///workspace/src/transformers/models/qwen3_5_moe/configuration_qwen3_5_moe.py#L59) 声明了张量并行(TP)和流水线并行(PP)策略:
#mermaid-svg-mb3teBdWeh3zhbIJ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-mb3teBdWeh3zhbIJ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-mb3teBdWeh3zhbIJ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-mb3teBdWeh3zhbIJ .error-icon{fill:#552222;}#mermaid-svg-mb3teBdWeh3zhbIJ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-mb3teBdWeh3zhbIJ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-mb3teBdWeh3zhbIJ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-mb3teBdWeh3zhbIJ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-mb3teBdWeh3zhbIJ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-mb3teBdWeh3zhbIJ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-mb3teBdWeh3zhbIJ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-mb3teBdWeh3zhbIJ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-mb3teBdWeh3zhbIJ .marker.cross{stroke:#333333;}#mermaid-svg-mb3teBdWeh3zhbIJ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-mb3teBdWeh3zhbIJ p{margin:0;}#mermaid-svg-mb3teBdWeh3zhbIJ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-mb3teBdWeh3zhbIJ .cluster-label text{fill:#333;}#mermaid-svg-mb3teBdWeh3zhbIJ .cluster-label span{color:#333;}#mermaid-svg-mb3teBdWeh3zhbIJ .cluster-label span p{background-color:transparent;}#mermaid-svg-mb3teBdWeh3zhbIJ .label text,#mermaid-svg-mb3teBdWeh3zhbIJ span{fill:#333;color:#333;}#mermaid-svg-mb3teBdWeh3zhbIJ .node rect,#mermaid-svg-mb3teBdWeh3zhbIJ .node circle,#mermaid-svg-mb3teBdWeh3zhbIJ .node ellipse,#mermaid-svg-mb3teBdWeh3zhbIJ .node polygon,#mermaid-svg-mb3teBdWeh3zhbIJ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-mb3teBdWeh3zhbIJ .rough-node .label text,#mermaid-svg-mb3teBdWeh3zhbIJ .node .label text,#mermaid-svg-mb3teBdWeh3zhbIJ .image-shape .label,#mermaid-svg-mb3teBdWeh3zhbIJ .icon-shape .label{text-anchor:middle;}#mermaid-svg-mb3teBdWeh3zhbIJ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-mb3teBdWeh3zhbIJ .rough-node .label,#mermaid-svg-mb3teBdWeh3zhbIJ .node .label,#mermaid-svg-mb3teBdWeh3zhbIJ .image-shape .label,#mermaid-svg-mb3teBdWeh3zhbIJ .icon-shape .label{text-align:center;}#mermaid-svg-mb3teBdWeh3zhbIJ .node.clickable{cursor:pointer;}#mermaid-svg-mb3teBdWeh3zhbIJ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-mb3teBdWeh3zhbIJ .arrowheadPath{fill:#333333;}#mermaid-svg-mb3teBdWeh3zhbIJ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-mb3teBdWeh3zhbIJ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-mb3teBdWeh3zhbIJ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-mb3teBdWeh3zhbIJ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-mb3teBdWeh3zhbIJ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-mb3teBdWeh3zhbIJ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-mb3teBdWeh3zhbIJ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-mb3teBdWeh3zhbIJ .cluster text{fill:#333;}#mermaid-svg-mb3teBdWeh3zhbIJ .cluster span{color:#333;}#mermaid-svg-mb3teBdWeh3zhbIJ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-mb3teBdWeh3zhbIJ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-mb3teBdWeh3zhbIJ rect.text{fill:none;stroke-width:0;}#mermaid-svg-mb3teBdWeh3zhbIJ .icon-shape,#mermaid-svg-mb3teBdWeh3zhbIJ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-mb3teBdWeh3zhbIJ .icon-shape p,#mermaid-svg-mb3teBdWeh3zhbIJ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-mb3teBdWeh3zhbIJ .icon-shape .label rect,#mermaid-svg-mb3teBdWeh3zhbIJ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-mb3teBdWeh3zhbIJ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-mb3teBdWeh3zhbIJ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-mb3teBdWeh3zhbIJ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} PP 策略 (base_model_pp_plan)
embed_tokens
输入: input_ids
输出: inputs_embeds
layers
输入: hidden_states, attention_mask
输出: hidden_states
norm
输入: hidden_states
输出: hidden_states
TP 策略 (base_model_tp_plan)
q_proj → colwise
k_proj → colwise
v_proj → colwise
o_proj → rowwise
q_norm → replicated_with_grad_allreduce
k_norm → replicated_with_grad_allreduce
experts.gate_up_proj → packed_colwise
experts.down_proj → rowwise
experts → moe_tp_experts
shared_expert.gate_proj → colwise
shared_expert.up_proj → colwise
shared_expert.down_proj → rowwise
3. from_pretrained 完整时序
从 Qwen3_5MoeForConditionalGeneration.from_pretrained('Qwen/Qwen3.5-35B-A3B') 到模型就绪的完整流程。
3.1 时序图
Qwen3_5MoeTextModel Qwen3_5MoeVisionModel Qwen3_5MoeModel Qwen3_5MoeForConditionalGeneration PreTrainedModel Qwen3_5MoeVisionConfig Qwen3_5MoeTextConfig Qwen3_5MoeConfig AutoModelForCausalLM 用户代码 Qwen3_5MoeTextModel Qwen3_5MoeVisionModel Qwen3_5MoeModel Qwen3_5MoeForConditionalGeneration PreTrainedModel Qwen3_5MoeVisionConfig Qwen3_5MoeTextConfig Qwen3_5MoeConfig AutoModelForCausalLM 用户代码 #mermaid-svg-L8lpS15eq2DYGdp1{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-L8lpS15eq2DYGdp1 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-L8lpS15eq2DYGdp1 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-L8lpS15eq2DYGdp1 .error-icon{fill:#552222;}#mermaid-svg-L8lpS15eq2DYGdp1 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-L8lpS15eq2DYGdp1 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-L8lpS15eq2DYGdp1 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-L8lpS15eq2DYGdp1 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-L8lpS15eq2DYGdp1 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-L8lpS15eq2DYGdp1 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-L8lpS15eq2DYGdp1 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-L8lpS15eq2DYGdp1 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-L8lpS15eq2DYGdp1 .marker.cross{stroke:#333333;}#mermaid-svg-L8lpS15eq2DYGdp1 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-L8lpS15eq2DYGdp1 p{margin:0;}#mermaid-svg-L8lpS15eq2DYGdp1 .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-L8lpS15eq2DYGdp1 text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-L8lpS15eq2DYGdp1 .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-L8lpS15eq2DYGdp1 .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-L8lpS15eq2DYGdp1 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-L8lpS15eq2DYGdp1 .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-L8lpS15eq2DYGdp1 #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-L8lpS15eq2DYGdp1 .sequenceNumber{fill:white;}#mermaid-svg-L8lpS15eq2DYGdp1 #sequencenumber{fill:#333;}#mermaid-svg-L8lpS15eq2DYGdp1 #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-L8lpS15eq2DYGdp1 .messageText{fill:#333;stroke:none;}#mermaid-svg-L8lpS15eq2DYGdp1 .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-L8lpS15eq2DYGdp1 .labelText,#mermaid-svg-L8lpS15eq2DYGdp1 .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-L8lpS15eq2DYGdp1 .loopText,#mermaid-svg-L8lpS15eq2DYGdp1 .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-L8lpS15eq2DYGdp1 .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-L8lpS15eq2DYGdp1 .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-L8lpS15eq2DYGdp1 .noteText,#mermaid-svg-L8lpS15eq2DYGdp1 .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-L8lpS15eq2DYGdp1 .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-L8lpS15eq2DYGdp1 .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-L8lpS15eq2DYGdp1 .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-L8lpS15eq2DYGdp1 .actorPopupMenu{position:absolute;}#mermaid-svg-L8lpS15eq2DYGdp1 .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-L8lpS15eq2DYGdp1 .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-L8lpS15eq2DYGdp1 .actor-man circle,#mermaid-svg-L8lpS15eq2DYGdp1 line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-L8lpS15eq2DYGdp1 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} post_init() 自动将 dict 转为 Config 对象 构建 40 层 DecoderLayer 每层根据 layer_types 选择 full_attention 或 linear_attention 每层均使用 Qwen3_5MoeSparseMoeBlock (256 专家 + 共享专家) MoE 权重特殊处理: experts.gate_up_proj shape: 256, 1024, 2048 experts.down_proj shape: 256, 2048, 512 GatedDeltaNet: dt_bias=1, A_log~U(0,16) RMSNorm: weight=0 (1-centered) Experts: normal_(std=initializer_range) from_pretrained('Qwen/Qwen3.5-35B-A3B') 从 config.json 实例化 Config 解析 text_config dict → Qwen3_5MoeTextConfig 解析 vision_config dict → Qwen3_5MoeVisionConfig Qwen3_5MoeForConditionalGeneration(config) Qwen3_5MoeModel(config) Qwen3_5MoeVisionModel._from_config(config.vision_config) Qwen3_5MoeTextModel._from_config(config.text_config) self.lm_head = Linear(2048, 248320) load_state_dict() 加载权重 权重分配完成 post_init() → _init_weights()
3.2 MoE 权重加载特殊处理
Qwen3.5-MoE 的专家权重以 3D 张量存储,这是 MoE 模型与 Dense 模型的关键区别。在 modeling_qwen3_5_moe.py:736-772(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L736) 中:
python
@use_experts_implementation
class Qwen3_5MoeExperts(nn.Module):
def __init__(self, config):
super().__init__()
self.num_experts = config.num_experts # 256
self.hidden_dim = config.hidden_size # 2048
self.intermediate_dim = config.moe_intermediate_size # 512
# 3D 参数张量:[num_experts, intermediate_dim*2, hidden_dim]
self.gate_up_proj = nn.Parameter(torch.empty(self.num_experts, 2 * self.intermediate_dim, self.hidden_dim))
# 3D 参数张量:[num_experts, hidden_dim, intermediate_dim]
self.down_proj = nn.Parameter(torch.empty(self.num_experts, self.hidden_dim, self.intermediate_dim))
#mermaid-svg-s2dJUsfjrbzluq42{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-s2dJUsfjrbzluq42 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-s2dJUsfjrbzluq42 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-s2dJUsfjrbzluq42 .error-icon{fill:#552222;}#mermaid-svg-s2dJUsfjrbzluq42 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-s2dJUsfjrbzluq42 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-s2dJUsfjrbzluq42 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-s2dJUsfjrbzluq42 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-s2dJUsfjrbzluq42 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-s2dJUsfjrbzluq42 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-s2dJUsfjrbzluq42 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-s2dJUsfjrbzluq42 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-s2dJUsfjrbzluq42 .marker.cross{stroke:#333333;}#mermaid-svg-s2dJUsfjrbzluq42 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-s2dJUsfjrbzluq42 p{margin:0;}#mermaid-svg-s2dJUsfjrbzluq42 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-s2dJUsfjrbzluq42 .cluster-label text{fill:#333;}#mermaid-svg-s2dJUsfjrbzluq42 .cluster-label span{color:#333;}#mermaid-svg-s2dJUsfjrbzluq42 .cluster-label span p{background-color:transparent;}#mermaid-svg-s2dJUsfjrbzluq42 .label text,#mermaid-svg-s2dJUsfjrbzluq42 span{fill:#333;color:#333;}#mermaid-svg-s2dJUsfjrbzluq42 .node rect,#mermaid-svg-s2dJUsfjrbzluq42 .node circle,#mermaid-svg-s2dJUsfjrbzluq42 .node ellipse,#mermaid-svg-s2dJUsfjrbzluq42 .node polygon,#mermaid-svg-s2dJUsfjrbzluq42 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-s2dJUsfjrbzluq42 .rough-node .label text,#mermaid-svg-s2dJUsfjrbzluq42 .node .label text,#mermaid-svg-s2dJUsfjrbzluq42 .image-shape .label,#mermaid-svg-s2dJUsfjrbzluq42 .icon-shape .label{text-anchor:middle;}#mermaid-svg-s2dJUsfjrbzluq42 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-s2dJUsfjrbzluq42 .rough-node .label,#mermaid-svg-s2dJUsfjrbzluq42 .node .label,#mermaid-svg-s2dJUsfjrbzluq42 .image-shape .label,#mermaid-svg-s2dJUsfjrbzluq42 .icon-shape .label{text-align:center;}#mermaid-svg-s2dJUsfjrbzluq42 .node.clickable{cursor:pointer;}#mermaid-svg-s2dJUsfjrbzluq42 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-s2dJUsfjrbzluq42 .arrowheadPath{fill:#333333;}#mermaid-svg-s2dJUsfjrbzluq42 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-s2dJUsfjrbzluq42 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-s2dJUsfjrbzluq42 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-s2dJUsfjrbzluq42 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-s2dJUsfjrbzluq42 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-s2dJUsfjrbzluq42 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-s2dJUsfjrbzluq42 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-s2dJUsfjrbzluq42 .cluster text{fill:#333;}#mermaid-svg-s2dJUsfjrbzluq42 .cluster span{color:#333;}#mermaid-svg-s2dJUsfjrbzluq42 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-s2dJUsfjrbzluq42 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-s2dJUsfjrbzluq42 rect.text{fill:none;stroke-width:0;}#mermaid-svg-s2dJUsfjrbzluq42 .icon-shape,#mermaid-svg-s2dJUsfjrbzluq42 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-s2dJUsfjrbzluq42 .icon-shape p,#mermaid-svg-s2dJUsfjrbzluq42 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-s2dJUsfjrbzluq42 .icon-shape .label rect,#mermaid-svg-s2dJUsfjrbzluq42 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-s2dJUsfjrbzluq42 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-s2dJUsfjrbzluq42 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-s2dJUsfjrbzluq42 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} MoE 模型权重 (3D)
Dense 模型权重 (2D)
MoE: 融合为 3D
MoE: 融合为 3D
MoE: 扩展为 3D
gate_proj: 2048, 512
up_proj: 2048, 512
down_proj: 512, 2048
gate_up_proj: 256, 1024, 2048
256个专家共享一个参数张量
gate和up融合存储
down_proj: 256, 2048, 512
256个专家共享一个参数张量
4. 混合注意力层架构
Qwen3.5-MoE 的核心创新在于 full_attention 层和 linear_attention 层交替排列,这是混合注意力架构的首次大规模应用。
4.1 层类型分布图
layer_types 在 configuration_qwen3_5_moe.py:112-119(file:///workspace/src/transformers/models/qwen3_5_moe/configuration_qwen3_5_moe.py#L112) 中自动生成,默认 full_attention_interval=4:
python
def __post_init__(self, **kwargs):
if self.layer_types is None:
interval_pattern = kwargs.pop("full_attention_interval", 4)
self.layer_types = [
"linear_attention" if bool((i + 1) % interval_pattern) else "full_attention"
for i in range(self.num_hidden_layers)
]
#mermaid-svg-XQghF1ANHaDDjfle{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-XQghF1ANHaDDjfle .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-XQghF1ANHaDDjfle .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-XQghF1ANHaDDjfle .error-icon{fill:#552222;}#mermaid-svg-XQghF1ANHaDDjfle .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-XQghF1ANHaDDjfle .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-XQghF1ANHaDDjfle .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-XQghF1ANHaDDjfle .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-XQghF1ANHaDDjfle .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-XQghF1ANHaDDjfle .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-XQghF1ANHaDDjfle .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-XQghF1ANHaDDjfle .marker{fill:#333333;stroke:#333333;}#mermaid-svg-XQghF1ANHaDDjfle .marker.cross{stroke:#333333;}#mermaid-svg-XQghF1ANHaDDjfle svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-XQghF1ANHaDDjfle p{margin:0;}#mermaid-svg-XQghF1ANHaDDjfle .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-XQghF1ANHaDDjfle .cluster-label text{fill:#333;}#mermaid-svg-XQghF1ANHaDDjfle .cluster-label span{color:#333;}#mermaid-svg-XQghF1ANHaDDjfle .cluster-label span p{background-color:transparent;}#mermaid-svg-XQghF1ANHaDDjfle .label text,#mermaid-svg-XQghF1ANHaDDjfle span{fill:#333;color:#333;}#mermaid-svg-XQghF1ANHaDDjfle .node rect,#mermaid-svg-XQghF1ANHaDDjfle .node circle,#mermaid-svg-XQghF1ANHaDDjfle .node ellipse,#mermaid-svg-XQghF1ANHaDDjfle .node polygon,#mermaid-svg-XQghF1ANHaDDjfle .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-XQghF1ANHaDDjfle .rough-node .label text,#mermaid-svg-XQghF1ANHaDDjfle .node .label text,#mermaid-svg-XQghF1ANHaDDjfle .image-shape .label,#mermaid-svg-XQghF1ANHaDDjfle .icon-shape .label{text-anchor:middle;}#mermaid-svg-XQghF1ANHaDDjfle .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-XQghF1ANHaDDjfle .rough-node .label,#mermaid-svg-XQghF1ANHaDDjfle .node .label,#mermaid-svg-XQghF1ANHaDDjfle .image-shape .label,#mermaid-svg-XQghF1ANHaDDjfle .icon-shape .label{text-align:center;}#mermaid-svg-XQghF1ANHaDDjfle .node.clickable{cursor:pointer;}#mermaid-svg-XQghF1ANHaDDjfle .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-XQghF1ANHaDDjfle .arrowheadPath{fill:#333333;}#mermaid-svg-XQghF1ANHaDDjfle .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-XQghF1ANHaDDjfle .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-XQghF1ANHaDDjfle .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-XQghF1ANHaDDjfle .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-XQghF1ANHaDDjfle .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-XQghF1ANHaDDjfle .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-XQghF1ANHaDDjfle .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-XQghF1ANHaDDjfle .cluster text{fill:#333;}#mermaid-svg-XQghF1ANHaDDjfle .cluster span{color:#333;}#mermaid-svg-XQghF1ANHaDDjfle div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-XQghF1ANHaDDjfle .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-XQghF1ANHaDDjfle rect.text{fill:none;stroke-width:0;}#mermaid-svg-XQghF1ANHaDDjfle .icon-shape,#mermaid-svg-XQghF1ANHaDDjfle .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-XQghF1ANHaDDjfle .icon-shape p,#mermaid-svg-XQghF1ANHaDDjfle .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-XQghF1ANHaDDjfle .icon-shape .label rect,#mermaid-svg-XQghF1ANHaDDjfle .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-XQghF1ANHaDDjfle .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-XQghF1ANHaDDjfle .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-XQghF1ANHaDDjfle :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 40层 DecoderLayer 的 layer_types 分布
Layer 0
linear_attention
Layer 1
linear_attention
Layer 2
linear_attention
Layer 3
🔥full_attention
Layer 4
linear_attention
Layer 5
linear_attention
Layer 6
linear_attention
Layer 7
🔥full_attention
...
Layer 39
🔥full_attention
规律:每 4 层中,第 0-2 层为 linear_attention,第 3 层为 full_attention。40 层中共有 10 个 full_attention 层和 30 个 linear_attention 层。
4.2 Qwen3_5MoeAttention 内部结构图
定义在 modeling_qwen3_5_moe.py:642-716(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L642):
#mermaid-svg-kcj2s6SgBmt7UOl3{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-kcj2s6SgBmt7UOl3 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-kcj2s6SgBmt7UOl3 .error-icon{fill:#552222;}#mermaid-svg-kcj2s6SgBmt7UOl3 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-kcj2s6SgBmt7UOl3 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-kcj2s6SgBmt7UOl3 .marker.cross{stroke:#333333;}#mermaid-svg-kcj2s6SgBmt7UOl3 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-kcj2s6SgBmt7UOl3 p{margin:0;}#mermaid-svg-kcj2s6SgBmt7UOl3 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-kcj2s6SgBmt7UOl3 .cluster-label text{fill:#333;}#mermaid-svg-kcj2s6SgBmt7UOl3 .cluster-label span{color:#333;}#mermaid-svg-kcj2s6SgBmt7UOl3 .cluster-label span p{background-color:transparent;}#mermaid-svg-kcj2s6SgBmt7UOl3 .label text,#mermaid-svg-kcj2s6SgBmt7UOl3 span{fill:#333;color:#333;}#mermaid-svg-kcj2s6SgBmt7UOl3 .node rect,#mermaid-svg-kcj2s6SgBmt7UOl3 .node circle,#mermaid-svg-kcj2s6SgBmt7UOl3 .node ellipse,#mermaid-svg-kcj2s6SgBmt7UOl3 .node polygon,#mermaid-svg-kcj2s6SgBmt7UOl3 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-kcj2s6SgBmt7UOl3 .rough-node .label text,#mermaid-svg-kcj2s6SgBmt7UOl3 .node .label text,#mermaid-svg-kcj2s6SgBmt7UOl3 .image-shape .label,#mermaid-svg-kcj2s6SgBmt7UOl3 .icon-shape .label{text-anchor:middle;}#mermaid-svg-kcj2s6SgBmt7UOl3 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-kcj2s6SgBmt7UOl3 .rough-node .label,#mermaid-svg-kcj2s6SgBmt7UOl3 .node .label,#mermaid-svg-kcj2s6SgBmt7UOl3 .image-shape .label,#mermaid-svg-kcj2s6SgBmt7UOl3 .icon-shape .label{text-align:center;}#mermaid-svg-kcj2s6SgBmt7UOl3 .node.clickable{cursor:pointer;}#mermaid-svg-kcj2s6SgBmt7UOl3 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-kcj2s6SgBmt7UOl3 .arrowheadPath{fill:#333333;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-kcj2s6SgBmt7UOl3 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-kcj2s6SgBmt7UOl3 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-kcj2s6SgBmt7UOl3 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-kcj2s6SgBmt7UOl3 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-kcj2s6SgBmt7UOl3 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-kcj2s6SgBmt7UOl3 .cluster text{fill:#333;}#mermaid-svg-kcj2s6SgBmt7UOl3 .cluster span{color:#333;}#mermaid-svg-kcj2s6SgBmt7UOl3 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-kcj2s6SgBmt7UOl3 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-kcj2s6SgBmt7UOl3 rect.text{fill:none;stroke-width:0;}#mermaid-svg-kcj2s6SgBmt7UOl3 .icon-shape,#mermaid-svg-kcj2s6SgBmt7UOl3 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-kcj2s6SgBmt7UOl3 .icon-shape p,#mermaid-svg-kcj2s6SgBmt7UOl3 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-kcj2s6SgBmt7UOl3 .icon-shape .label rect,#mermaid-svg-kcj2s6SgBmt7UOl3 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-kcj2s6SgBmt7UOl3 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-kcj2s6SgBmt7UOl3 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-kcj2s6SgBmt7UOl3 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} query
gate
hidden_states
bs, seq, 2048
q_proj
2048, 16×256×2=8192
输出含 gate
k_proj
2048, 2×256=512
v_proj
2048, 2×256=512
torch.chunk(dim=-1)
拆分为 query 和 gate
q_norm (RMSNorm)
head_dim=256
k_norm (RMSNorm)
head_dim=256
apply_rotary_pos_emb
M-RoPE 位置编码
KV Cache 更新
past_key_values.update()
Attention Interface
FlashAttn/SDPA/Eager
Sigmoid Gate
attn_output *= σ(gate)
o_proj
16×256, 2048
attn_output
bs, seq, 2048
三大创新点:
- QK Norm :
q_norm和k_norm对 Q/K 做 RMSNorm,稳定训练 - Gate 机制 :
q_proj输出维度翻倍(head_dim * 2),一半作为 query,一半经 sigmoid 门控 - M-RoPE:多模态旋转位置编码,支持文本/图像/视频的 3D 位置
4.3 Qwen3_5MoeGatedDeltaNet 内部结构图
定义在 modeling_qwen3_5_moe.py:367-555(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L367):
#mermaid-svg-MWnOINO4SGyiG9aj{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-MWnOINO4SGyiG9aj .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-MWnOINO4SGyiG9aj .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-MWnOINO4SGyiG9aj .error-icon{fill:#552222;}#mermaid-svg-MWnOINO4SGyiG9aj .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-MWnOINO4SGyiG9aj .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-MWnOINO4SGyiG9aj .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-MWnOINO4SGyiG9aj .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-MWnOINO4SGyiG9aj .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-MWnOINO4SGyiG9aj .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-MWnOINO4SGyiG9aj .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-MWnOINO4SGyiG9aj .marker{fill:#333333;stroke:#333333;}#mermaid-svg-MWnOINO4SGyiG9aj .marker.cross{stroke:#333333;}#mermaid-svg-MWnOINO4SGyiG9aj svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-MWnOINO4SGyiG9aj p{margin:0;}#mermaid-svg-MWnOINO4SGyiG9aj .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-MWnOINO4SGyiG9aj .cluster-label text{fill:#333;}#mermaid-svg-MWnOINO4SGyiG9aj .cluster-label span{color:#333;}#mermaid-svg-MWnOINO4SGyiG9aj .cluster-label span p{background-color:transparent;}#mermaid-svg-MWnOINO4SGyiG9aj .label text,#mermaid-svg-MWnOINO4SGyiG9aj span{fill:#333;color:#333;}#mermaid-svg-MWnOINO4SGyiG9aj .node rect,#mermaid-svg-MWnOINO4SGyiG9aj .node circle,#mermaid-svg-MWnOINO4SGyiG9aj .node ellipse,#mermaid-svg-MWnOINO4SGyiG9aj .node polygon,#mermaid-svg-MWnOINO4SGyiG9aj .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-MWnOINO4SGyiG9aj .rough-node .label text,#mermaid-svg-MWnOINO4SGyiG9aj .node .label text,#mermaid-svg-MWnOINO4SGyiG9aj .image-shape .label,#mermaid-svg-MWnOINO4SGyiG9aj .icon-shape .label{text-anchor:middle;}#mermaid-svg-MWnOINO4SGyiG9aj .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-MWnOINO4SGyiG9aj .rough-node .label,#mermaid-svg-MWnOINO4SGyiG9aj .node .label,#mermaid-svg-MWnOINO4SGyiG9aj .image-shape .label,#mermaid-svg-MWnOINO4SGyiG9aj .icon-shape .label{text-align:center;}#mermaid-svg-MWnOINO4SGyiG9aj .node.clickable{cursor:pointer;}#mermaid-svg-MWnOINO4SGyiG9aj .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-MWnOINO4SGyiG9aj .arrowheadPath{fill:#333333;}#mermaid-svg-MWnOINO4SGyiG9aj .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-MWnOINO4SGyiG9aj .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-MWnOINO4SGyiG9aj .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MWnOINO4SGyiG9aj .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-MWnOINO4SGyiG9aj .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MWnOINO4SGyiG9aj .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-MWnOINO4SGyiG9aj .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-MWnOINO4SGyiG9aj .cluster text{fill:#333;}#mermaid-svg-MWnOINO4SGyiG9aj .cluster span{color:#333;}#mermaid-svg-MWnOINO4SGyiG9aj div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-MWnOINO4SGyiG9aj .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-MWnOINO4SGyiG9aj rect.text{fill:none;stroke-width:0;}#mermaid-svg-MWnOINO4SGyiG9aj .icon-shape,#mermaid-svg-MWnOINO4SGyiG9aj .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MWnOINO4SGyiG9aj .icon-shape p,#mermaid-svg-MWnOINO4SGyiG9aj .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-MWnOINO4SGyiG9aj .icon-shape .label rect,#mermaid-svg-MWnOINO4SGyiG9aj .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MWnOINO4SGyiG9aj .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-MWnOINO4SGyiG9aj .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-MWnOINO4SGyiG9aj :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Q, K, V
hidden_states
bs, seq, 2048
in_proj_qkv
2048, 2×2048+4096=8192
QKV 融合投影
in_proj_z
2048, 4096
门控 z
in_proj_b
2048, 32
beta 投影
in_proj_a
2048, 32
衰减率投影
causal_conv1d
kernel_size=4, groups=conv_dim
因果卷积
torch.split → Q, K, V
β = σ(b)
erasure 门控
g = -exp(A_log) × softplus(a + dt_bias)
衰减率
Gated Delta Rule
chunk 模式 (prefill)
recurrent 模式 (decode)
RMSNormGated
norm + silu(z) 门控
out_proj
4096, 2048
output
bs, seq, 2048
GatedDeltaNet 核心公式:
# 递推模式(单 token 解码):
S_t = S_{t-1} * exp(g_t) # 衰减旧状态
kv_mem = (S_t * k_t).sum(dim=-2) # 检索记忆
δ_t = (v_t - kv_mem) * β_t # 计算修正量
S_t = S_t + k_t^T * δ_t # 更新状态
o_t = (S_t * q_t).sum(dim=-2) # 查询输出
4.4 两种注意力层的数据流对比图
#mermaid-svg-mBI4HqG7gesYPElO{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-mBI4HqG7gesYPElO .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-mBI4HqG7gesYPElO .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-mBI4HqG7gesYPElO .error-icon{fill:#552222;}#mermaid-svg-mBI4HqG7gesYPElO .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-mBI4HqG7gesYPElO .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-mBI4HqG7gesYPElO .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-mBI4HqG7gesYPElO .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-mBI4HqG7gesYPElO .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-mBI4HqG7gesYPElO .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-mBI4HqG7gesYPElO .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-mBI4HqG7gesYPElO .marker{fill:#333333;stroke:#333333;}#mermaid-svg-mBI4HqG7gesYPElO .marker.cross{stroke:#333333;}#mermaid-svg-mBI4HqG7gesYPElO svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-mBI4HqG7gesYPElO p{margin:0;}#mermaid-svg-mBI4HqG7gesYPElO .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-mBI4HqG7gesYPElO .cluster-label text{fill:#333;}#mermaid-svg-mBI4HqG7gesYPElO .cluster-label span{color:#333;}#mermaid-svg-mBI4HqG7gesYPElO .cluster-label span p{background-color:transparent;}#mermaid-svg-mBI4HqG7gesYPElO .label text,#mermaid-svg-mBI4HqG7gesYPElO span{fill:#333;color:#333;}#mermaid-svg-mBI4HqG7gesYPElO .node rect,#mermaid-svg-mBI4HqG7gesYPElO .node circle,#mermaid-svg-mBI4HqG7gesYPElO .node ellipse,#mermaid-svg-mBI4HqG7gesYPElO .node polygon,#mermaid-svg-mBI4HqG7gesYPElO .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-mBI4HqG7gesYPElO .rough-node .label text,#mermaid-svg-mBI4HqG7gesYPElO .node .label text,#mermaid-svg-mBI4HqG7gesYPElO .image-shape .label,#mermaid-svg-mBI4HqG7gesYPElO .icon-shape .label{text-anchor:middle;}#mermaid-svg-mBI4HqG7gesYPElO .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-mBI4HqG7gesYPElO .rough-node .label,#mermaid-svg-mBI4HqG7gesYPElO .node .label,#mermaid-svg-mBI4HqG7gesYPElO .image-shape .label,#mermaid-svg-mBI4HqG7gesYPElO .icon-shape .label{text-align:center;}#mermaid-svg-mBI4HqG7gesYPElO .node.clickable{cursor:pointer;}#mermaid-svg-mBI4HqG7gesYPElO .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-mBI4HqG7gesYPElO .arrowheadPath{fill:#333333;}#mermaid-svg-mBI4HqG7gesYPElO .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-mBI4HqG7gesYPElO .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-mBI4HqG7gesYPElO .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-mBI4HqG7gesYPElO .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-mBI4HqG7gesYPElO .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-mBI4HqG7gesYPElO .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-mBI4HqG7gesYPElO .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-mBI4HqG7gesYPElO .cluster text{fill:#333;}#mermaid-svg-mBI4HqG7gesYPElO .cluster span{color:#333;}#mermaid-svg-mBI4HqG7gesYPElO div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-mBI4HqG7gesYPElO .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-mBI4HqG7gesYPElO rect.text{fill:none;stroke-width:0;}#mermaid-svg-mBI4HqG7gesYPElO .icon-shape,#mermaid-svg-mBI4HqG7gesYPElO .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-mBI4HqG7gesYPElO .icon-shape p,#mermaid-svg-mBI4HqG7gesYPElO .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-mBI4HqG7gesYPElO .icon-shape .label rect,#mermaid-svg-mBI4HqG7gesYPElO .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-mBI4HqG7gesYPElO .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-mBI4HqG7gesYPElO .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-mBI4HqG7gesYPElO :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} linear_attention 层
hidden_states
input_layernorm
in_proj_qkv + in_proj_z/b/a
causal_conv1d
kernel=4
Gated Delta Rule
O(n) 复杂度
RMSNormGated
norm + silu(z)
residual + output
full_attention 层
hidden_states
input_layernorm
q_proj + k_proj + v_proj
q_norm + k_norm
M-RoPE
Softmax Attention
O(n²) 复杂度
Sigmoid Gate
residual + output
| 特性 | full_attention | linear_attention |
|---|---|---|
| 复杂度 | O(n²) | O(n) |
| 缓存类型 | KV Cache | conv_state + recurrent_state |
| 位置编码 | M-RoPE | 无(卷积隐式编码) |
| QK Norm | ✅ | ❌(使用 L2 Norm) |
| Gate 机制 | Sigmoid Gate on Q | RMSNormGated with z |
| 适用场景 | 精确长程依赖 | 高效序列建模 |
5. MoE 专家路由系统
Qwen3.5-MoE 采用 256 专家 Top-8 路由 + 共享专家的混合架构,每个 token 同时经过 8 个路由专家和 1 个共享专家。
5.1 MoE 路由流程图
#mermaid-svg-TfDZEYUxhDgZ2CIY{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-TfDZEYUxhDgZ2CIY .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-TfDZEYUxhDgZ2CIY .error-icon{fill:#552222;}#mermaid-svg-TfDZEYUxhDgZ2CIY .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-TfDZEYUxhDgZ2CIY .marker{fill:#333333;stroke:#333333;}#mermaid-svg-TfDZEYUxhDgZ2CIY .marker.cross{stroke:#333333;}#mermaid-svg-TfDZEYUxhDgZ2CIY svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-TfDZEYUxhDgZ2CIY p{margin:0;}#mermaid-svg-TfDZEYUxhDgZ2CIY .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-TfDZEYUxhDgZ2CIY .cluster-label text{fill:#333;}#mermaid-svg-TfDZEYUxhDgZ2CIY .cluster-label span{color:#333;}#mermaid-svg-TfDZEYUxhDgZ2CIY .cluster-label span p{background-color:transparent;}#mermaid-svg-TfDZEYUxhDgZ2CIY .label text,#mermaid-svg-TfDZEYUxhDgZ2CIY span{fill:#333;color:#333;}#mermaid-svg-TfDZEYUxhDgZ2CIY .node rect,#mermaid-svg-TfDZEYUxhDgZ2CIY .node circle,#mermaid-svg-TfDZEYUxhDgZ2CIY .node ellipse,#mermaid-svg-TfDZEYUxhDgZ2CIY .node polygon,#mermaid-svg-TfDZEYUxhDgZ2CIY .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-TfDZEYUxhDgZ2CIY .rough-node .label text,#mermaid-svg-TfDZEYUxhDgZ2CIY .node .label text,#mermaid-svg-TfDZEYUxhDgZ2CIY .image-shape .label,#mermaid-svg-TfDZEYUxhDgZ2CIY .icon-shape .label{text-anchor:middle;}#mermaid-svg-TfDZEYUxhDgZ2CIY .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-TfDZEYUxhDgZ2CIY .rough-node .label,#mermaid-svg-TfDZEYUxhDgZ2CIY .node .label,#mermaid-svg-TfDZEYUxhDgZ2CIY .image-shape .label,#mermaid-svg-TfDZEYUxhDgZ2CIY .icon-shape .label{text-align:center;}#mermaid-svg-TfDZEYUxhDgZ2CIY .node.clickable{cursor:pointer;}#mermaid-svg-TfDZEYUxhDgZ2CIY .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-TfDZEYUxhDgZ2CIY .arrowheadPath{fill:#333333;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-TfDZEYUxhDgZ2CIY .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TfDZEYUxhDgZ2CIY .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-TfDZEYUxhDgZ2CIY .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TfDZEYUxhDgZ2CIY .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-TfDZEYUxhDgZ2CIY .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-TfDZEYUxhDgZ2CIY .cluster text{fill:#333;}#mermaid-svg-TfDZEYUxhDgZ2CIY .cluster span{color:#333;}#mermaid-svg-TfDZEYUxhDgZ2CIY div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-TfDZEYUxhDgZ2CIY .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-TfDZEYUxhDgZ2CIY rect.text{fill:none;stroke-width:0;}#mermaid-svg-TfDZEYUxhDgZ2CIY .icon-shape,#mermaid-svg-TfDZEYUxhDgZ2CIY .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TfDZEYUxhDgZ2CIY .icon-shape p,#mermaid-svg-TfDZEYUxhDgZ2CIY .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-TfDZEYUxhDgZ2CIY .icon-shape .label rect,#mermaid-svg-TfDZEYUxhDgZ2CIY .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TfDZEYUxhDgZ2CIY .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-TfDZEYUxhDgZ2CIY .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-TfDZEYUxhDgZ2CIY :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 共享专家
专家计算
路由决策
top_k_index, top_k_weights
hidden_states
bs, seq, 2048
Qwen3_5MoeTopKRouter
weight: 256, 2048
Softmax → router_probs
Top-8 选择 → indices + weights
归一化 weights
w /= sum(w)
遍历 256 个专家
仅计算被选中的专家
gate_up_projexpert_idx
F.linear → chunk → SiLU(gate)*up
down_projexpert_idx
F.linear → down_proj
× routing_weights
index_add_ 累加
SharedExpert (标准 MLP)
gate_proj + up_proj + down_proj
SharedExpertGate
σ(Linear(x))
expert_output + shared_expert_output
output
bs, seq, 2048
5.2 Qwen3_5MoeSparseMoeBlock 内部结构图
定义在 modeling_qwen3_5_moe.py:794-813(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L794):
#mermaid-svg-lSLO1B4AMXWtBbsh{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-lSLO1B4AMXWtBbsh .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-lSLO1B4AMXWtBbsh .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-lSLO1B4AMXWtBbsh .error-icon{fill:#552222;}#mermaid-svg-lSLO1B4AMXWtBbsh .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-lSLO1B4AMXWtBbsh .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-lSLO1B4AMXWtBbsh .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-lSLO1B4AMXWtBbsh .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-lSLO1B4AMXWtBbsh .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-lSLO1B4AMXWtBbsh .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-lSLO1B4AMXWtBbsh .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-lSLO1B4AMXWtBbsh .marker{fill:#333333;stroke:#333333;}#mermaid-svg-lSLO1B4AMXWtBbsh .marker.cross{stroke:#333333;}#mermaid-svg-lSLO1B4AMXWtBbsh svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-lSLO1B4AMXWtBbsh p{margin:0;}#mermaid-svg-lSLO1B4AMXWtBbsh .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-lSLO1B4AMXWtBbsh .cluster-label text{fill:#333;}#mermaid-svg-lSLO1B4AMXWtBbsh .cluster-label span{color:#333;}#mermaid-svg-lSLO1B4AMXWtBbsh .cluster-label span p{background-color:transparent;}#mermaid-svg-lSLO1B4AMXWtBbsh .label text,#mermaid-svg-lSLO1B4AMXWtBbsh span{fill:#333;color:#333;}#mermaid-svg-lSLO1B4AMXWtBbsh .node rect,#mermaid-svg-lSLO1B4AMXWtBbsh .node circle,#mermaid-svg-lSLO1B4AMXWtBbsh .node ellipse,#mermaid-svg-lSLO1B4AMXWtBbsh .node polygon,#mermaid-svg-lSLO1B4AMXWtBbsh .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-lSLO1B4AMXWtBbsh .rough-node .label text,#mermaid-svg-lSLO1B4AMXWtBbsh .node .label text,#mermaid-svg-lSLO1B4AMXWtBbsh .image-shape .label,#mermaid-svg-lSLO1B4AMXWtBbsh .icon-shape .label{text-anchor:middle;}#mermaid-svg-lSLO1B4AMXWtBbsh .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-lSLO1B4AMXWtBbsh .rough-node .label,#mermaid-svg-lSLO1B4AMXWtBbsh .node .label,#mermaid-svg-lSLO1B4AMXWtBbsh .image-shape .label,#mermaid-svg-lSLO1B4AMXWtBbsh .icon-shape .label{text-align:center;}#mermaid-svg-lSLO1B4AMXWtBbsh .node.clickable{cursor:pointer;}#mermaid-svg-lSLO1B4AMXWtBbsh .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-lSLO1B4AMXWtBbsh .arrowheadPath{fill:#333333;}#mermaid-svg-lSLO1B4AMXWtBbsh .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-lSLO1B4AMXWtBbsh .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-lSLO1B4AMXWtBbsh .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lSLO1B4AMXWtBbsh .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-lSLO1B4AMXWtBbsh .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lSLO1B4AMXWtBbsh .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-lSLO1B4AMXWtBbsh .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-lSLO1B4AMXWtBbsh .cluster text{fill:#333;}#mermaid-svg-lSLO1B4AMXWtBbsh .cluster span{color:#333;}#mermaid-svg-lSLO1B4AMXWtBbsh div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-lSLO1B4AMXWtBbsh .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-lSLO1B4AMXWtBbsh rect.text{fill:none;stroke-width:0;}#mermaid-svg-lSLO1B4AMXWtBbsh .icon-shape,#mermaid-svg-lSLO1B4AMXWtBbsh .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lSLO1B4AMXWtBbsh .icon-shape p,#mermaid-svg-lSLO1B4AMXWtBbsh .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-lSLO1B4AMXWtBbsh .icon-shape .label rect,#mermaid-svg-lSLO1B4AMXWtBbsh .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lSLO1B4AMXWtBbsh .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-lSLO1B4AMXWtBbsh .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-lSLO1B4AMXWtBbsh :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 共享专家路径
稀疏路由路径
hidden_states
bs, seq, 2048
reshape → -1, 2048
gate (TopKRouter)
→ router_logits, routing_weights, selected_experts
experts (Qwen3_5MoeExperts)
→ expert_output
shared_expert (MLP)
gate_proj 2048, 512
up_proj 2048, 512
down_proj 512, 2048
shared_expert_gate
Linear(2048, 1)
σ(x) * shared_output
expert_output + gated_shared_output
reshape → bs, seq, 2048
output
关键代码:
python
class Qwen3_5MoeSparseMoeBlock(nn.Module):
def __init__(self, config):
self.gate = Qwen3_5MoeTopKRouter(config)
self.experts = Qwen3_5MoeExperts(config)
self.shared_expert = Qwen3_5MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size)
self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False)
def forward(self, hidden_states):
shared_expert_output = self.shared_expert(hidden_states_reshaped)
_, routing_weights, selected_experts = self.gate(hidden_states_reshaped)
expert_output = self.experts(hidden_states_reshaped, selected_experts, routing_weights)
shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output
expert_output = expert_output + shared_expert_output
5.3 @use_experts_implementation 装饰器的工作原理
定义在 integrations/moe.py:523(file:///workspace/src/transformers/integrations/moe.py),该装饰器允许将默认的 PyTorch 专家实现替换为优化版本(如 megablocks、grouped_gemm):
#mermaid-svg-J905Em76I1IPpg9e{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-J905Em76I1IPpg9e .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-J905Em76I1IPpg9e .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-J905Em76I1IPpg9e .error-icon{fill:#552222;}#mermaid-svg-J905Em76I1IPpg9e .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-J905Em76I1IPpg9e .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-J905Em76I1IPpg9e .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-J905Em76I1IPpg9e .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-J905Em76I1IPpg9e .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-J905Em76I1IPpg9e .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-J905Em76I1IPpg9e .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-J905Em76I1IPpg9e .marker{fill:#333333;stroke:#333333;}#mermaid-svg-J905Em76I1IPpg9e .marker.cross{stroke:#333333;}#mermaid-svg-J905Em76I1IPpg9e svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-J905Em76I1IPpg9e p{margin:0;}#mermaid-svg-J905Em76I1IPpg9e .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-J905Em76I1IPpg9e .cluster-label text{fill:#333;}#mermaid-svg-J905Em76I1IPpg9e .cluster-label span{color:#333;}#mermaid-svg-J905Em76I1IPpg9e .cluster-label span p{background-color:transparent;}#mermaid-svg-J905Em76I1IPpg9e .label text,#mermaid-svg-J905Em76I1IPpg9e span{fill:#333;color:#333;}#mermaid-svg-J905Em76I1IPpg9e .node rect,#mermaid-svg-J905Em76I1IPpg9e .node circle,#mermaid-svg-J905Em76I1IPpg9e .node ellipse,#mermaid-svg-J905Em76I1IPpg9e .node polygon,#mermaid-svg-J905Em76I1IPpg9e .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-J905Em76I1IPpg9e .rough-node .label text,#mermaid-svg-J905Em76I1IPpg9e .node .label text,#mermaid-svg-J905Em76I1IPpg9e .image-shape .label,#mermaid-svg-J905Em76I1IPpg9e .icon-shape .label{text-anchor:middle;}#mermaid-svg-J905Em76I1IPpg9e .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-J905Em76I1IPpg9e .rough-node .label,#mermaid-svg-J905Em76I1IPpg9e .node .label,#mermaid-svg-J905Em76I1IPpg9e .image-shape .label,#mermaid-svg-J905Em76I1IPpg9e .icon-shape .label{text-align:center;}#mermaid-svg-J905Em76I1IPpg9e .node.clickable{cursor:pointer;}#mermaid-svg-J905Em76I1IPpg9e .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-J905Em76I1IPpg9e .arrowheadPath{fill:#333333;}#mermaid-svg-J905Em76I1IPpg9e .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-J905Em76I1IPpg9e .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-J905Em76I1IPpg9e .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-J905Em76I1IPpg9e .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-J905Em76I1IPpg9e .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-J905Em76I1IPpg9e .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-J905Em76I1IPpg9e .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-J905Em76I1IPpg9e .cluster text{fill:#333;}#mermaid-svg-J905Em76I1IPpg9e .cluster span{color:#333;}#mermaid-svg-J905Em76I1IPpg9e div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-J905Em76I1IPpg9e .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-J905Em76I1IPpg9e rect.text{fill:none;stroke-width:0;}#mermaid-svg-J905Em76I1IPpg9e .icon-shape,#mermaid-svg-J905Em76I1IPpg9e .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-J905Em76I1IPpg9e .icon-shape p,#mermaid-svg-J905Em76I1IPpg9e .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-J905Em76I1IPpg9e .icon-shape .label rect,#mermaid-svg-J905Em76I1IPpg9e .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-J905Em76I1IPpg9e .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-J905Em76I1IPpg9e .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-J905Em76I1IPpg9e :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 默认
megablocks
grouped_gemm
原始 Qwen3_5MoeExperts
forward() 逐专家循环
@use_experts_implementation
装饰器
experts_interface.dispatch()
根据运行时选择实现
PyTorch 实现
逐专家循环计算
MegaBlocks 实现
Block-Sparse 矩阵乘法
GroupedGEMM 实现
批量矩阵乘法
5.4 负载均衡损失计算图
定义在 modeling_qwen3_5_moe.py:1755-1834(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L1755):
#mermaid-svg-AJcbqulDlnFYpyip{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-AJcbqulDlnFYpyip .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-AJcbqulDlnFYpyip .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-AJcbqulDlnFYpyip .error-icon{fill:#552222;}#mermaid-svg-AJcbqulDlnFYpyip .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-AJcbqulDlnFYpyip .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-AJcbqulDlnFYpyip .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-AJcbqulDlnFYpyip .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-AJcbqulDlnFYpyip .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-AJcbqulDlnFYpyip .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-AJcbqulDlnFYpyip .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-AJcbqulDlnFYpyip .marker{fill:#333333;stroke:#333333;}#mermaid-svg-AJcbqulDlnFYpyip .marker.cross{stroke:#333333;}#mermaid-svg-AJcbqulDlnFYpyip svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-AJcbqulDlnFYpyip p{margin:0;}#mermaid-svg-AJcbqulDlnFYpyip .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-AJcbqulDlnFYpyip .cluster-label text{fill:#333;}#mermaid-svg-AJcbqulDlnFYpyip .cluster-label span{color:#333;}#mermaid-svg-AJcbqulDlnFYpyip .cluster-label span p{background-color:transparent;}#mermaid-svg-AJcbqulDlnFYpyip .label text,#mermaid-svg-AJcbqulDlnFYpyip span{fill:#333;color:#333;}#mermaid-svg-AJcbqulDlnFYpyip .node rect,#mermaid-svg-AJcbqulDlnFYpyip .node circle,#mermaid-svg-AJcbqulDlnFYpyip .node ellipse,#mermaid-svg-AJcbqulDlnFYpyip .node polygon,#mermaid-svg-AJcbqulDlnFYpyip .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-AJcbqulDlnFYpyip .rough-node .label text,#mermaid-svg-AJcbqulDlnFYpyip .node .label text,#mermaid-svg-AJcbqulDlnFYpyip .image-shape .label,#mermaid-svg-AJcbqulDlnFYpyip .icon-shape .label{text-anchor:middle;}#mermaid-svg-AJcbqulDlnFYpyip .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-AJcbqulDlnFYpyip .rough-node .label,#mermaid-svg-AJcbqulDlnFYpyip .node .label,#mermaid-svg-AJcbqulDlnFYpyip .image-shape .label,#mermaid-svg-AJcbqulDlnFYpyip .icon-shape .label{text-align:center;}#mermaid-svg-AJcbqulDlnFYpyip .node.clickable{cursor:pointer;}#mermaid-svg-AJcbqulDlnFYpyip .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-AJcbqulDlnFYpyip .arrowheadPath{fill:#333333;}#mermaid-svg-AJcbqulDlnFYpyip .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-AJcbqulDlnFYpyip .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-AJcbqulDlnFYpyip .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AJcbqulDlnFYpyip .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-AJcbqulDlnFYpyip .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AJcbqulDlnFYpyip .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-AJcbqulDlnFYpyip .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-AJcbqulDlnFYpyip .cluster text{fill:#333;}#mermaid-svg-AJcbqulDlnFYpyip .cluster span{color:#333;}#mermaid-svg-AJcbqulDlnFYpyip div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-AJcbqulDlnFYpyip .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-AJcbqulDlnFYpyip rect.text{fill:none;stroke-width:0;}#mermaid-svg-AJcbqulDlnFYpyip .icon-shape,#mermaid-svg-AJcbqulDlnFYpyip .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AJcbqulDlnFYpyip .icon-shape p,#mermaid-svg-AJcbqulDlnFYpyip .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-AJcbqulDlnFYpyip .icon-shape .label rect,#mermaid-svg-AJcbqulDlnFYpyip .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AJcbqulDlnFYpyip .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-AJcbqulDlnFYpyip .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-AJcbqulDlnFYpyip :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 无 attention_mask
有 attention_mask
tokens_per_expert
= sum(mask * expert_mask) / sum(mask)
router_prob_per_expert
= sum(weights * mask) / sum(mask)
gate_logits
每层的路由 logits
shape: bs\*seq, 256
torch.cat 所有层的 logits
Softmax → routing_weights
Top-K → selected_experts
one_hot → expert_mask
tokens_per_expert
= mean(expert_mask)
router_prob_per_expert
= mean(routing_weights)
overall_loss
= sum(tokens_per_expert × router_prob_per_expert) × num_experts
公式:L_aux = N × Σ_i(f_i × P_i),其中 f_i 是分配给专家 i 的 token 比例,P_i 是路由到专家 i 的平均概率。
6. 多模态视觉编码器
6.1 视觉编码流程图
#mermaid-svg-Hd0X2HlK84PVmvZ8{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .error-icon{fill:#552222;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .marker.cross{stroke:#333333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Hd0X2HlK84PVmvZ8 p{margin:0;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .cluster-label text{fill:#333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .cluster-label span{color:#333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .cluster-label span p{background-color:transparent;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .label text,#mermaid-svg-Hd0X2HlK84PVmvZ8 span{fill:#333;color:#333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .node rect,#mermaid-svg-Hd0X2HlK84PVmvZ8 .node circle,#mermaid-svg-Hd0X2HlK84PVmvZ8 .node ellipse,#mermaid-svg-Hd0X2HlK84PVmvZ8 .node polygon,#mermaid-svg-Hd0X2HlK84PVmvZ8 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .rough-node .label text,#mermaid-svg-Hd0X2HlK84PVmvZ8 .node .label text,#mermaid-svg-Hd0X2HlK84PVmvZ8 .image-shape .label,#mermaid-svg-Hd0X2HlK84PVmvZ8 .icon-shape .label{text-anchor:middle;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .rough-node .label,#mermaid-svg-Hd0X2HlK84PVmvZ8 .node .label,#mermaid-svg-Hd0X2HlK84PVmvZ8 .image-shape .label,#mermaid-svg-Hd0X2HlK84PVmvZ8 .icon-shape .label{text-align:center;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .node.clickable{cursor:pointer;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .arrowheadPath{fill:#333333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Hd0X2HlK84PVmvZ8 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Hd0X2HlK84PVmvZ8 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Hd0X2HlK84PVmvZ8 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .cluster text{fill:#333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .cluster span{color:#333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Hd0X2HlK84PVmvZ8 rect.text{fill:none;stroke-width:0;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .icon-shape,#mermaid-svg-Hd0X2HlK84PVmvZ8 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .icon-shape p,#mermaid-svg-Hd0X2HlK84PVmvZ8 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .icon-shape .label rect,#mermaid-svg-Hd0X2HlK84PVmvZ8 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Hd0X2HlK84PVmvZ8 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Hd0X2HlK84PVmvZ8 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Hd0X2HlK84PVmvZ8 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 27 层 VisionBlock
×27
pixel_values
num_patches, 3, 2, 16, 16
(C, T, H, W)
Qwen3_5MoeVisionPatchEmbed
Conv3d: kernel=2,16,16
stride=2,16,16
位置嵌入
bilinear 插值 + pos_embed
旋转位置编码
rotary_pos_emb(position_ids)
norm1 (LayerNorm)
VisionAttention
qkv → RoPE → Attention → proj
norm2 (LayerNorm)
VisionMLP
fc1 → GELU → fc2
PatchMerger
LayerNorm → fc1 → GELU → fc2
1152×4, 3584
image_embeds / video_embeds
num_tokens, 3584
6.2 3D 位置编码(M-RoPE)计算图
视觉 token 的 3D 位置编码由 get_vision_position_ids 方法计算,定义在 modeling_qwen3_5_moe.py:1394-1450(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L1394):
#mermaid-svg-dOP3YYaePVZMlcz7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-dOP3YYaePVZMlcz7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-dOP3YYaePVZMlcz7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-dOP3YYaePVZMlcz7 .error-icon{fill:#552222;}#mermaid-svg-dOP3YYaePVZMlcz7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-dOP3YYaePVZMlcz7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-dOP3YYaePVZMlcz7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-dOP3YYaePVZMlcz7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-dOP3YYaePVZMlcz7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-dOP3YYaePVZMlcz7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-dOP3YYaePVZMlcz7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-dOP3YYaePVZMlcz7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-dOP3YYaePVZMlcz7 .marker.cross{stroke:#333333;}#mermaid-svg-dOP3YYaePVZMlcz7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-dOP3YYaePVZMlcz7 p{margin:0;}#mermaid-svg-dOP3YYaePVZMlcz7 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-dOP3YYaePVZMlcz7 .cluster-label text{fill:#333;}#mermaid-svg-dOP3YYaePVZMlcz7 .cluster-label span{color:#333;}#mermaid-svg-dOP3YYaePVZMlcz7 .cluster-label span p{background-color:transparent;}#mermaid-svg-dOP3YYaePVZMlcz7 .label text,#mermaid-svg-dOP3YYaePVZMlcz7 span{fill:#333;color:#333;}#mermaid-svg-dOP3YYaePVZMlcz7 .node rect,#mermaid-svg-dOP3YYaePVZMlcz7 .node circle,#mermaid-svg-dOP3YYaePVZMlcz7 .node ellipse,#mermaid-svg-dOP3YYaePVZMlcz7 .node polygon,#mermaid-svg-dOP3YYaePVZMlcz7 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-dOP3YYaePVZMlcz7 .rough-node .label text,#mermaid-svg-dOP3YYaePVZMlcz7 .node .label text,#mermaid-svg-dOP3YYaePVZMlcz7 .image-shape .label,#mermaid-svg-dOP3YYaePVZMlcz7 .icon-shape .label{text-anchor:middle;}#mermaid-svg-dOP3YYaePVZMlcz7 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-dOP3YYaePVZMlcz7 .rough-node .label,#mermaid-svg-dOP3YYaePVZMlcz7 .node .label,#mermaid-svg-dOP3YYaePVZMlcz7 .image-shape .label,#mermaid-svg-dOP3YYaePVZMlcz7 .icon-shape .label{text-align:center;}#mermaid-svg-dOP3YYaePVZMlcz7 .node.clickable{cursor:pointer;}#mermaid-svg-dOP3YYaePVZMlcz7 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-dOP3YYaePVZMlcz7 .arrowheadPath{fill:#333333;}#mermaid-svg-dOP3YYaePVZMlcz7 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-dOP3YYaePVZMlcz7 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-dOP3YYaePVZMlcz7 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-dOP3YYaePVZMlcz7 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-dOP3YYaePVZMlcz7 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-dOP3YYaePVZMlcz7 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-dOP3YYaePVZMlcz7 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-dOP3YYaePVZMlcz7 .cluster text{fill:#333;}#mermaid-svg-dOP3YYaePVZMlcz7 .cluster span{color:#333;}#mermaid-svg-dOP3YYaePVZMlcz7 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-dOP3YYaePVZMlcz7 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-dOP3YYaePVZMlcz7 rect.text{fill:none;stroke-width:0;}#mermaid-svg-dOP3YYaePVZMlcz7 .icon-shape,#mermaid-svg-dOP3YYaePVZMlcz7 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-dOP3YYaePVZMlcz7 .icon-shape p,#mermaid-svg-dOP3YYaePVZMlcz7 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-dOP3YYaePVZMlcz7 .icon-shape .label rect,#mermaid-svg-dOP3YYaePVZMlcz7 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-dOP3YYaePVZMlcz7 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-dOP3YYaePVZMlcz7 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-dOP3YYaePVZMlcz7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 三维位置计算
空间合并
grid_thw
T, H, W
spatial_merge_size = 2
temporal_merge_size = 1
position_temporal
arange(T) × time_interval
- start_position
position_height
arange(H//2) + start_position
repeat_interleave(W//2) × T
position_width
arange(W//2) + start_position
repeat(H//2 × T)
torch.stack(T, H, W)
shape: 3, num_tokens
关键代码:
python
def get_vision_position_ids(self, start_position, grid_thw, ...):
llm_grid_t = grid_thw[0] // temp_merge_size
llm_grid_h = grid_thw[1] // spatial_merge_size
llm_grid_w = grid_thw[2] // spatial_merge_size
position_temporal = torch.arange(llm_grid_t) * time_interval
position_width = torch.arange(llm_grid_w) + start_position
position_height = torch.arange(llm_grid_h) + start_position
position_width = position_width.repeat(llm_grid_h * llm_grid_t)
position_height = position_height.repeat_interleave(llm_grid_w).repeat(llm_grid_t)
position_temporal = position_temporal.repeat_interleave(llm_grid_h * llm_grid_w) + start_position
return torch.stack([position_temporal, position_height, position_width], dim=0)
6.3 视觉 token 与文本 token 的融合流程图
在 modeling_qwen3_5_moe.py:1707-1727(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L1707) 中,使用 masked_scatter 将视觉嵌入融合到文本嵌入中:
#mermaid-svg-4XmMOpbyMbm3mNE0{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-4XmMOpbyMbm3mNE0 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-4XmMOpbyMbm3mNE0 .error-icon{fill:#552222;}#mermaid-svg-4XmMOpbyMbm3mNE0 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-4XmMOpbyMbm3mNE0 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-4XmMOpbyMbm3mNE0 .marker.cross{stroke:#333333;}#mermaid-svg-4XmMOpbyMbm3mNE0 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-4XmMOpbyMbm3mNE0 p{margin:0;}#mermaid-svg-4XmMOpbyMbm3mNE0 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-4XmMOpbyMbm3mNE0 .cluster-label text{fill:#333;}#mermaid-svg-4XmMOpbyMbm3mNE0 .cluster-label span{color:#333;}#mermaid-svg-4XmMOpbyMbm3mNE0 .cluster-label span p{background-color:transparent;}#mermaid-svg-4XmMOpbyMbm3mNE0 .label text,#mermaid-svg-4XmMOpbyMbm3mNE0 span{fill:#333;color:#333;}#mermaid-svg-4XmMOpbyMbm3mNE0 .node rect,#mermaid-svg-4XmMOpbyMbm3mNE0 .node circle,#mermaid-svg-4XmMOpbyMbm3mNE0 .node ellipse,#mermaid-svg-4XmMOpbyMbm3mNE0 .node polygon,#mermaid-svg-4XmMOpbyMbm3mNE0 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-4XmMOpbyMbm3mNE0 .rough-node .label text,#mermaid-svg-4XmMOpbyMbm3mNE0 .node .label text,#mermaid-svg-4XmMOpbyMbm3mNE0 .image-shape .label,#mermaid-svg-4XmMOpbyMbm3mNE0 .icon-shape .label{text-anchor:middle;}#mermaid-svg-4XmMOpbyMbm3mNE0 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-4XmMOpbyMbm3mNE0 .rough-node .label,#mermaid-svg-4XmMOpbyMbm3mNE0 .node .label,#mermaid-svg-4XmMOpbyMbm3mNE0 .image-shape .label,#mermaid-svg-4XmMOpbyMbm3mNE0 .icon-shape .label{text-align:center;}#mermaid-svg-4XmMOpbyMbm3mNE0 .node.clickable{cursor:pointer;}#mermaid-svg-4XmMOpbyMbm3mNE0 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-4XmMOpbyMbm3mNE0 .arrowheadPath{fill:#333333;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-4XmMOpbyMbm3mNE0 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-4XmMOpbyMbm3mNE0 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-4XmMOpbyMbm3mNE0 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-4XmMOpbyMbm3mNE0 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-4XmMOpbyMbm3mNE0 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-4XmMOpbyMbm3mNE0 .cluster text{fill:#333;}#mermaid-svg-4XmMOpbyMbm3mNE0 .cluster span{color:#333;}#mermaid-svg-4XmMOpbyMbm3mNE0 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-4XmMOpbyMbm3mNE0 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-4XmMOpbyMbm3mNE0 rect.text{fill:none;stroke-width:0;}#mermaid-svg-4XmMOpbyMbm3mNE0 .icon-shape,#mermaid-svg-4XmMOpbyMbm3mNE0 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-4XmMOpbyMbm3mNE0 .icon-shape p,#mermaid-svg-4XmMOpbyMbm3mNE0 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-4XmMOpbyMbm3mNE0 .icon-shape .label rect,#mermaid-svg-4XmMOpbyMbm3mNE0 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-4XmMOpbyMbm3mNE0 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-4XmMOpbyMbm3mNE0 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-4XmMOpbyMbm3mNE0 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} input_ids
bs, seq
含 image_token_id 占位符
embed_tokens(input_ids)
bs, seq, 2048
pixel_values
图像/视频像素
VisionModel 编码
→ image_embeds / video_embeds
get_placeholder_mask()
定位 image_token_id / video_token_id
torch_compilable_check
验证 token 数 == feature 数
masked_scatter(mask, embeds)
将视觉嵌入填入占位符位置
inputs_embeds
bs, seq, 2048
文本+视觉融合嵌入
关键代码:
python
if pixel_values is not None:
image_embeds = self.get_image_features(pixel_values, image_grid_thw)
image_mask, _ = self.get_placeholder_mask(input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds)
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
if pixel_values_videos is not None:
video_embeds = self.get_video_features(pixel_values_videos, video_grid_thw)
_, video_mask = self.get_placeholder_mask(input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds)
inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
7. RoPE 与 M-RoPE 位置编码
M-RoPE(Multimodal Rotary Position Embedding)是 Qwen3.5 系列的核心位置编码方案,支持文本的 1D 位置和图像/视频的 3D 位置。
7.1 M-RoPE 原理图
#mermaid-svg-C9xwoV6oa6cMQ8oL{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-C9xwoV6oa6cMQ8oL .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-C9xwoV6oa6cMQ8oL .error-icon{fill:#552222;}#mermaid-svg-C9xwoV6oa6cMQ8oL .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-C9xwoV6oa6cMQ8oL .marker{fill:#333333;stroke:#333333;}#mermaid-svg-C9xwoV6oa6cMQ8oL .marker.cross{stroke:#333333;}#mermaid-svg-C9xwoV6oa6cMQ8oL svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-C9xwoV6oa6cMQ8oL p{margin:0;}#mermaid-svg-C9xwoV6oa6cMQ8oL .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-C9xwoV6oa6cMQ8oL .cluster-label text{fill:#333;}#mermaid-svg-C9xwoV6oa6cMQ8oL .cluster-label span{color:#333;}#mermaid-svg-C9xwoV6oa6cMQ8oL .cluster-label span p{background-color:transparent;}#mermaid-svg-C9xwoV6oa6cMQ8oL .label text,#mermaid-svg-C9xwoV6oa6cMQ8oL span{fill:#333;color:#333;}#mermaid-svg-C9xwoV6oa6cMQ8oL .node rect,#mermaid-svg-C9xwoV6oa6cMQ8oL .node circle,#mermaid-svg-C9xwoV6oa6cMQ8oL .node ellipse,#mermaid-svg-C9xwoV6oa6cMQ8oL .node polygon,#mermaid-svg-C9xwoV6oa6cMQ8oL .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-C9xwoV6oa6cMQ8oL .rough-node .label text,#mermaid-svg-C9xwoV6oa6cMQ8oL .node .label text,#mermaid-svg-C9xwoV6oa6cMQ8oL .image-shape .label,#mermaid-svg-C9xwoV6oa6cMQ8oL .icon-shape .label{text-anchor:middle;}#mermaid-svg-C9xwoV6oa6cMQ8oL .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-C9xwoV6oa6cMQ8oL .rough-node .label,#mermaid-svg-C9xwoV6oa6cMQ8oL .node .label,#mermaid-svg-C9xwoV6oa6cMQ8oL .image-shape .label,#mermaid-svg-C9xwoV6oa6cMQ8oL .icon-shape .label{text-align:center;}#mermaid-svg-C9xwoV6oa6cMQ8oL .node.clickable{cursor:pointer;}#mermaid-svg-C9xwoV6oa6cMQ8oL .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-C9xwoV6oa6cMQ8oL .arrowheadPath{fill:#333333;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-C9xwoV6oa6cMQ8oL .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-C9xwoV6oa6cMQ8oL .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-C9xwoV6oa6cMQ8oL .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-C9xwoV6oa6cMQ8oL .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-C9xwoV6oa6cMQ8oL .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-C9xwoV6oa6cMQ8oL .cluster text{fill:#333;}#mermaid-svg-C9xwoV6oa6cMQ8oL .cluster span{color:#333;}#mermaid-svg-C9xwoV6oa6cMQ8oL div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-C9xwoV6oa6cMQ8oL .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-C9xwoV6oa6cMQ8oL rect.text{fill:none;stroke-width:0;}#mermaid-svg-C9xwoV6oa6cMQ8oL .icon-shape,#mermaid-svg-C9xwoV6oa6cMQ8oL .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-C9xwoV6oa6cMQ8oL .icon-shape p,#mermaid-svg-C9xwoV6oa6cMQ8oL .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-C9xwoV6oa6cMQ8oL .icon-shape .label rect,#mermaid-svg-C9xwoV6oa6cMQ8oL .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-C9xwoV6oa6cMQ8oL .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-C9xwoV6oa6cMQ8oL .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-C9xwoV6oa6cMQ8oL :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 视频 token (3D 位置)
grid_thw = T, H, W
T: 0,0,...,0,1,1,...,1,...
(帧间递增)
H: 0,0,1,1,...,0,0,1,1,...
(每帧内行重复)
W: 0,1,0,1,...,0,1,0,1,...
(每帧内列重复)
图像 token (3D 位置)
grid_thw = 1, H, W
T: 0,0,0,...,0
(单帧,全0)
H: 0,0,1,1,2,2,...
(行重复)
W: 0,1,0,1,0,1,...
(列重复)
文本 token (1D 位置)
position_ids = 0,1,2,3,...
三个维度使用相同位置
T: 0,1,2,3,...
H: 0,1,2,3,...
W: 0,1,2,3,...
7.2 apply_interleaved_mrope 交错排列图
在 modeling_qwen3_5_moe.py:165-180(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L165) 中,M-RoPE 将三维频率交错排列:
#mermaid-svg-Fji9AvE1mxXLtnzg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Fji9AvE1mxXLtnzg .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Fji9AvE1mxXLtnzg .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Fji9AvE1mxXLtnzg .error-icon{fill:#552222;}#mermaid-svg-Fji9AvE1mxXLtnzg .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Fji9AvE1mxXLtnzg .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Fji9AvE1mxXLtnzg .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Fji9AvE1mxXLtnzg .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Fji9AvE1mxXLtnzg .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Fji9AvE1mxXLtnzg .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Fji9AvE1mxXLtnzg .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Fji9AvE1mxXLtnzg .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Fji9AvE1mxXLtnzg .marker.cross{stroke:#333333;}#mermaid-svg-Fji9AvE1mxXLtnzg svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Fji9AvE1mxXLtnzg p{margin:0;}#mermaid-svg-Fji9AvE1mxXLtnzg .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Fji9AvE1mxXLtnzg .cluster-label text{fill:#333;}#mermaid-svg-Fji9AvE1mxXLtnzg .cluster-label span{color:#333;}#mermaid-svg-Fji9AvE1mxXLtnzg .cluster-label span p{background-color:transparent;}#mermaid-svg-Fji9AvE1mxXLtnzg .label text,#mermaid-svg-Fji9AvE1mxXLtnzg span{fill:#333;color:#333;}#mermaid-svg-Fji9AvE1mxXLtnzg .node rect,#mermaid-svg-Fji9AvE1mxXLtnzg .node circle,#mermaid-svg-Fji9AvE1mxXLtnzg .node ellipse,#mermaid-svg-Fji9AvE1mxXLtnzg .node polygon,#mermaid-svg-Fji9AvE1mxXLtnzg .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Fji9AvE1mxXLtnzg .rough-node .label text,#mermaid-svg-Fji9AvE1mxXLtnzg .node .label text,#mermaid-svg-Fji9AvE1mxXLtnzg .image-shape .label,#mermaid-svg-Fji9AvE1mxXLtnzg .icon-shape .label{text-anchor:middle;}#mermaid-svg-Fji9AvE1mxXLtnzg .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Fji9AvE1mxXLtnzg .rough-node .label,#mermaid-svg-Fji9AvE1mxXLtnzg .node .label,#mermaid-svg-Fji9AvE1mxXLtnzg .image-shape .label,#mermaid-svg-Fji9AvE1mxXLtnzg .icon-shape .label{text-align:center;}#mermaid-svg-Fji9AvE1mxXLtnzg .node.clickable{cursor:pointer;}#mermaid-svg-Fji9AvE1mxXLtnzg .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Fji9AvE1mxXLtnzg .arrowheadPath{fill:#333333;}#mermaid-svg-Fji9AvE1mxXLtnzg .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Fji9AvE1mxXLtnzg .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Fji9AvE1mxXLtnzg .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Fji9AvE1mxXLtnzg .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Fji9AvE1mxXLtnzg .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Fji9AvE1mxXLtnzg .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Fji9AvE1mxXLtnzg .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Fji9AvE1mxXLtnzg .cluster text{fill:#333;}#mermaid-svg-Fji9AvE1mxXLtnzg .cluster span{color:#333;}#mermaid-svg-Fji9AvE1mxXLtnzg div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Fji9AvE1mxXLtnzg .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Fji9AvE1mxXLtnzg rect.text{fill:none;stroke-width:0;}#mermaid-svg-Fji9AvE1mxXLtnzg .icon-shape,#mermaid-svg-Fji9AvE1mxXLtnzg .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Fji9AvE1mxXLtnzg .icon-shape p,#mermaid-svg-Fji9AvE1mxXLtnzg .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Fji9AvE1mxXLtnzg .icon-shape .label rect,#mermaid-svg-Fji9AvE1mxXLtnzg .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Fji9AvE1mxXLtnzg .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Fji9AvE1mxXLtnzg .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Fji9AvE1mxXLtnzg :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 交错排列后
分块频率 (mrope_section=11,11,10)
T 频率: f0...f10
11 个维度
H 频率: f0...f10
11 个维度
W 频率: f0...f9
10 个维度
T0, H0, W0, T1, H1, W1, ..., T10, H10, T10, H10
THW 交错 → 保持频率连续性
关键代码:
python
def apply_interleaved_mrope(self, freqs, mrope_section):
freqs_t = freqs[0] # 以 T 维度为基底
for dim, offset in enumerate((1, 2), start=1): # H, W
length = mrope_section[dim] * 3
idx = slice(offset, length, 3) # 交错索引
freqs_t[..., idx] = freqs[dim, ..., idx]
return freqs_t
7.3 get_rope_index 位置计算流程图
定义在 modeling_qwen3_5_moe.py:1452-1543(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L1452):
#mermaid-svg-vKI32V0iYJWEJhp6{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-vKI32V0iYJWEJhp6 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-vKI32V0iYJWEJhp6 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-vKI32V0iYJWEJhp6 .error-icon{fill:#552222;}#mermaid-svg-vKI32V0iYJWEJhp6 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-vKI32V0iYJWEJhp6 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-vKI32V0iYJWEJhp6 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-vKI32V0iYJWEJhp6 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-vKI32V0iYJWEJhp6 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-vKI32V0iYJWEJhp6 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-vKI32V0iYJWEJhp6 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-vKI32V0iYJWEJhp6 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-vKI32V0iYJWEJhp6 .marker.cross{stroke:#333333;}#mermaid-svg-vKI32V0iYJWEJhp6 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-vKI32V0iYJWEJhp6 p{margin:0;}#mermaid-svg-vKI32V0iYJWEJhp6 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-vKI32V0iYJWEJhp6 .cluster-label text{fill:#333;}#mermaid-svg-vKI32V0iYJWEJhp6 .cluster-label span{color:#333;}#mermaid-svg-vKI32V0iYJWEJhp6 .cluster-label span p{background-color:transparent;}#mermaid-svg-vKI32V0iYJWEJhp6 .label text,#mermaid-svg-vKI32V0iYJWEJhp6 span{fill:#333;color:#333;}#mermaid-svg-vKI32V0iYJWEJhp6 .node rect,#mermaid-svg-vKI32V0iYJWEJhp6 .node circle,#mermaid-svg-vKI32V0iYJWEJhp6 .node ellipse,#mermaid-svg-vKI32V0iYJWEJhp6 .node polygon,#mermaid-svg-vKI32V0iYJWEJhp6 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-vKI32V0iYJWEJhp6 .rough-node .label text,#mermaid-svg-vKI32V0iYJWEJhp6 .node .label text,#mermaid-svg-vKI32V0iYJWEJhp6 .image-shape .label,#mermaid-svg-vKI32V0iYJWEJhp6 .icon-shape .label{text-anchor:middle;}#mermaid-svg-vKI32V0iYJWEJhp6 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-vKI32V0iYJWEJhp6 .rough-node .label,#mermaid-svg-vKI32V0iYJWEJhp6 .node .label,#mermaid-svg-vKI32V0iYJWEJhp6 .image-shape .label,#mermaid-svg-vKI32V0iYJWEJhp6 .icon-shape .label{text-align:center;}#mermaid-svg-vKI32V0iYJWEJhp6 .node.clickable{cursor:pointer;}#mermaid-svg-vKI32V0iYJWEJhp6 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-vKI32V0iYJWEJhp6 .arrowheadPath{fill:#333333;}#mermaid-svg-vKI32V0iYJWEJhp6 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-vKI32V0iYJWEJhp6 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-vKI32V0iYJWEJhp6 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-vKI32V0iYJWEJhp6 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-vKI32V0iYJWEJhp6 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-vKI32V0iYJWEJhp6 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-vKI32V0iYJWEJhp6 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-vKI32V0iYJWEJhp6 .cluster text{fill:#333;}#mermaid-svg-vKI32V0iYJWEJhp6 .cluster span{color:#333;}#mermaid-svg-vKI32V0iYJWEJhp6 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-vKI32V0iYJWEJhp6 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-vKI32V0iYJWEJhp6 rect.text{fill:none;stroke-width:0;}#mermaid-svg-vKI32V0iYJWEJhp6 .icon-shape,#mermaid-svg-vKI32V0iYJWEJhp6 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-vKI32V0iYJWEJhp6 .icon-shape p,#mermaid-svg-vKI32V0iYJWEJhp6 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-vKI32V0iYJWEJhp6 .icon-shape .label rect,#mermaid-svg-vKI32V0iYJWEJhp6 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-vKI32V0iYJWEJhp6 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-vKI32V0iYJWEJhp6 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-vKI32V0iYJWEJhp6 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 视频组 (type=2)
图像组 (type=1)
文本组 (type=0)
input_ids + mm_token_type_ids
- image_grid_thw + video_grid_thw
按 token_type 分组
itertools.groupby
arange(text_len) + current_pos
expand(3, -1) → T/H/W 相同
next(image_grid_thw_iter)
get_vision_position_ids(current_pos, grid_thw)
current_pos += max(H,W) // merge_size
next(video_grid_thw_iter)
get_vision_position_ids(current_pos, grid_thw)
current_pos += max(H,W) // merge_size
torch.cat 所有组的位置
shape: 3, bs, seq_len
mrope_position_deltas
= max(position) + 1 - seq_len
视频特殊处理 :由于 Qwen3.5 使用时间戳分隔视频帧,video_grid_thw 需要按帧拆分:
python
if video_grid_thw is not None:
video_grid_thw = torch.repeat_interleave(video_grid_thw, video_grid_thw[:, 0], dim=0)
video_grid_thw[:, 0] = 1 # 每帧独立
8. 缓存系统
Qwen3.5-MoE 的混合注意力架构需要混合缓存:full_attention 层使用 KV Cache,linear_attention 层使用 conv_state + recurrent_state。
8.1 混合缓存架构图
#mermaid-svg-qvYG1rhHZFdxCDEp{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-qvYG1rhHZFdxCDEp .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-qvYG1rhHZFdxCDEp .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-qvYG1rhHZFdxCDEp .error-icon{fill:#552222;}#mermaid-svg-qvYG1rhHZFdxCDEp .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-qvYG1rhHZFdxCDEp .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-qvYG1rhHZFdxCDEp .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-qvYG1rhHZFdxCDEp .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-qvYG1rhHZFdxCDEp .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-qvYG1rhHZFdxCDEp .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-qvYG1rhHZFdxCDEp .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-qvYG1rhHZFdxCDEp .marker{fill:#333333;stroke:#333333;}#mermaid-svg-qvYG1rhHZFdxCDEp .marker.cross{stroke:#333333;}#mermaid-svg-qvYG1rhHZFdxCDEp svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-qvYG1rhHZFdxCDEp p{margin:0;}#mermaid-svg-qvYG1rhHZFdxCDEp .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-qvYG1rhHZFdxCDEp .cluster-label text{fill:#333;}#mermaid-svg-qvYG1rhHZFdxCDEp .cluster-label span{color:#333;}#mermaid-svg-qvYG1rhHZFdxCDEp .cluster-label span p{background-color:transparent;}#mermaid-svg-qvYG1rhHZFdxCDEp .label text,#mermaid-svg-qvYG1rhHZFdxCDEp span{fill:#333;color:#333;}#mermaid-svg-qvYG1rhHZFdxCDEp .node rect,#mermaid-svg-qvYG1rhHZFdxCDEp .node circle,#mermaid-svg-qvYG1rhHZFdxCDEp .node ellipse,#mermaid-svg-qvYG1rhHZFdxCDEp .node polygon,#mermaid-svg-qvYG1rhHZFdxCDEp .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-qvYG1rhHZFdxCDEp .rough-node .label text,#mermaid-svg-qvYG1rhHZFdxCDEp .node .label text,#mermaid-svg-qvYG1rhHZFdxCDEp .image-shape .label,#mermaid-svg-qvYG1rhHZFdxCDEp .icon-shape .label{text-anchor:middle;}#mermaid-svg-qvYG1rhHZFdxCDEp .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-qvYG1rhHZFdxCDEp .rough-node .label,#mermaid-svg-qvYG1rhHZFdxCDEp .node .label,#mermaid-svg-qvYG1rhHZFdxCDEp .image-shape .label,#mermaid-svg-qvYG1rhHZFdxCDEp .icon-shape .label{text-align:center;}#mermaid-svg-qvYG1rhHZFdxCDEp .node.clickable{cursor:pointer;}#mermaid-svg-qvYG1rhHZFdxCDEp .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-qvYG1rhHZFdxCDEp .arrowheadPath{fill:#333333;}#mermaid-svg-qvYG1rhHZFdxCDEp .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-qvYG1rhHZFdxCDEp .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-qvYG1rhHZFdxCDEp .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-qvYG1rhHZFdxCDEp .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-qvYG1rhHZFdxCDEp .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-qvYG1rhHZFdxCDEp .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-qvYG1rhHZFdxCDEp .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-qvYG1rhHZFdxCDEp .cluster text{fill:#333;}#mermaid-svg-qvYG1rhHZFdxCDEp .cluster span{color:#333;}#mermaid-svg-qvYG1rhHZFdxCDEp div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-qvYG1rhHZFdxCDEp .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-qvYG1rhHZFdxCDEp rect.text{fill:none;stroke-width:0;}#mermaid-svg-qvYG1rhHZFdxCDEp .icon-shape,#mermaid-svg-qvYG1rhHZFdxCDEp .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-qvYG1rhHZFdxCDEp .icon-shape p,#mermaid-svg-qvYG1rhHZFdxCDEp .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-qvYG1rhHZFdxCDEp .icon-shape .label rect,#mermaid-svg-qvYG1rhHZFdxCDEp .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-qvYG1rhHZFdxCDEp .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-qvYG1rhHZFdxCDEp .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-qvYG1rhHZFdxCDEp :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} DynamicCache (统一管理)
linear_attention 层缓存
full_attention 层缓存
full_attention
linear_attention
linear_attention
CacheLayer
key_cache: bs, heads, seq, dim
value_cache: bs, heads, seq, dim
LinearAttentionCacheLayerMixin
conv_states: bs, conv_dim, kernel_size
因果卷积状态
recurrent_states: bs, heads, k_dim, v_dim
DeltaNet 递推状态
config.layer_types
确定每层缓存类型
DynamicCache 在初始化时根据 config.layer_types 自动判断每层的缓存类型,定义在 cache_utils.py:1229(file:///workspace/src/transformers/cache_utils.py)。
8.2 linear_attention 层的缓存更新流程
recurrent_state conv_state DynamicCache Qwen3_5MoeGatedDeltaNet recurrent_state conv_state DynamicCache Qwen3_5MoeGatedDeltaNet #mermaid-svg-MhtFPkmoy42yPA1T{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-MhtFPkmoy42yPA1T .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-MhtFPkmoy42yPA1T .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-MhtFPkmoy42yPA1T .error-icon{fill:#552222;}#mermaid-svg-MhtFPkmoy42yPA1T .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-MhtFPkmoy42yPA1T .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-MhtFPkmoy42yPA1T .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-MhtFPkmoy42yPA1T .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-MhtFPkmoy42yPA1T .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-MhtFPkmoy42yPA1T .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-MhtFPkmoy42yPA1T .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-MhtFPkmoy42yPA1T .marker{fill:#333333;stroke:#333333;}#mermaid-svg-MhtFPkmoy42yPA1T .marker.cross{stroke:#333333;}#mermaid-svg-MhtFPkmoy42yPA1T svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-MhtFPkmoy42yPA1T p{margin:0;}#mermaid-svg-MhtFPkmoy42yPA1T .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-MhtFPkmoy42yPA1T text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-MhtFPkmoy42yPA1T .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-MhtFPkmoy42yPA1T .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-MhtFPkmoy42yPA1T .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-MhtFPkmoy42yPA1T .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-MhtFPkmoy42yPA1T #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-MhtFPkmoy42yPA1T .sequenceNumber{fill:white;}#mermaid-svg-MhtFPkmoy42yPA1T #sequencenumber{fill:#333;}#mermaid-svg-MhtFPkmoy42yPA1T #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-MhtFPkmoy42yPA1T .messageText{fill:#333;stroke:none;}#mermaid-svg-MhtFPkmoy42yPA1T .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-MhtFPkmoy42yPA1T .labelText,#mermaid-svg-MhtFPkmoy42yPA1T .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-MhtFPkmoy42yPA1T .loopText,#mermaid-svg-MhtFPkmoy42yPA1T .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-MhtFPkmoy42yPA1T .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-MhtFPkmoy42yPA1T .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-MhtFPkmoy42yPA1T .noteText,#mermaid-svg-MhtFPkmoy42yPA1T .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-MhtFPkmoy42yPA1T .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-MhtFPkmoy42yPA1T .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-MhtFPkmoy42yPA1T .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-MhtFPkmoy42yPA1T .actorPopupMenu{position:absolute;}#mermaid-svg-MhtFPkmoy42yPA1T .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-MhtFPkmoy42yPA1T .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-MhtFPkmoy42yPA1T .actor-man circle,#mermaid-svg-MhtFPkmoy42yPA1T line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-MhtFPkmoy42yPA1T :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Prefill 阶段 (seq_len > 1) Decode 阶段 (seq_len == 1) conv_state 原地更新 S_t = S_{t-1} * g + k^T * δ has_previous_state(layer_idx)? False (首次) in_proj_qkv → causal_conv1d update_conv_state(new_conv_state, layer_idx) 懒初始化 + copy chunk_gated_delta_rule(Q, K, V, g, β) update_recurrent_state(last_recurrent_state, layer_idx) copy has_previous_state(layer_idx)? True conv_state, recurrent_state causal_conv1d_update (单步更新) recurrent_gated_delta_rule (递推) update_recurrent_state(new_state, layer_idx) copy
关键代码在 modeling_qwen3_5_moe.py:449-546(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L449):
python
use_precomputed_states = cache_params is not None and cache_params.has_previous_state(self.layer_idx)
if use_precomputed_states:
conv_state = cache_params.layers[self.layer_idx].conv_states
recurrent_state = cache_params.layers[self.layer_idx].recurrent_states
# Prefill: 多 token,使用 chunk 模式
if not (use_precomputed_states and seq_len == 1):
if cache_params is not None:
new_conv_state = F.pad(mixed_qkv, (self.conv_kernel_size - mixed_qkv.shape[-1], 0))
cache_params.update_conv_state(new_conv_state, self.layer_idx)
core_attn_out, last_recurrent_state = self.chunk_gated_delta_rule(...)
# Decode: 单 token,使用 recurrent 模式
else:
mixed_qkv = self.causal_conv1d_update(mixed_qkv, conv_state, ...)
core_attn_out, last_recurrent_state = self.recurrent_gated_delta_rule(...)
if cache_params is not None:
cache_params.update_recurrent_state(last_recurrent_state, self.layer_idx)
9. generate() 生成全流程
9.1 生成循环时序图
DynamicCache Qwen3_5MoeTextModel Qwen3_5MoeVisionModel Qwen3_5MoeForConditionalGeneration AutoProcessor 用户 DynamicCache Qwen3_5MoeTextModel Qwen3_5MoeVisionModel Qwen3_5MoeForConditionalGeneration AutoProcessor 用户 #mermaid-svg-dmc8lBfGLXt1X299{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-dmc8lBfGLXt1X299 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-dmc8lBfGLXt1X299 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-dmc8lBfGLXt1X299 .error-icon{fill:#552222;}#mermaid-svg-dmc8lBfGLXt1X299 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-dmc8lBfGLXt1X299 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-dmc8lBfGLXt1X299 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-dmc8lBfGLXt1X299 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-dmc8lBfGLXt1X299 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-dmc8lBfGLXt1X299 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-dmc8lBfGLXt1X299 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-dmc8lBfGLXt1X299 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-dmc8lBfGLXt1X299 .marker.cross{stroke:#333333;}#mermaid-svg-dmc8lBfGLXt1X299 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-dmc8lBfGLXt1X299 p{margin:0;}#mermaid-svg-dmc8lBfGLXt1X299 .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-dmc8lBfGLXt1X299 text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-dmc8lBfGLXt1X299 .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-dmc8lBfGLXt1X299 .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-dmc8lBfGLXt1X299 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-dmc8lBfGLXt1X299 .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-dmc8lBfGLXt1X299 #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-dmc8lBfGLXt1X299 .sequenceNumber{fill:white;}#mermaid-svg-dmc8lBfGLXt1X299 #sequencenumber{fill:#333;}#mermaid-svg-dmc8lBfGLXt1X299 #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-dmc8lBfGLXt1X299 .messageText{fill:#333;stroke:none;}#mermaid-svg-dmc8lBfGLXt1X299 .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-dmc8lBfGLXt1X299 .labelText,#mermaid-svg-dmc8lBfGLXt1X299 .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-dmc8lBfGLXt1X299 .loopText,#mermaid-svg-dmc8lBfGLXt1X299 .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-dmc8lBfGLXt1X299 .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-dmc8lBfGLXt1X299 .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-dmc8lBfGLXt1X299 .noteText,#mermaid-svg-dmc8lBfGLXt1X299 .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-dmc8lBfGLXt1X299 .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-dmc8lBfGLXt1X299 .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-dmc8lBfGLXt1X299 .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-dmc8lBfGLXt1X299 .actorPopupMenu{position:absolute;}#mermaid-svg-dmc8lBfGLXt1X299 .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-dmc8lBfGLXt1X299 .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-dmc8lBfGLXt1X299 .actor-man circle,#mermaid-svg-dmc8lBfGLXt1X299 line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-dmc8lBfGLXt1X299 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 首次迭代 (is_first_iteration=True) 后续迭代 (is_first_iteration=False) loop 生成循环 处理图像+文本 input_ids + pixel_values + grid_thw + mm_token_type_ids prepare_inputs_for_generation() get_image_features(pixel_values, grid_thw) image_embeds masked_scatter 融合视觉嵌入 _prepare_position_ids_for_generation() 计算 3D position_ids + rope_deltas forward(inputs_embeds, position_ids, ...) 初始化 DynamicCache hidden_states lm_head → logits → 采样 next_token prepare_inputs_for_generation() 清除 pixel_values/grid_thw _prepare_position_ids_for_generation() 使用 rope_deltas 推算位置 forward(input_ids=next_token, position_ids, past_key_values) 读取/更新缓存 hidden_states lm_head → logits → 采样 next_token generated_ids
9.2 prepare_inputs_for_generation 的特殊处理
定义在 modeling_qwen3_5_moe.py:2106-2142(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L2106):
python
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, ...,
pixel_values=None, pixel_values_videos=None,
image_grid_thw=None, video_grid_thw=None,
is_first_iteration=False, **kwargs):
model_inputs = super().prepare_inputs_for_generation(...)
# 首次迭代后清除视觉输入,避免重复编码
if not is_first_iteration and use_cache:
model_inputs["pixel_values"] = None
model_inputs["pixel_values_videos"] = None
return model_inputs
#mermaid-svg-26xaSGdaj6czf8xa{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-26xaSGdaj6czf8xa .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-26xaSGdaj6czf8xa .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-26xaSGdaj6czf8xa .error-icon{fill:#552222;}#mermaid-svg-26xaSGdaj6czf8xa .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-26xaSGdaj6czf8xa .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-26xaSGdaj6czf8xa .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-26xaSGdaj6czf8xa .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-26xaSGdaj6czf8xa .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-26xaSGdaj6czf8xa .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-26xaSGdaj6czf8xa .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-26xaSGdaj6czf8xa .marker{fill:#333333;stroke:#333333;}#mermaid-svg-26xaSGdaj6czf8xa .marker.cross{stroke:#333333;}#mermaid-svg-26xaSGdaj6czf8xa svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-26xaSGdaj6czf8xa p{margin:0;}#mermaid-svg-26xaSGdaj6czf8xa .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-26xaSGdaj6czf8xa .cluster-label text{fill:#333;}#mermaid-svg-26xaSGdaj6czf8xa .cluster-label span{color:#333;}#mermaid-svg-26xaSGdaj6czf8xa .cluster-label span p{background-color:transparent;}#mermaid-svg-26xaSGdaj6czf8xa .label text,#mermaid-svg-26xaSGdaj6czf8xa span{fill:#333;color:#333;}#mermaid-svg-26xaSGdaj6czf8xa .node rect,#mermaid-svg-26xaSGdaj6czf8xa .node circle,#mermaid-svg-26xaSGdaj6czf8xa .node ellipse,#mermaid-svg-26xaSGdaj6czf8xa .node polygon,#mermaid-svg-26xaSGdaj6czf8xa .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-26xaSGdaj6czf8xa .rough-node .label text,#mermaid-svg-26xaSGdaj6czf8xa .node .label text,#mermaid-svg-26xaSGdaj6czf8xa .image-shape .label,#mermaid-svg-26xaSGdaj6czf8xa .icon-shape .label{text-anchor:middle;}#mermaid-svg-26xaSGdaj6czf8xa .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-26xaSGdaj6czf8xa .rough-node .label,#mermaid-svg-26xaSGdaj6czf8xa .node .label,#mermaid-svg-26xaSGdaj6czf8xa .image-shape .label,#mermaid-svg-26xaSGdaj6czf8xa .icon-shape .label{text-align:center;}#mermaid-svg-26xaSGdaj6czf8xa .node.clickable{cursor:pointer;}#mermaid-svg-26xaSGdaj6czf8xa .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-26xaSGdaj6czf8xa .arrowheadPath{fill:#333333;}#mermaid-svg-26xaSGdaj6czf8xa .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-26xaSGdaj6czf8xa .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-26xaSGdaj6czf8xa .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-26xaSGdaj6czf8xa .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-26xaSGdaj6czf8xa .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-26xaSGdaj6czf8xa .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-26xaSGdaj6czf8xa .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-26xaSGdaj6czf8xa .cluster text{fill:#333;}#mermaid-svg-26xaSGdaj6czf8xa .cluster span{color:#333;}#mermaid-svg-26xaSGdaj6czf8xa div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-26xaSGdaj6czf8xa .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-26xaSGdaj6czf8xa rect.text{fill:none;stroke-width:0;}#mermaid-svg-26xaSGdaj6czf8xa .icon-shape,#mermaid-svg-26xaSGdaj6czf8xa .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-26xaSGdaj6czf8xa .icon-shape p,#mermaid-svg-26xaSGdaj6czf8xa .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-26xaSGdaj6czf8xa .icon-shape .label rect,#mermaid-svg-26xaSGdaj6czf8xa .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-26xaSGdaj6czf8xa .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-26xaSGdaj6czf8xa .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-26xaSGdaj6czf8xa :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 后续迭代
pixel_values: ❌ None
→ 仅文本 token,使用 rope_deltas 推算位置
image_grid_thw: ❌ None
mm_token_type_ids: ❌ None
首次迭代
pixel_values: ✅ 有值
→ 视觉编码 + masked_scatter
image_grid_thw: ✅ 有值
mm_token_type_ids: ✅ 有值
9.3 _prepare_position_ids_for_generation 的 3D 位置编码处理
定义在 modeling_qwen3_5_moe.py:2144-2180(file:///workspace/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L2144):
#mermaid-svg-VvSrbpcw1FmxkpZr{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-VvSrbpcw1FmxkpZr .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-VvSrbpcw1FmxkpZr .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-VvSrbpcw1FmxkpZr .error-icon{fill:#552222;}#mermaid-svg-VvSrbpcw1FmxkpZr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-VvSrbpcw1FmxkpZr .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-VvSrbpcw1FmxkpZr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-VvSrbpcw1FmxkpZr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-VvSrbpcw1FmxkpZr .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-VvSrbpcw1FmxkpZr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-VvSrbpcw1FmxkpZr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-VvSrbpcw1FmxkpZr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-VvSrbpcw1FmxkpZr .marker.cross{stroke:#333333;}#mermaid-svg-VvSrbpcw1FmxkpZr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-VvSrbpcw1FmxkpZr p{margin:0;}#mermaid-svg-VvSrbpcw1FmxkpZr .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-VvSrbpcw1FmxkpZr .cluster-label text{fill:#333;}#mermaid-svg-VvSrbpcw1FmxkpZr .cluster-label span{color:#333;}#mermaid-svg-VvSrbpcw1FmxkpZr .cluster-label span p{background-color:transparent;}#mermaid-svg-VvSrbpcw1FmxkpZr .label text,#mermaid-svg-VvSrbpcw1FmxkpZr span{fill:#333;color:#333;}#mermaid-svg-VvSrbpcw1FmxkpZr .node rect,#mermaid-svg-VvSrbpcw1FmxkpZr .node circle,#mermaid-svg-VvSrbpcw1FmxkpZr .node ellipse,#mermaid-svg-VvSrbpcw1FmxkpZr .node polygon,#mermaid-svg-VvSrbpcw1FmxkpZr .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-VvSrbpcw1FmxkpZr .rough-node .label text,#mermaid-svg-VvSrbpcw1FmxkpZr .node .label text,#mermaid-svg-VvSrbpcw1FmxkpZr .image-shape .label,#mermaid-svg-VvSrbpcw1FmxkpZr .icon-shape .label{text-anchor:middle;}#mermaid-svg-VvSrbpcw1FmxkpZr .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-VvSrbpcw1FmxkpZr .rough-node .label,#mermaid-svg-VvSrbpcw1FmxkpZr .node .label,#mermaid-svg-VvSrbpcw1FmxkpZr .image-shape .label,#mermaid-svg-VvSrbpcw1FmxkpZr .icon-shape .label{text-align:center;}#mermaid-svg-VvSrbpcw1FmxkpZr .node.clickable{cursor:pointer;}#mermaid-svg-VvSrbpcw1FmxkpZr .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-VvSrbpcw1FmxkpZr .arrowheadPath{fill:#333333;}#mermaid-svg-VvSrbpcw1FmxkpZr .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-VvSrbpcw1FmxkpZr .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-VvSrbpcw1FmxkpZr .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-VvSrbpcw1FmxkpZr .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-VvSrbpcw1FmxkpZr .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-VvSrbpcw1FmxkpZr .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-VvSrbpcw1FmxkpZr .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-VvSrbpcw1FmxkpZr .cluster text{fill:#333;}#mermaid-svg-VvSrbpcw1FmxkpZr .cluster span{color:#333;}#mermaid-svg-VvSrbpcw1FmxkpZr div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-VvSrbpcw1FmxkpZr .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-VvSrbpcw1FmxkpZr rect.text{fill:none;stroke-width:0;}#mermaid-svg-VvSrbpcw1FmxkpZr .icon-shape,#mermaid-svg-VvSrbpcw1FmxkpZr .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-VvSrbpcw1FmxkpZr .icon-shape p,#mermaid-svg-VvSrbpcw1FmxkpZr .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-VvSrbpcw1FmxkpZr .icon-shape .label rect,#mermaid-svg-VvSrbpcw1FmxkpZr .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-VvSrbpcw1FmxkpZr .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-VvSrbpcw1FmxkpZr .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-VvSrbpcw1FmxkpZr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是
否
是
否
_prepare_position_ids_for_generation
past_length != 0
且 rope_deltas 存在?
position_ids = text_positions + rope_deltas
直接使用缓存的 delta
有 input_ids 且
有 mm_token_type_ids 且
有 image/video_grid_thw?
get_rope_index(input_ids, ...)
计算完整 3D 位置
存储 rope_deltas
vision_positions = text_positions.expand(3,-1,-1)
纯文本:三个维度相同
rope_deltas = zeros
无多模态偏移
torch.cat(text_positions, vision_positions)
shape: 4, bs, seq
position_ids 4, bs, seq
关键代码:
python
def _prepare_position_ids_for_generation(self, inputs_tensor, model_kwargs):
text_positions = super()._prepare_position_ids_for_generation(inputs_tensor, model_kwargs)
# 增量生成:直接用缓存的 rope_deltas
past_length = 0
if (cache := model_kwargs.get("past_key_values")) is not None:
past_length = cache.get_seq_length()
if past_length != 0 and self.model.rope_deltas is not None:
position_ids = text_positions[None, ...] + self.model.rope_deltas
return position_ids
# 首次生成:计算 3D 位置
if is_input_ids and model_kwargs.get("mm_token_type_ids") is not None and ...:
vision_positions, rope_deltas = self.model.get_rope_index(inputs_tensor, **model_kwargs)
self.model.rope_deltas = rope_deltas
else:
vision_positions = text_positions.unsqueeze(0).expand(3, -1, -1)
self.model.rope_deltas = torch.zeros(...)
# 拼接 [text, T, H, W] → [4, bs, seq]
text_positions = text_positions[None, ...]
position_ids = torch.cat([text_positions, vision_positions], dim=0)
return position_ids
10. 分布式并行
10.1 TP 策略映射图
定义在 configuration_qwen3_5_moe.py:59-72(file:///workspace/src/transformers/models/qwen3_5_moe/configuration_qwen3_5_moe.py#L59):
#mermaid-svg-YsIisHY06zQS4DPM{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-YsIisHY06zQS4DPM .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-YsIisHY06zQS4DPM .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-YsIisHY06zQS4DPM .error-icon{fill:#552222;}#mermaid-svg-YsIisHY06zQS4DPM .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-YsIisHY06zQS4DPM .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-YsIisHY06zQS4DPM .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-YsIisHY06zQS4DPM .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-YsIisHY06zQS4DPM .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-YsIisHY06zQS4DPM .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-YsIisHY06zQS4DPM .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-YsIisHY06zQS4DPM .marker{fill:#333333;stroke:#333333;}#mermaid-svg-YsIisHY06zQS4DPM .marker.cross{stroke:#333333;}#mermaid-svg-YsIisHY06zQS4DPM svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-YsIisHY06zQS4DPM p{margin:0;}#mermaid-svg-YsIisHY06zQS4DPM .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-YsIisHY06zQS4DPM .cluster-label text{fill:#333;}#mermaid-svg-YsIisHY06zQS4DPM .cluster-label span{color:#333;}#mermaid-svg-YsIisHY06zQS4DPM .cluster-label span p{background-color:transparent;}#mermaid-svg-YsIisHY06zQS4DPM .label text,#mermaid-svg-YsIisHY06zQS4DPM span{fill:#333;color:#333;}#mermaid-svg-YsIisHY06zQS4DPM .node rect,#mermaid-svg-YsIisHY06zQS4DPM .node circle,#mermaid-svg-YsIisHY06zQS4DPM .node ellipse,#mermaid-svg-YsIisHY06zQS4DPM .node polygon,#mermaid-svg-YsIisHY06zQS4DPM .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-YsIisHY06zQS4DPM .rough-node .label text,#mermaid-svg-YsIisHY06zQS4DPM .node .label text,#mermaid-svg-YsIisHY06zQS4DPM .image-shape .label,#mermaid-svg-YsIisHY06zQS4DPM .icon-shape .label{text-anchor:middle;}#mermaid-svg-YsIisHY06zQS4DPM .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-YsIisHY06zQS4DPM .rough-node .label,#mermaid-svg-YsIisHY06zQS4DPM .node .label,#mermaid-svg-YsIisHY06zQS4DPM .image-shape .label,#mermaid-svg-YsIisHY06zQS4DPM .icon-shape .label{text-align:center;}#mermaid-svg-YsIisHY06zQS4DPM .node.clickable{cursor:pointer;}#mermaid-svg-YsIisHY06zQS4DPM .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-YsIisHY06zQS4DPM .arrowheadPath{fill:#333333;}#mermaid-svg-YsIisHY06zQS4DPM .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-YsIisHY06zQS4DPM .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-YsIisHY06zQS4DPM .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YsIisHY06zQS4DPM .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-YsIisHY06zQS4DPM .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YsIisHY06zQS4DPM .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-YsIisHY06zQS4DPM .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-YsIisHY06zQS4DPM .cluster text{fill:#333;}#mermaid-svg-YsIisHY06zQS4DPM .cluster span{color:#333;}#mermaid-svg-YsIisHY06zQS4DPM div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-YsIisHY06zQS4DPM .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-YsIisHY06zQS4DPM rect.text{fill:none;stroke-width:0;}#mermaid-svg-YsIisHY06zQS4DPM .icon-shape,#mermaid-svg-YsIisHY06zQS4DPM .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YsIisHY06zQS4DPM .icon-shape p,#mermaid-svg-YsIisHY06zQS4DPM .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-YsIisHY06zQS4DPM .icon-shape .label rect,#mermaid-svg-YsIisHY06zQS4DPM .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YsIisHY06zQS4DPM .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-YsIisHY06zQS4DPM .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-YsIisHY06zQS4DPM :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} MoE 专家并行策略
experts.gate_up_proj → packed_colwise
打包列切分 256, 1024, 2048
experts.down_proj → rowwise
experts → moe_tp_experts
🔥 专家级并行:每个 GPU 持有部分专家
shared_expert.gate_proj → colwise
shared_expert.up_proj → colwise
shared_expert.down_proj → rowwise
注意力层并行策略
q_proj → colwise
按列切分,每个 GPU 计算部分 head
k_proj → colwise
v_proj → colwise
o_proj → rowwise
按行切分,结果 all-reduce
q_norm → replicated_with_grad_allreduce
复制,梯度 all-reduce
k_norm → replicated_with_grad_allreduce
10.2 MoE 专家并行(moe_tp_experts)原理图
#mermaid-svg-2XhquL2L9qE1X38M{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-2XhquL2L9qE1X38M .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-2XhquL2L9qE1X38M .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-2XhquL2L9qE1X38M .error-icon{fill:#552222;}#mermaid-svg-2XhquL2L9qE1X38M .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-2XhquL2L9qE1X38M .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-2XhquL2L9qE1X38M .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-2XhquL2L9qE1X38M .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-2XhquL2L9qE1X38M .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-2XhquL2L9qE1X38M .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-2XhquL2L9qE1X38M .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-2XhquL2L9qE1X38M .marker{fill:#333333;stroke:#333333;}#mermaid-svg-2XhquL2L9qE1X38M .marker.cross{stroke:#333333;}#mermaid-svg-2XhquL2L9qE1X38M svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-2XhquL2L9qE1X38M p{margin:0;}#mermaid-svg-2XhquL2L9qE1X38M .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-2XhquL2L9qE1X38M .cluster-label text{fill:#333;}#mermaid-svg-2XhquL2L9qE1X38M .cluster-label span{color:#333;}#mermaid-svg-2XhquL2L9qE1X38M .cluster-label span p{background-color:transparent;}#mermaid-svg-2XhquL2L9qE1X38M .label text,#mermaid-svg-2XhquL2L9qE1X38M span{fill:#333;color:#333;}#mermaid-svg-2XhquL2L9qE1X38M .node rect,#mermaid-svg-2XhquL2L9qE1X38M .node circle,#mermaid-svg-2XhquL2L9qE1X38M .node ellipse,#mermaid-svg-2XhquL2L9qE1X38M .node polygon,#mermaid-svg-2XhquL2L9qE1X38M .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-2XhquL2L9qE1X38M .rough-node .label text,#mermaid-svg-2XhquL2L9qE1X38M .node .label text,#mermaid-svg-2XhquL2L9qE1X38M .image-shape .label,#mermaid-svg-2XhquL2L9qE1X38M .icon-shape .label{text-anchor:middle;}#mermaid-svg-2XhquL2L9qE1X38M .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-2XhquL2L9qE1X38M .rough-node .label,#mermaid-svg-2XhquL2L9qE1X38M .node .label,#mermaid-svg-2XhquL2L9qE1X38M .image-shape .label,#mermaid-svg-2XhquL2L9qE1X38M .icon-shape .label{text-align:center;}#mermaid-svg-2XhquL2L9qE1X38M .node.clickable{cursor:pointer;}#mermaid-svg-2XhquL2L9qE1X38M .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-2XhquL2L9qE1X38M .arrowheadPath{fill:#333333;}#mermaid-svg-2XhquL2L9qE1X38M .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-2XhquL2L9qE1X38M .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-2XhquL2L9qE1X38M .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-2XhquL2L9qE1X38M .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-2XhquL2L9qE1X38M .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-2XhquL2L9qE1X38M .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-2XhquL2L9qE1X38M .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-2XhquL2L9qE1X38M .cluster text{fill:#333;}#mermaid-svg-2XhquL2L9qE1X38M .cluster span{color:#333;}#mermaid-svg-2XhquL2L9qE1X38M div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-2XhquL2L9qE1X38M .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-2XhquL2L9qE1X38M rect.text{fill:none;stroke-width:0;}#mermaid-svg-2XhquL2L9qE1X38M .icon-shape,#mermaid-svg-2XhquL2L9qE1X38M .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-2XhquL2L9qE1X38M .icon-shape p,#mermaid-svg-2XhquL2L9qE1X38M .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-2XhquL2L9qE1X38M .icon-shape .label rect,#mermaid-svg-2XhquL2L9qE1X38M .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-2XhquL2L9qE1X38M .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-2XhquL2L9qE1X38M .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-2XhquL2L9qE1X38M :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 4 GPU 张量并行
GPU 3
GPU 2
GPU 1
GPU 0
Expert 0-63
gate_up_proj0:64
down_proj0:64
Expert 64-127
gate_up_proj64:128
down_proj64:128
Expert 128-191
gate_up_proj128:192
down_proj128:192
Expert 192-255
gate_up_proj192:256
down_proj192:256
hidden_states
bs, seq, 2048
TopKRouter
每个 GPU 完整计算路由
All-to-All 通信
将 token 发送到对应专家所在 GPU
各 GPU 并行计算
本地专家前向
All-to-All 通信
收集计算结果
expert_output
bs, seq, 2048
moe_tp_experts 与普通 colwise/rowwise 的区别:
colwise/rowwise:切分单个线性层的权重矩阵moe_tp_experts:按专家维度切分,每个 GPU 持有256/tp_size个完整专家
11. 状态与生命周期总结
11.1 状态机图
渲染错误: Mermaid 渲染失败: Parse error on line 41: ...生成: full_attention 层: KV Cache
linea... -----------------------^ Expecting 'SPACE', 'NL', 'HIDE_EMPTY', 'scale', 'COMPOSIT_STATE', 'STRUCT_STOP', 'STATE_DESCR', 'ID', 'FORK', 'JOIN', 'CHOICE', 'CONCURRENT', 'note', 'acc_title', 'acc_descr', 'acc_descr_multiline_value', 'CLICK', 'classDef', 'style', 'class', 'direction_tb', 'direction_bt', 'direction_rl', 'direction_lr', 'EDGE_STATE', got 'DESCR'
11.2 关键数据流总结
#mermaid-svg-Anp5gNhZXS57U3N9{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Anp5gNhZXS57U3N9 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Anp5gNhZXS57U3N9 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Anp5gNhZXS57U3N9 .error-icon{fill:#552222;}#mermaid-svg-Anp5gNhZXS57U3N9 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Anp5gNhZXS57U3N9 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Anp5gNhZXS57U3N9 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Anp5gNhZXS57U3N9 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Anp5gNhZXS57U3N9 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Anp5gNhZXS57U3N9 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Anp5gNhZXS57U3N9 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Anp5gNhZXS57U3N9 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Anp5gNhZXS57U3N9 .marker.cross{stroke:#333333;}#mermaid-svg-Anp5gNhZXS57U3N9 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Anp5gNhZXS57U3N9 p{margin:0;}#mermaid-svg-Anp5gNhZXS57U3N9 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Anp5gNhZXS57U3N9 .cluster-label text{fill:#333;}#mermaid-svg-Anp5gNhZXS57U3N9 .cluster-label span{color:#333;}#mermaid-svg-Anp5gNhZXS57U3N9 .cluster-label span p{background-color:transparent;}#mermaid-svg-Anp5gNhZXS57U3N9 .label text,#mermaid-svg-Anp5gNhZXS57U3N9 span{fill:#333;color:#333;}#mermaid-svg-Anp5gNhZXS57U3N9 .node rect,#mermaid-svg-Anp5gNhZXS57U3N9 .node circle,#mermaid-svg-Anp5gNhZXS57U3N9 .node ellipse,#mermaid-svg-Anp5gNhZXS57U3N9 .node polygon,#mermaid-svg-Anp5gNhZXS57U3N9 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Anp5gNhZXS57U3N9 .rough-node .label text,#mermaid-svg-Anp5gNhZXS57U3N9 .node .label text,#mermaid-svg-Anp5gNhZXS57U3N9 .image-shape .label,#mermaid-svg-Anp5gNhZXS57U3N9 .icon-shape .label{text-anchor:middle;}#mermaid-svg-Anp5gNhZXS57U3N9 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Anp5gNhZXS57U3N9 .rough-node .label,#mermaid-svg-Anp5gNhZXS57U3N9 .node .label,#mermaid-svg-Anp5gNhZXS57U3N9 .image-shape .label,#mermaid-svg-Anp5gNhZXS57U3N9 .icon-shape .label{text-align:center;}#mermaid-svg-Anp5gNhZXS57U3N9 .node.clickable{cursor:pointer;}#mermaid-svg-Anp5gNhZXS57U3N9 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Anp5gNhZXS57U3N9 .arrowheadPath{fill:#333333;}#mermaid-svg-Anp5gNhZXS57U3N9 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Anp5gNhZXS57U3N9 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Anp5gNhZXS57U3N9 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Anp5gNhZXS57U3N9 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Anp5gNhZXS57U3N9 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Anp5gNhZXS57U3N9 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Anp5gNhZXS57U3N9 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Anp5gNhZXS57U3N9 .cluster text{fill:#333;}#mermaid-svg-Anp5gNhZXS57U3N9 .cluster span{color:#333;}#mermaid-svg-Anp5gNhZXS57U3N9 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Anp5gNhZXS57U3N9 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Anp5gNhZXS57U3N9 rect.text{fill:none;stroke-width:0;}#mermaid-svg-Anp5gNhZXS57U3N9 .icon-shape,#mermaid-svg-Anp5gNhZXS57U3N9 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Anp5gNhZXS57U3N9 .icon-shape p,#mermaid-svg-Anp5gNhZXS57U3N9 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Anp5gNhZXS57U3N9 .icon-shape .label rect,#mermaid-svg-Anp5gNhZXS57U3N9 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Anp5gNhZXS57U3N9 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Anp5gNhZXS57U3N9 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Anp5gNhZXS57U3N9 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 输出
文本模型
位置编码
嵌入融合
视觉编码
输入
每层
input_layernorm
full_attention / linear_attention
post_attention_layernorm
SparseMoeBlock
256专家Top8 + 共享专家
input_ids
pixel_values
grid_thw
mm_token_type_ids
VisionModel
PatchEmbed → 27 Blocks → Merger
image_embeds
num_tokens, 3584
masked_scatter
视觉嵌入 → 占位符
inputs_embeds
bs, seq, 2048
get_rope_index
3D M-RoPE
position_ids
4, bs, seq
embed_tokens
40 层 DecoderLayer
RMSNorm
lm_head
2048, 248320
logits
11.3 核心设计哲学
Qwen3.5-MoE 在 Transformers 中的实现体现了以下设计哲学:
- 模块化继承 :通过
modular_qwen3_5_moe.py中的类继承(Qwen3_5MoeGatedDeltaNet ← Qwen3_5GatedDeltaNet),最大化代码复用,最小化重复 - 混合架构统一管理 :
DynamicCache根据config.layer_types自动分发不同缓存类型,上层代码无需感知底层差异 - 多模态位置编码 :M-RoPE 将文本 1D 位置和视觉 3D 位置统一到同一框架,通过
rope_deltas在增量生成时高效推算 - MoE 专家并行 :
moe_tp_experts策略让 256 个专家可以跨 GPU 分布,配合@use_experts_implementation装饰器支持多种优化后端 - 生成效率 :
linear_attention层的 O(n) 复杂度 +recurrent_state缓存,使得增量解码无需维护完整的 KV Cache,大幅降低长序列生成的内存开销