GPT-2 案例详解:Decoder-only 自回归模型的完整生命周期
本文档以 GPT-2 模型为例,将 Transformers 框架的所有模块串联起来,重点展示 Decoder-only 自回归模型从配置加载、分词编码、前向传播、注意力机制、自回归生成到训练推理的完整生命周期。
源码文件:
相关文章:
Hugging Face Transformers 源码全景解读
01-Hugging Face Transformers 核心基础设施深度分析
02-Hugging Face Transformers 配置系统深度分析
03-Hugging Face Transformers 模型系统深度分析
04-Hugging Face Transformers 注意力与掩码系统深度分析
05-Hugging Face Transformers 缓存系统深度分析
06-Hugging Face Transformers 生成系统深度分析
07-Hugging Face Transformers 分词器系统深度分析
08-Hugging Face Transformers 多模态处理系统深度分析
09-Hugging Face Transformers 训练系统深度分析
10-Hugging Face Transformers 量化系统深度分析
11-Hugging Face Transformers 分布式与并行系统深度分析
12-Hugging Face Transformers之Pipeline 推理管道深入分析
13-Hugging Face Transformers之AutoModel 自动分发机制深入分析
14-Hugging Face Transformers 模型实现模式深度分析
15-Hugging Face Transformers之CLI 与工具架构总览
16-Hugging Face Transformers之测试体系架构总览
17-Hugging Face Transformers之BERT 案例详解:Transformers 框架全模块串联
18-Hugging Face Transformers之GPT-2 案例详解:Decoder-only 自回归模型的完整生命周期
19-Hugging Face Transformers之Qwen3.5-MoE 系列详解:混合专家 + 线性注意力 + 多模态的完整生命周期
1. GPT-2 在 Transformers 中的定位
GPT-2 是 OpenAI 于 2019 年发布的 Decoder-only 因果语言模型,采用自回归生成范式:每次只预测下一个 token,已生成的内容作为上下文参与后续预测。与 BERT 等 Encoder 模型不同,GPT-2 使用**因果掩码(Causal Mask)**确保每个位置只能看到自身及之前的 token。
GPT-2 的一个标志性设计是使用 Conv1D 线性层(而非标准 nn.Linear),这是 OpenAI 原始实现的遗留设计------权重矩阵的形状为 (in_features, out_features),与 nn.Linear 的 (out_features, in_features) 互为转置。
架构定位图
#mermaid-svg-0OCuXI53EtFzhB6N{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-0OCuXI53EtFzhB6N .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-0OCuXI53EtFzhB6N .error-icon{fill:#552222;}#mermaid-svg-0OCuXI53EtFzhB6N .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-0OCuXI53EtFzhB6N .marker{fill:#333333;stroke:#333333;}#mermaid-svg-0OCuXI53EtFzhB6N .marker.cross{stroke:#333333;}#mermaid-svg-0OCuXI53EtFzhB6N svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-0OCuXI53EtFzhB6N p{margin:0;}#mermaid-svg-0OCuXI53EtFzhB6N .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster-label text{fill:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster-label span{color:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster-label span p{background-color:transparent;}#mermaid-svg-0OCuXI53EtFzhB6N .label text,#mermaid-svg-0OCuXI53EtFzhB6N span{fill:#333;color:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .node rect,#mermaid-svg-0OCuXI53EtFzhB6N .node circle,#mermaid-svg-0OCuXI53EtFzhB6N .node ellipse,#mermaid-svg-0OCuXI53EtFzhB6N .node polygon,#mermaid-svg-0OCuXI53EtFzhB6N .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-0OCuXI53EtFzhB6N .rough-node .label text,#mermaid-svg-0OCuXI53EtFzhB6N .node .label text,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape .label,#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape .label{text-anchor:middle;}#mermaid-svg-0OCuXI53EtFzhB6N .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-0OCuXI53EtFzhB6N .rough-node .label,#mermaid-svg-0OCuXI53EtFzhB6N .node .label,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape .label,#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape .label{text-align:center;}#mermaid-svg-0OCuXI53EtFzhB6N .node.clickable{cursor:pointer;}#mermaid-svg-0OCuXI53EtFzhB6N .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-0OCuXI53EtFzhB6N .arrowheadPath{fill:#333333;}#mermaid-svg-0OCuXI53EtFzhB6N .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-0OCuXI53EtFzhB6N .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-0OCuXI53EtFzhB6N .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0OCuXI53EtFzhB6N .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-0OCuXI53EtFzhB6N .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0OCuXI53EtFzhB6N .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-0OCuXI53EtFzhB6N .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster text{fill:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster span{color:#333;}#mermaid-svg-0OCuXI53EtFzhB6N div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-0OCuXI53EtFzhB6N .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-0OCuXI53EtFzhB6N rect.text{fill:none;stroke-width:0;}#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape p,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape .label rect,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0OCuXI53EtFzhB6N .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-0OCuXI53EtFzhB6N .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-0OCuXI53EtFzhB6N :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Decoder-only 因果语言模型家族
Conv1D→nn.Linear
Post-Norm→Pre-Norm
绝对位置→RoPE
GELU→SwiGLU
架构继承
规模扩展
GPT-2
2019 · OpenAI
Conv1D + Post-Norm
768/1024/1280/1600
GPT-Neo / GPT-J
2021 · EleutherAI
nn.Linear + Parallel Attention
LLaMA
2023 · Meta
RMSNorm + SwiGLU + RoPE
Qwen 系列
2023- · 阿里
RMSNorm + SwiGLU + RoPE
核心特征总结
| 特征 | GPT-2 | LLaMA(对比) |
|---|---|---|
| 线性层 | Conv1D(权重转置) |
nn.Linear |
| 归一化 | LayerNorm + Post-Norm |
RMSNorm + Pre-Norm |
| 激活函数 | gelu_new(近似 GELU) |
silu(SwiGLU) |
| 位置编码 | 可学习绝对位置 wpe |
旋转位置编码 RoPE |
| 注意力缩放 | scale_attn_weights + 可选层逆缩放 |
标准 head_dim^-0.5 |
| 权重绑定 | lm_head.weight ↔ wte.weight |
同样绑定 |
2. Config 定义与特殊设计
GPT2Config 继承自 PreTrainedConfig,定义于 configuration_gpt2.py。它使用 @strict 装饰器(来自 huggingface_hub)确保数据类字段的严格类型检查,并通过 attribute_map 将 GPT-2 原始命名映射到 Transformers 统一命名。
关键参数解析
python
# 源自 configuration_gpt2.py L78-L103
vocab_size: int = 50257 # 词表大小
n_positions: int = 1024 # 最大位置编码长度
n_embd: int = 768 # 隐藏层维度
n_layer: int = 12 # Transformer 层数
n_head: int = 12 # 注意力头数
n_inner: int | None = None # MLP 中间层维度,默认 4 * n_embd
activation_function: str = "gelu_new" # 近似 GELU 激活
scale_attn_weights: bool = True # 是否缩放注意力权重(1/√d_k)
reorder_and_upcast_attn: bool = False # 混合精度下重排序并上溯注意力计算
add_cross_attention: bool = False # 是否添加交叉注意力(用于编码器-解码器场景)
tie_word_embeddings: bool = True # 权重绑定:lm_head ↔ wte
Config 类图
#mermaid-svg-PEZYkF0c6VI1P7nH{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-PEZYkF0c6VI1P7nH .error-icon{fill:#552222;}#mermaid-svg-PEZYkF0c6VI1P7nH .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-PEZYkF0c6VI1P7nH .marker{fill:#333333;stroke:#333333;}#mermaid-svg-PEZYkF0c6VI1P7nH .marker.cross{stroke:#333333;}#mermaid-svg-PEZYkF0c6VI1P7nH svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-PEZYkF0c6VI1P7nH p{margin:0;}#mermaid-svg-PEZYkF0c6VI1P7nH g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-PEZYkF0c6VI1P7nH g.classGroup text .title{font-weight:bolder;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster-label text{fill:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster-label span{color:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster-label span p{background-color:transparent;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster text{fill:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster span{color:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .nodeLabel,#mermaid-svg-PEZYkF0c6VI1P7nH .edgeLabel{color:#131300;}#mermaid-svg-PEZYkF0c6VI1P7nH .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-PEZYkF0c6VI1P7nH .label text{fill:#131300;}#mermaid-svg-PEZYkF0c6VI1P7nH .labelBkg{background:#ECECFF;}#mermaid-svg-PEZYkF0c6VI1P7nH .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-PEZYkF0c6VI1P7nH .classTitle{font-weight:bolder;}#mermaid-svg-PEZYkF0c6VI1P7nH .node rect,#mermaid-svg-PEZYkF0c6VI1P7nH .node circle,#mermaid-svg-PEZYkF0c6VI1P7nH .node ellipse,#mermaid-svg-PEZYkF0c6VI1P7nH .node polygon,#mermaid-svg-PEZYkF0c6VI1P7nH .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-PEZYkF0c6VI1P7nH .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH g.clickable{cursor:pointer;}#mermaid-svg-PEZYkF0c6VI1P7nH g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-PEZYkF0c6VI1P7nH g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-PEZYkF0c6VI1P7nH .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-PEZYkF0c6VI1P7nH .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-PEZYkF0c6VI1P7nH .dashed-line{stroke-dasharray:3;}#mermaid-svg-PEZYkF0c6VI1P7nH .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-PEZYkF0c6VI1P7nH #compositionStart,#mermaid-svg-PEZYkF0c6VI1P7nH .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #compositionEnd,#mermaid-svg-PEZYkF0c6VI1P7nH .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #dependencyStart,#mermaid-svg-PEZYkF0c6VI1P7nH .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #dependencyStart,#mermaid-svg-PEZYkF0c6VI1P7nH .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #extensionStart,#mermaid-svg-PEZYkF0c6VI1P7nH .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #extensionEnd,#mermaid-svg-PEZYkF0c6VI1P7nH .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #aggregationStart,#mermaid-svg-PEZYkF0c6VI1P7nH .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #aggregationEnd,#mermaid-svg-PEZYkF0c6VI1P7nH .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #lollipopStart,#mermaid-svg-PEZYkF0c6VI1P7nH .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #lollipopEnd,#mermaid-svg-PEZYkF0c6VI1P7nH .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-PEZYkF0c6VI1P7nH .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-PEZYkF0c6VI1P7nH .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-PEZYkF0c6VI1P7nH :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} PreTrainedConfig
+model_type: str
+is_decoder: bool
+from_pretrained()
+to_dict()
GPT2Config
+vocab_size: int = 50257
+n_positions: int = 1024
+n_embd: int = 768
+n_layer: int = 12
+n_head: int = 12
+n_inner: int | None
+activation_function: str
+scale_attn_weights: bool
+reorder_and_upcast_attn: bool
+add_cross_attention: bool
+tie_word_embeddings: bool
+attribute_map: dict
attribute_map 将 GPT-2 原始命名\n映射到 Transformers 统一命名:\nhidden_size → n_embd\nnum_attention_heads → n_head\nnum_hidden_layers → n_layer\nmax_position_embeddings → n_positions
Conv1D vs nn.Linear 对比图
#mermaid-svg-8mu3TvDKig1NJVCF{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-8mu3TvDKig1NJVCF .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-8mu3TvDKig1NJVCF .error-icon{fill:#552222;}#mermaid-svg-8mu3TvDKig1NJVCF .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-8mu3TvDKig1NJVCF .marker{fill:#333333;stroke:#333333;}#mermaid-svg-8mu3TvDKig1NJVCF .marker.cross{stroke:#333333;}#mermaid-svg-8mu3TvDKig1NJVCF svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-8mu3TvDKig1NJVCF p{margin:0;}#mermaid-svg-8mu3TvDKig1NJVCF .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster-label text{fill:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster-label span{color:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster-label span p{background-color:transparent;}#mermaid-svg-8mu3TvDKig1NJVCF .label text,#mermaid-svg-8mu3TvDKig1NJVCF span{fill:#333;color:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .node rect,#mermaid-svg-8mu3TvDKig1NJVCF .node circle,#mermaid-svg-8mu3TvDKig1NJVCF .node ellipse,#mermaid-svg-8mu3TvDKig1NJVCF .node polygon,#mermaid-svg-8mu3TvDKig1NJVCF .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-8mu3TvDKig1NJVCF .rough-node .label text,#mermaid-svg-8mu3TvDKig1NJVCF .node .label text,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape .label,#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape .label{text-anchor:middle;}#mermaid-svg-8mu3TvDKig1NJVCF .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-8mu3TvDKig1NJVCF .rough-node .label,#mermaid-svg-8mu3TvDKig1NJVCF .node .label,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape .label,#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape .label{text-align:center;}#mermaid-svg-8mu3TvDKig1NJVCF .node.clickable{cursor:pointer;}#mermaid-svg-8mu3TvDKig1NJVCF .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-8mu3TvDKig1NJVCF .arrowheadPath{fill:#333333;}#mermaid-svg-8mu3TvDKig1NJVCF .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-8mu3TvDKig1NJVCF .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-8mu3TvDKig1NJVCF .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-8mu3TvDKig1NJVCF .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-8mu3TvDKig1NJVCF .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-8mu3TvDKig1NJVCF .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-8mu3TvDKig1NJVCF .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster text{fill:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster span{color:#333;}#mermaid-svg-8mu3TvDKig1NJVCF div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-8mu3TvDKig1NJVCF .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-8mu3TvDKig1NJVCF rect.text{fill:none;stroke-width:0;}#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape p,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape .label rect,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-8mu3TvDKig1NJVCF .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-8mu3TvDKig1NJVCF .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-8mu3TvDKig1NJVCF :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Conv1D(GPT-2 使用)
nn.Linear
数学等价
xW^T ≡ x·W_transposed
输入 x
(batch, seq, in_features)
权重 W
shape: (out_features, in_features)
输出 y = xW^T + b
(batch, seq, out_features)
输入 x
(batch, seq, nx)
权重 W
shape: (nx, nf) ← 转置!
输出 y = xW + b
(batch, seq, nf)
Conv1D 的核心差异(定义于 pytorch_utils.py L97-L123):
python
class Conv1D(nn.Module):
def __init__(self, nf, nx):
super().__init__()
self.nf = nf
self.nx = nx
# 关键:权重形状为 (nx, nf),即 (in_features, out_features)
# 而 nn.Linear 的权重形状为 (out_features, in_features)
self.weight = nn.Parameter(torch.empty(nx, nf))
self.bias = nn.Parameter(torch.zeros(nf))
nn.init.normal_(self.weight, std=0.02)
def forward(self, x):
size_out = x.size()[:-1] + (self.nf,)
# torch.addmm(bias, input, weight) 计算 bias + input @ weight
# 等价于 nn.Linear 的 F.linear(x, W.T, b) = x @ W.T + b
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
x = x.view(size_out)
return x
3. from_pretrained 完整时序
从 GPT2LMHeadModel.from_pretrained('gpt2') 到模型就绪,涉及多个关键步骤。
时序图
HuggingFace Hub PreTrainedModel GPT2LMHeadModel GPT2Config AutoModelForCausalLM 用户代码 HuggingFace Hub PreTrainedModel GPT2LMHeadModel GPT2Config AutoModelForCausalLM 用户代码 #mermaid-svg-bLqkD2Xqn6zqB9dZ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .error-icon{fill:#552222;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .marker.cross{stroke:#333333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bLqkD2Xqn6zqB9dZ p{margin:0;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bLqkD2Xqn6zqB9dZ text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bLqkD2Xqn6zqB9dZ .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .sequenceNumber{fill:white;}#mermaid-svg-bLqkD2Xqn6zqB9dZ #sequencenumber{fill:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .messageText{fill:#333;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .labelText,#mermaid-svg-bLqkD2Xqn6zqB9dZ .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .loopText,#mermaid-svg-bLqkD2Xqn6zqB9dZ .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bLqkD2Xqn6zqB9dZ .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .noteText,#mermaid-svg-bLqkD2Xqn6zqB9dZ .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actorPopupMenu{position:absolute;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actor-man circle,#mermaid-svg-bLqkD2Xqn6zqB9dZ line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-bLqkD2Xqn6zqB9dZ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Conv1D 权重无需转置 因为 checkpoint 已是 (nx, nf) 格式 lm_head.weight ← transformer.wte.weight _tied_weights_keys 指定绑定关系 from_pretrained('gpt2') 下载 config.json config.json GPT2Config.from_pretrained() config 实例 根据 model_type='gpt2' 分发到 GPT2LMHeadModel GPT2LMHeadModel.from_pretrained('gpt2') PreTrainedModel.from_pretrained() 1. 解析模型架构 _load_pretrained_model() 2. 下载权重文件 model.safetensors 权重 state_dict 3. 加载权重到模型 model.load_state_dict() 4. 权重绑定处理 tie_weights() 5. 设备放置与 dtype 转换 模型就绪 可用模型实例
权重绑定机制
GPT2LMHeadModel 通过 _tied_weights_keys 声明权重绑定关系(modeling_gpt2.py L646):
python
class GPT2LMHeadModel(GPT2PreTrainedModel, GenerationMixin):
_tied_weights_keys = {"lm_head.weight": "transformer.wte.weight"}
def __init__(self, config):
super().__init__(config)
self.transformer = GPT2Model(config)
# lm_head 使用 nn.Linear,权重形状 (vocab_size, n_embd)
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.post_init() # 在此触发 tie_weights()
绑定过程:lm_head.weight(形状 (50257, 768))与 transformer.wte.weight(形状 (50257, 768))共享同一存储,修改一个另一个同步变化。
Conv1D 权重加载的特殊处理
由于 GPT-2 的 checkpoint 中 Conv1D 权重已经以 (nx, nf) 格式存储,加载时无需额外转置。但需注意:
c_attn.weight:形状(768, 2304),一次投影出 Q/K/Vc_proj.weight:形状(768, 768),注意力输出投影c_fc.weight:形状(768, 3072),MLP 上投影c_proj.weight(MLP):形状(3072, 768),MLP 下投影
4. Tokenizer 编码流程
GPT2Tokenizer 定义于 tokenization_gpt2.py,继承自 TokenizersBackend,使用 Byte-Level BPE 分词算法。
核心设计特点
- ByteLevel 预处理 :将所有字符先转为 UTF-8 字节,再映射到 Unicode 字符,确保任何文本都可编码(无
<unk>问题) - 无 CLS/SEP :GPT-2 没有 BERT 风格的特殊分隔 token,只有
<|endoftext|>作为 BOS/EOS - 空格敏感 :词首有无空格会产生不同 token(如
"Hello"vs" Hello")
python
# 源自 tokenization_gpt2.py L94-L129
class GPT2Tokenizer(TokenizersBackend):
vocab_files_names = VOCAB_FILES_NAMES # {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
model_input_names = ["input_ids", "attention_mask"]
model = BPE # 使用 BPE 模型
def __init__(self, vocab, merges, errors="replace",
unk_token="<|endoftext|>", bos_token="<|endoftext|>",
eos_token="<|endoftext|>", pad_token=None,
add_prefix_space=False, **kwargs):
self.add_prefix_space = add_prefix_space
self._vocab = vocab if vocab is not None else {}
self._merges = merges or []
# 构建 tokenizers 库的 BPE 模型
self._tokenizer = Tokenizer(BPE(
vocab=self._vocab, merges=self._merges,
dropout=None, continuing_subword_prefix="",
end_of_word_suffix="", fuse_unk=False,
))
# ByteLevel 预分词器:将文本转为字节级表示
self._tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
add_prefix_space=add_prefix_space
)
# ByteLevel 解码器:将字节级表示还原为文本
self._tokenizer.decoder = decoders.ByteLevel()
编码流程图
#mermaid-svg-Nzup5i1hvy1r8wyx{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Nzup5i1hvy1r8wyx .error-icon{fill:#552222;}#mermaid-svg-Nzup5i1hvy1r8wyx .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Nzup5i1hvy1r8wyx .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Nzup5i1hvy1r8wyx .marker.cross{stroke:#333333;}#mermaid-svg-Nzup5i1hvy1r8wyx svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Nzup5i1hvy1r8wyx p{margin:0;}#mermaid-svg-Nzup5i1hvy1r8wyx .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster-label text{fill:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster-label span{color:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster-label span p{background-color:transparent;}#mermaid-svg-Nzup5i1hvy1r8wyx .label text,#mermaid-svg-Nzup5i1hvy1r8wyx span{fill:#333;color:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .node rect,#mermaid-svg-Nzup5i1hvy1r8wyx .node circle,#mermaid-svg-Nzup5i1hvy1r8wyx .node ellipse,#mermaid-svg-Nzup5i1hvy1r8wyx .node polygon,#mermaid-svg-Nzup5i1hvy1r8wyx .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Nzup5i1hvy1r8wyx .rough-node .label text,#mermaid-svg-Nzup5i1hvy1r8wyx .node .label text,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape .label,#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape .label{text-anchor:middle;}#mermaid-svg-Nzup5i1hvy1r8wyx .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Nzup5i1hvy1r8wyx .rough-node .label,#mermaid-svg-Nzup5i1hvy1r8wyx .node .label,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape .label,#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape .label{text-align:center;}#mermaid-svg-Nzup5i1hvy1r8wyx .node.clickable{cursor:pointer;}#mermaid-svg-Nzup5i1hvy1r8wyx .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Nzup5i1hvy1r8wyx .arrowheadPath{fill:#333333;}#mermaid-svg-Nzup5i1hvy1r8wyx .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Nzup5i1hvy1r8wyx .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Nzup5i1hvy1r8wyx .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Nzup5i1hvy1r8wyx .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Nzup5i1hvy1r8wyx .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Nzup5i1hvy1r8wyx .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster text{fill:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster span{color:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Nzup5i1hvy1r8wyx .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx rect.text{fill:none;stroke-width:0;}#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape p,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape .label rect,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Nzup5i1hvy1r8wyx .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Nzup5i1hvy1r8wyx .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Nzup5i1hvy1r8wyx :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 原始文本
'Hello world'
ByteLevel 预处理
pre_tokenizers.ByteLevel
- 按 Unicode 分割
识别词边界(空格→Ġ前缀)
2. 字节映射
每个字符→UTF-8字节→Unicode映射
'H'→'H', 'e'→'e', ' '→'Ġ'
BPE 分词
Tokenizer(BPE)
- 初始化:每个字节映射为子词
- 迭代合并
按 merges.txt 中的优先级
合并最高频的相邻子词对
3. 输出子词序列
'Hello', 'Ġworld'
词表查找
vocab.json
input_ids: 15496, 995
attention_mask 生成
默认全1(无padding时)
attention_mask: 1, 1
模型输入
input_ids + attention_mask
空格敏感性示例
python
tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
tokenizer("Hello world")["input_ids"] # [15496, 995] → "Hello" + "Ġworld"
tokenizer(" Hello world")["input_ids"] # [18435, 995] → "ĠHello" + "Ġworld"
# 注意:"Hello" 和 " Hello" 编码为不同的 token!
5. 模型前向传播全链路
从 input_ids 到 logits 的完整数据流,定义于 modeling_gpt2.py 的 GPT2LMHeadModel.forward() 和 GPT2Model.forward()。
数据流图
#mermaid-svg-tkFreWzf6kbnDZgu{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-tkFreWzf6kbnDZgu .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-tkFreWzf6kbnDZgu .error-icon{fill:#552222;}#mermaid-svg-tkFreWzf6kbnDZgu .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-tkFreWzf6kbnDZgu .marker{fill:#333333;stroke:#333333;}#mermaid-svg-tkFreWzf6kbnDZgu .marker.cross{stroke:#333333;}#mermaid-svg-tkFreWzf6kbnDZgu svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-tkFreWzf6kbnDZgu p{margin:0;}#mermaid-svg-tkFreWzf6kbnDZgu .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster-label text{fill:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster-label span{color:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster-label span p{background-color:transparent;}#mermaid-svg-tkFreWzf6kbnDZgu .label text,#mermaid-svg-tkFreWzf6kbnDZgu span{fill:#333;color:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .node rect,#mermaid-svg-tkFreWzf6kbnDZgu .node circle,#mermaid-svg-tkFreWzf6kbnDZgu .node ellipse,#mermaid-svg-tkFreWzf6kbnDZgu .node polygon,#mermaid-svg-tkFreWzf6kbnDZgu .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-tkFreWzf6kbnDZgu .rough-node .label text,#mermaid-svg-tkFreWzf6kbnDZgu .node .label text,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape .label,#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape .label{text-anchor:middle;}#mermaid-svg-tkFreWzf6kbnDZgu .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-tkFreWzf6kbnDZgu .rough-node .label,#mermaid-svg-tkFreWzf6kbnDZgu .node .label,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape .label,#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape .label{text-align:center;}#mermaid-svg-tkFreWzf6kbnDZgu .node.clickable{cursor:pointer;}#mermaid-svg-tkFreWzf6kbnDZgu .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-tkFreWzf6kbnDZgu .arrowheadPath{fill:#333333;}#mermaid-svg-tkFreWzf6kbnDZgu .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-tkFreWzf6kbnDZgu .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-tkFreWzf6kbnDZgu .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tkFreWzf6kbnDZgu .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-tkFreWzf6kbnDZgu .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tkFreWzf6kbnDZgu .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-tkFreWzf6kbnDZgu .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster text{fill:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster span{color:#333;}#mermaid-svg-tkFreWzf6kbnDZgu div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-tkFreWzf6kbnDZgu .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-tkFreWzf6kbnDZgu rect.text{fill:none;stroke-width:0;}#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape p,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape .label rect,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tkFreWzf6kbnDZgu .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-tkFreWzf6kbnDZgu .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-tkFreWzf6kbnDZgu :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} input_ids
(batch, seq_len)
wte: nn.Embedding
词嵌入
(batch, seq_len, 768)
position_ids
自动生成
(1, seq_len)
wpe: nn.Embedding
位置嵌入
(1, seq_len, 768)
⊕ 相加
hidden = wte + wpe
drop: Dropout
embd_pdrop=0.1
GPT2Block #0
ln_1 → Attn → + → ln_2 → MLP → +
GPT2Block #1
ln_1 → Attn → + → ln_2 → MLP → +
...
GPT2Block #11
ln_1 → Attn → + → ln_2 → MLP → +
ln_f: LayerNorm
最终层归一化
lm_head: nn.Linear
(768, 50257, bias=False)
logits
(batch, seq_len, 50257)
单层 GPT2Block 内部结构图
#mermaid-svg-QUcUD6EEjWQF33Y9{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-QUcUD6EEjWQF33Y9 .error-icon{fill:#552222;}#mermaid-svg-QUcUD6EEjWQF33Y9 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-QUcUD6EEjWQF33Y9 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .marker.cross{stroke:#333333;}#mermaid-svg-QUcUD6EEjWQF33Y9 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-QUcUD6EEjWQF33Y9 p{margin:0;}#mermaid-svg-QUcUD6EEjWQF33Y9 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster-label text{fill:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster-label span{color:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster-label span p{background-color:transparent;}#mermaid-svg-QUcUD6EEjWQF33Y9 .label text,#mermaid-svg-QUcUD6EEjWQF33Y9 span{fill:#333;color:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .node rect,#mermaid-svg-QUcUD6EEjWQF33Y9 .node circle,#mermaid-svg-QUcUD6EEjWQF33Y9 .node ellipse,#mermaid-svg-QUcUD6EEjWQF33Y9 .node polygon,#mermaid-svg-QUcUD6EEjWQF33Y9 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .rough-node .label text,#mermaid-svg-QUcUD6EEjWQF33Y9 .node .label text,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape .label,#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape .label{text-anchor:middle;}#mermaid-svg-QUcUD6EEjWQF33Y9 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .rough-node .label,#mermaid-svg-QUcUD6EEjWQF33Y9 .node .label,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape .label,#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape .label{text-align:center;}#mermaid-svg-QUcUD6EEjWQF33Y9 .node.clickable{cursor:pointer;}#mermaid-svg-QUcUD6EEjWQF33Y9 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .arrowheadPath{fill:#333333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-QUcUD6EEjWQF33Y9 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QUcUD6EEjWQF33Y9 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster text{fill:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster span{color:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-QUcUD6EEjWQF33Y9 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 rect.text{fill:none;stroke-width:0;}#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape p,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape .label rect,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QUcUD6EEjWQF33Y9 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-QUcUD6EEjWQF33Y9 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-QUcUD6EEjWQF33Y9 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 保存 residual
保存 residual
hidden_states
(batch, seq, 768)
ln_1: LayerNorm
GPT2Attention
c_attn → Q/K/V split →
Causal Attention → c_proj
- 残差连接
hidden = attn_out + residual
ln_2: LayerNorm
GPT2MLP
c_fc → gelu_new → c_proj
- 残差连接
hidden = mlp_out + residual
输出 hidden_states
(batch, seq, 768)
对应源码(modeling_gpt2.py L262-L309):
python
class GPT2Block(GradientCheckpointingLayer):
def forward(self, hidden_states, past_key_values=None,
attention_mask=None, ...):
# Post-Norm:先归一化,再注意力,再残差
residual = hidden_states
hidden_states = self.ln_1(hidden_states)
attn_output, _ = self.attn(hidden_states, ...)
hidden_states = attn_output + residual # 第一个残差连接
# Post-Norm:先归一化,再MLP,再残差
residual = hidden_states
hidden_states = self.ln_2(hidden_states)
feed_forward_hidden_states = self.mlp(hidden_states)
hidden_states = residual + feed_forward_hidden_states # 第二个残差连接
return hidden_states
Post-Norm 架构对比图
#mermaid-svg-O8vlnMTuzqHdPYWr{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-O8vlnMTuzqHdPYWr .error-icon{fill:#552222;}#mermaid-svg-O8vlnMTuzqHdPYWr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-O8vlnMTuzqHdPYWr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-O8vlnMTuzqHdPYWr .marker.cross{stroke:#333333;}#mermaid-svg-O8vlnMTuzqHdPYWr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-O8vlnMTuzqHdPYWr p{margin:0;}#mermaid-svg-O8vlnMTuzqHdPYWr .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster-label text{fill:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster-label span{color:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster-label span p{background-color:transparent;}#mermaid-svg-O8vlnMTuzqHdPYWr .label text,#mermaid-svg-O8vlnMTuzqHdPYWr span{fill:#333;color:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .node rect,#mermaid-svg-O8vlnMTuzqHdPYWr .node circle,#mermaid-svg-O8vlnMTuzqHdPYWr .node ellipse,#mermaid-svg-O8vlnMTuzqHdPYWr .node polygon,#mermaid-svg-O8vlnMTuzqHdPYWr .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-O8vlnMTuzqHdPYWr .rough-node .label text,#mermaid-svg-O8vlnMTuzqHdPYWr .node .label text,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape .label,#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape .label{text-anchor:middle;}#mermaid-svg-O8vlnMTuzqHdPYWr .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-O8vlnMTuzqHdPYWr .rough-node .label,#mermaid-svg-O8vlnMTuzqHdPYWr .node .label,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape .label,#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape .label{text-align:center;}#mermaid-svg-O8vlnMTuzqHdPYWr .node.clickable{cursor:pointer;}#mermaid-svg-O8vlnMTuzqHdPYWr .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-O8vlnMTuzqHdPYWr .arrowheadPath{fill:#333333;}#mermaid-svg-O8vlnMTuzqHdPYWr .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-O8vlnMTuzqHdPYWr .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-O8vlnMTuzqHdPYWr .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-O8vlnMTuzqHdPYWr .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-O8vlnMTuzqHdPYWr .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-O8vlnMTuzqHdPYWr .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster text{fill:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster span{color:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-O8vlnMTuzqHdPYWr .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr rect.text{fill:none;stroke-width:0;}#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape p,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape .label rect,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-O8vlnMTuzqHdPYWr .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-O8vlnMTuzqHdPYWr .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-O8vlnMTuzqHdPYWr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} LLaMA: Pre-Norm
x
RMSNorm(x)
Attention(RMSNorm(x))
x + Attention(RMSNorm(x))
RMSNorm(x + Attn_out)
MLP(RMSNorm(x + Attn_out))
x + Attn_out + MLP(RMSNorm(...))
输出
GPT-2: Post-Norm
x
Attention(x)
x + Attention(x)
LayerNorm(x + Attention(x))
MLP(LN_out)
LN_out + MLP(LN_out)
LayerNorm(LN_out + MLP(LN_out))
输出
Post-Norm vs Pre-Norm:GPT-2 采用 Post-Norm(归一化在残差之后),训练时梯度可能不稳定;LLaMA 采用 Pre-Norm(归一化在子层之前),训练更稳定,是现代 LLM 的主流选择。
6. 注意力系统运作
GPT2Attention 定义于 modeling_gpt2.py L75-L226,是 GPT-2 的核心计算模块。
注意力前向流程图
#mermaid-svg-ZNFA1fpiOUb3Z6sy{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .error-icon{fill:#552222;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .marker.cross{stroke:#333333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy p{margin:0;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster-label text{fill:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster-label span{color:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster-label span p{background-color:transparent;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .label text,#mermaid-svg-ZNFA1fpiOUb3Z6sy span{fill:#333;color:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .node rect,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node circle,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node ellipse,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node polygon,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .rough-node .label text,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node .label text,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape .label,#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .rough-node .label,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node .label,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape .label,#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape .label{text-align:center;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .node.clickable{cursor:pointer;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .arrowheadPath{fill:#333333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZNFA1fpiOUb3Z6sy .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster text{fill:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster span{color:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape p,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape .label rect,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZNFA1fpiOUb3Z6sy .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZNFA1fpiOUb3Z6sy :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} hidden_states
(batch, seq, 768)
c_attn: Conv1D(2304, 768)
一次投影 Q/K/V
split(768, dim=2)
→ Q, K, V 各 (batch, seq, 768)
Q.view + transpose
(batch, num_heads, seq, head_dim)
K.view + transpose
(batch, num_heads, seq, head_dim)
V.view + transpose
(batch, num_heads, seq, head_dim)
KV Cache 更新
past_key_values.update(K, V)
K_full (含历史)
V_full (含历史)
attn_weights = Q @ K_full^T × scaling
scaling = 1/√head_dim
- 因果掩码
create_causal_mask()
Softmax(dim=-1)
attn_dropout
attn_output = weights @ V_full
(batch, num_heads, seq, head_dim)
reshape → (batch, seq, 768)
c_proj: Conv1D(768, 768)
输出投影
resid_dropout
attn_output
(batch, seq, 768)
因果掩码生成图
因果掩码由 masking_utils.py 中的 create_causal_mask() 生成,确保每个位置只能关注自身及之前的 token:
#mermaid-svg-6BXbIULI9coocLhT{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6BXbIULI9coocLhT .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6BXbIULI9coocLhT .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6BXbIULI9coocLhT .error-icon{fill:#552222;}#mermaid-svg-6BXbIULI9coocLhT .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6BXbIULI9coocLhT .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6BXbIULI9coocLhT .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6BXbIULI9coocLhT .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6BXbIULI9coocLhT .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6BXbIULI9coocLhT .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6BXbIULI9coocLhT .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6BXbIULI9coocLhT .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6BXbIULI9coocLhT .marker.cross{stroke:#333333;}#mermaid-svg-6BXbIULI9coocLhT svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6BXbIULI9coocLhT p{margin:0;}#mermaid-svg-6BXbIULI9coocLhT .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-6BXbIULI9coocLhT .cluster-label text{fill:#333;}#mermaid-svg-6BXbIULI9coocLhT .cluster-label span{color:#333;}#mermaid-svg-6BXbIULI9coocLhT .cluster-label span p{background-color:transparent;}#mermaid-svg-6BXbIULI9coocLhT .label text,#mermaid-svg-6BXbIULI9coocLhT span{fill:#333;color:#333;}#mermaid-svg-6BXbIULI9coocLhT .node rect,#mermaid-svg-6BXbIULI9coocLhT .node circle,#mermaid-svg-6BXbIULI9coocLhT .node ellipse,#mermaid-svg-6BXbIULI9coocLhT .node polygon,#mermaid-svg-6BXbIULI9coocLhT .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6BXbIULI9coocLhT .rough-node .label text,#mermaid-svg-6BXbIULI9coocLhT .node .label text,#mermaid-svg-6BXbIULI9coocLhT .image-shape .label,#mermaid-svg-6BXbIULI9coocLhT .icon-shape .label{text-anchor:middle;}#mermaid-svg-6BXbIULI9coocLhT .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-6BXbIULI9coocLhT .rough-node .label,#mermaid-svg-6BXbIULI9coocLhT .node .label,#mermaid-svg-6BXbIULI9coocLhT .image-shape .label,#mermaid-svg-6BXbIULI9coocLhT .icon-shape .label{text-align:center;}#mermaid-svg-6BXbIULI9coocLhT .node.clickable{cursor:pointer;}#mermaid-svg-6BXbIULI9coocLhT .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-6BXbIULI9coocLhT .arrowheadPath{fill:#333333;}#mermaid-svg-6BXbIULI9coocLhT .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-6BXbIULI9coocLhT .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-6BXbIULI9coocLhT .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6BXbIULI9coocLhT .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-6BXbIULI9coocLhT .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6BXbIULI9coocLhT .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-6BXbIULI9coocLhT .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6BXbIULI9coocLhT .cluster text{fill:#333;}#mermaid-svg-6BXbIULI9coocLhT .cluster span{color:#333;}#mermaid-svg-6BXbIULI9coocLhT div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-6BXbIULI9coocLhT .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6BXbIULI9coocLhT rect.text{fill:none;stroke-width:0;}#mermaid-svg-6BXbIULI9coocLhT .icon-shape,#mermaid-svg-6BXbIULI9coocLhT .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6BXbIULI9coocLhT .icon-shape p,#mermaid-svg-6BXbIULI9coocLhT .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-6BXbIULI9coocLhT .icon-shape .label rect,#mermaid-svg-6BXbIULI9coocLhT .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6BXbIULI9coocLhT .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6BXbIULI9coocLhT .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6BXbIULI9coocLhT :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} True(默认)
False
eager
sdpa
flash_attention_2
create_causal_mask()
masking_utils.py L894
config.is_causal?
causal_mask_function
kv_idx <= q_idx
create_bidirectional_mask
双向掩码
注意力实现类型?
eager_mask()
生成 4D float 掩码
0(可见)/-inf(屏蔽)
sdpa_mask()
生成 4D bool 掩码
True(可见)/False(屏蔽)
flash_attention_mask()
返回 2D 掩码或 None
4D 掩码
(batch, 1, q_len, kv_len)
因果掩码矩阵示意(5×5):
位置 0 1 2 3 4
0 [ ■ ⬚ ⬚ ⬚ ⬚ ] ■ = 可见(0)
1 [ ■ ■ ⬚ ⬚ ⬚ ] ⬚ = 屏蔽(-inf)
2 [ ■ ■ ■ ⬚ ⬚ ]
3 [ ■ ■ ■ ■ ⬚ ]
4 [ ■ ■ ■ ■ ■ ]
KV Cache 增量更新时序图
DynamicCache GPT2Attention GPT2Model DynamicCache GPT2Attention GPT2Model #mermaid-svg-rS4KcfV2jQzqfSbb{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-rS4KcfV2jQzqfSbb .error-icon{fill:#552222;}#mermaid-svg-rS4KcfV2jQzqfSbb .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-rS4KcfV2jQzqfSbb .marker{fill:#333333;stroke:#333333;}#mermaid-svg-rS4KcfV2jQzqfSbb .marker.cross{stroke:#333333;}#mermaid-svg-rS4KcfV2jQzqfSbb svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-rS4KcfV2jQzqfSbb p{margin:0;}#mermaid-svg-rS4KcfV2jQzqfSbb .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rS4KcfV2jQzqfSbb text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rS4KcfV2jQzqfSbb .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb .sequenceNumber{fill:white;}#mermaid-svg-rS4KcfV2jQzqfSbb #sequencenumber{fill:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb .messageText{fill:#333;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rS4KcfV2jQzqfSbb .labelText,#mermaid-svg-rS4KcfV2jQzqfSbb .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .loopText,#mermaid-svg-rS4KcfV2jQzqfSbb .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rS4KcfV2jQzqfSbb .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-rS4KcfV2jQzqfSbb .noteText,#mermaid-svg-rS4KcfV2jQzqfSbb .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rS4KcfV2jQzqfSbb .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rS4KcfV2jQzqfSbb .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rS4KcfV2jQzqfSbb .actorPopupMenu{position:absolute;}#mermaid-svg-rS4KcfV2jQzqfSbb .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-rS4KcfV2jQzqfSbb .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rS4KcfV2jQzqfSbb .actor-man circle,#mermaid-svg-rS4KcfV2jQzqfSbb line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-rS4KcfV2jQzqfSbb :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} === 首次前向(Prefill) === === 生成第1个token === 拼接历史: K_full = cat(K_old, K_new) → (batch, 12, 6, 64) === 生成第2个token === K_full → (batch, 12, 7, 64) hidden_states (batch, 5, 768) c_attn → Q, K, V K: (batch, 12, 5, 64) V: (batch, 12, 5, 64) update(K, V, layer_idx=0) K_full=(batch,12,5,64), V_full=(batch,12,5,64) Q @ K_full^T → softmax → @ V_full attn_output (batch, 5, 768) hidden_states (batch, 1, 768) c_attn → Q, K, V K_new: (batch, 12, 1, 64) V_new: (batch, 12, 1, 64) update(K_new, V_new, layer_idx=0) K_full=(batch,12,6,64), V_full=(batch,12,6,64) Q @ K_full^T → softmax → @ V_full attn_output (batch, 1, 768) hidden_states (batch, 1, 768) update(K_new, V_new, layer_idx=0) K_full=(batch,12,7,64), V_full=(batch,12,7,64) attn_output (batch, 1, 768)
KV Cache 的关键源码(modeling_gpt2.py L193-L199):
python
# 在 GPT2Attention.forward() 中
if (past_key_values is not None and not is_cross_attention) or (
past_key_values is not None and is_cross_attention and not is_updated
):
# 将新的 K/V 与缓存中的历史 K/V 拼接
key_states, value_states = curr_past_key_values.update(
key_states, value_states, self.layer_idx
)
7. generate() 生成全流程
GPT2LMHeadModel 继承了 GenerationMixin(modeling_gpt2.py L645),其 generate() 方法定义于 generation/utils.py。
生成循环时序图
渲染错误: Mermaid 渲染失败: Parse error on line 16: ...sk() Note over Loop,Cache: === Pref ---------------------^ Expecting 'ACTOR', got 'loop'
辅助解码流程图(Speculative Decoding)
当使用辅助模型进行推测解码时,流程如下:
#mermaid-svg-jL2hJOKpDZpEwESa{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jL2hJOKpDZpEwESa .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jL2hJOKpDZpEwESa .error-icon{fill:#552222;}#mermaid-svg-jL2hJOKpDZpEwESa .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jL2hJOKpDZpEwESa .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jL2hJOKpDZpEwESa .marker.cross{stroke:#333333;}#mermaid-svg-jL2hJOKpDZpEwESa svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jL2hJOKpDZpEwESa p{margin:0;}#mermaid-svg-jL2hJOKpDZpEwESa .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster-label text{fill:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster-label span{color:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster-label span p{background-color:transparent;}#mermaid-svg-jL2hJOKpDZpEwESa .label text,#mermaid-svg-jL2hJOKpDZpEwESa span{fill:#333;color:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .node rect,#mermaid-svg-jL2hJOKpDZpEwESa .node circle,#mermaid-svg-jL2hJOKpDZpEwESa .node ellipse,#mermaid-svg-jL2hJOKpDZpEwESa .node polygon,#mermaid-svg-jL2hJOKpDZpEwESa .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-jL2hJOKpDZpEwESa .rough-node .label text,#mermaid-svg-jL2hJOKpDZpEwESa .node .label text,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape .label,#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape .label{text-anchor:middle;}#mermaid-svg-jL2hJOKpDZpEwESa .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-jL2hJOKpDZpEwESa .rough-node .label,#mermaid-svg-jL2hJOKpDZpEwESa .node .label,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape .label,#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape .label{text-align:center;}#mermaid-svg-jL2hJOKpDZpEwESa .node.clickable{cursor:pointer;}#mermaid-svg-jL2hJOKpDZpEwESa .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-jL2hJOKpDZpEwESa .arrowheadPath{fill:#333333;}#mermaid-svg-jL2hJOKpDZpEwESa .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-jL2hJOKpDZpEwESa .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-jL2hJOKpDZpEwESa .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jL2hJOKpDZpEwESa .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-jL2hJOKpDZpEwESa .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jL2hJOKpDZpEwESa .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-jL2hJOKpDZpEwESa .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster text{fill:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster span{color:#333;}#mermaid-svg-jL2hJOKpDZpEwESa div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-jL2hJOKpDZpEwESa .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-jL2hJOKpDZpEwESa rect.text{fill:none;stroke-width:0;}#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape p,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape .label rect,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jL2hJOKpDZpEwESa .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-jL2hJOKpDZpEwESa .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-jL2hJOKpDZpEwESa :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 全部接受
第 j 个被拒绝
主模型: GPT2LMHeadModel
辅助模型: assistant_model
辅助模型生成 K 个候选 token
主模型验证 K 个候选 token
一次前向传播
验证结果
接受 K 个 token
-
继续生成
接受前 j-1 个 token -
从主模型采样第 j 个
继续下一轮推测
8. 训练流程
训练循环时序图
渲染错误: Mermaid 渲染失败: Parse error on line 19: ..._labels Loss-->>Opt: loss 标量 No ----------------------^ Expecting '+', '-', '()', 'ACTOR', got 'opt'
权重绑定梯度流图
#mermaid-svg-opmYsdlaVp6z2ysq{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-opmYsdlaVp6z2ysq .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-opmYsdlaVp6z2ysq .error-icon{fill:#552222;}#mermaid-svg-opmYsdlaVp6z2ysq .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-opmYsdlaVp6z2ysq .marker{fill:#333333;stroke:#333333;}#mermaid-svg-opmYsdlaVp6z2ysq .marker.cross{stroke:#333333;}#mermaid-svg-opmYsdlaVp6z2ysq svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-opmYsdlaVp6z2ysq p{margin:0;}#mermaid-svg-opmYsdlaVp6z2ysq .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster-label text{fill:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster-label span{color:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster-label span p{background-color:transparent;}#mermaid-svg-opmYsdlaVp6z2ysq .label text,#mermaid-svg-opmYsdlaVp6z2ysq span{fill:#333;color:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .node rect,#mermaid-svg-opmYsdlaVp6z2ysq .node circle,#mermaid-svg-opmYsdlaVp6z2ysq .node ellipse,#mermaid-svg-opmYsdlaVp6z2ysq .node polygon,#mermaid-svg-opmYsdlaVp6z2ysq .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-opmYsdlaVp6z2ysq .rough-node .label text,#mermaid-svg-opmYsdlaVp6z2ysq .node .label text,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape .label,#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape .label{text-anchor:middle;}#mermaid-svg-opmYsdlaVp6z2ysq .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-opmYsdlaVp6z2ysq .rough-node .label,#mermaid-svg-opmYsdlaVp6z2ysq .node .label,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape .label,#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape .label{text-align:center;}#mermaid-svg-opmYsdlaVp6z2ysq .node.clickable{cursor:pointer;}#mermaid-svg-opmYsdlaVp6z2ysq .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-opmYsdlaVp6z2ysq .arrowheadPath{fill:#333333;}#mermaid-svg-opmYsdlaVp6z2ysq .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-opmYsdlaVp6z2ysq .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-opmYsdlaVp6z2ysq .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-opmYsdlaVp6z2ysq .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-opmYsdlaVp6z2ysq .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-opmYsdlaVp6z2ysq .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-opmYsdlaVp6z2ysq .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster text{fill:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster span{color:#333;}#mermaid-svg-opmYsdlaVp6z2ysq div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-opmYsdlaVp6z2ysq .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-opmYsdlaVp6z2ysq rect.text{fill:none;stroke-width:0;}#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape p,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape .label rect,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-opmYsdlaVp6z2ysq .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-opmYsdlaVp6z2ysq .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-opmYsdlaVp6z2ysq :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 反向传播
梯度累加
同一参数
前向传播
共享存储
wte: nn.Embedding
weight: (50257, 768)
lm_head: nn.Linear
weight: (50257, 768)
lm_head.weight.grad
来自 logits 的梯度
wte.weight.grad
来自嵌入层的梯度
共享权重
梯度 = lm_head梯度 + wte梯度
optimizer.step()
一次性更新共享权重
损失计算的关键源码(modeling_gpt2.py L708-L716):
python
# GPT2LMHeadModel.forward()
loss = None
if labels is not None:
# labels 自动 shift:logits 取前 n-1 位,labels 取后 n-1 位
# 即:用位置 i 的输出预测位置 i+1 的 token
loss = self.loss_function(
logits,
labels,
vocab_size=self.config.vocab_size,
**kwargs,
)
GPT-2 特殊的残差缩放初始化
GPT-2 采用了特殊的残差路径初始化策略(modeling_gpt2.py L448-L458):
python
# GPT2PreTrainedModel._init_weights()
if isinstance(module, PreTrainedModel):
for name, p in module.named_parameters():
if name == "c_proj.weight":
# 残差投影层的权重缩小 1/√(2N)
# N 为残差层数,2 是因为每个 Block 有 2 个残差连接
init.normal_(p, mean=0.0,
std=self.config.initializer_range / math.sqrt(2 * self.config.n_layer))
这一策略来自 GPT-2 论文:随着模型深度增加,残差路径上的方差会累积,通过缩小残差层权重来抵消这种累积效应。
9. Pipeline 推理
pipeline("text-generation", model="gpt2") 使用 TextGenerationPipeline(text_generation.py),封装了分词、生成、后处理的完整流程。
Pipeline 时序图
后处理 GPT2LMHeadModel GPT2Tokenizer TextGenerationPipeline 用户代码 后处理 GPT2LMHeadModel GPT2Tokenizer TextGenerationPipeline 用户代码 #mermaid-svg-rJCrLMJnTBy7mz5o{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-rJCrLMJnTBy7mz5o .error-icon{fill:#552222;}#mermaid-svg-rJCrLMJnTBy7mz5o .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-rJCrLMJnTBy7mz5o .marker{fill:#333333;stroke:#333333;}#mermaid-svg-rJCrLMJnTBy7mz5o .marker.cross{stroke:#333333;}#mermaid-svg-rJCrLMJnTBy7mz5o svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-rJCrLMJnTBy7mz5o p{margin:0;}#mermaid-svg-rJCrLMJnTBy7mz5o .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rJCrLMJnTBy7mz5o text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rJCrLMJnTBy7mz5o .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o .sequenceNumber{fill:white;}#mermaid-svg-rJCrLMJnTBy7mz5o #sequencenumber{fill:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o .messageText{fill:#333;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rJCrLMJnTBy7mz5o .labelText,#mermaid-svg-rJCrLMJnTBy7mz5o .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .loopText,#mermaid-svg-rJCrLMJnTBy7mz5o .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rJCrLMJnTBy7mz5o .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-rJCrLMJnTBy7mz5o .noteText,#mermaid-svg-rJCrLMJnTBy7mz5o .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rJCrLMJnTBy7mz5o .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rJCrLMJnTBy7mz5o .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rJCrLMJnTBy7mz5o .actorPopupMenu{position:absolute;}#mermaid-svg-rJCrLMJnTBy7mz5o .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-rJCrLMJnTBy7mz5o .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rJCrLMJnTBy7mz5o .actor-man circle,#mermaid-svg-rJCrLMJnTBy7mz5o line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-rJCrLMJnTBy7mz5o :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} === 初始化 === === preprocess === === _forward === === postprocess === pipeline("text-generation", model="gpt2") 1. 加载 Tokenizer 2. 加载 Model 3. 设置 padding_side="left" (Decoder-only 批量生成需要左填充) 4. 默认 GenerationConfig max_new_tokens=256 do_sample=True, temperature=0.7 ("Hello, I'm a language model", max_new_tokens=50) tokenizer(prefix + prompt_text, return_tensors="pt") input_ids, attention_mask model.generate( input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=50) generated_sequence (batch, num_return, total_len) 截取新生成的 token sequenceprompt_len: tokenizer.decode( new_tokens, skip_special_tokens=True) 生成文本 {"generated_text": "Hello, I'm a language model, and I'm here to help..."}
Pipeline 的关键初始化逻辑(text_generation.py L99-L106):
python
class TextGenerationPipeline(Pipeline):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.check_model_type(MODEL_FOR_CAUSAL_LM_MAPPING_NAMES)
# Decoder-only 模型需要左填充以确保批量生成正确
if self.tokenizer is not None and self.tokenizer.padding_side == "right":
self.tokenizer.padding_side = "left"
Pipeline 默认生成配置(text_generation.py L93-L97):
python
_default_generation_config = GenerationConfig(
max_new_tokens=256,
do_sample=True, # 自由文本生成通常使用采样
temperature=0.7,
)
10. 状态与生命周期总结
GPT-2 模型在 Transformers 中的完整生命周期,从配置创建到推理输出,经历以下状态转换:
状态机图
#mermaid-svg-Qb7qbweJ6wuo3Em7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .error-icon{fill:#552222;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .marker.cross{stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 p{margin:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 defs #statediagram-barbEnd{fill:#333333;stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup text{fill:#9370DB;stroke:none;font-size:10px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup text{fill:#333;stroke:none;font-size:10px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup .state-title{font-weight:bolder;fill:#131300;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup line{stroke:#333333;stroke-width:1;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .transition{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .stateGroup .composit{fill:white;border-bottom:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .state-note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .state-note text{fill:black;stroke:none;font-size:10px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel .label rect{fill:#ECECFF;opacity:0.5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel .label text{fill:#333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .label div .edgeLabel{color:#333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .stateLabel text{fill:#131300;font-size:10px;font-weight:bold;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node circle.state-start{fill:#333333;stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node .fork-join{fill:#333333;stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node circle.state-end{fill:#9370DB;stroke:white;stroke-width:1.5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .end-state-inner{fill:white;stroke-width:1.5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 #statediagram-barbEnd{fill:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .cluster-label,#mermaid-svg-Qb7qbweJ6wuo3Em7 .nodeLabel{color:#131300;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster rect.outer{rx:5px;ry:5px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-state .divider{stroke:#9370DB;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-state .title-state{rx:5px;ry:5px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster.statediagram-cluster .inner{fill:white;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster.statediagram-cluster-alt .inner{fill:#f0f0f0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster .inner{rx:0;ry:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-state rect.basic{rx:5px;ry:5px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#f0f0f0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .note-edge{stroke-dasharray:5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-note text{fill:black;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-note .nodeLabel{color:black;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram .edgeLabel{color:red;}#mermaid-svg-Qb7qbweJ6wuo3Em7 #dependencyStart,#mermaid-svg-Qb7qbweJ6wuo3Em7 #dependencyEnd{fill:#333333;stroke:#333333;stroke-width:1;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagramTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} GPT2Config()
GPT2LMHeadModel(config)
post_init() → _init_weights()
from_pretrained('gpt2')
下载权重 + 加载 + 权重绑定
model.forward(input_ids)
首次前向 + KV Cache 填充
生成循环
逐 token 解码
next_token → forward → sample
KV Cache 增量更新
EOS / max_length
生成完成
tokenizer.decode()
还原文本
forward → loss → backward → step
权重绑定梯度累加
输出结果
模型就绪
.eval() 模式
tokenizer(text)
ByteLevel BPE 编码
model.train()
labels=input_ids
model.eval()
ConfigCreated
ModelInstantiated
WeightsLoaded
Ready
model.train()
model.eval()
EvalMode
TrainMode
Tokenized
Prefilling
Decoding
Generated
PostProcessed
Training
生命周期关键节点总结
| 阶段 | 关键函数/类 | 源码位置 |
|---|---|---|
| 配置创建 | GPT2Config |
configuration_gpt2.py L25 |
| 模型实例化 | GPT2LMHeadModel.__init__ |
modeling_gpt2.py L648 |
| 权重初始化 | GPT2PreTrainedModel._init_weights |
modeling_gpt2.py L433 |
| 预训练加载 | PreTrainedModel.from_pretrained |
modeling_utils.py |
| 权重绑定 | _tied_weights_keys + tie_weights() |
modeling_gpt2.py L646 |
| 分词编码 | GPT2Tokenizer.__call__ |
tokenization_gpt2.py L94 |
| 前向传播 | GPT2Model.forward |
modeling_gpt2.py L522 |
| 注意力计算 | GPT2Attention.forward |
modeling_gpt2.py L144 |
| 因果掩码 | create_causal_mask |
masking_utils.py L894 |
| KV Cache | DynamicCache.update |
cache_utils.py L1229 |
| 自回归生成 | GenerationMixin.generate |
generation/utils.py L339 |
| 损失计算 | GPT2LMHeadModel.loss_function |
modeling_gpt2.py L711 |
| Pipeline | TextGenerationPipeline |
text_generation.py L23 |
| Conv1D 线性层 | Conv1D |
pytorch_utils.py L97 |
模型家族一览
GPT-2 在 Transformers 中提供了多种任务头:
| 类名 | 任务 | 头部 |
|---|---|---|
GPT2Model |
基础模型(提取隐藏状态) | 无 |
GPT2LMHeadModel |
因果语言建模 | lm_head (Linear, 权重绑定) |
GPT2DoubleHeadsModel |
语言建模 + 多项选择 | lm_head + multiple_choice_head |
GPT2ForSequenceClassification |
序列分类 | score (Linear) |
GPT2ForTokenClassification |
Token 分类 | classifier (Linear) |
GPT2ForQuestionAnswering |
问答 | qa_outputs (Linear → 2) |
总结:GPT-2 作为 Decoder-only 因果语言模型的开山之作,其设计深刻影响了后续所有 LLM。从 Conv1D 到 nn.Linear、从 Post-Norm 到 Pre-Norm、从绝对位置编码到 RoPE,每一代演进都在 GPT-2 的基础上优化。理解 GPT-2 的完整生命周期,就是理解现代大语言模型的基石。