18-Hugging Face Transformers之GPT-2 案例详解:Decoder-only 自回归模型的完整生命周期

GPT-2 案例详解:Decoder-only 自回归模型的完整生命周期

本文档以 GPT-2 模型为例,将 Transformers 框架的所有模块串联起来,重点展示 Decoder-only 自回归模型从配置加载、分词编码、前向传播、注意力机制、自回归生成到训练推理的完整生命周期。

源码文件:


相关文章:

Hugging Face Transformers 源码全景解读

01-Hugging Face Transformers 核心基础设施深度分析

02-Hugging Face Transformers 配置系统深度分析

03-Hugging Face Transformers 模型系统深度分析

04-Hugging Face Transformers 注意力与掩码系统深度分析

05-Hugging Face Transformers 缓存系统深度分析

06-Hugging Face Transformers 生成系统深度分析

07-Hugging Face Transformers 分词器系统深度分析

08-Hugging Face Transformers 多模态处理系统深度分析

09-Hugging Face Transformers 训练系统深度分析

10-Hugging Face Transformers 量化系统深度分析

11-Hugging Face Transformers 分布式与并行系统深度分析

12-Hugging Face Transformers之Pipeline 推理管道深入分析

13-Hugging Face Transformers之AutoModel 自动分发机制深入分析

14-Hugging Face Transformers 模型实现模式深度分析

15-Hugging Face Transformers之CLI 与工具架构总览

16-Hugging Face Transformers之测试体系架构总览

17-Hugging Face Transformers之BERT 案例详解:Transformers 框架全模块串联

18-Hugging Face Transformers之GPT-2 案例详解:Decoder-only 自回归模型的完整生命周期

19-Hugging Face Transformers之Qwen3.5-MoE 系列详解:混合专家 + 线性注意力 + 多模态的完整生命周期

1. GPT-2 在 Transformers 中的定位

GPT-2 是 OpenAI 于 2019 年发布的 Decoder-only 因果语言模型,采用自回归生成范式:每次只预测下一个 token,已生成的内容作为上下文参与后续预测。与 BERT 等 Encoder 模型不同,GPT-2 使用**因果掩码(Causal Mask)**确保每个位置只能看到自身及之前的 token。

GPT-2 的一个标志性设计是使用 Conv1D 线性层(而非标准 nn.Linear),这是 OpenAI 原始实现的遗留设计------权重矩阵的形状为 (in_features, out_features),与 nn.Linear(out_features, in_features) 互为转置。

架构定位图

#mermaid-svg-0OCuXI53EtFzhB6N{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-0OCuXI53EtFzhB6N .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-0OCuXI53EtFzhB6N .error-icon{fill:#552222;}#mermaid-svg-0OCuXI53EtFzhB6N .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-0OCuXI53EtFzhB6N .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-0OCuXI53EtFzhB6N .marker{fill:#333333;stroke:#333333;}#mermaid-svg-0OCuXI53EtFzhB6N .marker.cross{stroke:#333333;}#mermaid-svg-0OCuXI53EtFzhB6N svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-0OCuXI53EtFzhB6N p{margin:0;}#mermaid-svg-0OCuXI53EtFzhB6N .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster-label text{fill:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster-label span{color:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster-label span p{background-color:transparent;}#mermaid-svg-0OCuXI53EtFzhB6N .label text,#mermaid-svg-0OCuXI53EtFzhB6N span{fill:#333;color:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .node rect,#mermaid-svg-0OCuXI53EtFzhB6N .node circle,#mermaid-svg-0OCuXI53EtFzhB6N .node ellipse,#mermaid-svg-0OCuXI53EtFzhB6N .node polygon,#mermaid-svg-0OCuXI53EtFzhB6N .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-0OCuXI53EtFzhB6N .rough-node .label text,#mermaid-svg-0OCuXI53EtFzhB6N .node .label text,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape .label,#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape .label{text-anchor:middle;}#mermaid-svg-0OCuXI53EtFzhB6N .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-0OCuXI53EtFzhB6N .rough-node .label,#mermaid-svg-0OCuXI53EtFzhB6N .node .label,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape .label,#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape .label{text-align:center;}#mermaid-svg-0OCuXI53EtFzhB6N .node.clickable{cursor:pointer;}#mermaid-svg-0OCuXI53EtFzhB6N .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-0OCuXI53EtFzhB6N .arrowheadPath{fill:#333333;}#mermaid-svg-0OCuXI53EtFzhB6N .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-0OCuXI53EtFzhB6N .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-0OCuXI53EtFzhB6N .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0OCuXI53EtFzhB6N .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-0OCuXI53EtFzhB6N .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0OCuXI53EtFzhB6N .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-0OCuXI53EtFzhB6N .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster text{fill:#333;}#mermaid-svg-0OCuXI53EtFzhB6N .cluster span{color:#333;}#mermaid-svg-0OCuXI53EtFzhB6N div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-0OCuXI53EtFzhB6N .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-0OCuXI53EtFzhB6N rect.text{fill:none;stroke-width:0;}#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape p,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-0OCuXI53EtFzhB6N .icon-shape .label rect,#mermaid-svg-0OCuXI53EtFzhB6N .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0OCuXI53EtFzhB6N .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-0OCuXI53EtFzhB6N .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-0OCuXI53EtFzhB6N :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Decoder-only 因果语言模型家族
Conv1D→nn.Linear

Post-Norm→Pre-Norm
绝对位置→RoPE

GELU→SwiGLU
架构继承

规模扩展
GPT-2

2019 · OpenAI

Conv1D + Post-Norm

768/1024/1280/1600
GPT-Neo / GPT-J

2021 · EleutherAI

nn.Linear + Parallel Attention
LLaMA

2023 · Meta

RMSNorm + SwiGLU + RoPE
Qwen 系列

2023- · 阿里

RMSNorm + SwiGLU + RoPE

核心特征总结

特征 GPT-2 LLaMA(对比)
线性层 Conv1D(权重转置) nn.Linear
归一化 LayerNorm + Post-Norm RMSNorm + Pre-Norm
激活函数 gelu_new(近似 GELU) silu(SwiGLU)
位置编码 可学习绝对位置 wpe 旋转位置编码 RoPE
注意力缩放 scale_attn_weights + 可选层逆缩放 标准 head_dim^-0.5
权重绑定 lm_head.weight ↔ wte.weight 同样绑定

2. Config 定义与特殊设计

GPT2Config 继承自 PreTrainedConfig,定义于 configuration_gpt2.py。它使用 @strict 装饰器(来自 huggingface_hub)确保数据类字段的严格类型检查,并通过 attribute_map 将 GPT-2 原始命名映射到 Transformers 统一命名。

关键参数解析

python 复制代码
# 源自 configuration_gpt2.py L78-L103
vocab_size: int = 50257              # 词表大小
n_positions: int = 1024              # 最大位置编码长度
n_embd: int = 768                    # 隐藏层维度
n_layer: int = 12                    # Transformer 层数
n_head: int = 12                     # 注意力头数
n_inner: int | None = None           # MLP 中间层维度,默认 4 * n_embd
activation_function: str = "gelu_new" # 近似 GELU 激活
scale_attn_weights: bool = True      # 是否缩放注意力权重(1/√d_k)
reorder_and_upcast_attn: bool = False # 混合精度下重排序并上溯注意力计算
add_cross_attention: bool = False    # 是否添加交叉注意力(用于编码器-解码器场景)
tie_word_embeddings: bool = True     # 权重绑定:lm_head ↔ wte

Config 类图

#mermaid-svg-PEZYkF0c6VI1P7nH{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-PEZYkF0c6VI1P7nH .error-icon{fill:#552222;}#mermaid-svg-PEZYkF0c6VI1P7nH .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-PEZYkF0c6VI1P7nH .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-PEZYkF0c6VI1P7nH .marker{fill:#333333;stroke:#333333;}#mermaid-svg-PEZYkF0c6VI1P7nH .marker.cross{stroke:#333333;}#mermaid-svg-PEZYkF0c6VI1P7nH svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-PEZYkF0c6VI1P7nH p{margin:0;}#mermaid-svg-PEZYkF0c6VI1P7nH g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-PEZYkF0c6VI1P7nH g.classGroup text .title{font-weight:bolder;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster-label text{fill:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster-label span{color:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster-label span p{background-color:transparent;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster text{fill:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .cluster span{color:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .nodeLabel,#mermaid-svg-PEZYkF0c6VI1P7nH .edgeLabel{color:#131300;}#mermaid-svg-PEZYkF0c6VI1P7nH .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-PEZYkF0c6VI1P7nH .label text{fill:#131300;}#mermaid-svg-PEZYkF0c6VI1P7nH .labelBkg{background:#ECECFF;}#mermaid-svg-PEZYkF0c6VI1P7nH .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-PEZYkF0c6VI1P7nH .classTitle{font-weight:bolder;}#mermaid-svg-PEZYkF0c6VI1P7nH .node rect,#mermaid-svg-PEZYkF0c6VI1P7nH .node circle,#mermaid-svg-PEZYkF0c6VI1P7nH .node ellipse,#mermaid-svg-PEZYkF0c6VI1P7nH .node polygon,#mermaid-svg-PEZYkF0c6VI1P7nH .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-PEZYkF0c6VI1P7nH .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH g.clickable{cursor:pointer;}#mermaid-svg-PEZYkF0c6VI1P7nH g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-PEZYkF0c6VI1P7nH g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-PEZYkF0c6VI1P7nH .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-PEZYkF0c6VI1P7nH .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-PEZYkF0c6VI1P7nH .dashed-line{stroke-dasharray:3;}#mermaid-svg-PEZYkF0c6VI1P7nH .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-PEZYkF0c6VI1P7nH #compositionStart,#mermaid-svg-PEZYkF0c6VI1P7nH .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #compositionEnd,#mermaid-svg-PEZYkF0c6VI1P7nH .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #dependencyStart,#mermaid-svg-PEZYkF0c6VI1P7nH .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #dependencyStart,#mermaid-svg-PEZYkF0c6VI1P7nH .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #extensionStart,#mermaid-svg-PEZYkF0c6VI1P7nH .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #extensionEnd,#mermaid-svg-PEZYkF0c6VI1P7nH .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #aggregationStart,#mermaid-svg-PEZYkF0c6VI1P7nH .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #aggregationEnd,#mermaid-svg-PEZYkF0c6VI1P7nH .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #lollipopStart,#mermaid-svg-PEZYkF0c6VI1P7nH .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH #lollipopEnd,#mermaid-svg-PEZYkF0c6VI1P7nH .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-PEZYkF0c6VI1P7nH .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-PEZYkF0c6VI1P7nH .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-PEZYkF0c6VI1P7nH .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-PEZYkF0c6VI1P7nH .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-PEZYkF0c6VI1P7nH :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} PreTrainedConfig
+model_type: str
+is_decoder: bool
+from_pretrained()
+to_dict()
GPT2Config
+vocab_size: int = 50257
+n_positions: int = 1024
+n_embd: int = 768
+n_layer: int = 12
+n_head: int = 12
+n_inner: int | None
+activation_function: str
+scale_attn_weights: bool
+reorder_and_upcast_attn: bool
+add_cross_attention: bool
+tie_word_embeddings: bool
+attribute_map: dict
attribute_map 将 GPT-2 原始命名\n映射到 Transformers 统一命名:\nhidden_size → n_embd\nnum_attention_heads → n_head\nnum_hidden_layers → n_layer\nmax_position_embeddings → n_positions

Conv1D vs nn.Linear 对比图

#mermaid-svg-8mu3TvDKig1NJVCF{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-8mu3TvDKig1NJVCF .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-8mu3TvDKig1NJVCF .error-icon{fill:#552222;}#mermaid-svg-8mu3TvDKig1NJVCF .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-8mu3TvDKig1NJVCF .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-8mu3TvDKig1NJVCF .marker{fill:#333333;stroke:#333333;}#mermaid-svg-8mu3TvDKig1NJVCF .marker.cross{stroke:#333333;}#mermaid-svg-8mu3TvDKig1NJVCF svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-8mu3TvDKig1NJVCF p{margin:0;}#mermaid-svg-8mu3TvDKig1NJVCF .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster-label text{fill:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster-label span{color:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster-label span p{background-color:transparent;}#mermaid-svg-8mu3TvDKig1NJVCF .label text,#mermaid-svg-8mu3TvDKig1NJVCF span{fill:#333;color:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .node rect,#mermaid-svg-8mu3TvDKig1NJVCF .node circle,#mermaid-svg-8mu3TvDKig1NJVCF .node ellipse,#mermaid-svg-8mu3TvDKig1NJVCF .node polygon,#mermaid-svg-8mu3TvDKig1NJVCF .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-8mu3TvDKig1NJVCF .rough-node .label text,#mermaid-svg-8mu3TvDKig1NJVCF .node .label text,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape .label,#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape .label{text-anchor:middle;}#mermaid-svg-8mu3TvDKig1NJVCF .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-8mu3TvDKig1NJVCF .rough-node .label,#mermaid-svg-8mu3TvDKig1NJVCF .node .label,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape .label,#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape .label{text-align:center;}#mermaid-svg-8mu3TvDKig1NJVCF .node.clickable{cursor:pointer;}#mermaid-svg-8mu3TvDKig1NJVCF .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-8mu3TvDKig1NJVCF .arrowheadPath{fill:#333333;}#mermaid-svg-8mu3TvDKig1NJVCF .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-8mu3TvDKig1NJVCF .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-8mu3TvDKig1NJVCF .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-8mu3TvDKig1NJVCF .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-8mu3TvDKig1NJVCF .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-8mu3TvDKig1NJVCF .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-8mu3TvDKig1NJVCF .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster text{fill:#333;}#mermaid-svg-8mu3TvDKig1NJVCF .cluster span{color:#333;}#mermaid-svg-8mu3TvDKig1NJVCF div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-8mu3TvDKig1NJVCF .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-8mu3TvDKig1NJVCF rect.text{fill:none;stroke-width:0;}#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape p,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-8mu3TvDKig1NJVCF .icon-shape .label rect,#mermaid-svg-8mu3TvDKig1NJVCF .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-8mu3TvDKig1NJVCF .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-8mu3TvDKig1NJVCF .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-8mu3TvDKig1NJVCF :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Conv1D(GPT-2 使用)
nn.Linear
数学等价

xW^T ≡ x·W_transposed
输入 x

(batch, seq, in_features)
权重 W

shape: (out_features, in_features)
输出 y = xW^T + b

(batch, seq, out_features)
输入 x

(batch, seq, nx)
权重 W

shape: (nx, nf) ← 转置!
输出 y = xW + b

(batch, seq, nf)

Conv1D 的核心差异(定义于 pytorch_utils.py L97-L123):

python 复制代码
class Conv1D(nn.Module):
    def __init__(self, nf, nx):
        super().__init__()
        self.nf = nf
        self.nx = nx
        # 关键:权重形状为 (nx, nf),即 (in_features, out_features)
        # 而 nn.Linear 的权重形状为 (out_features, in_features)
        self.weight = nn.Parameter(torch.empty(nx, nf))
        self.bias = nn.Parameter(torch.zeros(nf))
        nn.init.normal_(self.weight, std=0.02)

    def forward(self, x):
        size_out = x.size()[:-1] + (self.nf,)
        # torch.addmm(bias, input, weight) 计算 bias + input @ weight
        # 等价于 nn.Linear 的 F.linear(x, W.T, b) = x @ W.T + b
        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
        x = x.view(size_out)
        return x

3. from_pretrained 完整时序

GPT2LMHeadModel.from_pretrained('gpt2') 到模型就绪,涉及多个关键步骤。

时序图

HuggingFace Hub PreTrainedModel GPT2LMHeadModel GPT2Config AutoModelForCausalLM 用户代码 HuggingFace Hub PreTrainedModel GPT2LMHeadModel GPT2Config AutoModelForCausalLM 用户代码 #mermaid-svg-bLqkD2Xqn6zqB9dZ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .error-icon{fill:#552222;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .marker.cross{stroke:#333333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bLqkD2Xqn6zqB9dZ p{margin:0;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bLqkD2Xqn6zqB9dZ text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bLqkD2Xqn6zqB9dZ .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .sequenceNumber{fill:white;}#mermaid-svg-bLqkD2Xqn6zqB9dZ #sequencenumber{fill:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .messageText{fill:#333;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .labelText,#mermaid-svg-bLqkD2Xqn6zqB9dZ .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .loopText,#mermaid-svg-bLqkD2Xqn6zqB9dZ .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bLqkD2Xqn6zqB9dZ .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .noteText,#mermaid-svg-bLqkD2Xqn6zqB9dZ .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actorPopupMenu{position:absolute;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bLqkD2Xqn6zqB9dZ .actor-man circle,#mermaid-svg-bLqkD2Xqn6zqB9dZ line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-bLqkD2Xqn6zqB9dZ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Conv1D 权重无需转置 因为 checkpoint 已是 (nx, nf) 格式 lm_head.weight ← transformer.wte.weight _tied_weights_keys 指定绑定关系 from_pretrained('gpt2') 下载 config.json config.json GPT2Config.from_pretrained() config 实例 根据 model_type='gpt2' 分发到 GPT2LMHeadModel GPT2LMHeadModel.from_pretrained('gpt2') PreTrainedModel.from_pretrained() 1. 解析模型架构 _load_pretrained_model() 2. 下载权重文件 model.safetensors 权重 state_dict 3. 加载权重到模型 model.load_state_dict() 4. 权重绑定处理 tie_weights() 5. 设备放置与 dtype 转换 模型就绪 可用模型实例

权重绑定机制

GPT2LMHeadModel 通过 _tied_weights_keys 声明权重绑定关系(modeling_gpt2.py L646):

python 复制代码
class GPT2LMHeadModel(GPT2PreTrainedModel, GenerationMixin):
    _tied_weights_keys = {"lm_head.weight": "transformer.wte.weight"}

    def __init__(self, config):
        super().__init__(config)
        self.transformer = GPT2Model(config)
        # lm_head 使用 nn.Linear,权重形状 (vocab_size, n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.post_init()  # 在此触发 tie_weights()

绑定过程:lm_head.weight(形状 (50257, 768))与 transformer.wte.weight(形状 (50257, 768))共享同一存储,修改一个另一个同步变化。

Conv1D 权重加载的特殊处理

由于 GPT-2 的 checkpoint 中 Conv1D 权重已经以 (nx, nf) 格式存储,加载时无需额外转置。但需注意:

  • c_attn.weight:形状 (768, 2304),一次投影出 Q/K/V
  • c_proj.weight:形状 (768, 768),注意力输出投影
  • c_fc.weight:形状 (768, 3072),MLP 上投影
  • c_proj.weight(MLP):形状 (3072, 768),MLP 下投影

4. Tokenizer 编码流程

GPT2Tokenizer 定义于 tokenization_gpt2.py,继承自 TokenizersBackend,使用 Byte-Level BPE 分词算法。

核心设计特点

  1. ByteLevel 预处理 :将所有字符先转为 UTF-8 字节,再映射到 Unicode 字符,确保任何文本都可编码(无 <unk> 问题)
  2. CLS/SEP :GPT-2 没有 BERT 风格的特殊分隔 token,只有 <|endoftext|> 作为 BOS/EOS
  3. 空格敏感 :词首有无空格会产生不同 token(如 "Hello" vs " Hello"
python 复制代码
# 源自 tokenization_gpt2.py L94-L129
class GPT2Tokenizer(TokenizersBackend):
    vocab_files_names = VOCAB_FILES_NAMES  # {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
    model_input_names = ["input_ids", "attention_mask"]
    model = BPE  # 使用 BPE 模型

    def __init__(self, vocab, merges, errors="replace",
                 unk_token="<|endoftext|>", bos_token="<|endoftext|>",
                 eos_token="<|endoftext|>", pad_token=None,
                 add_prefix_space=False, **kwargs):
        self.add_prefix_space = add_prefix_space
        self._vocab = vocab if vocab is not None else {}
        self._merges = merges or []
        # 构建 tokenizers 库的 BPE 模型
        self._tokenizer = Tokenizer(BPE(
            vocab=self._vocab, merges=self._merges,
            dropout=None, continuing_subword_prefix="",
            end_of_word_suffix="", fuse_unk=False,
        ))
        # ByteLevel 预分词器:将文本转为字节级表示
        self._tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
            add_prefix_space=add_prefix_space
        )
        # ByteLevel 解码器:将字节级表示还原为文本
        self._tokenizer.decoder = decoders.ByteLevel()

编码流程图

#mermaid-svg-Nzup5i1hvy1r8wyx{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Nzup5i1hvy1r8wyx .error-icon{fill:#552222;}#mermaid-svg-Nzup5i1hvy1r8wyx .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Nzup5i1hvy1r8wyx .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Nzup5i1hvy1r8wyx .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Nzup5i1hvy1r8wyx .marker.cross{stroke:#333333;}#mermaid-svg-Nzup5i1hvy1r8wyx svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Nzup5i1hvy1r8wyx p{margin:0;}#mermaid-svg-Nzup5i1hvy1r8wyx .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster-label text{fill:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster-label span{color:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster-label span p{background-color:transparent;}#mermaid-svg-Nzup5i1hvy1r8wyx .label text,#mermaid-svg-Nzup5i1hvy1r8wyx span{fill:#333;color:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .node rect,#mermaid-svg-Nzup5i1hvy1r8wyx .node circle,#mermaid-svg-Nzup5i1hvy1r8wyx .node ellipse,#mermaid-svg-Nzup5i1hvy1r8wyx .node polygon,#mermaid-svg-Nzup5i1hvy1r8wyx .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Nzup5i1hvy1r8wyx .rough-node .label text,#mermaid-svg-Nzup5i1hvy1r8wyx .node .label text,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape .label,#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape .label{text-anchor:middle;}#mermaid-svg-Nzup5i1hvy1r8wyx .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Nzup5i1hvy1r8wyx .rough-node .label,#mermaid-svg-Nzup5i1hvy1r8wyx .node .label,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape .label,#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape .label{text-align:center;}#mermaid-svg-Nzup5i1hvy1r8wyx .node.clickable{cursor:pointer;}#mermaid-svg-Nzup5i1hvy1r8wyx .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Nzup5i1hvy1r8wyx .arrowheadPath{fill:#333333;}#mermaid-svg-Nzup5i1hvy1r8wyx .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Nzup5i1hvy1r8wyx .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Nzup5i1hvy1r8wyx .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Nzup5i1hvy1r8wyx .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Nzup5i1hvy1r8wyx .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Nzup5i1hvy1r8wyx .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster text{fill:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx .cluster span{color:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Nzup5i1hvy1r8wyx .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Nzup5i1hvy1r8wyx rect.text{fill:none;stroke-width:0;}#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape p,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Nzup5i1hvy1r8wyx .icon-shape .label rect,#mermaid-svg-Nzup5i1hvy1r8wyx .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Nzup5i1hvy1r8wyx .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Nzup5i1hvy1r8wyx .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Nzup5i1hvy1r8wyx :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 原始文本

'Hello world'
ByteLevel 预处理

pre_tokenizers.ByteLevel

  1. 按 Unicode 分割

识别词边界(空格→Ġ前缀)
2. 字节映射

每个字符→UTF-8字节→Unicode映射

'H'→'H', 'e'→'e', ' '→'Ġ'
BPE 分词

Tokenizer(BPE)

  1. 初始化:每个字节映射为子词
  2. 迭代合并

按 merges.txt 中的优先级

合并最高频的相邻子词对
3. 输出子词序列

'Hello', 'Ġworld'

词表查找

vocab.json
input_ids: 15496, 995
attention_mask 生成

默认全1(无padding时)
attention_mask: 1, 1
模型输入

input_ids + attention_mask

空格敏感性示例

python 复制代码
tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
tokenizer("Hello world")["input_ids"]     # [15496, 995]    → "Hello" + "Ġworld"
tokenizer(" Hello world")["input_ids"]    # [18435, 995]    → "ĠHello" + "Ġworld"
# 注意:"Hello" 和 " Hello" 编码为不同的 token!

5. 模型前向传播全链路

input_idslogits 的完整数据流,定义于 modeling_gpt2.pyGPT2LMHeadModel.forward()GPT2Model.forward()

数据流图

#mermaid-svg-tkFreWzf6kbnDZgu{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-tkFreWzf6kbnDZgu .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-tkFreWzf6kbnDZgu .error-icon{fill:#552222;}#mermaid-svg-tkFreWzf6kbnDZgu .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-tkFreWzf6kbnDZgu .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-tkFreWzf6kbnDZgu .marker{fill:#333333;stroke:#333333;}#mermaid-svg-tkFreWzf6kbnDZgu .marker.cross{stroke:#333333;}#mermaid-svg-tkFreWzf6kbnDZgu svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-tkFreWzf6kbnDZgu p{margin:0;}#mermaid-svg-tkFreWzf6kbnDZgu .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster-label text{fill:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster-label span{color:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster-label span p{background-color:transparent;}#mermaid-svg-tkFreWzf6kbnDZgu .label text,#mermaid-svg-tkFreWzf6kbnDZgu span{fill:#333;color:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .node rect,#mermaid-svg-tkFreWzf6kbnDZgu .node circle,#mermaid-svg-tkFreWzf6kbnDZgu .node ellipse,#mermaid-svg-tkFreWzf6kbnDZgu .node polygon,#mermaid-svg-tkFreWzf6kbnDZgu .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-tkFreWzf6kbnDZgu .rough-node .label text,#mermaid-svg-tkFreWzf6kbnDZgu .node .label text,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape .label,#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape .label{text-anchor:middle;}#mermaid-svg-tkFreWzf6kbnDZgu .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-tkFreWzf6kbnDZgu .rough-node .label,#mermaid-svg-tkFreWzf6kbnDZgu .node .label,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape .label,#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape .label{text-align:center;}#mermaid-svg-tkFreWzf6kbnDZgu .node.clickable{cursor:pointer;}#mermaid-svg-tkFreWzf6kbnDZgu .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-tkFreWzf6kbnDZgu .arrowheadPath{fill:#333333;}#mermaid-svg-tkFreWzf6kbnDZgu .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-tkFreWzf6kbnDZgu .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-tkFreWzf6kbnDZgu .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tkFreWzf6kbnDZgu .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-tkFreWzf6kbnDZgu .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tkFreWzf6kbnDZgu .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-tkFreWzf6kbnDZgu .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster text{fill:#333;}#mermaid-svg-tkFreWzf6kbnDZgu .cluster span{color:#333;}#mermaid-svg-tkFreWzf6kbnDZgu div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-tkFreWzf6kbnDZgu .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-tkFreWzf6kbnDZgu rect.text{fill:none;stroke-width:0;}#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape p,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-tkFreWzf6kbnDZgu .icon-shape .label rect,#mermaid-svg-tkFreWzf6kbnDZgu .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tkFreWzf6kbnDZgu .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-tkFreWzf6kbnDZgu .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-tkFreWzf6kbnDZgu :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} input_ids

(batch, seq_len)
wte: nn.Embedding

词嵌入

(batch, seq_len, 768)
position_ids

自动生成

(1, seq_len)
wpe: nn.Embedding

位置嵌入

(1, seq_len, 768)
⊕ 相加

hidden = wte + wpe
drop: Dropout

embd_pdrop=0.1
GPT2Block #0

ln_1 → Attn → + → ln_2 → MLP → +
GPT2Block #1

ln_1 → Attn → + → ln_2 → MLP → +
...
GPT2Block #11

ln_1 → Attn → + → ln_2 → MLP → +
ln_f: LayerNorm

最终层归一化
lm_head: nn.Linear

(768, 50257, bias=False)
logits

(batch, seq_len, 50257)

单层 GPT2Block 内部结构图

#mermaid-svg-QUcUD6EEjWQF33Y9{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-QUcUD6EEjWQF33Y9 .error-icon{fill:#552222;}#mermaid-svg-QUcUD6EEjWQF33Y9 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-QUcUD6EEjWQF33Y9 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .marker.cross{stroke:#333333;}#mermaid-svg-QUcUD6EEjWQF33Y9 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-QUcUD6EEjWQF33Y9 p{margin:0;}#mermaid-svg-QUcUD6EEjWQF33Y9 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster-label text{fill:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster-label span{color:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster-label span p{background-color:transparent;}#mermaid-svg-QUcUD6EEjWQF33Y9 .label text,#mermaid-svg-QUcUD6EEjWQF33Y9 span{fill:#333;color:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .node rect,#mermaid-svg-QUcUD6EEjWQF33Y9 .node circle,#mermaid-svg-QUcUD6EEjWQF33Y9 .node ellipse,#mermaid-svg-QUcUD6EEjWQF33Y9 .node polygon,#mermaid-svg-QUcUD6EEjWQF33Y9 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .rough-node .label text,#mermaid-svg-QUcUD6EEjWQF33Y9 .node .label text,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape .label,#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape .label{text-anchor:middle;}#mermaid-svg-QUcUD6EEjWQF33Y9 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .rough-node .label,#mermaid-svg-QUcUD6EEjWQF33Y9 .node .label,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape .label,#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape .label{text-align:center;}#mermaid-svg-QUcUD6EEjWQF33Y9 .node.clickable{cursor:pointer;}#mermaid-svg-QUcUD6EEjWQF33Y9 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .arrowheadPath{fill:#333333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QUcUD6EEjWQF33Y9 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-QUcUD6EEjWQF33Y9 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QUcUD6EEjWQF33Y9 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster text{fill:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 .cluster span{color:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-QUcUD6EEjWQF33Y9 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-QUcUD6EEjWQF33Y9 rect.text{fill:none;stroke-width:0;}#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape p,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-QUcUD6EEjWQF33Y9 .icon-shape .label rect,#mermaid-svg-QUcUD6EEjWQF33Y9 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QUcUD6EEjWQF33Y9 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-QUcUD6EEjWQF33Y9 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-QUcUD6EEjWQF33Y9 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 保存 residual
保存 residual
hidden_states

(batch, seq, 768)
ln_1: LayerNorm
GPT2Attention

c_attn → Q/K/V split →

Causal Attention → c_proj

  • 残差连接

hidden = attn_out + residual
ln_2: LayerNorm
GPT2MLP

c_fc → gelu_new → c_proj

  • 残差连接

hidden = mlp_out + residual
输出 hidden_states

(batch, seq, 768)

对应源码(modeling_gpt2.py L262-L309):

python 复制代码
class GPT2Block(GradientCheckpointingLayer):
    def forward(self, hidden_states, past_key_values=None,
                attention_mask=None, ...):
        # Post-Norm:先归一化,再注意力,再残差
        residual = hidden_states
        hidden_states = self.ln_1(hidden_states)
        attn_output, _ = self.attn(hidden_states, ...)
        hidden_states = attn_output + residual  # 第一个残差连接

        # Post-Norm:先归一化,再MLP,再残差
        residual = hidden_states
        hidden_states = self.ln_2(hidden_states)
        feed_forward_hidden_states = self.mlp(hidden_states)
        hidden_states = residual + feed_forward_hidden_states  # 第二个残差连接

        return hidden_states

Post-Norm 架构对比图

#mermaid-svg-O8vlnMTuzqHdPYWr{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-O8vlnMTuzqHdPYWr .error-icon{fill:#552222;}#mermaid-svg-O8vlnMTuzqHdPYWr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-O8vlnMTuzqHdPYWr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-O8vlnMTuzqHdPYWr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-O8vlnMTuzqHdPYWr .marker.cross{stroke:#333333;}#mermaid-svg-O8vlnMTuzqHdPYWr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-O8vlnMTuzqHdPYWr p{margin:0;}#mermaid-svg-O8vlnMTuzqHdPYWr .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster-label text{fill:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster-label span{color:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster-label span p{background-color:transparent;}#mermaid-svg-O8vlnMTuzqHdPYWr .label text,#mermaid-svg-O8vlnMTuzqHdPYWr span{fill:#333;color:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .node rect,#mermaid-svg-O8vlnMTuzqHdPYWr .node circle,#mermaid-svg-O8vlnMTuzqHdPYWr .node ellipse,#mermaid-svg-O8vlnMTuzqHdPYWr .node polygon,#mermaid-svg-O8vlnMTuzqHdPYWr .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-O8vlnMTuzqHdPYWr .rough-node .label text,#mermaid-svg-O8vlnMTuzqHdPYWr .node .label text,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape .label,#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape .label{text-anchor:middle;}#mermaid-svg-O8vlnMTuzqHdPYWr .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-O8vlnMTuzqHdPYWr .rough-node .label,#mermaid-svg-O8vlnMTuzqHdPYWr .node .label,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape .label,#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape .label{text-align:center;}#mermaid-svg-O8vlnMTuzqHdPYWr .node.clickable{cursor:pointer;}#mermaid-svg-O8vlnMTuzqHdPYWr .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-O8vlnMTuzqHdPYWr .arrowheadPath{fill:#333333;}#mermaid-svg-O8vlnMTuzqHdPYWr .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-O8vlnMTuzqHdPYWr .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-O8vlnMTuzqHdPYWr .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-O8vlnMTuzqHdPYWr .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-O8vlnMTuzqHdPYWr .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-O8vlnMTuzqHdPYWr .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster text{fill:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr .cluster span{color:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-O8vlnMTuzqHdPYWr .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-O8vlnMTuzqHdPYWr rect.text{fill:none;stroke-width:0;}#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape p,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-O8vlnMTuzqHdPYWr .icon-shape .label rect,#mermaid-svg-O8vlnMTuzqHdPYWr .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-O8vlnMTuzqHdPYWr .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-O8vlnMTuzqHdPYWr .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-O8vlnMTuzqHdPYWr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} LLaMA: Pre-Norm
x
RMSNorm(x)
Attention(RMSNorm(x))
x + Attention(RMSNorm(x))
RMSNorm(x + Attn_out)
MLP(RMSNorm(x + Attn_out))
x + Attn_out + MLP(RMSNorm(...))
输出
GPT-2: Post-Norm
x
Attention(x)
x + Attention(x)
LayerNorm(x + Attention(x))
MLP(LN_out)
LN_out + MLP(LN_out)
LayerNorm(LN_out + MLP(LN_out))
输出

Post-Norm vs Pre-Norm:GPT-2 采用 Post-Norm(归一化在残差之后),训练时梯度可能不稳定;LLaMA 采用 Pre-Norm(归一化在子层之前),训练更稳定,是现代 LLM 的主流选择。


6. 注意力系统运作

GPT2Attention 定义于 modeling_gpt2.py L75-L226,是 GPT-2 的核心计算模块。

注意力前向流程图

#mermaid-svg-ZNFA1fpiOUb3Z6sy{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .error-icon{fill:#552222;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .marker.cross{stroke:#333333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy p{margin:0;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster-label text{fill:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster-label span{color:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster-label span p{background-color:transparent;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .label text,#mermaid-svg-ZNFA1fpiOUb3Z6sy span{fill:#333;color:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .node rect,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node circle,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node ellipse,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node polygon,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .rough-node .label text,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node .label text,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape .label,#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .rough-node .label,#mermaid-svg-ZNFA1fpiOUb3Z6sy .node .label,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape .label,#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape .label{text-align:center;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .node.clickable{cursor:pointer;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .arrowheadPath{fill:#333333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZNFA1fpiOUb3Z6sy .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZNFA1fpiOUb3Z6sy .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster text{fill:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .cluster span{color:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZNFA1fpiOUb3Z6sy rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape p,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .icon-shape .label rect,#mermaid-svg-ZNFA1fpiOUb3Z6sy .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZNFA1fpiOUb3Z6sy .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZNFA1fpiOUb3Z6sy .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZNFA1fpiOUb3Z6sy :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} hidden_states

(batch, seq, 768)
c_attn: Conv1D(2304, 768)

一次投影 Q/K/V
split(768, dim=2)

→ Q, K, V 各 (batch, seq, 768)
Q.view + transpose

(batch, num_heads, seq, head_dim)
K.view + transpose

(batch, num_heads, seq, head_dim)
V.view + transpose

(batch, num_heads, seq, head_dim)
KV Cache 更新

past_key_values.update(K, V)
K_full (含历史)
V_full (含历史)
attn_weights = Q @ K_full^T × scaling

scaling = 1/√head_dim

  • 因果掩码

create_causal_mask()
Softmax(dim=-1)
attn_dropout
attn_output = weights @ V_full

(batch, num_heads, seq, head_dim)
reshape → (batch, seq, 768)
c_proj: Conv1D(768, 768)

输出投影
resid_dropout
attn_output

(batch, seq, 768)

因果掩码生成图

因果掩码由 masking_utils.py 中的 create_causal_mask() 生成,确保每个位置只能关注自身及之前的 token:
#mermaid-svg-6BXbIULI9coocLhT{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6BXbIULI9coocLhT .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6BXbIULI9coocLhT .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6BXbIULI9coocLhT .error-icon{fill:#552222;}#mermaid-svg-6BXbIULI9coocLhT .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6BXbIULI9coocLhT .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6BXbIULI9coocLhT .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6BXbIULI9coocLhT .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6BXbIULI9coocLhT .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6BXbIULI9coocLhT .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6BXbIULI9coocLhT .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6BXbIULI9coocLhT .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6BXbIULI9coocLhT .marker.cross{stroke:#333333;}#mermaid-svg-6BXbIULI9coocLhT svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6BXbIULI9coocLhT p{margin:0;}#mermaid-svg-6BXbIULI9coocLhT .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-6BXbIULI9coocLhT .cluster-label text{fill:#333;}#mermaid-svg-6BXbIULI9coocLhT .cluster-label span{color:#333;}#mermaid-svg-6BXbIULI9coocLhT .cluster-label span p{background-color:transparent;}#mermaid-svg-6BXbIULI9coocLhT .label text,#mermaid-svg-6BXbIULI9coocLhT span{fill:#333;color:#333;}#mermaid-svg-6BXbIULI9coocLhT .node rect,#mermaid-svg-6BXbIULI9coocLhT .node circle,#mermaid-svg-6BXbIULI9coocLhT .node ellipse,#mermaid-svg-6BXbIULI9coocLhT .node polygon,#mermaid-svg-6BXbIULI9coocLhT .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6BXbIULI9coocLhT .rough-node .label text,#mermaid-svg-6BXbIULI9coocLhT .node .label text,#mermaid-svg-6BXbIULI9coocLhT .image-shape .label,#mermaid-svg-6BXbIULI9coocLhT .icon-shape .label{text-anchor:middle;}#mermaid-svg-6BXbIULI9coocLhT .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-6BXbIULI9coocLhT .rough-node .label,#mermaid-svg-6BXbIULI9coocLhT .node .label,#mermaid-svg-6BXbIULI9coocLhT .image-shape .label,#mermaid-svg-6BXbIULI9coocLhT .icon-shape .label{text-align:center;}#mermaid-svg-6BXbIULI9coocLhT .node.clickable{cursor:pointer;}#mermaid-svg-6BXbIULI9coocLhT .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-6BXbIULI9coocLhT .arrowheadPath{fill:#333333;}#mermaid-svg-6BXbIULI9coocLhT .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-6BXbIULI9coocLhT .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-6BXbIULI9coocLhT .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6BXbIULI9coocLhT .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-6BXbIULI9coocLhT .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6BXbIULI9coocLhT .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-6BXbIULI9coocLhT .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6BXbIULI9coocLhT .cluster text{fill:#333;}#mermaid-svg-6BXbIULI9coocLhT .cluster span{color:#333;}#mermaid-svg-6BXbIULI9coocLhT div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-6BXbIULI9coocLhT .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6BXbIULI9coocLhT rect.text{fill:none;stroke-width:0;}#mermaid-svg-6BXbIULI9coocLhT .icon-shape,#mermaid-svg-6BXbIULI9coocLhT .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6BXbIULI9coocLhT .icon-shape p,#mermaid-svg-6BXbIULI9coocLhT .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-6BXbIULI9coocLhT .icon-shape .label rect,#mermaid-svg-6BXbIULI9coocLhT .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6BXbIULI9coocLhT .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6BXbIULI9coocLhT .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6BXbIULI9coocLhT :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} True(默认)
False
eager
sdpa
flash_attention_2
create_causal_mask()

masking_utils.py L894
config.is_causal?
causal_mask_function

kv_idx <= q_idx
create_bidirectional_mask

双向掩码
注意力实现类型?
eager_mask()

生成 4D float 掩码

0(可见)/-inf(屏蔽)
sdpa_mask()

生成 4D bool 掩码

True(可见)/False(屏蔽)
flash_attention_mask()

返回 2D 掩码或 None
4D 掩码

(batch, 1, q_len, kv_len)

因果掩码矩阵示意(5×5):

复制代码
位置  0  1  2  3  4
 0 [ ■  ⬚  ⬚  ⬚  ⬚ ]    ■ = 可见(0)
 1 [ ■  ■  ⬚  ⬚  ⬚ ]    ⬚ = 屏蔽(-inf)
 2 [ ■  ■  ■  ⬚  ⬚ ]
 3 [ ■  ■  ■  ■  ⬚ ]
 4 [ ■  ■  ■  ■  ■ ]

KV Cache 增量更新时序图

DynamicCache GPT2Attention GPT2Model DynamicCache GPT2Attention GPT2Model #mermaid-svg-rS4KcfV2jQzqfSbb{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-rS4KcfV2jQzqfSbb .error-icon{fill:#552222;}#mermaid-svg-rS4KcfV2jQzqfSbb .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-rS4KcfV2jQzqfSbb .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-rS4KcfV2jQzqfSbb .marker{fill:#333333;stroke:#333333;}#mermaid-svg-rS4KcfV2jQzqfSbb .marker.cross{stroke:#333333;}#mermaid-svg-rS4KcfV2jQzqfSbb svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-rS4KcfV2jQzqfSbb p{margin:0;}#mermaid-svg-rS4KcfV2jQzqfSbb .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rS4KcfV2jQzqfSbb text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rS4KcfV2jQzqfSbb .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb .sequenceNumber{fill:white;}#mermaid-svg-rS4KcfV2jQzqfSbb #sequencenumber{fill:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-rS4KcfV2jQzqfSbb .messageText{fill:#333;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rS4KcfV2jQzqfSbb .labelText,#mermaid-svg-rS4KcfV2jQzqfSbb .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .loopText,#mermaid-svg-rS4KcfV2jQzqfSbb .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rS4KcfV2jQzqfSbb .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-rS4KcfV2jQzqfSbb .noteText,#mermaid-svg-rS4KcfV2jQzqfSbb .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-rS4KcfV2jQzqfSbb .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rS4KcfV2jQzqfSbb .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rS4KcfV2jQzqfSbb .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rS4KcfV2jQzqfSbb .actorPopupMenu{position:absolute;}#mermaid-svg-rS4KcfV2jQzqfSbb .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-rS4KcfV2jQzqfSbb .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rS4KcfV2jQzqfSbb .actor-man circle,#mermaid-svg-rS4KcfV2jQzqfSbb line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-rS4KcfV2jQzqfSbb :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} === 首次前向(Prefill) === === 生成第1个token === 拼接历史: K_full = cat(K_old, K_new) → (batch, 12, 6, 64) === 生成第2个token === K_full → (batch, 12, 7, 64) hidden_states (batch, 5, 768) c_attn → Q, K, V K: (batch, 12, 5, 64) V: (batch, 12, 5, 64) update(K, V, layer_idx=0) K_full=(batch,12,5,64), V_full=(batch,12,5,64) Q @ K_full^T → softmax → @ V_full attn_output (batch, 5, 768) hidden_states (batch, 1, 768) c_attn → Q, K, V K_new: (batch, 12, 1, 64) V_new: (batch, 12, 1, 64) update(K_new, V_new, layer_idx=0) K_full=(batch,12,6,64), V_full=(batch,12,6,64) Q @ K_full^T → softmax → @ V_full attn_output (batch, 1, 768) hidden_states (batch, 1, 768) update(K_new, V_new, layer_idx=0) K_full=(batch,12,7,64), V_full=(batch,12,7,64) attn_output (batch, 1, 768)

KV Cache 的关键源码(modeling_gpt2.py L193-L199):

python 复制代码
# 在 GPT2Attention.forward() 中
if (past_key_values is not None and not is_cross_attention) or (
    past_key_values is not None and is_cross_attention and not is_updated
):
    # 将新的 K/V 与缓存中的历史 K/V 拼接
    key_states, value_states = curr_past_key_values.update(
        key_states, value_states, self.layer_idx
    )

7. generate() 生成全流程

GPT2LMHeadModel 继承了 GenerationMixinmodeling_gpt2.py L645),其 generate() 方法定义于 generation/utils.py

生成循环时序图

渲染错误: Mermaid 渲染失败: Parse error on line 16: ...sk() Note over Loop,Cache: === Pref ---------------------^ Expecting 'ACTOR', got 'loop'

辅助解码流程图(Speculative Decoding)

当使用辅助模型进行推测解码时,流程如下:
#mermaid-svg-jL2hJOKpDZpEwESa{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jL2hJOKpDZpEwESa .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jL2hJOKpDZpEwESa .error-icon{fill:#552222;}#mermaid-svg-jL2hJOKpDZpEwESa .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jL2hJOKpDZpEwESa .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jL2hJOKpDZpEwESa .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jL2hJOKpDZpEwESa .marker.cross{stroke:#333333;}#mermaid-svg-jL2hJOKpDZpEwESa svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jL2hJOKpDZpEwESa p{margin:0;}#mermaid-svg-jL2hJOKpDZpEwESa .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster-label text{fill:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster-label span{color:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster-label span p{background-color:transparent;}#mermaid-svg-jL2hJOKpDZpEwESa .label text,#mermaid-svg-jL2hJOKpDZpEwESa span{fill:#333;color:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .node rect,#mermaid-svg-jL2hJOKpDZpEwESa .node circle,#mermaid-svg-jL2hJOKpDZpEwESa .node ellipse,#mermaid-svg-jL2hJOKpDZpEwESa .node polygon,#mermaid-svg-jL2hJOKpDZpEwESa .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-jL2hJOKpDZpEwESa .rough-node .label text,#mermaid-svg-jL2hJOKpDZpEwESa .node .label text,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape .label,#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape .label{text-anchor:middle;}#mermaid-svg-jL2hJOKpDZpEwESa .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-jL2hJOKpDZpEwESa .rough-node .label,#mermaid-svg-jL2hJOKpDZpEwESa .node .label,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape .label,#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape .label{text-align:center;}#mermaid-svg-jL2hJOKpDZpEwESa .node.clickable{cursor:pointer;}#mermaid-svg-jL2hJOKpDZpEwESa .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-jL2hJOKpDZpEwESa .arrowheadPath{fill:#333333;}#mermaid-svg-jL2hJOKpDZpEwESa .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-jL2hJOKpDZpEwESa .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-jL2hJOKpDZpEwESa .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jL2hJOKpDZpEwESa .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-jL2hJOKpDZpEwESa .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jL2hJOKpDZpEwESa .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-jL2hJOKpDZpEwESa .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster text{fill:#333;}#mermaid-svg-jL2hJOKpDZpEwESa .cluster span{color:#333;}#mermaid-svg-jL2hJOKpDZpEwESa div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-jL2hJOKpDZpEwESa .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-jL2hJOKpDZpEwESa rect.text{fill:none;stroke-width:0;}#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape p,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-jL2hJOKpDZpEwESa .icon-shape .label rect,#mermaid-svg-jL2hJOKpDZpEwESa .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jL2hJOKpDZpEwESa .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-jL2hJOKpDZpEwESa .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-jL2hJOKpDZpEwESa :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 全部接受
第 j 个被拒绝
主模型: GPT2LMHeadModel
辅助模型: assistant_model
辅助模型生成 K 个候选 token
主模型验证 K 个候选 token

一次前向传播
验证结果
接受 K 个 token

  • 继续生成
    接受前 j-1 个 token

  • 从主模型采样第 j 个
    继续下一轮推测


8. 训练流程

训练循环时序图

渲染错误: Mermaid 渲染失败: Parse error on line 19: ..._labels Loss-->>Opt: loss 标量 No ----------------------^ Expecting '+', '-', '()', 'ACTOR', got 'opt'

权重绑定梯度流图

#mermaid-svg-opmYsdlaVp6z2ysq{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-opmYsdlaVp6z2ysq .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-opmYsdlaVp6z2ysq .error-icon{fill:#552222;}#mermaid-svg-opmYsdlaVp6z2ysq .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-opmYsdlaVp6z2ysq .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-opmYsdlaVp6z2ysq .marker{fill:#333333;stroke:#333333;}#mermaid-svg-opmYsdlaVp6z2ysq .marker.cross{stroke:#333333;}#mermaid-svg-opmYsdlaVp6z2ysq svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-opmYsdlaVp6z2ysq p{margin:0;}#mermaid-svg-opmYsdlaVp6z2ysq .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster-label text{fill:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster-label span{color:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster-label span p{background-color:transparent;}#mermaid-svg-opmYsdlaVp6z2ysq .label text,#mermaid-svg-opmYsdlaVp6z2ysq span{fill:#333;color:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .node rect,#mermaid-svg-opmYsdlaVp6z2ysq .node circle,#mermaid-svg-opmYsdlaVp6z2ysq .node ellipse,#mermaid-svg-opmYsdlaVp6z2ysq .node polygon,#mermaid-svg-opmYsdlaVp6z2ysq .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-opmYsdlaVp6z2ysq .rough-node .label text,#mermaid-svg-opmYsdlaVp6z2ysq .node .label text,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape .label,#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape .label{text-anchor:middle;}#mermaid-svg-opmYsdlaVp6z2ysq .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-opmYsdlaVp6z2ysq .rough-node .label,#mermaid-svg-opmYsdlaVp6z2ysq .node .label,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape .label,#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape .label{text-align:center;}#mermaid-svg-opmYsdlaVp6z2ysq .node.clickable{cursor:pointer;}#mermaid-svg-opmYsdlaVp6z2ysq .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-opmYsdlaVp6z2ysq .arrowheadPath{fill:#333333;}#mermaid-svg-opmYsdlaVp6z2ysq .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-opmYsdlaVp6z2ysq .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-opmYsdlaVp6z2ysq .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-opmYsdlaVp6z2ysq .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-opmYsdlaVp6z2ysq .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-opmYsdlaVp6z2ysq .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-opmYsdlaVp6z2ysq .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster text{fill:#333;}#mermaid-svg-opmYsdlaVp6z2ysq .cluster span{color:#333;}#mermaid-svg-opmYsdlaVp6z2ysq div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-opmYsdlaVp6z2ysq .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-opmYsdlaVp6z2ysq rect.text{fill:none;stroke-width:0;}#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape p,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-opmYsdlaVp6z2ysq .icon-shape .label rect,#mermaid-svg-opmYsdlaVp6z2ysq .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-opmYsdlaVp6z2ysq .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-opmYsdlaVp6z2ysq .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-opmYsdlaVp6z2ysq :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 反向传播
梯度累加

同一参数
前向传播
共享存储
wte: nn.Embedding

weight: (50257, 768)
lm_head: nn.Linear

weight: (50257, 768)
lm_head.weight.grad

来自 logits 的梯度
wte.weight.grad

来自嵌入层的梯度
共享权重

梯度 = lm_head梯度 + wte梯度
optimizer.step()

一次性更新共享权重

损失计算的关键源码(modeling_gpt2.py L708-L716):

python 复制代码
# GPT2LMHeadModel.forward()
loss = None
if labels is not None:
    # labels 自动 shift:logits 取前 n-1 位,labels 取后 n-1 位
    # 即:用位置 i 的输出预测位置 i+1 的 token
    loss = self.loss_function(
        logits,
        labels,
        vocab_size=self.config.vocab_size,
        **kwargs,
    )

GPT-2 特殊的残差缩放初始化

GPT-2 采用了特殊的残差路径初始化策略(modeling_gpt2.py L448-L458):

python 复制代码
# GPT2PreTrainedModel._init_weights()
if isinstance(module, PreTrainedModel):
    for name, p in module.named_parameters():
        if name == "c_proj.weight":
            # 残差投影层的权重缩小 1/√(2N)
            # N 为残差层数,2 是因为每个 Block 有 2 个残差连接
            init.normal_(p, mean=0.0,
                std=self.config.initializer_range / math.sqrt(2 * self.config.n_layer))

这一策略来自 GPT-2 论文:随着模型深度增加,残差路径上的方差会累积,通过缩小残差层权重来抵消这种累积效应。


9. Pipeline 推理

pipeline("text-generation", model="gpt2") 使用 TextGenerationPipelinetext_generation.py),封装了分词、生成、后处理的完整流程。

Pipeline 时序图

后处理 GPT2LMHeadModel GPT2Tokenizer TextGenerationPipeline 用户代码 后处理 GPT2LMHeadModel GPT2Tokenizer TextGenerationPipeline 用户代码 #mermaid-svg-rJCrLMJnTBy7mz5o{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-rJCrLMJnTBy7mz5o .error-icon{fill:#552222;}#mermaid-svg-rJCrLMJnTBy7mz5o .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-rJCrLMJnTBy7mz5o .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-rJCrLMJnTBy7mz5o .marker{fill:#333333;stroke:#333333;}#mermaid-svg-rJCrLMJnTBy7mz5o .marker.cross{stroke:#333333;}#mermaid-svg-rJCrLMJnTBy7mz5o svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-rJCrLMJnTBy7mz5o p{margin:0;}#mermaid-svg-rJCrLMJnTBy7mz5o .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rJCrLMJnTBy7mz5o text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rJCrLMJnTBy7mz5o .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o .sequenceNumber{fill:white;}#mermaid-svg-rJCrLMJnTBy7mz5o #sequencenumber{fill:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-rJCrLMJnTBy7mz5o .messageText{fill:#333;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rJCrLMJnTBy7mz5o .labelText,#mermaid-svg-rJCrLMJnTBy7mz5o .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .loopText,#mermaid-svg-rJCrLMJnTBy7mz5o .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rJCrLMJnTBy7mz5o .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-rJCrLMJnTBy7mz5o .noteText,#mermaid-svg-rJCrLMJnTBy7mz5o .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-rJCrLMJnTBy7mz5o .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rJCrLMJnTBy7mz5o .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rJCrLMJnTBy7mz5o .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rJCrLMJnTBy7mz5o .actorPopupMenu{position:absolute;}#mermaid-svg-rJCrLMJnTBy7mz5o .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-rJCrLMJnTBy7mz5o .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rJCrLMJnTBy7mz5o .actor-man circle,#mermaid-svg-rJCrLMJnTBy7mz5o line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-rJCrLMJnTBy7mz5o :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} === 初始化 === === preprocess === === _forward === === postprocess === pipeline("text-generation", model="gpt2") 1. 加载 Tokenizer 2. 加载 Model 3. 设置 padding_side="left" (Decoder-only 批量生成需要左填充) 4. 默认 GenerationConfig max_new_tokens=256 do_sample=True, temperature=0.7 ("Hello, I'm a language model", max_new_tokens=50) tokenizer(prefix + prompt_text, return_tensors="pt") input_ids, attention_mask model.generate( input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=50) generated_sequence (batch, num_return, total_len) 截取新生成的 token sequenceprompt_len: tokenizer.decode( new_tokens, skip_special_tokens=True) 生成文本 {"generated_text": "Hello, I'm a language model, and I'm here to help..."}

Pipeline 的关键初始化逻辑(text_generation.py L99-L106):

python 复制代码
class TextGenerationPipeline(Pipeline):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.check_model_type(MODEL_FOR_CAUSAL_LM_MAPPING_NAMES)
        # Decoder-only 模型需要左填充以确保批量生成正确
        if self.tokenizer is not None and self.tokenizer.padding_side == "right":
            self.tokenizer.padding_side = "left"

Pipeline 默认生成配置(text_generation.py L93-L97):

python 复制代码
_default_generation_config = GenerationConfig(
    max_new_tokens=256,
    do_sample=True,       # 自由文本生成通常使用采样
    temperature=0.7,
)

10. 状态与生命周期总结

GPT-2 模型在 Transformers 中的完整生命周期,从配置创建到推理输出,经历以下状态转换:

状态机图

#mermaid-svg-Qb7qbweJ6wuo3Em7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .error-icon{fill:#552222;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .marker.cross{stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 p{margin:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 defs #statediagram-barbEnd{fill:#333333;stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup text{fill:#9370DB;stroke:none;font-size:10px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup text{fill:#333;stroke:none;font-size:10px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup .state-title{font-weight:bolder;fill:#131300;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-Qb7qbweJ6wuo3Em7 g.stateGroup line{stroke:#333333;stroke-width:1;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .transition{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .stateGroup .composit{fill:white;border-bottom:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .state-note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .state-note text{fill:black;stroke:none;font-size:10px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel .label rect{fill:#ECECFF;opacity:0.5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Qb7qbweJ6wuo3Em7 .edgeLabel .label text{fill:#333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .label div .edgeLabel{color:#333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .stateLabel text{fill:#131300;font-size:10px;font-weight:bold;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node circle.state-start{fill:#333333;stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node .fork-join{fill:#333333;stroke:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node circle.state-end{fill:#9370DB;stroke:white;stroke-width:1.5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .end-state-inner{fill:white;stroke-width:1.5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 #statediagram-barbEnd{fill:#333333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .cluster-label,#mermaid-svg-Qb7qbweJ6wuo3Em7 .nodeLabel{color:#131300;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster rect.outer{rx:5px;ry:5px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-state .divider{stroke:#9370DB;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-state .title-state{rx:5px;ry:5px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster.statediagram-cluster .inner{fill:white;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster.statediagram-cluster-alt .inner{fill:#f0f0f0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-cluster .inner{rx:0;ry:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-state rect.basic{rx:5px;ry:5px;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#f0f0f0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .note-edge{stroke-dasharray:5;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-note text{fill:black;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram-note .nodeLabel{color:black;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagram .edgeLabel{color:red;}#mermaid-svg-Qb7qbweJ6wuo3Em7 #dependencyStart,#mermaid-svg-Qb7qbweJ6wuo3Em7 #dependencyEnd{fill:#333333;stroke:#333333;stroke-width:1;}#mermaid-svg-Qb7qbweJ6wuo3Em7 .statediagramTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Qb7qbweJ6wuo3Em7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} GPT2Config()
GPT2LMHeadModel(config)

post_init() → _init_weights()
from_pretrained('gpt2')

下载权重 + 加载 + 权重绑定
model.forward(input_ids)

首次前向 + KV Cache 填充
生成循环

逐 token 解码
next_token → forward → sample

KV Cache 增量更新
EOS / max_length

生成完成
tokenizer.decode()

还原文本
forward → loss → backward → step

权重绑定梯度累加
输出结果
模型就绪

.eval() 模式
tokenizer(text)

ByteLevel BPE 编码
model.train()

labels=input_ids
model.eval()
ConfigCreated
ModelInstantiated
WeightsLoaded
Ready
model.train()
model.eval()
EvalMode
TrainMode
Tokenized
Prefilling
Decoding
Generated
PostProcessed
Training

生命周期关键节点总结

阶段 关键函数/类 源码位置
配置创建 GPT2Config configuration_gpt2.py L25
模型实例化 GPT2LMHeadModel.__init__ modeling_gpt2.py L648
权重初始化 GPT2PreTrainedModel._init_weights modeling_gpt2.py L433
预训练加载 PreTrainedModel.from_pretrained modeling_utils.py
权重绑定 _tied_weights_keys + tie_weights() modeling_gpt2.py L646
分词编码 GPT2Tokenizer.__call__ tokenization_gpt2.py L94
前向传播 GPT2Model.forward modeling_gpt2.py L522
注意力计算 GPT2Attention.forward modeling_gpt2.py L144
因果掩码 create_causal_mask masking_utils.py L894
KV Cache DynamicCache.update cache_utils.py L1229
自回归生成 GenerationMixin.generate generation/utils.py L339
损失计算 GPT2LMHeadModel.loss_function modeling_gpt2.py L711
Pipeline TextGenerationPipeline text_generation.py L23
Conv1D 线性层 Conv1D pytorch_utils.py L97

模型家族一览

GPT-2 在 Transformers 中提供了多种任务头:

类名 任务 头部
GPT2Model 基础模型(提取隐藏状态)
GPT2LMHeadModel 因果语言建模 lm_head (Linear, 权重绑定)
GPT2DoubleHeadsModel 语言建模 + 多项选择 lm_head + multiple_choice_head
GPT2ForSequenceClassification 序列分类 score (Linear)
GPT2ForTokenClassification Token 分类 classifier (Linear)
GPT2ForQuestionAnswering 问答 qa_outputs (Linear → 2)

总结:GPT-2 作为 Decoder-only 因果语言模型的开山之作,其设计深刻影响了后续所有 LLM。从 Conv1D 到 nn.Linear、从 Post-Norm 到 Pre-Norm、从绝对位置编码到 RoPE,每一代演进都在 GPT-2 的基础上优化。理解 GPT-2 的完整生命周期,就是理解现代大语言模型的基石。

相关推荐
Aloudata1 小时前
宽表 vs 语义层:论 AI 时代语义编织对智能数据分析的重要性
大数据·人工智能·数据挖掘·数据分析·agent·语义层·语义编织
幽冥三王爷2 小时前
手机蓝牙分档策略的理论基础与科学定档方法:从 RSSI 物理规律到稳健聚类定档
智能手机·数据挖掘·聚类·蓝牙定位·rssi
程序员猫哥_14 小时前
AI建站工具选型指南:不同模式对比与核心筛选标准
数据挖掘
V搜xhliang024615 小时前
临床科研新范式:从选题到投稿,AI智能体如何接管全流程?
运维·数据结构·人工智能·算法·microsoft·数据挖掘·自动化
科研小刘带你玩学术21 小时前
【科研快报】AI时代如何高效“组队“?计算社会选择理论带来新思路
数据挖掘·数据分析·計算社會選擇·委員會選舉·參數化複雜性分析
烬、、、1 天前
如何用 Claude Code 调用 gpt-image2 生成图片?
人工智能·笔记·gpt·prompt·skills
jike88ai1 天前
Windows版Claude Code安装与API对接教程(附常见问题解决)
windows·gpt·node.js·claude·claudecode·88api
王哈哈^_^1 天前
YOLO分类任务训练教程:从数据准备到模型部署全流程
人工智能·yolo·计算机视觉·分类·数据挖掘
酉鬼女又兒1 天前
零基础入门计算机网络:物理层核心知识全解——传输方式分类、编码调制原理与信道极限容量计算
网络·计算机网络·考研·职场和发展·分类·数据挖掘·php