BERT 案例详解:Transformers 框架全模块串联
本文档以 BERT 模型为例,将 Transformers 框架的所有模块串联起来,展示从加载到推理的完整生命周期。
所有引用均基于真实源码文件。
相关文章:
Hugging Face Transformers 源码全景解读
01-Hugging Face Transformers 核心基础设施深度分析
02-Hugging Face Transformers 配置系统深度分析
03-Hugging Face Transformers 模型系统深度分析
04-Hugging Face Transformers 注意力与掩码系统深度分析
05-Hugging Face Transformers 缓存系统深度分析
06-Hugging Face Transformers 生成系统深度分析
07-Hugging Face Transformers 分词器系统深度分析
08-Hugging Face Transformers 多模态处理系统深度分析
09-Hugging Face Transformers 训练系统深度分析
10-Hugging Face Transformers 量化系统深度分析
11-Hugging Face Transformers 分布式与并行系统深度分析
12-Hugging Face Transformers之Pipeline 推理管道深入分析
13-Hugging Face Transformers之AutoModel 自动分发机制深入分析
14-Hugging Face Transformers 模型实现模式深度分析
15-Hugging Face Transformers之CLI 与工具架构总览
16-Hugging Face Transformers之测试体系架构总览
17-Hugging Face Transformers之BERT 案例详解:Transformers 框架全模块串联
18-Hugging Face Transformers之GPT-2 案例详解:Decoder-only 自回归模型的完整生命周期
19-Hugging Face Transformers之Qwen3.5-MoE 系列详解:混合专家 + 线性注意力 + 多模态的完整生命周期
1. BERT 在 Transformers 中的定位
BERT(Bidirectional Encoder Representations from Transformers)是 Encoder-only 模型的典型代表。与 GPT(Decoder-only)和 T5(Encoder-Decoder)形成三类 Transformer 架构的鼎立格局。
BERT 的核心特征是 双向注意力------每个 token 可以同时关注序列中所有其他 token,而非仅关注左侧上下文。这使得 BERT 天然适合以下任务:
| 任务类型 | 对应模型类 | 源码位置 |
|---|---|---|
| MLM(掩码语言建模) | BertForMaskedLM |
modeling_bert.py:913(file:///workspace/src/transformers/models/bert/modeling_bert.py#L913) |
| NSP(下一句预测) | BertForNextSentencePrediction |
modeling_bert.py:994(file:///workspace/src/transformers/models/bert/modeling_bert.py#L994) |
| 序列分类 | BertForSequenceClassification |
modeling_bert.py:1076(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1076) |
| 问答 | BertForQuestionAnswering |
modeling_bert.py:1315(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1315) |
| Token 标注 | BertForTokenClassification |
modeling_bert.py:1255(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1255) |
| 多选 | BertForMultipleChoice |
modeling_bert.py:1157(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1157) |
| 预训练(MLM+NSP) | BertForPreTraining |
modeling_bert.py:731(file:///workspace/src/transformers/models/bert/modeling_bert.py#L731) |
架构定位图
#mermaid-svg-UmgNXy2OFk0xL4vi{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-UmgNXy2OFk0xL4vi .error-icon{fill:#552222;}#mermaid-svg-UmgNXy2OFk0xL4vi .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-UmgNXy2OFk0xL4vi .marker{fill:#333333;stroke:#333333;}#mermaid-svg-UmgNXy2OFk0xL4vi .marker.cross{stroke:#333333;}#mermaid-svg-UmgNXy2OFk0xL4vi svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-UmgNXy2OFk0xL4vi p{margin:0;}#mermaid-svg-UmgNXy2OFk0xL4vi .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster-label text{fill:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster-label span{color:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster-label span p{background-color:transparent;}#mermaid-svg-UmgNXy2OFk0xL4vi .label text,#mermaid-svg-UmgNXy2OFk0xL4vi span{fill:#333;color:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .node rect,#mermaid-svg-UmgNXy2OFk0xL4vi .node circle,#mermaid-svg-UmgNXy2OFk0xL4vi .node ellipse,#mermaid-svg-UmgNXy2OFk0xL4vi .node polygon,#mermaid-svg-UmgNXy2OFk0xL4vi .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-UmgNXy2OFk0xL4vi .rough-node .label text,#mermaid-svg-UmgNXy2OFk0xL4vi .node .label text,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape .label,#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape .label{text-anchor:middle;}#mermaid-svg-UmgNXy2OFk0xL4vi .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-UmgNXy2OFk0xL4vi .rough-node .label,#mermaid-svg-UmgNXy2OFk0xL4vi .node .label,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape .label,#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape .label{text-align:center;}#mermaid-svg-UmgNXy2OFk0xL4vi .node.clickable{cursor:pointer;}#mermaid-svg-UmgNXy2OFk0xL4vi .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-UmgNXy2OFk0xL4vi .arrowheadPath{fill:#333333;}#mermaid-svg-UmgNXy2OFk0xL4vi .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-UmgNXy2OFk0xL4vi .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-UmgNXy2OFk0xL4vi .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-UmgNXy2OFk0xL4vi .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-UmgNXy2OFk0xL4vi .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-UmgNXy2OFk0xL4vi .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster text{fill:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster span{color:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-UmgNXy2OFk0xL4vi .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi rect.text{fill:none;stroke-width:0;}#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape p,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape .label rect,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-UmgNXy2OFk0xL4vi .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-UmgNXy2OFk0xL4vi .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-UmgNXy2OFk0xL4vi :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Transformer家族
Encoder-Decoder
T5
编码器双向 + 解码器因果
Seq2Seq
BART
去噪自编码
Whisper
语音识别
Decoder-Only
GPT 系列
因果(单向)注意力
自回归生成
LLaMA
RoPE + SwiGLU
Qwen
大容量解码器
Encoder-Only
BERT
双向注意力
MLM + NSP 预训练
RoBERTa
动态掩码
ALBERT
参数共享
DeBERTa
解耦注意力
关键区别:BERT 在 modeling_bert.py:708(file:///workspace/src/transformers/models/bert/modeling_bert.py#L708) 使用 create_bidirectional_mask 创建双向掩码,而 GPT 使用 create_causal_mask 创建因果掩码。
2. Config 定义全流程
BertConfig 的 @strict dataclass 定义
BERT 的配置类定义在 configuration_bert.py(file:///workspace/src/transformers/models/bert/configuration_bert.py) 中,使用了 @strict 装饰器和 @auto_docstring 装饰器:
python
# configuration_bert.py:17-63
from huggingface_hub.dataclasses import strict
from ...configuration_utils import PreTrainedConfig
from ...utils import auto_docstring
@auto_docstring(checkpoint="google-bert/bert-base-uncased")
@strict
class BertConfig(PreTrainedConfig):
model_type = "bert" # 注册模型类型标识
vocab_size: int = 30522
hidden_size: int = 768
num_hidden_layers: int = 12
num_attention_heads: int = 12
intermediate_size: int = 3072
hidden_act: str = "gelu"
hidden_dropout_prob: float | int = 0.1
attention_probs_dropout_prob: float | int = 0.1
max_position_embeddings: int = 512
type_vocab_size: int = 2
initializer_range: float = 0.02
layer_norm_eps: float = 1e-12
pad_token_id: int | None = 0
use_cache: bool = True
classifier_dropout: float | int | None = None
is_decoder: bool = False
add_cross_attention: bool = False
tie_word_embeddings: bool = True
关键设计要点:
@strict装饰器 (来自huggingface_hub.dataclasses):强制类型检查,确保配置参数类型正确,防止传入非法值model_type = "bert":这是 AutoConfig 自动路由的核心标识,在 configuration_auto.py:424(file:///workspace/src/transformers/models/auto/configuration_auto.py#L424) 的AutoConfig.register方法中用于注册映射attribute_map:继承自PreTrainedConfig(configuration_utils.py:219(file:///workspace/src/transformers/configuration_utils.py#L219)),提供属性别名映射,通过__getattribute__和__setattr__拦截实现透明别名访问
Config 类图
#mermaid-svg-HpkU4lmx4TpmmF4y{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-HpkU4lmx4TpmmF4y .error-icon{fill:#552222;}#mermaid-svg-HpkU4lmx4TpmmF4y .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-HpkU4lmx4TpmmF4y .marker{fill:#333333;stroke:#333333;}#mermaid-svg-HpkU4lmx4TpmmF4y .marker.cross{stroke:#333333;}#mermaid-svg-HpkU4lmx4TpmmF4y svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-HpkU4lmx4TpmmF4y p{margin:0;}#mermaid-svg-HpkU4lmx4TpmmF4y g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-HpkU4lmx4TpmmF4y g.classGroup text .title{font-weight:bolder;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster-label text{fill:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster-label span{color:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster-label span p{background-color:transparent;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster text{fill:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster span{color:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .nodeLabel,#mermaid-svg-HpkU4lmx4TpmmF4y .edgeLabel{color:#131300;}#mermaid-svg-HpkU4lmx4TpmmF4y .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-HpkU4lmx4TpmmF4y .label text{fill:#131300;}#mermaid-svg-HpkU4lmx4TpmmF4y .labelBkg{background:#ECECFF;}#mermaid-svg-HpkU4lmx4TpmmF4y .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-HpkU4lmx4TpmmF4y .classTitle{font-weight:bolder;}#mermaid-svg-HpkU4lmx4TpmmF4y .node rect,#mermaid-svg-HpkU4lmx4TpmmF4y .node circle,#mermaid-svg-HpkU4lmx4TpmmF4y .node ellipse,#mermaid-svg-HpkU4lmx4TpmmF4y .node polygon,#mermaid-svg-HpkU4lmx4TpmmF4y .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-HpkU4lmx4TpmmF4y .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y g.clickable{cursor:pointer;}#mermaid-svg-HpkU4lmx4TpmmF4y g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-HpkU4lmx4TpmmF4y g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-HpkU4lmx4TpmmF4y .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-HpkU4lmx4TpmmF4y .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-HpkU4lmx4TpmmF4y .dashed-line{stroke-dasharray:3;}#mermaid-svg-HpkU4lmx4TpmmF4y .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-HpkU4lmx4TpmmF4y #compositionStart,#mermaid-svg-HpkU4lmx4TpmmF4y .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #compositionEnd,#mermaid-svg-HpkU4lmx4TpmmF4y .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #dependencyStart,#mermaid-svg-HpkU4lmx4TpmmF4y .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #dependencyStart,#mermaid-svg-HpkU4lmx4TpmmF4y .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #extensionStart,#mermaid-svg-HpkU4lmx4TpmmF4y .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #extensionEnd,#mermaid-svg-HpkU4lmx4TpmmF4y .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #aggregationStart,#mermaid-svg-HpkU4lmx4TpmmF4y .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #aggregationEnd,#mermaid-svg-HpkU4lmx4TpmmF4y .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #lollipopStart,#mermaid-svg-HpkU4lmx4TpmmF4y .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #lollipopEnd,#mermaid-svg-HpkU4lmx4TpmmF4y .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-HpkU4lmx4TpmmF4y .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-HpkU4lmx4TpmmF4y .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-HpkU4lmx4TpmmF4y :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 查找
"bert" -> BertConfig
继承
PreTrainedConfig
+model_type: str
+attribute_map: dict
+vocab_size: int
+hidden_size: int
+is_encoder_decoder: bool
+from_pretrained()
+from_dict()
+to_dict()
+save_pretrained()
+getattribute(key)
+setattr(key, value)
BertConfig
+model_type = "bert"
+vocab_size: int = 30522
+hidden_size: int = 768
+num_hidden_layers: int = 12
+num_attention_heads: int = 12
+intermediate_size: int = 3072
+hidden_act: str = "gelu"
+hidden_dropout_prob: float = 0.1
+attention_probs_dropout_prob: float = 0.1
+max_position_embeddings: int = 512
+type_vocab_size: int = 2
+initializer_range: float = 0.02
+layer_norm_eps: float = 1e-12
+pad_token_id: int = 0
+is_decoder: bool = False
+add_cross_attention: bool = False
+tie_word_embeddings: bool = True
AutoConfig
+from_pretrained()
+for_model()
+register()
CONFIG_MAPPING
+register(model_type, config)
+getitem(model_type)
Config 序列化流程图
#mermaid-svg-ZAFV0Ny68IJV0wIX{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZAFV0Ny68IJV0wIX .error-icon{fill:#552222;}#mermaid-svg-ZAFV0Ny68IJV0wIX .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZAFV0Ny68IJV0wIX .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .marker.cross{stroke:#333333;}#mermaid-svg-ZAFV0Ny68IJV0wIX svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZAFV0Ny68IJV0wIX p{margin:0;}#mermaid-svg-ZAFV0Ny68IJV0wIX .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster-label text{fill:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster-label span{color:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster-label span p{background-color:transparent;}#mermaid-svg-ZAFV0Ny68IJV0wIX .label text,#mermaid-svg-ZAFV0Ny68IJV0wIX span{fill:#333;color:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .node rect,#mermaid-svg-ZAFV0Ny68IJV0wIX .node circle,#mermaid-svg-ZAFV0Ny68IJV0wIX .node ellipse,#mermaid-svg-ZAFV0Ny68IJV0wIX .node polygon,#mermaid-svg-ZAFV0Ny68IJV0wIX .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .rough-node .label text,#mermaid-svg-ZAFV0Ny68IJV0wIX .node .label text,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape .label,#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZAFV0Ny68IJV0wIX .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .rough-node .label,#mermaid-svg-ZAFV0Ny68IJV0wIX .node .label,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape .label,#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape .label{text-align:center;}#mermaid-svg-ZAFV0Ny68IJV0wIX .node.clickable{cursor:pointer;}#mermaid-svg-ZAFV0Ny68IJV0wIX .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .arrowheadPath{fill:#333333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZAFV0Ny68IJV0wIX .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZAFV0Ny68IJV0wIX .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster text{fill:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster span{color:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZAFV0Ny68IJV0wIX .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape p,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape .label rect,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZAFV0Ny68IJV0wIX .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZAFV0Ny68IJV0wIX .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZAFV0Ny68IJV0wIX :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} to_dict()
json.dumps()
保存到磁盘
json.loads()
BertConfig.from_dict()
下载 config.json
读取 model_type
model_type='bert'
from_dict()
BertConfig()
Python 字典
{vocab_size: 30522, ...}
JSON 字符串
config.json
config.json 文件
Python 字典
BertConfig 实例
远程 Hub
google-bert/bert-base-uncased
AutoConfig.from_pretrained()
CONFIG_MAPPING 查找
定位 BertConfig
序列化/反序列化关键路径:
save_pretrained()→to_dict()→ JSON 文件from_pretrained()→ 下载/读取 JSON →from_dict()→ BertConfig 实例AutoConfig.from_pretrained()通过model_type字段在CONFIG_MAPPING中查找对应的 Config 类
3. from_pretrained 完整时序
当用户调用 BertModel.from_pretrained('bert-base-uncased') 时,框架执行一系列复杂步骤将预训练权重加载到模型中。
时序图
权重绑定 设备分配 量化器 WeightConverter meta 设备初始化 BertModel(BertPreTrainedModel) BertConfig AutoModel 用户代码 权重绑定 设备分配 量化器 WeightConverter meta 设备初始化 BertModel(BertPreTrainedModel) BertConfig AutoModel 用户代码 #mermaid-svg-R2joEiljFa3UVz73{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-R2joEiljFa3UVz73 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-R2joEiljFa3UVz73 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-R2joEiljFa3UVz73 .error-icon{fill:#552222;}#mermaid-svg-R2joEiljFa3UVz73 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-R2joEiljFa3UVz73 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-R2joEiljFa3UVz73 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-R2joEiljFa3UVz73 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-R2joEiljFa3UVz73 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-R2joEiljFa3UVz73 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-R2joEiljFa3UVz73 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-R2joEiljFa3UVz73 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-R2joEiljFa3UVz73 .marker.cross{stroke:#333333;}#mermaid-svg-R2joEiljFa3UVz73 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-R2joEiljFa3UVz73 p{margin:0;}#mermaid-svg-R2joEiljFa3UVz73 .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-R2joEiljFa3UVz73 text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-R2joEiljFa3UVz73 .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-R2joEiljFa3UVz73 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-R2joEiljFa3UVz73 .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-R2joEiljFa3UVz73 #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-R2joEiljFa3UVz73 .sequenceNumber{fill:white;}#mermaid-svg-R2joEiljFa3UVz73 #sequencenumber{fill:#333;}#mermaid-svg-R2joEiljFa3UVz73 #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-R2joEiljFa3UVz73 .messageText{fill:#333;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-R2joEiljFa3UVz73 .labelText,#mermaid-svg-R2joEiljFa3UVz73 .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .loopText,#mermaid-svg-R2joEiljFa3UVz73 .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-R2joEiljFa3UVz73 .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-R2joEiljFa3UVz73 .noteText,#mermaid-svg-R2joEiljFa3UVz73 .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-R2joEiljFa3UVz73 .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-R2joEiljFa3UVz73 .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-R2joEiljFa3UVz73 .actorPopupMenu{position:absolute;}#mermaid-svg-R2joEiljFa3UVz73 .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-R2joEiljFa3UVz73 .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-R2joEiljFa3UVz73 .actor-man circle,#mermaid-svg-R2joEiljFa3UVz73 line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-R2joEiljFa3UVz73 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 在 meta 设备上初始化空壳模型 BertEmbeddings + BertEncoder + BertPooler model.safetensors 或 pytorch_model.bin alt 存在量化配置 cls.predictions.decoder.weight ← bert.embeddings.word_embeddings.weight from_pretrained("bert-base-uncased") from_pretrained("bert-base-uncased") BertConfig(vocab_size=30522, ...) 检查 quantization_config 无量化配置 _from_config(config) init(config) post_init() → _init_weights() 下载/加载权重文件 转换旧格式键名 (如 "gamma" → "weight") load_state_dict() AutoQuantizationConfig.from_pretrained() 量化权重 分配到目标设备 (cuda/cpu) tie_weights() 就绪的 BertModel
每一步涉及的具体代码
步骤 1:Config 加载
- 入口:modeling_utils.py:3789(file:///workspace/src/transformers/modeling_utils.py#L3789)
PreTrainedModel.from_pretrained() - Config 加载:先通过
AutoConfig.from_pretrained()获取BertConfig
步骤 2:meta 设备初始化
- 框架在
torch.device("meta")上创建模型骨架,不分配实际内存 - 调用
BertModel.__init__(config)(modeling_bert.py:601(file:///workspace/src/transformers/models/bert/modeling_bert.py#L601))
python
# modeling_bert.py:601-616
def __init__(self, config, add_pooling_layer=True):
super().__init__(config)
self.config = config
self.embeddings = BertEmbeddings(config) # 词/位置/类型嵌入
self.encoder = BertEncoder(config) # 12层 Transformer
self.pooler = BertPooler(config) if add_pooling_layer else None # 池化层
self.post_init() # 初始化权重 + 权重绑定
步骤 3:权重加载与转换
WeightConverter处理旧版键名映射(如gamma→weight,beta→bias)- 从 safetensors 或 bin 文件加载 state_dict
步骤 4:权重绑定
BertForPreTraining中定义了绑定关系(modeling_bert.py:732-735(file:///workspace/src/transformers/models/bert/modeling_bert.py#L732)):
python
# modeling_bert.py:732-735
_tied_weights_keys = {
"cls.predictions.decoder.weight": "bert.embeddings.word_embeddings.weight",
"cls.predictions.decoder.bias": "cls.predictions.bias",
}
这意味着 MLM 头的输出权重与输入嵌入共享,节省参数量。
4. Tokenizer 编码流程
BertTokenizer 基于 WordPiece 分词算法,定义在 tokenization_bert.py(file:///workspace/src/transformers/models/bert/tokenization_bert.py)。
核心架构
python
# tokenization_bert.py:41-77
class BertTokenizer(TokenizersBackend):
vocab_files_names = VOCAB_FILES_NAMES # {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}
model_input_names = ["input_ids", "token_type_ids", "attention_mask"]
model = WordPiece
BertTokenizer 继承自 TokenizersBackend,底层使用 HuggingFace 的 tokenizers 库实现高性能分词。
初始化流程
python
# tokenization_bert.py:79-135
def __init__(self, vocab=None, do_lower_case=True, unk_token="[UNK]", ...):
self._tokenizer = Tokenizer(WordPiece(self._vocab, unk_token=str(unk_token)))
self._tokenizer.normalizer = normalizers.BertNormalizer( # 文本规范化
clean_text=True, handle_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents, lowercase=do_lower_case,
)
self._tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer() # 预分词
self._tokenizer.decoder = decoders.WordPiece(prefix="##") # 解码器
# 后处理器:添加 [CLS] 和 [SEP]
self._tokenizer.post_processor = processors.TemplateProcessing(
single=f"[CLS]:0 $A:0 [SEP]:0",
pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)
编码流程图
#mermaid-svg-DfssETW5nqIwy6jD{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-DfssETW5nqIwy6jD .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-DfssETW5nqIwy6jD .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-DfssETW5nqIwy6jD .error-icon{fill:#552222;}#mermaid-svg-DfssETW5nqIwy6jD .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-DfssETW5nqIwy6jD .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-DfssETW5nqIwy6jD .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-DfssETW5nqIwy6jD .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-DfssETW5nqIwy6jD .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-DfssETW5nqIwy6jD .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-DfssETW5nqIwy6jD .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-DfssETW5nqIwy6jD .marker{fill:#333333;stroke:#333333;}#mermaid-svg-DfssETW5nqIwy6jD .marker.cross{stroke:#333333;}#mermaid-svg-DfssETW5nqIwy6jD svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-DfssETW5nqIwy6jD p{margin:0;}#mermaid-svg-DfssETW5nqIwy6jD .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-DfssETW5nqIwy6jD .cluster-label text{fill:#333;}#mermaid-svg-DfssETW5nqIwy6jD .cluster-label span{color:#333;}#mermaid-svg-DfssETW5nqIwy6jD .cluster-label span p{background-color:transparent;}#mermaid-svg-DfssETW5nqIwy6jD .label text,#mermaid-svg-DfssETW5nqIwy6jD span{fill:#333;color:#333;}#mermaid-svg-DfssETW5nqIwy6jD .node rect,#mermaid-svg-DfssETW5nqIwy6jD .node circle,#mermaid-svg-DfssETW5nqIwy6jD .node ellipse,#mermaid-svg-DfssETW5nqIwy6jD .node polygon,#mermaid-svg-DfssETW5nqIwy6jD .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-DfssETW5nqIwy6jD .rough-node .label text,#mermaid-svg-DfssETW5nqIwy6jD .node .label text,#mermaid-svg-DfssETW5nqIwy6jD .image-shape .label,#mermaid-svg-DfssETW5nqIwy6jD .icon-shape .label{text-anchor:middle;}#mermaid-svg-DfssETW5nqIwy6jD .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-DfssETW5nqIwy6jD .rough-node .label,#mermaid-svg-DfssETW5nqIwy6jD .node .label,#mermaid-svg-DfssETW5nqIwy6jD .image-shape .label,#mermaid-svg-DfssETW5nqIwy6jD .icon-shape .label{text-align:center;}#mermaid-svg-DfssETW5nqIwy6jD .node.clickable{cursor:pointer;}#mermaid-svg-DfssETW5nqIwy6jD .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-DfssETW5nqIwy6jD .arrowheadPath{fill:#333333;}#mermaid-svg-DfssETW5nqIwy6jD .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-DfssETW5nqIwy6jD .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-DfssETW5nqIwy6jD .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-DfssETW5nqIwy6jD .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-DfssETW5nqIwy6jD .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-DfssETW5nqIwy6jD .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-DfssETW5nqIwy6jD .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-DfssETW5nqIwy6jD .cluster text{fill:#333;}#mermaid-svg-DfssETW5nqIwy6jD .cluster span{color:#333;}#mermaid-svg-DfssETW5nqIwy6jD div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-DfssETW5nqIwy6jD .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-DfssETW5nqIwy6jD rect.text{fill:none;stroke-width:0;}#mermaid-svg-DfssETW5nqIwy6jD .icon-shape,#mermaid-svg-DfssETW5nqIwy6jD .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-DfssETW5nqIwy6jD .icon-shape p,#mermaid-svg-DfssETW5nqIwy6jD .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-DfssETW5nqIwy6jD .icon-shape .label rect,#mermaid-svg-DfssETW5nqIwy6jD .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-DfssETW5nqIwy6jD .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-DfssETW5nqIwy6jD .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-DfssETW5nqIwy6jD :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 原始文本
'Hello, my dog is cute'
BertNormalizer
小写化 + 清理 + 中文字符处理
BertPreTokenizer
基于空白和标点的预分词
WordPiece 分词
子词切分
'hello' → 'hello'
'cute' → 'cute'
TemplateProcessing
添加特殊 token
CLS\] hello , my dog is cute \[SEP
生成三要素
input_ids
101, 7592, 1010, 2026, ...
token 在词表中的索引
attention_mask
1, 1, 1, 1, ...
1=有效, 0=填充
token_type_ids
0, 0, 0, 0, ...
0=句子A, 1=句子B
特殊 Token 管理
| Token | 用途 | 默认值 |
|---|---|---|
[CLS] |
句首标记,用于分类 | cls_token_id = 2 |
[SEP] |
句子分隔符 | sep_token_id = 3 |
[PAD] |
填充标记 | pad_token_id = 0 |
[UNK] |
未知词标记 | unk_token_id = 1 |
[MASK] |
掩码标记(MLM 训练) | mask_token_id = 4 |
句对编码
当输入两个句子时,TemplateProcessing 的 pair 模板生效:
[CLS] 句子A [SEP] 句子B [SEP]
0 0 0 1 1 ← token_type_ids
5. 模型前向传播全链路
从 input_ids 到最终输出的完整数据流。
数据流图
#mermaid-svg-AFb9CbjF2aLpVM95{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-AFb9CbjF2aLpVM95 .error-icon{fill:#552222;}#mermaid-svg-AFb9CbjF2aLpVM95 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-AFb9CbjF2aLpVM95 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-AFb9CbjF2aLpVM95 .marker.cross{stroke:#333333;}#mermaid-svg-AFb9CbjF2aLpVM95 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-AFb9CbjF2aLpVM95 p{margin:0;}#mermaid-svg-AFb9CbjF2aLpVM95 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster-label text{fill:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster-label span{color:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster-label span p{background-color:transparent;}#mermaid-svg-AFb9CbjF2aLpVM95 .label text,#mermaid-svg-AFb9CbjF2aLpVM95 span{fill:#333;color:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .node rect,#mermaid-svg-AFb9CbjF2aLpVM95 .node circle,#mermaid-svg-AFb9CbjF2aLpVM95 .node ellipse,#mermaid-svg-AFb9CbjF2aLpVM95 .node polygon,#mermaid-svg-AFb9CbjF2aLpVM95 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-AFb9CbjF2aLpVM95 .rough-node .label text,#mermaid-svg-AFb9CbjF2aLpVM95 .node .label text,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape .label,#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape .label{text-anchor:middle;}#mermaid-svg-AFb9CbjF2aLpVM95 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-AFb9CbjF2aLpVM95 .rough-node .label,#mermaid-svg-AFb9CbjF2aLpVM95 .node .label,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape .label,#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape .label{text-align:center;}#mermaid-svg-AFb9CbjF2aLpVM95 .node.clickable{cursor:pointer;}#mermaid-svg-AFb9CbjF2aLpVM95 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-AFb9CbjF2aLpVM95 .arrowheadPath{fill:#333333;}#mermaid-svg-AFb9CbjF2aLpVM95 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-AFb9CbjF2aLpVM95 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-AFb9CbjF2aLpVM95 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AFb9CbjF2aLpVM95 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-AFb9CbjF2aLpVM95 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AFb9CbjF2aLpVM95 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster text{fill:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster span{color:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-AFb9CbjF2aLpVM95 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 rect.text{fill:none;stroke-width:0;}#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape p,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape .label rect,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AFb9CbjF2aLpVM95 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-AFb9CbjF2aLpVM95 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-AFb9CbjF2aLpVM95 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} TaskHead
BertOnlyMLMHead
MLM 预测
BertOnlyNSPHead
NSP 预测
Classifier
序列分类
QA Outputs
问答
Classifier
Token 分类
BertEncoder
BertLayer 0
BertLayer 1
...
BertLayer 11
BertEmbeddings
word_embeddings
nn.Embedding(30522, 768)
求和
position_embeddings
nn.Embedding(512, 768)
token_type_embeddings
nn.Embedding(2, 768)
LayerNorm(768)
Dropout(0.1)
input_ids
(batch, seq_len)
BertEmbeddings
attention_mask
(batch, seq_len)
_create_attention_masks
create_bidirectional_mask
BertEncoder
12 × BertLayer
BertPooler
取 CLS token
Dense + Tanh
sequence_output
(batch, seq_len, 768)
pooler_output
(batch, 768)
单层 BertLayer 内部结构图
#mermaid-svg-KMIW9zM5SUslBr9z{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-KMIW9zM5SUslBr9z .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-KMIW9zM5SUslBr9z .error-icon{fill:#552222;}#mermaid-svg-KMIW9zM5SUslBr9z .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-KMIW9zM5SUslBr9z .marker{fill:#333333;stroke:#333333;}#mermaid-svg-KMIW9zM5SUslBr9z .marker.cross{stroke:#333333;}#mermaid-svg-KMIW9zM5SUslBr9z svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-KMIW9zM5SUslBr9z p{margin:0;}#mermaid-svg-KMIW9zM5SUslBr9z .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster-label text{fill:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster-label span{color:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster-label span p{background-color:transparent;}#mermaid-svg-KMIW9zM5SUslBr9z .label text,#mermaid-svg-KMIW9zM5SUslBr9z span{fill:#333;color:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .node rect,#mermaid-svg-KMIW9zM5SUslBr9z .node circle,#mermaid-svg-KMIW9zM5SUslBr9z .node ellipse,#mermaid-svg-KMIW9zM5SUslBr9z .node polygon,#mermaid-svg-KMIW9zM5SUslBr9z .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-KMIW9zM5SUslBr9z .rough-node .label text,#mermaid-svg-KMIW9zM5SUslBr9z .node .label text,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape .label,#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape .label{text-anchor:middle;}#mermaid-svg-KMIW9zM5SUslBr9z .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-KMIW9zM5SUslBr9z .rough-node .label,#mermaid-svg-KMIW9zM5SUslBr9z .node .label,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape .label,#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape .label{text-align:center;}#mermaid-svg-KMIW9zM5SUslBr9z .node.clickable{cursor:pointer;}#mermaid-svg-KMIW9zM5SUslBr9z .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-KMIW9zM5SUslBr9z .arrowheadPath{fill:#333333;}#mermaid-svg-KMIW9zM5SUslBr9z .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-KMIW9zM5SUslBr9z .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-KMIW9zM5SUslBr9z .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-KMIW9zM5SUslBr9z .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-KMIW9zM5SUslBr9z .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-KMIW9zM5SUslBr9z .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-KMIW9zM5SUslBr9z .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster text{fill:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster span{color:#333;}#mermaid-svg-KMIW9zM5SUslBr9z div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-KMIW9zM5SUslBr9z .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-KMIW9zM5SUslBr9z rect.text{fill:none;stroke-width:0;}#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape p,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape .label rect,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-KMIW9zM5SUslBr9z .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-KMIW9zM5SUslBr9z .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-KMIW9zM5SUslBr9z :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Feed-Forward Network
BertAttention
BertOutput
BertIntermediate
BertSelfOutput
BertSelfAttention
query = Linear(768, 768)
Q×K^T × V
缩放点积注意力
key = Linear(768, 768)
value = Linear(768, 768)
- attention_mask
(双向掩码)
hidden_states
(batch, seq_len, 768)
BertSelfAttention
dense = Linear(768, 768)
dropout(0.1)
LayerNorm + 残差连接
FFN
dense = Linear(768, 3072)
GELU 激活
dense = Linear(3072, 768)
dropout(0.1)
LayerNorm + 残差连接
layer_output
(batch, seq_len, 768)
关键代码对应
BertSelfAttention(modeling_bert.py:143-207(file:///workspace/src/transformers/models/bert/modeling_bert.py#L143)):
python
# modeling_bert.py:168-207
def forward(self, hidden_states, attention_mask=None, past_key_values=None, **kwargs):
# Q/K/V 投影并重塑为多头形式
query_layer = self.query(hidden_states).view(*hidden_shape).transpose(1, 2)
key_layer = self.key(hidden_states).view(*hidden_shape).transpose(1, 2)
value_layer = self.value(hidden_states).view(*hidden_shape).transpose(1, 2)
# 通过 ALL_ATTENTION_FUNCTIONS 分发到具体实现
attention_interface = ALL_ATTENTION_FUNCTIONS.get_interface(
self.config._attn_implementation, eager_attention_forward
)
attn_output, attn_weights = attention_interface(
self, query_layer, key_layer, value_layer,
attention_mask, dropout=..., scaling=self.scaling, **kwargs,
)
attn_output = attn_output.reshape(*input_shape, -1).contiguous()
return attn_output, attn_weights
BertLayer(modeling_bert.py:358-420(file:///workspace/src/transformers/models/bert/modeling_bert.py#L358)):
python
# modeling_bert.py:378-420
def forward(self, hidden_states, attention_mask=None, ...):
self_attention_output, _ = self.attention(hidden_states, attention_mask, ...)
attention_output = self_attention_output
# 如果是 decoder 且有 encoder 输出,执行交叉注意力
if self.is_decoder and encoder_hidden_states is not None:
cross_attention_output, _ = self.crossattention(...)
attention_output = cross_attention_output
# FFN(支持分块处理以节省内存)
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
)
return layer_output
双向注意力掩码图
#mermaid-svg-IWNbva42U27c5x8N{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-IWNbva42U27c5x8N .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-IWNbva42U27c5x8N .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-IWNbva42U27c5x8N .error-icon{fill:#552222;}#mermaid-svg-IWNbva42U27c5x8N .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-IWNbva42U27c5x8N .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-IWNbva42U27c5x8N .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-IWNbva42U27c5x8N .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-IWNbva42U27c5x8N .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-IWNbva42U27c5x8N .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-IWNbva42U27c5x8N .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-IWNbva42U27c5x8N .marker{fill:#333333;stroke:#333333;}#mermaid-svg-IWNbva42U27c5x8N .marker.cross{stroke:#333333;}#mermaid-svg-IWNbva42U27c5x8N svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-IWNbva42U27c5x8N p{margin:0;}#mermaid-svg-IWNbva42U27c5x8N .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-IWNbva42U27c5x8N .cluster-label text{fill:#333;}#mermaid-svg-IWNbva42U27c5x8N .cluster-label span{color:#333;}#mermaid-svg-IWNbva42U27c5x8N .cluster-label span p{background-color:transparent;}#mermaid-svg-IWNbva42U27c5x8N .label text,#mermaid-svg-IWNbva42U27c5x8N span{fill:#333;color:#333;}#mermaid-svg-IWNbva42U27c5x8N .node rect,#mermaid-svg-IWNbva42U27c5x8N .node circle,#mermaid-svg-IWNbva42U27c5x8N .node ellipse,#mermaid-svg-IWNbva42U27c5x8N .node polygon,#mermaid-svg-IWNbva42U27c5x8N .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-IWNbva42U27c5x8N .rough-node .label text,#mermaid-svg-IWNbva42U27c5x8N .node .label text,#mermaid-svg-IWNbva42U27c5x8N .image-shape .label,#mermaid-svg-IWNbva42U27c5x8N .icon-shape .label{text-anchor:middle;}#mermaid-svg-IWNbva42U27c5x8N .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-IWNbva42U27c5x8N .rough-node .label,#mermaid-svg-IWNbva42U27c5x8N .node .label,#mermaid-svg-IWNbva42U27c5x8N .image-shape .label,#mermaid-svg-IWNbva42U27c5x8N .icon-shape .label{text-align:center;}#mermaid-svg-IWNbva42U27c5x8N .node.clickable{cursor:pointer;}#mermaid-svg-IWNbva42U27c5x8N .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-IWNbva42U27c5x8N .arrowheadPath{fill:#333333;}#mermaid-svg-IWNbva42U27c5x8N .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-IWNbva42U27c5x8N .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-IWNbva42U27c5x8N .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-IWNbva42U27c5x8N .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-IWNbva42U27c5x8N .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-IWNbva42U27c5x8N .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-IWNbva42U27c5x8N .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-IWNbva42U27c5x8N .cluster text{fill:#333;}#mermaid-svg-IWNbva42U27c5x8N .cluster span{color:#333;}#mermaid-svg-IWNbva42U27c5x8N div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-IWNbva42U27c5x8N .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-IWNbva42U27c5x8N rect.text{fill:none;stroke-width:0;}#mermaid-svg-IWNbva42U27c5x8N .icon-shape,#mermaid-svg-IWNbva42U27c5x8N .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-IWNbva42U27c5x8N .icon-shape p,#mermaid-svg-IWNbva42U27c5x8N .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-IWNbva42U27c5x8N .icon-shape .label rect,#mermaid-svg-IWNbva42U27c5x8N .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-IWNbva42U27c5x8N .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-IWNbva42U27c5x8N .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-IWNbva42U27c5x8N :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 掩码可视化
双向掩码 (BERT)
⬜⬜⬜⬜⬛⬛
⬜⬜⬜⬜⬛⬛
⬜⬜⬜⬜⬛⬛
⬜⬜⬜⬜⬛⬛
⬛⬛⬛⬛⬛⬛
⬛⬛⬛⬛⬛⬛
因果掩码 (GPT)
⬜⬛⬛⬛⬛⬛
⬜⬜⬛⬛⬛⬛
⬜⬜⬜⬛⬛⬛
⬜⬜⬜⬜⬛⬛
⬜⬜⬜⬜⬜⬛
⬜⬜⬜⬜⬜⬜
create_bidirectional_mask
padding_mask_function
处理填充位置
and_masks(PM, BMF)
bidirectional_mask_function
所有 token 互相可见
q_idx >= 0 → True
attention_mask (2D)
\[1,1,1,1,0,0,
1,1,1,1,1,0\]
attention_mask (4D)
(batch, 1, seq_len, seq_len)
填充位置为 -inf
BERT 的 _create_attention_masks 方法(modeling_bert.py:692-722(file:///workspace/src/transformers/models/bert/modeling_bert.py#L692))根据 is_decoder 标志选择掩码类型:
python
# modeling_bert.py:700-712
if self.config.is_decoder:
attention_mask = create_causal_mask(...) # 因果掩码
else:
attention_mask = create_bidirectional_mask(...) # 双向掩码(BERT 默认)
6. 注意力系统如何运作
BERT 双向注意力 vs GPT 因果注意力
| 特性 | BERT(双向) | GPT(因果) |
|---|---|---|
| 掩码函数 | bidirectional_mask_function |
causal_mask_function |
| 掩码逻辑 | q_idx >= 0(全部可见) |
kv_idx <= q_idx(仅看左侧) |
| 创建函数 | create_bidirectional_mask |
create_causal_mask |
| 源码位置 | masking_utils.py:80(file:///workspace/src/transformers/masking_utils.py#L80) | masking_utils.py:73(file:///workspace/src/transformers/masking_utils.py#L73) |
| 适用场景 | 理解型任务 | 生成型任务 |
ALL_ATTENTION_FUNCTIONS 分发机制
ALL_ATTENTION_FUNCTIONS 是一个全局的注意力接口注册表,定义在 modeling_utils.py:5070(file:///workspace/src/transformers/models/bert/.../.../modeling_utils.py#L5070):
python
ALL_ATTENTION_FUNCTIONS: AttentionInterface = AttentionInterface()
它继承自 GeneralInterface(utils/generic.py:1054(file:///workspace/src/transformers/utils/generic.py#L1054)),支持全局映射和局部覆盖。
注意力分发流程图
#mermaid-svg-ZDQ3emx2kCdrBKC1{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .error-icon{fill:#552222;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .marker.cross{stroke:#333333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 p{margin:0;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster-label text{fill:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster-label span{color:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster-label span p{background-color:transparent;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .label text,#mermaid-svg-ZDQ3emx2kCdrBKC1 span{fill:#333;color:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .node rect,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node circle,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node ellipse,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node polygon,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .rough-node .label text,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node .label text,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape .label,#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .rough-node .label,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node .label,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape .label,#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape .label{text-align:center;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .node.clickable{cursor:pointer;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .arrowheadPath{fill:#333333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZDQ3emx2kCdrBKC1 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster text{fill:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster span{color:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape p,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape .label rect,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZDQ3emx2kCdrBKC1 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZDQ3emx2kCdrBKC1 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} eager
sdpa
flash_attention_2
flex_attention
BertSelfAttention.forward()
ALL_ATTENTION_FUNCTIONS.get_interface()
config._attn_implementation
attn_implementation?
eager_attention_forward()
标准 PyTorch 实现
Q×K^T → softmax → ×V
sdpa_attention_forward()
torch.nn.functional.scaled_dot_product_attention
自动选择 Flash/内存高效/数学内核
flash_attention_2_forward()
Flash Attention 2 内核
IO-aware 优化
flex_attention_forward()
PyTorch Flex Attention
自定义掩码函数
attn_output, attn_weights
reshape → contiguous
返回给 BertSelfOutput
eager_attention_forward 的核心实现(modeling_bert.py:115-140(file:///workspace/src/transformers/models/bert/modeling_bert.py#L115)):
python
# modeling_bert.py:115-140
def eager_attention_forward(module, query, key, value, attention_mask, scaling=None, dropout=0.0, **kwargs):
if scaling is None:
scaling = query.size(-1) ** -0.5
attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling # QK^T / √d
if attention_mask is not None:
attn_weights = attn_weights + attention_mask # 加掩码(-inf 被屏蔽)
attn_weights = nn.functional.softmax(attn_weights, dim=-1) # softmax 归一化
attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
attn_output = torch.matmul(attn_weights, value) # 加权求和
attn_output = attn_output.transpose(1, 2).contiguous()
return attn_output, attn_weights
BertPreTrainedModel 声明支持的注意力实现(modeling_bert.py:536-548(file:///workspace/src/transformers/models/bert/modeling_bert.py#L536)):
python
# modeling_bert.py:536-548
class BertPreTrainedModel(PreTrainedModel):
_supports_flash_attn = True
_supports_sdpa = True
_supports_flex_attn = True
_supports_attention_backend = True
7. 缓存系统在 BERT 中的角色
BERT 不需要 KV Cache
BERT 作为 Encoder-only 模型,采用 非自回归 的推理方式------一次性处理整个序列,而非逐 token 生成。因此,BERT 默认不使用 KV Cache。
在 modeling_bert.py:643-646(file:///workspace/src/transformers/models/bert/modeling_bert.py#L643) 中可以清楚看到:
python
# modeling_bert.py:643-646
if self.config.is_decoder:
use_cache = use_cache if use_cache is not None else self.config.use_cache
else:
use_cache = False # Encoder 模式下,缓存始终关闭
EncoderDecoderCache 场景
当 BERT 被配置为 decoder(is_decoder=True + add_cross_attention=True)时,如 BertLMHeadModel,它可以参与 Seq2Seq 架构。此时会使用 EncoderDecoderCache:
python
# modeling_bert.py:648-653
if use_cache and past_key_values is None:
past_key_values = (
EncoderDecoderCache(DynamicCache(config=self.config), DynamicCache(config=self.config))
if encoder_hidden_states is not None or self.config.is_encoder_decoder
else DynamicCache(config=self.config)
)
EncoderDecoderCache(cache_utils.py:1479(file:///workspace/src/transformers/cache_utils.py#L1479))包含两个独立的缓存:
self_attention_cache:自注意力的 KV 缓存cross_attention_cache:交叉注意力的 KV 缓存
缓存对比图
#mermaid-svg-YMc83Xm46owMW8b5{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-YMc83Xm46owMW8b5 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-YMc83Xm46owMW8b5 .error-icon{fill:#552222;}#mermaid-svg-YMc83Xm46owMW8b5 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-YMc83Xm46owMW8b5 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-YMc83Xm46owMW8b5 .marker.cross{stroke:#333333;}#mermaid-svg-YMc83Xm46owMW8b5 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-YMc83Xm46owMW8b5 p{margin:0;}#mermaid-svg-YMc83Xm46owMW8b5 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster-label text{fill:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster-label span{color:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster-label span p{background-color:transparent;}#mermaid-svg-YMc83Xm46owMW8b5 .label text,#mermaid-svg-YMc83Xm46owMW8b5 span{fill:#333;color:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .node rect,#mermaid-svg-YMc83Xm46owMW8b5 .node circle,#mermaid-svg-YMc83Xm46owMW8b5 .node ellipse,#mermaid-svg-YMc83Xm46owMW8b5 .node polygon,#mermaid-svg-YMc83Xm46owMW8b5 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-YMc83Xm46owMW8b5 .rough-node .label text,#mermaid-svg-YMc83Xm46owMW8b5 .node .label text,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape .label,#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape .label{text-anchor:middle;}#mermaid-svg-YMc83Xm46owMW8b5 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-YMc83Xm46owMW8b5 .rough-node .label,#mermaid-svg-YMc83Xm46owMW8b5 .node .label,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape .label,#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape .label{text-align:center;}#mermaid-svg-YMc83Xm46owMW8b5 .node.clickable{cursor:pointer;}#mermaid-svg-YMc83Xm46owMW8b5 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-YMc83Xm46owMW8b5 .arrowheadPath{fill:#333333;}#mermaid-svg-YMc83Xm46owMW8b5 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-YMc83Xm46owMW8b5 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-YMc83Xm46owMW8b5 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YMc83Xm46owMW8b5 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-YMc83Xm46owMW8b5 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YMc83Xm46owMW8b5 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-YMc83Xm46owMW8b5 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster text{fill:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster span{color:#333;}#mermaid-svg-YMc83Xm46owMW8b5 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-YMc83Xm46owMW8b5 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-YMc83Xm46owMW8b5 rect.text{fill:none;stroke-width:0;}#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape p,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape .label rect,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YMc83Xm46owMW8b5 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-YMc83Xm46owMW8b5 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-YMc83Xm46owMW8b5 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} BERT as Decoder(Seq2Seq)
Encoder 输出
cross_attention_cache
(一次性存入,不变)
Decode Step 1
self_attention_cache
(逐步增长)
Decode Step 2
GPT Decoder
Token 1
KV Cache 存入
Token 2
KV Cache 更新
Token 3
KV Cache 更新
...
BERT Encoder(默认)
输入序列
一次性前向传播
无缓存
输出
关键区别:
| 特性 | BERT Encoder | GPT Decoder | BERT as Decoder |
|---|---|---|---|
| KV Cache | 不使用 | DynamicCache | EncoderDecoderCache |
| 推理方式 | 一次性 | 自回归 | 自回归 |
| 交叉注意力 | 无 | 无 | 有(缓存 encoder 输出) |
use_cache |
False |
True |
True |
8. 训练流程
BertForPreTraining 的 MLM + NSP 损失
BERT 的预训练包含两个任务,定义在 modeling_bert.py:731-820(file:///workspace/src/transformers/models/bert/modeling_bert.py#L731):
python
# modeling_bert.py:731-820
class BertForPreTraining(BertPreTrainedModel):
_tied_weights_keys = {
"cls.predictions.decoder.weight": "bert.embeddings.word_embeddings.weight",
"cls.predictions.decoder.bias": "cls.predictions.bias",
}
def __init__(self, config):
super().__init__(config)
self.bert = BertModel(config)
self.cls = BertPreTrainingHeads(config) # MLM头 + NSP头
self.post_init()
def forward(self, input_ids, attention_mask=None, token_type_ids=None,
labels=None, next_sentence_label=None, **kwargs):
outputs = self.bert(input_ids, attention_mask=attention_mask,
token_type_ids=token_type_ids, ...)
sequence_output, pooled_output = outputs[:2]
prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
total_loss = None
if labels is not None and next_sentence_label is not None:
loss_fct = CrossEntropyLoss()
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
total_loss = masked_lm_loss + next_sentence_loss # 两个损失简单相加
return BertForPreTrainingOutput(
loss=total_loss,
prediction_logits=prediction_scores,
seq_relationship_logits=seq_relationship_score,
...
)
BertPreTrainingHeads(modeling_bert.py:523-532(file:///workspace/src/transformers/models/bert/modeling_bert.py#L523))包含两个头:
python
# modeling_bert.py:523-532
class BertPreTrainingHeads(nn.Module):
def __init__(self, config):
super().__init__()
self.predictions = BertLMPredictionHead(config) # MLM: Dense → GELU → LN → Linear(vocab_size)
self.seq_relationship = nn.Linear(config.hidden_size, 2) # NSP: Linear(768, 2)
训练循环时序图
Loss Functions BertPreTrainingHeads BertModel BertForPreTraining Trainer Loss Functions BertPreTrainingHeads BertModel BertForPreTraining Trainer #mermaid-svg-OIlMSTAmZWpS4bmK{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-OIlMSTAmZWpS4bmK .error-icon{fill:#552222;}#mermaid-svg-OIlMSTAmZWpS4bmK .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-OIlMSTAmZWpS4bmK .marker{fill:#333333;stroke:#333333;}#mermaid-svg-OIlMSTAmZWpS4bmK .marker.cross{stroke:#333333;}#mermaid-svg-OIlMSTAmZWpS4bmK svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-OIlMSTAmZWpS4bmK p{margin:0;}#mermaid-svg-OIlMSTAmZWpS4bmK .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-OIlMSTAmZWpS4bmK text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-OIlMSTAmZWpS4bmK .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK .sequenceNumber{fill:white;}#mermaid-svg-OIlMSTAmZWpS4bmK #sequencenumber{fill:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK .messageText{fill:#333;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-OIlMSTAmZWpS4bmK .labelText,#mermaid-svg-OIlMSTAmZWpS4bmK .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .loopText,#mermaid-svg-OIlMSTAmZWpS4bmK .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-OIlMSTAmZWpS4bmK .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-OIlMSTAmZWpS4bmK .noteText,#mermaid-svg-OIlMSTAmZWpS4bmK .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-OIlMSTAmZWpS4bmK .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-OIlMSTAmZWpS4bmK .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-OIlMSTAmZWpS4bmK .actorPopupMenu{position:absolute;}#mermaid-svg-OIlMSTAmZWpS4bmK .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-OIlMSTAmZWpS4bmK .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-OIlMSTAmZWpS4bmK .actor-man circle,#mermaid-svg-OIlMSTAmZWpS4bmK line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-OIlMSTAmZWpS4bmK :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Dense → GELU → LayerNorm → Linear(768→30522) Linear(768→2) forward(input_ids, labels, next_sentence_label) forward(input_ids, attention_mask, token_type_ids) Embeddings → Encoder(12层) → Pooler sequence_output, pooled_output cls(sequence_output, pooled_output) predictions = BertLMPredictionHead(sequence_output) seq_relationship = Linear(pooled_output) prediction_scores, seq_relationship_score CrossEntropyLoss(prediction_scores, labels) masked_lm_loss CrossEntropyLoss(seq_relationship_score, next_sentence_label) next_sentence_loss total_loss = masked_lm_loss + next_sentence_loss BertForPreTrainingOutput(loss=total_loss) loss.backward() optimizer.step() scheduler.step()
Trainer 集成要点
- 数据准备 :MLM 标签中,被掩码 token 的位置为真实 token ID,其余为
-100(忽略) - NSP 标签 :
0表示句子 B 是句子 A 的续句,1表示随机句子 - 权重绑定 :MLM 头的 decoder 权重与 embedding 层共享,通过
_tied_weights_keys声明
9. Pipeline 推理
pipeline("text-classification", model="bert-base-uncased") 的完整流程。
Pipeline 时序图
postprocess BertForSequenceClassification BertTokenizer TextClassificationPipeline pipeline() 用户 postprocess BertForSequenceClassification BertTokenizer TextClassificationPipeline pipeline() 用户 #mermaid-svg-iMJKWUt7Pp5pOllI{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-iMJKWUt7Pp5pOllI .error-icon{fill:#552222;}#mermaid-svg-iMJKWUt7Pp5pOllI .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-iMJKWUt7Pp5pOllI .marker{fill:#333333;stroke:#333333;}#mermaid-svg-iMJKWUt7Pp5pOllI .marker.cross{stroke:#333333;}#mermaid-svg-iMJKWUt7Pp5pOllI svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-iMJKWUt7Pp5pOllI p{margin:0;}#mermaid-svg-iMJKWUt7Pp5pOllI .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-iMJKWUt7Pp5pOllI text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-iMJKWUt7Pp5pOllI .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI .sequenceNumber{fill:white;}#mermaid-svg-iMJKWUt7Pp5pOllI #sequencenumber{fill:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI .messageText{fill:#333;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-iMJKWUt7Pp5pOllI .labelText,#mermaid-svg-iMJKWUt7Pp5pOllI .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .loopText,#mermaid-svg-iMJKWUt7Pp5pOllI .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-iMJKWUt7Pp5pOllI .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-iMJKWUt7Pp5pOllI .noteText,#mermaid-svg-iMJKWUt7Pp5pOllI .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-iMJKWUt7Pp5pOllI .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-iMJKWUt7Pp5pOllI .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-iMJKWUt7Pp5pOllI .actorPopupMenu{position:absolute;}#mermaid-svg-iMJKWUt7Pp5pOllI .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-iMJKWUt7Pp5pOllI .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-iMJKWUt7Pp5pOllI .actor-man circle,#mermaid-svg-iMJKWUt7Pp5pOllI line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-iMJKWUt7Pp5pOllI :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} pipeline("text-classification", model="bert-base-uncased") 确定任务类型 → text-classification 实例化 TextClassificationPipeline AutoTokenizer.from_pretrained("bert-base-uncased") BertTokenizer 实例 AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") BertForSequenceClassification 实例 ("This movie is great!") _sanitize_parameters() preprocess → tokenizer("This movie is great!", return_tensors="pt") {input_ids, attention_mask, token_type_ids} _forward → model(**inputs, use_cache=False) BertEmbeddings → BertEncoder → BertPooler Dropout → Linear(768, num_labels) SequenceClassifierOutput(logits=(batch, num_labels)) postprocess(logits, function_to_apply="sigmoid") softmax/sigmoid → 取 top_k → 映射 label {"label": "POSITIVE", "score": 0.9998}
关键代码对应
TextClassificationPipeline(text_classification.py:43(file:///workspace/src/transformers/pipelines/text_classification.py#L43))的核心方法:
python
# text_classification.py:154-157
def preprocess(self, inputs, **tokenizer_kwargs):
return_tensors = "pt"
return self.tokenizer(**inputs, return_tensors=return_tensors, **tokenizer_kwargs)
# text_classification.py:171-176
def _forward(self, model_inputs):
model_forward = self.model.forward
if "use_cache" in inspect.signature(model_forward).parameters:
model_inputs["use_cache"] = False # 分类任务不需要缓存
return self.model(**model_inputs)
BertForSequenceClassification(modeling_bert.py:1076-1153(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1076))的前向传播:
python
# modeling_bert.py:1110-1153
def forward(self, input_ids, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, ...)
pooled_output = outputs[1] # [CLS] token 的池化输出
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output) # Linear(768, num_labels)
loss = None
if labels is not None:
# 自动判断问题类型:回归 / 单标签分类 / 多标签分类
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
# 根据问题类型选择损失函数
...
10. 状态与生命周期总结
BERT 模型在 Transformers 框架中经历从定义到使用的完整生命周期。
状态机图
#mermaid-svg-wVbqsHqGMmmGCByg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-wVbqsHqGMmmGCByg .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-wVbqsHqGMmmGCByg .error-icon{fill:#552222;}#mermaid-svg-wVbqsHqGMmmGCByg .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-wVbqsHqGMmmGCByg .marker{fill:#333333;stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg .marker.cross{stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-wVbqsHqGMmmGCByg p{margin:0;}#mermaid-svg-wVbqsHqGMmmGCByg defs #statediagram-barbEnd{fill:#333333;stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup text{fill:#9370DB;stroke:none;font-size:10px;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup text{fill:#333;stroke:none;font-size:10px;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup .state-title{font-weight:bolder;fill:#131300;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup line{stroke:#333333;stroke-width:1;}#mermaid-svg-wVbqsHqGMmmGCByg .transition{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-wVbqsHqGMmmGCByg .stateGroup .composit{fill:white;border-bottom:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .state-note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-wVbqsHqGMmmGCByg .state-note text{fill:black;stroke:none;font-size:10px;}#mermaid-svg-wVbqsHqGMmmGCByg .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel .label rect{fill:#ECECFF;opacity:0.5;}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel .label text{fill:#333;}#mermaid-svg-wVbqsHqGMmmGCByg .label div .edgeLabel{color:#333;}#mermaid-svg-wVbqsHqGMmmGCByg .stateLabel text{fill:#131300;font-size:10px;font-weight:bold;}#mermaid-svg-wVbqsHqGMmmGCByg .node circle.state-start{fill:#333333;stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg .node .fork-join{fill:#333333;stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg .node circle.state-end{fill:#9370DB;stroke:white;stroke-width:1.5;}#mermaid-svg-wVbqsHqGMmmGCByg .end-state-inner{fill:white;stroke-width:1.5;}#mermaid-svg-wVbqsHqGMmmGCByg .node rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-wVbqsHqGMmmGCByg #statediagram-barbEnd{fill:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .cluster-label,#mermaid-svg-wVbqsHqGMmmGCByg .nodeLabel{color:#131300;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster rect.outer{rx:5px;ry:5px;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-state .divider{stroke:#9370DB;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-state .title-state{rx:5px;ry:5px;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster.statediagram-cluster .inner{fill:white;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster.statediagram-cluster-alt .inner{fill:#f0f0f0;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster .inner{rx:0;ry:0;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-state rect.basic{rx:5px;ry:5px;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#f0f0f0;}#mermaid-svg-wVbqsHqGMmmGCByg .note-edge{stroke-dasharray:5;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-note text{fill:black;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-note .nodeLabel{color:black;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram .edgeLabel{color:red;}#mermaid-svg-wVbqsHqGMmmGCByg #dependencyStart,#mermaid-svg-wVbqsHqGMmmGCByg #dependencyEnd{fill:#333333;stroke:#333333;stroke-width:1;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagramTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-wVbqsHqGMmmGCByg :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 开发者编写代码
model_type = "bert"
from_pretrained()
model.eval()
model.train()
save_pretrained()
save_pretrained()
from_pretrained()
定义
注册
BertConfig 定义参数
BertModel 定义架构
BertPreTrainedModel 定义基类
各任务头定义行为
加载
CONFIG_MAPPING"bert" = BertConfig
MODEL_MAPPING"bert" = BertModel
AutoConfig/AutoModel 路由
推理
训练
- 下载/读取 config.json 2. BertConfig.from_dict() 3. meta 设备初始化空壳 4. 下载/读取权重文件 5. WeightConverter 键名转换 6. load_state_dict() 7. 量化(可选) 8. 设备分配 9. tie_weights() 保存
Tokenizer 编码
→ Embeddings
→ Encoder (12层)
→ Pooler / 任务头
→ 输出
前向传播 → 计算损失
→ 反向传播
→ 优化器更新
→ 学习率调度
config.json
model.safetensors
tokenizer.json / vocab.txt
生命周期各阶段与源码映射
| 阶段 | 关键文件 | 关键函数/类 |
|---|---|---|
| 定义 | configuration_bert.py(file:///workspace/src/transformers/models/bert/configuration_bert.py) | BertConfig @strict dataclass |
| modeling_bert.py(file:///workspace/src/transformers/models/bert/modeling_bert.py) | BertModel, BertPreTrainedModel, 各任务头 |
|
| tokenization_bert.py(file:///workspace/src/transformers/models/bert/tokenization_bert.py) | BertTokenizer |
|
| 注册 | configuration_auto.py(file:///workspace/src/transformers/models/auto/configuration_auto.py) | CONFIG_MAPPING, AutoConfig.register() |
| **init** .py(file:///workspace/src/transformers/models/bert/init.py) | _LazyModule 延迟导入 |
|
| 加载 | modeling_utils.py(file:///workspace/src/transformers/modeling_utils.py) | PreTrainedModel.from_pretrained() |
| configuration_utils.py(file:///workspace/src/transformers/configuration_utils.py) | PreTrainedConfig.from_pretrained() |
|
| 推理 | masking_utils.py(file:///workspace/src/transformers/masking_utils.py) | create_bidirectional_mask() |
| modeling_utils.py(file:///workspace/src/transformers/modeling_utils.py) | ALL_ATTENTION_FUNCTIONS.get_interface() |
|
| 训练 | modeling_bert.py(file:///workspace/src/transformers/models/bert/modeling_bert.py) | BertForPreTraining.forward(), CrossEntropyLoss |
| 缓存 | cache_utils.py(file:///workspace/src/transformers/cache_utils.py) | DynamicCache, EncoderDecoderCache |
| Pipeline | text_classification.py(file:///workspace/src/transformers/pipelines/text_classification.py) | TextClassificationPipeline |
| 保存 | configuration_utils.py(file:///workspace/src/transformers/configuration_utils.py) | PreTrainedConfig.save_pretrained() |
模块协作全景
#mermaid-svg-X3AJnNskinCQUz5a{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-X3AJnNskinCQUz5a .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-X3AJnNskinCQUz5a .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-X3AJnNskinCQUz5a .error-icon{fill:#552222;}#mermaid-svg-X3AJnNskinCQUz5a .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-X3AJnNskinCQUz5a .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-X3AJnNskinCQUz5a .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-X3AJnNskinCQUz5a .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-X3AJnNskinCQUz5a .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-X3AJnNskinCQUz5a .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-X3AJnNskinCQUz5a .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-X3AJnNskinCQUz5a .marker{fill:#333333;stroke:#333333;}#mermaid-svg-X3AJnNskinCQUz5a .marker.cross{stroke:#333333;}#mermaid-svg-X3AJnNskinCQUz5a svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-X3AJnNskinCQUz5a p{margin:0;}#mermaid-svg-X3AJnNskinCQUz5a .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-X3AJnNskinCQUz5a .cluster-label text{fill:#333;}#mermaid-svg-X3AJnNskinCQUz5a .cluster-label span{color:#333;}#mermaid-svg-X3AJnNskinCQUz5a .cluster-label span p{background-color:transparent;}#mermaid-svg-X3AJnNskinCQUz5a .label text,#mermaid-svg-X3AJnNskinCQUz5a span{fill:#333;color:#333;}#mermaid-svg-X3AJnNskinCQUz5a .node rect,#mermaid-svg-X3AJnNskinCQUz5a .node circle,#mermaid-svg-X3AJnNskinCQUz5a .node ellipse,#mermaid-svg-X3AJnNskinCQUz5a .node polygon,#mermaid-svg-X3AJnNskinCQUz5a .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-X3AJnNskinCQUz5a .rough-node .label text,#mermaid-svg-X3AJnNskinCQUz5a .node .label text,#mermaid-svg-X3AJnNskinCQUz5a .image-shape .label,#mermaid-svg-X3AJnNskinCQUz5a .icon-shape .label{text-anchor:middle;}#mermaid-svg-X3AJnNskinCQUz5a .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-X3AJnNskinCQUz5a .rough-node .label,#mermaid-svg-X3AJnNskinCQUz5a .node .label,#mermaid-svg-X3AJnNskinCQUz5a .image-shape .label,#mermaid-svg-X3AJnNskinCQUz5a .icon-shape .label{text-align:center;}#mermaid-svg-X3AJnNskinCQUz5a .node.clickable{cursor:pointer;}#mermaid-svg-X3AJnNskinCQUz5a .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-X3AJnNskinCQUz5a .arrowheadPath{fill:#333333;}#mermaid-svg-X3AJnNskinCQUz5a .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-X3AJnNskinCQUz5a .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-X3AJnNskinCQUz5a .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-X3AJnNskinCQUz5a .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-X3AJnNskinCQUz5a .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-X3AJnNskinCQUz5a .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-X3AJnNskinCQUz5a .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-X3AJnNskinCQUz5a .cluster text{fill:#333;}#mermaid-svg-X3AJnNskinCQUz5a .cluster span{color:#333;}#mermaid-svg-X3AJnNskinCQUz5a div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-X3AJnNskinCQUz5a .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-X3AJnNskinCQUz5a rect.text{fill:none;stroke-width:0;}#mermaid-svg-X3AJnNskinCQUz5a .icon-shape,#mermaid-svg-X3AJnNskinCQUz5a .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-X3AJnNskinCQUz5a .icon-shape p,#mermaid-svg-X3AJnNskinCQUz5a .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-X3AJnNskinCQUz5a .icon-shape .label rect,#mermaid-svg-X3AJnNskinCQUz5a .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-X3AJnNskinCQUz5a .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-X3AJnNskinCQUz5a .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-X3AJnNskinCQUz5a :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 基础设施
缓存层
掩码层
注意力层
模型层
分词层
配置层
BertConfig
@strict dataclass
model_type='bert'
PreTrainedConfig
attribute_map
序列化/反序列化
BertTokenizer
WordPiece
BertNormalizer
TemplateProcessing
BertEmbeddings
word + position + token_type
BertLayer × 12
SelfAttention → FFN
BertPooler
CLS → Dense → Tanh
任务头
MLM / NSP / CLS / QA / TC
eager_attention_forward
sdpa_attention_forward
flash_attention_2_forward
ALL_ATTENTION_FUNCTIONS
分发注册表
create_bidirectional_mask
BERT 默认
create_causal_mask
decoder 模式
DynamicCache
自回归 KV 缓存
EncoderDecoderCache
self + cross 缓存
PreTrainedModel
from_pretrained()
save_pretrained()
AutoModel / AutoConfig
自动路由
Pipeline
端到端推理
总结
BERT 在 Transformers 框架中的完整生命周期可以概括为:
- 定义 :通过
@strictdataclass 定义BertConfig,声明model_type = "bert";通过BertPreTrainedModel→BertModel定义模型架构 - 注册 :
model_type自动注册到CONFIG_MAPPING和MODEL_MAPPING,支持AutoConfig/AutoModel自动路由 - 加载 :
from_pretrained()执行 Config 加载 → meta 设备初始化 → 权重下载/转换 → 量化(可选)→ 设备分配 → 权重绑定 - 编码 :
BertTokenizer通过 BertNormalizer → BertPreTokenizer → WordPiece → TemplateProcessing 将文本转为input_ids+attention_mask+token_type_ids - 前向传播:Embeddings(三种嵌入求和)→ Encoder(12层 BertLayer,每层含 SelfAttention + FFN)→ Pooler → 任务头
- 注意力 :通过
ALL_ATTENTION_FUNCTIONS分发到 eager/SDPA/Flash Attention/Flex Attention 实现;BERT 默认使用create_bidirectional_mask双向掩码 - 缓存 :BERT Encoder 不使用 KV Cache;作为 decoder 时使用
EncoderDecoderCache - 训练 :
BertForPreTraining同时计算 MLM 损失和 NSP 损失,简单相加作为总损失 - Pipeline :
TextClassificationPipeline封装了 tokenize → forward → postprocess 的端到端流程 - 保存 :
save_pretrained()将 config.json + model.safetensors + tokenizer 文件持久化到磁盘