17-Hugging Face Transformers之BERT 案例详解:Transformers 框架全模块串联

BERT 案例详解:Transformers 框架全模块串联

本文档以 BERT 模型为例,将 Transformers 框架的所有模块串联起来,展示从加载到推理的完整生命周期。

所有引用均基于真实源码文件。


相关文章:

Hugging Face Transformers 源码全景解读

01-Hugging Face Transformers 核心基础设施深度分析

02-Hugging Face Transformers 配置系统深度分析

03-Hugging Face Transformers 模型系统深度分析

04-Hugging Face Transformers 注意力与掩码系统深度分析

05-Hugging Face Transformers 缓存系统深度分析

06-Hugging Face Transformers 生成系统深度分析

07-Hugging Face Transformers 分词器系统深度分析

08-Hugging Face Transformers 多模态处理系统深度分析

09-Hugging Face Transformers 训练系统深度分析

10-Hugging Face Transformers 量化系统深度分析

11-Hugging Face Transformers 分布式与并行系统深度分析

12-Hugging Face Transformers之Pipeline 推理管道深入分析

13-Hugging Face Transformers之AutoModel 自动分发机制深入分析

14-Hugging Face Transformers 模型实现模式深度分析

15-Hugging Face Transformers之CLI 与工具架构总览

16-Hugging Face Transformers之测试体系架构总览

17-Hugging Face Transformers之BERT 案例详解:Transformers 框架全模块串联

18-Hugging Face Transformers之GPT-2 案例详解:Decoder-only 自回归模型的完整生命周期

19-Hugging Face Transformers之Qwen3.5-MoE 系列详解:混合专家 + 线性注意力 + 多模态的完整生命周期

1. BERT 在 Transformers 中的定位

BERT(Bidirectional Encoder Representations from Transformers)是 Encoder-only 模型的典型代表。与 GPT(Decoder-only)和 T5(Encoder-Decoder)形成三类 Transformer 架构的鼎立格局。

BERT 的核心特征是 双向注意力------每个 token 可以同时关注序列中所有其他 token,而非仅关注左侧上下文。这使得 BERT 天然适合以下任务:

任务类型 对应模型类 源码位置
MLM(掩码语言建模) BertForMaskedLM modeling_bert.py:913(file:///workspace/src/transformers/models/bert/modeling_bert.py#L913)
NSP(下一句预测) BertForNextSentencePrediction modeling_bert.py:994(file:///workspace/src/transformers/models/bert/modeling_bert.py#L994)
序列分类 BertForSequenceClassification modeling_bert.py:1076(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1076)
问答 BertForQuestionAnswering modeling_bert.py:1315(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1315)
Token 标注 BertForTokenClassification modeling_bert.py:1255(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1255)
多选 BertForMultipleChoice modeling_bert.py:1157(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1157)
预训练(MLM+NSP) BertForPreTraining modeling_bert.py:731(file:///workspace/src/transformers/models/bert/modeling_bert.py#L731)

架构定位图

#mermaid-svg-UmgNXy2OFk0xL4vi{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-UmgNXy2OFk0xL4vi .error-icon{fill:#552222;}#mermaid-svg-UmgNXy2OFk0xL4vi .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-UmgNXy2OFk0xL4vi .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-UmgNXy2OFk0xL4vi .marker{fill:#333333;stroke:#333333;}#mermaid-svg-UmgNXy2OFk0xL4vi .marker.cross{stroke:#333333;}#mermaid-svg-UmgNXy2OFk0xL4vi svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-UmgNXy2OFk0xL4vi p{margin:0;}#mermaid-svg-UmgNXy2OFk0xL4vi .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster-label text{fill:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster-label span{color:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster-label span p{background-color:transparent;}#mermaid-svg-UmgNXy2OFk0xL4vi .label text,#mermaid-svg-UmgNXy2OFk0xL4vi span{fill:#333;color:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .node rect,#mermaid-svg-UmgNXy2OFk0xL4vi .node circle,#mermaid-svg-UmgNXy2OFk0xL4vi .node ellipse,#mermaid-svg-UmgNXy2OFk0xL4vi .node polygon,#mermaid-svg-UmgNXy2OFk0xL4vi .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-UmgNXy2OFk0xL4vi .rough-node .label text,#mermaid-svg-UmgNXy2OFk0xL4vi .node .label text,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape .label,#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape .label{text-anchor:middle;}#mermaid-svg-UmgNXy2OFk0xL4vi .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-UmgNXy2OFk0xL4vi .rough-node .label,#mermaid-svg-UmgNXy2OFk0xL4vi .node .label,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape .label,#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape .label{text-align:center;}#mermaid-svg-UmgNXy2OFk0xL4vi .node.clickable{cursor:pointer;}#mermaid-svg-UmgNXy2OFk0xL4vi .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-UmgNXy2OFk0xL4vi .arrowheadPath{fill:#333333;}#mermaid-svg-UmgNXy2OFk0xL4vi .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-UmgNXy2OFk0xL4vi .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-UmgNXy2OFk0xL4vi .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-UmgNXy2OFk0xL4vi .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-UmgNXy2OFk0xL4vi .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-UmgNXy2OFk0xL4vi .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster text{fill:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi .cluster span{color:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-UmgNXy2OFk0xL4vi .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-UmgNXy2OFk0xL4vi rect.text{fill:none;stroke-width:0;}#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape p,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-UmgNXy2OFk0xL4vi .icon-shape .label rect,#mermaid-svg-UmgNXy2OFk0xL4vi .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-UmgNXy2OFk0xL4vi .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-UmgNXy2OFk0xL4vi .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-UmgNXy2OFk0xL4vi :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Transformer家族
Encoder-Decoder
T5

编码器双向 + 解码器因果

Seq2Seq
BART

去噪自编码
Whisper

语音识别
Decoder-Only
GPT 系列

因果(单向)注意力

自回归生成
LLaMA

RoPE + SwiGLU
Qwen

大容量解码器
Encoder-Only
BERT

双向注意力

MLM + NSP 预训练
RoBERTa

动态掩码
ALBERT

参数共享
DeBERTa

解耦注意力

关键区别:BERT 在 modeling_bert.py:708(file:///workspace/src/transformers/models/bert/modeling_bert.py#L708) 使用 create_bidirectional_mask 创建双向掩码,而 GPT 使用 create_causal_mask 创建因果掩码。


2. Config 定义全流程

BertConfig 的 @strict dataclass 定义

BERT 的配置类定义在 configuration_bert.py(file:///workspace/src/transformers/models/bert/configuration_bert.py) 中,使用了 @strict 装饰器和 @auto_docstring 装饰器:

python 复制代码
# configuration_bert.py:17-63
from huggingface_hub.dataclasses import strict
from ...configuration_utils import PreTrainedConfig
from ...utils import auto_docstring

@auto_docstring(checkpoint="google-bert/bert-base-uncased")
@strict
class BertConfig(PreTrainedConfig):
    model_type = "bert"  # 注册模型类型标识

    vocab_size: int = 30522
    hidden_size: int = 768
    num_hidden_layers: int = 12
    num_attention_heads: int = 12
    intermediate_size: int = 3072
    hidden_act: str = "gelu"
    hidden_dropout_prob: float | int = 0.1
    attention_probs_dropout_prob: float | int = 0.1
    max_position_embeddings: int = 512
    type_vocab_size: int = 2
    initializer_range: float = 0.02
    layer_norm_eps: float = 1e-12
    pad_token_id: int | None = 0
    use_cache: bool = True
    classifier_dropout: float | int | None = None
    is_decoder: bool = False
    add_cross_attention: bool = False
    tie_word_embeddings: bool = True

关键设计要点

  1. @strict 装饰器 (来自 huggingface_hub.dataclasses):强制类型检查,确保配置参数类型正确,防止传入非法值
  2. model_type = "bert" :这是 AutoConfig 自动路由的核心标识,在 configuration_auto.py:424(file:///workspace/src/transformers/models/auto/configuration_auto.py#L424) 的 AutoConfig.register 方法中用于注册映射
  3. attribute_map :继承自 PreTrainedConfigconfiguration_utils.py:219(file:///workspace/src/transformers/configuration_utils.py#L219)),提供属性别名映射,通过 __getattribute____setattr__ 拦截实现透明别名访问

Config 类图

#mermaid-svg-HpkU4lmx4TpmmF4y{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-HpkU4lmx4TpmmF4y .error-icon{fill:#552222;}#mermaid-svg-HpkU4lmx4TpmmF4y .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-HpkU4lmx4TpmmF4y .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-HpkU4lmx4TpmmF4y .marker{fill:#333333;stroke:#333333;}#mermaid-svg-HpkU4lmx4TpmmF4y .marker.cross{stroke:#333333;}#mermaid-svg-HpkU4lmx4TpmmF4y svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-HpkU4lmx4TpmmF4y p{margin:0;}#mermaid-svg-HpkU4lmx4TpmmF4y g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-HpkU4lmx4TpmmF4y g.classGroup text .title{font-weight:bolder;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster-label text{fill:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster-label span{color:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster-label span p{background-color:transparent;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster text{fill:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .cluster span{color:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .nodeLabel,#mermaid-svg-HpkU4lmx4TpmmF4y .edgeLabel{color:#131300;}#mermaid-svg-HpkU4lmx4TpmmF4y .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-HpkU4lmx4TpmmF4y .label text{fill:#131300;}#mermaid-svg-HpkU4lmx4TpmmF4y .labelBkg{background:#ECECFF;}#mermaid-svg-HpkU4lmx4TpmmF4y .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-HpkU4lmx4TpmmF4y .classTitle{font-weight:bolder;}#mermaid-svg-HpkU4lmx4TpmmF4y .node rect,#mermaid-svg-HpkU4lmx4TpmmF4y .node circle,#mermaid-svg-HpkU4lmx4TpmmF4y .node ellipse,#mermaid-svg-HpkU4lmx4TpmmF4y .node polygon,#mermaid-svg-HpkU4lmx4TpmmF4y .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-HpkU4lmx4TpmmF4y .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y g.clickable{cursor:pointer;}#mermaid-svg-HpkU4lmx4TpmmF4y g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-HpkU4lmx4TpmmF4y g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-HpkU4lmx4TpmmF4y .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-HpkU4lmx4TpmmF4y .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-HpkU4lmx4TpmmF4y .dashed-line{stroke-dasharray:3;}#mermaid-svg-HpkU4lmx4TpmmF4y .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-HpkU4lmx4TpmmF4y #compositionStart,#mermaid-svg-HpkU4lmx4TpmmF4y .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #compositionEnd,#mermaid-svg-HpkU4lmx4TpmmF4y .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #dependencyStart,#mermaid-svg-HpkU4lmx4TpmmF4y .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #dependencyStart,#mermaid-svg-HpkU4lmx4TpmmF4y .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #extensionStart,#mermaid-svg-HpkU4lmx4TpmmF4y .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #extensionEnd,#mermaid-svg-HpkU4lmx4TpmmF4y .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #aggregationStart,#mermaid-svg-HpkU4lmx4TpmmF4y .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #aggregationEnd,#mermaid-svg-HpkU4lmx4TpmmF4y .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #lollipopStart,#mermaid-svg-HpkU4lmx4TpmmF4y .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y #lollipopEnd,#mermaid-svg-HpkU4lmx4TpmmF4y .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-HpkU4lmx4TpmmF4y .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-HpkU4lmx4TpmmF4y .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-HpkU4lmx4TpmmF4y .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-HpkU4lmx4TpmmF4y .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-HpkU4lmx4TpmmF4y :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 查找
"bert" -> BertConfig
继承
PreTrainedConfig
+model_type: str
+attribute_map: dict
+vocab_size: int
+hidden_size: int
+is_encoder_decoder: bool
+from_pretrained()
+from_dict()
+to_dict()
+save_pretrained()
+getattribute(key)
+setattr(key, value)
BertConfig
+model_type = "bert"
+vocab_size: int = 30522
+hidden_size: int = 768
+num_hidden_layers: int = 12
+num_attention_heads: int = 12
+intermediate_size: int = 3072
+hidden_act: str = "gelu"
+hidden_dropout_prob: float = 0.1
+attention_probs_dropout_prob: float = 0.1
+max_position_embeddings: int = 512
+type_vocab_size: int = 2
+initializer_range: float = 0.02
+layer_norm_eps: float = 1e-12
+pad_token_id: int = 0
+is_decoder: bool = False
+add_cross_attention: bool = False
+tie_word_embeddings: bool = True
AutoConfig
+from_pretrained()
+for_model()
+register()
CONFIG_MAPPING
+register(model_type, config)
+getitem(model_type)

Config 序列化流程图

#mermaid-svg-ZAFV0Ny68IJV0wIX{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZAFV0Ny68IJV0wIX .error-icon{fill:#552222;}#mermaid-svg-ZAFV0Ny68IJV0wIX .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZAFV0Ny68IJV0wIX .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .marker.cross{stroke:#333333;}#mermaid-svg-ZAFV0Ny68IJV0wIX svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZAFV0Ny68IJV0wIX p{margin:0;}#mermaid-svg-ZAFV0Ny68IJV0wIX .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster-label text{fill:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster-label span{color:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster-label span p{background-color:transparent;}#mermaid-svg-ZAFV0Ny68IJV0wIX .label text,#mermaid-svg-ZAFV0Ny68IJV0wIX span{fill:#333;color:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .node rect,#mermaid-svg-ZAFV0Ny68IJV0wIX .node circle,#mermaid-svg-ZAFV0Ny68IJV0wIX .node ellipse,#mermaid-svg-ZAFV0Ny68IJV0wIX .node polygon,#mermaid-svg-ZAFV0Ny68IJV0wIX .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .rough-node .label text,#mermaid-svg-ZAFV0Ny68IJV0wIX .node .label text,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape .label,#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZAFV0Ny68IJV0wIX .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .rough-node .label,#mermaid-svg-ZAFV0Ny68IJV0wIX .node .label,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape .label,#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape .label{text-align:center;}#mermaid-svg-ZAFV0Ny68IJV0wIX .node.clickable{cursor:pointer;}#mermaid-svg-ZAFV0Ny68IJV0wIX .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .arrowheadPath{fill:#333333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZAFV0Ny68IJV0wIX .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZAFV0Ny68IJV0wIX .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZAFV0Ny68IJV0wIX .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster text{fill:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX .cluster span{color:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZAFV0Ny68IJV0wIX .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZAFV0Ny68IJV0wIX rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape p,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZAFV0Ny68IJV0wIX .icon-shape .label rect,#mermaid-svg-ZAFV0Ny68IJV0wIX .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZAFV0Ny68IJV0wIX .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZAFV0Ny68IJV0wIX .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZAFV0Ny68IJV0wIX :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} to_dict()
json.dumps()
保存到磁盘
json.loads()
BertConfig.from_dict()
下载 config.json
读取 model_type
model_type='bert'
from_dict()
BertConfig()
Python 字典

{vocab_size: 30522, ...}
JSON 字符串

config.json
config.json 文件
Python 字典
BertConfig 实例
远程 Hub

google-bert/bert-base-uncased
AutoConfig.from_pretrained()
CONFIG_MAPPING 查找
定位 BertConfig

序列化/反序列化关键路径

  • save_pretrained()to_dict() → JSON 文件
  • from_pretrained() → 下载/读取 JSON → from_dict() → BertConfig 实例
  • AutoConfig.from_pretrained() 通过 model_type 字段在 CONFIG_MAPPING 中查找对应的 Config 类

3. from_pretrained 完整时序

当用户调用 BertModel.from_pretrained('bert-base-uncased') 时,框架执行一系列复杂步骤将预训练权重加载到模型中。

时序图

权重绑定 设备分配 量化器 WeightConverter meta 设备初始化 BertModel(BertPreTrainedModel) BertConfig AutoModel 用户代码 权重绑定 设备分配 量化器 WeightConverter meta 设备初始化 BertModel(BertPreTrainedModel) BertConfig AutoModel 用户代码 #mermaid-svg-R2joEiljFa3UVz73{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-R2joEiljFa3UVz73 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-R2joEiljFa3UVz73 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-R2joEiljFa3UVz73 .error-icon{fill:#552222;}#mermaid-svg-R2joEiljFa3UVz73 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-R2joEiljFa3UVz73 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-R2joEiljFa3UVz73 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-R2joEiljFa3UVz73 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-R2joEiljFa3UVz73 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-R2joEiljFa3UVz73 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-R2joEiljFa3UVz73 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-R2joEiljFa3UVz73 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-R2joEiljFa3UVz73 .marker.cross{stroke:#333333;}#mermaid-svg-R2joEiljFa3UVz73 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-R2joEiljFa3UVz73 p{margin:0;}#mermaid-svg-R2joEiljFa3UVz73 .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-R2joEiljFa3UVz73 text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-R2joEiljFa3UVz73 .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-R2joEiljFa3UVz73 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-R2joEiljFa3UVz73 .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-R2joEiljFa3UVz73 #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-R2joEiljFa3UVz73 .sequenceNumber{fill:white;}#mermaid-svg-R2joEiljFa3UVz73 #sequencenumber{fill:#333;}#mermaid-svg-R2joEiljFa3UVz73 #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-R2joEiljFa3UVz73 .messageText{fill:#333;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-R2joEiljFa3UVz73 .labelText,#mermaid-svg-R2joEiljFa3UVz73 .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .loopText,#mermaid-svg-R2joEiljFa3UVz73 .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-R2joEiljFa3UVz73 .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-R2joEiljFa3UVz73 .noteText,#mermaid-svg-R2joEiljFa3UVz73 .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-R2joEiljFa3UVz73 .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-R2joEiljFa3UVz73 .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-R2joEiljFa3UVz73 .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-R2joEiljFa3UVz73 .actorPopupMenu{position:absolute;}#mermaid-svg-R2joEiljFa3UVz73 .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-R2joEiljFa3UVz73 .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-R2joEiljFa3UVz73 .actor-man circle,#mermaid-svg-R2joEiljFa3UVz73 line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-R2joEiljFa3UVz73 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 在 meta 设备上初始化空壳模型 BertEmbeddings + BertEncoder + BertPooler model.safetensors 或 pytorch_model.bin alt 存在量化配置 cls.predictions.decoder.weight ← bert.embeddings.word_embeddings.weight from_pretrained("bert-base-uncased") from_pretrained("bert-base-uncased") BertConfig(vocab_size=30522, ...) 检查 quantization_config 无量化配置 _from_config(config) init(config) post_init() → _init_weights() 下载/加载权重文件 转换旧格式键名 (如 "gamma" → "weight") load_state_dict() AutoQuantizationConfig.from_pretrained() 量化权重 分配到目标设备 (cuda/cpu) tie_weights() 就绪的 BertModel

每一步涉及的具体代码

步骤 1:Config 加载

  • 入口:modeling_utils.py:3789(file:///workspace/src/transformers/modeling_utils.py#L3789) PreTrainedModel.from_pretrained()
  • Config 加载:先通过 AutoConfig.from_pretrained() 获取 BertConfig

步骤 2:meta 设备初始化

  • 框架在 torch.device("meta") 上创建模型骨架,不分配实际内存
  • 调用 BertModel.__init__(config)modeling_bert.py:601(file:///workspace/src/transformers/models/bert/modeling_bert.py#L601))
python 复制代码
# modeling_bert.py:601-616
def __init__(self, config, add_pooling_layer=True):
    super().__init__(config)
    self.config = config
    self.embeddings = BertEmbeddings(config)    # 词/位置/类型嵌入
    self.encoder = BertEncoder(config)           # 12层 Transformer
    self.pooler = BertPooler(config) if add_pooling_layer else None  # 池化层
    self.post_init()  # 初始化权重 + 权重绑定

步骤 3:权重加载与转换

  • WeightConverter 处理旧版键名映射(如 gammaweightbetabias
  • 从 safetensors 或 bin 文件加载 state_dict

步骤 4:权重绑定

  • BertForPreTraining 中定义了绑定关系(modeling_bert.py:732-735(file:///workspace/src/transformers/models/bert/modeling_bert.py#L732)):
python 复制代码
# modeling_bert.py:732-735
_tied_weights_keys = {
    "cls.predictions.decoder.weight": "bert.embeddings.word_embeddings.weight",
    "cls.predictions.decoder.bias": "cls.predictions.bias",
}

这意味着 MLM 头的输出权重与输入嵌入共享,节省参数量。


4. Tokenizer 编码流程

BertTokenizer 基于 WordPiece 分词算法,定义在 tokenization_bert.py(file:///workspace/src/transformers/models/bert/tokenization_bert.py)。

核心架构

python 复制代码
# tokenization_bert.py:41-77
class BertTokenizer(TokenizersBackend):
    vocab_files_names = VOCAB_FILES_NAMES  # {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}
    model_input_names = ["input_ids", "token_type_ids", "attention_mask"]
    model = WordPiece

BertTokenizer 继承自 TokenizersBackend,底层使用 HuggingFace 的 tokenizers 库实现高性能分词。

初始化流程

python 复制代码
# tokenization_bert.py:79-135
def __init__(self, vocab=None, do_lower_case=True, unk_token="[UNK]", ...):
    self._tokenizer = Tokenizer(WordPiece(self._vocab, unk_token=str(unk_token)))
    self._tokenizer.normalizer = normalizers.BertNormalizer(  # 文本规范化
        clean_text=True, handle_chinese_chars=tokenize_chinese_chars,
        strip_accents=strip_accents, lowercase=do_lower_case,
    )
    self._tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()  # 预分词
    self._tokenizer.decoder = decoders.WordPiece(prefix="##")  # 解码器
    # 后处理器:添加 [CLS] 和 [SEP]
    self._tokenizer.post_processor = processors.TemplateProcessing(
        single=f"[CLS]:0 $A:0 [SEP]:0",
        pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
        special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
    )

编码流程图

#mermaid-svg-DfssETW5nqIwy6jD{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-DfssETW5nqIwy6jD .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-DfssETW5nqIwy6jD .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-DfssETW5nqIwy6jD .error-icon{fill:#552222;}#mermaid-svg-DfssETW5nqIwy6jD .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-DfssETW5nqIwy6jD .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-DfssETW5nqIwy6jD .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-DfssETW5nqIwy6jD .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-DfssETW5nqIwy6jD .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-DfssETW5nqIwy6jD .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-DfssETW5nqIwy6jD .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-DfssETW5nqIwy6jD .marker{fill:#333333;stroke:#333333;}#mermaid-svg-DfssETW5nqIwy6jD .marker.cross{stroke:#333333;}#mermaid-svg-DfssETW5nqIwy6jD svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-DfssETW5nqIwy6jD p{margin:0;}#mermaid-svg-DfssETW5nqIwy6jD .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-DfssETW5nqIwy6jD .cluster-label text{fill:#333;}#mermaid-svg-DfssETW5nqIwy6jD .cluster-label span{color:#333;}#mermaid-svg-DfssETW5nqIwy6jD .cluster-label span p{background-color:transparent;}#mermaid-svg-DfssETW5nqIwy6jD .label text,#mermaid-svg-DfssETW5nqIwy6jD span{fill:#333;color:#333;}#mermaid-svg-DfssETW5nqIwy6jD .node rect,#mermaid-svg-DfssETW5nqIwy6jD .node circle,#mermaid-svg-DfssETW5nqIwy6jD .node ellipse,#mermaid-svg-DfssETW5nqIwy6jD .node polygon,#mermaid-svg-DfssETW5nqIwy6jD .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-DfssETW5nqIwy6jD .rough-node .label text,#mermaid-svg-DfssETW5nqIwy6jD .node .label text,#mermaid-svg-DfssETW5nqIwy6jD .image-shape .label,#mermaid-svg-DfssETW5nqIwy6jD .icon-shape .label{text-anchor:middle;}#mermaid-svg-DfssETW5nqIwy6jD .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-DfssETW5nqIwy6jD .rough-node .label,#mermaid-svg-DfssETW5nqIwy6jD .node .label,#mermaid-svg-DfssETW5nqIwy6jD .image-shape .label,#mermaid-svg-DfssETW5nqIwy6jD .icon-shape .label{text-align:center;}#mermaid-svg-DfssETW5nqIwy6jD .node.clickable{cursor:pointer;}#mermaid-svg-DfssETW5nqIwy6jD .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-DfssETW5nqIwy6jD .arrowheadPath{fill:#333333;}#mermaid-svg-DfssETW5nqIwy6jD .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-DfssETW5nqIwy6jD .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-DfssETW5nqIwy6jD .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-DfssETW5nqIwy6jD .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-DfssETW5nqIwy6jD .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-DfssETW5nqIwy6jD .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-DfssETW5nqIwy6jD .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-DfssETW5nqIwy6jD .cluster text{fill:#333;}#mermaid-svg-DfssETW5nqIwy6jD .cluster span{color:#333;}#mermaid-svg-DfssETW5nqIwy6jD div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-DfssETW5nqIwy6jD .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-DfssETW5nqIwy6jD rect.text{fill:none;stroke-width:0;}#mermaid-svg-DfssETW5nqIwy6jD .icon-shape,#mermaid-svg-DfssETW5nqIwy6jD .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-DfssETW5nqIwy6jD .icon-shape p,#mermaid-svg-DfssETW5nqIwy6jD .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-DfssETW5nqIwy6jD .icon-shape .label rect,#mermaid-svg-DfssETW5nqIwy6jD .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-DfssETW5nqIwy6jD .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-DfssETW5nqIwy6jD .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-DfssETW5nqIwy6jD :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 原始文本

'Hello, my dog is cute'
BertNormalizer

小写化 + 清理 + 中文字符处理
BertPreTokenizer

基于空白和标点的预分词
WordPiece 分词

子词切分

'hello' → 'hello'

'cute' → 'cute'
TemplateProcessing

添加特殊 token

CLS\] hello , my dog is cute \[SEP

生成三要素
input_ids

101, 7592, 1010, 2026, ...

token 在词表中的索引
attention_mask

1, 1, 1, 1, ...

1=有效, 0=填充
token_type_ids

0, 0, 0, 0, ...

0=句子A, 1=句子B

特殊 Token 管理

Token 用途 默认值
[CLS] 句首标记,用于分类 cls_token_id = 2
[SEP] 句子分隔符 sep_token_id = 3
[PAD] 填充标记 pad_token_id = 0
[UNK] 未知词标记 unk_token_id = 1
[MASK] 掩码标记(MLM 训练) mask_token_id = 4

句对编码

当输入两个句子时,TemplateProcessingpair 模板生效:

复制代码
[CLS] 句子A [SEP] 句子B [SEP]
 0     0     0     1     1    ← token_type_ids

5. 模型前向传播全链路

input_ids 到最终输出的完整数据流。

数据流图

#mermaid-svg-AFb9CbjF2aLpVM95{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-AFb9CbjF2aLpVM95 .error-icon{fill:#552222;}#mermaid-svg-AFb9CbjF2aLpVM95 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-AFb9CbjF2aLpVM95 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-AFb9CbjF2aLpVM95 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-AFb9CbjF2aLpVM95 .marker.cross{stroke:#333333;}#mermaid-svg-AFb9CbjF2aLpVM95 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-AFb9CbjF2aLpVM95 p{margin:0;}#mermaid-svg-AFb9CbjF2aLpVM95 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster-label text{fill:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster-label span{color:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster-label span p{background-color:transparent;}#mermaid-svg-AFb9CbjF2aLpVM95 .label text,#mermaid-svg-AFb9CbjF2aLpVM95 span{fill:#333;color:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .node rect,#mermaid-svg-AFb9CbjF2aLpVM95 .node circle,#mermaid-svg-AFb9CbjF2aLpVM95 .node ellipse,#mermaid-svg-AFb9CbjF2aLpVM95 .node polygon,#mermaid-svg-AFb9CbjF2aLpVM95 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-AFb9CbjF2aLpVM95 .rough-node .label text,#mermaid-svg-AFb9CbjF2aLpVM95 .node .label text,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape .label,#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape .label{text-anchor:middle;}#mermaid-svg-AFb9CbjF2aLpVM95 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-AFb9CbjF2aLpVM95 .rough-node .label,#mermaid-svg-AFb9CbjF2aLpVM95 .node .label,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape .label,#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape .label{text-align:center;}#mermaid-svg-AFb9CbjF2aLpVM95 .node.clickable{cursor:pointer;}#mermaid-svg-AFb9CbjF2aLpVM95 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-AFb9CbjF2aLpVM95 .arrowheadPath{fill:#333333;}#mermaid-svg-AFb9CbjF2aLpVM95 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-AFb9CbjF2aLpVM95 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-AFb9CbjF2aLpVM95 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AFb9CbjF2aLpVM95 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-AFb9CbjF2aLpVM95 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AFb9CbjF2aLpVM95 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster text{fill:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 .cluster span{color:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-AFb9CbjF2aLpVM95 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-AFb9CbjF2aLpVM95 rect.text{fill:none;stroke-width:0;}#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape p,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-AFb9CbjF2aLpVM95 .icon-shape .label rect,#mermaid-svg-AFb9CbjF2aLpVM95 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AFb9CbjF2aLpVM95 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-AFb9CbjF2aLpVM95 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-AFb9CbjF2aLpVM95 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} TaskHead
BertOnlyMLMHead

MLM 预测
BertOnlyNSPHead

NSP 预测
Classifier

序列分类
QA Outputs

问答
Classifier

Token 分类
BertEncoder
BertLayer 0
BertLayer 1
...
BertLayer 11
BertEmbeddings
word_embeddings

nn.Embedding(30522, 768)
求和
position_embeddings

nn.Embedding(512, 768)
token_type_embeddings

nn.Embedding(2, 768)
LayerNorm(768)
Dropout(0.1)
input_ids

(batch, seq_len)
BertEmbeddings
attention_mask

(batch, seq_len)
_create_attention_masks

create_bidirectional_mask
BertEncoder

12 × BertLayer
BertPooler

CLS token

Dense + Tanh
sequence_output

(batch, seq_len, 768)
pooler_output

(batch, 768)

单层 BertLayer 内部结构图

#mermaid-svg-KMIW9zM5SUslBr9z{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-KMIW9zM5SUslBr9z .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-KMIW9zM5SUslBr9z .error-icon{fill:#552222;}#mermaid-svg-KMIW9zM5SUslBr9z .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-KMIW9zM5SUslBr9z .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-KMIW9zM5SUslBr9z .marker{fill:#333333;stroke:#333333;}#mermaid-svg-KMIW9zM5SUslBr9z .marker.cross{stroke:#333333;}#mermaid-svg-KMIW9zM5SUslBr9z svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-KMIW9zM5SUslBr9z p{margin:0;}#mermaid-svg-KMIW9zM5SUslBr9z .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster-label text{fill:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster-label span{color:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster-label span p{background-color:transparent;}#mermaid-svg-KMIW9zM5SUslBr9z .label text,#mermaid-svg-KMIW9zM5SUslBr9z span{fill:#333;color:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .node rect,#mermaid-svg-KMIW9zM5SUslBr9z .node circle,#mermaid-svg-KMIW9zM5SUslBr9z .node ellipse,#mermaid-svg-KMIW9zM5SUslBr9z .node polygon,#mermaid-svg-KMIW9zM5SUslBr9z .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-KMIW9zM5SUslBr9z .rough-node .label text,#mermaid-svg-KMIW9zM5SUslBr9z .node .label text,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape .label,#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape .label{text-anchor:middle;}#mermaid-svg-KMIW9zM5SUslBr9z .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-KMIW9zM5SUslBr9z .rough-node .label,#mermaid-svg-KMIW9zM5SUslBr9z .node .label,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape .label,#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape .label{text-align:center;}#mermaid-svg-KMIW9zM5SUslBr9z .node.clickable{cursor:pointer;}#mermaid-svg-KMIW9zM5SUslBr9z .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-KMIW9zM5SUslBr9z .arrowheadPath{fill:#333333;}#mermaid-svg-KMIW9zM5SUslBr9z .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-KMIW9zM5SUslBr9z .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-KMIW9zM5SUslBr9z .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-KMIW9zM5SUslBr9z .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-KMIW9zM5SUslBr9z .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-KMIW9zM5SUslBr9z .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-KMIW9zM5SUslBr9z .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster text{fill:#333;}#mermaid-svg-KMIW9zM5SUslBr9z .cluster span{color:#333;}#mermaid-svg-KMIW9zM5SUslBr9z div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-KMIW9zM5SUslBr9z .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-KMIW9zM5SUslBr9z rect.text{fill:none;stroke-width:0;}#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape p,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-KMIW9zM5SUslBr9z .icon-shape .label rect,#mermaid-svg-KMIW9zM5SUslBr9z .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-KMIW9zM5SUslBr9z .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-KMIW9zM5SUslBr9z .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-KMIW9zM5SUslBr9z :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Feed-Forward Network
BertAttention
BertOutput
BertIntermediate
BertSelfOutput
BertSelfAttention
query = Linear(768, 768)
Q×K^T × V

缩放点积注意力
key = Linear(768, 768)
value = Linear(768, 768)

  • attention_mask

(双向掩码)
hidden_states

(batch, seq_len, 768)
BertSelfAttention
dense = Linear(768, 768)
dropout(0.1)
LayerNorm + 残差连接
FFN
dense = Linear(768, 3072)
GELU 激活
dense = Linear(3072, 768)
dropout(0.1)
LayerNorm + 残差连接
layer_output

(batch, seq_len, 768)

关键代码对应

BertSelfAttentionmodeling_bert.py:143-207(file:///workspace/src/transformers/models/bert/modeling_bert.py#L143)):

python 复制代码
# modeling_bert.py:168-207
def forward(self, hidden_states, attention_mask=None, past_key_values=None, **kwargs):
    # Q/K/V 投影并重塑为多头形式
    query_layer = self.query(hidden_states).view(*hidden_shape).transpose(1, 2)
    key_layer = self.key(hidden_states).view(*hidden_shape).transpose(1, 2)
    value_layer = self.value(hidden_states).view(*hidden_shape).transpose(1, 2)

    # 通过 ALL_ATTENTION_FUNCTIONS 分发到具体实现
    attention_interface = ALL_ATTENTION_FUNCTIONS.get_interface(
        self.config._attn_implementation, eager_attention_forward
    )
    attn_output, attn_weights = attention_interface(
        self, query_layer, key_layer, value_layer,
        attention_mask, dropout=..., scaling=self.scaling, **kwargs,
    )
    attn_output = attn_output.reshape(*input_shape, -1).contiguous()
    return attn_output, attn_weights

BertLayermodeling_bert.py:358-420(file:///workspace/src/transformers/models/bert/modeling_bert.py#L358)):

python 复制代码
# modeling_bert.py:378-420
def forward(self, hidden_states, attention_mask=None, ...):
    self_attention_output, _ = self.attention(hidden_states, attention_mask, ...)
    attention_output = self_attention_output

    # 如果是 decoder 且有 encoder 输出,执行交叉注意力
    if self.is_decoder and encoder_hidden_states is not None:
        cross_attention_output, _ = self.crossattention(...)
        attention_output = cross_attention_output

    # FFN(支持分块处理以节省内存)
    layer_output = apply_chunking_to_forward(
        self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
    )
    return layer_output

双向注意力掩码图

#mermaid-svg-IWNbva42U27c5x8N{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-IWNbva42U27c5x8N .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-IWNbva42U27c5x8N .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-IWNbva42U27c5x8N .error-icon{fill:#552222;}#mermaid-svg-IWNbva42U27c5x8N .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-IWNbva42U27c5x8N .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-IWNbva42U27c5x8N .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-IWNbva42U27c5x8N .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-IWNbva42U27c5x8N .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-IWNbva42U27c5x8N .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-IWNbva42U27c5x8N .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-IWNbva42U27c5x8N .marker{fill:#333333;stroke:#333333;}#mermaid-svg-IWNbva42U27c5x8N .marker.cross{stroke:#333333;}#mermaid-svg-IWNbva42U27c5x8N svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-IWNbva42U27c5x8N p{margin:0;}#mermaid-svg-IWNbva42U27c5x8N .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-IWNbva42U27c5x8N .cluster-label text{fill:#333;}#mermaid-svg-IWNbva42U27c5x8N .cluster-label span{color:#333;}#mermaid-svg-IWNbva42U27c5x8N .cluster-label span p{background-color:transparent;}#mermaid-svg-IWNbva42U27c5x8N .label text,#mermaid-svg-IWNbva42U27c5x8N span{fill:#333;color:#333;}#mermaid-svg-IWNbva42U27c5x8N .node rect,#mermaid-svg-IWNbva42U27c5x8N .node circle,#mermaid-svg-IWNbva42U27c5x8N .node ellipse,#mermaid-svg-IWNbva42U27c5x8N .node polygon,#mermaid-svg-IWNbva42U27c5x8N .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-IWNbva42U27c5x8N .rough-node .label text,#mermaid-svg-IWNbva42U27c5x8N .node .label text,#mermaid-svg-IWNbva42U27c5x8N .image-shape .label,#mermaid-svg-IWNbva42U27c5x8N .icon-shape .label{text-anchor:middle;}#mermaid-svg-IWNbva42U27c5x8N .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-IWNbva42U27c5x8N .rough-node .label,#mermaid-svg-IWNbva42U27c5x8N .node .label,#mermaid-svg-IWNbva42U27c5x8N .image-shape .label,#mermaid-svg-IWNbva42U27c5x8N .icon-shape .label{text-align:center;}#mermaid-svg-IWNbva42U27c5x8N .node.clickable{cursor:pointer;}#mermaid-svg-IWNbva42U27c5x8N .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-IWNbva42U27c5x8N .arrowheadPath{fill:#333333;}#mermaid-svg-IWNbva42U27c5x8N .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-IWNbva42U27c5x8N .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-IWNbva42U27c5x8N .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-IWNbva42U27c5x8N .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-IWNbva42U27c5x8N .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-IWNbva42U27c5x8N .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-IWNbva42U27c5x8N .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-IWNbva42U27c5x8N .cluster text{fill:#333;}#mermaid-svg-IWNbva42U27c5x8N .cluster span{color:#333;}#mermaid-svg-IWNbva42U27c5x8N div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-IWNbva42U27c5x8N .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-IWNbva42U27c5x8N rect.text{fill:none;stroke-width:0;}#mermaid-svg-IWNbva42U27c5x8N .icon-shape,#mermaid-svg-IWNbva42U27c5x8N .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-IWNbva42U27c5x8N .icon-shape p,#mermaid-svg-IWNbva42U27c5x8N .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-IWNbva42U27c5x8N .icon-shape .label rect,#mermaid-svg-IWNbva42U27c5x8N .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-IWNbva42U27c5x8N .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-IWNbva42U27c5x8N .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-IWNbva42U27c5x8N :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 掩码可视化
双向掩码 (BERT)

⬜⬜⬜⬜⬛⬛

⬜⬜⬜⬜⬛⬛

⬜⬜⬜⬜⬛⬛

⬜⬜⬜⬜⬛⬛

⬛⬛⬛⬛⬛⬛

⬛⬛⬛⬛⬛⬛
因果掩码 (GPT)

⬜⬛⬛⬛⬛⬛

⬜⬜⬛⬛⬛⬛

⬜⬜⬜⬛⬛⬛

⬜⬜⬜⬜⬛⬛

⬜⬜⬜⬜⬜⬛

⬜⬜⬜⬜⬜⬜
create_bidirectional_mask
padding_mask_function

处理填充位置
and_masks(PM, BMF)
bidirectional_mask_function

所有 token 互相可见

q_idx >= 0 → True
attention_mask (2D)

\[1,1,1,1,0,0,

1,1,1,1,1,0\]

attention_mask (4D)

(batch, 1, seq_len, seq_len)

填充位置为 -inf

BERT 的 _create_attention_masks 方法(modeling_bert.py:692-722(file:///workspace/src/transformers/models/bert/modeling_bert.py#L692))根据 is_decoder 标志选择掩码类型:

python 复制代码
# modeling_bert.py:700-712
if self.config.is_decoder:
    attention_mask = create_causal_mask(...)     # 因果掩码
else:
    attention_mask = create_bidirectional_mask(...)  # 双向掩码(BERT 默认)

6. 注意力系统如何运作

BERT 双向注意力 vs GPT 因果注意力

特性 BERT(双向) GPT(因果)
掩码函数 bidirectional_mask_function causal_mask_function
掩码逻辑 q_idx >= 0(全部可见) kv_idx <= q_idx(仅看左侧)
创建函数 create_bidirectional_mask create_causal_mask
源码位置 masking_utils.py:80(file:///workspace/src/transformers/masking_utils.py#L80) masking_utils.py:73(file:///workspace/src/transformers/masking_utils.py#L73)
适用场景 理解型任务 生成型任务

ALL_ATTENTION_FUNCTIONS 分发机制

ALL_ATTENTION_FUNCTIONS 是一个全局的注意力接口注册表,定义在 modeling_utils.py:5070(file:///workspace/src/transformers/models/bert/.../.../modeling_utils.py#L5070):

python 复制代码
ALL_ATTENTION_FUNCTIONS: AttentionInterface = AttentionInterface()

它继承自 GeneralInterfaceutils/generic.py:1054(file:///workspace/src/transformers/utils/generic.py#L1054)),支持全局映射和局部覆盖。

注意力分发流程图

#mermaid-svg-ZDQ3emx2kCdrBKC1{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .error-icon{fill:#552222;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .marker.cross{stroke:#333333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 p{margin:0;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster-label text{fill:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster-label span{color:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster-label span p{background-color:transparent;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .label text,#mermaid-svg-ZDQ3emx2kCdrBKC1 span{fill:#333;color:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .node rect,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node circle,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node ellipse,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node polygon,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .rough-node .label text,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node .label text,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape .label,#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .rough-node .label,#mermaid-svg-ZDQ3emx2kCdrBKC1 .node .label,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape .label,#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape .label{text-align:center;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .node.clickable{cursor:pointer;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .arrowheadPath{fill:#333333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZDQ3emx2kCdrBKC1 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZDQ3emx2kCdrBKC1 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster text{fill:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .cluster span{color:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZDQ3emx2kCdrBKC1 rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape p,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .icon-shape .label rect,#mermaid-svg-ZDQ3emx2kCdrBKC1 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZDQ3emx2kCdrBKC1 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZDQ3emx2kCdrBKC1 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZDQ3emx2kCdrBKC1 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} eager
sdpa
flash_attention_2
flex_attention
BertSelfAttention.forward()
ALL_ATTENTION_FUNCTIONS.get_interface()

config._attn_implementation
attn_implementation?
eager_attention_forward()

标准 PyTorch 实现

Q×K^T → softmax → ×V
sdpa_attention_forward()

torch.nn.functional.scaled_dot_product_attention

自动选择 Flash/内存高效/数学内核
flash_attention_2_forward()

Flash Attention 2 内核

IO-aware 优化
flex_attention_forward()

PyTorch Flex Attention

自定义掩码函数
attn_output, attn_weights
reshape → contiguous
返回给 BertSelfOutput

eager_attention_forward 的核心实现(modeling_bert.py:115-140(file:///workspace/src/transformers/models/bert/modeling_bert.py#L115)):

python 复制代码
# modeling_bert.py:115-140
def eager_attention_forward(module, query, key, value, attention_mask, scaling=None, dropout=0.0, **kwargs):
    if scaling is None:
        scaling = query.size(-1) ** -0.5
    attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling  # QK^T / √d
    if attention_mask is not None:
        attn_weights = attn_weights + attention_mask  # 加掩码(-inf 被屏蔽)
    attn_weights = nn.functional.softmax(attn_weights, dim=-1)  # softmax 归一化
    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
    attn_output = torch.matmul(attn_weights, value)  # 加权求和
    attn_output = attn_output.transpose(1, 2).contiguous()
    return attn_output, attn_weights

BertPreTrainedModel 声明支持的注意力实现modeling_bert.py:536-548(file:///workspace/src/transformers/models/bert/modeling_bert.py#L536)):

python 复制代码
# modeling_bert.py:536-548
class BertPreTrainedModel(PreTrainedModel):
    _supports_flash_attn = True
    _supports_sdpa = True
    _supports_flex_attn = True
    _supports_attention_backend = True

7. 缓存系统在 BERT 中的角色

BERT 不需要 KV Cache

BERT 作为 Encoder-only 模型,采用 非自回归 的推理方式------一次性处理整个序列,而非逐 token 生成。因此,BERT 默认不使用 KV Cache

modeling_bert.py:643-646(file:///workspace/src/transformers/models/bert/modeling_bert.py#L643) 中可以清楚看到:

python 复制代码
# modeling_bert.py:643-646
if self.config.is_decoder:
    use_cache = use_cache if use_cache is not None else self.config.use_cache
else:
    use_cache = False  # Encoder 模式下,缓存始终关闭

EncoderDecoderCache 场景

当 BERT 被配置为 decoder(is_decoder=True + add_cross_attention=True)时,如 BertLMHeadModel,它可以参与 Seq2Seq 架构。此时会使用 EncoderDecoderCache

python 复制代码
# modeling_bert.py:648-653
if use_cache and past_key_values is None:
    past_key_values = (
        EncoderDecoderCache(DynamicCache(config=self.config), DynamicCache(config=self.config))
        if encoder_hidden_states is not None or self.config.is_encoder_decoder
        else DynamicCache(config=self.config)
    )

EncoderDecoderCachecache_utils.py:1479(file:///workspace/src/transformers/cache_utils.py#L1479))包含两个独立的缓存:

  • self_attention_cache:自注意力的 KV 缓存
  • cross_attention_cache:交叉注意力的 KV 缓存

缓存对比图

#mermaid-svg-YMc83Xm46owMW8b5{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-YMc83Xm46owMW8b5 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-YMc83Xm46owMW8b5 .error-icon{fill:#552222;}#mermaid-svg-YMc83Xm46owMW8b5 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-YMc83Xm46owMW8b5 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-YMc83Xm46owMW8b5 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-YMc83Xm46owMW8b5 .marker.cross{stroke:#333333;}#mermaid-svg-YMc83Xm46owMW8b5 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-YMc83Xm46owMW8b5 p{margin:0;}#mermaid-svg-YMc83Xm46owMW8b5 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster-label text{fill:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster-label span{color:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster-label span p{background-color:transparent;}#mermaid-svg-YMc83Xm46owMW8b5 .label text,#mermaid-svg-YMc83Xm46owMW8b5 span{fill:#333;color:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .node rect,#mermaid-svg-YMc83Xm46owMW8b5 .node circle,#mermaid-svg-YMc83Xm46owMW8b5 .node ellipse,#mermaid-svg-YMc83Xm46owMW8b5 .node polygon,#mermaid-svg-YMc83Xm46owMW8b5 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-YMc83Xm46owMW8b5 .rough-node .label text,#mermaid-svg-YMc83Xm46owMW8b5 .node .label text,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape .label,#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape .label{text-anchor:middle;}#mermaid-svg-YMc83Xm46owMW8b5 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-YMc83Xm46owMW8b5 .rough-node .label,#mermaid-svg-YMc83Xm46owMW8b5 .node .label,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape .label,#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape .label{text-align:center;}#mermaid-svg-YMc83Xm46owMW8b5 .node.clickable{cursor:pointer;}#mermaid-svg-YMc83Xm46owMW8b5 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-YMc83Xm46owMW8b5 .arrowheadPath{fill:#333333;}#mermaid-svg-YMc83Xm46owMW8b5 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-YMc83Xm46owMW8b5 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-YMc83Xm46owMW8b5 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YMc83Xm46owMW8b5 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-YMc83Xm46owMW8b5 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YMc83Xm46owMW8b5 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-YMc83Xm46owMW8b5 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster text{fill:#333;}#mermaid-svg-YMc83Xm46owMW8b5 .cluster span{color:#333;}#mermaid-svg-YMc83Xm46owMW8b5 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-YMc83Xm46owMW8b5 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-YMc83Xm46owMW8b5 rect.text{fill:none;stroke-width:0;}#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape p,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-YMc83Xm46owMW8b5 .icon-shape .label rect,#mermaid-svg-YMc83Xm46owMW8b5 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YMc83Xm46owMW8b5 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-YMc83Xm46owMW8b5 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-YMc83Xm46owMW8b5 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} BERT as Decoder(Seq2Seq)
Encoder 输出
cross_attention_cache

(一次性存入,不变)
Decode Step 1
self_attention_cache

(逐步增长)
Decode Step 2
GPT Decoder
Token 1
KV Cache 存入
Token 2
KV Cache 更新
Token 3
KV Cache 更新
...
BERT Encoder(默认)
输入序列
一次性前向传播

无缓存
输出

关键区别

特性 BERT Encoder GPT Decoder BERT as Decoder
KV Cache 不使用 DynamicCache EncoderDecoderCache
推理方式 一次性 自回归 自回归
交叉注意力 有(缓存 encoder 输出)
use_cache False True True

8. 训练流程

BertForPreTraining 的 MLM + NSP 损失

BERT 的预训练包含两个任务,定义在 modeling_bert.py:731-820(file:///workspace/src/transformers/models/bert/modeling_bert.py#L731):

python 复制代码
# modeling_bert.py:731-820
class BertForPreTraining(BertPreTrainedModel):
    _tied_weights_keys = {
        "cls.predictions.decoder.weight": "bert.embeddings.word_embeddings.weight",
        "cls.predictions.decoder.bias": "cls.predictions.bias",
    }

    def __init__(self, config):
        super().__init__(config)
        self.bert = BertModel(config)
        self.cls = BertPreTrainingHeads(config)  # MLM头 + NSP头
        self.post_init()

    def forward(self, input_ids, attention_mask=None, token_type_ids=None,
                labels=None, next_sentence_label=None, **kwargs):
        outputs = self.bert(input_ids, attention_mask=attention_mask,
                           token_type_ids=token_type_ids, ...)
        sequence_output, pooled_output = outputs[:2]
        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)

        total_loss = None
        if labels is not None and next_sentence_label is not None:
            loss_fct = CrossEntropyLoss()
            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
            total_loss = masked_lm_loss + next_sentence_loss  # 两个损失简单相加

        return BertForPreTrainingOutput(
            loss=total_loss,
            prediction_logits=prediction_scores,
            seq_relationship_logits=seq_relationship_score,
            ...
        )

BertPreTrainingHeadsmodeling_bert.py:523-532(file:///workspace/src/transformers/models/bert/modeling_bert.py#L523))包含两个头:

python 复制代码
# modeling_bert.py:523-532
class BertPreTrainingHeads(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.predictions = BertLMPredictionHead(config)  # MLM: Dense → GELU → LN → Linear(vocab_size)
        self.seq_relationship = nn.Linear(config.hidden_size, 2)  # NSP: Linear(768, 2)

训练循环时序图

Loss Functions BertPreTrainingHeads BertModel BertForPreTraining Trainer Loss Functions BertPreTrainingHeads BertModel BertForPreTraining Trainer #mermaid-svg-OIlMSTAmZWpS4bmK{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-OIlMSTAmZWpS4bmK .error-icon{fill:#552222;}#mermaid-svg-OIlMSTAmZWpS4bmK .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-OIlMSTAmZWpS4bmK .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-OIlMSTAmZWpS4bmK .marker{fill:#333333;stroke:#333333;}#mermaid-svg-OIlMSTAmZWpS4bmK .marker.cross{stroke:#333333;}#mermaid-svg-OIlMSTAmZWpS4bmK svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-OIlMSTAmZWpS4bmK p{margin:0;}#mermaid-svg-OIlMSTAmZWpS4bmK .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-OIlMSTAmZWpS4bmK text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-OIlMSTAmZWpS4bmK .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK .sequenceNumber{fill:white;}#mermaid-svg-OIlMSTAmZWpS4bmK #sequencenumber{fill:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-OIlMSTAmZWpS4bmK .messageText{fill:#333;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-OIlMSTAmZWpS4bmK .labelText,#mermaid-svg-OIlMSTAmZWpS4bmK .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .loopText,#mermaid-svg-OIlMSTAmZWpS4bmK .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-OIlMSTAmZWpS4bmK .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-OIlMSTAmZWpS4bmK .noteText,#mermaid-svg-OIlMSTAmZWpS4bmK .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-OIlMSTAmZWpS4bmK .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-OIlMSTAmZWpS4bmK .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-OIlMSTAmZWpS4bmK .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-OIlMSTAmZWpS4bmK .actorPopupMenu{position:absolute;}#mermaid-svg-OIlMSTAmZWpS4bmK .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-OIlMSTAmZWpS4bmK .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-OIlMSTAmZWpS4bmK .actor-man circle,#mermaid-svg-OIlMSTAmZWpS4bmK line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-OIlMSTAmZWpS4bmK :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Dense → GELU → LayerNorm → Linear(768→30522) Linear(768→2) forward(input_ids, labels, next_sentence_label) forward(input_ids, attention_mask, token_type_ids) Embeddings → Encoder(12层) → Pooler sequence_output, pooled_output cls(sequence_output, pooled_output) predictions = BertLMPredictionHead(sequence_output) seq_relationship = Linear(pooled_output) prediction_scores, seq_relationship_score CrossEntropyLoss(prediction_scores, labels) masked_lm_loss CrossEntropyLoss(seq_relationship_score, next_sentence_label) next_sentence_loss total_loss = masked_lm_loss + next_sentence_loss BertForPreTrainingOutput(loss=total_loss) loss.backward() optimizer.step() scheduler.step()

Trainer 集成要点

  1. 数据准备 :MLM 标签中,被掩码 token 的位置为真实 token ID,其余为 -100(忽略)
  2. NSP 标签0 表示句子 B 是句子 A 的续句,1 表示随机句子
  3. 权重绑定 :MLM 头的 decoder 权重与 embedding 层共享,通过 _tied_weights_keys 声明

9. Pipeline 推理

pipeline("text-classification", model="bert-base-uncased") 的完整流程。

Pipeline 时序图

postprocess BertForSequenceClassification BertTokenizer TextClassificationPipeline pipeline() 用户 postprocess BertForSequenceClassification BertTokenizer TextClassificationPipeline pipeline() 用户 #mermaid-svg-iMJKWUt7Pp5pOllI{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-iMJKWUt7Pp5pOllI .error-icon{fill:#552222;}#mermaid-svg-iMJKWUt7Pp5pOllI .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-iMJKWUt7Pp5pOllI .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-iMJKWUt7Pp5pOllI .marker{fill:#333333;stroke:#333333;}#mermaid-svg-iMJKWUt7Pp5pOllI .marker.cross{stroke:#333333;}#mermaid-svg-iMJKWUt7Pp5pOllI svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-iMJKWUt7Pp5pOllI p{margin:0;}#mermaid-svg-iMJKWUt7Pp5pOllI .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-iMJKWUt7Pp5pOllI text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-iMJKWUt7Pp5pOllI .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI .sequenceNumber{fill:white;}#mermaid-svg-iMJKWUt7Pp5pOllI #sequencenumber{fill:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-iMJKWUt7Pp5pOllI .messageText{fill:#333;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-iMJKWUt7Pp5pOllI .labelText,#mermaid-svg-iMJKWUt7Pp5pOllI .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .loopText,#mermaid-svg-iMJKWUt7Pp5pOllI .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-iMJKWUt7Pp5pOllI .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-iMJKWUt7Pp5pOllI .noteText,#mermaid-svg-iMJKWUt7Pp5pOllI .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-iMJKWUt7Pp5pOllI .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-iMJKWUt7Pp5pOllI .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-iMJKWUt7Pp5pOllI .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-iMJKWUt7Pp5pOllI .actorPopupMenu{position:absolute;}#mermaid-svg-iMJKWUt7Pp5pOllI .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-iMJKWUt7Pp5pOllI .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-iMJKWUt7Pp5pOllI .actor-man circle,#mermaid-svg-iMJKWUt7Pp5pOllI line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-iMJKWUt7Pp5pOllI :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} pipeline("text-classification", model="bert-base-uncased") 确定任务类型 → text-classification 实例化 TextClassificationPipeline AutoTokenizer.from_pretrained("bert-base-uncased") BertTokenizer 实例 AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") BertForSequenceClassification 实例 ("This movie is great!") _sanitize_parameters() preprocess → tokenizer("This movie is great!", return_tensors="pt") {input_ids, attention_mask, token_type_ids} _forward → model(**inputs, use_cache=False) BertEmbeddings → BertEncoder → BertPooler Dropout → Linear(768, num_labels) SequenceClassifierOutput(logits=(batch, num_labels)) postprocess(logits, function_to_apply="sigmoid") softmax/sigmoid → 取 top_k → 映射 label {"label": "POSITIVE", "score": 0.9998}

关键代码对应

TextClassificationPipelinetext_classification.py:43(file:///workspace/src/transformers/pipelines/text_classification.py#L43))的核心方法:

python 复制代码
# text_classification.py:154-157
def preprocess(self, inputs, **tokenizer_kwargs):
    return_tensors = "pt"
    return self.tokenizer(**inputs, return_tensors=return_tensors, **tokenizer_kwargs)

# text_classification.py:171-176
def _forward(self, model_inputs):
    model_forward = self.model.forward
    if "use_cache" in inspect.signature(model_forward).parameters:
        model_inputs["use_cache"] = False  # 分类任务不需要缓存
    return self.model(**model_inputs)

BertForSequenceClassificationmodeling_bert.py:1076-1153(file:///workspace/src/transformers/models/bert/modeling_bert.py#L1076))的前向传播:

python 复制代码
# modeling_bert.py:1110-1153
def forward(self, input_ids, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
    outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, ...)
    pooled_output = outputs[1]  # [CLS] token 的池化输出
    pooled_output = self.dropout(pooled_output)
    logits = self.classifier(pooled_output)  # Linear(768, num_labels)

    loss = None
    if labels is not None:
        # 自动判断问题类型:回归 / 单标签分类 / 多标签分类
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"
        # 根据问题类型选择损失函数
        ...

10. 状态与生命周期总结

BERT 模型在 Transformers 框架中经历从定义到使用的完整生命周期。

状态机图

#mermaid-svg-wVbqsHqGMmmGCByg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-wVbqsHqGMmmGCByg .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-wVbqsHqGMmmGCByg .error-icon{fill:#552222;}#mermaid-svg-wVbqsHqGMmmGCByg .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-wVbqsHqGMmmGCByg .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-wVbqsHqGMmmGCByg .marker{fill:#333333;stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg .marker.cross{stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-wVbqsHqGMmmGCByg p{margin:0;}#mermaid-svg-wVbqsHqGMmmGCByg defs #statediagram-barbEnd{fill:#333333;stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup text{fill:#9370DB;stroke:none;font-size:10px;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup text{fill:#333;stroke:none;font-size:10px;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup .state-title{font-weight:bolder;fill:#131300;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-wVbqsHqGMmmGCByg g.stateGroup line{stroke:#333333;stroke-width:1;}#mermaid-svg-wVbqsHqGMmmGCByg .transition{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-wVbqsHqGMmmGCByg .stateGroup .composit{fill:white;border-bottom:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .state-note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-wVbqsHqGMmmGCByg .state-note text{fill:black;stroke:none;font-size:10px;}#mermaid-svg-wVbqsHqGMmmGCByg .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel .label rect{fill:#ECECFF;opacity:0.5;}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-wVbqsHqGMmmGCByg .edgeLabel .label text{fill:#333;}#mermaid-svg-wVbqsHqGMmmGCByg .label div .edgeLabel{color:#333;}#mermaid-svg-wVbqsHqGMmmGCByg .stateLabel text{fill:#131300;font-size:10px;font-weight:bold;}#mermaid-svg-wVbqsHqGMmmGCByg .node circle.state-start{fill:#333333;stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg .node .fork-join{fill:#333333;stroke:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg .node circle.state-end{fill:#9370DB;stroke:white;stroke-width:1.5;}#mermaid-svg-wVbqsHqGMmmGCByg .end-state-inner{fill:white;stroke-width:1.5;}#mermaid-svg-wVbqsHqGMmmGCByg .node rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-wVbqsHqGMmmGCByg #statediagram-barbEnd{fill:#333333;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-wVbqsHqGMmmGCByg .cluster-label,#mermaid-svg-wVbqsHqGMmmGCByg .nodeLabel{color:#131300;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster rect.outer{rx:5px;ry:5px;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-state .divider{stroke:#9370DB;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-state .title-state{rx:5px;ry:5px;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster.statediagram-cluster .inner{fill:white;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster.statediagram-cluster-alt .inner{fill:#f0f0f0;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-cluster .inner{rx:0;ry:0;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-state rect.basic{rx:5px;ry:5px;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#f0f0f0;}#mermaid-svg-wVbqsHqGMmmGCByg .note-edge{stroke-dasharray:5;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-note text{fill:black;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram-note .nodeLabel{color:black;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagram .edgeLabel{color:red;}#mermaid-svg-wVbqsHqGMmmGCByg #dependencyStart,#mermaid-svg-wVbqsHqGMmmGCByg #dependencyEnd{fill:#333333;stroke:#333333;stroke-width:1;}#mermaid-svg-wVbqsHqGMmmGCByg .statediagramTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-wVbqsHqGMmmGCByg :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 开发者编写代码
model_type = "bert"
from_pretrained()
model.eval()
model.train()
save_pretrained()
save_pretrained()
from_pretrained()
定义
注册
BertConfig 定义参数

BertModel 定义架构

BertPreTrainedModel 定义基类

各任务头定义行为
加载
CONFIG_MAPPING"bert" = BertConfig

MODEL_MAPPING"bert" = BertModel

AutoConfig/AutoModel 路由
推理
训练

  1. 下载/读取 config.json 2. BertConfig.from_dict() 3. meta 设备初始化空壳 4. 下载/读取权重文件 5. WeightConverter 键名转换 6. load_state_dict() 7. 量化(可选) 8. 设备分配 9. tie_weights() 保存
    Tokenizer 编码

→ Embeddings

→ Encoder (12层)

→ Pooler / 任务头

→ 输出
前向传播 → 计算损失

→ 反向传播

→ 优化器更新

→ 学习率调度
config.json

model.safetensors

tokenizer.json / vocab.txt

生命周期各阶段与源码映射

阶段 关键文件 关键函数/类
定义 configuration_bert.py(file:///workspace/src/transformers/models/bert/configuration_bert.py) BertConfig @strict dataclass
modeling_bert.py(file:///workspace/src/transformers/models/bert/modeling_bert.py) BertModel, BertPreTrainedModel, 各任务头
tokenization_bert.py(file:///workspace/src/transformers/models/bert/tokenization_bert.py) BertTokenizer
注册 configuration_auto.py(file:///workspace/src/transformers/models/auto/configuration_auto.py) CONFIG_MAPPING, AutoConfig.register()
**init** .py(file:///workspace/src/transformers/models/bert/init.py) _LazyModule 延迟导入
加载 modeling_utils.py(file:///workspace/src/transformers/modeling_utils.py) PreTrainedModel.from_pretrained()
configuration_utils.py(file:///workspace/src/transformers/configuration_utils.py) PreTrainedConfig.from_pretrained()
推理 masking_utils.py(file:///workspace/src/transformers/masking_utils.py) create_bidirectional_mask()
modeling_utils.py(file:///workspace/src/transformers/modeling_utils.py) ALL_ATTENTION_FUNCTIONS.get_interface()
训练 modeling_bert.py(file:///workspace/src/transformers/models/bert/modeling_bert.py) BertForPreTraining.forward(), CrossEntropyLoss
缓存 cache_utils.py(file:///workspace/src/transformers/cache_utils.py) DynamicCache, EncoderDecoderCache
Pipeline text_classification.py(file:///workspace/src/transformers/pipelines/text_classification.py) TextClassificationPipeline
保存 configuration_utils.py(file:///workspace/src/transformers/configuration_utils.py) PreTrainedConfig.save_pretrained()

模块协作全景

#mermaid-svg-X3AJnNskinCQUz5a{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-X3AJnNskinCQUz5a .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-X3AJnNskinCQUz5a .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-X3AJnNskinCQUz5a .error-icon{fill:#552222;}#mermaid-svg-X3AJnNskinCQUz5a .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-X3AJnNskinCQUz5a .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-X3AJnNskinCQUz5a .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-X3AJnNskinCQUz5a .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-X3AJnNskinCQUz5a .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-X3AJnNskinCQUz5a .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-X3AJnNskinCQUz5a .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-X3AJnNskinCQUz5a .marker{fill:#333333;stroke:#333333;}#mermaid-svg-X3AJnNskinCQUz5a .marker.cross{stroke:#333333;}#mermaid-svg-X3AJnNskinCQUz5a svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-X3AJnNskinCQUz5a p{margin:0;}#mermaid-svg-X3AJnNskinCQUz5a .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-X3AJnNskinCQUz5a .cluster-label text{fill:#333;}#mermaid-svg-X3AJnNskinCQUz5a .cluster-label span{color:#333;}#mermaid-svg-X3AJnNskinCQUz5a .cluster-label span p{background-color:transparent;}#mermaid-svg-X3AJnNskinCQUz5a .label text,#mermaid-svg-X3AJnNskinCQUz5a span{fill:#333;color:#333;}#mermaid-svg-X3AJnNskinCQUz5a .node rect,#mermaid-svg-X3AJnNskinCQUz5a .node circle,#mermaid-svg-X3AJnNskinCQUz5a .node ellipse,#mermaid-svg-X3AJnNskinCQUz5a .node polygon,#mermaid-svg-X3AJnNskinCQUz5a .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-X3AJnNskinCQUz5a .rough-node .label text,#mermaid-svg-X3AJnNskinCQUz5a .node .label text,#mermaid-svg-X3AJnNskinCQUz5a .image-shape .label,#mermaid-svg-X3AJnNskinCQUz5a .icon-shape .label{text-anchor:middle;}#mermaid-svg-X3AJnNskinCQUz5a .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-X3AJnNskinCQUz5a .rough-node .label,#mermaid-svg-X3AJnNskinCQUz5a .node .label,#mermaid-svg-X3AJnNskinCQUz5a .image-shape .label,#mermaid-svg-X3AJnNskinCQUz5a .icon-shape .label{text-align:center;}#mermaid-svg-X3AJnNskinCQUz5a .node.clickable{cursor:pointer;}#mermaid-svg-X3AJnNskinCQUz5a .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-X3AJnNskinCQUz5a .arrowheadPath{fill:#333333;}#mermaid-svg-X3AJnNskinCQUz5a .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-X3AJnNskinCQUz5a .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-X3AJnNskinCQUz5a .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-X3AJnNskinCQUz5a .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-X3AJnNskinCQUz5a .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-X3AJnNskinCQUz5a .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-X3AJnNskinCQUz5a .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-X3AJnNskinCQUz5a .cluster text{fill:#333;}#mermaid-svg-X3AJnNskinCQUz5a .cluster span{color:#333;}#mermaid-svg-X3AJnNskinCQUz5a div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-X3AJnNskinCQUz5a .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-X3AJnNskinCQUz5a rect.text{fill:none;stroke-width:0;}#mermaid-svg-X3AJnNskinCQUz5a .icon-shape,#mermaid-svg-X3AJnNskinCQUz5a .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-X3AJnNskinCQUz5a .icon-shape p,#mermaid-svg-X3AJnNskinCQUz5a .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-X3AJnNskinCQUz5a .icon-shape .label rect,#mermaid-svg-X3AJnNskinCQUz5a .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-X3AJnNskinCQUz5a .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-X3AJnNskinCQUz5a .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-X3AJnNskinCQUz5a :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 基础设施
缓存层
掩码层
注意力层
模型层
分词层
配置层
BertConfig

@strict dataclass

model_type='bert'
PreTrainedConfig

attribute_map

序列化/反序列化
BertTokenizer

WordPiece

BertNormalizer

TemplateProcessing
BertEmbeddings

word + position + token_type
BertLayer × 12

SelfAttention → FFN
BertPooler

CLS → Dense → Tanh
任务头

MLM / NSP / CLS / QA / TC
eager_attention_forward
sdpa_attention_forward
flash_attention_2_forward
ALL_ATTENTION_FUNCTIONS

分发注册表
create_bidirectional_mask

BERT 默认
create_causal_mask

decoder 模式
DynamicCache

自回归 KV 缓存
EncoderDecoderCache

self + cross 缓存
PreTrainedModel

from_pretrained()

save_pretrained()
AutoModel / AutoConfig

自动路由
Pipeline

端到端推理


总结

BERT 在 Transformers 框架中的完整生命周期可以概括为:

  1. 定义 :通过 @strict dataclass 定义 BertConfig,声明 model_type = "bert";通过 BertPreTrainedModelBertModel 定义模型架构
  2. 注册model_type 自动注册到 CONFIG_MAPPINGMODEL_MAPPING,支持 AutoConfig/AutoModel 自动路由
  3. 加载from_pretrained() 执行 Config 加载 → meta 设备初始化 → 权重下载/转换 → 量化(可选)→ 设备分配 → 权重绑定
  4. 编码BertTokenizer 通过 BertNormalizer → BertPreTokenizer → WordPiece → TemplateProcessing 将文本转为 input_ids + attention_mask + token_type_ids
  5. 前向传播:Embeddings(三种嵌入求和)→ Encoder(12层 BertLayer,每层含 SelfAttention + FFN)→ Pooler → 任务头
  6. 注意力 :通过 ALL_ATTENTION_FUNCTIONS 分发到 eager/SDPA/Flash Attention/Flex Attention 实现;BERT 默认使用 create_bidirectional_mask 双向掩码
  7. 缓存 :BERT Encoder 不使用 KV Cache;作为 decoder 时使用 EncoderDecoderCache
  8. 训练BertForPreTraining 同时计算 MLM 损失和 NSP 损失,简单相加作为总损失
  9. PipelineTextClassificationPipeline 封装了 tokenize → forward → postprocess 的端到端流程
  10. 保存save_pretrained() 将 config.json + model.safetensors + tokenizer 文件持久化到磁盘
相关推荐
企服AI产品测评局1 小时前
2026年Agent元年!深度解析实在Agent未来路线图:从自动化工具到全能数字员工的跃迁
运维·人工智能·ai·chatgpt·自动化
日光明媚1 小时前
从代码的角度解读DMD2
人工智能·深度学习·机器学习·stable diffusion·aigc
yangshuo12811 小时前
终端环境下 AI 图像识别与生成实战:从手绘草稿到精美插画的完整方案
人工智能
weixin_468466851 小时前
UNet 模型结构从零搭建与实战解析
人工智能·深度学习·算法·机器学习·ai·unet
继续商行1 小时前
高并发 Go 优化:深入内存逃逸分析与零分配优化策略
人工智能
事变天下1 小时前
国产ECMO破局者汉诺医疗闯关科创板:以“中国心”与“中国肺”托起生命希望
大数据·人工智能·microsoft
AI英德西牛仔1 小时前
Claude 导出 pdf 颜色不一样怎么办,选用 AI 导出鸭优化格式转换,多维度落地修正 PDF 色彩失真问题
javascript·人工智能·ai·chatgpt·pdf·deepseek·ai导出鸭
2301_818527781 小时前
冲锋衣达人营销——AI精准匹配高效转化
人工智能
TFHoney1 小时前
当 AI 真正走进你的终端:Claude Code 使用指南
java·人工智能·ai编程