BERT在开发中的一些方法的应用以及讲解

一、导入

python 复制代码

from transformers import BertModel, BertTokenizer

二、from_pretrained

2.1 调用方式

python 复制代码

bert = BertModel.from_pretrained(model_name)

从 Hugging Face Hub 下载模型权重和配置 model_name = "bert-base-chinese"
保存到本地缓存目录（Windows下通常是）： C:\Users\用户名\.cache\huggingface\hub\
加载模型权重到内存
返回一个 BertModel 实例

python 复制代码

# from_pretrained 加载后，模型包括
bert.config        # 模型配置（隐藏层大小，层数等）
bert.parameters()  # 所有可训练参数
bert.state_dict()  # 权重字典

使用BERT训练分类任务的时候

python 复制代码

classifier = nn.Linear(bert.config.hidden, num_laber)   # num_laber 维，类别数

2.2数据流

输入文本: "今天天气很好"

|

input_ids = $101, 791, 211, 369, 361, 2521, 1962, 102$

|

BERT 模型

|

┌──────────────────────────────────┐

│ 输出的 last_hidden_state: │

│ shape: $batch, seq_len, 768$ │

│ │

│ 每个token都有一个768维向量 │

│ $CLS$ 今天天气很好 $SEP$ │

│ ↑ │

│ 这个向量代表整个句子的语义

│

└──────────────────────────────────┘

|

取 $CLS$ 位置的向量: shape $batch, 768$

|

分类器 Linear(768 → num_labels)

|

输出 logits: shape $batch, num_labels$

2.2.1 完整示例

python 复制代码

from transformers import BertModel, BertTokenizer
import torch.nn as nn

class BertClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super().__init__()
        
        # 1. 加载预训练 BERT
        self.bert = BertModel.from_pretrained(model_name)
        
        # 2. 冻结 BERT 参数（可选，加快训练）
        # for param in self.bert.parameters():
        #     param.requires_grad = False
        
        # 3. 分类器
        hidden_size = self.bert.config.hidden_size  # 768
        self.classifier = nn.Linear(hidden_size, num_labels)
    
    def forward(self, input_ids, attention_mask):
        # BERT 前向传播
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        
        # 取 [CLS] 位置的输出（代表整个句子）
        pooled_output = outputs.pooler_output  # shape: [batch, 768]
        # 或者: last_hidden_state[:, 0, :]
        
        # 分类
        logits = self.classifier(pooled_output)  # shape: [batch, num_labels]
        
        return logits

三、冻结参数

在上面的示例中有一段代码

python 复制代码

# 冻结参数 = 固定参数，不让其在训练中更新
for param in self.bert.parameters():
    param.requires_grad = False

冻结后，反向传播时不会计算这下参数的梯度，也不会更新它们

3.1 何时冻结？何时不冻结

3.1.1 数据量少->冻结

python 复制代码

# 场景：只有几百条训练数据
# 问题：直接训练会把 BERT 的知识遗忘，导致过拟合

# 冻结 BERT，只训练分类器
for param in self.bert.parameters():
    param.requires_grad = False
classifier = nn.Linear(768, 2)

# 训练时只有 classifier 的参数会更新

|---------------|---------|
| 数据量 | 推荐方式 |
| < 1000 条 | 冻结BERT |
| 1000 ~ 10000 | 冻结最后几层 |
| > 10000 条 | 可以全参数微调 |

3.1.2 显存不足 -> 冻结

python 复制代码

# 问题：BERT 有 1.1 亿参数，训练需要大量显存

# 冻结大部分参数，减少显存占用
for param in self.bert.parameters():
    param.requires_grad = False

# 只训练分类器的参数，显存需求大幅降低

3.1.3 加速训练 -> 冻结

python 复制代码

# 冻结后，反向传播计算量大幅减少
# 训练速度可以快 10 倍以上

# 场景：快速验证想法、迭代实验
for param in self.bert.parameters():
    param.requires_grad = False

3.1.4 使用预训练模型特征提取 -> 冻结

python 复制代码

# 场景：只需要 BERT 的语义特征，不调整模型

# BERT 作为一个固定的"特征提取器"
for param in self.bert.parameters():
    param.requires_grad = False

# 输出 [CLS] 向量，作为其他任务的输入
outputs = self.bert(input_ids)
cls_embedding = outputs.last_hidden_state[:, 0, :]

3.2 常见的几种策略

3.2.1 完全冻结

python 复制代码

# 所有 BERT 参数都不训练
for param in self.bert.parameters():
    param.requires_grad = False

3.2.2 只训练最后几层

python 复制代码

# 冻结前面层，解冻最后几层
for name, param in self.bert.named_parameters():
    # 冻结前 8 层，解冻后 4 层
    if "layer.11" in name or "layer.10" in name or "layer.9" in name or "layer.8" in name:
        param.requires_grad = True
    else:
        param.requires_grad = False

3.2.3 逐步冻结

python 复制代码

# 第1轮：只训练分类器
# 第2轮：解冻最后2层
# 第3轮：解冻最后4层
# ...
# 最后：解冻全部

3.2.4 Adapter (适配器)

python 复制代码

# 在每层 Transformer 中插入少量新参数
# 原始参数完全冻结，只训练 Adapter

# 效果：参数极少，效果也很好
# 适合：数据量少 + 显存有限

四、forward 里面的参数的讲解

python 复制代码

def forward(
    self,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
    token_type_ids: Optional[torch.Tensor] = None
) -> torch.Tensor:

4.1 input_ids

|------|--------------------------|
| 属性 | 说明 |
| 含义 | 输入文本在词表中的索ID |
| 形状 | （batch_size, seq_length） |
| 数据类型 | torch.long （整数） |
| 必填 | 是 |

4.1.1 举例

python 复制代码

# 原始文本: "今天 天气 很好"
# 分词后: ["今", "天", "天", "气", "很", "好"]

# 假设词表: {"今": 1, "天": 2, "气": 3, "很": 4, "好": 5, "[PAD]": 0}
# 转为 ID:
input_ids = torch.tensor([[1, 2, 2, 3, 4, 5]])  # shape: (1, 6)

print("输入:", input_ids)
# tensor([[1, 2, 2, 3, 4, 5]])

原始文本: "今天天气很好"

↓

分词器 tokenizer

↓

input_ids: $101, 791, 211, 361, 2521, 1962, 102$

│ │ │ │ │ │ │
$CLS\] 今天天气很好 \[SEP$
0 1 2 3 4 5 6 ← 位置索引

4.2 attention_mask

|----|----------------------------|
| 属性 | 说明 |
| 含义 | 标记哪些位置是真实的token，哪些是padding |
| 形状 | （batch_size, seq_length） |
| 值 | 1 = 真实token, 0= padding |
| 必填 | 是 |

4.2.1 为什么需要？

python 复制代码

因为 BERT 需要固定长度的输入，但句子长度不一样

例子：
句子1: "今天天气很好" → 6个词
句子2: "我很喜欢学习" → 6个词
句子3: "好"           → 1个词

需要 padding 到相同长度:
句子1: [1, 2, 2, 3, 4, 5]
句子2: [10, 11, 12, 13, 14, 15]
句子3: [16, 0, 0, 0, 0, 0]  ← 用0 padding

attention_mask 告诉模型哪里是真实内容，哪里是 padding:
句子1: [1, 1, 1, 1, 1, 1]  ← 全部是真实token
句子2: [1, 1, 1, 1, 1, 1]
句子3: [1, 0, 0, 0, 0, 0]  ← 只有第一个是真实的

python 复制代码

# input_ids
input_ids = torch.tensor([
    [1, 2, 3, 4, 5, 0],    # 句子1，长度5，1个padding
    [1, 2, 3, 4, 5, 6]    # 句子2，长度6，无padding
])

# attention_mask
attention_mask = torch.tensor([
    [1, 1, 1, 1, 1, 0],   # 最后一个是padding
    [1, 1, 1, 1, 1, 1]    # 全部是真实token
])

4.3 token_type_ids

|----|-----------------------------------|
| 属性 | 说明 |
| 含义 | 标记每个token属于句子A还是句子B（用于BERT的NSP任务） |
| 形状 | （batch_size, seq_length） |
| 值 | 0=句子A， 1=句子B |
| 必填 | 可选 |

4.3.1 使用场景

python 复制代码

# 场景1：单句分类（不需要）
# "今天天气很好" → 不需要区分句子
token_type_ids = None  # 或者不传

# 场景2：句子对任务（需要）
# "今天天气很好" + "适合出去玩" → 需要区分两个句子
# [CLS] 今 天 天 气 很 好 [SEP] 适 合 出 去 玩 [SEP]
#  0    1  1  1  1  1  1  0   0  0  0  0  0  0

# BERT 会用 token_type_ids 来区分：
# 0 = 第一句话的token
# 1 = 第二句话的token

python 复制代码

句子对: "今天天气很好" + "适合出去玩"

input_ids:      [CLS] 今 天 天 气 很 好 [SEP] 适 合 出 去 玩 [SEP]
                101   791 211 361 2521 1962 102  3922 3311 1168 1294 643 102

token_type_ids: [0]   0  0  0  0  0  0   0    1  1  1  1  1  1   1
                      └──第一句话──┘      └──第二句话──┘

4.3.2 方法内部流程

python 复制代码

def forward(
    self,
    input_ids: torch.Tensor,        # (batch_size, seq_length)
    attention_mask: torch.Tensor,   # (batch_size, seq_length)
    token_type_ids: Optional[torch.Tensor] = None  # (batch_size, seq_length)
) -> torch.Tensor:                  # (batch_size, num_labels)
    
    # Step 1: BERT 前向传播
    outputs = self.bert(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids
    )
    # outputs.last_hidden_state shape: (batch_size, seq_length, hidden_size)
    # outputs.pooler_output shape: (batch_size, hidden_size)
    
    # Step 2: 取 [CLS] 位置的特征
    pooled_output = outputs.pooler_output  # (batch_size, 768)
    
    # Step 3: Dropout（防止过拟合）
    pooled_output = self.dropout(pooled_output)  # (batch_size, 768)
    
    # Step 4: 分类
    logits = self.classifier(pooled_output)  # (batch_size, num_labels)
    
    return logits

五、tokenizer 参数详解

python 复制代码

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
encoded_inputs = tokenizer(
    texts,                    # 要处理的文本列表
    padding=True,             # 是否padding
    truncation=True,         # 是否截断
    max_length=MAX_SEQ_LENGTH,  # 最大长度
    return_tensors='pt'      # 返回数据类型
)

5.1 texts

|----|-----------------------|
| 属性 | 说明 |
| 含义 | 要处理的原始文本 |
| 类型 | 字符串列表 List $str$ |
| 示例 | $"今天天气很好", "我喜欢学习"$ |

5.2 padding

|--------------|----------------------|
| 值 | 含义 |
| True | 填充到同批次最长句子的长度 |
| False | 不填充，每句多长就多长 |
| 'longest' | 填充到批次中最长的长度（同True） |
| 'max_length' | 填充到 max_length 指定的长度 |

python 复制代码

texts = ["今天天气很好", "我喜欢学习"]  # 长度: 6 vs 5

# padding=True 时
# 会自动把短的句子 padding 到和长句子一样长

# 效果相当于自动补 0
input_ids = [
    [101, 791, 211, 361, 2521, 1962, 102],  # 长度6
    [101, 2769, 3319, 3319, 2521, 4333, 102, 0, 0, 0]  # 补0到长度10
]

python 复制代码

# 不padding
['今','天','天','气','很','好']              # 长度6
['我','喜','欢','学','习']                    # 长度5

# padding=True
['今','天','天','气','很','好','[PAD]','[PAD]']  # 长度8
['我','喜','欢','学','习','[PAD]','[PAD]','[PAD]']  # 长度8

5.3 truncation

|-------|-----------------------|
| 值 | 含义 |
| True | 超过 max_length 的部分截断丢弃 |
| False | 不截断，可能超过最大长度 |

python 复制代码

MAX_SEQ_LENGTH = 10

texts = ["这是一个非常非常非常长的句子超过了最大长度限制"]

# truncation=True 时
# 只取前10个token，后面的全部丢弃

# 效果
input_ids = [101, 1, 2, 3, 4, 5, 6, 7, 8, 102]  # 截断到10个

5.4 max_length

|-----|-------------------------------|
| 属性 | 说明 |
| 含义 | 设置token 序列的最大长度 |
| 作用 | 1.超过这个长度会被截断 2 不足这个长度会padding |
| 常见值 | 128. 256 .512 |

5.5 return_tensors

|------|---------------------------|
| 值 | 含义 |
| pt | 返回PyTorch张量 torch.Tensor |
| tf | 返回TensorFlow 张量 tf.Tensor |
| np | 返回 Numpy 数组 np.ndarray |
| None | 返回python 列表 |