【深度学习】【NLP】Bert理论，代码

文章目录

一、Bert理论
- - [BERT 模型公式](#BERT 模型公式)
  - - [1. 输入表示 (Input Representation)](#1. 输入表示 (Input Representation))
    - [2. 自注意力机制 (Self-Attention Mechanism)](#2. 自注意力机制 (Self-Attention Mechanism))
    - [3. Transformer 层 (Transformer Layer)](#3. Transformer 层 (Transformer Layer))
二、便于理解Bert的代码
- - [1. 自注意力机制](#1. 自注意力机制)
  - [2. Transformer 层](#2. Transformer 层)
  - [3. 位置编码](#3. 位置编码)
  - [4. BERT 模型](#4. BERT 模型)
  - 解释代码
三、Bert文本多分类任务
四、Bert有啥创新点

一、Bert理论

BERT (Bidirectional Encoder Representations from Transformers) 是一个由Google开发的自然语言处理预训练模型。BERT在多个NLP任务中取得了显著的效果，主要因为它能够利用句子中所有单词的上下文信息进行训练和预测。下面从公式和代码两个角度进行讲解。

BERT 模型公式

1. 输入表示 (Input Representation)

BERT 的输入由三个嵌入层组成：

Token Embeddings：词嵌入，表示句子中的每个词。
Segment Embeddings：句子嵌入，用于区分两个句子。
Position Embeddings：位置嵌入，表示每个词在句子中的位置。

输入向量表示为：
Input = Token Embedding + Segment Embedding + Position Embedding \text{Input} = \text{Token Embedding} + \text{Segment Embedding} + \text{Position Embedding} Input=Token Embedding+Segment Embedding+Position Embedding

2. 自注意力机制 (Self-Attention Mechanism)

BERT 的核心是 Transformer 的多头自注意力机制。自注意力的计算公式如下：

Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V Attention(Q,K,V)=softmax(dk QKT)V

其中 Q , K , V Q, K, V Q,K,V 分别表示查询（Query）、键（Key）、值（Value）矩阵， d k d_k dk 是键的维度。

多头注意力将多个注意力头的结果进行连接：
MultiHead ( Q , K , V ) = Concat ( head 1 , head 2 , ... , head h ) W O \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W^O MultiHead(Q,K,V)=Concat(head1,head2,...,headh)WO

每个头的计算如下：
head i = Attention ( Q W i Q , K W i K , V W i V ) \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) headi=Attention(QWiQ,KWiK,VWiV)

3. Transformer 层 (Transformer Layer)

每个 Transformer 层包含多头自注意力和前馈神经网络：
Output = LayerNorm ( MultiHead ( Q , K , V ) + Input ) \text{Output} = \text{LayerNorm}(\text{MultiHead}(Q, K, V) + \text{Input}) Output=LayerNorm(MultiHead(Q,K,V)+Input)
Output = LayerNorm ( FFN ( Output ) + Output ) \text{Output} = \text{LayerNorm}(\text{FFN}(\text{Output}) + \text{Output}) Output=LayerNorm(FFN(Output)+Output)

前馈神经网络的定义如下：
FFN ( x ) = max ⁡ ( 0 , x W 1 + b 1 ) W 2 + b 2 \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 FFN(x)=max(0,xW1+b1)W2+b2

二、便于理解Bert的代码

以下是一个基于 PyTorch 从零实现 BERT 的简化版示例。这个实现包括自注意力机制、多头注意力、位置编码和 Transformer 层。

1. 自注意力机制

首先实现自注意力机制：

python 复制代码

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads
        
        assert self.head_dim * heads == embed_size, "Embedding size needs to be divisible by heads"
        
        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)
    
    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
        
        # Split embedding into self.heads pieces
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)
        
        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)
        
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys]) # Queries dot product Keys
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))
        
        attention = torch.softmax(energy / (self.embed_size ** (1 / 2)), dim=3) # Scaled dot-product
        
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(N, query_len, self.heads * self.head_dim)
        out = self.fc_out(out)
        return out

2. Transformer 层

接下来实现一个完整的 Transformer 层，包括多头自注意力和前馈神经网络：

python 复制代码

class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)
        
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size)
        )
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)
        
        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

3. 位置编码

实现位置编码，帮助模型理解单词在句子中的位置：

python 复制代码

class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_length):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_length, embed_size)
        self.encoding.requires_grad = False
        
        pos = torch.arange(0, max_length).float().unsqueeze(1)
        _2i = torch.arange(0, embed_size, step=2).float()
        
        self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / embed_size)))
        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / embed_size)))
    
    def forward(self, x):
        batch_size, seq_len, embed_size = x.size()
        return x + self.encoding[:seq_len, :].to(x.device)

4. BERT 模型

将所有部分组合到 BERT 模型中：

python 复制代码

class BERT(nn.Module):
    def __init__(self, 
                 vocab_size, 
                 embed_size=768, 
                 num_layers=12, 
                 heads=12, 
                 forward_expansion=4, 
                 dropout=0.1, 
                 max_length=512):
        super(BERT, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.position_encoding = PositionalEncoding(embed_size, max_length)
        self.layers = nn.ModuleList(
            [TransformerBlock(embed_size, heads, dropout, forward_expansion) for _ in range(num_layers)]
        )
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask):
        out = self.embedding(x)
        out = self.position_encoding(out)
        out = self.dropout(out)
        
        for layer in self.layers:
            out = layer(out, out, out, mask)
        
        return out

# 用法示例
# 定义参数
vocab_size = 30522  # 词汇表大小（例如 BERT-base 的词汇表）
embed_size = 768    # 嵌入维度（例如 BERT-base）
num_layers = 12     # Transformer 层数
heads = 12          # 注意力头数
max_length = 512    # 最大序列长度

# 创建 BERT 模型实例
model = BERT(vocab_size, embed_size, num_layers, heads, max_length=max_length)

# 输入张量
input_ids = torch.randint(0, vocab_size, (1, max_length))  # 示例输入

# 假设没有掩码
mask = None

# 前向传播
output = model(input_ids, mask)

print(output.shape)  # 输出张量的形状

解释代码

自注意力机制：
- SelfAttention 类实现了多头自注意力机制。
- forward 方法计算注意力权重并应用到值上。
Transformer 层：
- TransformerBlock 类结合了多头自注意力和前馈神经网络。
- forward 方法执行自注意力和前馈过程，并应用层归一化和残差连接。
位置编码：
- PositionalEncoding 类为输入添加位置信息。
- forward 方法将位置编码添加到输入嵌入上。
BERT 模型：
- BERT 类组合了嵌入层、位置编码和多个 Transformer 层。
- forward 方法依次通过嵌入、位置编码、dropout 和多个 Transformer 层。

这个实现展示了 BERT 的核心机制，但它是一个简化版本，适合理解 BERT 的内部工作原理。在实际应用中，使用现成的库（如 transformers）更为高效和可靠。

三、Bert文本多分类任务

要实现基于BERT的文本多分类任务，我们需要使用Transformers库和PyTorch。下面是完整的代码，包括数据加载、模型训练、评估和绘制损失变化曲线和准确率变化曲线。

使用BertForSequenceClassification。

首先，我们需要安装所需的库：

bash 复制代码

pip install transformers torch scikit-learn matplotlib pandas

以下是完整的代码：

python 复制代码

import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import numpy as np

# 加载数据
df = pd.read_csv('data.csv')  # 假设文件名为data.csv
df['Label'] = df['Label'].astype('category').cat.codes  # 将Label转换为类别编码

# 数据集类
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

# 准备数据
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
MAX_LEN = 128
BATCH_SIZE = 16

train_texts, val_texts, train_labels, val_labels = train_test_split(df['seq'], df['Label'], test_size=0.2, random_state=42)

train_dataset = TextDataset(train_texts.tolist(), train_labels.tolist(), tokenizer, MAX_LEN)
val_dataset = TextDataset(val_texts.tolist(), val_labels.tolist(), tokenizer, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

# 模型定义
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# 训练参数
EPOCHS = 3
optimizer = optim.Adam(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

# 训练和评估函数
def train_epoch(model, data_loader, criterion, optimizer, device, scheduler, n_examples):
    model = model.train()
    losses = []
    correct_predictions = 0

    for d in data_loader:
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        labels = d["label"].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        loss = criterion(outputs.logits, labels)
        correct_predictions += torch.sum(torch.argmax(outputs.logits, dim=1) == labels)
        losses.append(loss.item())

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    return correct_predictions.double() / n_examples, np.mean(losses)

def eval_model(model, data_loader, criterion, device, n_examples):
    model = model.eval()
    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for d in data_loader:
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            labels = d["label"].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

            loss = criterion(outputs.logits, labels)
            correct_predictions += torch.sum(torch.argmax(outputs.logits, dim=1) == labels)
            losses.append(loss.item())

    return correct_predictions.double() / n_examples, np.mean(losses)

# 训练模型
history = {
    'train_acc': [],
    'train_loss': [],
    'val_acc': [],
    'val_loss': []
}

best_accuracy = 0

for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)

    train_acc, train_loss = train_epoch(
        model,
        train_loader,
        criterion,
        optimizer,
        device,
        None,
        len(train_dataset)
    )

    print(f'Train loss {train_loss} accuracy {train_acc}')

    val_acc, val_loss = eval_model(
        model,
        val_loader,
        criterion,
        device,
        len(val_dataset)
    )

    print(f'Val loss {val_loss} accuracy {val_acc}')
    print()

    history['train_acc'].append(train_acc)
    history['train_loss'].append(train_loss)
    history['val_acc'].append(val_acc)
    history['val_loss'].append(val_loss)

    if val_acc > best_accuracy:
        torch.save(model.state_dict(), 'best_model_state.bin')
        best_accuracy = val_acc

# 绘制损失和准确率曲线
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.plot(history['train_loss'], label='train loss')
plt.plot(history['val_loss'], label='val loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curve')

plt.subplot(1, 2, 2)
plt.plot(history['train_acc'], label='train accuracy')
plt.plot(history['val_acc'], label='val accuracy')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Accuracy Curve')

plt.show()

该代码执行以下步骤：

加载并预处理数据。
定义一个PyTorch数据集类以便于数据加载。
初始化BERT分词器和模型。
分割数据集并创建数据加载器。
定义训练和评估函数。
训练模型，记录每个epoch的损失和准确率，并保存最佳模型。
绘制训练和验证的损失和准确率曲线。

BERT的文本分类任务中，数据流动可以分为几个步骤。下面将详细描述BERT在文本分类任务中的内部数据流动，并用公式表示。

输入预处理

假设输入文本为T，通过BERT的分词器将其转换为词汇表中的ID。

输入文本：T = "Hello, how are you?"
分词后的ID表示：input_ids = [101, 7592, 1010, 2129, 2024, 2017, 102]
注意力掩码：attention_mask = [1, 1, 1, 1, 1, 1, 1]

公式表示：

输入序列： X = [ x 1 , x 2 , ... , x n ] X = [x_1, x_2, \ldots, x_n] X=[x1,x2,...,xn]，其中 x i x_i xi表示第 i i i个token的嵌入向量。
注意力掩码： A = [ a 1 , a 2 , ... , a n ] A = [a_1, a_2, \ldots, a_n] A=[a1,a2,...,an]，其中 a i a_i ai表示第 i i i个token的注意力掩码。

BERT编码器

BERT编码器由多个自注意力层和前馈神经网络层堆叠组成。

每个自注意力层的输出： H l = TransformerLayer ( H l − 1 , A ) H_l = \text{TransformerLayer}(H_{l-1}, A) Hl=TransformerLayer(Hl−1,A)，其中 H l − 1 H_{l-1} Hl−1是上一层的输出（初始输入为 X X X）， A A A是注意力掩码。
最终隐藏状态： H L H_L HL，其中 L L L是BERT的层数。

公式表示：

H 0 = X H_0 = X H0=X
H l = TransformerLayer ( H l − 1 , A ) for l = 1 , 2 , ... , L H_l = \text{TransformerLayer}(H_{l-1}, A) \quad \text{for} \quad l = 1, 2, \ldots, L Hl=TransformerLayer(Hl−1,A)forl=1,2,...,L
H L = BERT ( X , A ) H_L = \text{BERT}(X, A) HL=BERT(X,A)

分类任务

BERT的最后一层隐藏状态 H L H_L HL的第一个token（[CLS] token）的向量表示用于分类任务。

提取[CLS] token的表示： H C L S = H L [ 0 ] H_{CLS} = H_L[0] HCLS=HL[0]
通过一个全连接层进行分类： l o g i t s = W ⋅ H C L S + b logits = W \cdot H_{CLS} + b logits=W⋅HCLS+b，其中 W W W是权重矩阵， b b b是偏置向量。
应用Softmax函数得到类别概率： P ( y ∣ X ) = softmax ( l o g i t s ) P(y|X) = \text{softmax}(logits) P(y∣X)=softmax(logits)

在BERT中使用最后一层隐藏状态的[CLS] token向量作为分类任务的表示是BERT特有的设计，而不是所有Transformer模型都采用的策略。在其他Transformer模型中，可能会使用不同的策略来获取表示用于分类任务。

一些变种的Transformer模型或其他NLP模型可能会使用不同的策略，比如：

平均池化（Mean Pooling）：将所有token的表示向量取平均，得到一个整体的句子表示。
最大池化（Max Pooling）：将所有token的表示向量中的最大值作为整体的句子表示。
自注意力池化（Self-Attention Pooling）：通过自注意力机制，动态地对不同token的重要性进行加权，得到整体的句子表示。

比如要将 BertForSequenceClassification 模型改为使用平均池化（Mean Pooling）方式，你需要修改模型的输出部分，以便对所有 token 的表示向量取平均。下面是一个示例代码，演示了如何修改 BertForSequenceClassification 模型以使用平均池化：

python 复制代码

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class BertForMeanPoolingSequenceClassification(nn.Module):
    def __init__(self, num_classes, bert_model_name='bert-base-uncased'):
        super(BertForMeanPoolingSequenceClassification, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(self.bert.config.hidden_dropout_prob)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = torch.mean(outputs.last_hidden_state, dim=1)  # 使用平均池化
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

# 使用示例
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMeanPoolingSequenceClassification(num_classes=2, bert_model_name='bert-base-uncased')

input_text = ["Hello, how are you?", "Fine, thank you!"]
inputs = tokenizer(input_text, padding=True, truncation=True, return_tensors="pt")

outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])

在这个示例中，我们定义了一个新的模型 BertForMeanPoolingSequenceClassification，它使用了平均池化来获取句子的表示。在 forward 方法中，我们计算了所有 token 的表示向量的平均值，然后通过一个全连接层进行分类。

公式表示：

H C L S = H L [ 0 ] H_{CLS} = H_L[0] HCLS=HL[0]
logits = W ⋅ H C L S + b \text{logits} = W \cdot H_{CLS} + b logits=W⋅HCLS+b
P ( y ∣ X ) = softmax ( logits ) P(y|X) = \text{softmax}(\text{logits}) P(y∣X)=softmax(logits)

损失函数

分类任务中，通常使用交叉熵损失函数来衡量预测概率与真实标签之间的差异。

交叉熵损失： L = − ∑ i = 1 C y i log ⁡ P ( y i ∣ X ) L = -\sum_{i=1}^{C} y_i \log P(y_i|X) L=−∑i=1CyilogP(yi∣X)，其中 C C C是类别数， y i y_i yi是第 i i i个类别的真实标签（one-hot编码）， P ( y i ∣ X ) P(y_i|X) P(yi∣X)是第 i i i个类别的预测概率。

公式表示：

L = − ∑ i = 1 C y i log ⁡ P ( y i ∣ X ) L = -\sum_{i=1}^{C} y_i \log P(y_i|X) L=−i=1∑CyilogP(yi∣X)

总结

BERT在文本分类任务中的数据流动过程可以概括如下：

输入文本通过分词器编码为input_ids和attention_mask。
经过BERT编码器，计算得到最后一层隐藏状态 H L H_L HL。
提取 H L H_L HL的[CLS] token表示，并通过全连接层计算出类别的logits。
应用Softmax函数得到类别概率。
使用交叉熵损失函数计算损失。

四、Bert有啥创新点

BERT（Bidirectional Encoder Representations from Transformers）确实基于Transformer架构，但它的创新之处不仅仅在于简单地将Transformer应用于特定任务。以下是BERT相对于原始Transformer论文的一些关键创新点：

双向编码: BERT的核心创新之一是使用双向Transformer编码器，这与传统的自回归语言模型（如Transformer的解码器部分或OpenAI的GPT模型）不同。传统的语言模型通常为单向，即在预测一个词时只能看到它之前的词（左向）或者之后的词（右向），而BERT通过遮蔽语言模型（Masked Language Model, MLM）任务，在训练时同时考虑左侧和右侧的上下文信息，使得模型能够学习到词汇间的双向依赖关系。
预训练与微调策略: BERT引入了大规模的无监督预训练方法，然后针对特定任务进行微调。这种策略极大地简化了针对不同任务设计特定架构的需求，因为只需要在预训练的BERT模型上添加一个额外的输出层即可适应各种下游任务，比如问答、情感分析、命名实体识别等，显著提高了这些任务的性能。
多任务学习: 除了MLM任务外，BERT还采用了"下一句预测"（Next Sentence Prediction, NSP）任务作为预训练的一部分，旨在学习文本对之间的关系，增强模型对语境连贯性的理解。虽然后续研究表明NSP任务可能不是提升性能的关键因素，但它反映了BERT设计时对多任务学习的探索。
大规模数据集: BERT在非常庞大的数据集（包括BooksCorpus和Wikipedia）上进行了预训练，这有助于模型学习更广泛的语言模式和知识。
技术细节优化: BERT在训练过程中使用了更大的批量大小、动态调整的学习率以及其他超参数设置，这些优化策略帮助模型更高效地学习高质量的表示。

相比Transformer的原始论文，BERT的贡献在于展示了双向Transformer在语言理解任务上的巨大潜力，以及通过预训练和微调策略可以极大提升模型的泛化能力，这一思路后来影响了整个自然语言处理领域的发展方向。BERT的这些创新点不仅推动了模型性能的显著提升，也为后续的研究如XLNet、RoBERTa、T5等模型的发展奠定了基础。