Transformer文档分类
在此之前,需要了解Transformer的相关知识,请移步Transformer详解
模型结构:
- 输入层 [ b a t c h _ s i z e batch\_size batch_size, s e n t e n c e _ l e n sentence\_len sentence_len, e m b e d _ s i z e embed\_size embed_size]
- 词嵌入层 [ b a t c h _ s i z e batch\_size batch_size, s e n t e n c e _ l e n sentence\_len sentence_len, e m b e d _ s i z e embed\_size embed_size]
- position_embedding [ b a t c h _ s i z e batch\_size batch_size, s e n t e n c e _ l e n sentence\_len sentence_len, e m b e d _ s i z e embed\_size embed_size]
- 编码层Encoder
- 全连接层
position_embedding
py
class Positional_Encoding(nn.Module):
def __init__(self, embed, pad_size, dropout, device):
super(Positional_Encoding, self).__init__()
self.device = device
self.pe = torch.tensor(
[[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
self.dropout = nn.Dropout(dropout)
def forward(self, x):
out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device) # torch.Size([128, 32, 300])
out = self.dropout(out)
return out
- 引入顺序信息:在Transformer模型中,位置编码(Positional Encoding)是用于表示输入序列中元素的位置信息的。因为Transformer模型完全基于注意力机制,没有内置的顺序信息,因此需要位置编码来补充位置信息。位置编码是将位置信息注入输入序列的一种方式。
- 保持序列不变性:位置编码的设计使得不同位置的编码具有唯一性,并且与输入的表示一起被模型处理。这样,即使在经过多层注意力机制后,位置信息仍然能够保留,保证了序列中的相对位置信息不丢失。
- 提供周期性:位置编码使用了正弦和余弦函数,这种设计使得位置编码具有周期性,能够表示任意长度的序列。
编码层Encoder
py
class Encoder(nn.Module):
def __init__(self, dim_model, num_head, hidden, dropout):
super(Encoder, self).__init__()
self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)
def forward(self, x): # torch.Size([128, 32, 300])
out = self.attention(x) # torch.Size([128, 32, 300])
out = self.feed_forward(out) # torch.Size([128, 32, 300])
return out
Multi_Head_Attention
-
注意力机制中的Q,K,V
- Query(查询): 用于提出问题,代表我们希望获取哪些信息。
- Key(键): 用于标识信息,代表输入序列中的每一个位置。
- Value(值): 用于存储信息,代表输入序列中每一个位置对应的实际信息。
-
假设输入张量
x
的形状为[batch_size, seq_length, dim_model]
,其中:batch_size
:批次大小(例如,一次处理多少个样本)。seq_length
:序列长度(例如,句子的长度)。dim_model
:模型的维度(例如,嵌入的维度)。
-
通过线性变换
self.fc_Q
,我们将输入x
转换为查询张量Q
:pyQ = self.fc_Q(x)
此时
Q
的形状仍然是[batch_size, seq_length, dim_model]
。 -
多头拆分
为了进行多头注意力计算,需要将
dim_model
维度拆分成多个头(num_head
),每个头的维度为dim_head
。这里dim_head = dim_model // num_head
,所以dim_model = num_head * dim_head
。我们希望得到的形状是
[batch_size, seq_length, num_head, dim_head]
,但为了便于后续计算,需要进一步调整形状。 -
形状重塑
通过
Q.view(batch_size * num_head, -1, dim_head)
,我们将Q
的形状从[batch_size, seq_length, dim_model]
转换为[batch_size * num_head, seq_length, dim_head]
。具体步骤如下:- 调整形状 :
batch_size
和num_head
合并为一维:batch_size * num_head
。dim_model
拆分为num_head
和dim_head
。
- 维度解释 :
batch_size * num_head
:每个注意力头作为一个独立的样本进行处理。seq_length
:序列长度保持不变。dim_head
:每个头的维度。
最终的形状
[batch_size * num_head, seq_length, dim_head]
适用于多头注意力机制的并行计算,使得每个头可以独立地计算注意力。这样处理后,每个注意力头都可以独立计算注意力权重,从而提高计算效率和模型的表示能力。
- 调整形状 :
-
缩放因子:
s c a l e = 1 d i m _ h e a d scale=\frac{1}{\sqrt{dim_{\_}head}} scale=dim_head 1缩放因子的作用:
- 防止数值过大
- 稳定梯度:
- 提高模型的性能:缩放因子使得注意力分数的分布更加均匀, s o f t m a x softmax softmax之后的注意力权重也会更加平滑,这样模型在训练时能更稳定地学习到有效的注意力权重,从而提升模型性能。
py
class Multi_Head_Attention(nn.Module):
def __init__(self, dim_model, num_head, dropout=0.0):
super(Multi_Head_Attention, self).__init__()
self.num_head = num_head
assert dim_model % num_head == 0
self.dim_head = dim_model // self.num_head
self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
self.attention = Scaled_Dot_Product_Attention()
self.fc = nn.Linear(num_head * self.dim_head, dim_model)
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(dim_model)
def forward(self, x):
# print("Multi_Head_Attention--x", x.shape) # [128, 32, 300]
batch_size = x.size(0)
Q = self.fc_Q(x) # [128, 32, 300]
K = self.fc_K(x) # [128, 32, 300]
V = self.fc_V(x) # [128, 32, 300]
Q = Q.view(batch_size * self.num_head, -1, self.dim_head) # [640, 32, 60]
K = K.view(batch_size * self.num_head, -1, self.dim_head) # [640, 32, 60]
V = V.view(batch_size * self.num_head, -1, self.dim_head) # [640, 32, 60]
scale = K.size(-1) ** -0.5 # 缩放因子
context = self.attention(Q, K, V, scale) # [640, 32, 60]
context = context.view(batch_size, -1, self.dim_head * self.num_head) # [128, 32, 300]
out = self.fc(context) # [128, 32, 300]
out = self.dropout(out) # [128, 32, 300]
out = out + x # 残差连接# [128, 32, 300]
out = self.layer_norm(out) # [128, 32, 300]
return out
Scaled_Dot_Product_Attention
py
class Scaled_Dot_Product_Attention(nn.Module):
'''Scaled Dot-Product Attention '''
def __init__(self):
super(Scaled_Dot_Product_Attention, self).__init__()
def forward(self, Q, K, V, scale=None):
# 这行代码计算查询张量Q和键张量K的点积
attention = torch.matmul(Q, K.permute(0, 2, 1))# [640, 32, 32]
if scale:
attention = attention * scale
# if mask: # TODO change this
# attention = attention.masked_fill_(mask == 0, -1e9)
# 对注意力分数应用softmax函数,得到归一化的注意力权重
attention = F.softmax(attention, dim=-1) # [640, 32, 32]
# 将注意力权重与值张量V进行点积,得到上下文向量。
context = torch.matmul(attention, V)# [640, 32, 60]
return context
点积注意力分数:
a t t e n t i o n _ s c o r e s = Q ⋅ K T attention\_scores=Q\cdot K^T attention_scores=Q⋅KT
其中
py
context = torch.matmul(attention, V)# [640, 32, 60]
将注意力权重与值张量 V
进行点积,得到上下文向量。注意力权重的形状为 [batch_size, len_Q, len_K]
,值张量 V
的形状为 [batch_size, len_V, dim_V]
,计算得到的 context
张量形状为 [batch_size, len_Q, dim_V]
。
Position_wise_Feed_Forward
py
class Position_wise_Feed_Forward(nn.Module):
def __init__(self, dim_model, hidden, dropout=0.0):
super(Position_wise_Feed_Forward, self).__init__()
self.fc1 = nn.Linear(dim_model, hidden)
self.fc2 = nn.Linear(hidden, dim_model)
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(dim_model)
def forward(self, x): # x.shape [128, 32, 300]
out = self.fc1(x) # [128, 32, 1024]
out = F.relu(out) # [128, 32, 1024]
out = self.fc2(out) # [128, 32, 300]
out = self.dropout(out) # [128, 32, 300]
out = out + x # 残差连接 # [128, 32, 300]
out = self.layer_norm(out)
return out
作用:
- 增加模型的表达能力:通过非线性变换,模型可以学习到更复杂的特征。
- 保持梯度流动:残差连接帮助保持梯度的流动,防止梯度消失。
- 提高训练稳定性和速度:层归一化帮助在训练过程中保持每一层的输出分布稳定。
- 增强模型的泛化能力:Dropout防止模型过拟合,提高其在新数据上的表现。
整体代码:
py
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import copy
class Config(object):
"""配置参数"""
def __init__(self, dataset, embedding):
self.model_name = 'Transformer'
self.train_path = dataset + '/data/train.txt' # 训练集
self.dev_path = dataset + '/data/dev.txt' # 验证集
self.test_path = dataset + '/data/test.txt' # 测试集
self.class_list = [x.strip() for x in open(
dataset + '/data/class.txt', encoding='utf-8').readlines()] # 类别名单
self.vocab_path = dataset + '/data/vocab.pkl' # 词表
self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt' # 模型训练结果
self.log_path = dataset + '/log/' + self.model_name
self.embedding_pretrained = torch.tensor(
np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32')) \
if embedding != 'random' else None # 预训练词向量
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # 设备
self.dropout = 0.5 # 随机失活
self.require_improvement = 2000 # 若超过1000batch效果还没提升,则提前结束训练
self.num_classes = len(self.class_list) # 类别数
self.n_vocab = 0 # 词表大小,在运行时赋值
self.num_epochs = 20 # epoch数
self.batch_size = 128 # mini-batch大小
self.pad_size = 32 # 每句话处理成的长度(短填长切)
self.learning_rate = 5e-4 # 学习率
self.embed = self.embedding_pretrained.size(1) \
if self.embedding_pretrained is not None else 300 # 字向量维度
self.dim_model = 300
self.hidden = 1024
self.last_hidden = 512
self.num_head = 5
self.num_encoder = 2
'''Attention Is All You Need'''
class Model(nn.Module):
def __init__(self, config):
super(Model, self).__init__()
if config.embedding_pretrained is not None:
self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
else:
self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
self.postion_embedding = Positional_Encoding(config.embed, config.pad_size, config.dropout, config.device)
self.encoder = Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
self.encoders = nn.ModuleList([
copy.deepcopy(self.encoder)
# Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
for _ in range(config.num_encoder)])
self.fc1 = nn.Linear(config.pad_size * config.dim_model, config.num_classes)
# self.fc2 = nn.Linear(config.last_hidden, config.num_classes)
# self.fc1 = nn.Linear(config.dim_model, config.num_classes)
def forward(self, x):
out = self.embedding(x[0]) # torch.Size([128, 32, 300])
out = self.postion_embedding(out) # torch.Size([128, 32, 300])
for encoder in self.encoders:
out = encoder(out)
out = out.view(out.size(0), -1) # torch.Size([128, 9600])
out = self.fc1(out) # torch.Size([128, 10])
return out
class Encoder(nn.Module):
def __init__(self, dim_model, num_head, hidden, dropout):
super(Encoder, self).__init__()
self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)
def forward(self, x): # torch.Size([128, 32, 300])
out = self.attention(x) # torch.Size([128, 32, 300])
out = self.feed_forward(out) # torch.Size([128, 32, 300])
return out
class Positional_Encoding(nn.Module):
def __init__(self, embed, pad_size, dropout, device):
super(Positional_Encoding, self).__init__()
self.device = device
self.pe = torch.tensor(
[[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
self.dropout = nn.Dropout(dropout)
def forward(self, x):
out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device) # torch.Size([128, 32, 300])
out = self.dropout(out)
return out
class Scaled_Dot_Product_Attention(nn.Module):
'''Scaled Dot-Product Attention '''
def __init__(self):
super(Scaled_Dot_Product_Attention, self).__init__()
def forward(self, Q, K, V, scale=None):
# 这行代码计算查询张量Q和键张量K的点积
attention = torch.matmul(Q, K.permute(0, 2, 1))# [640, 32, 32]
if scale:
attention = attention * scale
# if mask: # TODO change this
# attention = attention.masked_fill_(mask == 0, -1e9)
# 对注意力分数应用softmax函数,得到归一化的注意力权重
attention = F.softmax(attention, dim=-1) # [640, 32, 32]
# 将注意力权重与值张量V进行点积,得到上下文向量。
context = torch.matmul(attention, V)# [640, 32, 60]
return context
class Multi_Head_Attention(nn.Module):
def __init__(self, dim_model, num_head, dropout=0.0):
super(Multi_Head_Attention, self).__init__()
self.num_head = num_head
assert dim_model % num_head == 0
self.dim_head = dim_model // self.num_head
self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
self.attention = Scaled_Dot_Product_Attention()
self.fc = nn.Linear(num_head * self.dim_head, dim_model)
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(dim_model)
def forward(self, x):
# print("Multi_Head_Attention--x", x.shape) # [128, 32, 300]
batch_size = x.size(0)
Q = self.fc_Q(x) # [128, 32, 300]
K = self.fc_K(x) # [128, 32, 300]
V = self.fc_V(x) # [128, 32, 300]
Q = Q.view(batch_size * self.num_head, -1, self.dim_head) # [640, 32, 60]
K = K.view(batch_size * self.num_head, -1, self.dim_head) # [640, 32, 60]
V = V.view(batch_size * self.num_head, -1, self.dim_head) # [640, 32, 60]
# if mask: # TODO
# # mask = mask.repeat(self.num_head, 1, 1) # TODO change this
scale = K.size(-1) ** -0.5 # 缩放因子
context = self.attention(Q, K, V, scale) # [640, 32, 60]
context = context.view(batch_size, -1, self.dim_head * self.num_head) # [128, 32, 300]
out = self.fc(context) # [128, 32, 300]
out = self.dropout(out) # [128, 32, 300]
out = out + x # 残差连接# [128, 32, 300]
out = self.layer_norm(out) # [128, 32, 300]
return out
class Position_wise_Feed_Forward(nn.Module):
def __init__(self, dim_model, hidden, dropout=0.0):
super(Position_wise_Feed_Forward, self).__init__()
self.fc1 = nn.Linear(dim_model, hidden)
self.fc2 = nn.Linear(hidden, dim_model)
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(dim_model)
def forward(self, x): # x.shape [128, 32, 300]
out = self.fc1(x) # [128, 32, 1024]
out = F.relu(out) # [128, 32, 1024]
out = self.fc2(out) # [128, 32, 300]
out = self.dropout(out) # [128, 32, 300]
out = out + x # 残差连接 # [128, 32, 300]
out = self.layer_norm(out)
return out