目录
[2. 模型构建](#2. 模型构建)
[3. 模型训练](#3. 模型训练)
[4. 模型评价](#4. 模型评价)
[5. 模型预测](#5. 模型预测)
[6. 基于Torch的单向LSTM](#6. 基于Torch的单向LSTM)
1.数据处理
电影评论可以蕴含丰富的情感:比如喜欢、讨厌、等等.情感分析(Sentiment Analysis)是为一个文本分类问题,即使用判定给定的一段文本信息表达的情感属于积极情绪,还是消极情绪.
本实践使用 IMDB 电影评论数据集,使用双向 LSTM 对电影评论进行情感分析.
1.1.数据集下载
Bag of Words Meets Bags of Popcorn | Kaggle
实际上这个数据集与实验书中使用的数据集并不相同,具体区别在于测试集数据并不带标签,这里建议去飞桨社区下载。没有账号的注册一个。
选择在项目页面右上角点击启动环境。
打开项目到notebook页面,这个点击dataset,就可以下载相应的数据集和词典了。
1.2.数据加载
python
# 加载数据集
def load_imdb_data(path):
assert os.path.exists(path), f"路径 {path} 不存在!"
trainset, devset, testset = [], [], []
# 读取train.txt文件
with open(os.path.join(path, "train.txt"), "r", encoding="utf-8") as fr:
for line in fr:
try:
sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1)
trainset.append((sentence, sentence_label))
except ValueError:
print(f"跳过无效行: {line.strip()}")
# 读取dev.txt文件
with open(os.path.join(path, "dev.txt"), "r", encoding="utf-8") as fr:
for line in fr:
try:
sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1)
devset.append((sentence, sentence_label))
except ValueError:
print(f"跳过无效行: {line.strip()}")
# 读取test.txt文件
with open(os.path.join(path, "test.txt"), "r", encoding="utf-8") as fr:
for line in fr:
try:
sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1)
testset.append((sentence, sentence_label))
except ValueError:
print(f"跳过无效行: {line.strip()}")
return trainset, devset, testset
# 加载IMDB数据集
train_data, dev_data, test_data = load_imdb_data("./dataset/")
# # 打印一下加载后的数据样式
# print(train_data[4]) # 打印第5个数据点,确保数据被正确加载
1.2.1读取数据
python
class IMDBDataset(Dataset):
def __init__(self, examples, word2id_dict):
super(IMDBDataset, self).__init__()
# 词典,用于将单词转为字典索引的数字
self.word2id_dict = word2id_dict
# 加载后的数据集
self.examples = self.words_to_id(examples)
def words_to_id(self, examples):
tmp_examples = []
for idx, example in enumerate(examples):
seq, label = example
# 将单词映射为字典索引的ID, 对于词典中没有的单词用[UNK]对应的ID进行替代
seq = [self.word2id_dict.get(word, self.word2id_dict['[UNK]']) for word in seq.split(" ")]
label = int(label)
tmp_examples.append([seq, label])
return tmp_examples
def __getitem__(self, idx):
seq, label = self.examples[idx]
return seq, label
def __len__(self):
return len(self.examples)
1.2.2词表转换
python
def load_vocab(vocab_path):
word2id = {}
idx=0
with open(vocab_path, 'r', encoding='utf-8') as f:
for line in f.readlines():
word = line.strip()
word2id[word] = idx # 每个词按顺序分配索引
idx += 1
return word2id
word2id_dict = load_vocab("./dataset/vocab.txt")
# 实例化Dataset
train_set = IMDBDataset(train_data, word2id_dict)
dev_set = IMDBDataset(dev_data, word2id_dict)
test_set = IMDBDataset(test_data, word2id_dict)
# print('训练集样本数:', len(train_set))
# print('样本示例:', train_set[4])
1.2.3封装数据
在构建 Dataset 类之后,我们构造对应的 DataLoader,用于批次数据的迭代.和前几章的 DataLoader 不同,这里的 DataLoader 需要引入下面两个功能:
-
长度限制:需要将序列的长度控制在一定的范围内,避免部分数据过长影响整体训练效果。
-
长度补齐:神经网络模型通常需要同一批处理的数据的序列长度是相同的,然而在分批时通常会将不同长度序列放在同一批,因此需要对序列进行补齐处理。
pythondef collate_fn(batch_data, pad_val=0, max_seq_len=256): seqs, seq_lens, labels = [], [], [] max_len = 0 for example in batch_data: seq, label = example # 对数据序列进行截断 seq = seq[:max_seq_len] # 对数据截断并保存于seqs中 seqs.append(seq) seq_lens.append(len(seq)) labels.append(label) # 保存序列最大长度 max_len = max(max_len, len(seq)) # 对数据序列进行填充至最大长度 for i in range(len(seqs)): seqs[i] = seqs[i] + [pad_val] * (max_len - len(seqs[i])) # 返回Tensor形式的数据 return (torch.tensor(seqs), torch.tensor(seq_lens)), torch.tensor(labels) # # 测试coolate_fn # # max_seq_len = 5 # batch_data = [[[1, 2, 3, 4, 5, 6], 1], [[2,4,6], 0]] # (seqs, seq_lens), labels = collate_fn(batch_data, pad_val=word2id_dict["[PAD]"], max_seq_len=max_seq_len) # print("seqs: ", seqs) # print("seq_lens: ", seq_lens) # print("labels: ", labels) max_seq_len = 256 batch_size = 128 # 使用partial为collate_fn提供固定参数 collate_fn = partial(collate_fn, pad_val=word2id_dict["[PAD]"], max_seq_len=max_seq_len) # 创建 PyTorch DataLoader train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, drop_last=False, collate_fn=collate_fn) dev_loader = DataLoader(dev_set, batch_size=batch_size, shuffle=False, drop_last=False, collate_fn=collate_fn) test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, drop_last=False, collate_fn=collate_fn)
可以看到,原始序列中长度为6的序列被截断为5,同时原始序列中长度为3的序列被填充到5,同时返回了非[PAD]
的序列长度。
接下来,我们将collate_fn作为回调函数传入DataLoader中, 其在返回一批数据时,可以通过collate_fn函数处理该批次的数据。 这里需要注意的是,这里通过partial函数对collate_fn函数中的关键词参数进行设置,并返回一个新的函数对象作为collate_fn。
由于这里返回的X是一个tuple类型,因此如果想把该数据放上GPU加速的话,应该要再多一步转换,这里为了不大修RunnerV3类,选择在Model上将X中的数据放上GPU。具体的GPU加速的代码放到了2.2模型汇总中。
2. 模型构建
整个模型结构如图
2.1汇聚层算子
汇聚层算子将双向LSTM层所有位置上的隐状态进行平均,作为整个句子的表示。这里我们实现了AveragePooling算子进行隐状态的汇聚,首先利用序列长度向量生成掩码(Mask)矩阵,用于对文本序列中[PAD]位置的向量进行掩蔽,然后将该序列的向量进行相加后取均值。
python
class AveragePooling(nn.Module):
def __init__(self):
super(AveragePooling, self).__init__()
def forward(self, sequence_output, sequence_length):
sequence_length = sequence_length.unsqueeze(-1).float()
max_len = sequence_output.shape[1]
mask = torch.arange(max_len, device=sequence_output.device) < sequence_length
mask = mask.float().unsqueeze(-1)
sequence_output = sequence_output * mask
batch_mean_hidden = torch.sum(sequence_output, dim=1) / sequence_length
return batch_mean_hidden
2.2模型汇总
加入GPU加速
python
class Model_BiLSTM_FC(nn.Module):
def __init__(self, num_embeddings, input_size, hidden_size, num_classes=2):
super(Model_BiLSTM_FC, self).__init__()
self.num_embeddings = num_embeddings
self.input_size = input_size
self.hidden_size = hidden_size
self.num_classes = num_classes
# 词嵌入层
self.embedding_layer = nn.Embedding(num_embeddings, input_size, padding_idx=0)
# 双向LSTM层
self.lstm_layer = nn.LSTM(input_size, hidden_size, bidirectional=True)
# 聚合层
self.average_layer = AveragePooling()
# 输出层
self.output_layer = nn.Linear(hidden_size * 2, num_classes)
def forward(self, inputs):
input_ids, sequence_length = inputs
# 将数据迁移到当前模型所在的设备
device = next(self.parameters()).device
input_ids = input_ids.to(device)
sequence_length = sequence_length.to(device)
# 获取词向量
inputs_emb = self.embedding_layer(input_ids)
# 使用LSTM处理数据
sequence_output, _ = self.lstm_layer(inputs_emb)
# 使用聚合层对LSTM输出进行聚合
batch_mean_hidden = self.average_layer(sequence_output, sequence_length)
# 输出层进行分类
logits = self.output_layer(batch_mean_hidden)
return logits
3. 模型训练
3.1模型训练
python
# 设置随机种子
torch.manual_seed(0)
np.random.seed(0)
random.seed(0)
# 设置训练参数
num_epochs = 3
learning_rate = 0.001
num_embeddings = len(word2id_dict) # 假设word2id_dict是词汇表字典
input_size = 256 # embedding维度
hidden_size = 256 # LSTM隐层维度
# 实例化模型
model = Model_BiLSTM_FC(num_embeddings, input_size, hidden_size)
# 指定优化器
optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999))
# 指定损失函数
loss_fn = nn.CrossEntropyLoss()
# 指定评估指标
metric = Accuracy()
use_gpu = torch.cuda.is_available() # 检查是否有可用的GPU
# print(use_gpu)
if use_gpu:
device = torch.device('cuda:0') # 指定 GPU 0
else:
device = torch.device('cpu') # 使用 CPU
# print(device)
if use_gpu:
model = model.to(device) # 将模型转移到 GPU
# 实例化Runner
runner = RunnerV3(model, optimizer, loss_fn, metric,device)
# 模型训练
start_time = time.time()
runner.train(train_loader, dev_loader, num_epochs=num_epochs, eval_steps=10, log_steps=10, save_path="./checkpoints/best.pdparams")
end_time = time.time()
print("time: ", (end_time-start_time))
3.2绘制准确率和损失函数图像
python
# 损失准确率图像
# 图像名字
fig_name = "./images/6.16.pdf"
# sample_step: 训练损失的采样step,即每隔多少个点选择1个点绘制
# loss_legend_loc: loss 图像的图例放置位置
# acc_legend_loc: acc 图像的图例放置位置
plot_training_loss_acc(runner, fig_name, fig_size=(16,6), sample_step=10, loss_legend_loc="lower left", acc_legend_loc="lower right")
下图展示了文本分类模型在训练过程中的损失曲线和在验证集上的准确率曲线,其中在损失图像中,实线表示训练集上的损失变化,虚线表示验证集上的损失变化. 可以看到,随着训练过程的进行,训练集的损失不断下降, 验证集上的损失在大概200步后开始上升,这是因为在训练过程中发生了过拟合,可以选择保存在训练过程中在验证集上效果最好的模型来解决这个问题. 从准确率曲线上可以看到,首先在验证集上的准确率大幅度上升,然后大概200步后准确率不再上升,并且由于过拟合的因素,在验证集上的准确率稍微降低。
4. 模型评价
python
# 模型评价
model_path = "./checkpoints/best.pdparams"
runner.load_model(model_path)
accuracy, _ = runner.evaluate(test_loader)
print(f"Evaluate on test set, Accuracy: {accuracy:.5f}")
5. 模型预测
python
# 模型预测
# id2label 映射
id2label = {0: "消极情绪", 1: "积极情绪"}
# 输入文本
text = "this movie is so great. I watched it three times already"
# 处理单条文本
sentence = text.split(" ")
words = [word2id_dict[word] if word in word2id_dict else word2id_dict['[UNK]'] for word in sentence]
words = words[:max_seq_len]
sequence_length = torch.tensor([len(words)], dtype=torch.long) # 记录序列长度
words = torch.tensor(words, dtype=torch.long).unsqueeze(0) # 增加batch维度
logits = runner.predict((words, sequence_length))
# 获取最大标签索引
max_label_id = torch.argmax(logits, dim=-1).item() # 获取预测的标签索引
# 显示预测标签
pred_label = id2label[max_label_id]
print("Label: ", pred_label)
6. 基于Torch的单向LSTM
首先,修改模型定义,将nn.LSTM
中的direction
设置为forward
以使用单向LSTM模型,同时设置线性层的shape为[hidden_size, num_classes]
。
6.1模型修改-只返回最后时刻的隐状态
python
class AveragePooling(nn.Module):
def __init__(self):
super(AveragePooling, self).__init__()
def forward(self, sequence_output, sequence_length):
# 对sequence_length进行扩展,变成(batch_size, 1)
sequence_length = sequence_length.unsqueeze(-1).float()
max_len = sequence_output.size(1)
# 根据sequence_length生成mask矩阵
mask = torch.arange(max_len, device=sequence_output.device).unsqueeze(0) < sequence_length
mask = mask.float().unsqueeze(-1)
# 对padding位置进行mask处理
sequence_output = sequence_output * mask
# 对序列中的向量取均值
batch_mean_hidden = sequence_output.sum(dim=1) / sequence_length
return batch_mean_hidden
class Model_BiLSTM_FC(nn.Module):
def __init__(self, num_embeddings, input_size, hidden_size, num_classes=2):
super(Model_BiLSTM_FC, self).__init__()
# 词典大小
self.num_embeddings = num_embeddings
# 单词向量的维度
self.input_size = input_size
# LSTM隐藏单元数量
self.hidden_size = hidden_size
# 情感分类类别数量
self.num_classes = num_classes
# 实例化嵌入层
self.embedding_layer = nn.Embedding(num_embeddings, input_size, padding_idx=0)
# 实例化LSTM层
self.lstm_layer = nn.LSTM(input_size, hidden_size, batch_first=True, bidirectional=False) # 单向LSTM
# 实例化聚合层
self.average_layer = AveragePooling()
# 实例化输出层
self.output_layer = nn.Linear(hidden_size, num_classes)
def forward(self, inputs):
# 对模型输入拆分为序列数据和mask
input_ids, sequence_length = inputs
# 将数据迁移到当前模型所在的设备
device = next(self.parameters()).device
input_ids = input_ids.to(device)
sequence_length = sequence_length.to(device)
# 获取词向量
inputs_emb = self.embedding_layer(input_ids)
# 使用LSTM处理数据
sequence_output, _ = self.lstm_layer(inputs_emb) # 不需要显式传递sequence_length
# 使用聚合层聚合sequence_output
batch_mean_hidden = self.average_layer(sequence_output, sequence_length)
# 输出文本分类logits
logits = self.output_layer(batch_mean_hidden)
return logits
6.1.1模型训练
python
# 设置随机种子
torch.manual_seed(0)
np.random.seed(0)
random.seed(0)
# 设置训练参数
num_epochs = 3
learning_rate = 0.001
num_embeddings = len(word2id_dict) # 假设word2id_dict是词汇表字典
input_size = 256 # embedding维度
hidden_size = 256 # LSTM隐层维度
# 实例化模型
model = Model_BiLSTM_FC(num_embeddings, input_size, hidden_size)
# 指定优化器
optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999))
# 指定损失函数
loss_fn = nn.CrossEntropyLoss()
# 指定评估指标
metric = Accuracy()
use_gpu = torch.cuda.is_available() # 检查是否有可用的GPU
# print(use_gpu)
if use_gpu:
device = torch.device('cuda:0') # 指定 GPU 0
else:
device = torch.device('cpu') # 使用 CPU
# print(device)
if use_gpu:
model = model.to(device) # 将模型转移到 GPU
# 实例化Runner
runner = RunnerV3(model, optimizer, loss_fn, metric,device)
# 模型训练
start_time = time.time()
runner.train(train_loader, dev_loader, num_epochs=num_epochs, eval_steps=10, log_steps=10, save_path="./checkpoints/best_forward.pdparams")
end_time = time.time()
print("time: ", (end_time-start_time))
6.1.2模型评价
python
# 模型评价
model_path = "./checkpoints/best_forward.pdparams"
runner.load_model(model_path)
accuracy, _ = runner.evaluate(test_loader)
print(f"Evaluate on test set, Accuracy: {accuracy:.5f}")
6.2模型修改-所有时刻的隐状态向量
由于之前实现的LSTM默认只返回最后时刻的隐状态,然而本实验中需要用到所有时刻的隐状态向量,因此需要对自己实现的LSTM进行修改,使其返回序列向量。
python
class AveragePooling(nn.Module):
def __init__(self):
super(AveragePooling, self).__init__()
def forward(self, sequence_output, sequence_length):
sequence_length = sequence_length.unsqueeze(-1).float()
max_len = sequence_output.shape[1]
mask = torch.arange(max_len, device=sequence_output.device) < sequence_length
mask = mask.float().unsqueeze(-1)
sequence_output = sequence_output * mask
batch_mean_hidden = torch.sum(sequence_output, dim=1) / sequence_length
return batch_mean_hidden
class LSTM(nn.Module):
def __init__(self, input_size, hidden_size):
super(LSTM, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
# 初始化模型参数
self.W_i = nn.Parameter(torch.empty(input_size, hidden_size))
self.W_f = nn.Parameter(torch.empty(input_size, hidden_size))
self.W_o = nn.Parameter(torch.empty(input_size, hidden_size))
self.W_c = nn.Parameter(torch.empty(input_size, hidden_size))
self.U_i = nn.Parameter(torch.empty(hidden_size, hidden_size))
self.U_f = nn.Parameter(torch.empty(hidden_size, hidden_size))
self.U_o = nn.Parameter(torch.empty(hidden_size, hidden_size))
self.U_c = nn.Parameter(torch.empty(hidden_size, hidden_size))
self.b_i = nn.Parameter(torch.empty(1, hidden_size))
self.b_f = nn.Parameter(torch.empty(1, hidden_size))
self.b_o = nn.Parameter(torch.empty(1, hidden_size))
self.b_c = nn.Parameter(torch.empty(1, hidden_size))
# Xavier均匀分布初始化
init.xavier_uniform_(self.W_i)
init.xavier_uniform_(self.W_f)
init.xavier_uniform_(self.W_o)
init.xavier_uniform_(self.W_c)
init.xavier_uniform_(self.U_i)
init.xavier_uniform_(self.U_f)
init.xavier_uniform_(self.U_o)
init.xavier_uniform_(self.U_c)
init.zeros_(self.b_i)
init.zeros_(self.b_f)
init.zeros_(self.b_o)
init.zeros_(self.b_c)
def init_state(self, batch_size, device):
hidden_state = torch.zeros(batch_size, self.hidden_size, dtype=torch.float32, device=device)
cell_state = torch.zeros(batch_size, self.hidden_size, dtype=torch.float32, device=device)
return hidden_state, cell_state
def forward(self, inputs, states=None, sequence_length=None):
batch_size, seq_len, input_size = inputs.shape # inputs: batch_size x seq_len x input_size
# 获取输入张量的设备
device = inputs.device
if states is None:
states = self.init_state(batch_size, device)
hidden_state, cell_state = states
outputs = []
# 执行LSTM计算,包括:隐藏门、输入门、遗忘门、候选状态向量、状态向量和隐状态向量
for step in range(seq_len):
input_step = inputs[:, step, :]
# 确保所有参数都在相同设备上
I_gate = torch.sigmoid(torch.matmul(input_step, self.W_i) + torch.matmul(hidden_state, self.U_i) + self.b_i)
F_gate = torch.sigmoid(torch.matmul(input_step, self.W_f) + torch.matmul(hidden_state, self.U_f) + self.b_f)
O_gate = torch.sigmoid(torch.matmul(input_step, self.W_o) + torch.matmul(hidden_state, self.U_o) + self.b_o)
C_tilde = torch.tanh(torch.matmul(input_step, self.W_c) + torch.matmul(hidden_state, self.U_c) + self.b_c)
cell_state = F_gate * cell_state + I_gate * C_tilde
hidden_state = O_gate * torch.tanh(cell_state)
outputs.append(hidden_state.unsqueeze(dim=1))
outputs = torch.cat(outputs, dim=1) # (batch_size, seq_len, hidden_size)
return outputs
class Model_BiLSTM_FC(nn.Module):
def __init__(self, num_embeddings, input_size, hidden_size, num_classes=2):
super(Model_BiLSTM_FC, self).__init__()
# 词典大小
self.num_embeddings = num_embeddings
# 单词向量的维度
self.input_size = input_size
# LSTM隐藏单元数量
self.hidden_size = hidden_size
# 情感分类类别数量
self.num_classes = num_classes
# 实例化嵌入层
self.embedding_layer = nn.Embedding(num_embeddings, input_size, padding_idx=0)
# 实例化LSTM层
self.lstm_layer = LSTM(input_size, hidden_size)
# 实例化聚合层
self.average_layer = AveragePooling()
# 实例化输出层
self.output_layer = nn.Linear(hidden_size, num_classes)
def forward(self, inputs):
# 对模型输入拆分为序列数据和mask
input_ids, sequence_length = inputs
# 将数据迁移到当前模型所在的设备
device = next(self.parameters()).device
input_ids = input_ids.to(device)
sequence_length = sequence_length.to(device)
# 获取词向量
inputs_emb = self.embedding_layer(input_ids)
# 使用lstm处理数据
sequence_output = self.lstm_layer(inputs_emb)
# 使用聚合层聚合sequence_output
batch_mean_hidden = self.average_layer(sequence_output, sequence_length)
# 输出文本分类logits
logits = self.output_layer(batch_mean_hidden)
return logits
6.2.1模型训练
python
# 设置随机种子
torch.manual_seed(0)
np.random.seed(0)
random.seed(0)
# 设置训练参数
num_epochs = 3
learning_rate = 0.001
num_embeddings = len(word2id_dict) # 假设word2id_dict是词汇表字典
input_size = 256 # embedding维度
hidden_size = 256 # LSTM隐层维度
# 实例化模型
model = Model_BiLSTM_FC(num_embeddings, input_size, hidden_size)
# 指定优化器
optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999))
# 指定损失函数
loss_fn = nn.CrossEntropyLoss()
# 指定评估指标
metric = Accuracy()
use_gpu = torch.cuda.is_available() # 检查是否有可用的GPU
# print(use_gpu)
if use_gpu:
device = torch.device('cuda:0') # 指定 GPU 0
else:
device = torch.device('cpu') # 使用 CPU
# print(device)
if use_gpu:
model = model.to(device) # 将模型转移到 GPU
# 实例化Runner
runner = RunnerV3(model, optimizer, loss_fn, metric,device)
# 模型训练
start_time = time.time()
runner.train(train_loader, dev_loader, num_epochs=num_epochs, eval_steps=10, log_steps=10, save_path="./checkpoints/best_self_forward.pdparams")
end_time = time.time()
print("time: ", (end_time-start_time))
6.2.2模型评价
python
# 模型评价
model_path = "./checkpoints/best_self_forward.pdparams"
runner.load_model(model_path)
accuracy, _ = runner.evaluate(test_loader)
print(f"Evaluate on test set, Accuracy: {accuracy:.5f}")