[oneAPI] 基于BERT预训练模型的SWAG问答任务
- [基于Intel® DevCloud for oneAPI下的Intel® Optimization for PyTorch](#基于Intel® DevCloud for oneAPI下的Intel® Optimization for PyTorch)
- 基于BERT预训练模型的SWAG问答任务
- 结果
- 参考资料
比赛:https://marketing.csdn.net/p/f3e44fbfe46c465f4d9d6c23e38e0517
Intel® DevCloud for oneAPI:https://devcloud.intel.com/oneapi/get_started/aiAnalyticsToolkitSamples/
基于Intel® DevCloud for oneAPI下的Intel® Optimization for PyTorch
在Intel® DevCloud for oneAPI平台上,我们搭建了实验环境,充分发挥其完全虚拟化的优势,使我们能够专注于模型开发和优化,无需过多关心底层配置和维护工作。为了进一步提升我们的实验效果,我们充分利用了Intel® Optimization for PyTorch,将其应用于我们的PyTorch模型中,从而实现了高效的优化。这一优化策略不仅提升了模型的训练速度和推断性能,还使模型在英特尔硬件上的表现更加卓越,从而加速了实验的整体进程。
在我们的实验中,我们选择了Bert预训练模型,并将其应用于SQuAD英文问答任务中。通过数据预处理、Fine-tuning等步骤,我们成功地将Bert模型应用于问答任务中,取得了令人满意的效果。在这一过程中,Intel® Optimization for PyTorch的应用进一步加速了模型的训练和推断过程,提高了整体效率。
这一实验方案不仅仅是技术上的创新,更是在实际应用中带来的价值。通过提升模型的性能,我们不仅可以更快地训练和部署模型,还可以提供更高质量的问答结果,从而提升用户体验。这对于自然语言处理领域的研究和应用都具有重要意义。
基于BERT预训练模型的SWAG问答任务
SWAG任务是一项常见的自然语言处理任务,旨在对给定的句子上下文中的多个选项进行排列,从而确定最可能的下一句。该任务有助于模型理解句子的语境和逻辑,具有广泛的应用价值。下来介绍该任务的我们使用的解决方案:
- 数据准备: 首先,从SWAG数据集中获取句子上下文和多个选项。每个样本包含一个上下文句子以及四个候选选项,其中一个是正确的。需要将这些文本数据转换成Bert模型可以理解的输入格式。
- 预训练模型选择: 选择合适的预训练Bert模型,如BERT-base或BERT-large。这些模型在大规模语料库上进行了预训练,捕捉了丰富的语义信息。
- Fine-tuning: 将预训练的Bert模型应用于SWAG任务,使用已标注的训练数据进行Fine-tuning。在Fine-tuning时,使用正确的选项作为标签,通过最大化正确选项的预测概率来优化模型。
- 模型推断: 使用Fine-tuned的模型对测试数据进行预测,从多个选项中选择最可能的下一句。
数据集下载和描述
在这里,我们使用到的也是论文中所提到的SWAG(The Situations With Adversarial Generations )数据集,即给定一个情景(一个问题或一句描述),任务是模型从给定的四个选项中预测最有可能的一个。
如下所示便是部分原始示例数据:
1 ,video-id,fold-ind,startphrase,sent1,sent2,gold-source,ending0,ending1,ending2,ending3,label
2 0,anetv_NttjvRpSdsI,19391,The people are in robes. They,The people are in robes.,They,gold,are wearing colorful costumes.,are doing karate moves on the floor.,shake hands on their hips.,do a flip to the bag.,0
3 1,lsmdc3057_ROBIN_HOOD-27684,16344,She smirks at someone and rides off. He,She smirks at someone and rides off.,He,gold,smiles and falls heavily.,wears a bashful smile.,kneels down behind her.,gives him a playful glance.,1
如上所示数据集中一共有12个字段包含两个样本,我们这里需要用到的就是sent1,ending0,ending1,ending2,ending3,label这6个字段。例如对于第一个样本来说,其形式如下:
The people are in robes. They
A) wearing colorful costumes.# 正确选项
B) are doing karate moves on the floor.
C) shake hands on their hips.
D) do a flip to the bag.
同时,由于该数据集已经做了训练集、验证集和测试集(没有标签)的划分,所以后续我们也就不需要来手动划分了。
数据集构建
python
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import pandas as pd
import json
import logging
import os
from sklearn.model_selection import train_test_split
import collections
import six
class Vocab:
"""
根据本地的vocab文件,构造一个词表
vocab = Vocab()
print(vocab.itos) # 得到一个列表,返回词表中的每一个词;
print(vocab.itos[2]) # 通过索引返回得到词表中对应的词;
print(vocab.stoi) # 得到一个字典,返回词表中每个词的索引;
print(vocab.stoi['我']) # 通过单词返回得到词表中对应的索引
print(len(vocab)) # 返回词表长度
"""
UNK = '[UNK]'
def __init__(self, vocab_path):
self.stoi = {}
self.itos = []
with open(vocab_path, 'r', encoding='utf-8') as f:
for i, word in enumerate(f):
w = word.strip('\n')
self.stoi[w] = i
self.itos.append(w)
def __getitem__(self, token):
return self.stoi.get(token, self.stoi.get(Vocab.UNK))
def __len__(self):
return len(self.itos)
def build_vocab(vocab_path):
"""
vocab = Vocab()
print(vocab.itos) # 得到一个列表,返回词表中的每一个词;
print(vocab.itos[2]) # 通过索引返回得到词表中对应的词;
print(vocab.stoi) # 得到一个字典,返回词表中每个词的索引;
print(vocab.stoi['我']) # 通过单词返回得到词表中对应的索引
"""
return Vocab(vocab_path)
def pad_sequence(sequences, batch_first=False, max_len=None, padding_value=0):
"""
对一个List中的元素进行padding
Pad a list of variable length Tensors with ``padding_value``
a = torch.ones(25)
b = torch.ones(22)
c = torch.ones(15)
pad_sequence([a, b, c],max_len=None).size()
torch.Size([25, 3])
sequences:
batch_first: 是否把batch_size放到第一个维度
padding_value:
max_len :
当max_len = 50时,表示以某个固定长度对样本进行padding,多余的截掉;
当max_len=None是,表示以当前batch中最长样本的长度对其它进行padding;
Returns:
"""
if max_len is None:
max_len = max([s.size(0) for s in sequences])
out_tensors = []
for tensor in sequences:
if tensor.size(0) < max_len:
tensor = torch.cat([tensor, torch.tensor([padding_value] * (max_len - tensor.size(0)))], dim=0)
else:
tensor = tensor[:max_len]
out_tensors.append(tensor)
out_tensors = torch.stack(out_tensors, dim=1)
if batch_first:
return out_tensors.transpose(0, 1)
return out_tensors
def cache(func):
"""
本修饰器的作用是将SQuAD数据集中data_process()方法处理后的结果进行缓存,下次使用时可直接载入!
:param func:
:return:
"""
def wrapper(*args, **kwargs):
filepath = kwargs['filepath']
postfix = kwargs['postfix']
data_path = filepath.split('.')[0] + '_' + postfix + '.pt'
if not os.path.exists(data_path):
logging.info(f"缓存文件 {data_path} 不存在,重新处理并缓存!")
data = func(*args, **kwargs)
with open(data_path, 'wb') as f:
torch.save(data, f)
else:
logging.info(f"缓存文件 {data_path} 存在,直接载入缓存文件!")
with open(data_path, 'rb') as f:
data = torch.load(f)
return data
return wrapper
class LoadSingleSentenceClassificationDataset:
def __init__(self,
vocab_path='./vocab.txt', #
tokenizer=None,
batch_size=32,
max_sen_len=None,
split_sep='\n',
max_position_embeddings=512,
pad_index=0,
is_sample_shuffle=True
):
"""
:param vocab_path: 本地词表vocab.txt的路径
:param tokenizer:
:param batch_size:
:param max_sen_len: 在对每个batch进行处理时的配置;
当max_sen_len = None时,即以每个batch中最长样本长度为标准,对其它进行padding
当max_sen_len = 'same'时,以整个数据集中最长样本为标准,对其它进行padding
当max_sen_len = 50, 表示以某个固定长度符样本进行padding,多余的截掉;
:param split_sep: 文本和标签之前的分隔符,默认为'\t'
:param max_position_embeddings: 指定最大样本长度,超过这个长度的部分将本截取掉
:param is_sample_shuffle: 是否打乱训练集样本(只针对训练集)
在后续构造DataLoader时,验证集和测试集均指定为了固定顺序(即不进行打乱),修改程序时请勿进行打乱
因为当shuffle为True时,每次通过for循环遍历data_iter时样本的顺序都不一样,这会导致在模型预测时
返回的标签顺序与原始的顺序不一样,不方便处理。
"""
self.tokenizer = tokenizer
self.vocab = build_vocab(vocab_path)
self.PAD_IDX = pad_index
self.SEP_IDX = self.vocab['[SEP]']
self.CLS_IDX = self.vocab['[CLS]']
# self.UNK_IDX = '[UNK]'
self.batch_size = batch_size
self.split_sep = split_sep
self.max_position_embeddings = max_position_embeddings
if isinstance(max_sen_len, int) and max_sen_len > max_position_embeddings:
max_sen_len = max_position_embeddings
self.max_sen_len = max_sen_len
self.is_sample_shuffle = is_sample_shuffle
@cache
def data_process(self, filepath, postfix='cache'):
"""
将每一句话中的每一个词根据字典转换成索引的形式,同时返回所有样本中最长样本的长度
:param filepath: 数据集路径
:return:
"""
raw_iter = open(filepath, encoding="utf8").readlines()
data = []
max_len = 0
for raw in tqdm(raw_iter, ncols=80):
line = raw.rstrip("\n").split(self.split_sep)
s, l = line[0], line[1]
tmp = [self.CLS_IDX] + [self.vocab[token] for token in self.tokenizer(s)]
if len(tmp) > self.max_position_embeddings - 1:
tmp = tmp[:self.max_position_embeddings - 1] # BERT预训练模型只取前512个字符
tmp += [self.SEP_IDX]
tensor_ = torch.tensor(tmp, dtype=torch.long)
l = torch.tensor(int(l), dtype=torch.long)
max_len = max(max_len, tensor_.size(0))
data.append((tensor_, l))
return data, max_len
def load_train_val_test_data(self, train_file_path=None,
val_file_path=None,
test_file_path=None,
only_test=False):
postfix = str(self.max_sen_len)
test_data, _ = self.data_process(filepath=test_file_path, postfix=postfix)
test_iter = DataLoader(test_data, batch_size=self.batch_size,
shuffle=False, collate_fn=self.generate_batch)
if only_test:
return test_iter
train_data, max_sen_len = self.data_process(filepath=train_file_path,
postfix=postfix) # 得到处理好的所有样本
if self.max_sen_len == 'same':
self.max_sen_len = max_sen_len
val_data, _ = self.data_process(filepath=val_file_path,
postfix=postfix)
train_iter = DataLoader(train_data, batch_size=self.batch_size, # 构造DataLoader
shuffle=self.is_sample_shuffle, collate_fn=self.generate_batch)
val_iter = DataLoader(val_data, batch_size=self.batch_size,
shuffle=False, collate_fn=self.generate_batch)
return train_iter, test_iter, val_iter
def generate_batch(self, data_batch):
batch_sentence, batch_label = [], []
for (sen, label) in data_batch: # 开始对一个batch中的每一个样本进行处理。
batch_sentence.append(sen)
batch_label.append(label)
batch_sentence = pad_sequence(batch_sentence, # [batch_size,max_len]
padding_value=self.PAD_IDX,
batch_first=False,
max_len=self.max_sen_len)
batch_label = torch.tensor(batch_label, dtype=torch.long)
return batch_sentence, batch_label
class LoadMultipleChoiceDataset(LoadSingleSentenceClassificationDataset):
def __init__(self, num_choice=4, **kwargs):
super(LoadMultipleChoiceDataset, self).__init__(**kwargs)
self.num_choice = num_choice
@cache
def data_process(self, filepath, postfix='cache'):
data = pd.read_csv(filepath)
questions = data['startphrase']
answers0, answers1 = data['ending0'], data['ending1']
answers2, answers3 = data['ending2'], data['ending3']
labels = [-1] * len(questions)
if 'label' in data: # 测试集中没有标签
labels = data['label']
all_data = []
max_len = 0
for i in tqdm(range(len(questions)), ncols=80):
# 将问题中的每个word转换为字典中的token id
t_q = [self.vocab[token] for token in self.tokenizer(questions[i])]
t_q = [self.CLS_IDX] + t_q + [self.SEP_IDX]
# 将答案中的每个word转换为字典中的token id
t_a0 = [self.vocab[token] for token in self.tokenizer(answers0[i])]
t_a1 = [self.vocab[token] for token in self.tokenizer(answers1[i])]
t_a2 = [self.vocab[token] for token in self.tokenizer(answers2[i])]
t_a3 = [self.vocab[token] for token in self.tokenizer(answers3[i])]
# 计算最长序列的长度
max_len = max(max_len, len(t_q) + max(len(t_a0), len(t_a1), len(t_a2), len(t_a3)))
seg_q = [0] * len(t_q)
# 加1表示还要加上问题和答案组合后最后一个[SEP]的长度
seg_a0 = [1] * (len(t_a0) + 1)
seg_a1 = [1] * (len(t_a1) + 1)
seg_a2 = [1] * (len(t_a2) + 1)
seg_a3 = [1] * (len(t_a3) + 1)
all_data.append((t_q, t_a0, t_a1, t_a2, t_a3, seg_q,
seg_a0, seg_a1, seg_a2, seg_a3, labels[i]))
return all_data, max_len
def generate_batch(self, data_batch):
batch_qa, batch_seg, batch_label = [], [], []
def get_seq(q, a):
seq = q + a
if len(seq) > self.max_position_embeddings - 1:
seq = seq[:self.max_position_embeddings - 1]
return torch.tensor(seq + [self.SEP_IDX], dtype=torch.long)
for item in data_batch:
# 得到 每个问题组合其中一个答案的 input_ids 序列
tmp_qa = [get_seq(item[0], item[1]),
get_seq(item[0], item[2]),
get_seq(item[0], item[3]),
get_seq(item[0], item[4])]
# 得到 每个问题组合其中一个答案的 token_type_ids
seg0 = (item[5] + item[6])[:self.max_position_embeddings]
seg1 = (item[5] + item[7])[:self.max_position_embeddings]
seg2 = (item[5] + item[8])[:self.max_position_embeddings]
seg3 = (item[5] + item[9])[:self.max_position_embeddings]
tmp_seg = [torch.tensor(seg0, dtype=torch.long),
torch.tensor(seg1, dtype=torch.long),
torch.tensor(seg2, dtype=torch.long),
torch.tensor(seg3, dtype=torch.long)]
batch_qa.extend(tmp_qa)
batch_seg.extend(tmp_seg)
batch_label.append(item[-1])
batch_qa = pad_sequence(batch_qa, # [batch_size*num_choice,max_len]
padding_value=self.PAD_IDX,
batch_first=True,
max_len=self.max_sen_len)
batch_mask = (batch_qa == self.PAD_IDX).view(
[-1, self.num_choice, batch_qa.size(-1)])
# reshape 至 [batch_size, num_choice, max_len]
batch_qa = batch_qa.view([-1, self.num_choice, batch_qa.size(-1)])
batch_seg = pad_sequence(batch_seg, # [batch_size*num_choice,max_len]
padding_value=self.PAD_IDX,
batch_first=True,
max_len=self.max_sen_len)
# reshape 至 [batch_size, num_choice, max_len]
batch_seg = batch_seg.view([-1, self.num_choice, batch_seg.size(-1)])
batch_label = torch.tensor(batch_label, dtype=torch.long)
return batch_qa, batch_seg, batch_mask, batch_label
问答选择模型
我们只需要在原始BERT模型的基础上再加一个分类层即可,因此这部分代码相对来说也比较容易理解
定义一个类以及相应的初始化函数
python
from model.Bert import BertModel
import torch.nn as nn
class BertForMultipleChoice(nn.Module):
"""
用于类似SWAG数据集的下游任务
"""
def __init__(self, config, bert_pretrained_model_dir=None):
super(BertForMultipleChoice, self).__init__()
self.num_choice = config.num_labels
if bert_pretrained_model_dir is not None:
self.bert = BertModel.from_pretrained(config, bert_pretrained_model_dir)
else:
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, 1)
def forward(self, input_ids,
attention_mask=None,
token_type_ids=None,
position_ids=None,
labels=None):
"""
:param input_ids: [batch_size, num_choice, src_len]
:param attention_mask: [batch_size, num_choice, src_len]
:param token_type_ids: [batch_size, num_choice, src_len]
:param position_ids:
:param labels:
:return:
"""
flat_input_ids = input_ids.view(-1, input_ids.size(-1)).transpose(0, 1)
flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)).transpose(0, 1)
flat_attention_mask = attention_mask.view(-1, token_type_ids.size(-1))
pooled_output, _ = self.bert(
input_ids=flat_input_ids, # [src_len,batch_size*num_choice]
attention_mask=flat_attention_mask, # [batch_size*num_choice,src_len]
token_type_ids=flat_token_type_ids, # [src_len,batch_size*num_choice]
position_ids=position_ids)
pooled_output = self.dropout(pooled_output) # [batch_size*num_choice, hidden_size]
logits = self.classifier(pooled_output) # [batch_size*num_choice, 1]
shaped_logits = logits.view(-1, self.num_choice) # [batch_size, num_choice]
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(shaped_logits, labels.view(-1))
return loss, shaped_logits
else:
return shaped_logits
定义一个ModelConfig类来对分类模型中的超参数以及其它变量进行管理,代码如下所示:
python
class ModelConfig:
def __init__(self):
self.project_dir = os.path.dirname(os.path.abspath(__file__))
self.dataset_dir = os.path.join(self.project_dir, 'MultipleChoice')
self.pretrained_model_dir = os.path.join(self.project_dir, "weight")
self.vocab_path = os.path.join(self.pretrained_model_dir, 'vocab.txt')
self.device = torch.device('xpu' if torch.cuda.is_available() else 'cpu')
self.train_file_path = os.path.join(self.dataset_dir, 'train.csv')
self.val_file_path = os.path.join(self.dataset_dir, 'val.csv')
self.test_file_path = os.path.join(self.dataset_dir, 'test.csv')
self.model_save_dir = os.path.join(self.project_dir, 'cache')
self.logs_save_dir = os.path.join(self.project_dir, 'logs')
self.is_sample_shuffle = True
self.batch_size = 16
self.max_sen_len = None
self.num_labels = 4 # num_choice
self.learning_rate = 2e-5
self.epochs = 10
self.model_val_per_epoch = 2
logger_init(log_file_name='choice', log_level=logging.INFO,
log_dir=self.logs_save_dir)
if not os.path.exists(self.model_save_dir):
os.makedirs(self.model_save_dir)
# 把原始bert中的配置参数也导入进来
bert_config_path = os.path.join(self.pretrained_model_dir, "config.json")
bert_config = BertConfig.from_json_file(bert_config_path)
for key, value in bert_config.__dict__.items():
self.__dict__[key] = value
# 将当前配置打印到日志文件中
logging.info(" ### 将当前配置打印到日志文件中 ")
for key, value in self.__dict__.items():
logging.info(f"### {key} = {value}")
训练
最后,我们便可以通过如下方法完成整个模型的微调:
python
def train(config):
model = BertForMultipleChoice(config,
config.pretrained_model_dir)
model_save_path = os.path.join(config.model_save_dir, 'model.pt')
if os.path.exists(model_save_path):
loaded_paras = torch.load(model_save_path)
model.load_state_dict(loaded_paras)
logging.info("## 成功载入已有模型,进行追加训练......")
model = model.to(config.device)
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
'''
Apply Intel Extension for PyTorch optimization against the model object and optimizer object.
'''
model, optimizer = ipex.optimize(model, optimizer=optimizer)
model.train()
bert_tokenize = BertTokenizer.from_pretrained(model_config.pretrained_model_dir).tokenize
data_loader = LoadMultipleChoiceDataset(
vocab_path=config.vocab_path,
tokenizer=bert_tokenize,
batch_size=config.batch_size,
max_sen_len=config.max_sen_len,
max_position_embeddings=config.max_position_embeddings,
pad_index=config.pad_token_id,
is_sample_shuffle=config.is_sample_shuffle,
num_choice=config.num_labels)
train_iter, test_iter, val_iter = \
data_loader.load_train_val_test_data(config.train_file_path,
config.val_file_path,
config.test_file_path)
max_acc = 0
for epoch in range(config.epochs):
losses = 0
start_time = time.time()
for idx, (qa, seg, mask, label) in enumerate(train_iter):
qa = qa.to(config.device) # [src_len, batch_size]
label = label.to(config.device)
seg = seg.to(config.device)
mask = mask.to(config.device)
loss, logits = model(input_ids=qa,
attention_mask=mask,
token_type_ids=seg,
position_ids=None,
labels=label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses += loss.item()
acc = (logits.argmax(1) == label).float().mean()
if idx % 10 == 0:
logging.info(f"Epoch: {epoch}, Batch[{idx}/{len(train_iter)}], "
f"Train loss :{loss.item():.3f}, Train acc: {acc:.3f}")
if idx % 100 == 0:
y_pred = logits.argmax(1).cpu()
show_result(qa, y_pred, data_loader.vocab.itos, num_show=1)
end_time = time.time()
train_loss = losses / len(train_iter)
logging.info(f"Epoch: {epoch}, Train loss: "
f"{train_loss:.3f}, Epoch time = {(end_time - start_time):.3f}s")
if (epoch + 1) % config.model_val_per_epoch == 0:
acc, _ = evaluate(val_iter, model,
config.device, inference=False)
logging.info(f"Accuracy on val {acc:.3f}")
if acc > max_acc:
max_acc = acc
torch.save(model.state_dict(), model_save_path)
结果
参考资料
基于BERT预训练模型的SWAG问答任务:https://mp.weixin.qq.com/s/GqsbMBNt9XcFIjmumR04Pg