Transformer实战——微调多语言Transformer模型

Transformer实战------微调多语言Transformer模型

    • [0. 前言](#0. 前言)
    • [1. 微调单语言模型](#1. 微调单语言模型)
    • [2. 微调多语言模型 mBERT](#2. 微调多语言模型 mBERT)
    • [3. 微调多语言模型 XLM-R](#3. 微调多语言模型 XLM-R)
    • 相关链接

0. 前言

我们已经学习了多语言和跨语言语言模型的预训练,在本节中,我们验证微调后的多语言模型的性能,是否确实比单语言模型差。以土耳其语文本分类(七个类别)为例,我们已经学习了如何微调了一个专门的土耳其语单语言模型,并取得了良好的结果,接下来,我们将重复相同的步骤,保持其他条件不变,仅将土耳其语单语言模型分别替换为 mBERTXLM-R 模型。

1. 微调单语言模型

首先,微调 dbmdz/bert-base-turkish-uncased 模型:

python 复制代码
import pandas as pd
data= pd.read_csv("TTC4900.csv")
data=data.sample(frac=1.0, random_state=42)
data.head(5)
labels=["teknoloji","ekonomi","saglik","siyaset","kultur","spor","dunya"]
NUM_LABELS= len(labels)
id2label={i:l for i,l in enumerate(labels)}
label2id={l:i for i,l in enumerate(labels)}
data["labels"]=data.category.map(lambda x: label2id[x.strip()])
data.category.value_counts().plot(kind='pie', figsize=(8,8))
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("dbmdz/bert-base-turkish-uncased")
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("dbmdz/bert-base-turkish-uncased",
                        num_labels=NUM_LABELS,
                        id2label=id2label,
                        label2id=label2id)
SIZE= data.shape[0]

train_texts= list(data.text[:SIZE//2])
val_texts=   list(data.text[SIZE//2:(3*SIZE)//4 ])
test_texts=  list(data.text[(3*SIZE)//4:])

train_labels= list(data.labels[:SIZE//2])
val_labels=   list(data.labels[SIZE//2:(3*SIZE)//4])
test_labels=  list(data.labels[(3*SIZE)//4:])
len(train_texts), len(val_texts), len(test_texts)
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings  = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
from torch.utils.data import Dataset
class MyDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)
train_dataset = MyDataset(train_encodings, train_labels)
val_dataset = MyDataset(val_encodings, val_labels)
test_dataset = MyDataset(test_encodings, test_labels)
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support 
def compute_metrics(pred): 
    labels = pred.label_ids 
    preds = pred.predictions.argmax(-1) 
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro') 
    acc = accuracy_score(labels, preds) 
    return { 
        'Accuracy': acc, 
        'F1': f1, 
        'Precision': precision, 
        'Recall': recall 
} 
training_args = TrainingArguments(
    # The output directory where the model predictions and checkpoints will be written
    output_dir='./TTC4900Model', 
    do_train=True,
    do_eval=True,
    #  The number of epochs, defaults to 3.0 
    num_train_epochs=3,              
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=8,
    # Number of steps used for a linear warmup
    warmup_steps=100,                
    weight_decay=0.01,
    logging_strategy='steps',
    # TensorBoard log directory                 
    logging_dir='./multi-class-logs',            
    logging_steps=50,
    evaluation_strategy="epoch",
    eval_steps=50,
    save_strategy="epoch", 
    fp16=True,
    load_best_model_at_end=True
)
trainer = Trainer(
    # the pre-trained model that will be fine-tuned 
    model=model,
    # training arguments that we defined above                        
    args=training_args,                 
    train_dataset=train_dataset,         
    eval_dataset=val_dataset,            
    compute_metrics= compute_metrics
)
import torch
trainer.train()
q=[trainer.evaluate(eval_dataset=data) for data in [train_dataset, val_dataset, test_dataset]]
pd.DataFrame(q, index=["train","val","test"]).iloc[:,:5]

使用单语言模型时,性能指标如下所示:

2. 微调多语言模型 mBERT

要使用 mBERT 进行微调,只需要替换模型初始化的代码,使用 bert-base-multilingual-uncased 多语言模型:

python 复制代码
from transformers import BertForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-uncased", num_labels=NUM_LABELS, id2label=id2label, label2id=label2id)
training_args = TrainingArguments(
    # The output directory where the model predictions and checkpoints will be written
    output_dir='./TTC4900Model', 
    do_train=True,
    do_eval=True,
    #  The number of epochs, defaults to 3.0 
    num_train_epochs=3,              
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=8,
    # Number of steps used for a linear warmup
    warmup_steps=100,                
    weight_decay=0.01,
    logging_strategy='steps',
    # TensorBoard log directory                 
    logging_dir='./multi-class-logs',            
    logging_steps=50,
    evaluation_strategy="epoch",
    eval_steps=50,
    save_strategy="epoch", 
    fp16=True,
    load_best_model_at_end=True
)
trainer = Trainer(
    # the pre-trained model that will be fine-tuned 
    model=model,
    # training arguments that we defined above                        
    args=training_args,                 
    train_dataset=train_dataset,         
    eval_dataset=val_dataset,            
    compute_metrics= compute_metrics
)
import torch
trainer.train()
q=[trainer.evaluate(eval_dataset=data) for data in [train_dataset, val_dataset, test_dataset]]
pd.DataFrame(q, index=["train","val","test"]).iloc[:,:5]

保持所有其他参数和设置不变,运行代码,性能指标如下所示:

与其单语言模型相比,多语言模型在所有指标上均表现较差,差距约为 2.2%

3. 微调多语言模型 XLM-R

XLM-R 模型 xlm-roberta-base 进行相同问题的微调。执行 XLM-R 模型初始化:

python 复制代码
from transformers import AutoTokenizer, XLMRobertaForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
model = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base",num_labels=NUM_LABELS, id2label=id2label, label2id=label2id)
training_args = TrainingArguments(
    # The output directory where the model predictions and checkpoints will be written
    output_dir='./TTC4900Model', 
    do_train=True,
    do_eval=True,
    #  The number of epochs, defaults to 3.0 
    num_train_epochs=3,              
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=8,
    # Number of steps used for a linear warmup
    warmup_steps=100,                
    weight_decay=0.01,
    logging_strategy='steps',
    # TensorBoard log directory                 
    logging_dir='./multi-class-logs',            
    logging_steps=50,
    evaluation_strategy="epoch",
    eval_steps=50,
    save_strategy="epoch", 
    fp16=True,
    load_best_model_at_end=True
)
trainer = Trainer(
    # the pre-trained model that will be fine-tuned 
    model=model,
    # training arguments that we defined above                        
    args=training_args,                 
    train_dataset=train_dataset,         
    eval_dataset=val_dataset,            
    compute_metrics= compute_metrics
)

trainer.train()
q=[trainer.evaluate(eval_dataset=data) for data in [train_dataset, val_dataset, test_dataset]]
pd.DataFrame(q, index=["train","val","test"]).iloc[:,:5]

同样,其他设置保持不变。使用 XLM-R 模型得到的性能指标如下:

可以看到,XLM 模型的表现与单语言模型相当,差距仅为 1.0% 左右。因此,尽管在某些任务中单语言模型的表现可能优于多语言模型,但多语言模型也能取得令人满意的结果。如果单纯为了提升 1% 的性能,我们可能不愿意为此花费数十天或更长时间训练一个单语言模型。对于这种微小的性能差异,我们或许可以忽略不计。

相关链接

Transformer实战(1)------词嵌入技术详解
Transformer实战(2)------循环神经网络详解
Transformer实战(3)------从词袋模型到Transformer:NLP技术演进
Transformer实战(4)------从零开始构建Transformer
Transformer实战(5)------Hugging Face环境配置与应用详解
Transformer实战(6)------Transformer模型性能评估
Transformer实战(7)------datasets库核心功能解析
Transformer实战(8)------BERT模型详解与实现
Transformer实战(9)------Transformer分词算法详解
Transformer实战(10)------生成式语言模型 (Generative Language Model, GLM)
Transformer实战(11)------从零开始构建GPT模型
Transformer实战(12)------基于Transformer的文本到文本模型
Transformer实战(13)------从零开始训练GPT-2语言模型
Transformer实战(14)------微调Transformer语言模型用于文本分类
Transformer实战(15)------使用PyTorch微调Transformer语言模型
Transformer实战(16)------微调Transformer语言模型用于多类别文本分类
Transformer实战(17)------微调Transformer语言模型进行多标签文本分类
Transformer实战(18)------微调Transformer语言模型进行回归分析
Transformer实战(19)------微调Transformer语言模型进行词元分类
Transformer实战(20)------微调Transformer语言模型进行问答任务
Transformer实战(21)------文本表示(Text Representation)
Transformer实战(22)------使用FLAIR进行语义相似性评估
Transformer实战(23)------使用SBERT进行文本聚类与语义搜索
Transformer实战(24)------通过数据增强提升Transformer模型性能
Transformer实战(25)------自动超参数优化提升Transformer模型性能
Transformer实战(26)------通过领域适应提升Transformer模型性能
Transformer实战(27)------参数高效微调(Parameter Efficient Fine-Tuning,PEFT)
Transformer实战(28)------使用 LoRA 高效微调 FLAN-T5
Transformer实战(29)------大语言模型(Large Language Model,LLM)
Transformer实战(30)------Transformer注意力机制可视化
Transformer实战(31)------解释Transformer模型决策
Transformer实战(32)------Transformer模型压缩
Transformer实战(33)------高效自注意力机制
Transformer实战(34)------多语言和跨语言Transformer模型
Transformer实战(35)------跨语言相似性任务

相关推荐
程序员打怪兽15 小时前
详解Visual Transformer (ViT)网络模型
深度学习
CoovallyAIHub3 天前
仿生学突破:SILD模型如何让无人机在电力线迷宫中发现“隐形威胁”
深度学习·算法·计算机视觉
CoovallyAIHub3 天前
从春晚机器人到零样本革命:YOLO26-Pose姿态估计实战指南
深度学习·算法·计算机视觉
CoovallyAIHub3 天前
Le-DETR:省80%预训练数据,这个实时检测Transformer刷新SOTA|Georgia Tech & 北交大
深度学习·算法·计算机视觉
CoovallyAIHub3 天前
强化学习凭什么比监督学习更聪明?RL的“聪明”并非来自算法,而是因为它学会了“挑食”
深度学习·算法·计算机视觉
CoovallyAIHub3 天前
YOLO-IOD深度解析:打破实时增量目标检测的三重知识冲突
深度学习·算法·计算机视觉
用户1474853079743 天前
AI-动手深度学习环境搭建-d2l
深度学习
OpenBayes贝式计算3 天前
解决视频模型痛点,TurboDiffusion 高效视频扩散生成系统;Google Streetview 涵盖多个国家的街景图像数据集
人工智能·深度学习·机器学习
OpenBayes贝式计算3 天前
OCR教程汇总丨DeepSeek/百度飞桨/华中科大等开源创新技术,实现OCR高精度、本地化部署
人工智能·深度学习·机器学习
warm3snow3 天前
AI 核心技能系列:12 篇文章带你系统掌握大模型岗位必备技能
ai·transformer·agent·skill·mcp·fine-tunning