【HuggingFace LLM】经典NLP微调任务之翻译

正文

数据准备

python 复制代码
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

en_sentence = split_datasets["train"][1]["translation"]["en"]
fr_sentence = split_datasets["train"][1]["translation"]["fr"]

inputs = tokenizer(en_sentence, text_target=fr_sentence)

注意这里需要在目标语句(target)中进行分词标注,因为源语句(source)的分词器与之不同。

例如,使用源语句(en)分词器对目标语句(fr)进行分词就会出现

python 复制代码
>>>tokenizer.convert_ids_to_tokens(tokenizer(fr_sentence))
>>>['▁Par', '▁dé', 'f', 'aut', ',', '▁dé', 've', 'lop', 'per', '▁les', '▁fil', 's', '▁de', '▁discussion', '</s>']
>>>tokenizer.convert_ids_to_tokens(inputs["labels"])
>>>['▁Par', '▁défaut', ',', '▁développer', '▁les', '▁fils', '▁de', '▁discussion', '</s>']

tokens数目变多(本质是因为在en的vocabulary中很多fr组合没有见过)的情况。

!NOTE

使用text_target参数后,目标语句target分词后会作为标签labels存在于inputs字典中

Trainer API 微调

python 复制代码
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

使用AutoModelForSeq2SeqLM加载类,适配于翻译、概括这类Seq2Seq任务。

在定义好模型后,需要对准备好的数据进行填充(至max_length),由于这里不仅需要对输入input_ids进行填充,还需要对标签labels进行填充 ,因此需要用到DataCollatorForSeq2Seq

python 复制代码
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
>>>batch.keys()
>>>dict_keys(['attention_mask', 'input_ids', 'labels', 'decoder_input_ids'])
>>>batch["labels"]
>>>tensor([[  577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,  -100,
          -100,  -100,  -100,  -100,  -100,  -100],
        [ 1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,   817,
           550,  7032,  5821,  7907, 12649,     0]])
>>>batch["decoder_input_ids"]
>>>tensor([[59513,   577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,
         59513, 59513, 59513, 59513, 59513, 59513],
        [59513,  1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,
           817,   550,  7032,  5821,  7907, 12649]])

!NOTE decoder_input_ids是很重要的输入

  1. 假设模型处理源句子 input_ids["I", "love", "AI"]。此时
  • 输入input_ids 对应的词向量;
  • 输出:Encoder 输出一个 Hidden States,包含整句话的完整语义信息。
  1. 解码器读取完整 decoder_input_ids["<bos>", "我", "爱", "AI"]Decoder 需同时做 4 个预测任务 ,一般配合 Causal Mask下三角掩码防止看到未来信息
  • 输入<bos>,根据 <bos> + 源句,模型通过 Softmax 输出 vocabulary 中每个词的概率。计算预测概率与真实标签 "我" 之间的交叉熵误差;
  • 输入,根据 <bos>, 我 + 源句,模型预测下一个词...
  • 输入 AI,根据<bos>, 我, 爱, AI + 源句,希望可以预测出<eos>

通过把完整的 decoder_input_ids(正确答案)喂给模型,并利用 Mask 机制完成训练 。因此训练时 decoder_input_ids 必须向右移位(Shift Right),因为我们在 Position ttt 输入的是第 ttt 个词,期望模型输出的是第 t+1t+1t+1 个词(即 labels 中的第 ttt 个词)。

位置 输入词 Pos 0 () Pos 1 (我) Pos 2 (爱) Pos 3 (AI) 模型此刻能看到的内容
Pos 0 <bos> ✅ 可见 🚫 遮挡 🚫 遮挡 🚫 遮挡 只能看到 <bos>
Pos 1 ✅ 可见 ✅ 可见 🚫 遮挡 🚫 遮挡 看到 <bos>
Pos 2 ✅ 可见 ✅ 可见 ✅ 可见 🚫 遮挡 看到 <bos> 我 爱
Pos 3 AI ✅ 可见 ✅ 可见 ✅ 可见 ✅ 可见 看到 <bos> 我 爱 AI
最终评估时,选用的是BLEU分数,这个分数需要转化为翻译后的句子进行对比,因此指标计算为
python 复制代码
import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}
  1. tokenizer.batch_decode():把批量的 token ID 序列转换成自然语言文本(反向分词),其中skip_special_tokens=True指跳过特殊 token(如 BOS/ EOS/ PAD 等),只保留有效文本。
    • 举例:模型预测的 token ID [577, 5891, 2, 3184, 0] → 解码后是"Par défaut, développer"(去掉 EOS=0)。
  2. 标签中-100是填充位,用于忽略计算Loss损失的(之前 使用DataCollator 填充),但tokenizer.batch_decode()无法解码-100(因为词表中没有这个 ID),np.where把所有-100替换成tokenizer.pad_token_id(分词器的填充 token ID,比如 Marian 的59513),这样标签才能正常解码。
    • 举例:原标签[577, 5891, 2, -100, -100] → 替换后[577, 5891, 2, 59513, 59513]
python 复制代码
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"marian-finetuned-kde4-en-to-fr",
    eval_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.evaluate(max_length=max_length)
trainer.train()
trainer.evaluate(max_length=max_length)

Accelerate训练

训练环节比较类似,如下代码。其中尤其需注意评估部分

python 复制代码
## 数据集准备(训练集、验证集)
from torch.utils.data import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], 
    collate_fn=data_collator, 
    batch_size=8
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

## 优化器准备(AdamW)
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

## 准备Accelerate.prepare方法
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

## 准备线性学习率递减策略,使学习率逐渐递减到0
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

## 准备仓库
from huggingface_hub import Repository, get_full_repo_name

model_name = "marian-finetuned-kde4-en-to-fr-accelerate"
repo_name = get_full_repo_name(model_name)
output_dir = "marian-finetuned-kde4-en-to-fr-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

## 类似于前文的指标计算
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels

from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

由于generate()方法是基础模型的方法 ,而不是由accelerate.prepare()创建的模型方法。因此需要使用accelerate.unwrap_model()的方式解析出一个基础模型出来。

由于accelerate启用多个进程,进程中会有不同的pad策略,因此需使用pad_across_processes方法同步预测结果和标签形状 ;之后再调用gather方法。

课后任务

使用accelerate训练一个en-zh模型

python 复制代码
!pip install datasets==2.18.0
!pip install sacrebleu
!pip install evaluate
import evaluate

metric = evaluate.load("sacrebleu")
predictions = [
    "我爱人工"
]
references = [
    [
        "我爱人工智能"
    ]
]
metric.compute(predictions=predictions, references=references, tokenize='zh')

安装datasets==2.18.0,这是因为colab中配置的库版本过高,无法使用load_dataset

#重要 此外,由于对比文字是中文,因此需要设置一个tokenize='zh'

复制代码
// tokenize='zh'
{'score': 60.653065971263366,
 'counts': [4, 3, 2, 1],
 'totals': [4, 3, 2, 1],
 'precisions': [100.0, 100.0, 100.0, 100.0],
 'bp': 0.6065306597126334,
 'sys_len': 4,
 'ref_len': 6}
 
 // 不设置tokenize
 {'score': 0.0,
 'counts': [0, 0, 0, 0],
 'totals': [1, 0, 0, 0],
 'precisions': [0.0, 0.0, 0.0, 0.0],
 'bp': 1.0,
 'sys_len': 1,
 'ref_len': 1}
python 复制代码
from datasets import load_dataset
raw_dataset = load_dataset("kde4", lang1="en", lang2="zh_CN")

split_dataset = raw_dataset['train'].train_test_split(train_size=0.9, seed=42)
train_valid_dataset = split_dataset['train'].train_test_split(train_size=0.9, seed=42)
split_dataset['train'] = train_valid_dataset['train']
split_dataset['valid'] = train_valid_dataset['test']

import re
pattern = re.compile(r'[a-zA-Z]')

def filter_english_in_zh(sample):
    # 1. 安全获取数据,防止 KeyError 或 None
    translation = sample.get('translation', {})
    if not translation:
        return False  # 数据缺失,丢弃

    zh_sentence = translation.get('zh_CN')

    # 2. 检查数据类型,防止非字符串报错
    if not isinstance(zh_sentence, str):
        return False  # 不是字符串,丢弃

    # 3. 使用正则查找(只要搜到一个字母,就认为包含英文)
    if pattern.search(zh_sentence):
        return False  # 包含英文,丢弃 (返回 False)

    # 4. 只有纯中文(或其他非英文符号)才保留
    return True # 保留 (返回 True)

clean_dataset = split_dataset.filter(filter_english_in_zh)
clean_dataset

使用正则表达式的方法,过滤样本中,中文字段存在英文的情况。

使用filter方法,如果返回True则保留,如果返回False则丢弃。

python 复制代码
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq

checkpoint = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, return_tensors="pt")

def batch_tokenizer(samples):
  en_sentences = [sample['en'] for sample in samples['translation']]
  zh_sentences = [sample['zh_CN'] for sample in samples['translation']]

  return tokenizer(en_sentences,
          text_target=zh_sentences,
          max_length=128,
          truncation=True)

tokenized_dataset = clean_dataset.map(batch_tokenizer, batched=True, remove_columns=clean_dataset["train"].column_names)

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

remove_columns=clean_dataset["train"].column_names需要把原始的数据集中的字段删除,仅保留input_idsattention_masklabels这些输入模型的字段

python 复制代码
from torch.utils.data import DataLoader, Dataset
import torch
import numpy as np

train_dataloader = DataLoader(
    tokenized_dataset["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_dataset["valid"],
    collate_fn=data_collator,
    batch_size=8
)

## 优化器准备(AdamW)
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

## 准备Accelerate.prepare方法
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

## 准备线性学习率递减策略,使学习率逐渐递减到0
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

不需要设置tokenized_dataset.set_format('torch'),因为前面已经设置return_tensors="pt"

python 复制代码
from huggingface_hub import Repository, get_full_repo_name, notebook_login, HfApi
import os
import shutil

# 登录 Hugging Face
notebook_login()

# 定义模型名称
model_name = "marian-finetuned-kde4-en-to-zh-accelerate"
output_dir = model_name

# 获取完整的仓库 ID (username/model_name)
repo_id = get_full_repo_name(model_name)

# 配置 Git 用户信息
os.system("git config --global user.email 'your_email@qq.com'")
os.system("git config --global user.name 'your_name'")

# 获取 Token
api = HfApi()
huggingface_token = api.token

try:
    api.create_repo(repo_id=repo_id, exist_ok=True)
    print(f"Remote repository '{repo_id}' checked/created.")
except Exception as e:
    print(f"Note: Could not create repo or check existence (might be fine if it exists): {e}")

# 初始化本地仓库(如果远程有内容,这里会自动 clone 下来)
repo = Repository(
    local_dir=output_dir,
    clone_from=repo_id,
    token=huggingface_token
)

repo.push_to_hub(commit_message="Initial commit from Colab setup", blocking=False)

print(f"Repository initialized locally at {repo.local_dir}")
print(f"Remote repository ID: {repo_id}")

#仓库手动初始化 这里不按照教程中的方式进行初始化,需要注意:

  1. 设置git邮箱、用户名信息;
  2. 获取token api,进行仓库创建以及初始化;
  3. 需要在HuggingFace中手动创建同名模型
python 复制代码
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels
    
from tqdm.auto import tqdm
import torch
import pdb

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    total_loss = 0 # 初始化 loss 统计
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

        total_loss += loss.item()

    # 每个 epoch 结束打印平均 loss
    avg_train_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch} | Average Training Loss: {avg_train_loss:.4f}")

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute(tokenize="zh")
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

注意事项同上。

问题

总结