正文
数据准备
python
from transformers import AutoTokenizer
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")
en_sentence = split_datasets["train"][1]["translation"]["en"]
fr_sentence = split_datasets["train"][1]["translation"]["fr"]
inputs = tokenizer(en_sentence, text_target=fr_sentence)
注意这里需要在目标语句(target)中进行分词标注,因为源语句(source)的分词器与之不同。
例如,使用源语句(en)分词器对目标语句(fr)进行分词就会出现
python
>>>tokenizer.convert_ids_to_tokens(tokenizer(fr_sentence))
>>>['▁Par', '▁dé', 'f', 'aut', ',', '▁dé', 've', 'lop', 'per', '▁les', '▁fil', 's', '▁de', '▁discussion', '</s>']
>>>tokenizer.convert_ids_to_tokens(inputs["labels"])
>>>['▁Par', '▁défaut', ',', '▁développer', '▁les', '▁fils', '▁de', '▁discussion', '</s>']
tokens数目变多(本质是因为在en的vocabulary中很多fr组合没有见过)的情况。
!NOTE
使用
text_target参数后,目标语句target分词后会作为标签labels存在于inputs字典中
Trainer API 微调
python
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
使用AutoModelForSeq2SeqLM加载类,适配于翻译、概括这类Seq2Seq任务。
在定义好模型后,需要对准备好的数据进行填充(至max_length),由于这里不仅需要对输入input_ids进行填充,还需要对标签labels进行填充 ,因此需要用到DataCollatorForSeq2Seq。
python
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
>>>batch.keys()
>>>dict_keys(['attention_mask', 'input_ids', 'labels', 'decoder_input_ids'])
>>>batch["labels"]
>>>tensor([[ 577, 5891, 2, 3184, 16, 2542, 5, 1710, 0, -100,
-100, -100, -100, -100, -100, -100],
[ 1211, 3, 49, 9409, 1211, 3, 29140, 817, 3124, 817,
550, 7032, 5821, 7907, 12649, 0]])
>>>batch["decoder_input_ids"]
>>>tensor([[59513, 577, 5891, 2, 3184, 16, 2542, 5, 1710, 0,
59513, 59513, 59513, 59513, 59513, 59513],
[59513, 1211, 3, 49, 9409, 1211, 3, 29140, 817, 3124,
817, 550, 7032, 5821, 7907, 12649]])
!NOTE
decoder_input_ids是很重要的输入
- 假设模型处理源句子
input_ids:["I", "love", "AI"]。此时
- 输入 :
input_ids对应的词向量;- 输出:Encoder 输出一个 Hidden States,包含整句话的完整语义信息。
- 解码器读取完整
decoder_input_ids:["<bos>", "我", "爱", "AI"]。Decoder 需同时做 4 个预测任务 ,一般配合 Causal Mask下三角掩码防止看到未来信息
- 输入
<bos>,根据<bos>+ 源句,模型通过 Softmax 输出vocabulary中每个词的概率。计算预测概率与真实标签"我"之间的交叉熵误差;- 输入
我,根据<bos>, 我+ 源句,模型预测下一个词...- 输入
AI,根据<bos>, 我, 爱, AI+ 源句,希望可以预测出<eos>通过把完整的
decoder_input_ids(正确答案)喂给模型,并利用 Mask 机制完成训练 。因此训练时decoder_input_ids必须向右移位(Shift Right),因为我们在 Position ttt 输入的是第 ttt 个词,期望模型输出的是第 t+1t+1t+1 个词(即labels中的第 ttt 个词)。
| 位置 | 输入词 | Pos 0 () | Pos 1 (我) | Pos 2 (爱) | Pos 3 (AI) | 模型此刻能看到的内容 |
|---|---|---|---|---|---|---|
| Pos 0 | <bos> |
✅ 可见 | 🚫 遮挡 | 🚫 遮挡 | 🚫 遮挡 | 只能看到 <bos> |
| Pos 1 | 我 |
✅ 可见 | ✅ 可见 | 🚫 遮挡 | 🚫 遮挡 | 看到 <bos> 和 我 |
| Pos 2 | 爱 |
✅ 可见 | ✅ 可见 | ✅ 可见 | 🚫 遮挡 | 看到 <bos> 我 爱 |
| Pos 3 | AI |
✅ 可见 | ✅ 可见 | ✅ 可见 | ✅ 可见 | 看到 <bos> 我 爱 AI |
最终评估时,选用的是BLEU分数,这个分数需要转化为翻译后的句子进行对比,因此指标计算为 |
python
import numpy as np
import evaluate
metric = evaluate.load("sacrebleu")
def compute_metrics(eval_preds):
preds, labels = eval_preds
# In case the model returns more than the prediction logits
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100s in the labels as we can't decode them
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
return {"bleu": result["score"]}
tokenizer.batch_decode():把批量的 token ID 序列转换成自然语言文本(反向分词),其中skip_special_tokens=True指跳过特殊 token(如 BOS/ EOS/ PAD 等),只保留有效文本。- 举例:模型预测的 token ID
[577, 5891, 2, 3184, 0]→ 解码后是"Par défaut, développer"(去掉 EOS=0)。
- 举例:模型预测的 token ID
- 标签中
-100是填充位,用于忽略计算Loss损失的(之前 使用DataCollator 填充),但tokenizer.batch_decode()无法解码-100(因为词表中没有这个 ID),用np.where把所有-100替换成tokenizer.pad_token_id(分词器的填充 token ID,比如 Marian 的59513),这样标签才能正常解码。- 举例:原标签
[577, 5891, 2, -100, -100]→ 替换后[577, 5891, 2, 59513, 59513]。
- 举例:原标签
python
from transformers import Seq2SeqTrainingArguments
args = Seq2SeqTrainingArguments(
f"marian-finetuned-kde4-en-to-fr",
eval_strategy="no",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=True,
push_to_hub=True,
)
from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.evaluate(max_length=max_length)
trainer.train()
trainer.evaluate(max_length=max_length)
Accelerate训练
训练环节比较类似,如下代码。其中尤其需注意评估部分。
python
## 数据集准备(训练集、验证集)
from torch.utils.data import DataLoader
tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
tokenized_datasets["train"],
shuffle=True,
collate_fn=data_collator,
batch_size=8,
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"],
collate_fn=data_collator,
batch_size=8
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
## 优化器准备(AdamW)
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=2e-5)
## 准备Accelerate.prepare方法
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
)
## 准备线性学习率递减策略,使学习率逐渐递减到0
from transformers import get_scheduler
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)
## 准备仓库
from huggingface_hub import Repository, get_full_repo_name
model_name = "marian-finetuned-kde4-en-to-fr-accelerate"
repo_name = get_full_repo_name(model_name)
output_dir = "marian-finetuned-kde4-en-to-fr-accelerate"
repo = Repository(output_dir, clone_from=repo_name)
## 类似于前文的指标计算
def postprocess(predictions, labels):
predictions = predictions.cpu().numpy()
labels = labels.cpu().numpy()
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]
return decoded_preds, decoded_labels
from tqdm.auto import tqdm
import torch
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_train_epochs):
# Training
model.train()
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
# Evaluation
model.eval()
for batch in tqdm(eval_dataloader):
with torch.no_grad():
generated_tokens = accelerator.unwrap_model(model).generate(
batch["input_ids"],
attention_mask=batch["attention_mask"],
max_length=128,
)
labels = batch["labels"]
# Necessary to pad predictions and labels for being gathered
generated_tokens = accelerator.pad_across_processes(
generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
)
labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
predictions_gathered = accelerator.gather(generated_tokens)
labels_gathered = accelerator.gather(labels)
decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
metric.add_batch(predictions=decoded_preds, references=decoded_labels)
results = metric.compute()
print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")
# Save and upload
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
if accelerator.is_main_process:
tokenizer.save_pretrained(output_dir)
repo.push_to_hub(
commit_message=f"Training in progress epoch {epoch}", blocking=False
)
由于generate()方法是基础模型的方法 ,而不是由accelerate.prepare()创建的模型方法。因此需要使用accelerate.unwrap_model()的方式解析出一个基础模型出来。
由于accelerate启用多个进程,进程中会有不同的pad策略,因此需使用pad_across_processes方法同步预测结果和标签形状 ;之后再调用gather方法。
课后任务
使用accelerate训练一个en-zh模型
python
!pip install datasets==2.18.0
!pip install sacrebleu
!pip install evaluate
import evaluate
metric = evaluate.load("sacrebleu")
predictions = [
"我爱人工"
]
references = [
[
"我爱人工智能"
]
]
metric.compute(predictions=predictions, references=references, tokenize='zh')
安装datasets==2.18.0,这是因为colab中配置的库版本过高,无法使用load_dataset。
#重要 此外,由于对比文字是中文,因此需要设置一个tokenize='zh'。
// tokenize='zh'
{'score': 60.653065971263366,
'counts': [4, 3, 2, 1],
'totals': [4, 3, 2, 1],
'precisions': [100.0, 100.0, 100.0, 100.0],
'bp': 0.6065306597126334,
'sys_len': 4,
'ref_len': 6}
// 不设置tokenize
{'score': 0.0,
'counts': [0, 0, 0, 0],
'totals': [1, 0, 0, 0],
'precisions': [0.0, 0.0, 0.0, 0.0],
'bp': 1.0,
'sys_len': 1,
'ref_len': 1}
python
from datasets import load_dataset
raw_dataset = load_dataset("kde4", lang1="en", lang2="zh_CN")
split_dataset = raw_dataset['train'].train_test_split(train_size=0.9, seed=42)
train_valid_dataset = split_dataset['train'].train_test_split(train_size=0.9, seed=42)
split_dataset['train'] = train_valid_dataset['train']
split_dataset['valid'] = train_valid_dataset['test']
import re
pattern = re.compile(r'[a-zA-Z]')
def filter_english_in_zh(sample):
# 1. 安全获取数据,防止 KeyError 或 None
translation = sample.get('translation', {})
if not translation:
return False # 数据缺失,丢弃
zh_sentence = translation.get('zh_CN')
# 2. 检查数据类型,防止非字符串报错
if not isinstance(zh_sentence, str):
return False # 不是字符串,丢弃
# 3. 使用正则查找(只要搜到一个字母,就认为包含英文)
if pattern.search(zh_sentence):
return False # 包含英文,丢弃 (返回 False)
# 4. 只有纯中文(或其他非英文符号)才保留
return True # 保留 (返回 True)
clean_dataset = split_dataset.filter(filter_english_in_zh)
clean_dataset
使用正则表达式的方法,过滤样本中,中文字段存在英文的情况。
使用filter方法,如果返回True则保留,如果返回False则丢弃。
python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq
checkpoint = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, return_tensors="pt")
def batch_tokenizer(samples):
en_sentences = [sample['en'] for sample in samples['translation']]
zh_sentences = [sample['zh_CN'] for sample in samples['translation']]
return tokenizer(en_sentences,
text_target=zh_sentences,
max_length=128,
truncation=True)
tokenized_dataset = clean_dataset.map(batch_tokenizer, batched=True, remove_columns=clean_dataset["train"].column_names)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
remove_columns=clean_dataset["train"].column_names需要把原始的数据集中的字段删除,仅保留input_ids、attention_mask、labels这些输入模型的字段。
python
from torch.utils.data import DataLoader, Dataset
import torch
import numpy as np
train_dataloader = DataLoader(
tokenized_dataset["train"],
shuffle=True,
collate_fn=data_collator,
batch_size=8,
)
eval_dataloader = DataLoader(
tokenized_dataset["valid"],
collate_fn=data_collator,
batch_size=8
)
## 优化器准备(AdamW)
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=2e-5)
## 准备Accelerate.prepare方法
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
)
## 准备线性学习率递减策略,使学习率逐渐递减到0
from transformers import get_scheduler
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)
不需要设置tokenized_dataset.set_format('torch'),因为前面已经设置return_tensors="pt"。
python
from huggingface_hub import Repository, get_full_repo_name, notebook_login, HfApi
import os
import shutil
# 登录 Hugging Face
notebook_login()
# 定义模型名称
model_name = "marian-finetuned-kde4-en-to-zh-accelerate"
output_dir = model_name
# 获取完整的仓库 ID (username/model_name)
repo_id = get_full_repo_name(model_name)
# 配置 Git 用户信息
os.system("git config --global user.email 'your_email@qq.com'")
os.system("git config --global user.name 'your_name'")
# 获取 Token
api = HfApi()
huggingface_token = api.token
try:
api.create_repo(repo_id=repo_id, exist_ok=True)
print(f"Remote repository '{repo_id}' checked/created.")
except Exception as e:
print(f"Note: Could not create repo or check existence (might be fine if it exists): {e}")
# 初始化本地仓库(如果远程有内容,这里会自动 clone 下来)
repo = Repository(
local_dir=output_dir,
clone_from=repo_id,
token=huggingface_token
)
repo.push_to_hub(commit_message="Initial commit from Colab setup", blocking=False)
print(f"Repository initialized locally at {repo.local_dir}")
print(f"Remote repository ID: {repo_id}")
#仓库手动初始化 这里不按照教程中的方式进行初始化,需要注意:
- 设置git邮箱、用户名信息;
- 获取
token api,进行仓库创建以及初始化; - 需要在HuggingFace中手动创建同名模型;
python
def postprocess(predictions, labels):
predictions = predictions.cpu().numpy()
labels = labels.cpu().numpy()
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]
return decoded_preds, decoded_labels
from tqdm.auto import tqdm
import torch
import pdb
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_train_epochs):
# Training
model.train()
total_loss = 0 # 初始化 loss 统计
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
total_loss += loss.item()
# 每个 epoch 结束打印平均 loss
avg_train_loss = total_loss / len(train_dataloader)
print(f"Epoch {epoch} | Average Training Loss: {avg_train_loss:.4f}")
# Evaluation
model.eval()
for batch in tqdm(eval_dataloader):
with torch.no_grad():
generated_tokens = accelerator.unwrap_model(model).generate(
batch["input_ids"],
attention_mask=batch["attention_mask"],
max_length=128,
)
labels = batch["labels"]
# Necessary to pad predictions and labels for being gathered
generated_tokens = accelerator.pad_across_processes(
generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
)
labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
predictions_gathered = accelerator.gather(generated_tokens)
labels_gathered = accelerator.gather(labels)
decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
metric.add_batch(predictions=decoded_preds, references=decoded_labels)
results = metric.compute(tokenize="zh")
print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")
# Save and upload
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
if accelerator.is_main_process:
tokenizer.save_pretrained(output_dir)
repo.push_to_hub(
commit_message=f"Training in progress epoch {epoch}", blocking=False
)
注意事项同上。