文章目录
自然语言处理
文本分类
文本分类是一种常见的自然语言处理任务,它将标签或类别分配给文本。一些最大的公司将文本分类应用于各种实际应用中。最流行的文本分类形式之一是情感分析,它将标签(例如:积极、消极或中性)分配给一段文本。
在开始之前,请确保你已安装所有必要的库:
pip install transformers datasets evaluate
加载 IMDb 数据集
首先,从 Datasets 库中加载 IMDb 数据集:
from datasets import load_dataset
imdb = load_dataset("imdb")
imdb["test"][0]
然后查看一个示例:
{
"label": 0,
"text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}
其中:
text:电影评论文本。label:取值为 0 表示消极评论,取值为 1 表示积极评论。
预处理
下一步是加载 DistilBERT 的 tokenizer 来预处理 text 字段:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
创建一个预处理函数来对 text 进行分词和截断,以使其不超过 DistilBERT 的最大输入长度:
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
使用 Datasets [~datasets.Dataset.map] 函数对整个数据集应用预处理函数,通过将 batched=True 设置为一次处理数据集中的多个元素,可以加快处理速度:
tokenized_imdb = imdb.map(preprocess_function, batched=True)
现在使用 [DataCollatorWithPadding] 创建一个示例的批处理。在 collation 过程中,将句子动态填充为一批中最长长度,而不是将整个数据集填充到最大长度。
# framework = "pt"
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# framework = "tf"
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
评估
在训练过程中包含一个评估指标通常对评估模型的性能很有帮助。你可以使用 Evaluate 库快速加载一个评估方法。对于这个任务,加载 accuracy 指标(请参阅 Evaluate 快速入门 以了解有关如何加载和计算指标的更多信息):
import evaluate
accuracy = evaluate.load("accuracy")
然后创建一个函数,将模型的预测结果和标签传递给 [~evaluate.EvaluationModule.compute] 来计算准确率:
import numpy as np
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
现在 compute_metrics 函数已准备就绪,在设置训练时将返回它。
训练
在开始训练模型之前,使用 id2label 和 label2id 创建一个预期 id 到标签的映射:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
现在已经准备好开始训练模型了!使用 [AutoModelForSequenceClassification] 加载 DistilBERT 并指定预期的标签数和标签映射:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id,...)
到这一步,只剩下三个步骤:
-
在 [
TrainingArguments] 中定义你的训练超参数。唯一必需的参数是output_dir,用于指定保存模型的位置。通过设置push_to_hub=True,你将把这个模型推送到 Hub(你需要登录到 Hugging Face 才能上传模型 )。在每个epoch结束时,[Trainer] 将评估准确率并保存训练检查点。 -
将训练参数与
模型、数据集、tokenizer、数据处理器和compute_metrics函数一起传递给 [Trainer]。 -
调用 [
Trainer.train] 来微调模型。training_args = TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
...
)trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_imdb["train"],
eval_dataset=tokenized_imdb["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
...
)trainer.train()
`Trainer`\] 默认使用动态填充,所以当你将 `tokenizer` 传递给它时,默认会应用该机制。在此示例中,不需要显式地指定数据处理器。
训练完成后,使用 \[`~transformers.Trainer.push_to_hub`\] 方法将你的模型共享到 Hub,以供所有人使用:
trainer.push_to_hub()
要在 `TensorFlow` 中微调模型,请首先设置优化器函数、学习率计划和一些训练超参数:
from transformers import create_optimizer
import tensorflow as tf
batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
然后,你可以加载 \[`TFAutoModelForSequenceClassification`\] 的 `DistilBERT` 和预期的标签数以及标签映射:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)
使用 \[\~transformers.TFPreTrainedModel.prepare_tf_dataset\] 将数据集转换为 tf.data.Dataset 格式:
tf_train_set = model.prepare_tf_dataset(
tokenized_imdb["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
...)
tf_validation_set = model.prepare_tf_dataset(
tokenized_imdb["test"],
shuffle=False,batch_size=16,
collate_fn=data_collator,
... )
使用 `compile` 配置模型进行训练。请注意,`Transformer` 模型都有一个与任务相关的默认损失函数,所以除非你希望使用自定义的损失函数,否则不需要指定损失函数:
import tensorflow as tf
model.compile(optimizer=optimizer) # 没有损失参数!
在你开始训练之前,设置好最后两个事项,即从预测结果中计算准确率,并提供一种将你的模型推送到 Hub 的方式,这两个都是使用 `Keras` 回调 完成的。
将你的 `compute_metrics` 函数传递给 \[`transformers.KerasMetricCallback`\]:
from transformers.keras_callbacks import KerasMetricCallback
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
在 \[`transformers.PushToHubCallback`\] 中指定要将模型和 `tokenizer` 推送到的位置:
from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(
output_dir="my_awesome_model",
tokenizer=tokenizer,
...)
将你的回调函数捆绑在一起:
callbacks = [metric_callback, push_to_hub_callback]
最后,你可以开始训练你的模型了!使用你的训练和验证数据集、epoch 数和回调函数来调用 fit 以进行微调:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)
训练完成后,你的模型会自动上传到 Hub,供所有人使用!
有关如何为文本分类微调模型的更深入示例,请查看相应的 `PyTorch notebook` 或`TensorFlow notebook`。
###### 推理
至此,已经微调了一个模型,可以用它进行推断了!
找到一些想要运行推断的文本:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
使用微调后的模型进行推理的最简单方法是在\[`pipeline`\]中使用它。实例化一个用于情感分析的 `pipeline`,并将文本传递给它:
from transformers import pipeline
classifier = pipeline("sentiment-analysis",model="stevhliu/my_awesome_model")
classifier(text)
# [{'label': 'POSITIVE', 'score': 0.9994940757751465}]
如果你愿意,你也可以手动复制 `pipeline` 的结果:
对文本进行分词并返回 `PyTorch` 张量:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="pt")
将输入传递给模型并返回 `logits`:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
with torch.no_grad():
logits = model(**inputs).logits
获取具有最高概率的类别,并使用模型的 `id2label` 映射将其转换为文本标签:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
# 'POSITIVE'
对文本进行分词并返回 `TensorFlow` 张量:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="tf")
将输入传递给模型并返回logits:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
logits = model(**inputs).logits
获取具有最高概率的类别,并使用模型的id2label映射将其转换为文本标签:
predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
model.config.id2label[predicted_class_id]
# 'POSITIVE'
##### Token分类
`Token` 分类为句子中的每个标记分配一个标签。最常见的 `Token` 分类任务之一是**命名实体识别** (`Named Entity Recognition`,`NER`)。`NER` 旨在为句子中的每个实体(如人、位置或组织)找到一个标签。
开始之前,请确保已安装所有必要的库:
pip install transformers datasets evaluate seqeval
###### 加载 `WNUT 17` 数据集
首先从 `Datasets` 库中加载 `WNUT 17` 数据集:
from datasets import load_dataset
wnut = load_dataset("wnut_17")
wnut["train"][0]
输出结果:
{'id': '0',
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}
`ner_tags` 中的每个数字表示一个实体。将数字转换为其标签名称以了解实体是什么:
label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list[
"O",
"B-corporation",
"I-corporation",
"B-creative-work",
"I-creative-work",
"B-group",
"I-group",
"B-location",
"I-location",
"B-person",
"I-person",
"B-product",
"I-product",
]
`ner_tags` 中的每个标记前缀字母表示实体的 `token` 位置:
* `B-`: 表示实体的开始。
* `I-`: 表示 `token` 包含在同一个实体中(例如,`Statetoken` 是 `Empire State Building` 实体的一部分)。
* `0`:表示该token不对应任何实体。
###### 预处理
下一步是加载 `DistilBERT` 分词器以预处理 `tokens` 字段:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
正如你在上面的示例 `tokens` 字段中看到的那样,它看起来像已经进行了标记化的输入。但是实际上输入尚未标记化,你需要设置 `is_split_into_words=True` 将单词标记化为子单词。例如:
example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens
#['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
但是,这会添加一些特殊标记 `CLS` 和 `SEP`,而子词标记会导致输入和标签之间的不匹配。现在,一个对应于单个标签的单个单词可能被分割为两个子词。你需要通过以下方式对齐标记和标签:
1. 使用 `word_ids` 方法将所有标记映射到相应的单词。
2. 将特殊标记 `CLS` 和 `SEP` 的标签设置为 − 100 -100 −100,这样它们将被忽略掉用于 `PyTorch` 损失函数的计算(请参见 `CrossEntropyLoss`)。
3. 仅为给定单词的第一个标记进行标记。将同一单词的其他子词分配为 − 100 -100 −100。
以下是你可以创建以对齐标记和标签的函数,并截断序列为不超过 `DistilBERT` 的最大输入长度的方法:
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples[f"ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i) # 将标记映射到它们对应的单词。
previous_word_idx = None
label_ids = []
for word_idx in word_ids: # 将特殊标记设置为-100。
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx: # 仅对给定单词的第一个标记进行标记。
label_ids.append(label[word_idx])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
要将预处理函数应用于整个数据集,请使用 `Datasets [~datasets.Dataset.map]`函数。通过设置 `batched=True` 可以加速map函数,以便一次处理数据集的多个元素:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
现在使用\[`DataCollatorWithPadding`\]创建一个示例批次。在整理期间将句子动态填充到批次中的最大长度,而不是将整个数据集填充到最大长度。
# framework = "pt"
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# framework = "tf"
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")
###### 评估
在训练过程中包含度量标准通常有助于评估模型的性能。你可以使用评估库快速加载一个评估方法。对于本任务,请加载 `seqeval` 框架(请参阅 `Evaluate` 快速导览以了解有关如何加载和计算度量标准的更多信息)。`Seqeval` 实际上产生了几个分数:精确度(`precision`)、召回率(`recall`)、F1和准确度(`accuracy`)。
import evaluate
seqeval = evaluate.load("seqeval")
首先获取 `NER` 标签,然后创建一个函数,该函数将你的真实预测和真实标签传递给\[`evaluate.EvaluationModule.compute`\]以计算分数:
import numpy as np
labels = [label_list[i] for i in example[f"ner_tags"]]
def compute_metrics(p):
predictions, labels = p
predictions = np.argmax(predictions, axis=2)
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
true_labels = [
[label_list[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
results = seqeval.compute(predictions=true_predictions, references=true_labels)
return {
"precision": results["overall_precision"],
"recall": results["overall_recall"],
"f1": results["overall_f1"],
"accuracy": results["overall_accuracy"],
...
}
现在你的 `compute_metrics` 函数已经准备好,当设置训练时将会返回它。
###### 训练
在开始训练模型之前,先创建一个预期的 `ID` 到标签的映射以及 `ID` 到标签的映射 `id2label` 和 `label2id`:
id2label = {
0: "O",
1: "B-corporation",
2: "I-corporation",
3: "B-creative-work",
4: "I-creative-work",
5: "B-group",
6: "I-group",
7: "B-location",
8: "I-location",
9: "B-person",
10: "I-person",
11: "B-product",
12: "I-product",
}
label2id = {
"O": 0,
"B-corporation": 1,
"I-corporation": 2,
"B-creative-work": 3,
"I-creative-work": 4,
"B-group": 5,
"I-group": 6,
"B-location": 7,
"I-location": 8,
"B-person": 9,
"I-person": 10,
"B-product": 11,
"I-product": 12,
...
}
现在,可以开始训练模型了!使用\[`AutoModelForTokenClassification`\]加载 `DistilBERT`,同时指定期望的标签数量以及标签映射:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
model = AutoModelForTokenClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=13,
id2label=id2label,
label2id=label2id)
此时,仅剩下三个步骤:
* 在\[`TrainingArguments`\]中定义你的训练超参数。`output_dir` 是唯一需要的参数,它指定要保存模型的位置。你可以设置 `push_to_hub=True` 将模型推送到Hub(上传模型需要登录到`Hugging Face`)。在每个阶段结束时,\[`Trainer`\]将评估 `seqeval` 分数并保存训练检查点。
* 将训练参数与模型、数据集、分词器、数据整理器和 `compute_metrics` 函数一起传递给\[`Trainer`\]。
* 调用\[`Trainer.train`\]以微调你的模型。
training_args = TrainingArguments(
output_dir="my_awesome_wnut_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
...
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_wnut["train"],
eval_dataset=tokenized_wnut["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
...
)
trainer.train()
训练完成后,使用\[`transformers.Trainer.push_to_hub`\]方法将你的模型分享到Hub,以便每个人都可以使用你的模型:
trainer.push_to_hub()
要在 `TensorFlow` 中微调模型,请首先设置一个优化器函数、学习率计划和一些训练超参数:
from transformers import create_optimizer
batch_size = 16
num_train_epochs = 3
num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
init_lr=2e-5,
num_train_steps=num_train_steps,
weight_decay_rate=0.01,
num_warmup_steps=0,
...
)
然后,你可以使用\[`TFAutoModelForTokenClassification`\]加载 `DistilBERT`,同时指定期望的标签数量以及标签映射:
from transformers import TFAutoModelForTokenClassification
model = TFAutoModelForTokenClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=13,
id2label=id2label,
label2id=label2id
)
使用\[`transformers.TFPreTrainedModel.prepare_tf_dataset`\]将数据集转换为 `tf.data.Dataset` 格式:
tf_train_set = model.prepare_tf_dataset(
tokenized_wnut["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
...
)
将模型配置为使用 `compile` 进行训练。注意,`Transformers` 模型都有一个默认的与任务相关的损失函数,因此除非你想要指定一个,否则不需要再指定了:
import tensorflow as tf
model.compile(optimizer=optimizer) # 没有损失参数!
在开始训练之前,还有最后两件事要做,即从预测中计算 `seqeval` 分数,并提供将模型上传到Hub的方法。这两件事都是通过使用 `Keras` 回调来完成的。
将你的 `compute_metrics` 函数传递给\[`transformers.KerasMetricCallback`\]:
from transformers.keras_callbacks import KerasMetricCallback
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
在\[`transformers.PushToHubCallback`\]中指定将模型和分词处理器上传到哪:
from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(
output_dir="my_awesome_wnut_model",
tokenizer=tokenizer,
...
)
然后将回调捆绑在一起:
callbacks = [metric_callback, push_to_hub_callback]
最后,你可以开始训练模型了!使用训练和验证数据集、训练轮数以及回调函数来调用fit来微调模型:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)
一旦训练完成,你的模型将自动上传到Hub,这样每个人都可以使用它!
###### 推理
现在你已经微调了模型,可以用它进行推理了!首先选择一些你想要进行推理的文本:
text = "The Golden State Warriors are an American professional basketball team based in San Francisco."
尝试使用\[pipeline\]中的模型进行推理是最简单的方法。使用NER实例化一个pipeline,并将文本传递给它:
from transformers import pipeline
classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model")
classifier(text)
结果为:
[
{'entity': 'B-location',
'score': 0.42658573,
'index': 2,
'word': 'golden',
'start': 4,
'end': 10},
{'entity': 'I-location',
'score': 0.35856336,
'index': 3,
'word': 'state',
'start': 11,
'end': 16},
{'entity': 'B-group',
'score': 0.3064001,
'index': 4,
'word': 'warriors',
'start': 17,
'end': 25},
{'entity': 'B-location',
'score': 0.65523505,
'index': 13,
'word': 'san',
'start': 80,
'end': 83},
{'entity': 'B-location',
'score': 0.4668663,
'index': 14,
'word': 'francisco',
'start': 84,
'end': 93}
]
如果需要,也可以手动复制 `pipeline` 的结果:
对文本进行分词并返回 `PyTorch` 张量:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
inputs = tokenizer(text, return_tensors="pt")
将输入传递给模型并返回 `logits`:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
with torch.no_grad():
logits = model(**inputs).logits
获取具有最高概率的类别,并使用模型的id2label映射将其转换为文本标签:
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
predicted_token_class
结果为:
['O',
'O',
'B-location',
'I-location',
'B-group',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'B-location',
'B-location',
'O',
'O']
对文本进行分词并返回TensorFlow张量:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
inputs = tokenizer(text, return_tensors="tf")
将输入传递给模型并返回 `logits`:
from transformers import TFAutoModelForTokenClassification
model = TFAutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
logits = model(**inputs).logits
获取具有最高概率的类别,并使用模型的 `id2label` 映射将其转换为文本标签:
predicted_token_class_ids = tf.math.argmax(logits, axis=-1)
predicted_token_class = [model.config.id2label[t] for t in redicted_token_class_ids[0].numpy().tolist()]
predicted_token_class
结果为:
['O',
'O',
'B-location',
'I-location',
'B-group',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'B-location',
'B-location',
'O',
'O']
##### 问答
问答任务根据问题得到一个答案。如果你曾经向 `Alexa`、`Siri` 或 `Google` 等虚拟助手询问天气情况,那么你之前使用过问答模型。常见的问答任务有两种类型:
* 抽取式:从给定的上下文中抽取答案。
* 生成式:从上下文生成一个正确回答问题的答案。
开始之前,请确保你已经安装了所有必需的库:
pip install transformers datasets evaluate
###### 加载 `SQuAD` 数据集
首先,通过 `Datasets` 库加载 `SQuAD` 数据集的一个较小子集。这将给你一个机会在使用完整数据集进行训练之前进行实验和确保一切工作正常。
from datasets import load_dataset
squad = load_dataset("squad", split="train[:5000]")
使用\[`datasets.Dataset.train_test_split`\]方法将数据集的 `train` 拆分为训练集和测试集:
squad = squad.train_test_split(test_size=0.2)
然后看一个例子:
squad["train"][0]
# {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
# 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
# 'id': '5733be284776f41900661182',
# 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
# 'title': 'University_of_Notre_Dame'
# }
这里有几个重要的字段:
* `answers`:答案标记的起始位置和答案文本。
* `context`:模型需要从中提取答案的背景信息。
* `question`:模型应该回答的问题。
###### 预处理
下一步是加载DistilBERT tokenizer以处理question和context字段:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
还有一些特定于问答任务的预处理步骤需要注意:
1. 数据集中的一些示例可能具有非常长的 `context`,超过了模型的最大输入长度。为了处理更长的序列,只截断 `context`,将 `truncation` 设置为 `only_second`。
2. 接下来,通过设置 `return_offsets_mapping=True`,将回答的开始和结束位置映射到原始的 `context`。
3. 有了映射后,可以找到答案的开始和结束标记。使用\[`tokenizers.Encoding.sequence_ids`\]方法找出哪部分偏移对应于 `question`,哪部分对应于 `context`。
下面是如何创建函数来截断和映射answer的开始和结束标记到context的方法:
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding=True,
max_length=(64, 384), # 扩展输入以适应新的输入token
stride=128 # 测试时按照128 stride
)
offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
sequence_ids = inputs.sequence_ids(i)
# 找到上下文的开始和结束位置
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# 如果答案没有完全在上下文内,则标记为(0, 0)
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# 否则是开始和结束标记的位置
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
要在整个数据集上应用预处理函数,使用 `Datasets` 的\[`datasets.Dataset.map`\]函数即可。你可以通过将 `batched=True` 设置为一次处理数据集的多个元素来加快map函数的速度。删除你不需要的任何列:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
然后使用\[`DefaultDataCollator`\]创建一批示例。与 `Transformers` 中的其他数据整理器不同,\[`DefaultDataCollator`\]不会应用任何额外的预处理,例如填充。
# pytorch 代码
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator()
# tensorflow代码
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator(return_tensors="tf")
###### 训练
1. **pytorch代码**
如果你不熟悉使用\[`Trainer`\]微调模型,请参阅此处的基础教程!
现在你可以开始训练模型了!使用\[`AutoModelForQuestionAnswering`\]加载 `DistilBERT`:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
在这一点上,只剩下三个步骤:
* 在\[`TrainingArguments`\]中定义你的训练超参数。唯一需要的参数是 `output_dir`,指定保存模型的位置。你可以通过设置 `push_to_hub=True` 将模型推送到Hub(你需要登录Hugging Face才能上传模型)。
* 将训练参数与模型、数据集、tokenizer和数据整理器一起传递给\[`Trainer`\]。
* 调用\[`Trainer.train`\]进行微调模型。
training_args = TrainingArguments(
output_dir="my_awesome_qa_model",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
push_to_hub=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_squad["train"],
eval_dataset=tokenized_squad["test"],
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
训练完成后,使用\[`transformers.Trainer.push_to_hub`\]方法将模型分享给Hub,这样每个人都可以使用你的模型:
trainer.push_to_hub()
2. **`tensorflow`代码**
要在 `TensorFlow` 中微调模型,请首先设置优化器、学习率计划和一些训练超参数:
from transformers import create_optimizer
batch_size = 16
num_epochs = 2
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
init_lr=2e-5,
num_warmup_steps=0,
num_train_steps=total_train_steps,
)
然后使用\[`TFAutoModelForQuestionAnswering`\]加载 `DistilBERT`:
from transformers import TFAutoModelForQuestionAnswering
model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
使用\[`transformers.TFPreTrainedModel.prepare_tf_dataset`\]将数据集转换为 `tf.data.Dataset` 格式:
tf_train_set = model.prepare_tf_dataset(
tokenized_squad["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_validation_set = model.prepare_tf_dataset(
tokenized_squad["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
使用 `compile` 为训练配置模型:
import tensorflow as tf
model.compile(optimizer=optimizer)
在开始训练之前,你还需要提供一种将模型推送到Hub的方法。这可以通过在\[`transformers.PushToHubCallback`\]中指定要推送模型和 `tokenizer` 的位置来完成:
from transformers.keras_callbacks import PushToHubCallback
callback = PushToHubCallback(
output_dir="my_awesome_qa_model",
tokenizer=tokenizer,
)
最后,你已经准备好开始训练模型了!调用fit与训练集、验证集的样本数量、回调函数来微调模型:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=[callback])
训练完成后,你的模型将自动上传到Hub,以便每个人都可以使用它!
###### 推理
至此,已经微调了一个模型,现在可以用它进行推理了!提出一个问题和一些你希望模型预测的上下文:
在使用微调模型进行推理时,最简单的方法是在\[ `pipeline`\]中使用它。使用你的模型实例化一个问题回答的 `pipeline`,并将你的文本传递给它:
from transformers import pipeline
question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
question_answerer(question=question, context=context)
# {'score': 0.2058267742395401,
# 'start': 10,
# 'end': 95,
# 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}
如果你愿意,你也可以手动复制 `pipeline` 的结果:
1. **`pytorch` 代码**
对文本进行标记化并返回 `PyTorch` 张量:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, context, return_tensors="pt")
将你的输入传递给模型并返回 `logits`:
import torch
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
with torch.no_grad():
outputs = model(**inputs)
从模型输出中获取开始和结束位置的最高概率:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()
将预测的标记解码为答案:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'
2. **`tensorflow` 代码**
对文本进行标记化并返回 `TensorFlow` 张量:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, text, return_tensors="tf")
将你的输入传递给模型并返回logits:
from transformers import TFAutoModelForQuestionAnswering
model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
outputs = model(**inputs)
从模型输出中获取开始和结束位置的最高概率:
answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])
将预测的标记解码为答案:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'
##### 因果关系标记语言模型
语言建模有两种类型,因果和掩码。本指南介绍因果语言建模。因果语言模型经常用于文本生成。你可以将这些模型用于创意应用,例如选择你自己的文字冒险或智能编码助手(如 `Copilot` 或 `CodeParrot`)。
因果语言建模预测标记序列中的下一个标记,并且模型只能关注左侧的标记。这意味着模型无法看到未来的标记。`GPT-2` 是因果语言模型的一个例子。
开始之前,请确保安装了所有必要的库:
pip install transformers datasets evaluate
###### 加载ELI5数据集
首先,从数据集库中加载 `ELI5数据集` 的较小子集 `r/askscience` 子集。这样可以让你有机会进行实验,并确保在完整数据集上进行更多时间的训练之前,一切正常。
from datasets import load_dataset
eli5 = load_dataset("eli5", split="train_asks[:5000]")
使用\[`datasets.Dataset.train_test_split`\]方法将数据集的 `train_asks` 拆分为训练集和测试集:
eli5 = eli5.train_test_split(test_size=0.2)
然后看一个示例:
eli5["train"][0]
得到的结果为:
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
'score': [6, 3],
'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
'answers_urls': {'url': []},
'document': '',
'q_id': 'nyxfp',
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
'subreddit': 'askscience',
'title': 'Few questions about this space walk photograph.',
'title_urls': {'url': []}}
虽然这看起来很多,但你实际上只对"text"字段感兴趣。语言建模任务的有趣之处在于你不需要标签(也称为无监督任务),因为下一个单词就是标签。
###### 预处理
下一步是加载DistilGPT2标记器以处理"text"子字段:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
你将从上面的示例中注意到, `text` 字段实际上是嵌套在 `answers` 内部的。这意味着你需要使用\[`datasets.Dataset.flatten`\]方法从其嵌套结构中提取 `text` 子字段:
eli5 = eli5.flatten()
eli5["train"][0]
输出的结果为:
{
'answers.a_id': ['c3d1aib', 'c3d4lya'],
'answers.score': [6, 3],
'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
'answers_urls.url': [],
'document': '',
'q_id': 'nyxfp',
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
'subreddit': 'askscience',
'title': 'Few questions about this space walk photograph.',
'title_urls.url': []
}
现在,每个子字段都是一个单独的列,如 `answers` 前缀所示,`text` 字段现在是一个列表。不是分别对每个句子进行标记化,而是将列表转换为字符串,以便可以联合对其进行标记化。
下面是用于连接示例中的字符串列表并对结果进行标记化的第一个预处理函数:
def preprocess_function(examples):
return tokenizer([" ".join(x) for x in examples["answers.text"]])
使用数据集\[`datasets.Dataset.map`\]方法将该预处理函数应用于整个数据集。通过将`batched=True` 设置为同时处理数据集的多个元素,并使用 `num_proc` 增加进程的数量,可以加速map函数的处理速度。删除不需要的列:
tokenized_eli5 = eli5.map(
preprocess_function,
batched=True,
num_proc=4,
remove_columns=eli5["train"].column_names,
)
该数据集包含标记序列,但其中一些序列长度超过了模型的最大输入长度。
现在,可以使用第二个预处理函数来
* 连接所有序列
* 将连接的序列拆分为长度由 `block_size` 定义的较短的块,其长度应小于最大输入长度,并且足够短以适应 `GPU RAM`。
block_size = 128
def group_texts(examples):
# 连接所有文本。
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# 我们丢弃剩余的小块,我们可以添加填充而不是丢弃的部分,你可以根据需要自定义此部分。
if total_length >= block_size:
total_length = (total_length // block_size) * block_size
# 按block_size进行拆分。
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
在整个数据集上应用 `group_texts` 函数:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
现在,使用\[`DataCollatorForLanguageModeling`\]创建一批示例。在整理过程中使用动态填充模型在一批中最长长度的句子更有效,而不是将整个数据集填充到最大长度。
使用终止序列 `token` 作为填充 `token`,并设置`mlm=False`。这将使用右移一个元素的标签作为输入:
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
使用终止序列 `token` 作为填充 `token`,并设置 `mlm=False`。这将使用右移一个元素的标签作为输入:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
###### 训练
现在已经准备好开始训练模型了!使用\[`AutoModelForCausalLM`\]加载 `DistilGPT2` 模型:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
现在只剩下三个步骤:
* 使用\[`TrainingArguments`\]定义训练超参数。唯一需要的参数是 `output_dir`,用于指定保存模型的位置。设置 `push_to_hub=True`将此模型推送到Hub(你需要登录Hugging Face以上传模型)。
* 将训练参数与模型、数据集和数据整理器一起传递给\[`Trainer`\]。
* 调用\[`Trainer.train`\]来微调模型。
training_args = TrainingArguments(
output_dir="my_awesome_eli5_clm-model",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
push_to_hub=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset["train"],
eval_dataset=lm_dataset["test"],
data_collator=data_collator,
)
trainer.train()
训练完成后,使用\[\~transformers.Trainer.evaluate\]方法评估模型并获得困惑度:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
输出结果为:
Perplexity: 49.61
然后可以使用\[\~transformers.Trainer.push_to_hub\]方法将模型分享到Hub,以便每个人都可以使用你的模型:
trainer.push_to_hub()
要在TensorFlow中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
然后可以使用\[TFAutoModelForCausalLM\]加载DistilGPT2模型:
from transformers import TFAutoModelForCausalLM
model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")
使用\[`transformers.TFPreTrainedModel.prepare_tf_dataset`\]将数据集转换为 `tf.data.Dataset` 格式:
tf_train_set = model.prepare_tf_dataset(
lm_dataset["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_test_set = model.prepare_tf_dataset(
lm_dataset["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
使用 `compile` 为训练配置模型。请注意,`Transformer` 模型都有一个默认的与任务相关的损失函数,所以除非你想要指定一个,否则不需要特别指定:
import tensorflow as tf
model.compile(optimizer=optimizer) # 没有损失参数!
这可以通过 在\[`transformers.PushToHubCallback`\]中指定要推送模型和分词器的位置 来实现:
from transformers.keras_callbacks import PushToHubCallback
callback = PushToHubCallback(
output_dir="my_awesome_eli5_clm-model",
tokenizer=tokenizer,
)
最后,你可以开始训练模型了!使用fit方法并传入训练和验证数据集、训练轮数,以及回调函数来微调模型:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])
训练完成后,你的模型会自动上传到Hub,这样每个人都可以使用它!
###### 推断
现在已经微调了模型,可以用它进行推断了!先构造一个你想要生成文本的输入提示:
prompt = "Somatic hypermutation allows the immune system to"
尝试使用\[ `pipeline`\] 中的模型进行推断是最简单的方法。使用你的模型实例化一个文本生成的 `pipeline`,并将文本传递给它:
from transformers import pipeline
generator = pipeline("text-generation", model="my_awesome_eli5_clm-model")
generator(prompt)
输出的结果为:
[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]
将文本分词处理并将`input_ids`返回为 `PyTorch` 张量:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
inputs = tokenizer(prompt, return_tensors="pt").input_ids
使用\[`transformers.generation_utils.GenerationMixin.generate`\]方法生成文本。有关不同的文本生成策略和控制生成的参数的更多细节,请查看文本生成策略页面。
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
将生成的token id解码回文本:
tokenizer.batch_decode(outputs, skip_special_tokens=True)
输出结果如下:
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]
将文本分词处理并将`input_ids`返回为 `TensorFlow`张量:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
inputs = tokenizer(prompt, return_tensors="tf").input_ids
使用\[`transformers.generation_tf_utils.TFGenerationMixin.generate`\]方法创建摘要。有关不同的文本生成策略和控制生成的参数的更多细节,请查看文本生成策略页面。
from transformers import TFAutoModelForCausalLM
model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
将生成的 `token id` 解码回文本:
tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']
##### 掩码标记语言模型
掩码语言建模是预测序列中掩码标记的任务,模型可以双向关注标记。这意味着模型可以完全访问左右两侧的标记。掩码语言建模适用于需要对整个序列具有良好上下文理解的任务。`BERT` 就是掩码语言模型的一个例子。
在开始之前,请确保已安装所有必要的库:
pip install transformers datasets evaluate
###### 加载ELI5数据集
首先,从 `Datasets`库中加载 `ELI5` 数据集中 `r/askscience` 子集的一个较小子集。这将给你一个机会进行实验,并确保一切正常,然后再花更多时间在完整数据集上进行训练。
from datasets import load_dataset
eli5 = load_dataset("eli5", split="train_asks[:5000]")
将数据集的 `train_asks` 拆分为训练集和测试集,使用\[`datasets.Dataset.train_test_split`\]方法:
eli5 = eli5.train_test_split(test_size=0.2)
然后查看一个示例:
eli5["train"][0]
输出结果为:
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
'score': [6, 3],
'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
'answers_urls': {'url': []},
'document': '',
'q_id': 'nyxfp',
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
'subreddit': 'askscience',
'title': 'Few questions about this space walk photograph.',
'title_urls': {'url': []}}
虽然这看起来可能有点多,但你实际上只对 `text` 字段感兴趣。关于语言建模任务的酷炫之处在于你不需要标签(也称为无监督任务),因为下一个词就是标签。
###### 预处理
对于掩码语言建模,下一步是加载一个 `DistilRoBERTa` 分词器来处理 `text` 子字段:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
你会注意到上面的示例中,`text` 字段实际上是嵌套在 `answers` 内的。这意味着你需要使用\[`flatten`\]方法从其嵌套结构中提取 `text`子字段:
eli5 = eli5.flatten()
eli5["train"][0]
输出结果为:
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
'answers.score': [6, 3],
'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
'answers_urls.url': [],
'document': '',
'q_id': 'nyxfp',
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
'subreddit': 'askscience',
'title': 'Few questions about this space walk photograph.',
'title_urls.url': []}
现在,每个子字段都是一个单独的列,如 `answers` 前缀所示,而 `text` 字段现在是一个列表。不需要将每个句子分别标记化,而是将列表转换为字符串,以便可以联合标记化它们。
下面是第一个预处理函数,该函数将每个示例的字符串列表连接起来,并对结果进行标记化:
def preprocess_function(examples):
return tokenizer([" ".join(x) for x in examples["answers.text"]])
使用\[`datasets.Dataset.map`\]方法将此预处理函数应用于整个数据集。你可以通过设置 `batched=True` 以一次处理数据集的多个元素,并通过设置 `num_proc` 来增加进程数,以加快map函数的速度。删除你不需要的任何列:
tokenized_eli5 = eli5.map(
preprocess_function,
batched=True,
num_proc=4,
remove_columns=eli5["train"].column_names,
)
该数据集包含标记序列,但其中一些序列比模型的最大输入长度要长。
现在,使用第二个预处理函数:
* 连接所有序列
* 根据 `block_size` 将连接后的序列拆分为较短的块,其长度应小于最大输入长度,并且足够小,适应你的 `GPU RAM`。
block_size = 128
def group_texts(examples):
# 合并所有文本。
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# 我们舍弃掉不足一个block_size的部分,我们可以在这里添加填充,而不是舍弃,前提是模型支持填充,你可以根据需要自定义此部分。
if total_length >= block_size:
total_length = (total_length // block_size) * block_size
# 按block_size划分。
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
return result
在整个数据集上应用 `group_texts`函数:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
现在,使用\[`DataCollatorForLanguageModeling`\]创建一个示例批次。在整个批次的数据校对期间,动态填充句子到批次中的最长长度比在整个数据集上填充到最大长度更高效。
将序列结束标记用作填充标记,并指定 `mlm_probability`,以便在每次迭代数据时随机掩码标记:
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
将序列结束标记用作填充标记,并指定mlm_probability,以便在每次迭代数据时随机掩码标记:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")
###### 训练
如果你对使用\[ `Trainer` \]微调模型不太熟悉,请查看此处的基本教程。
你现在可以开始训练模型了!使用\[`AutoModelForMaskedLM`\]加载 `DistilRoBERTa`:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")
只剩下三个步骤:
* 在\[`TrainingArguments`\]中定义你的训练超参数。仅需要 `output_dir` 参数,指定保存模型的位置。
* 将训练参数与模型、数据集和数据校验器一起传递给\[`Trainer`\]。
* 调用\[`Trainer.train`\]进行模型微调。
training_args = TrainingArguments(
output_dir="my_awesome_eli5_mlm_model",
evaluation_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=3,
weight_decay=0.01,
push_to_hub=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset["train"],
eval_dataset=lm_dataset["test"],
data_collator=data_collator,
)
trainer.train()
当训练完成后,使用\[`transformers.Trainer.evaluate`\]方法评估模型并获得模型的困惑度:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
将你的模型共享到Hub上,使用\[`transformers.Trainer.push_to_hub`\]方法,这样每个人都可以使用你的模型:
trainer.push_to_hub()
要在 `TensorFlow` 中微调模型,请首先设置优化器函数、学习率调度程序和一些训练超参数:
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
然后可以使用\[`TFAutoModelForMaskedLM`\]加载 `DistilRoBERTa`:
from transformers import TFAutoModelForMaskedLM
model = TFAutoModelForMaskedLM.from_pretrained("distilroberta-base")
使用\[`transformers.TFPreTrainedModel.prepare_tf_dataset`\]将数据集转换为 `tf.data.Dataset` 格式:
tf_train_set = model.prepare_tf_dataset(
lm_dataset["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_test_set = model.prepare_tf_dataset(
lm_dataset["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
使用 `compile` 为模型配置训练。请注意,`Transformers` 模型都有默认的与任务相关的损失函数,因此除非你想要指定一个,否则不需要指定:
import tensorflow as tf
model.compile(optimizer=optimizer) # 没有损失参数!
可以通过在\[\~transformers.PushToHubCallback\]中指定要推送模型和分词器的位置来实现:
from transformers.keras_callbacks import PushToHubCallback
callback = PushToHubCallback(
output_dir="my_awesome_eli5_mlm_model",
tokenizer=tokenizer,
)
最后,你可以开始训练模型!使用你的训练和验证数据集、`epochs` 的数量以及你的回调函数来调用 `fit` 方法来微调模型:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])
###### 推断
现在已经对模型进行了微调,可以用它进行推断了!先想出一些希望模型填补空白的文本,并使用特殊的 `