24 Transformers - 训练自然语言处理模型

文章目录

- - 自然语言处理
  - - 文本分类
    - - [加载 `IMDb` 数据集](#加载 IMDb 数据集)
      - 预处理
      - 评估
      - 训练
      - 推理
    - Token分类
    - - [加载 `WNUT 17` 数据集](#加载 WNUT 17 数据集)
      - 预处理
      - 评估
      - 训练
      - 推理
    - 问答
    - - [加载 `SQuAD` 数据集](#加载 SQuAD 数据集)
      - 预处理
      - 训练
      - 推理
    - 因果关系标记语言模型
    - 掩码标记语言模型
    - 翻译
    - - [加载 `OPUS Books` 数据集](#加载 OPUS Books 数据集)
      - 预处理
      - 评估
      - 训练
      - 推理
    - 文本摘要
    - - [加载 `BillSum` 数据集](#加载 BillSum 数据集)
      - 预处理
      - 评估
      - 训练
      - 推理
    - 多项选择

自然语言处理

文本分类

文本分类是一种常见的自然语言处理任务，它将标签或类别分配给文本。一些最大的公司将文本分类应用于各种实际应用中。最流行的文本分类形式之一是情感分析，它将标签（例如：积极、消极或中性）分配给一段文本。

在开始之前，请确保你已安装所有必要的库：

复制代码

pip install transformers datasets evaluate

加载 `IMDb` 数据集

首先，从 Datasets 库中加载 IMDb 数据集：

复制代码

from datasets import load_dataset
imdb = load_dataset("imdb")
imdb["test"][0]

然后查看一个示例：

复制代码

{
  "label": 0,
  "text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}

其中：

text：电影评论文本。
label：取值为 0 表示消极评论，取值为 1 表示积极评论。

预处理

下一步是加载 DistilBERT 的 tokenizer 来预处理 text 字段：

复制代码

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

创建一个预处理函数来对 text 进行分词和截断，以使其不超过 DistilBERT 的最大输入长度：

复制代码

def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)

使用 Datasets [~datasets.Dataset.map] 函数对整个数据集应用预处理函数，通过将 batched=True 设置为一次处理数据集中的多个元素，可以加快处理速度：

复制代码

tokenized_imdb = imdb.map(preprocess_function, batched=True)

现在使用 [DataCollatorWithPadding] 创建一个示例的批处理。在 collation 过程中，将句子动态填充为一批中最长长度，而不是将整个数据集填充到最大长度。

复制代码

# framework = "pt"
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# framework = "tf"
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

评估

在训练过程中包含一个评估指标通常对评估模型的性能很有帮助。你可以使用 Evaluate 库快速加载一个评估方法。对于这个任务，加载 accuracy 指标（请参阅 Evaluate 快速入门以了解有关如何加载和计算指标的更多信息）：

复制代码

import evaluate

accuracy = evaluate.load("accuracy")

然后创建一个函数，将模型的预测结果和标签传递给 [~evaluate.EvaluationModule.compute] 来计算准确率：

复制代码

import numpy as np

def compute_metrics(eval_pred):
     predictions, labels = eval_pred
     predictions = np.argmax(predictions, axis=1)
     return accuracy.compute(predictions=predictions, references=labels)

现在 compute_metrics 函数已准备就绪，在设置训练时将返回它。

训练

在开始训练模型之前，使用 id2label 和 label2id 创建一个预期 id 到标签的映射：

复制代码

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

现在已经准备好开始训练模型了！使用 [AutoModelForSequenceClassification] 加载 DistilBERT 并指定预期的标签数和标签映射：

复制代码

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id,...)

到这一步，只剩下三个步骤：

在 [TrainingArguments] 中定义你的训练超参数。唯一必需的参数是 output_dir，用于指定保存模型的位置。通过设置 push_to_hub=True，你将把这个模型推送到 Hub（你需要登录到 Hugging Face 才能上传模型 ）。在每个 epoch 结束时，[Trainer] 将评估准确率并保存训练检查点。
将训练参数与模型、数据集、tokenizer、数据处理器和 compute_metrics 函数一起传递给 [Trainer]。
调用 [Trainer.train] 来微调模型。

training_args = TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
...
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_imdb["train"],
eval_dataset=tokenized_imdb["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
...
)

trainer.train()

24 Transformers - 训练自然语言处理模型

文章目录

自然语言处理

文本分类

加载 IMDb 数据集

预处理

评估

训练

加载 `IMDb` 数据集