13 Transformers - 使用Pipelien处理自然语言处理

文章目录

自然语言处理
- 文本分类
- 零样本文本分类
- [`token` 分类](#token 分类)
- 问答
- 表格问答
- 文本摘要
- 翻译
- 文本生成
- 文生文

Transformers 是一个采用当下技术最新、表现最佳（ State-of-the-art， SoTA）的模型和技术在预训练 自然语言处理 、 计算机视觉 、音频和 多模态 模型方面提供推理和训练的的开源库；旨在快速易用，以便每个人都可以开始使用 transformer 模型进行学习或构建。该库不仅包含 Transformer 模型，还包括用于计算机视觉任务的现代卷积网络等非 Transformer 模型。

自然语言处理

NLP 任务是最常见的类型之一，因为文本是我们进行交流的自然方式。为了让文本变成模型识别的格式，需要对其进行分词。这意味着将一段文本分成单独的单词或子词（tokens），然后将这些tokens转换为数字。因此，可以将一段文本表示为一系列数字，一旦有了一系列的数字，就可以将其输入到模型中以解决各种 NLP 任务！

文本分类

像任何模态的分类任务一样，文本分类（Text classification）将一段文本（可以是句子级别、段落或文档）从预定义的类别集合中进行标记。文本分类有许多实际应用，其中一些包括：

情感分析：根据某些极性（如积极或消极）对文本进行标记，可以支持政治、金融和营销等领域的决策制定
内容分类：根据某些主题对文本进行标记，有助于组织和过滤新闻和社交媒体提要中的信息（天气、体育、金融等）

文本分类的任务标识为：sentiment-analysis 。

复制代码

from transformers import pipeline

classifier = pipeline(task="sentiment-analysis")
preds = classifier("Hugging Face is the best thing since sliced bread!")
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds

结果：

复制代码

[{'score': 0.9991, 'label': 'POSITIVE'}]

零样本文本分类

零样本文本分类（Zero-shot Classification）是自然语言处理中的一项任务，为提供的输入文本，在新设定的分类集合上找到对应的分类并将文本指派给该分类。

零样本文本分类的任务标识为：zero-shot-classification

首先，将 Pipeline 实例化并传入所使用的模型 facebook/bart-large-mnli：

复制代码

from transformers import pipeline

classify = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

根据指定的模型在默认配置中指定了具体的任务处理类，就不用再传入任务标识。现在使用实例 classify 来实现推理：

复制代码

output=classify(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)
print(output)

输出：

复制代码

{
"sequence": "I have a problem with my iphone that needs to be resolved asap!!",
"labels":["urgent", "not urgent", "phone", "tablet", "computer"],
"scores": [0.5036360025405884, 0.4787988066673279, 0.012600637972354889, 0.02655780641362071, 0.023087705485522747]
}

`token` 分类

在任何 NLP 任务中，文本都经过预处理，将文本序列分成单个单词或子词。这些被称为 tokens 。Token分类（Token classification）将每个token分配一个来自预定义类别集的标签。

两种常见的 Token 分类是：

命名实体识别（NER）：根据实体类别（如组织、人员、位置或日期）对token进行标记。NER在生物医学设置中特别受欢迎，可以标记基因、蛋白质和药物名称。
词性标注（POS）：根据其词性（如名词、动词或形容词）对标记进行标记。POS对于帮助翻译系统了解两个相同的单词如何在语法上不同很有用（作为名词的银行与作为动词的银行）。

目前通过 Pipeline应用，只支持命名实体识别（传入 ner）。

复制代码

from transformers import pipeline

classifier = pipeline(task="ner")
preds = classifier("Hugging Face is a French company based in New York City.")
result = [
    {
        "entity": pred["entity"],
        "score": round(pred["score"], 4),
        "index": pred["index"],
        "word": pred["word"],
        "start": pred["start"],
        "end": pred["end"],
    }
    for pred in preds
]
print(*result, sep="\n")

结果：

复制代码

{'entity': 'I-ORG', 'score': 0.9968, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9293, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9763, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-MISC', 'score': 0.9983, 'index': 6, 'word': 'French', 'start': 18, 'end': 24}
{'entity': 'I-LOC', 'score': 0.999, 'index': 10, 'word': 'New', 'start': 42, 'end': 45}
{'entity': 'I-LOC', 'score': 0.9987, 'index': 11, 'word': 'York', 'start': 46, 'end': 50}
{'entity': 'I-LOC', 'score': 0.9992, 'index': 12, 'word': 'City', 'start': 51, 'end': 55}

问答

问答（Question answering）是另一个 token 级别的任务，返回一个问题的答案，有时带有上下文（开放领域），有时不带上下文（封闭领域）。当向虚拟助手提出问题时，例如询问一家餐厅是否营业，就会发生这种情况。它还可以提供客户或技术支持，并帮助搜索引擎检索您要求的相关信息。

提供答案的方式有两种：
- 抽取式：给定一个问题和一些上下文，然后模型从给定的上下文抽取出一些文本片段来回答提出的问题；
- 抽象式 ：给定一个问题和一些上下文，然后根据问题和上下文生成所需的答案。这种方法由[Text2TextGenerationPipeline]处理，而不是下面展示的[QuestionAnsweringPipeline]。

问答的任务标识为：question-answering

复制代码

from transformers import pipeline

question_answerer = pipeline(task="question-answering")
preds = question_answerer(
    question="What is the name of the repository?",
    context="The name of the repository is huggingface/transformers",
)
print(
    f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

结果为：

复制代码

score: 0.9327, start: 30, end: 54, answer: huggingface/transformers

表格问答

表格问答（Table Question Answer）根据提问，根据给定的表格的信息来回答的任务。

应用场景：

自动化客服系统
智能搜索引擎
数据可视化工具
企业知识图谱构建
科学文献自动抽取

表格问答的任务标识为：table-question-answering

首先，通过实例化 Pipeline 并在实例化时指定任务 table-question-answering 和使用的模型 google/tapas-base-finetuned-wtq ：

复制代码

from transformers import pipeline
qa = pipeline(task="table-question-answering", model="google/tapas-base-finetuned-wtq")

然后假定一张表格数据，作为待分析的输入：

复制代码

table = {
    "Repository": ["Transformers", "Datasets", "Tokenizers"],
    "Stars": ["36542", "4512", "3934"],
    "Contributors": ["651", "77", "34"],
    "Programming language": ["Python", "Python", "Rust, Python and NodeJS"],
}

最后通过实例 qa来实现分析并输出：

复制代码

output= qa(query="How many stars does the transformers repository have?", table=table)
print(output)

结果：

复制代码

{"answer": "AVERAGE > 36542", "coordinates": "[(0,1)]", "cells":["36542"], "aggreator": "AVERAGE"}

文本摘要

文本摘要（Summarization）是从较长的文本中创建一个较短的版本，尽可能保留原始文档的大部分含义。摘要是一个序列到序列的任务；它输出比输入更短的文本序列。有许多长篇文档可以进行摘要，以帮助读者快速了解主要要点。法案、法律和财务文件、专利和科学论文等文档可以摘要，以节省读者的时间并作为阅读辅助工具。

像问答一样，摘要有两种类型：

提取式：从原始文本中识别和提取最重要的句子
抽象式：从原始文本生成目标摘要（可能包括不在输入文档中的新单词）；[SummarizationPipeline]使用抽象方法。

文本摘要的任务标识为：summarization

复制代码

from transformers import pipeline

summarizer = pipeline(task="summarization")
summarizer(
    "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles."
)

结果：

复制代码

[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}]

翻译

翻译（Translation）将一种语言的文本序列转换为另一种语言。它对于帮助来自不同背景的人们相互交流、帮助翻译内容以吸引更广泛的受众，甚至成为学习工具以帮助人们学习一门新语言都非常重要。除了摘要之外，翻译也是一个序列到序列的任务，意味着模型接收输入序列并返回目标输出序列。

翻译的任务标识为：translation_xx_to_yy 或 translation

在早期，翻译模型大多是单语的，但最近，越来越多的人对可以在多种语言之间进行翻译的多语言模型感兴趣。

复制代码

from transformers import pipeline

text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning."
translator = pipeline(task=""translation_xx_to_yy"", model="google-t5/t5-small")
translator(text)

结果为：

复制代码

[{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}]

文本生成

文本生成（Text Generation）是一种预测文本序列中单词的任务。它已成为一种非常流行的 NLP 任务，因为预训练的语言模型可以微调用于许多其他下游任务。最近，人们对大型语言模型（LLMs）表现出了极大的兴趣，这些模型展示了zero learning或few-shot learning的能力。这意味着模型可以解决它未被明确训练过的任务！语言模型可用于生成流畅和令人信服的文本，但需要小心，因为文本可能并不总是准确的。

有两种类型的话语模型：

causal ：模型的目标是预测序列中的下一个token，而未来的tokens被遮盖。该方式使用的任务标识为：text-generation

复制代码

from transformers import pipeline

prompt = "Hugging Face is a community-based open-source platform for machine learning."
generator = pipeline(task="text-generation")
generator(prompt)  # doctest: +SKIP

masked ：模型的目标是预测序列中被遮蔽的token，同时具有对序列中所有tokens的完全访问权限。该方式使用的任务标识为：fill-mask

复制代码

text = "Hugging Face is a community-based open-source <mask> for machine learning."
fill_mask = pipeline(task="fill-mask")
preds = fill_mask(text, top_k=1)
preds = [
    {
        "score": round(pred["score"], 4),
        "token": pred["token"],
        "token_str": pred["token_str"],
        "sequence": pred["sequence"],
    }
    for pred in preds
]
preds

结果：

复制代码

[{'score': 0.224, 'token': 3944, 'token_str': ' tool', 'sequence': 'Hugging Face is a community-based open-source tool for machine learning.'}]

文生文

文生文（Text-to-Text）和文本生成（Text Generation）两者都是自然语言处理（NLP）的子领域，但它们有不同的重点和应用场景。文本生成主要指的是自动生成文本内容的技术，例如：自动生成新闻报道、自动生成产品描述、自动生成聊天机器人的对话，这种技术通常使用深度学习模型来训练语言模型，从而能够根据输入的条件或提示生成新的文本内容。文生文则主要指的是将一段文本转换为另一段文本的技术，例如：机器翻译、文本摘要、风格转换，这种技术通常使用序列到序列模型或变换器模型来训练语言模型，从而能够根据输入的文本生成新的文本内容。文本生成主要关注于自动生成文本内容，而文生文则主要关注于将一段文本转换为另一段文本。

文生文的任务标识为：text2text-generation

首先将 Pipeline实例化，并在实例化过程中指定任务为 text2text-generation，指定使用 google/flan-t5-small 模型：

复制代码

from transformers import pipeline

generator = pipeline(task="text2text-generation",model= "google/flan-t5-small" )

然后分析推理得到结果并打印输出：

复制代码

output=generator( "Translate to German:  My name is Arthur")
print(output)

结果：

复制代码

[{"generated_text": "Meinen Name ist Arthur."}]

13 Transformers - 使用Pipelien处理自然语言处理

文章目录

自然语言处理

文本分类

零样本文本分类

token 分类

问答

表格问答

文本摘要

翻译

文本生成

文生文

`token` 分类