官网链接

Fast Transformer Inference with Better Transformer --- PyTorch Tutorials 2.0.1+cu117 documentation

使用 BETTER TRANSFORMER 快速的推理TRANSFORMER

本教程介绍了作为PyTorch 1.12版本的一部分的Better Transformer (BT)。在本教程中，我们将展示如何使用更好的Transformer 与torchtext进行生产推理。Better Transformer是一个具备生产条件fastpath并且可以加速在CPU和GPU上具有高性能的Transformer模型的部署。对于直接基于PyTorch核心nn.module或基于torchtext的模型，fastpath功能可以透明地工作。

使用PyTorch核心torch.nn.module类TransformerEncoder, TransformerEncoderLayer和MultiHeadAttention的模型，可以通过Better Transformer fastpath执行加速。此外，torchtext已经更新为使用核心库模块，以受益于fastpath加速。(将来可能会启用其他模块的fastpath执行。)

Better Transformer提供两种类型的加速:

实现CPU和GPU的Native multihead attention(MHA)，提高整体执行效率。
利用NLP推理中的稀疏性。由于输入长度可变，输入令牌可能包含大量填充令牌，可以跳过处理，从而显著提高速度。

Fastpath执行受制于一些标准。最重要的是，模型必须在推理模式下执行，并且在不收集梯度信息的输入张量上运行(例如，使用torch.no_grad运行)。

本教程中Better Transformer 特点

加载预训练模型(1.12之前没有Better Transformer)
在CPU上并且没有BT fastpath(仅本机MHA))的情况下运行和基准测试推断
在设备(可配置)上并且没有BT fastpath(仅本机MHA))的情况下运行和基准测试推断
启用稀疏性支持
在设备(可配置)上并且没有BT fastpath(仅本机MHA+稀疏性))的情况下运行和基准测试推断

额外的信息

关于Better Transformer的其他信息可以在PyTorch.Org 博客中找到。A Better Transformer for Fast Transformer Inference.

设置

加载预训练模型

我们按照torchtext.models中的说明从预定义的torchtext模型下载XLM-R模型。我们还将DEVICE设置为执行加速器上的测试。(根据您的环境适当启用GPU执行。)

复制代码

import torch
import torch.nn as nn

print(f"torch version: {torch.__version__}")

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"torch cuda available: {torch.cuda.is_available()}")

import torch, torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER
classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim = 1024)
model = xlmr_large.get_model(head=classifier_head)
transform = xlmr_large.transform()

数据集搭建

我们设置了两种类型的输入:一个小的输入批次和一个具有稀疏性的大的输入批次。

复制代码

small_input_batch = [
               "Hello world",
               "How are you!"
]
big_input_batch = [
               "Hello world",
               "How are you!",
               """`Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.`

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite."""
]

接下来，我们选择小批量或大批量输入，对输入进行预处理并测试模型。

复制代码

input_batch=big_input_batch

model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape

最后，我们设置基准迭代计数:

复制代码

ITERATIONS=10

执行

在CPU上并且没有BT fastpath(仅本机MHA)的情况下运行和基准测试推断

我们在CPU上运行模型，并收集概要信息:

第一次运行使用传统方式("slow path")执行。
第二次运行通过使用model.eval()将模型置于推理模式来启用BT fastpath执行，并使用torch.no_grad()禁用梯度收集。

当模型在CPU上执行时，您可以看到改进(其大小取决于CPU模型)。注意，fastpath配置文件显示了本机TransformerEncoderLayer实现aten::_transformer_encoder_layer_fwd.中的大部分执行时间。

复制代码

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

在设备(可配置)上并且没有BT fastpath(仅本机MHA))的情况下运行和基准测试推断

我们检查BT 稀疏性设置:

复制代码

model.encoder.transformer.layers.enable_nested_tensor

我们禁用BT 稀疏性:

复制代码

model.encoder.transformer.layers.enable_nested_tensor=False

我们在DEVICE上运行模型，并收集DEVICE上本机MHA执行的配置文件信息:

第一次运行使用传统方式("slow path")执行。
第二次运行通过使用model.eval()将模型置于推理模式来启用BT fastpath执行，并使用torch.no_grad()禁用梯度收集。

当在GPU上执行时，你应该看到一个显著的加速，特别是对于包含稀疏性的大输入批处理设置:

复制代码

model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

总结

在本教程中，我们介绍了使用 Better Transformer fastpath快速的transformer 推理，在torchtext 中使用PyTorch核心的 Better Transformer包支持Transformer Encoder 模型。在确认BT fastpath可用性的前提下，我们已经演示了 Better Transformer 的使用。我们已经演示并测试了BT fastpath执行模式·、本机MHA执行和BT稀疏性加速的使用。

PyTorch翻译官网教程-FAST TRANSFORMER INFERENCE WITH BETTER TRANSFORMER

官网链接

使用 BETTER TRANSFORMER 快速的推理TRANSFORMER

本教程中Better Transformer 特点

额外的信息

设置

加载预训练模型

数据集搭建

执行

总结