Hugging Face系列2：详细剖析Hugging Face网站资源——实战六类开源库

Hugging Face系列2：详细剖析Hugging Face网站资源------实战六类开源库

前言
本篇摘要
[2. Hugging Face开源库](#2. Hugging Face开源库)
- [2.1 transformers](#2.1 transformers)
- - [2.1.1 简介](#2.1.1 简介)
  - [2.1.2 实战](#2.1.2 实战)
  - - [1. 文本分类](#1. 文本分类)
    - [2. 图像识别](#2. 图像识别)
    - [3. 在Pytorch和TensorFlow中使用pipeline](#3. 在Pytorch和TensorFlow中使用pipeline)
- [2.2 diffusers](#2.2 diffusers)
- - [2.2.1 简介](#2.2.1 简介)
  - [2.2.2 实战](#2.2.2 实战)
  - - [1. 管线](#1. 管线)
    - [2. 模型和调度器](#2. 模型和调度器)
- [2.3 datasets](#2.3 datasets)
- - [2.3.1 简介](#2.3.1 简介)
  - [2.3.2 实战](#2.3.2 实战)
- [2.4 PEFT](#2.4 PEFT)
- - [2.4.1 简介](#2.4.1 简介)
  - [2.4.2 实战](#2.4.2 实战)
  - - [1. peft模型](#1. peft模型)
    - [2. 新旧模型结构对比](#2. 新旧模型结构对比)
    - [3. PEFT模型推理](#3. PEFT模型推理)
- [2.5 accelerate](#2.5 accelerate)
- - [2.5.1 简介](#2.5.1 简介)
  - [2.5.2 实战](#2.5.2 实战)
  - - [1. 与pytorch融合](#1. 与pytorch融合)
    - [2. 命令行accelerate config&launch](#2. 命令行accelerate config&launch)
    - [3. DeepSpeed](#3. DeepSpeed)
    - [4. notebook_launcher](#4. notebook_launcher)
- [2.6 optimum](#2.6 optimum)
- - [2.6.1 简介](#2.6.1 简介)
  - [2.6.2 实战](#2.6.2 实战)
  - - [1. ONNX&ONNX Runtime](#1. ONNX&ONNX Runtime)
    - [2. 加速推理](#2. 加速推理)
    - [3. 加速训练](#3. 加速训练)
参考文献

前言

本系列文章旨在全面系统的介绍Hugging Face，让小白也能熟练使用Hugging Face上的各种开源资源，并上手创建自己的第一个Space App，在本地加载Hugging Face管线训练自己的第一个模型，并使用模型生成采样数据，同时详细解决部署中出现的各种问题。后续文章会分别介绍采样器及其加速、显示分类器引导扩散模型、CLIP多模态图像引导生成、DDMI反转及控制类大模型ControlNet等，根据反馈情况可能再增加最底层的逻辑公式和从零开始训练LLM等，让您从原理到实践彻底搞懂扩散模型和大语言模型。欢迎点赞评论、收藏和关注。

本系列文章如下：

《详细剖析Hugging Face网站资源------models/datasets/spaces》：全面系统的介绍Hugging Face资源；
《详细剖析Hugging Face网站资源------实战六类开源库》
《从0到1：使用Hugging Face管线加载Diffusion模型生成第一张图像》：在本地加载Hugging Face管线训练自己的第一个模型，并使用模型生成采样数据，同时详细解决部署中出现的各种问题。

本篇摘要

本篇主要介绍Hugging Face。Hugging Face是一个人工智能的开源社区，是相关从业者协作和交流的平台。它的核心产品是Hugging Face Hub，这是一个基于Git进行版本管理的存储库，截至2024年5月，已托管了65万个模型、14.5万个数据集以及超过17万个Space应用。另外，Hugging Face还开源了一系列的机器学习库如Transformers、Datasets和Diffusers等，以及界面演示工具Gradio。此外，Hugging Face设计开发了很多学习资源，比如与NLP（大语言模型）、扩散模型及深度强化学习等相关课程。最后介绍一些供大家交流学习的平台。为了更有趣，本篇介绍了大量有趣的Spaces应用，比如换装IDM-VTON、灯光特效IC-Light、LLM性能排行Artificial Analysis LLM Performance Leaderboard和自己部署的文生图模型stable-diffusion-xl-base-1.0、对图片精细化的stable-diffusion-xl-refiner-1.0等。只要读者认真按着文章操作，上述操作都可自己实现。下面对以上内容逐一介绍。

2. Hugging Face开源库

除了Hugging Face Hub提供的models、datasets和spaces，Hugging Face还在GitHub上开源了一系列的机器学习库和工具。在GitHub的Hugging Face组织页面，其置顶了一些开源代码仓库，包括transformers、diffusers、datasets、peft、accelerate以及optimum，如下图所示。本篇逐一详细介绍并给出对应的实战用例，方便读者更直观的理解和应用。

2.1 transformers

2.1.1 简介

transformers 提供了数以千计的可执行不同任务的SOTA^[1](#1)预训练模型，这些任务包括（1）支持 100 多种语言的文本分类、信息抽取、问答、摘要、翻译、文本生成的文本任务；（2）用于图像分类，目标检测和分割等图像任务；（3）用于语音识别和音频分类等音频任务。Transformer模型还可以在几种模式的组合上执行任务，例如表格问答、光学字符识别、从扫描文档中提取信息、视频分类和视觉问答等。

Transformers 提供了便于快速下载和使用的API，让你可以将预训练模型在给定文本、数据集上微调后，通过 model hub 与社区共享；同时，每个定义的 Python 模块均完全独立，方便修改和快速研究实验。Transformers 支持三个最热门的深度学习库： Jax, PyTorch 以及 TensorFlow------并与之无缝整合。你可以直接使用一个框架训练你的模型然后用另一个加载和推理，以支持框架之间的互操作。模型还可以导出成ONNX和TorchScript等格式，以方便在生产环境中部署。

用户可以通过transformers直接测试和使用model hub中的大部分模型，以节省训练时间和资源，降低成本。

2.1.2 实战

Hugging Face提供pipeline API以便在给定的输入(文本、图像、音频等)上使用模型。模型训练期间，管道用预训练模型实现预处理。

在使用任何库之前，需要先安装：

python 复制代码

# upgrade参数保证安装最新版
pip install --upgrade XXX

下面看三个例子。

1. 文本分类

使用pipeline实现文本分类：

python 复制代码

>>> from transformers import pipeline

# Allocate a pipeline for sentiment-analysis. The second line of code downloads and caches the pretrained model used by the pipeline, while the third evaluates it on the given text.
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to introduce pipeline to the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9996980428695679}]

2. 图像识别

使用pipeline实现图像识别：

python 复制代码

>>> import requests
>>> from PIL import Image
>>> from transformers import pipeline

# Download an image with cute cats
>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png"
>>> image_data = requests.get(url, stream=True).raw
>>> image = Image.open(image_data)

# Allocate a pipeline for object detection
>>> object_detector = pipeline('object-detection')
>>> object_detector(image)
[{'score': 0.9982201457023621,
  'label': 'remote',
  'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
 {'score': 0.9960021376609802,
  'label': 'remote',
  'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},
 {'score': 0.9954745173454285,
  'label': 'couch',
  'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},
 {'score': 0.9988006353378296,
  'label': 'cat',
  'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},
 {'score': 0.9986783862113953,
  'label': 'cat',
  'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]

3. 在Pytorch和TensorFlow中使用pipeline

代码如下：

python 复制代码

>>> from transformers import AutoTokenizer, AutoModel

>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased")

# pytorch uses pt and tensorflow uses tf
>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)

代码解释：tokenizer负责为预训练模型预处理字符串，并且可指定返回张量格式，然后传入model作训练，产生最终输出。示例代码演示了如何在PyTorch或TensorFlow框架中使用模型，并在新数据集上使用训练API快速微调预处理模型。

2.2 diffusers

2.2.1 简介

diffusers是SOTA预训练扩散模型的首选库，可用于生成图像、音频，甚至分子的3D结构。diffusers是一个模块化工具箱，既可支持简单的推理解决方案，也支持训练自己的扩散模型。diffusers库设计注重可用性而非性能，简单而非简易，可定制化而非抽象。

diffusers提供三个核心组件：1）最先进的扩散管道diffusion pipelines，只需几行代码即可进行推理；2）可互换噪声调度器noise schedulers，可用于调节推理过程中模型生成中的扩散速度和输出质量；3）预训练模型pretrained models可用作构建块，并与调度器相结合，以创建自己的端到端扩散系统。

2.2.2 实战

实战包括管线、模型和调度器。

1. 管线

使用diffusers生成输出非常容易，如文生图，可使用from_pretrained方法加载预训练的扩散模型stable-diffusion：

python 复制代码

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipeline.to("cuda")
pipeline("An image of a squirrel in Picasso style").images[0]

2. 模型和调度器

还可以深入研究预训练模型和噪声调度器的工具箱，构建自己的扩散系统：

python 复制代码

from diffusers import DDPMScheduler, UNet2DModel
from PIL import Image
import torch

scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
scheduler.set_timesteps(50)

sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")
input = noise

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(input, t).sample
        prev_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
        input = prev_noisy_sample

image = (input / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
image = Image.fromarray((image * 255).round().astype("uint8"))
image

2.3 datasets

2.3.1 简介

datasets是一个轻量级库，旨在让社区轻松添加和共享新的数据集，它提供两个主要功能：

可用于众多公共数据集的dataloaders：可一行代码下载和预处理HuggingFace Datasets Hub提供的任意数据集，包括图像数据集、音频数据集、467种语言/方言的文本数据集等。示例如squad_dataset=load_dataset（"squad"）。
高效的数据预处理：可简单、快速和可复制的预处理众多公共数据集及本地的CSV、JSON、text、PNG、JPEG、WAV、MP3、Parquet等格式数据。示例如processed_dataset=dataset.map（process_example）。

此外，datasets还有许多其他有趣的功能：

适配于大型数据集：datasets将用户从RAM内存限制中解放出来，所有数据集都使用高效的零序列化成本后端（Apache Arrow）进行内存映射。
智能缓存：重复处理数据无需等待。
支持轻量快速透明的Python API，Python API还具有多处理器、缓存及内存映射等特性。
内置支持与NumPy、pandas、PyTorch、TensorFlow 2和JAX进行互操作。
原生支持音频和图像数据。
支持启用流模式以节省磁盘空间，可快速开启对数据集进行迭代。

需要指出的是，HuggingFace Datasets源于优秀的TensorFlow数据集的一个分支，两者差异主要是HuggingFace Datasets动态加载python脚本而不是在库中提供、后端序列化基于Apache Arrow而不是TF Records、面向用户的datasets对象（封装了一个内存映射的Arrow表缓存）基于与框架无关的由tf.data的方法激发的数据集类而不是基于tf.data.Dataset。

2.3.2 实战

datasets的API以函数datasets.load_dataset(dataset_name, **kwargs)为中心，该函数用于实例化数据集。

加载文本数据集的示例：

python 复制代码

from datasets import load_dataset

# Print all the available datasets
from huggingface_hub import list_datasets
print([dataset.id for dataset in list_datasets()])

# Load a dataset and print the first example in the training set
squad_dataset = load_dataset('squad')
print(squad_dataset['train'][0])

# Process the dataset - add a column with the length of the context texts
dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})

# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)

当数据集比磁盘空间大，或者不想下载数据时，可以使用流媒体，示例如下：

python 复制代码

# If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset
image_dataset = load_dataset('cifar100', streaming=True)
for example in image_dataset["train"]:
    break

2.4 PEFT

2.4.1 简介

PEFT（Parameter-Efficient Fine-Tuning）即SOTA高效参数微调方法，由于大型模型的参数规模，完全微调模型成本极高，而PEFT仅通过微调少量参数就可以达到完全微调的性能，使型预训练模型有效地适应各种下游应用，显著降低计算和存储成本。

PEFT与Transformers集成，可轻松进行模型训练和推理；与Diffuser集成，可方便地管理不同的适配器；与Accelerate集成，可用于真正大型模型的分布式训练和推理。

提示：（1）PEFT方法如何用于下游各类任务可参考PEFT 组织部分jupyter notebook的演示；（2）查看PEFT API Reference的 Adapters部分，可得到PEFT支持的方法列表。阅读CONCEPTUAL GUIDES部分学习如何使用。如下图：

2.4.2 实战

实战包括示例、新旧模型对比和模型推理三部分。

1. peft模型

在使用PEFT中方法（如LoRA）训练模型之前，需要通过get_peft_model()将基础模型和PEFT配置包装在一起。

比如下例的bigscience/mt0-large，只需训练完全模型0.19%的参数即可：

python 复制代码

from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model_peft = get_peft_model(model, peft_config)
model.print_trainable_parameters()
"trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282"

2. 新旧模型结构对比

现在打印出新旧模型作对比，看PEFT到底做了什么。也可参考Spaces空间中使用PEFT的示例，如下：

python 复制代码

# 查看peft配置
model.peft_config

PrefixTuningConfig(peft_type=<PeftType.PREFIX_TUNING: 'PREFIX_TUNING'>, base_model_name_or_path='bigscience/mt0-large', 
task_type=<TaskType.SEQ_2_SEQ_LM: 'SEQ_2_SEQ_LM'>, inference_mode=False, num_virtual_tokens=30, token_dim=1024, num_transformer_submodules=1, 
num_attention_heads=16, num_layers=24, encoder_hidden_size=1024, prefix_projection=False)

# 打印旧模型
print(model)
MT5ForConditionalGeneration(
  (shared): Embedding(250112, 1024)
  (encoder): MT5Stack(
    (embed_tokens): Embedding(250112, 1024)
    (block): ModuleList(
      (0): MT5Block(
        (layer): ModuleList(
          (0): MT5LayerSelfAttention(
            (SelfAttention): MT5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): MT5LayerFF(
            (DenseReluDense): MT5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
              (wo): Linear(in_features=2816, out_features=1024, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (1-23): 23 x MT5Block(...)
    (final_layer_norm): MT5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder): MT5Stack(
    (embed_tokens): Embedding(250112, 1024)
    (block): ModuleList(
      (0): MT5Block(
        (layer): ModuleList(
          (0): MT5LayerSelfAttention(
            (SelfAttention): MT5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): MT5LayerCrossAttention(
            (EncDecAttention): MT5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (2): MT5LayerFF(
            (DenseReluDense): MT5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
              (wo): Linear(in_features=2816, out_features=1024, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (1-23): 23 x MT5Block(
        (layer): ModuleList(...)
    (final_layer_norm): MT5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (lm_head): Linear(in_features=1024, out_features=250112, bias=False)
)

# 打印peft模型
print(model_peft)
PeftModelForSeq2SeqLM(
  (base_model): IA3Model(
    (model): MT5ForConditionalGeneration(
      (shared): Embedding(250112, 1024)
      (encoder): MT5Stack(
        (embed_tokens): Embedding(250112, 1024)
        (block): ModuleList(
          (0): MT5Block(
            (layer): ModuleList(
              (0): MT5LayerSelfAttention(
                (SelfAttention): MT5Attention(
                  (q): Linear(in_features=1024, out_features=1024, bias=False)
                  (k): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 1024x1])
                  )
                  (v): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 1024x1])
                  )
                  (o): Linear(in_features=1024, out_features=1024, bias=False)
                  (relative_attention_bias): Embedding(32, 16)
                )
                (layer_norm): MT5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (1): MT5LayerFF(
                (DenseReluDense): MT5DenseGatedActDense(
                  (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
                  (wi_1): Linear(
                    in_features=1024, out_features=2816, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 2816x1])
                  )
                  (wo): Linear(in_features=2816, out_features=1024, bias=False)
                  (dropout): Dropout(p=0.1, inplace=False)
                  (act): NewGELUActivation()
                )
                (layer_norm): MT5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (1-23): 23 x MT5Block(...)
        (final_layer_norm): MT5LayerNorm()
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (decoder): MT5Stack(
        (embed_tokens): Embedding(250112, 1024)
        (block): ModuleList(
          (0): MT5Block(
            (layer): ModuleList(
              (0): MT5LayerSelfAttention(
                (SelfAttention): MT5Attention(
                  (q): Linear(in_features=1024, out_features=1024, bias=False)
                  (k): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 1024x1])
                  )
                  (v): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 1024x1])
                  )
                  (o): Linear(in_features=1024, out_features=1024, bias=False)
                  (relative_attention_bias): Embedding(32, 16)
                )
                (layer_norm): MT5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (1): MT5LayerCrossAttention(
                (EncDecAttention): MT5Attention(
                  (q): Linear(in_features=1024, out_features=1024, bias=False)
                  (k): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 1024x1])
                  )
                  (v): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 1024x1])
                  )
                  (o): Linear(in_features=1024, out_features=1024, bias=False)
                )
                (layer_norm): MT5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (2): MT5LayerFF(
                (DenseReluDense): MT5DenseGatedActDense(
                  (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
                  (wi_1): Linear(
                    in_features=1024, out_features=2816, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 2816x1])
                  )
                  (wo): Linear(in_features=2816, out_features=1024, bias=False)
                  (dropout): Dropout(p=0.1, inplace=False)
                  (act): NewGELUActivation()
                )
                (layer_norm): MT5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (1-23): 23 x MT5Block(...)
        (final_layer_norm): MT5LayerNorm()
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (lm_head): Linear(in_features=1024, out_features=250112, bias=False)
    )
  )
)

对比以上输出，新旧模型主要的区别在于下面三点，可以看出peft主要通过减少输出特征参数量来缩小训练规模：

python 复制代码

model:
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
model_peft:
                  (k): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 1024x1])
                  )
                  (v): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 1024x1])
                  )
                  (wi_1): Linear(
                    in_features=1024, out_features=2816, bias=False
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 2816x1])
                  )

3. PEFT模型推理

下面使用PEFT中模型AutoPeftModelForCausalLM进行推理：

python 复制代码

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

model.eval()
inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")

outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

"Preheat the oven to 350 degrees and place the cookie dough in the center of the oven. In a large bowl, combine the flour, baking powder, baking soda, salt, and cinnamon. In a separate bowl, combine the egg yolks, sugar, and vanilla."

2.5 accelerate

2.5.1 简介

accelerate可以在任何类型的设备上运行raw格式的PyTorch训练脚本。accelerate就是为PyTorch用户创建的，这些用户喜欢编写PyTorch模型的训练循环，但不愿意编写和维护使用多GPU/TPU/fp16所需的样板代码。accelerate替用户做了这部分工作，它可以准确地抽出与多GPU/TPU/fp16相关的样板代码，并保持代码的其余部分不变。

2.5.2 实战

accelerate除了支持pytorch，还支持多种其他应用方式，下面逐一详细介绍。

1. 与pytorch融合

下面来看一个用accelerate如何改造使用pytorch框架的训练脚本：

python 复制代码

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, data = accelerator.prepare(model, optimizer, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
          source = source.to(device)
          targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

在上例中，通过向任何标准PyTorch训练脚本添加5行，就可以在任何类型的单个或分布式节点（单CPU、单GPU、多GPU和TPU）、以及混合精度的情况下运行（fp8、fp16、bf16）。特别是，在本地机器上调试和训练环境中运行相同的代码而无需进行任何修改。

2. 命令行accelerate config&launch

accelerate提供了一个可选的CLI工具，可在启动脚本之前快速配置和测试培训环境。无需记住如何使用torch.distributed.run或编写TPU训练的特定启动器，只需运行命令：

python 复制代码

(pytorch) shoe@shoe-tp14:~$ accelerate config
In which compute environment are you running?
This machine                                                                                                                          
Which type of machine are you using?                                                                                                  
multi-GPU                                                                                                                             
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                            
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: yes    
Do you wish to optimize your script with torch dynamo?[yes/NO]:yes                                                                    
Which dynamo backend would you like to use?                                                                                           
tensorrt                                                                                                                              
Do you want to customize the defaults sent to torch.compile? [yes/NO]: yes                                                            
Which mode do you want to use?                                                                                                        
default                                                                                                                               
Do you want the fullgraph mode or it is ok to break model into several subgraphs? [yes/NO]: yes                                       
Do you want to enable dynamic shape tracing? [yes/NO]: yes                                                                            
Do you want to use DeepSpeed? [yes/NO]: NO                                                                                            
Do you want to use FullyShardedDataParallel? [yes/NO]: NO                                                                             
Do you want to use Megatron-LM ? [yes/NO]: yes                                                                                        
What is the Tensor Parallelism degree/size? [1]:1                                                                                     
What is the Pipeline Parallelism degree/size? [1]:1                                                                                   
Do you want to enable selective activation recomputation? [YES/no]: 1                                                                 
Please enter yes or no.
Do you want to enable selective activation recomputation? [YES/no]: yes
Do you want to use distributed optimizer which shards optimizer state and gradients across data parallel ranks? [YES/no]: yes
What is the gradient clipping value based on global L2 Norm (0 to disable)? [1.0]: 1
How many GPU(s) should be used for distributed training? [1]:1

Do you wish to use FP16 or BF16 (mixed precision)?
bf16                                                                                                                                  
accelerate configuration saved at /home/shoe/.cache/huggingface/accelerate/default_config.yaml

以下是您提供的配置的简要概述以及每个选项的含义：

计算环境：您正在使用本地机器，这可能意味着您将在单台物理服务器或工作站上使用多个GPU。
机器类型：您正在使用多GPU机器。
多机器训练：您只计划使用一台机器进行训练，这意味着您将在单节点上进行训练。
分布式操作检查：您希望在运行时检查分布式操作是否有错误，这样可以避免超时问题，但可能会使训练变慢。
使用torch dynamo优化：您希望使用torch dynamo来优化您的PyTorch代码，这可以提高性能。
dynamo后端：您选择使用tensorrt作为后端，这通常用于生产环境，可以提供优化的代码。
DeepSpeed：您不打算使用DeepSpeed，这是一个用于深度学习训练的优化库。
FullyShardedDataParallel：您不打算使用FullyShardedDataParallel，这是一个用于数据并行的PyTorch分布式训练的库。
Megatron-LM：您打算使用Megatron-LM，这是一个用于大规模语言模型训练的PyTorch扩展。
Tensor并行度：您设置为1，这意味着您不会使用Tensor并行。
流水线并行度：您设置为1，这意味着您不会使用流水线并行。
选择性激活重计算：您启用了选择性激活重计算，这可以提高效率。
分布式优化器：您启用了分布式优化器，这意味着优化器状态和梯度将在数据并行等级上分片。
梯度裁剪：您设置了一个基于全局L2范数的梯度裁剪值。
用于分布式训练的GPU数量：您指定了使用3个GPU进行分布式训练。
FP16或BF16（混合精度）：您选择了BF16，这是英伟达的混合精度之一，可以提高训练性能。

逐一根据您的具体需求和配置回答所问，就会生成配置文件default_config.yaml，当使用命令accelerate launch时，它将被自动设置到配置中，示例如下：

python 复制代码

!accelerate launch  --multi_gpu --num_processes 2 train_unconditional.py \
  --dataset_name="huggan/smithsonian_butterflies_subset" \
  --resolution=64 \
  --output_dir={model_name} \
  --train_batch_size=32 \
  --num_epochs=50 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_warmup_steps=500 \
  --mixed_precision="no"

可以在accelerate config中配置多CPU，也可以直接使用命令mpirun启用多CPU，命令示例如下：

python 复制代码

# np参数设置CPU数量
mpirun -np 2 python examples/nlp_example.py

3. DeepSpeed

accelerate支持在单/多GPU上使用DeepSpeed。你即可以通过上面accelerate config中配置DeepSpeed，也可以将DeepSpeed相关参数通过DeepSpeedPlugin集成到python脚本中。其中DeepSpeedPlugin 是一个用于 PyTorch 模型的插件，可以帮助用户更快地训练深度学习模型。具体来说，DeepSpeedPlugin 可以实现以下功能：

自动混合精度训练：DeepSpeedPlugin 可以自动将模型参数和梯度转换为半精度浮点数，从而减少 GPU 存储器的使用量，加快训练速度。
分布式训练：DeepSpeedPlugin 可以自动启动多个进程，将训练任务分配到多个 GPU 或多台机器上进行分布式训练。
梯度累积：DeepSpeedPlugin 可以实现梯度累积，从而减少 GPU 存储器的使用量，允许使用更大的批量大小进行训练。

DeepSpeedPlugin 中的参数解释：

zero_stage：指定使用哪个阶段的 ZeRO 内存优化技术。ZeRO 是一种内存优化技术，可以将模型参数和梯度分成多个分片，从而减少 GPU 存储器的使用量。zero_stage 的取值范围为 0、1、2，分别表示不使用 ZeRO、使用 ZeRO 1 和使用 ZeRO 2。
gradient_accumulation_steps：指定梯度累积的步数，取值为正整数。梯度累积是一种训练技巧，可以将多个小批量数据的梯度累积起来，从而实现大批量数据的训练。

示例如下：

python 复制代码

from accelerate import Accelerator, DeepSpeedPlugin

# deepspeed needs to know your gradient accumulation steps beforehand, so don't forget to pass it
# Remember you still need to do gradient accumulation by yourself, just like you would have done without deepspeed
deepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_accumulation_steps=2)
accelerator = Accelerator(mixed_precision='fp16', deepspeed_plugin=deepspeed_plugin)

# How to save your 🤗 Transformer?
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(save_dir, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model))

注意：目前DeepSpeed仍处于试验阶段，不支持手动编写的Config，将来可能支持，同时底层API可能也会有变化。

4. notebook_launcher

在Accelerate 0.3.0版本中引入了notebook_launcher()，它支持在jupyter notebook中启动分布式培训。这对于使用TPU后端训练的Colab或Kaggle型notebook尤其有用。只需在training_function中定义训练循环，然后在最后一个单元格中添加下列代码即可：

python 复制代码

from accelerate import notebook_launcher

notebook_launcher(training_function)

2.6 optimum

2.6.1 简介

optimum是Transformers和Diffusers的扩展，它提供了一组性能优化工具，保持易用性的同时，可以在特定的目标硬件上以最高效率训练和运行模型。optimum支持多种类型的加速器，比如ONNX Runtime、Intel Neural Compressor、OpenVINO、NVIDIA TensorRT-LLM、AMD Instinct GPUs and Ryzen AI NPU、AWS Trainum & Inferentia、Habana Gaudi Processor (HPU)、FuriosaAI等，具体可参照optimum官方文档。

2.6.2 实战

在使用optimum特定加速器的功能之前，需要安装对应依赖包，如下表所示：

Accelerator	Installation
ONNX Runtime	pip install --upgrade --upgrade-strategy eager optimum[onnxruntime]
Intel Neural Compressor	pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]
OpenVINO	pip install --upgrade --upgrade-strategy eager optimum[openvino]
NVIDIA TensorRT-LLM	docker run -it --gpus all --ipc host huggingface/optimum-nvidia
AMD Instinct GPUs and Ryzen AI NPU	pip install --upgrade --upgrade-strategy eager optimum[amd]
AWS Trainum & Inferentia	pip install --upgrade --upgrade-strategy eager optimum[neuronx]
Habana Gaudi Processor (HPU)	pip install --upgrade --upgrade-strategy eager optimum[habana]
FuriosaAI	pip install --upgrade --upgrade-strategy eager optimum[furiosa]

optimum提供不同工具在不同框架中导出和运行优化后模型，导出和优化操作即可通过编程的方式，也可由命令行完成。可操作框架包括：

ONNX / ONNX Runtime
TensorFlow Lite
OpenVINO
Habana first-gen Gaudi / Gaudi2
AWS Inferentia 2 / Inferentia 1
NVIDIA TensorRT-LLM

其中ONNX Runtime、Neural Compressor、OpenVINO、TensorFlow Lite的特点总结如下：

Features	ONNX Runtime	Neural Compressor	OpenVINO	TensorFlow Lite
Graph optimization	✔️	N/A	✔️	N/A
Post-training dynamic quantization	✔️	✔️	N/A	✔️
Post-training static quantization	✔️	✔️	✔️	✔️
Quantization Aware Training (QAT)	N/A	✔️	✔️	N/A
FP16 (half precision)	✔️	N/A	✔️	✔️
Pruning	N/A	✔️	✔️	N/A
Knowledge Distillation	N/A	✔️	✔️	N/A

下面我们选择ONNX/ONNX Runtime作为实战对象。

1. ONNX&ONNX Runtime

ONNX(Open Neural Network Exchange)是一个由微软提出的一种开放的神经网络交换格式，用于在不同的深度学习框架之间进行模型的转换和交流。它的设计目标是提供一个中间表示格式，使得用户可以在不同的深度学习训练和推理框架之间无缝地转换模型。

使用ONNX，你可以使用不同的深度学习框架（如PyTorch、TensorFlow、Paddle等）进行模型的训练和定义，然后将模型导出为ONNX格式。导出后的ONNX模型包含了模型的结构和权重参数等信息，但不包含模型的硬件信息。

一旦模型导出为ONNX格式，你可以使用ONNX Runtime或其他支持ONNX的框架将模型转换为目标设备所支持的模型格式，如TensorRT、Core ML、OpenVINO等。这样，你可以在不同的设备上部署和运行模型，无需重新训练或重新实现模型。同时它还支持修改模型的Graph、Node和Tensor。

ONNX的优势在于它提供了跨框架和跨平台的灵活性，使得深度学习模型在不同的环境中更易于部署和集成。它还支持许多常见的深度学习操作和网络架构，使得模型转换过程更加方便和高效。

ONNX Runtime：ONNX Runtime是一个跨平台的高性能推理引擎，支持将ONNX模型转换为多种设备和框架所支持的格式，如TensorFlow、TensorRT、Core ML等。它提供了一套API和工具，使得在不同平台上部署和运行ONNX模型更加便捷。

2. 加速推理

transformers和diffusers格式的模型可导出为ONNX格式，并易于执行图优化和量化操作。先执行格式转换：

python 复制代码

optimum-cli export onnx -m deepset/roberta-base-squad2 --optimize O2 roberta_base_qa_onnx

然后使用onnxruntime进行量化：

python 复制代码

optimum-cli onnxruntime quantize \
  --avx512 \
  --onnx_model roberta_base_qa_onnx \
  -o quantized_roberta_base_qa_onnx

这些命令将deepset/roberta-base-squad2导出为roberta_base_qa_onnx，并对导出的模型进行O2图优化，最后使用avx512配置进行量化。

完成量化后，就可以在后端使用Python类ONNX Runtime以无缝衔接地运行ONNX模型，示例代码如下：

python 复制代码

- from transformers import AutoModelForQuestionAnswering
+ from optimum.onnxruntime import ORTModelForQuestionAnswering
  from transformers import AutoTokenizer, pipeline

  model_id = "deepset/roberta-base-squad2"
  tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForQuestionAnswering.from_pretrained(model_id)
+ model = ORTModelForQuestionAnswering.from_pretrained("roberta_base_qa_onnx")
  qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
  question = "What's Optimum?"
  context = "Optimum is an awesome library everyone should use!"
  results = qa_pipe(question=question, context=context)

更多使用ORTModelForXXX类运行ONNX模型的细节可参考 Optimum Inference with ONNX Runtime

3. 加速训练

optimum为原生Transformers Trainer提供类型包装器，以便在强大硬件上进行训练。支持如下加速训练方式：

Habana's Gaudi processors
AWS Trainium instances
ONNX Runtime (optimized for GPUs)
以ONNX Runtime作为加速训练包装器为例：

python 复制代码

- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
      output_dir="path/to/save/folder/",
      optim="adamw_ort_fused",
      ...
  )

  # Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      ...
  )

  # Use ONNX Runtime for training!
  trainer.train()

除了上面列举的开源库之外，其他常用的开源库还有timm、text-generation-inference、candle、trl、tokenizers、evaluate等。其中timm(PyTorch Image Models)是一个深度学习库，其中包含图像模型、优化器、调度器以及训练/验证脚本等内容。text-generation-inference(TGI)文本生成推理是一个用于部署和服务大型语言模型（LLM）的工具包，它为当前流行的开源LLM实现高性能文本生成，包括Llama、Falcon、StarCoder、BLOOM、GPT-NeoX等。candle是Rust的一个极简ML框架，专注于性能（支持GPU）和易用性。trl库是一个全栈工具，使用包括监督微调步骤（SFT）、奖励建模（RM）和近端策略优化（PPO）以及直接偏好优化（DPO）等方法对transformer结构和diffusion模型进行微调和对齐。Tokenizers是适用于研究和生产环境的高性能分词器，使用Rust语言实现，在20秒内就能使用CPU完成对1GB文本数据的分词。Evaluate则使用数十种流行的指标对数据集和模型进行评估。详细介绍可在GitHub Hugging Face组织页面查看。

参考文献

llama factory学习笔记
 DeepSpeedPlugin的作用
 ONNX 模型格式分析与使用

State of the arts ↩︎