[ExecuTorch 系列] 3. 导出自定义的大语言模型

简介

如果你有自己的 PyTorch 模型且该模型是大型语言模型（LLM），本文档将展示如何手动将其导出并适配到 ExecuTorch，其中包含许多与之前的 export_llm 指南中相同的优化。

本文档提供一个实际示例以利用 ExecuTorch 导入自定义 LLM。主要目标是提供：关于如何将 ExecuTorch 与自定义 LLM 集成的指南。

方法适用于其他语言模型 ，因为 ExecuTorch 具有模型无关性。PyTorch - Exporting custom LLMs

环境准备

首先，需要下载 ExecuTorch 仓库并安装依赖项。ExecuTorch 建议使用 Python 3.10 并使用 Conda 来管理环境。

可以参考另一篇文档： $ExecuTorch 系列$ 1. 从源码构建 ExecuTorch

bash 复制代码

# Create a directory for this example.
mkdir et-nanogpt
cd et-nanogpt

# Clone the ExecuTorch repository and submodules.
mkdir third-party && cd third-party
git clone -b release/0.7 https://github.com/pytorch/executorch.git
cd executorch
git submodule update --init

# Create a conda environment and install requirements.
conda create -yn executorch python=3.10.0
conda activate executorch
./install_requirements.sh

cd ../..

我这里安装的 python 版本是 3.12.12。

在本地运行 LLM

文档示例使用 Karpathy 的 nanoGPT，教程同样适用于其他大语言模型，因为 ExecuTorch 是模型不变的。

使用 ExecuTorch 运行模型有两个步骤：

导出模型：将模型预处理为适合 ExecuTorch Runtime 执行的 .pte 格式。
运行：加载模型文件并使用 ExecuTorch Runtime 运行。

导出到 ExecuTorch （基础版）

首先，需要下载 nanoGPT 模型和相应的分词器词汇表：

bash 复制代码

# curl
curl https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -O
curl -L https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json?download=true -o vocab.json

# wget
wget https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py
wget https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json

然后，创建一个名为 export_nanogpt.py 的文件，其中包含以下内容：

python 复制代码

import torch

from executorch.exir import EdgeCompileConfig, to_edge
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export, export_for_training

from model import GPT

# Load the model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
edge_config = EdgeCompileConfig(_check_ir_validity=False)
edge_manager = to_edge(traced_model,  compile_config=edge_config)
et_program = edge_manager.to_executorch()

# Save the ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

然后，通过python export_nanogpt.py执行该文件，在当前目录下得到导出后的模型：

后端委派 (Backend Delegation)

ExecuTorch 为多个不同目标提供了专用后端，包括但不限于通过 XNNPACK 后端实现 x86 和 ARM CPU 加速，通过 Core ML 后端和 Metal Performance Shader（MPS）后端实现苹果加速，以及通过 Vulkan 后端实现 GPU 加速。

为了在导出期间将模型委托给特定的后端，ExecuTorch 使用了to_edge_transform_and_lower()函数。该函数接收来自torch.export的导出程序以及一个特定于后端的分区器对象。分区器会识别计算图中可由目标后端优化的部分。在to_edge_transform_and_lower()内部，导出的程序会被转换为边缘方言程序。之后，分区器会将兼容的图部分委托给后端以进行加速和优化。任何未被委托的图部分都由 ExecuTorch 的默认算子实现来执行。

也就是说，要将导出的模型委托给特定后端，我们需要先从 ExecuTorch 代码库导入其分区器和边缘编译配置，然后调用to_edge_transform_and_lower。

如下示例，说明如何将 nanoGPT 委派给 XNNPACK：

bash 复制代码

# export_nanogpt.py

# Load partitioner for Xnnpack backend
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

# Model to be delegated to specific backend should use specific edge compile config
from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower

import torch
from torch.export import export
from torch.nn.attention import sdpa_kernel, SDPBackend

from model import GPT

# Load the nanoGPT model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (
        torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long),
    )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See ../concepts.html#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size - 1)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
edge_config = get_xnnpack_edge_compile_config()
# Converted to edge program and then delegate exported model to Xnnpack backend
# by invoking `to` function with Xnnpack partitioner.
edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config)
et_program = edge_manager.to_executorch()

# Save the Xnnpack-delegated ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

量化

具体请参考：Quantization

更多请参考：ExecuTorch 中的量化

Runtime 的调用

ExecuTorch 提供了一组 Runtime API 和类型来加载和运行模型。

创建一个名为 main.cpp 的文件，其中包含以下内容：

cpp 复制代码

#include <cstdint>

#include "basic_sampler.h"
#include "basic_tokenizer.h"

#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
#include <executorch/runtime/core/evalue.h>
#include <executorch/runtime/core/exec_aten/exec_aten.h>
#include <executorch/runtime/core/result.h>

using executorch::aten::ScalarType;
using executorch::aten::Tensor;
using executorch::extension::from_blob;
using executorch::extension::Module;
using executorch::runtime::EValue;
using executorch::runtime::Result;

// The value of the gpt2 `<|endoftext|>` token.
#define ENDOFTEXT_TOKEN 50256

std::string generate(
Module& llm_model,
std::string& prompt,
BasicTokenizer& tokenizer,
BasicSampler& sampler,
size_t max_input_length,
size_t max_output_length) {
    // Convert the input text into a list of integers (tokens) that represents it,
    // using the string-to-token mapping that the model was trained on. Each token
    // is an integer that represents a word or part of a word.
    std::vector<int64_t> input_tokens = tokenizer.encode(prompt);
    std::vector<int64_t> output_tokens;

    for (auto i = 0u; i < max_output_length; i++) {
        // Convert the input_tokens from a vector of int64_t to EValue. EValue is a
        // unified data type in the ExecuTorch runtime.
        auto inputs = from_blob(
        input_tokens.data(),
        {1, static_cast<int>(input_tokens.size())},
        ScalarType::Long);

        // Run the model. It will return a tensor of logits (log-probabilities).
        auto logits_evalue = llm_model.forward(inputs);

        // Convert the output logits from EValue to std::vector, which is what the
        // sampler expects.
        Tensor logits_tensor = logits_evalue.get()[0].toTensor();
        std::vector<float> logits(
        logits_tensor.data_ptr<float>(),
        logits_tensor.data_ptr<float>() + logits_tensor.numel());

        // Sample the next token from the logits.
        int64_t next_token = sampler.sample(logits);

        // Break if we reached the end of the text.
        if (next_token == ENDOFTEXT_TOKEN) {
            break;
        }

        // Add the next token to the output.
        output_tokens.push_back(next_token);

        std::cout << tokenizer.decode({next_token});
        std::cout.flush();

        // Update next input.
        input_tokens.push_back(next_token);
        if (input_tokens.size() > max_input_length) {
            input_tokens.erase(input_tokens.begin());
        }
    }

    std::cout << std::endl;

    // Convert the output tokens into a human-readable string.
    std::string output_string = tokenizer.decode(output_tokens);
    return output_string;
}

int main() {
    // Set up the prompt. This provides the seed text for the model to elaborate.
    std::cout << "Enter model prompt: ";
    std::string prompt;
    std::getline(std::cin, prompt);

    // The tokenizer is used to convert between tokens (used by the model) and
    // human-readable strings.
    BasicTokenizer tokenizer("vocab.json");

    // The sampler is used to sample the next token from the logits.
    BasicSampler sampler = BasicSampler();
  
    // Load the exported nanoGPT program, which was generated via the previous
    // steps.
    Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors);
  
    const auto max_input_tokens = 1024;
    const auto max_output_tokens = 30;
    std::cout << prompt;
    generate(
        model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
  }

将以下文件下载到与 main.cpp 相同的目录中：

bash 复制代码

curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_sampler.h
curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_tokenizer.h

构建 ExecuTorch

要使用 ExecuTorch Runtime 运行 LLM，还需要构建 ExecuTorch Runtime，具体内容可以参考： $ExecuTorch 系列$ 1. 从源码构建 ExecuTorch 。ExecuTorch 使用 CMake 构建系统。

bash 复制代码

cd ~/et-nanogpt/third-party/executorch
rm -rf cmake-out && mkdir cmake-out && cd cmake-out

# cmake -DCMAKE_BUILD_TYPE=Release ..
cmake ..

cd ..

cmake --build cmake-out -j$(nproc)

构建示例代码

创建一个名为 CMakeLists.txt 的文件，其中包含以下内容：

cmake 复制代码

cmake_minimum_required(VERSION 3.19)
project(nanogpt_runner)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

# Set options for executorch build.
option(EXECUTORCH_ENABLE_LOGGING "" ON)
option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)

option(EXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR "" ON)

# Include the executorch subdirectory.
add_subdirectory(
  ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch
  ${CMAKE_BINARY_DIR}/executorch
)

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
  nanogpt_runner
  PRIVATE executorch
          extension_module_static # Provides the Module class
          extension_tensor # Provides the TensorPtr class
          optimized_native_cpu_ops_lib # Provides baseline cross-platform
                                       # kernels
)

完成上述准备工作后，完整目录应包含以下文件：

plain 复制代码

et-nanogpt/
├── CMakeLists.txt
├── main.cpp
├── basic_tokenizer.h
├── basic_sampler.h
├── export_nanogpt.py
├── model.py
├── vocab.json
├── nanogpt.pte
└── third-party
    └── executorch

最后，我们只需要构建示例工程：

bash 复制代码

cd et-nanogpt
mkdir -p build && cd build

cmake ..

make -j$(nproc)

运行

成功构建后，在 et-nanogpt/bulid 目录下生成了可执行文件nanogpt_runner，我们把生成的模型和词汇表文件放入 build 目录。

运行测试代码：

bash 复制代码

./nanogpt_runner

在我的电脑上跑效果不太好，生成回复的速度非常慢，回答也不准确。

据文档中描述：此时，它可能会运行得非常缓慢。这是因为 ExecuTorch 没有被告知要针对特定的硬件后端（delegation），并且它以 32 位浮点数（无量化）执行所有计算。

[ExecuTorch 系列] 3. 导出自定义的大语言模型

简介

环境准备

在本地运行 LLM

导出到 ExecuTorch （基础版）

后端委派 (Backend Delegation)

量化

Runtime 的调用

构建 ExecuTorch

构建示例代码

运行

参考资料