（一）TensorRT-LLM 初探（version: 1.0.0）

1. 安装

参考https://www.codewithgpu.com/i/NVIDIA/TensorRT-LLM/tensorrt_llm

2. Python API

python 复制代码

from tensorrt_llm import LLM, SamplingParams

def main():
    prompts = [
        "What is AI?",
        "Why the color of sky is blue?"
    ]
    # tensorrt_llm/sampling_params.py:SamplingParams
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # tensorrt_llm/llmapi/llm.py:LLM
    llm = LLM(model="/data0/models/Qwen2.5-0.5B-Instruct", tensor_parallel_size=2)
    # List[tensorrt_llm/llmapi/llm.py:RequestOutput]
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
    main()

3. trtllm-*

3.1 `trtllm-build`

trtllm-build 是 TensorRT-LLM 项目中的一个核心工具，用于将大型语言模型（LLM）从 PyTorch 或 TensorFlow 框架的权重格式，编译和优化为 TensorRT 引擎。这个引擎可以在 NVIDIA GPU 上实现高性能的推理。

简单来说，trtllm-build 就是一个 "模型编译器"，它专门为 LLM 的 GPU 推理做极致的优化。

分两步：

首先，基于 examples/models/core/<model-name>/convert_checkpoint.py 将 huggingface 格式模型转换为 TensorRT-LLM 的格式。这一步骤主要是确定权重的拆分和计算方式，以及实际采用的量化方式。

python examples/models/core/qwen/convert_checkpoint.py \

--model_dir /data0/models/Qwen2.5-0.5B-Instruct \

--output_dir ./Qwen2.5-0.5B-Instruct-tp_2_pp_2 \

--tp_size 2 \

--pp_size 2 \

--use_weight_only \

--weight_only_precision int8

接着转换为实际推理使用的文件格式，以下是几个常见的 build 选项：

trtllm-build \

--checkpoint_dir ./Qwen2.5-0.5B-Instruct-tp_2_pp_2 \

--output_dir ./Qwen2.5-0.5B-Instruct-tp_2_pp_2-engines \

--max_batch_size 128 \

--max_input_len 8192 \

--max_seq_len 10240 \

--max_beam_width 2 \

--max_num_tokens 65536 \

--kv_cache_type paged

转换完成后，生成文件夹中的权重文件个数等于 tp_size*pp_size 的大小。可以使用 run.py 进行简单推理验证：

多GPU/分布式推理：

mpirun -n 4 python examples/run.py \

--input_text "What is AI?" \

--engine_dir ./Qwen2.5-0.5B-Instruct-tp_2_pp_2-engines/ \

--max_output_len 30

int8量化指令：

trtllm-build --checkpoint_dir ./checkpoints \

--output_dir ./engines_int8 \

--use_weight_only \

--weight_only_precision int8

3.2 trtllm-eval

用于精度评测。基于先前生成模型，在MMLU数据集上的评测：

trtllm-eval \

--model ./Qwen2.5-0.5B-Instruct-tp_2_pp_2-engines \

--tokenizer ../../models/Qwen2.5-0.5B-Instruct/ \

--backend tensorrt \

--tp_size 2 \

--pp_size 2 \

mmlu \

--dataset_path ./data/

3.3 trtllm-serve

用于启动在线服务，基本命令：

trtllm-serve serve ./Qwen2.5-0.5B-Instruct-tp_2_pp_2-engines \

--tokenizer ../../models/Qwen2.5-0.5B-Instruct/ \

--host 0.0.0.0 \

--port 18001 \

--tp_size 2 \--pp_size 2

服务启动成功后会绑定并监听 0.0.0.0:18001，TensorRT-LLM 提供了如下请求接口：

def register_routes(self):

self.app.add_api_route("/health", self.health, methods=["GET"])

self.app.add_api_route("/version", self.version, methods=["GET"])

self.app.add_api_route("/v1/models", self.get_model, methods=["GET"])

self.app.add_api_route("/metrics", self.get_iteration_stats, methods=["GET"])

self.app.add_api_route("/kv_cache_events", self.get_kv_cache_events, methods=["POST"])

self.app.add_api_route("/v1/completions", self.openai_completion, methods=["POST"])

self.app.add_api_route("/v1/chat/completions", self.openai_chat, methods=["POST"])

例如：

python 复制代码

curl http://localhost:18001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
	   "model": "test",
	   "messages": [
	     {
	       "role": "system",
	       "content": "You are a helpful assistant."
	     },
	     {
	       "role": "user",
	       "content": "Where is New York?"
	     }
	   ],
	   "max_tokens": 16
	 }'

关键工作流程示例：

python 复制代码

# 完整流程示例：LLaMA-7B部署
# 1. 下载模型
git-lfs clone https://huggingface.co/meta-llama/Llama-2-7b-hf

# 2. 转换权重
python3 convert_checkpoint.py \
    --model_dir ./Llama-2-7b-hf \
    --output_dir ./llama-checkpoints \
    --dtype float16 \
    --model llama

# 3. 构建引擎
trtllm-build \
    --checkpoint_dir ./llama-checkpoints \
    --output_dir ./llama-engine \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_output_len 512

# 4. 运行推理
python3 run.py \
    --engine_dir ./llama-engine \
    --tokenizer_dir ./Llama-2-7b-hf \
    --input_text "Hello, how are you?" \
    --max_output_len 50

那么构建时到底发生了什么呢？

python 复制代码

# trtllm-build 执行过程
def build_process():
    # 1. 加载模型定义（Python代码）
    model = load_python_model("models/llama/model.py")
    
    # 2. 提取计算图
    computation_graph = extract_computation_graph(model)
    
    # 3. 优化计算图（融合、重排序）
    optimized_graph = optimize_graph(computation_graph)
    
    # 4. 编译为GPU内核
    cuda_kernels = compile_to_cubin(optimized_graph)
    
    # 5. 序列化为引擎文件
    engine_data = serialize_to_plan(optimized_graph, cuda_kernels)
    
    # 注意：只保存了第3-5步的结果，不包括第1步的Python代码

引擎中包含的CUBIN内核示例。

cpp 复制代码

// 引擎文件中包含的CUBIN内核示例
// 这些是编译好的CUDA二进制代码

// gemm_kernel.cubin
__global__ void gemm_kernel_fp16(
    half* A, half* B, half* C,
    int M, int N, int K) {
    // 矩阵乘法实现
}

// attention_kernel.cubin  
__global__ void fused_attention_kernel(
    half* Q, half* K, half* V, half* Output,
    int batch_size, int seq_len, int num_heads) {
    // Flash Attention实现
}

// layernorm_kernel.cubin
__global__ void layernorm_kernel(
    half* input, half* output,
    half* gamma, half* beta,
    int hidden_size) {
    // LayerNorm实现
}

所以引擎文件包含：优化后的计算图、编译的CUDA内核、权重数据、内存计划

完整的推理应用包含：

完整应用 = 引擎文件 + 运行时环境 + 前后处理

各部分职责：

引擎文件 (.plan): 执行矩阵乘、Attention等计算

Python运行时: 加载引擎、管理请求、调用Tokenizer

C++运行时: 底层内存管理、流调度

其他组件: 日志、监控、API服务等

最后实际部署所需要的文件：

deployment/ 部署目录

deployment/

├── model.plan # 引擎文件（仅计算部分）

├── trtllm_runtime.so # 运行时库（动态链接库）

├── tokenizer/ # 分词器文件

│ ├── tokenizer.model

│ └── tokenizer_config.json

├── config.json # 配置文件

└── server.py # Python服务代码

└── 包含：

加载引擎

Tokenizer调用

请求处理

结果返回

（一）TensorRT-LLM 初探（version: 1.0.0）

1. 安装

2. Python API

3. trtllm-*

3.1 trtllm-build

3.2 trtllm-eval

3.3 trtllm-serve

完整的推理应用包含：

各部分职责：

deployment/ 部署目录

3.1 `trtllm-build`