Qwen 大模型微调填坑实录:从 CUDA 对齐到 TensorRT 加速
在把 Qwen2.5 等大模型真正部署到 GPU 服务器的过程中,环境配置、磁盘管理和推理加速的坑远比想象的多。本文将记录一次完整踩坑经历,覆盖系统盘告急、nvcc 消失、FlashAttention 编译慢、LoRA 合并陷阱、TensorRT 构建与推理等环节,并提供可复现的解决脚本。
1. 环境初始化的"空间与路径"之坑
坑 1:系统盘和数据盘不分,空间说没就没
很多云服务器(例如 AutoDL)会区分系统盘和数据盘,系统盘通常只有 30--50 GB,而 conda 环境、pip 缓存、模型权重默认都会塞进系统盘,分分钟占满。
解决方案:
- 创建 conda 环境时指定目标盘
使用--prefix或将envs_dirs配置到数据盘。如果已经创建在系统盘,可以直接复制到数据盘再删除原环境:
bash
mkdir -p /root/autodl-tmp/conda/{envs,pkgs}
vim ~/.condarc # 在文件末尾添加
"""
envs_dirs:
- /root/autodl-tmp/conda/envs
pkgs_dirs:
- /root/autodl-tmp/conda/pkgs
"""
# 复制原来的环境,这个时候刚才的配置就会生效
conda create -n qwen2 --clone qwen --offline
# 非离线环境下不要加 --offline
conda create -n qwen2 --clone qwen
# 删掉原来的环境
conda env remove -n qwen
-
修改 pip 缓存目录
pip 下载的 whl 和源码包默认缓存在
~/.cache/pip,极易占满系统盘。先清除旧缓存,再指定新位置:bashpip cache purge pip config set global.cache-dir /root/autodl-tmp/pip_cache pip cache dir # 确认是否生效 -
模型权重与数据集存放
下载时通过
--local-dir显式指定到数据盘,例如:bashhuggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir ./Qwen2.5-3B-Instruct后续 LoRA 合并、TensorRT 引擎构建同样会生成大量中间文件,全程保证工作目录在数据盘。
2. CUDA 与 PyTorch 版本对齐,以及"消失的 nvcc"
坑 2:nvidia-smi 和 nvcc -V 不一致,外加找不到 nvcc
nvidia-smi 显示的是驱动支持的最高 CUDA 版本,而 nvcc -V 显示的是已安装的 CUDA 工具包版本。PyTorch 的预编译包需要与 工具包版本 匹配,而不是驱动版本。
而且很多环境中 nvcc 并不在 PATH 中,导致根本无法查看版本。
解决步骤:
bash
# 1. 查找 nvcc 位置
which nvcc || find / -name nvcc 2>/dev/null
# 2. 若找到但无法直接调用,则加入 PATH
export PATH=/usr/local/cuda/bin:$PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
nvcc -V # 应显示类似 "release 12.1, V12.1.xxx"
根据 nvcc -V 的输出(如 CUDA 12.1)去 PyTorch 官网 选择对应的 cu121 版本:
bash
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
若你用的是 CUDA 12.8,则对应 cu128 的 PyTorch 2.7+:
bash
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
原则 :
nvcc -V决定 PyTorch 的cu后缀,切勿只看nvidia-smi。
3. 所需要的环境
使用 nvidia-smi 和 nvcc -V 找到合适的 torch 版本再进行安装。这里有一个技巧,就是先安装指定版本的、然后安装容易出问题的,最后再安装不指定版本号和相对容易安装的依赖。
bash
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
pip install transformers==4.45.2
pip install peft==0.13.1
pip install accelerate==1.0.0
pip install bitsandbytes
pip install datasets
pip install tiktoken
pip install jupyter notebook
pip install huggingface_hub
# 安装 unsloth 不好装
pip install unsloth
pip install onnx onnxruntime-gpu # 调试用
pip install coloredlogs
pip install flash-attn --no-build-isolation --verbose # very slow verbose 是这里的精髓
pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com # 极大 极慢 最后装
4. FlashAttention 编译慢?--verbose 是精髓
安装 flash-attn 时需要从源码编译,通常耗时极长且无任何输出,很容易让人以为卡死了。
正确做法 :加上 --no-build-isolation --verbose,既能避免构建隔离环境带来的额外依赖下载,又能实时看到编译进度,心里有底。
bash
pip install flash-attn --no-build-isolation --verbose
此外,确保提前安装好匹配的 PyTorch 和 CUDA 工具链,否则编译会失败。
在安装过程中,可以在日志上看到当前下载的文件,如果网速太慢,也可以先下载到本地然后再手动安装,例如在安装 flash-attn 的时候就卡在了某个 whl 文件的下载过程,这个时候,只需要使用 wget 将这个文件下载下来,然后使用 pip install ./* 本地安装即可。
5. 模型下载与 HuggingFace 镜像
下载模型和数据集,推荐使用 huggingface-cli 或 huggingface_hub 库。若网络受限,设置镜像环境变量:
bash
export HF_ENDPOINT=https://hf-mirror.com
# 下载模型
huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir ./Qwen2.5-3B-Instruct
# 下载数据集(需指定 --repo-type dataset)
huggingface-cli download unsloth/Radiology_mini --repo-type dataset --local-dir ./radiology_mini
若 huggingface_hub 版本大于 0.23.0,自带的 hf 命令也可使用:
bash
pip install huggingface_hub # 0.36.2+
hf download unsloth/Radiology_mini --repo-type dataset --local-dir ./radiology_mini
6. LoRA 推理、合并与"强行用 LoRA"的错误
基模测试 & LoRA 推理
使用同一个测试脚本(model_test.py 见附录),通过不同参数切换:
bash
# 仅基模
python model_test.py --base_model ./Qwen2.5-3B-Instruct --adapter "" --question_file 测试问题.txt --load_in_4bit
执行训练微调命令(train.qwen.py 和 train.qwen.2.py 见附录):
bash
python train.qwen.py # 或者 python train.qwen.2.py
运行完毕之后得到每 500 组数据的保存点目录 checkpoints_lora,选择其中任一保存点权重,与基础模型合并使用:
bash
# 加载 LoRA adapter
python model_test.py --base_model ./Qwen2.5-3B-Instruct --adapter "checkpoints_self_cong/checkpoint-1000" --load_in_4bit
# 交互模式
python model_test.py --base_model ./Qwen2.5-3B-Instruct --adapter "checkpoints_self_cong/checkpoint-1000" --load_in_4bit --interactive
坑 3:强行用 LoRA 权重直接当基模
若直接将 adapter 目录作为 base_model 传入,会报错:
bash
python model_test.py --base_model checkpoints_lora/checkpoint-1000 --adapter "" --interactive
# ValueError: 模型结构不匹配
原因 :LoRA 保存的仅是增量权重,模型主体仍基于原基模。必须经过合并操作,生成完整权重后才能作为独立模型使用。
合并 LoRA 及后续推理(合并脚本 merge.py 见附录)
bash
python merge.py # 生成合并后的模型目录,例如 qwen_merged_fp16
# 合并后的模型可直接作为基模推理
python model_test.py --base_model ./qwen_merged_fp16 --adapter "" --interactive
合并时的磁盘空间警告:合并过程会生成一份全量 fp16 权重(与基模相同大小),务必保证数据盘有充足空间。
7. TensorRT-LLM 部署:又一个"磁盘吞噬者"
安装 TensorRT-LLM
bash
pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
安装包极大且慢,请耐心等待。完成后检查:
bash
trtllm-build --help
构建 TensorRT Engine
bash
trtllm-build \
--checkpoint_dir ./qwen_merged_fp16 \
--output_dir ./qwen_FP16_engine \
--gemm_plugin auto \
--max_batch_size 8 \
--max_input_len 1024 \
--max_seq_len 2048
坑 4:TensorRT Engine 不能直接用原模型代码推理
TensorRT-LLM 构建出的 engine 文件需要专门的 Runtime API 来加载和推理,不能像普通 transformers 模型那样直接调用。因此,你需要额外编写或使用 tensorrt_llm 自带的推理脚本,例如:
python
from tensorrt_llm.runtime import ModelRunner
runner = ModelRunner.from_dir(engine_dir='./qwen_FP16_engine')
# 然后按 runner 接口输入 token ids 等
8. 通用 TensorRT 部署(ONNX → engine 路径)
为了更灵活的控制,有时会绕过 TensorRT-LLM,直接导出 ONNX 再用 trtexec 构建 engine。
额外的 python 包
bash
# 安装 unsloth
pip install unsloth
# 安装 tensorrt 相关;系统环境为:4090 24GB Driver Version: 580.105.08 CUDA Version: 13.0 Cuda compilation tools, release 12.8, V12.8.93 torch version 2.10.0+cu128
pip install onnx onnxruntime-gpu # 调试用
pip install coloredlogs
pip install "transformers==4.45.2" # 保证 transformers 的版本小于 5
安装 TensorRT 独立包(示例 CUDA 12.8 对应 TRT 10.8.0)
bash
wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.8.0/tars/TensorRT-10.8.0.43.Linux.x86_64-gnu.cuda-12.8.tar.gz
tar -xzf TensorRT-10.8.0.43.Linux.x86_64-gnu.cuda-12.8.tar.gz
export TENSORRT_DIR=$(pwd)/TensorRT-10.8.0.43
export LD_LIBRARY_PATH=$TENSORRT_DIR/lib:$LD_LIBRARY_PATH
export PATH=$TENSORRT_DIR/bin:$PATH
# 安装 Python 绑定
pip install $TENSORRT_DIR/python/tensorrt-10.8.0.43-cp310-none-linux_x86_64.whl
验证:
python
import tensorrt as trt
print(trt.__version__) # 10.8.0
或者,
bash
pip freeze > requirements.txt
# 在文件中可以发现 tensorrt 是本地安装的
# tensorrt @ file:///root/autodl-tmp/TensorRT-10.8.0.43/python/tensorrt-10.8.0.43-cp310-none-linux_x86_64.whl#sha256=cc02978d9f6a8c129c0ce0910b9e1f7be6d793c0c70e986cdc8895956d06a54e
导出 ONNX(导出脚本 export_qwen.py 见附录)
bash
python export_qwen.py # 生成 qwen2.5_3b_fp16.test.onnx
# 快速测试 ONNX 模型
python test_onnx_qwen.py --prompt "你好,今天天气" --onnx qwen2.5_3b_fp16.test.onnx --greedy-steps 30 --gpu
用 trtexec 构建 Engine
必须指定动态维度范围(min/opt/max),否则可能无法变长推理:
bash
trtexec \
--onnx=qwen2.5_3b_fp16.test.onnx \
--saveEngine=qwen_fp32.test.engine \
--minShapes=input_ids:1x1,attention_mask:1x1,position_ids:1x1 \
--optShapes=input_ids:1x128,attention_mask:1x128,position_ids:1x128 \
--maxShapes=input_ids:1x512,attention_mask:1x512,position_ids:1x512
自检 engine 是否正确保存并可以加载:
bash
trtexec --loadEngine=qwen_fp32.test.engine \
--shapes=input_ids:1x1,attention_mask:1x1,position_ids:1x1
使用 Python API 加载 engine 推理
infer_trt_qwen.py 是专门用来推理的脚本,见附录。
bash
python infer_trt_qwen.py \
--engine ./qwen_fp32.test.engine \
--tokenizer ./qwen_merged_fp16 \
--prompt "你好,今天天气" \
--greedy-steps 128
关键提醒:engine 与构建它的 GPU 架构强相关,跨机器或更新驱动后可能需要重新构建。
9. 多模态扩展与其他小坑
多模态模型下载
bash
hf download unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit --local-dir ./Qwen2.5-VL-7B-Instruct-bnb-4bit
关于 "安装 unsloth"
指 unsloth(一个加速库,可用于 4bit 量化训练/推理),请执行:
bash
pip install unsloth
conda 环境 "base" 的隐式问题
有时你进入了某个环境,但 python 指向的却是 base 的 Python,导致 import transformers 版本不符。可以用以下方法强制验证:
bash
conda run -n base python -c "import transformers; print(transformers.__version__)"
conda run -n qwen python -c "import transformers; print(transformers.__version__)"
确保你当前激活的环境和实际运行的 Python 解释器一致,必要时使用 which python 或绝对路径调用脚本。
10. 总结:让 GPU 顺利吃上大模型的 Checklist
- 磁盘规划:全部工作目录、conda 环境、pip 缓存、模型权重放在数据盘。
- CUDA 对齐 :通过
nvcc -V选择 PyTorch 版本,并将nvcc加入 PATH。 - 闪存注意 :
flash-attn编译加--verbose,耐心等待。 - 模型下载 :善用 HF 镜像,
--local-dir控制路径。 - LoRA 使用:推理需指定基模 + adapter 路径;部署必须先合并生成全量权重。
- TensorRT 部署:无论是 TensorRT-LLM 还是 ONNX 路径,都要准备专门的推理脚本,且注意 engine 与硬件绑定。
- 多模态与第三方库 :确保库版本匹配,遇到
sloth请检查是否是unsloth。 - 环境隔离:明确当前 conda 环境,避免 base 污染。
这些坑都是真刀真枪部署大模型时极易碰到的,希望这篇记录能帮你节省数小时的 debug 时间。如果觉得有用,欢迎分享给同样在填坑的同伴。
附录
所用脚本记录。
1. model_test.py
用来测试模型的脚本。
python
import argparse
import os
from typing import List, Dict
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
DEFAULT_QUESTIONS = [
"你是谁?请介绍一下你自己。",
"孩子发烧到38.5度应该怎么办?",
"小孩咳嗽三天了,需要马上去医院吗?",
"儿童肥胖平时饮食和运动要注意什么?",
"宝宝腹泻时家长应该怎么护理?",
]
SYSTEM_PROMPT = (
"你是智能医生客服机器人小D。你的回答只能用于健康科普和就医建议,"
"不能替代医生诊断。遇到高烧不退、呼吸困难、意识异常、严重疼痛、"
"婴幼儿急症等情况,要建议用户及时线下就医。"
)
def build_quant_config(load_in_4bit: bool, load_in_8bit: bool):
if load_in_4bit:
return BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
if load_in_8bit:
return BitsAndBytesConfig(load_in_8bit=True)
return None
def load_model(args):
tokenizer = AutoTokenizer.from_pretrained(
args.base_model,
use_fast=False,
trust_remote_code=True,
padding_side="right",
)
quant_config = build_quant_config(args.load_in_4bit, args.load_in_8bit)
model_kwargs = {
"trust_remote_code": True,
"device_map": args.device_map,
"torch_dtype": torch.float16 if torch.cuda.is_available() else torch.float32,
}
if quant_config is not None:
model_kwargs["quantization_config"] = quant_config
base_model = AutoModelForCausalLM.from_pretrained(args.base_model, **model_kwargs)
if args.adapter:
model = PeftModel.from_pretrained(base_model, args.adapter)
if args.merge:
model = model.merge_and_unload()
else:
model = base_model
model.eval()
return tokenizer, model
def build_messages(question: str) -> List[Dict[str, str]]:
return [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
@torch.inference_mode()
def generate_answer(tokenizer, model, question: str, args) -> str:
messages = build_messages(question)
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(
**inputs,
max_new_tokens=args.max_new_tokens,
do_sample=args.do_sample,
temperature=args.temperature,
top_p=args.top_p,
repetition_penalty=args.repetition_penalty,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
)
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
def run_batch_test(tokenizer, model, args):
questions = DEFAULT_QUESTIONS
if args.questions_file and os.path.exists(args.questions_file):
with open(args.questions_file, "r", encoding="utf-8") as f:
questions = [line.strip() for line in f if line.strip()]
print("\n=== 小D医生机器人 LoRA 测试 ===")
print(f"Base model : {args.base_model}")
print(f"Adapter : {args.adapter or '未加载,测试基础模型'}")
print(f"Questions : {len(questions)}")
for idx, question in enumerate(questions, 1):
print("\n" + "=" * 80)
print(f"[问题 {idx}] {question}")
answer = generate_answer(tokenizer, model, question, args)
print("\n[回答]")
print(answer)
def run_interactive(tokenizer, model, args):
print("\n进入交互测试模式,输入 exit 退出。")
while True:
question = input("\n用户:").strip()
if question.lower() in {"exit", "quit", "q"}:
break
answer = generate_answer(tokenizer, model, question, args)
print("\n小D:", answer)
def parse_args():
parser = argparse.ArgumentParser(description="Test Qwen2.5 medical LoRA adapter.")
parser.add_argument(
"--base_model",
default=os.getenv("QWEN_BASE_MODEL", "Qwen2.5-3B-Instruct"),
help="基础模型路径或 HuggingFace/ModelScope 模型名",
)
parser.add_argument(
"--adapter",
default=os.getenv("QWEN_LORA_ADAPTER", "checkpoints_self_cong/checkpoint-1000"),
help="训练得到的 LoRA checkpoint 路径;留空则只测试基础模型",
)
parser.add_argument("--questions_file", default="", help="每行一个问题的测试文件")
parser.add_argument("--interactive", action="store_true", help="启动交互式测试")
parser.add_argument("--merge", action="store_true", help="加载 adapter 后合并权重再推理")
parser.add_argument("--load_in_4bit", action="store_true", help="4-bit 量化加载基础模型")
parser.add_argument("--load_in_8bit", action="store_true", help="8-bit 量化加载基础模型")
parser.add_argument("--device_map", default="auto")
parser.add_argument("--max_new_tokens", type=int, default=512)
parser.add_argument("--temperature", type=float, default=0.2)
parser.add_argument("--top_p", type=float, default=0.9)
parser.add_argument("--repetition_penalty", type=float, default=1.05)
parser.add_argument("--do_sample", action="store_true", help="开启采样;默认贪心/近确定性输出")
return parser.parse_args()
def main():
args = parse_args()
if args.load_in_4bit and args.load_in_8bit:
raise ValueError("--load_in_4bit 和 --load_in_8bit 不能同时开启")
tokenizer, model = load_model(args)
if args.interactive:
run_interactive(tokenizer, model, args)
else:
run_batch_test(tokenizer, model, args)
if __name__ == "__main__":
main()
- train.qwen.py & train.qwen.2.py
用来加载数据库进行微调。
python
#!/usr/bin/env python3
"""
使用 QLoRA 和 FlashAttention 微调 Qwen 模型
"""
import os
import time
import json
import random
import csv
import torch
from torch.utils.data import Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling,
)
from transformers.trainer_pt_utils import LabelSmoother
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# ============================================================================
# 配置参数
# ============================================================================
# GPU 设置
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# 模型路径
MODEL_PATH = "./Qwen2.5-3B-Instruct"
# 数据文件路径
DATA_FILE = "数据.csv"
SELF_COGNITION_FILE = "self_cognition.json"
# 输出目录
OUTPUT_DIR = "checkpoints_self_cong/"
# 训练参数
MAX_LEN = 1024
WARMUP_STEPS = 1
PER_DEVICE_TRAIN_BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 1
LEARNING_RATE = 2e-4
LOGGING_STEPS = 100
MAX_STEPS = 3000
SAVE_STEPS = 500
# LoRA 参数
LORA_R = 32
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]
# 系统提示
SYSTEM_MESSAGE = "You are a helpful assistant."
# ============================================================================
# 数据加载函数
# ============================================================================
def load_dataset(filename):
"""加载 CSV 数据集"""
data_list = []
with open(filename, "r", encoding="gb18030") as f:
reader = csv.DictReader(f)
for row in reader:
data_list.append({
"department": row["department"],
"input": row["ask"],
"output": row["answer"],
})
return data_list
def replace_name(s):
"""替换名称占位符"""
s = s.replace("<NAME>", "智能医生客服机器人小D")
s = s.replace("<AUTHOR>", "Greedy AI")
return s
def load_self_cong_data(filename):
"""加载自我认知数据"""
data_list = []
idx = 0
for d in json.load(open(filename, "r", encoding="utf-8")):
d["instruction"] = replace_name(d["instruction"])
d["output"] = replace_name(d["output"])
data_list.append({
"id": idx,
"conversations": [
{"from": "user", "value": d["instruction"]},
{"from": "assistant", "value": d["output"]},
],
})
idx += 1
return data_list
def prepare_message(data_list):
"""将数据转换为 conversations 格式"""
new_list = []
for i, data in enumerate(data_list):
new_list.append({
"id": f"identity_{i}",
"conversations": [
{"from": "user", "value": data["input"]},
{"from": "assistant", "value": data["output"]},
],
})
return new_list
# ============================================================================
# 数据预处理
# ============================================================================
def preprocess(sources, tokenizer, max_len, system_message=SYSTEM_MESSAGE):
"""预处理数据,将对话转换为模型输入"""
roles = {"user": "<|im_start|>user", "assistant": "<|im_start|>assistant"}
im_start = tokenizer("<|im_start|>").input_ids[0]
im_end = tokenizer("<|im_end|>").input_ids[0]
nl_tokens = tokenizer("\n").input_ids
_system = tokenizer("system").input_ids + nl_tokens
_user = tokenizer("user").input_ids + nl_tokens
_assistant = tokenizer("assistant").input_ids + nl_tokens
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["user"]:
source = source[1:]
input_id, target = [], []
system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
input_id += system
target += [im_start] + [IGNORE_TOKEN_ID] * (len(system) - 3) + [im_end] + nl_tokens
assert len(input_id) == len(target)
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
_input_id = (
tokenizer(role).input_ids + nl_tokens +
tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
)
input_id += _input_id
if role == "<|im_start|>user":
_target = [im_start] + [IGNORE_TOKEN_ID] * (len(_input_id) - 3) + [im_end] + nl_tokens
elif role == "<|im_start|>assistant":
_target = (
[im_start] + [IGNORE_TOKEN_ID] * len(tokenizer(role).input_ids) +
_input_id[len(tokenizer(role).input_ids) + 1:-2] + [im_end] + nl_tokens
)
else:
raise NotImplementedError
target += _target
assert len(input_id) == len(target)
input_id += [tokenizer.pad_token_id] * (max_len - len(input_id))
target += [IGNORE_TOKEN_ID] * (max_len - len(target))
input_ids.append(input_id[:max_len])
targets.append(target[:max_len])
input_ids = torch.tensor(input_ids, dtype=torch.int)
targets = torch.tensor(targets, dtype=torch.int)
return dict(
input_ids=input_ids,
labels=targets,
attention_mask=input_ids.ne(tokenizer.pad_token_id),
)
class SupervisedDataset(Dataset):
"""监督微调数据集"""
def __init__(self, raw_data, tokenizer, max_len):
super().__init__()
print("Formatting inputs...")
sources = [example["conversations"] for example in raw_data]
data_dict = preprocess(sources, tokenizer, max_len)
self.input_ids = data_dict["input_ids"]
self.labels = data_dict["labels"]
self.attention_mask = data_dict["attention_mask"]
print("Formatting done...")
def __len__(self):
return len(self.input_ids)
def __getitem__(self, i):
return dict(
input_ids=self.input_ids[i],
labels=self.labels[i],
attention_mask=self.attention_mask[i],
)
# ============================================================================
# 主训练流程
# ============================================================================
def main():
# 加载数据
print("加载数据集...")
dataset = load_dataset(DATA_FILE)
print(f"数据集大小: {len(dataset)}")
# 准备训练数据
self_cong_data = load_self_cong_data(SELF_COGNITION_FILE)
format_data_list = prepare_message(dataset[:1000])
format_data_list = self_cong_data + format_data_list
random.shuffle(format_data_list)
# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
use_fast=False,
trust_remote_code=True,
padding_side="right",
)
# 创建数据集
train_dataset = SupervisedDataset(format_data_list, tokenizer, max_len=MAX_LEN)
# 设置 LoRA 配置
config = LoraConfig(
r=LORA_R,
lora_alpha=LORA_ALPHA,
target_modules=LORA_TARGET_MODULES,
bias="none",
lora_dropout=LORA_DROPOUT,
task_type="CAUSAL_LM",
)
# 设置量化配置
compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
# 设置训练参数
peft_training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
warmup_steps=WARMUP_STEPS,
per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH_SIZE,
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
learning_rate=LEARNING_RATE,
optim="paged_adamw_8bit",
logging_steps=LOGGING_STEPS,
logging_dir="./logs",
save_strategy="steps",
max_steps=MAX_STEPS,
save_steps=SAVE_STEPS,
gradient_checkpointing=True,
report_to="none",
overwrite_output_dir=True,
group_by_length=True,
)
# 加载预训练模型
print("加载模型...")
original_model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=compute_dtype,
quantization_config=quant_config,
# attn_implementation="flash_attention_2",
)
# 准备模型
original_model.gradient_checkpointing_enable()
original_model = prepare_model_for_kbit_training(original_model)
peft_model = get_peft_model(original_model, config)
peft_model.config.use_cache = False
# 创建 Trainer
peft_trainer = Trainer(
model=peft_model,
train_dataset=train_dataset,
args=peft_training_args,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
# 开始训练
torch.cuda.empty_cache()
print("开始训练...")
start_time = time.time()
peft_trainer.train()
end_time = time.time()
print(f"训练完成!耗时: {end_time - start_time:.2f} 秒")
if __name__ == "__main__":
IGNORE_TOKEN_ID = LabelSmoother.ignore_index
main()
&
python
#!/usr/bin/env python3
"""
使用 QLoRA 和 FlashAttention 微调 Qwen 模型
(兼容性版本 - 处理 bitsandbytes 问题)
"""
import os
import sys
import time
import json
import random
import csv
import warnings
import torch
from torch.utils.data import Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling,
)
from transformers.trainer_pt_utils import LabelSmoother
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 忽略警告
warnings.filterwarnings("ignore")
# ============================================================================
# 配置参数
# ============================================================================
# GPU 设置
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# 模型路径
MODEL_PATH = "./Qwen2.5-3B-Instruct"
# 数据文件路径
DATA_FILE = "数据.csv"
SELF_COGNITION_FILE = "self_cognition.json"
# 输出目录
OUTPUT_DIR = "checkpoints_self_cong/"
# 训练参数
MAX_LEN = 1024
WARMUP_STEPS = 1
PER_DEVICE_TRAIN_BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 1
LEARNING_RATE = 2e-4
LOGGING_STEPS = 100
MAX_STEPS = 1000
SAVE_STEPS = 500
# LoRA 参数
LORA_R = 32
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]
# 系统提示
SYSTEM_MESSAGE = "You are a helpful assistant."
# 是否使用 4-bit 量化(如果环境不支持,会自动降级)
USE_4BIT = True
# ============================================================================
# 环境检查
# ============================================================================
def check_and_fix_bitsandbytes():
"""检查并修复 bitsandbytes 兼容性问题"""
global USE_4BIT
try:
import bitsandbytes as bnb
print(f"✓ bitsandbytes 版本: {bnb.__version__}")
# 测试 bitsandbytes 是否正常工作
test_tensor = torch.zeros(10).cuda()
print("✓ bitsandbytes CUDA 支持正常")
except Exception as e:
print(f"⚠ bitsandbytes 错误: {e}")
print("尝试降级处理...")
# 尝试重新安装兼容版本
import subprocess
subprocess.check_call([
sys.executable, "-m", "pip", "install",
"bitsandbytes==0.41.3", "--quiet"
])
try:
import bitsandbytes as bnb
print(f"✓ 降级后的 bitsandbytes 版本: {bnb.__version__}")
except:
print("✗ 无法修复 bitsandbytes,将使用 8-bit 或全精度模式")
USE_4BIT = False
# ============================================================================
# 数据加载函数
# ============================================================================
def load_dataset(filename):
"""加载 CSV 数据集"""
data_list = []
with open(filename, "r", encoding="gb18030") as f:
reader = csv.DictReader(f)
for row in reader:
data_list.append({
"department": row["department"],
"input": row["ask"],
"output": row["answer"],
})
return data_list
def replace_name(s):
"""替换名称占位符"""
s = s.replace("<NAME>", "智能医生客服机器人小D")
s = s.replace("<AUTHOR>", "Greedy AI")
return s
def load_self_cong_data(filename):
"""加载自我认知数据"""
data_list = []
idx = 0
for d in json.load(open(filename, "r", encoding="utf-8")):
d["instruction"] = replace_name(d["instruction"])
d["output"] = replace_name(d["output"])
data_list.append({
"id": idx,
"conversations": [
{"from": "user", "value": d["instruction"]},
{"from": "assistant", "value": d["output"]},
],
})
idx += 1
return data_list
def prepare_message(data_list):
"""将数据转换为 conversations 格式"""
new_list = []
for i, data in enumerate(data_list):
new_list.append({
"id": f"identity_{i}",
"conversations": [
{"from": "user", "value": data["input"]},
{"from": "assistant", "value": data["output"]},
],
})
return new_list
# ============================================================================
# 数据预处理
# ============================================================================
def preprocess(sources, tokenizer, max_len, system_message=SYSTEM_MESSAGE):
"""预处理数据,将对话转换为模型输入"""
roles = {"user": "<|im_start|>user", "assistant": "<|im_start|>assistant"}
im_start = tokenizer("<|im_start|>").input_ids[0]
im_end = tokenizer("<|im_end|>").input_ids[0]
nl_tokens = tokenizer("\n").input_ids
_system = tokenizer("system").input_ids + nl_tokens
_user = tokenizer("user").input_ids + nl_tokens
_assistant = tokenizer("assistant").input_ids + nl_tokens
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["user"]:
source = source[1:]
input_id, target = [], []
system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
input_id += system
target += [im_start] + [IGNORE_TOKEN_ID] * (len(system) - 3) + [im_end] + nl_tokens
assert len(input_id) == len(target)
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
_input_id = (
tokenizer(role).input_ids + nl_tokens +
tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
)
input_id += _input_id
if role == "<|im_start|>user":
_target = [im_start] + [IGNORE_TOKEN_ID] * (len(_input_id) - 3) + [im_end] + nl_tokens
elif role == "<|im_start|>assistant":
_target = (
[im_start] + [IGNORE_TOKEN_ID] * len(tokenizer(role).input_ids) +
_input_id[len(tokenizer(role).input_ids) + 1:-2] + [im_end] + nl_tokens
)
else:
raise NotImplementedError
target += _target
assert len(input_id) == len(target)
input_id += [tokenizer.pad_token_id] * (max_len - len(input_id))
target += [IGNORE_TOKEN_ID] * (max_len - len(target))
input_ids.append(input_id[:max_len])
targets.append(target[:max_len])
input_ids = torch.tensor(input_ids, dtype=torch.int)
targets = torch.tensor(targets, dtype=torch.int)
return dict(
input_ids=input_ids,
labels=targets,
attention_mask=input_ids.ne(tokenizer.pad_token_id),
)
class SupervisedDataset(Dataset):
"""监督微调数据集"""
def __init__(self, raw_data, tokenizer, max_len):
super().__init__()
print("Formatting inputs...")
sources = [example["conversations"] for example in raw_data]
data_dict = preprocess(sources, tokenizer, max_len)
self.input_ids = data_dict["input_ids"]
self.labels = data_dict["labels"]
self.attention_mask = data_dict["attention_mask"]
print("Formatting done...")
def __len__(self):
return len(self.input_ids)
def __getitem__(self, i):
return dict(
input_ids=self.input_ids[i],
labels=self.labels[i],
attention_mask=self.attention_mask[i],
)
# ============================================================================
# 主训练流程
# ============================================================================
def main():
# 检查环境
print("检查运行环境...")
print(f"PyTorch 版本: {torch.__version__}")
print(f"CUDA 可用: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA 版本: {torch.version.cuda}")
print(f"GPU 数量: {torch.cuda.device_count()}")
# 检查并修复 bitsandbytes
check_and_fix_bitsandbytes()
# 加载数据
print("\n加载数据集...")
dataset = load_dataset(DATA_FILE)
print(f"数据集大小: {len(dataset)}")
# 准备训练数据
self_cong_data = load_self_cong_data(SELF_COGNITION_FILE)
format_data_list = prepare_message(dataset[:1000])
format_data_list = self_cong_data + format_data_list
random.shuffle(format_data_list)
print(f"训练数据总数: {len(format_data_list)}")
# 加载分词器
print("\n加载分词器...")
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
use_fast=False,
trust_remote_code=True,
padding_side="right",
)
# 创建数据集
print("创建数据集...")
train_dataset = SupervisedDataset(format_data_list, tokenizer, max_len=MAX_LEN)
# 设置 LoRA 配置
config = LoraConfig(
r=LORA_R,
lora_alpha=LORA_ALPHA,
target_modules=LORA_TARGET_MODULES,
bias="none",
lora_dropout=LORA_DROPOUT,
task_type="CAUSAL_LM",
)
# 设置训练参数
peft_training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
warmup_steps=WARMUP_STEPS,
per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH_SIZE,
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
learning_rate=LEARNING_RATE,
optim="paged_adamw_8bit",
logging_steps=LOGGING_STEPS,
logging_dir="./logs",
save_strategy="steps",
max_steps=MAX_STEPS,
save_steps=SAVE_STEPS,
gradient_checkpointing=True,
report_to="none",
# overwrite_output_dir=True,
# group_by_length=True,
)
# 加载模型
print("\n加载模型...")
if USE_4BIT:
# 使用 4-bit 量化
compute_dtype = torch.float16
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
try:
original_model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=compute_dtype,
quantization_config=quant_config,
# attn_implementation="flash_attention_2",
)
print("✓ 成功加载 4-bit 量化模型")
except Exception as e:
print(f"✗ 4-bit 量化加载失败: {e}")
print("降级到 8-bit 量化...")
# 降级到 8-bit
quant_config = BitsAndBytesConfig(load_in_8bit=True)
original_model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
quantization_config=quant_config,
attn_implementation="flash_attention_2",
)
print("✓ 成功加载 8-bit 量化模型")
else:
# 不使用量化(需要更多显存)
print("不使用量化,加载全精度模型...")
original_model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
)
print("✓ 成功加载全精度模型")
# 准备模型
print("\n准备模型...")
original_model.gradient_checkpointing_enable()
original_model = prepare_model_for_kbit_training(original_model)
peft_model = get_peft_model(original_model, config)
peft_model.config.use_cache = False
# 打印可训练参数
trainable_params = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in peft_model.parameters())
print(f"可训练参数: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
print(f"总参数: {total_params:,}")
# 创建 Trainer
print("\n创建 Trainer...")
peft_trainer = Trainer(
model=peft_model,
train_dataset=train_dataset,
args=peft_training_args,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
# 开始训练
torch.cuda.empty_cache()
print("\n" + "=" * 50)
print("开始训练")
print("=" * 50)
start_time = time.time()
peft_trainer.train()
end_time = time.time()
print("\n" + "=" * 50)
print(f"训练完成!耗时: {end_time - start_time:.2f} 秒")
print("=" * 50)
# 保存模型
print(f"\n保存模型到 {OUTPUT_DIR}")
peft_model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print("✓ 模型保存完成")
if __name__ == "__main__":
IGNORE_TOKEN_ID = LabelSmoother.ignore_index
main()
- merge.py
用来将微调权重和基础模型合并的脚本。
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model_id = "./Qwen2.5-3B-Instruct"
adapter_path = "./checkpoints_lora/checkpoint-1000"
merged_model_path = "./qwen_merged_fp16" # 合并后的模型将保存在这里
# 使用 float16 精度加载基座模型,避免引入不必要的高精度开销
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
trust_remote_code=True, # Qwen模型通常需要此参数
device_map="auto"
)
# 挂载并合并LoRA适配器
model = PeftModel.from_pretrained(model, adapter_path)
model = model.merge_and_unload()
# 保存合并后的全量模型(.safetensors格式)
model.save_pretrained(merged_model_path, safe_serialization=True)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
tokenizer.save_pretrained(merged_model_path)
print(f"全精度模型已保存到 {merged_model_path}")
- export_qwen.py
用来将 pth 权重转成 onnx 格式的脚本。
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, __version__ as _transformers_version
# transformers>=5 + legacy torch.onnx.export (JIT trace) breaks in masking_utils (q_length as 0-dim tensor).
# Match model card (config lists 4.45.2): pip install "transformers==4.45.2"
if int(_transformers_version.split(".", 1)[0]) >= 5:
raise RuntimeError(
f"export_qwen.py requires transformers<5 for ONNX export (got {_transformers_version}). "
'Install: pip install "transformers==4.45.2"'
)
model_path = "./qwen_merged_fp16"
device = "cuda" if torch.cuda.is_available() else "cpu"
# ================== 加载模型 ==================
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
trust_remote_code=True,
).to(device)
model.eval()
model.config.use_cache = False # 彻底关闭 KV 缓存
# ================== 加载分词器 ==================
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# ================== 准备输入 ==================
inputs = tokenizer("你好,今天天气", return_tensors="pt").to(device)
seq_len = inputs.input_ids.shape[1]
position_ids = torch.arange(0, seq_len, dtype=torch.long, device=device).unsqueeze(0)
# ================== 包装模型 ==================
# 必须显式传入 position_ids,避免内部推导 q_length 导致 trace 失败
class ExportModel(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, input_ids, attention_mask, position_ids):
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
use_cache=False,
return_dict=False, # 只返回纯张量
)
return outputs[0] # logits
wrapped_model = ExportModel(model)
# ================== 导出 ONNX ==================
print("Exporting to ONNX...")
torch.onnx.export(
wrapped_model,
(inputs.input_ids, inputs.attention_mask, position_ids),
"qwen2.5_3b_fp16.test.onnx",
input_names=["input_ids", "attention_mask", "position_ids"],
output_names=["logits"],
dynamic_axes={
"input_ids": {0: "batch", 1: "seq"},
"attention_mask": {0: "batch", 1: "seq"},
"position_ids": {0: "batch", 1: "seq"},
"logits": {0: "batch", 1: "seq"},
},
opset_version=14, # 14 兼容性好,避免版本转换 bug
dynamo=False, # 使用旧版 TorchScript 导出,稳定不出错
do_constant_folding=True,
export_params=True,
)
print("✅ ONNX 导出成功!文件:qwen2.5_3b_fp16.onnx")
- test__onnx_qwen.py
用来测试转成 onnx 格式之后的模型的脚本文件。
python
#!/usr/bin/env python3
"""
Smoke test for exported ONNX (same I/O contract as export_qwen.py).
Requires: pip install onnxruntime # or onnxruntime-gpu for CUDA
pip install transformers # tokenizer only; any recent 4.x is fine
External weight shards must sit beside qwen2.5_3b_fp16.onnx (as produced by export).
"""
from __future__ import annotations
import argparse
from pathlib import Path
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
def build_session(onnx_path: Path, use_gpu: bool) -> ort.InferenceSession:
so = ort.SessionOptions()
so.log_severity_level = 3
providers = []
if use_gpu:
providers.append("CUDAExecutionProvider")
providers.append("CPUExecutionProvider")
return ort.InferenceSession(str(onnx_path), sess_options=so, providers=providers)
def run_once(
session: ort.InferenceSession,
input_ids: np.ndarray,
attention_mask: np.ndarray,
position_ids: np.ndarray,
) -> np.ndarray:
outs = session.run(
["logits"],
{
"input_ids": input_ids,
"attention_mask": attention_mask,
"position_ids": position_ids,
},
)
return outs[0]
def main() -> None:
parser = argparse.ArgumentParser(description="Test qwen2.5_3b_fp16.onnx forward + optional greedy decode.")
parser.add_argument("--onnx", type=Path, default=None, help="Path to .onnx (default: next to this script)")
parser.add_argument("--tokenizer", type=Path, default=None, help="Tokenizer dir (default: ./qwen_merged_fp16)")
parser.add_argument("--prompt", type=str, default="你好,今天天气")
parser.add_argument("--greedy-steps", type=int, default=0, help="Extra greedy decode steps after prompt (slow on CPU)")
parser.add_argument("--gpu", action="store_true", help="Use CUDAExecutionProvider if onnxruntime-gpu is installed")
args = parser.parse_args()
root = Path(__file__).resolve().parent
onnx_path = args.onnx or (root / "qwen2.5_3b_fp16.onnx")
tok_dir = args.tokenizer or (root / "qwen_merged_fp16")
if not onnx_path.is_file():
raise SystemExit(f"Missing ONNX: {onnx_path}")
if not tok_dir.is_dir():
raise SystemExit(f"Missing tokenizer dir: {tok_dir}")
available = ort.get_available_providers()
use_gpu = args.gpu and "CUDAExecutionProvider" in available
if args.gpu and not use_gpu:
print("Warning: --gpu requested but CUDAExecutionProvider not available; using CPU.")
print(f" Available: {available}")
print(f"Loading ONNX: {onnx_path}")
session = build_session(onnx_path, use_gpu=use_gpu)
print(f"Providers in use: {session.get_providers()}")
tokenizer = AutoTokenizer.from_pretrained(str(tok_dir), trust_remote_code=True)
enc = tokenizer(args.prompt, return_tensors="np")
input_ids = enc["input_ids"].astype(np.int64)
n_steps = 1 + max(0, args.greedy_steps)
for step in range(n_steps):
seq_len = int(input_ids.shape[1])
attention_mask = np.ones((1, seq_len), dtype=np.int64)
position_ids = np.arange(seq_len, dtype=np.int64)[np.newaxis, :]
logits = run_once(session, input_ids, attention_mask, position_ids)
last_logits = logits[0, -1]
next_id = int(np.argmax(last_logits))
next_piece = tokenizer.decode([next_id], skip_special_tokens=False)
print(f"[step {step}] seq_len={seq_len} logits.shape={logits.shape} next_id={next_id!r} next_token={next_piece!r}")
if step == n_steps - 1:
break
input_ids = np.concatenate([input_ids, [[next_id]]], axis=1)
print("OK: ONNX forward completed.")
if __name__ == "__main__":
main()
- infer_trt_qwen.py
用来跑 engine 的。
python
#!/usr/bin/env python3
"""
Run forward + optional greedy decode using a TensorRT engine (.engine) built from Qwen ONNX.
Depends on: TensorRT Python wheels matching the same release as `trtexec`, torch, transformers, numpy.
Runtime note: load `libnvinfer` from the same TensorRT tree as `trtexec` (prepend its `lib/`
to LD_LIBRARY_PATH) so engines deserialize reliably. Pip may install a newer `tensorrt_libs`
bundle that clashes with engines built via the tarball `trtexec`.
Precision note (important): engines built with `trtexec --fp16` for this ONNX can produce logits
that are entirely zero (silent wrong output); verify with `--dumpRawBindingsToFile`.
Use an FP32 engine (omit `--fp16`) for correct logits on this graph.
Inputs/outputs match export (see test_onnx_qwen.py): input_ids, attention_mask, position_ids -> logits.
Generation defaults stop when EOS is predicted, `--max-new-tokens` is reached, or the engine sequence
limit is hit; optionally cap displayed reply length with `--max-chars`. Use `--greedy-steps` for the
legacy fixed-step loop.
"""
from __future__ import annotations
import argparse
import os
from pathlib import Path
_TRT_TARBALL_LIB = Path("/root/autodl-tmp/TensorRT-10.8.0.43/lib")
if _TRT_TARBALL_LIB.is_dir():
os.environ["LD_LIBRARY_PATH"] = (
str(_TRT_TARBALL_LIB) + os.pathsep + os.environ.get("LD_LIBRARY_PATH", "")
)
import numpy as np
import tensorrt as trt
import torch
from transformers import AutoTokenizer
def load_engine(engine_path: Path, logger: trt.ILogger) -> trt.ICudaEngine:
with open(engine_path, "rb") as f:
blob = f.read()
runtime = trt.Runtime(logger)
engine = runtime.deserialize_cuda_engine(blob)
if engine is None:
raise RuntimeError(f"Failed to deserialize TensorRT engine: {engine_path}")
return engine
def run_forward(
ctx: trt.IExecutionContext,
*,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
position_ids: torch.Tensor,
logits_out: torch.Tensor,
stream: torch.cuda.Stream,
) -> None:
assert input_ids.is_cuda and attention_mask.is_cuda and position_ids.is_cuda and logits_out.is_cuda
ctx.set_tensor_address("input_ids", input_ids.data_ptr())
ctx.set_tensor_address("attention_mask", attention_mask.data_ptr())
ctx.set_tensor_address("position_ids", position_ids.data_ptr())
ctx.set_tensor_address("logits", logits_out.data_ptr())
ok = ctx.execute_async_v3(stream.cuda_stream)
if not ok:
raise RuntimeError("TensorRT execute_async_v3 returned False")
def main() -> None:
parser = argparse.ArgumentParser(description="Infer with a Qwen TensorRT .engine (FP32 recommended)")
parser.add_argument(
"--engine",
type=Path,
default=None,
help="Path to .engine (default: ./qwen_fp32_test.engine next to this script)",
)
parser.add_argument(
"--tokenizer",
type=Path,
default=None,
help="Tokenizer directory (default: ./qwen_merged_fp16)",
)
parser.add_argument("--prompt", type=str, default="你好,今天天气")
parser.add_argument(
"--greedy-steps",
type=int,
default=None,
help="Legacy: exact number of extra forwards after the prompt (overrides max-new-tokens behavior if set)",
)
parser.add_argument(
"--max-new-tokens",
type=int,
default=256,
help="Stop after generating this many tokens past the prompt (unless EOS/engine limit hits first)",
)
parser.add_argument(
"--max-chars",
type=int,
default=0,
help="If > 0, truncate printed reply text (Unicode chars) after decode and append ellipsis",
)
parser.add_argument("--verbose-trt", action="store_true", help="TensorRT INFO logs")
args = parser.parse_args()
root = Path(__file__).resolve().parent
engine_path = args.engine or (root / "qwen_fp32_test.engine")
tok_dir = args.tokenizer or (root / "qwen_merged_fp16")
if not engine_path.is_file():
raise SystemExit(f"Missing engine: {engine_path}")
if not tok_dir.is_dir():
raise SystemExit(f"Missing tokenizer dir: {tok_dir}")
log_lvl = trt.Logger.INFO if args.verbose_trt else trt.Logger.WARNING
logger = trt.Logger(log_lvl)
engine = load_engine(engine_path, logger)
ctx = engine.create_execution_context()
device = torch.device("cuda:0")
tokenizer = AutoTokenizer.from_pretrained(str(tok_dir), trust_remote_code=True)
stream = torch.cuda.Stream()
enc = tokenizer(args.prompt, return_tensors="pt")
input_ids_cpu = enc["input_ids"].to(torch.int64)
if input_ids_cpu.dim() != 2 or input_ids_cpu.shape[0] != 1:
raise SystemExit("Only batch size 1 is supported (same as export).")
max_seq = int(engine.get_tensor_profile_shape("input_ids", 0)[2][1])
seq_len = int(input_ids_cpu.shape[1])
if seq_len > max_seq:
raise SystemExit(f"Prompt length {seq_len} exceeds engine max sequence {max_seq}")
prompt_token_count = seq_len
eos_id = tokenizer.eos_token_id
step = 0
if args.greedy_steps is not None:
n_steps = 1 + max(0, args.greedy_steps)
for _ in range(n_steps):
seq_len = int(input_ids_cpu.shape[1])
if seq_len > max_seq:
print(f"[stop] sequence length {seq_len} exceeds engine max {max_seq}")
break
ctx.set_input_shape("input_ids", (1, seq_len))
ctx.set_input_shape("attention_mask", (1, seq_len))
ctx.set_input_shape("position_ids", (1, seq_len))
out_shape = tuple(ctx.get_tensor_shape("logits"))
logits_gpu = torch.empty(out_shape, dtype=torch.float32, device=device)
with torch.cuda.stream(stream):
input_ids = input_ids_cpu.to(device, non_blocking=True)
attention_mask = torch.ones((1, seq_len), dtype=torch.int64, device=device)
position_ids = torch.arange(seq_len, dtype=torch.int64, device=device).unsqueeze(0)
run_forward(
ctx,
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
logits_out=logits_gpu,
stream=stream,
)
stream.synchronize()
logits_np = logits_gpu.detach().cpu().numpy()
last_logits = logits_np[0, -1]
next_id = int(np.argmax(last_logits))
next_piece = tokenizer.decode([next_id], skip_special_tokens=False)
print(
f"[step {step}] seq_len={seq_len} logits.shape={logits_np.shape} "
f"next_id={next_id!r} next_token={next_piece!r}"
)
step += 1
if step >= n_steps:
break
input_ids_cpu = torch.cat([input_ids_cpu, torch.tensor([[next_id]], dtype=torch.int64)], dim=1)
else:
generated = 0
while generated < args.max_new_tokens:
seq_len = int(input_ids_cpu.shape[1])
if seq_len >= max_seq:
print("[stop] engine max sequence length (cannot append further)")
break
ctx.set_input_shape("input_ids", (1, seq_len))
ctx.set_input_shape("attention_mask", (1, seq_len))
ctx.set_input_shape("position_ids", (1, seq_len))
out_shape = tuple(ctx.get_tensor_shape("logits"))
logits_gpu = torch.empty(out_shape, dtype=torch.float32, device=device)
with torch.cuda.stream(stream):
input_ids = input_ids_cpu.to(device, non_blocking=True)
attention_mask = torch.ones((1, seq_len), dtype=torch.int64, device=device)
position_ids = torch.arange(seq_len, dtype=torch.int64, device=device).unsqueeze(0)
run_forward(
ctx,
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
logits_out=logits_gpu,
stream=stream,
)
stream.synchronize()
logits_np = logits_gpu.detach().cpu().numpy()
last_logits = logits_np[0, -1]
next_id = int(np.argmax(last_logits))
next_piece = tokenizer.decode([next_id], skip_special_tokens=False)
print(
f"[step {step}] seq_len={seq_len} logits.shape={logits_np.shape} "
f"next_id={next_id!r} next_token={next_piece!r}"
)
step += 1
if eos_id is not None and next_id == eos_id:
print("[stop] EOS token")
break
input_ids_cpu = torch.cat([input_ids_cpu, torch.tensor([[next_id]], dtype=torch.int64)], dim=1)
generated += 1
full = tokenizer.decode(input_ids_cpu[0].tolist(), skip_special_tokens=False)
reply_ids = input_ids_cpu[0, prompt_token_count:].tolist()
reply = tokenizer.decode(reply_ids, skip_special_tokens=True)
if args.max_chars > 0 and len(reply) > args.max_chars:
reply_out = reply[: args.max_chars] + "..."
truncated_note = f" (--max-chars {args.max_chars}, original {len(reply)} chars)"
else:
reply_out = reply
truncated_note = ""
print("--- full context decode (skip_special_tokens=False may keep chat tokens) ---")
print(full)
print("--- reply only (skip_special_tokens=True)" + truncated_note + " ---")
print(reply_out)
if __name__ == "__main__":
main()