昇腾Ascend环境微调部署Qwen3（LlamaFactory+vLLM-Ascend）

环境介绍：

GitCode：

实例：NPU basic · 1 * NPU 910B · 16v CPU · 64GB

容器镜像：ubuntu22-cann8.5-python3.11-jupyter:v1.0.1-notebook

模型：千问3-0.6B

数据集：helloworld0/Brain_teasers、identity

LlamaFactory：v0.9.5.dev0

vLLM-Ascend：v0.13.0

1. 准备环境

1.1 启动Notebook实例

1.2 创建模型存放路径和虚拟环境

复制代码

mkdir models
python -m venv ~/llamafactory_env
python -m venv ~/vllm_env

2. 拉取模型

2.1 切换modelscope虚拟环境

复制代码

source modelscope_env/bin/activate

2.2 拉取模型

复制代码

pip install modelscope
modelscope download --model Qwen/Qwen3-0.6B --local_dir ~/models/Qwen/Qwen3-0.6B

3. LlamaFactory微调模型

3.1 微调任务

3.1.1 拉取仓库

复制代码

git clone https://github.com/hiyouga/LlamaFactory.git

3.1.2 修改 requirements/npu.txt 文件（这一步主要是要 llamafactory 和 vllm-ascend 的 transformers 版本一致，注意需要提前确认两边支持同一版本的 transformers ）

llama-factory测：

vllm-ascend测：

复制代码

cd LlamaFactory
vim npu.txt

末尾加上 transformers==4.57.4

3.1.3 配置镜像加速

复制代码

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple/

3.1.4 安装依赖项

复制代码

pip install -e .
pip install -r requirements/metrics.txt
pip install -r requirements/npu.txt

验证 LlamaFactory 安装情况

复制代码

llamafactory-cli env

复制代码

llamafactory-cli version

3.1.5 获取数据集

复制代码

cd 
cd LlamaFactory/data
modelscope download --dataset helloworld0/Brain_teasers data.json --local_dir ./

调整数据集格式，数据集最前和最后加上 [] 括号

复制代码

vim data.json

修改 dataset_info.json 文件

复制代码

vim dataset_info.json

注册新数据集

复制代码

"naojin": {
  "file_name": "data.json"
},

3.1.6 准备训练脚本

复制代码

mkdir my_yaml
mkdir my_yaml/train_lora
cp examples/train_lora/qwen3_lora_sft.yaml my_yaml/train_lora/

修改脚本，修改以下内容

复制代码

model_name_or_path: /opt/atomgit/models/Qwen/Qwen3-0.6B

dataset: identity,naojin

output_dir: saves/Qwen/Qwen3-0.6B/lora/sft

3.1.7 开始微调

复制代码

llamafactory-cli train my_yaml/train_lora/qwen3_lora_sft.yaml

训练过程中看下NPU利用率

复制代码

npu-smi info

训练结束

3.1.8 查看训练loss，路径: /opt/atomgit/LlamaFactory/saves/Qwen/Qwen3-0.6B/lora/sft/training_loss.png

3.2 合并模型

3.2.1 准备合并配置文件

复制代码

cd ~/LlamaFactory

mkdir my_yaml/merge_lora

cp examples/merge_lora/qwen3_lora_sft.yaml my_yaml/merge_lora/

复制代码

vim my_yaml/merge_lora/qwen3_lora_sft.yaml

修改以下字段

复制代码

model_name_or_path: /opt/atomgit/models/Qwen/Qwen3-0.6B

adapter_name_or_path: saves/Qwen/Qwen3-0.6B/lora/sft

export_dir: /opt/atomgit/models/Qwen/Qwen3-0.6B_sft_merged

3.2.2 合并模型

复制代码

llamafactory-cli export my_yaml/merge_lora/qwen3_lora_sft.yaml

4. 部署/测试模型

4.1 切换vllm环境

复制代码

cd
source vllm_env/bin/activate

4.2 安装依赖

复制代码

sudo sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list

sudo apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

复制代码

pip install vllm==0.13.0

pip install vllm-ascend==0.13.0

4.3 代码测试

创建文件 test.py

复制代码

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
    "你是谁",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

运行代码

python 复制代码

python test.py

4.4 部署测试微调后模型

复制代码

vllm serve /opt/atomgit/models/Qwen/Qwen3-0.6B_sft_merged \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name Qwen3-0.6B_sft_merged

复制代码

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3-0.6B_sft_merged",
  "messages": [
    {"role": "user", "content": "你是谁？"}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 4096
}'

复制代码

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3-0.6B_sft_merged",
  "messages": [
    {"role": "user", "content": "什么门永远关不上？"}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 4096
}'

复制代码

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3-0.6B_sft_merged",
  "messages": [
    {"role": "user", "content": "小明晚上看文艺表演，为啥有一个演员总是背对观众？"}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 4096
}'

4.5 部署测试微调前模型做对比

复制代码

vllm serve /opt/atomgit/models/Qwen/Qwen3-0.6B \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name Qwen3-0.6B_sft_merged

复制代码

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3-0.6B_sft_merged",
  "messages": [
    {"role": "user", "content": "你是谁？"}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 4096
}'

复制代码

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3-0.6B_sft_merged",
  "messages": [
    {"role": "user", "content": "什么门永远关不上？"}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 4096
}'