作者:昇腾PAE技术支持团队
昇腾案例库简介 :https://agent.blog.csdn.net/article/details/155446713
昇腾案例抢鲜预览 :https://gitcode.com/invite/link/8791cccc43cb4ee589e8
(如对本文有疑问,请移步案例库提交issue,专人答疑)
前言
vLLM推理部署按照不同场景可划分为:离线在线、单机、多机、PD混布、PD分离。
此处存放一些vLLM-Ascend昇腾集群部署相关的步骤介绍和链接,可能会随时更新。
| 链接 | |
|---|---|
| 官方部署样例 | https://vllm-ascend.readthedocs.io/en/latest/tutorials/index.html |
| 环境搭建之社区配套 | https://www.hiascend.com/developer/download/community/result?module=ie+pt+cann |
环境搭建
准备镜像
- 采用vLLM-Ascend镜像,如下链接自取。
docker load < vllm_ascend-xxxx- 拉起容器
vLLM-Ascend 镜像:镜像
社区CANN包:社区CANN包
拉起容器参考:
docker run --name ${container_name} -it -d --net=host --shm-size=500g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /mnt:/mnt \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
${image_id}
image_id:为镜像版本号。(docker images 命令回显中的 IMAGES ID)
container_name:为启动容器名,可自定义设置。
vllm 安装
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout main
VLLM_TARGET_DEVICE=empty pip install -e .
vllm-ascend 安装
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -r requirements.txt
python setup.py develop 或者 pip install -e .
PYTHONPATH修改(可选):
export PYTHONPATH=/home/vllm:/home/vllm-ascend/:${PYTHONPATH}
vLLM各场景部署(以A3为例)
离线推理
离线单机部署
from vllm import LLM, SamplingParams
import torch_npu
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/home/data/Qwen3-14B", tensor_parallel_size=1)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
在线推理
服务拉起成功标志:

拉起服务方式如下列脚本:
在线单机部署
export HCCL_IF_IP=xxx # local ip
# export GLOO_SOCKET_IFNAME="enp194s0f0" # network card name
# export TP_SOCKET_IFNAME="enp194s0f0"
# export HCCL_SOCKET_IFNAME="enp194s0f0"
export HCCL_BUFFSIZE=256
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_DETERMINISTIC="true" # 确定性计算
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export MINDIE_LOG_TO_STDOUT="benchmark:1; client:1"
export VLLM_USE_V1=1
export VLLM_ENABLE_MC2=1
# export VLLM_VERSION=0.9.1
# export DUMP_GE_GRAPH=3
# export ASCEND_LAUNCH_BLOCKING=1 # 报错时开启同步,方便定位
export VLLM_ASCEND_TRACE_RECOMPILES=1 # 用于重编译定位,开启获取详细信息
# export ASCEND_GLOBAL_LOG_LEVEL=0 # plog日志开关,开启获取算子详细日志
# export VLLM_LOGGING_LEVEL="DEBUG" # vLLM日志级别设置
LOG_FILE="./mtp_tp16/mtp_log_$(date +%Y%m%d_%H%M).log" # 日志落盘
python -m vllm.entrypoints.openai.api_server \
--model="/home/data/DeepSeek-R1_w8a8/" \
--trust-remote-code \
--max-model-len 34500 \
--no-enable-prefix-caching \
--tensor-parallel-size 16 \
--data_parallel_size 1 \
--served-model-name deepseekr1 \
--max-num-seqs 16 \
--max-num-batched-tokens 1024 \
--enable_expert_parallel \
--quantization ascend \
--host 0.0.0.0 \
--port 1234 \
--additional-config '{"ascend_scheduler_config":{"enabled":false}, "torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}' \
--gpu_memory_utilization 0.90 > $LOG_FILE 2>&1
另起一个终端,请求命令:
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model": "deepseekr1","messages": [{"role": "user", "content": "请介绍一下杭州"}],"max_tokens": 60}' http://0.0.0.0:1234/v1/chat/completions
在线多机PD混布
(以双机A3为例)
并行策略参数控制:--data-parallel-size和--tensor-parallel-size 注意此处--data-parallel-size表示全局的DP数量,单节点的本地DP数对应参数--data-parallel-size-local;根据节点顺序调整非主节点启动脚本中的--data-parallel-start-rank;
跨节点通信:所有节点使用统一的--data-parallel-address和--data-parallel-rpc-port(建议选择DP-RANK0节点的IP)。
主节点node0:
export HCCL_IF_IP=xxx # local ip
export GLOO_SOCKET_IFNAME="xxx" # network card name 注意每台机子不一样,使用ifconfig查询
export TP_SOCKET_IFNAME="xxx"
export HCCL_SOCKET_IFNAME="xxx"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_DETERMINISTIC="true" # 确定性计算
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export MINDIE_LOG_TO_STDOUT="benchmark:1; client:1"
export VLLM_USE_V1=1
export VLLM_ENABLE_MC2=1
# export VLLM_VERSION=0.9.1
# export DUMP_GE_GRAPH=3
# export ASCEND_LAUNCH_BLOCKING=1 # 报错时开启同步,方便定位
export VLLM_ASCEND_TRACE_RECOMPILES=1 # 用于重编译定位,开启获取详细信息
# export ASCEND_GLOBAL_LOG_LEVEL=0 # plog日志开关,开启获取算子详细日志
# export VLLM_LOGGING_LEVEL="DEBUG" # vLLM日志级别设置
LOG_FILE="./mtp_tp16/mtp_log_$(date +%Y%m%d_%H%M).log" # 日志落盘
python -m vllm.entrypoints.openai.api_server \
--model="/home/data/DeepSeek-R1_w8a8/" \
--trust-remote-code \
--max-model-len 34500 \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data_parallel_size 32 \
--data-parallel-size-local 16 \
--data-parallel-address ${node0 ip} \
--data-parallel-rpc-port 13345 \
--served-model-name deepseekr1 \
--max-num-seqs 16 \
--max-num-batched-tokens 1024 \
--enable_expert_parallel \
--quantization ascend \
--host 0.0.0.0 \
--port 1234 \
--additional-config '{"ascend_scheduler_config":{"enabled":false}, "torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}' \
--gpu_memory_utilization 0.90 > $LOG_FILE 2>&1
从节点node1 :多两个参数 --headless 和 --data-parallel-start-rank
export HCCL_IF_IP=xxx # local ip
export GLOO_SOCKET_IFNAME="xxx" # network card name 注意每台机子不一样,使用ifconfig查询
export TP_SOCKET_IFNAME="xxx"
export HCCL_SOCKET_IFNAME="xxx"
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_DETERMINISTIC="true" # 确定性计算
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export MINDIE_LOG_TO_STDOUT="benchmark:1; client:1"
export VLLM_USE_V1=1
export VLLM_ENABLE_MC2=1
# export VLLM_VERSION=0.9.1
# export DUMP_GE_GRAPH=3
# export ASCEND_LAUNCH_BLOCKING=1 # 报错时开启同步,方便定位
export VLLM_ASCEND_TRACE_RECOMPILES=1 # 用于重编译定位,开启获取详细信息
# export ASCEND_GLOBAL_LOG_LEVEL=0 # plog日志开关,开启获取算子详细日志
# export VLLM_LOGGING_LEVEL="DEBUG" # vLLM日志级别设置
LOG_FILE="./mtp_tp16/mtp_log_$(date +%Y%m%d_%H%M).log" # 日志落盘
python -m vllm.entrypoints.openai.api_server \
--model="/home/data/DeepSeek-R1_w8a8/" \
--trust-remote-code \
--max-model-len 34500 \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data_parallel_size 32 \
--headless \
--data-parallel-size-local 16 \
--data-parallel-start-rank 16 \
--data-parallel-address ${node0 ip} \
--data-parallel-rpc-port 13345 \
--served-model-name deepseekr1 \
--max-num-seqs 16 \
--max-num-batched-tokens 1024 \
--enable_expert_parallel \
--quantization ascend \
--host 0.0.0.0 \
--port 1234 \
--additional-config '{"ascend_scheduler_config":{"enabled":false}, "torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}' \
--gpu_memory_utilization 0.90 > $LOG_FILE 2>&1
在两台机子同时拉起脚本即可。
在线多机PD分离
更多详细信息参考vllm-ascend仓的README:PD分离官方仓指导
此处以双机A3为例,1P节点,1D节点。
若是4机、8机存在多个P节点和多个D节点,则主节点参考如下配置,从节点参考上述"在线多机PD混布中主从节点的配置"
生成ranktable
可参考PD分离官方仓指导,命令如下。
cd vllm-ascend/examples/disaggregate_prefill_v1/
# 其中两个cnt按die来分(A3是8卡16die),第一个IP为主节点,在两台机子上一起执行如下命令
bash gen_ranktable.sh --ips x.x.x.101 x.x.x.136 \
--npus-per-node 16 --network-card-name enp48s3u1u1 --prefill-device-cnt 16 --decode-device-cnt 16
# 此命令会在vllm-ascend/examples/disaggregate_prefill_v1/目录下生成ranktable.json
确认机子是否连通:在ranktable命令确认正确的情况下,生成ranktable报错,排查步骤:
- 确认机器首先是否能连通
- 确认防火墙是否关闭,可执行以下命令关闭:
systemctl stop firewalld
systemctl disable firewalld
P节点主节点
export HCCL_IF_IP=xxx # P node host ip
export GLOO_SOCKET_IFNAME="xxx" # network card name
export TP_SOCKET_IFNAME="xxx"
export HCCL_SOCKET_IFNAME="xxx"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/home/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export HCCL_DETERMINISTIC=True
export VLLM_VERSION=0.9.1
LOG_FILE="./mtp_pd/mtp_p_log_$(date +%Y%m%d_%H%M).log"
vllm serve /home/data/DeepSeek-R1_w8a8/ \
--host 0.0.0.0 \
--port 1234 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address ${P node host ip} \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--api-server-count 2 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseekr1 \
--max-model-len 16000 \
--max-num-batched-tokens 4000 \
--trust-remote-code \
--quantization ascend \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "enable_weight_nz_layout":true}' | tee $LOG_FILE 2>&1
D节点主节点
export HCCL_IF_IP=xxx # D node host ip
export GLOO_SOCKET_IFNAME="xxx"
export TP_SOCKET_IFNAME="xxx"
export HCCL_SOCKET_IFNAME="xxx"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/home/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export VLLM_ENABLE_MC2=1
export VLLM_ASCEND_TRACE_RECOMPILES=1
export HCCL_DETERMINISTIC=True
export VLLM_VERSION=0.9.1
LOG_FILE="./mtp_pd/mtp_d_log_$(date +%Y%m%d_%H%M).log"
vllm serve /home/data/DeepSeek-R1_w8a8/ \
--host 0.0.0.0 \
--port 1234 \
--data-parallel-size 2 \
--data-parallel-size-local 2 \
--data-parallel-address ${D node host ip} \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseekr1 \
--max-model-len 8192 \
--max-num-batched-tokens 100 \
--max-num-seqs 14 \
--trust-remote-code \
--quantization ascend \
--gpu-memory-utilization 0.88 \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled":true, "enable_multistream_shared_expert":false, "graph_batch_sizes":[14]}, "enable_weight_nz_layout":true}' | tee $LOG_FILE 2>&1
启动toy_proxy(可在vLLM服务拉起后启动)
proxy用于分发请求。
在P节点的主节点拉起proxy
cd vllm-ascend/examples/disaggregate_prefill_v1/
python load_balance_proxy_server_example.py --host ${P node host ip} --port 1025 --prefiller-hosts ${P node host ip} --prefiller-port 1234 --decoder-hosts ${D node host ip} --decoder-ports 1234
发起请求
区别:此处请求经过proxy转发,因此需使用proxy暴露的端口
curl http://0.0.0.0:1025/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseekr1",
"prompt": "Who are you?",
"max_tokens": 10,
"temperature": 0
}'