本文提供2种方式进行基础华为昇腾的vllm推理部署
昇腾开源:昇腾开源 --- 昇腾开源 文档
参考文档:昇腾文档-昇腾社区
硬件:CANN 8.0
一:制作npu版本vllm镜像
-
安装vllm
git clone --depth 1 --branch v0.7.3 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install . --extra-index https://download.pytorch.org/whl/cpu/
2.安装vllm Ascend
git clone --depth 1 --branch v0.7.3rc1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
3.安装torch-npu
mkdir pta
cd pta
wget https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250320.3/pytorch_v2.5.1_py310.tar.gz
tar -xvf pytorch_v2.5.1_py310.tar.gz
pip install ./torch_npu-2.5.1.dev20250320-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- 如遇见torchvision的版本问题,请手动安装对应版本。torchvision安装好之后,避免安装包冲突,需要把vLLM与vLLM-Ascend再重新安装一下,执行最后一条命令即可。
5.代码测试
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
6.vlm启动模型
参考:使用差异-vLLM-MindIE开源第三方服务化框架适配开发指南-MindIE1.0.0开发文档-昇腾社区
vllm serve /root/autodl-tmp/Qwen/Qwen2.5-0.5B-Instruct \
--served-model-name Qwen2.5-0.5B-Instruct \
--max_model 4096 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--trust-remote-code \
--api-key 123321
curl测试:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 123321" \
-d '{
"model": "Qwen2.5-0.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "你好,介绍一下你自己"}
]
}'
二:使用开源镜像
vllm-Ascend:https://vllm-ascend.readthedocs.io/en/latest/installation.html
-
拉取代码构建镜像
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile .
官方启动示例:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.2rc1
docker run --rm \
--name vllm-ascend-env \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
三. 基于下载好的镜像,为方便离线部署,接下来会贴出来一键部署的脚本
*** LLM模型 ***
3.1 Dockerfile
FROM vllm-ascend-dev-image:latest
WORKDIR /home
ENV ASCEND_RT_VISIBLE_DEVICES=2,3,4,5,6,7
COPY start_model.sh /home/start_model.sh
RUN chmod +x /home/start_model.sh
CMD ["/home/start_model.sh"]
3.2 start_model.sh
#!/bin/bash
nohup vllm serve /home/Qwen2.5-72B-Instruct \
--served-model-name Qwen2.5-72B-Instruct \
--max_model_len 20480 \
--host 0.0.0.0 \
--port 6006 \
--tensor-parallel-size 6 \
--dtype float16 \
--max-num-seqs 200 \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--api-key sc-e6NHaCv5TY43BbISCJ > /home/vllm.log 2>&1 &
tail -f /home/vllm.log
3.3 一键部署脚本 deploy.sh
#!/bin/bash
IMAGE_NAME="scxx_event_horizon:v1"
CONTAINER_NAME="event"
MODEL_HOST_PATH="/tmp/sc_bid/models_pkg/Qwen2.5-72B-Instruct"
MODEL_CONTAINER_PATH="/home/Qwen2.5-72B-Instruct"
API_PORT=40201
echo "=============================Building image...============================="
if [[ "$(docker images -f "name=${IMAGE_NAME}" --format "{{.Repository}}")" != "${IMAGE_NAME%%:*}" ]]; then
echo "***** image ${IMAGE_NAME} not exist, currently under construction..."
docker build -t ${IMAGE_NAME} .
else
echo "***** image ${IMAGE_NAME} Already exists, skip build。"
fi
if [ "$(docker ps -a -f "name=${CONTAINER_NAME}" --format "{{.Status}}")" ]; then
echo "***** Stop and delete old containers: ${CONTAINER_NAME}"
docker stop ${CONTAINER_NAME}
docker rm ${CONTAINER_NAME}
fi
echo "***** Start new container: ${CONTAINER_NAME}"
docker run -d \
--name ${CONTAINER_NAME} \
--restart always \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v ${MODEL_HOST_PATH}:${MODEL_CONTAINER_PATH} \
-p ${API_PORT}:6006 \
${IMAGE_NAME}
echo "Success!!!"
测试1 : api_test.py 单问题
from openai import OpenAI
model_name = "Qwen2.5-72B-Instruct"
api_key = "sc-e6NHaCv5TY43BbISCJ"
base_url = "http://127.0.0.1:6006/v1/"
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
def qwen_api(prompt_txt):
completion = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt_txt}],
response_format={"type": "json_object"}
)
ai_result = completion.choices[0].message.content
print(ai_result)
prompt_ = "Please check the garbage exposure situation on Chang'an Street at 3 PM and then convert it into JSON format."
qwen_api(prompt_)
并行测试:
import time
import platform
import asyncio
from openai import OpenAI,AsyncOpenAI
model_name = "Qwen2.5-72B-Instruct"
api_key = "sc-e6NHaCv5TY43BbISCJ"
base_url = "http://127.0.0.1:6006/v1/"
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
async def batch_api(query_dict, async_client):
response = await async_client.chat.completions.create(
messages=[{"role": "user", "content": query_dict["query"]}],
model=model_name,
response_format={"type": "json_object"},
timeout=3 * 60,
)
res_content = response.choices[0].message.content
return res_content
async def batch_main(query_list):
async_client = AsyncOpenAI(
api_key=api_key,
base_url=base_url,
)
tasks = [batch_api(query_dict, async_client) for query_dict in query_list]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
def batch_async_run(query_list):
start_time = time.time()
res_list = []
if platform.system() == "Windows":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
for i in range(0, len(query_list), 100):
batch = query_list[i : i + 100]
results = loop.run_until_complete(batch_main(batch))
for i, result in enumerate(results):
print(f"index : {i} ,res :{result}")
res_list.append(result)
end_time = time.time()
print(
f"******* async_run Model parsing completed, time-consuming :{end_time - start_time}******"
)
return res_list
query_list = [
{'query':"Please check the garbage exposure situation on beijing Street at 1 PM and then convert it into JSON format."},
{'query':"Please check the garbage exposure situation on Chang'an Street at 3 PM and then convert it into JSON format."},
{'query':"Please check the garbage exposure situation on shanghai Street at 4 PM and then convert it into JSON format."}
]
batch_async_run(query_list)
*** emb模型 ***
Dockerfile
FROM vllm-ascend-dev-image:latest
WORKDIR /home
ENV ASCEND_RT_VISIBLE_DEVICES=0,1
COPY start_model.sh /home/start_model.sh
RUN chmod +x /home/start_model.sh
CMD ["/home/start_model.sh"]
start_model.sh
#!/bin/bash
nohup vllm serve /home/bge-m3 \
--served-model-name Bge_Emb \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--host 0.0.0.0 \
--port 9892 \
--api-key sc-16Nwadv5Tg43BbaSCs \
--max-num-seqs 200 \
--max_model_len 8192 \
--task embed > /home/vllm.log 2>&1 &
tail -f /home/vllm.log
#!/bin/bash
IMAGE_NAME="scxx_quantum_foam:v1"
CONTAINER_NAME="quantum"
MODEL_HOST_PATH="/tmp/sc_bid/models_pkg/bge-m3"
MODEL_CONTAINER_PATH="/home/bge-m3"
API_PORT=40202
echo "=============================Building image...============================="
if [[ "$(docker images -f "name=${IMAGE_NAME}" --format "{{.Repository}}")" != "${IMAGE_NAME%%:*}" ]]; then
echo "***** image ${IMAGE_NAME} not exist, currently under construction..."
docker build -t ${IMAGE_NAME} .
else
echo "***** image ${IMAGE_NAME} Already exists, skip build。"
fi
if [ "$(docker ps -a -f "name=${CONTAINER_NAME}" --format "{{.Status}}")" ]; then
echo "***** Stop and delete old containers: ${CONTAINER_NAME}"
docker stop ${CONTAINER_NAME}
docker rm ${CONTAINER_NAME}
fi
echo "***** Start new container: ${CONTAINER_NAME}"
docker run -d \
--name ${CONTAINER_NAME} \
--restart always \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v ${MODEL_HOST_PATH}:${MODEL_CONTAINER_PATH} \
-p ${API_PORT}:9892 \
${IMAGE_NAME}
echo "Success!!!"
测试 emb_test.py
from openai import OpenAI
model_name = "Bge_Emb"
api_key = "sc-16Nwadv5Tg43BbaSCs"
base_url = "http://127.0.0.1:9892/v1/"
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
def bge_api(input_list):
res = client.embeddings.create(
model=model_name,
input=input_list
)
ai_result = res.model_dump()
print(ai_result)
input_list = ['北京天气怎么样','上海天气怎么样']
bge_api(input_list)