本地环境情况:
GPU:4090(48GB)* 2
RAM:512GB
使用docker运行多个模型实例,目标为了显存的复用
使用的镜像
vllm/vllm-openai:nightly
运行命令:
nginx
docker run -d --name glm-4.7-flash-vllm \
--gpus '"device=0,1"' \
-v /home/ls/.cache/modelscope/hub/models/ZhipuAI/GLM-4.7-Flash:/app/models \
--ipc=host \
-p 8003:8000 \
vllm/vllm-openai:nightly \
--model /app/models \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--max-model-len=32768 \
--max_num_seqs=8 \
--served-model-name glm-4.7-flash
不出意外肯定要报错了:
(APIServer pid=1) You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=1) For further information visit https://errors.pydantic.dev/2.12/v/value_error
意思是:nightly预览版里面的transformers的版本还是太低了。。。
升级transformers版本
- 构建本地docker镜像
nginx
mkdir -p ~/vllm-glm47 && cd ~/vllm-glm47
cat > Dockerfile << 'EOF'
FROM vllm/vllm-openai:nightly
RUN pip install --upgrade transformers accelerate -q
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
EOF
docker build -t vllm-glm47:latest .
- 运行命令改成:
nginx
docker run -d --name glm-4.7-flash-vllm \
--gpus '"device=0,1"' \
-v /home/ls/.cache/modelscope/hub/models/ZhipuAI/GLM-4.7-Flash:/app/models \
--ipc=host \
-p 8003:8000 \
vllm-glm47:latest \
--model /app/models \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--max-model-len=32768 \
--max_num_seqs=8 \
--served-model-name glm-4.7-flash
- 当然如果想在docker bash升级测试也可以
nginx
# 0. 进入bash环境
docker run -it --rm \
--gpus '"device=1"' \
-v /home/ls/.cache/modelscope/hub/models/ZhipuAI/GLM-4.7-Flash:/app/models \
--ipc=host \
-p 8003:8000 \
--entrypoint /bin/bash \
vllm/vllm-openai:nightly
# 1. 升级依赖
pip install --upgrade transformers accelerate
# 2. 验证版本
python -c "import transformers; print(transformers.__version__)"
# 3. 手动启动服务
python -m vllm.entrypoints.openai.api_server \
--model /app/models \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--max-model-len=32768 \
--max_num_seqs=8 \
--served-model-name glm-4.7-flash
如果运行爆显存
调整相关参数:--max-model-len=32768、--max_num_seqs=8
