来源: recipes.vllm.ai 官方配置
DeepSeek-V4-Pro 是 DeepSeek V4 预览系列的旗舰模型,拥有 1.6T 总参数 / 49B 激活参数 的 MoE 架构,checkpoint 高达 960GB。本文基于 vLLM 官方 Recipes 配置,详细介绍六种主流 GPU 平台的部署方案。
一、模型概览
| 指标 | 参数 |
|---|---|
| 总参数量 | 1.6 Trillion (1600B) |
| 激活参数量 | 49 Billion |
| 上下文长度 | 最高 1,048,576 tokens (1M) |
| 精度格式 | FP4 + FP8 混合精度 |
| Checkpoint | ~960 GB |
| vLLM 版本 | ≥ 0.20.1 |
二、Docker 镜像选择
| 镜像 | CUDA 版本 | 适用平台 |
|---|---|---|
vllm/vllm-openai:deepseekv4-cu129 |
CUDA 12.9 | H100、H200、B200 |
vllm/vllm-openai:deepseekv4-cu130 |
CUDA 13 | GB200、B300、GB300 |
三、H200 部署(单节点 TP8+EP)
硬件 : 1 台 × 8× H200 (141GB × 8 = 1128GB)
bash
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu129 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 800000 \
--gpu-memory-utilization 0.95 \
--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}'
四、B200 部署(单节点 TP8+EP)
硬件 : 1 台 × 8× B200 (180GB × 8 = 1440GB)
bash
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \
--attention_config.use_fp4_indexer_cache=True
五、GB200 NVL4 部署(多节点 DEP)
硬件 : 2 trays × 4× GB200 = 8 GPU
注意 : 单 tray 768GB < 960GB,必须 2 trays
部署前准备
bash
export HEAD_IP="192.168.1.100" # 替换为实际 Tray 0 的 IP
Tray 0 (Head)
bash
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-hybrid-lb \
--data-parallel-size 8 \
--data-parallel-size-local 4 \
--data-parallel-address '$HEAD_IP' \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True
Tray 1 (Worker)
bash
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-hybrid-lb \
--data-parallel-size 8 \
--data-parallel-size-local 4 \
--data-parallel-address '$HEAD_IP' \
--data-parallel-start-rank 4 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True
六、B300 部署(单节点 TP8+EP)
硬件 : 1 台 × 8× B300 (268GB × 8 = 2144GB )
注意 : 单节点显存最充裕,无需多节点
bash
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \
--attention_config.use_fp4_indexer_cache=True
七、GB300 NVL4 部署(多节点 DEP)
硬件 : 2 trays × 4× GB300 = 8 GPU
Tray 0 (Head)
bash
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-hybrid-lb \
--data-parallel-size 8 \
--data-parallel-size-local 4 \
--data-parallel-address '$HEAD_IP' \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True
Tray 1 (Worker)
bash
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-hybrid-lb \
--data-parallel-size 8 \
--data-parallel-size-local 4 \
--data-parallel-address '$HEAD_IP' \
--data-parallel-start-rank 4 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True
八、H100 部署(多节点 DEP)
硬件 : 2 台 × 8× H100 = 16 GPU
注意 : 单节点 640GB < 960GB,必须多节点
部署前准备(两台机器都执行)
bash
# 1. 设置 Head 节点 IP(Node 0 的 IP 地址)
export HEAD_IP="192.168.1.100" # 替换为实际 Node 0 的 IP
# 2. 拉取 Docker 镜像
docker pull vllm/vllm-openai:deepseekv4-cu129
# 3. 准备模型目录
mkdir -p ~/.cache/huggingface
# 4. NCCL 网络环境变量
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=eth0 # 替换为实际 RDMA 网卡名
Node 0(Master / Head 节点)
bash
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu129 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-hybrid-lb \
--data-parallel-size 16 \
--data-parallel-size-local 8 \
--data-parallel-address '$HEAD_IP' \
--max-model-len 800000 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 512 \
--max-num-batched-tokens 512 \
--no-enable-flashinfer-autotune \
--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}'
Node 1(Worker 节点)
bash
docker run --gpus all \
--privileged --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu129 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-hybrid-lb \
--data-parallel-size 16 \
--data-parallel-size-local 8 \
--data-parallel-address '$HEAD_IP' \
--data-parallel-start-rank 8 \
--max-model-len 800000 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 512 \
--max-num-batched-tokens 512 \
--no-enable-flashinfer-autotune \
--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}'
H100 启动顺序
bash
# Step 1: 先在 Node 0 (Master) 上启动
ssh node0
cd /path/to/scripts
./start_master.sh
# Step 2: 等待 Node 0 启动完成(约需 5-10 分钟加载 960GB 模型)
# Step 3: 在 Node 1 (Worker) 上启动
ssh node1
cd /path/to/scripts
./start_worker.sh
# Step 4: 验证服务
ssh node0
curl http://localhost:8000/v1/models
九、各平台配置对比
| 平台 | 台数 | GPU/台 | 总 GPU | 总显存 | 策略 | Docker |
|---|---|---|---|---|---|---|
| H200 | 1 | 8 | 8 | 1128 GB | TP8+EP | cu129 |
| B200 | 1 | 8 | 8 | 1440 GB | TP8+EP | cu130 |
| GB200 | 2 | 4 | 8 | 1536 GB | Multi-Node DEP | cu130 |
| B300 | 1 | 8 | 8 | 2144 GB | TP8+EP | cu130 |
| GB300 | 2 | 4 | 8 | 2304 GB | Multi-Node DEP | cu130 |
| H100 | 2 | 8 | 16 | 1280 GB | Multi-Node DEP | cu129 |
关键参数说明
| 参数 | H200/B200/B300 | GB200/GB300 | H100 |
|---|---|---|---|
--tensor-parallel-size |
8 | - | - |
--data-parallel-size |
- | 8 | 16 |
--data-parallel-size-local |
- | 4 | 8 |
--data-parallel-start-rank |
- | 4 | 8 |
--max-model-len |
800K (H200) / 不限 | 不限 | 800K |
--no-enable-flashinfer-autotune |
✅ (H200) | - | ✅ |