NVIDIA DGX Spark (Blackwell GB10) 双机 196B Step 3.5 Flash 大模型部署完整实录

一、架构概览

1.1 硬件与网络拓扑

本次部署使用两台 NVIDIA DGX Spark 主机，通过直连高速网卡组成 Ray 集群，使用流水线并行（PP=2）将 196B 参数模型拆分到两台机器上运行。

主机名	角色	统一内存（vLLM 可见）	高速网 IP	网卡
spark-7	Ray Head（主节点）	121.695 GiB	169.254.72.234	enp1s0f1np1
spark-6	Ray Worker（从节点）	119.676 GiB	169.254.12.148	enp1s0f1np1

⚠️ 重要架构说明：GB10 统一内存（Unified Memory）

GB10 不含独立显存（VRAM）。CPU 与 GPU 通过 NVLink-C2C （Chip-to-Chip）互联，共享同一块 128GB LPDDR5X 内存池，带宽 273 GB/s。vLLM 看到的"可用显存"（约 119~122 GiB）是扣除操作系统和系统保留后的剩余统一内存，而非独立 GPU 显存。这与传统 PCIe GPU（如 H100/A100）有本质区别------无需 CPU↔GPU 数据拷贝，但内存带宽低于 HBM（273 GB/s vs ~3.35 TB/s of H100 SXM）。

关键设计决策：

放弃千兆管理网，Ray 控制面与数据面全部走 ConnectX-7 高速网卡（169.254.x.x 网段）
使用 Pipeline Parallel = 2：spark-7 承载前 23 层（PP Rank 0），spark-6 承载后 22 层（PP Rank 1）
跨节点 PP 激活值传递由 Ray Compiled DAG 封装管理

1.2 ConnectX-7 网络带宽说明

⚠️ 关于"200Gbps"的正确理解（与市场宣传有出入）

DGX Spark 配备 2 个 QSFP56 物理口，由单片 ConnectX-7 SmartNIC 驱动，每个物理口在操作系统中显示为 2 个网络接口（共 4 个虚拟口：enp1s0f0np0、enp1s0f0np1、enp1s0f1np0、enp1s0f1np1）。

项目	规格
NIC 型号	NVIDIA ConnectX-7 SmartNIC
物理口数量	2 个 QSFP56
聚合额定带宽	200 Gbps（双口合计）
每口 PCIe 连接	PCIe Gen5 x4（~100 Gbps 上限）
单口 TCP 实测峰值	~100 Gbps
双口 RDMA 聚合实测	~185--190 Gbps（需正确配置 RoCE 拓扑）

核心约束 ：由于每个物理口仅连接 PCIe Gen5 x4（约 100Gbps 有效带宽），单根直连线缆仅能充分利用一个物理口（约 100Gbps），无法达到 200Gbps。要达到 200Gbps 聚合，需同时使用两根线缆+RDMA+正确的网口绑定配置。

对本次部署的影响 ：使用单根 DAC 直连时（本次部署），通过 enp1s0f1np1 的 PP 通信理论上限约 100Gbps。但由于 PP 层间激活值传输量极小（实测仅 ~150--230 KB/s），网络带宽完全不是瓶颈。

1.3 软件栈

组件	版本 / 说明
操作系统	DGX OS (Ubuntu 24.04 LTS，ARM64 架构)
Python 环境	conda: `vllm-spark`，Python 3.11
vLLM	v0.17.0rc1.dev105+g86e1060b1（自编译，已打补丁）
PyTorch	Nightly 版本必须（稳定版不含 sm_121 / CUDA 13.0 / ARM64 支持）
CUDA	13.0
GPU 算力	sm_121（Compute Capability 12.1）
Flash Attention	v2.8.3（源码修改，强行注入 sm_121 支持）
NCCL	2.29.3
内存规格	128GB LPDDR5X，273 GB/s，CPU/GPU 统一共享
CUDA 核心数	6,144
模型路径	`/home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8`
vLLM 源码路径	`/home/nvidia/workspaces/vllm`
启动脚本路径	`/home/nvidia/workspaces/ray/test_run_step_cluster.sh`

二、环境搭建（主/从节点均需执行）

⚠️ 以下所有步骤必须在 spark-7 和 spark-6 上各自完整执行一遍，确保环境完全一致。

2.1 安装支持 GB10 的 PyTorch（必须 Nightly 版）

GB10 的 CUDA 算力为 sm_121（Compute Capability 12.1，CUDA 13.0）。PyTorch 官方稳定版（截至本文）不含 ARM64 + CUDA 13.0 的预编译 wheel，必须安装 Nightly 版本。

bash 复制代码

conda activate vllm-spark

# 彻底清除旧包，防止环境冲突
pip uninstall -y torch torchvision torchaudio vllm-flash-attn flash-attn vllm

# 安装支持 Blackwell sm_121 的 Nightly 版本（CUDA 13.0）
pip install --pre torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/nightly/cu130 \
    --force-reinstall

# 验证算力：必须输出 (12, 1)
python -c "import torch; print('CUDA 算力:', torch.cuda.get_device_capability())"
# 期望输出：CUDA 算力: (12, 1)

为什么是 12.1 而非 12.0？ RTX 5090 是 sm_120（12.0），而 DGX Spark 的 GB10 是 sm_121（12.1）。两者均为 Blackwell 架构，但计算能力编号不同，编译时需明确区分。

2.2 编译 vLLM（限制并发防止编译期 OOM）

DGX Spark 的统一内存为 128GB（CPU/GPU 共享），编译过程并发过高会耗尽内存。必须通过 MAX_JOBS 限制并发数。

bash 复制代码

cd /home/nvidia/workspaces/vllm
rm -rf build/ .deps/ *.egg-info/

export VLLM_TARGET_DEVICE=cuda
export TORCH_CUDA_ARCH_LIST="12.1"   # 核心：明确指向 GB10 的 sm_121 算力
export MAX_JOBS=8                     # 核心：限制编译并发，防止统一内存耗尽

# --no-build-isolation --no-deps 防止 pip 自动降级 PyTorch
pip install -e . --no-build-isolation --no-deps

2.3 源码编译 Flash-Attention（注入 sm_121 支持）

官方 Flash-Attention 预编译 wheel 不支持 sm_121（官方 Issue #1969 已确认），需修改 setup.py 强行注入编译标志。

bash 复制代码

cd /home/nvidia/workspaces
rm -rf flash-attention
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && git checkout v2.8.3

# 修改 setup.py，注入 sm_121 编译目标
python -c "
with open('setup.py', 'r') as f:
    c = f.read()
c = c.replace('\"--use_fast_math\",',
    '\"--use_fast_math\", \"-gencode\", \"arch=compute_90a,code=sm_90a\",\
     \"-gencode\", \"arch=compute_121,code=sm_121\",\
     \"-gencode\", \"arch=compute_120,code=compute_120\",')
with open('setup.py', 'w') as f:
    f.write(c)
print('setup.py 修改成功')
"

# 极限省内存编译：跳过反向传播（推理用不到）
export MAX_JOBS=4
export FLASH_ATTENTION_FORCE_BUILD=TRUE
export FLASH_ATTENTION_SKIP_BWD=TRUE
pip install -e . --no-build-isolation --no-deps

# 验证：若无报错说明 Flash-Attention 打通
python -c "import vllm.vllm_flash_attn as fa; print('Flash-Attention OK')"

三、vLLM 源码修复（核心排坑）

这是本次部署最关键的部分。vLLM v0.17.0rc1 在单 GPU 每节点的跨节点 Ray PP 场景下存在多个严重 Bug，必须逐一修复才能成功启动。

Bug 1：Placement Group 策略为 PACK 导致跨节点 GPU 分配失败

现象：

vLLM 启动后几秒内就报 RuntimeError，EngineCore 进程收到 SIGTERM 退出：

复制代码

WARNING Tensor parallel size (2) exceeds available GPUs (1).
This may result in Ray placement group allocation failures.

根本原因：

vLLM 在创建 Ray Placement Group 时默认使用 PACK 策略，该策略会尽量将所有 bundle（GPU）塞进同一节点（尝试 Pack，溢出才跨节点）。而本次配置每台机器只有 1 张 GPU，PP=2 需要 2 个 GPU bundle。在 vLLM v0.17.0rc1 这个版本中，单节点 GPU 不足时 PG 直接报错退出，而不是自动跨节点分配。

注意：Ray Serve LLM 的 PACK 策略在资源不足时会自动溢出到其他节点；但 vLLM 直接调用的这个版本 RC 中不具备该行为，必须手动改为 SPREAD。

修复方案：

bash 复制代码

# 文件：/home/nvidia/workspaces/vllm/vllm/v1/executor/ray_utils.py  第 477 行
sed -i 's/strategy="PACK"/strategy="SPREAD"/' \
    /home/nvidia/workspaces/vllm/vllm/v1/executor/ray_utils.py

# 验证修改
grep -n 'strategy=' /home/nvidia/workspaces/vllm/vllm/v1/executor/ray_utils.py
# 期望输出：477:            placement_group_specs, strategy="SPREAD"

原理： SPREAD 策略将 bundle 分散到不同节点，使 spark-7 和 spark-6 各出一张 GPU，正好满足 PP=2 的需求。

Bug 2：ray_utils.py 中 current_ip 使用错误 IP

现象：

即使修复 Bug 1 后，PG 创建仍然失败。手动验证发现 get_ip() 返回了错误 IP：

bash 复制代码

python3 -c "from vllm.utils.network_utils import get_ip; print(get_ip())"
# 错误输出：192.168.1.228  ← 管理网网线 IP，不是 Ray 使用的直连 IP

根本原因：

ray_utils.py 创建 Placement Group 时，用 get_ip() 返回的 IP 生成 node:192.168.1.228 约束，将 bundle 0 钉在该 IP 对应的节点。但 Ray 集群中 spark-7 注册的 IP 是 169.254.72.234，约束匹配失败，PG 创建立即报错。

虽然启动脚本已设置 VLLM_HOST_IP=169.254.72.234，但 envs.VLLM_HOST_IP 在模块 import 时缓存，Ray worker 子进程启动时可能读不到该环境变量，导致 get_ip() 回退到 socket 路由探测，探测到 8.8.8.8 时走管理网，返回错误 IP。

修复方案：

bash 复制代码

# 文件：/home/nvidia/workspaces/vllm/vllm/v1/executor/ray_utils.py  第 461 行
sed -i 's/current_ip = get_ip()/current_ip = os.environ.get("VLLM_HOST_IP") or get_ip()/' \
    /home/nvidia/workspaces/vllm/vllm/v1/executor/ray_utils.py

# 验证
grep -n "current_ip" /home/nvidia/workspaces/vllm/vllm/v1/executor/ray_utils.py

Bug 3：ray_executor.py 中 driver_ip 使用错误 IP

现象：

修复 Bug 1 和 Bug 2 后，初始化仍然失败。错误被吞没，日志只显示 Shutting down Ray distributed executor。通过在源码添加 try/except 捕获后，发现真实错误：

复制代码

RuntimeError: Every node should have a unique IP address.
Got 2 nodes with node ids [...] and 3 unique IP addresses
{'169.254.72.234', '192.168.1.128', '192.168.1.228'}

根本原因：

ray_executor.py 的 _init_workers_ray 方法中，driver_ip = get_ip() 同样返回了管理网 IP（192.168.1.228）。该方法收集所有 worker 的 IP（均为 169.254.x.x），发现 IP 数量（3个）与节点数量（2个）不匹配，抛出异常。

修复方案：

bash 复制代码

# 文件：/home/nvidia/workspaces/vllm/vllm/v1/executor/ray_executor.py  第 207 行
sed -i 's/driver_ip = get_ip()/driver_ip = os.environ.get("VLLM_HOST_IP") or get_ip()/' \
    /home/nvidia/workspaces/vllm/vllm/v1/executor/ray_executor.py

# 验证
grep -n "driver_ip" /home/nvidia/workspaces/vllm/vllm/v1/executor/ray_executor.py

Bug 4：RAY_RUNTIME_ENV_HOOK 设为空字符串导致 ValueError

现象：

某次启动时报错：

复制代码

ValueError: You need to pass a valid path like mymodule.provider_class

根本原因：

旧版启动脚本中有 export RAY_RUNTIME_ENV_HOOK="" 这一行，将该变量设为空字符串。Ray 读取到空字符串后，尝试将其作为模块路径解析，失败并抛出 ValueError。

修复方案：

从启动脚本中彻底删除该行。不能设为空字符串，要么不设置，要么设置有效的模块路径。

Bug 5：spark-6 上 get_node_ip() 返回错误 IP

现象：

出现"3个唯一 IP 对应 2个节点"的错误，spark-6 上的 worker 上报了 192.168.1.128（管理网 IP）而非 169.254.12.148（高速直连 IP）。

根本原因：

vLLM worker 的 get_node_ip() 方法通过网络探测获取 IP，在 spark-6 上探测到的是管理网网卡 IP，与 Ray 集群注册的 IP 不一致。

修复方案：

修改 get_node_ip() 方法，改用 Ray 原生 API 获取节点 IP，确保与 Ray 集群内注册的 IP 完全一致：

python 复制代码

# 文件：/home/nvidia/workspaces/vllm/vllm/v1/executor/ray_utils.py  第 77 行
# 将 get_node_ip() 方法修改为：

def get_node_ip(self) -> str:
    import ray
    return ray.util.get_node_ip_address()

修复汇总

Bug #	文件	位置	修复内容
Bug 1	`vllm/v1/executor/ray_utils.py`	第 477 行	`strategy="PACK"` → `strategy="SPREAD"`
Bug 2	`vllm/v1/executor/ray_utils.py`	第 461 行	`current_ip = get_ip()` → `os.environ.get("VLLM_HOST_IP") or get_ip()`
Bug 3	`vllm/v1/executor/ray_executor.py`	第 207 行	`driver_ip = get_ip()` → `os.environ.get("VLLM_HOST_IP") or get_ip()`
Bug 4	启动脚本	全局	删除 `RAY_RUNTIME_ENV_HOOK=""` 赋值行
Bug 5	`vllm/v1/executor/ray_utils.py`	第 77 行	`get_node_ip()` 改用 `ray.util.get_node_ip_address()`

修改源码后必须清除 .pyc 字节码缓存

⚠️ 修改 .py 文件后，如不清除 .pyc 缓存，Python 可能仍然运行旧版本代码。

bash 复制代码

find /home/nvidia/workspaces/vllm/vllm/v1/executor/ -name '*.pyc' -delete

四、组网与 Ray 集群启动

4.1 169.254.x.x 网段说明

两台机器通过 ConnectX-7 QSFP56 网卡直连，IP 地址在 169.254.x.x 网段（链路本地地址）。该网段有一个已知限制：底层 c10d/Gloo 通信库无法对其进行反向 DNS 解析，会产生 err=-3 的 DNS 查询警告------无害，可安全忽略：

复制代码

[W311 19:30:50.617528206 socket.cpp:207] [c10d] The hostname of the client
socket cannot be retrieved. err=-3   ← 可安全忽略

4.2 Ray 集群启动步骤

第一步：清场（每次重启前执行，主/从节点均需）

bash 复制代码

pkill -9 python
ray stop -f
sync; sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

第二步：在 spark-7（主节点）启动 Ray Head

bash 复制代码

conda activate vllm-spark
ray start --head \
    --node-ip-address=169.254.72.234 \
    --port=6379 \
    --num-gpus=1

第三步：在 spark-6（从节点）加入集群

bash 复制代码

conda activate vllm-spark
ray start \
    --address=169.254.72.234:6379 \
    --node-ip-address=169.254.12.148 \
    --num-gpus=1

第四步：验证集群状态

bash 复制代码

# 在 spark-7 上执行，期望看到 2 个 ALIVE 节点，共 2.0 GPU
ray status

# 期望输出：
# Active:
#  1 node_xxxx  (169.254.12.148)
#  1 node_xxxx  (169.254.72.234)
# Total Usage: 0.0/2.0 GPU

ray list nodes  # 查看详细节点信息

五、完整启动脚本

以下为经过所有 Bug 修复验证后，最终可正常工作的启动脚本（/home/nvidia/workspaces/ray/test_run_step_cluster.sh）：

bash 复制代码

#!/bin/bash
source ~/miniconda3/bin/activate vllm-spark

export VLLM_HOST_IP=169.254.72.234
export RAY_NODE_IP_ADDRESS=169.254.72.234
export RAY_ADDRESS=169.254.72.234:6379
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export GLOO_SOCKET_IFNAME=enp1s0f1np1
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4

python -m vllm.entrypoints.openai.api_server \
  --model /home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8 \
  --served-model-name step3p5-flash \
  --distributed-executor-backend ray \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 2 \
  --trust-remote-code \
  --disable-cascade-attn \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000

后台启动方式：

bash 复制代码

chmod +x /home/nvidia/workspaces/ray/test_run_step_cluster.sh

# 后台运行，输出到日志文件
nohup /home/nvidia/workspaces/ray/test_run_step_cluster.sh \
    > /home/nvidia/workspaces/ray/vllm_server.log 2>&1 &

# 实时追踪启动日志
tail -f /home/nvidia/workspaces/ray/vllm_server.log

六、启动过程详解与预期日志

6.1 启动流程

阶段	预计耗时	关键日志
Ray 集群连接	< 5 秒	`Connected to Ray cluster at 169.254.72.234:6379`
Placement Group 创建	< 10 秒	`Creating a new placement group (SPREAD 策略)`
模型权重加载 spark-7	~320 秒	`Loading weights took 320.10 seconds (91.24 GiB)`
模型权重加载 spark-6	~408 秒	`Loading weights took 407.91 seconds (99.76 GiB)`
KV Cache 分配 & 预热	< 30 秒	`GPU KV cache size: 12,192 tokens`
服务就绪	---	`INFO: Application startup complete.`

总启动时间约 8~10 分钟（主要耗时在模型权重加载，两节点串行加载，spark-6 比 spark-7 慢约 90 秒）。

6.2 启动成功后的关键确认项

bash 复制代码

# 确认 1：检查层分配（默认不均等分配）
# 日志中应出现：
# Hidden layers were unevenly partitioned: [23,22]
# spark-7 (PP Rank 0) 承载第 0~22 层（共 23 层）
# spark-6 (PP Rank 1) 承载第 23~44 层（共 22 层）
#
# 可选：通过环境变量手动均等分配
# export VLLM_PP_LAYER_PARTITION='22,23'

# 确认 2：KV Cache 大小
# spark-7: Available KV cache memory: 7.4 GiB
# GPU KV cache size: 12,192 tokens

# 确认 3：PP 通信方式
# 日志应出现：
# Using RayPPCommunicator (wraps vLLM _PPGroupCoordinator)
# VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE = auto
# 说明 PP 通信由 Ray Compiled DAG 管理

6.3 模型推理速度

指标	实测值	备注
单请求生成速度	₁₀11 tokens/s	enforce-eager 模式，无 CUDA Graphs
KV Cache 可用量	12,192 tokens	两节点取较小值
最大上下文长度	8,192 tokens	--max-model-len 限制

⚠️ --enforce-eager 禁用了 CUDA Graphs 和 torch.compile，导致推理速度偏低。去掉该参数后预期速度可提升 2~3 倍（₂₅35 tokens/s）。当前保留该参数是因为 GB10 + vLLM RC 版在编译模式下，Triton 编译器对 sm_121a 的 PTX 代码生成存在已知问题（triton #9181），稳定性尚未充分验证。

七、验收测试

7.1 快速健康检查

bash 复制代码

# 健康检查
curl http://localhost:8000/health

# 查看已加载模型
curl http://localhost:8000/v1/models | python3 -m json.tool

7.2 性能测试请求

bash 复制代码

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "step3p5-flash",
    "messages": [
      {"role": "system", "content": "你是一名资深程序员。"},
      {"role": "user", "content": "用 Python 写一个完整的多线程网络爬虫，含异常处理和注释。"}
    ],
    "max_tokens": 1024,
    "temperature": 0.3,
    "stream": true
  }'

7.3 Python 速度测试脚本

bash 复制代码

import asyncio
import aiohttp
import time
import json

# ================= 配置区 =================
URL = "http://localhost:8000/v1/chat/completions"
MODEL = "step3p5-flash"
CONCURRENT_REQUESTS = 15  # 同时发起的请求数量（可根据显存占用调大）
MAX_TOKENS = 1024         # 每个请求最大生成的 token 数
# ==========================================

PAYLOAD = {
    "model": MODEL,
    "messages": [{"role": "user", "content": "请用极度详细的语言，写一篇关于人类探索火星的科幻小说，包含技术细节、心理描写和跌宕起伏的剧情，不少于1000字。"}],
    "max_tokens": MAX_TOKENS,
    "temperature": 0.7,
    "stream": False
}

async def fetch(session, request_id):
    start = time.time()
    try:
        async with session.post(URL, json=PAYLOAD, timeout=600) as response:
            res = await response.json()
            latency = time.time() - start
            tokens = res.get('usage', {}).get('completion_tokens', 0)
            print(f"请求 [{request_id}] 完成! 耗时: {latency:.2f}s, 生成 Tokens: {tokens}")
            return tokens
    except Exception as e:
        print(f"请求 [{request_id}] 失败: {e}")
        return 0

async def main():
    print(f"🚀 开始极限压测：同时发起 {CONCURRENT_REQUESTS} 个并发请求...")
    start_time = time.time()
    
    # 绕过连接池限制，火力全开
    connector = aiohttp.TCPConnector(limit=CONCURRENT_REQUESTS)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch(session, i) for i in range(CONCURRENT_REQUESTS)]
        results = await asyncio.gather(*tasks)
        
    end_time = time.time()
    total_time = end_time - start_time
    total_tokens = sum(results)
    
    print("\n" + "="*40)
    print("🎯 压测报告")
    print("="*40)
    print(f"总耗时:         {total_time:.2f} 秒")
    print(f"总生成 Tokens:  {total_tokens}")
    print(f"真实吞吐量:     {total_tokens / total_time:.2f} Tokens/秒")
    print("="*40)

if __name__ == "__main__":
    asyncio.run(main())

运行结果

bash 复制代码

(vllm-spark) root@spark-7:/home/nvidia/workspaces/ray/benchmark#  python stress_test.py
🚀 开始极限压测：同时发起 15 个并发请求...
请求 [13] 完成! 耗时: 263.38s, 生成 Tokens: 1024
请求 [3] 完成! 耗时: 263.38s, 生成 Tokens: 1024
请求 [10] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [1] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [7] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [14] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [11] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [8] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [0] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [9] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [2] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [4] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [6] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [5] 完成! 耗时: 263.57s, 生成 Tokens: 1024
请求 [12] 完成! 耗时: 263.57s, 生成 Tokens: 1024

========================================
🎯 压测报告
========================================
总耗时:         263.57 秒
总生成 Tokens:  15360
真实吞吐量:     58.28 Tokens/秒
========================================

服务器荷载

vllm输出结果

bash 复制代码

(vllm-spark) root@spark-7:/home/nvidia/workspaces/ray# # 直接在终端执行，避免任何编辑器引入不可见字符
cat > /home/nvidia/workspaces/ray/run_step_cluster_0312.sh << 'EOF'
#!/bin/bash
source ~/miniconda3/bin/activate vllm-spark

export VLLM_HOST_IP=169.254.72.234
export RAY_NODE_IP_ADDRESS=169.254.72.234
export RAY_ADDRESS=169.254.72.234:6379
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export GLOO_SOCKET_IFNAME=enp1s0f1np1
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4

python -m vllm.entrypoints.openai.api_server \
  --model /home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8 \
  --served-model-name step3p5-flash \
  --distributed-executor-backend ray \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 2 \
  --trust-remote-code \
  --disable-cascade-attn \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000
EOF

chmod +x /home/nvidia/workspaces/ray/run_step_cluster_0312.sh
(vllm-spark) root@spark-7:/home/nvidia/workspaces/ray# ./run_step_cluster_0312.sh
(APIServer pid=54127) INFO 03-12 15:14:05 [utils.py:292] 
(APIServer pid=54127) INFO 03-12 15:14:05 [utils.py:292]        █     █     █▄   ▄█
(APIServer pid=54127) INFO 03-12 15:14:05 [utils.py:292]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0rc1.dev105+g86e1060b1
(APIServer pid=54127) INFO 03-12 15:14:05 [utils.py:292]   █▄█▀ █     █     █     █  model   /home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8
(APIServer pid=54127) INFO 03-12 15:14:05 [utils.py:292]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=54127) INFO 03-12 15:14:05 [utils.py:292] 
(APIServer pid=54127) INFO 03-12 15:14:05 [utils.py:228] non-default args: {'host': '0.0.0.0', 'model': '/home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8', 'trust_remote_code': True, 'max_model_len': 8192, 'disable_cascade_attn': True, 'served_model_name': ['step3p5-flash'], 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 2, 'gpu_memory_utilization': 0.85}
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) INFO 03-12 15:14:05 [model.py:531] Resolved architecture: Step3p5ForCausalLM
(APIServer pid=54127) INFO 03-12 15:14:05 [model.py:1554] Using max model len 8192
(APIServer pid=54127) INFO 03-12 15:14:06 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=54127) WARNING 03-12 15:14:06 [vllm.py:742] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=54127) INFO 03-12 15:14:06 [vllm.py:753] Asynchronous scheduling is disabled.
(APIServer pid=54127) The tokenizer you are loading from '/home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(EngineCore_DP0 pid=54151) INFO 03-12 15:14:09 [core.py:104] Initializing a V1 LLM engine (v0.17.0rc1.dev105+g86e1060b1) with config: model='/home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8', speculative_config=None, tokenizer='/home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=step3p5-flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=54151) WARNING 03-12 15:14:09 [ray_utils.py:377] Tensor parallel size (2) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 2 GPUs available.
(EngineCore_DP0 pid=54151) 2026-03-12 15:14:09,622      INFO worker.py:1669 -- Using address 169.254.72.234:6379 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=54151) 2026-03-12 15:14:09,626      INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 169.254.72.234:6379...
(EngineCore_DP0 pid=54151) 2026-03-12 15:14:09,658      INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
(EngineCore_DP0 pid=54151) /root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=54151)   warnings.warn(
(EngineCore_DP0 pid=54151) INFO 03-12 15:14:09 [ray_utils.py:442] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=54151) INFO 03-12 15:14:15 [ray_env.py:100] Env var prefixes to copy: ['HF_', 'HUGGING_FACE_', 'LMCACHE_', 'NCCL_', 'UCX_', 'VLLM_']
(EngineCore_DP0 pid=54151) INFO 03-12 15:14:15 [ray_env.py:101] Copying the following environment variables to workers: ['CUDA_HOME', 'LD_LIBRARY_PATH', 'NCCL_SOCKET_IFNAME', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=54151) INFO 03-12 15:14:15 [ray_env.py:111] To exclude env vars from copying, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) WARNING 03-12 15:14:15 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/cuda-13.0/lib64:' to '/root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/cv2/../../lib64:/root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/cuda-13.0/lib64:'
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) WARNING 03-12 15:14:16 [worker_base.py:291] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:14:17 [parallel_state.py:1395] world_size=2 rank=1 local_rank=0 distributed_init_method=tcp://169.254.72.234:42405 backend=nccl
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) [W312 15:14:17.022773829 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:14:18 [pynccl.py:111] vLLM is using nccl==2.29.3
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:14:18 [parallel_state.py:1717] rank 1 in world size 2 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:14:19 [gpu_model_runner.py:4258] Starting to load model /home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8...
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:14:19 [utils.py:129] Hidden layers were unevenly partitioned: [23,22]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:14:19 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:14:19 [flash_attn.py:593] Using FlashAttention version 2
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) WARNING 03-12 15:14:19 [step3p5.py:501] Disable custom fused all reduce...
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:14:19 [fp8.py:390] Using TRITON Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'TRITON', 'MARLIN', 'BATCHED_DEEPGEMM', 'BATCHED_TRITON', 'XPU'].
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) WARNING 03-12 15:14:22 [compilation.py:1114] Op 'quant_fp8' not present in model, enabling with '+quant_fp8' has no effect
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) WARNING 03-12 15:14:15 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/cuda-13.0/lib64:' to '/root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/cv2/../../lib64:/root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/cuda-13.0/lib64:'
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) WARNING 03-12 15:14:16 [worker_base.py:291] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:14:17 [parallel_state.py:1395] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://169.254.72.234:42405 backend=nccl
Loading safetensors checkpoint shards:   0% Completed | 0/44 [00:00<?, ?it/s]
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) [W312 15:14:17.711220570 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
Loading safetensors checkpoint shards:   2% Completed | 1/44 [00:35<25:28, 35.55s/it]
Loading safetensors checkpoint shards:   5% Completed | 2/44 [00:36<10:28, 14.97s/it]
Loading safetensors checkpoint shards:   7% Completed | 3/44 [00:49<09:49, 14.37s/it]
Loading safetensors checkpoint shards:   9% Completed | 4/44 [01:03<09:29, 14.24s/it]
Loading safetensors checkpoint shards:  11% Completed | 5/44 [01:17<09:13, 14.18s/it]
Loading safetensors checkpoint shards:  14% Completed | 6/44 [01:31<08:56, 14.11s/it]
Loading safetensors checkpoint shards:  16% Completed | 7/44 [01:45<08:40, 14.05s/it]
Loading safetensors checkpoint shards:  18% Completed | 8/44 [01:59<08:27, 14.09s/it]
Loading safetensors checkpoint shards:  20% Completed | 9/44 [02:13<08:12, 14.06s/it]
Loading safetensors checkpoint shards:  23% Completed | 10/44 [02:27<07:57, 14.05s/it]
Loading safetensors checkpoint shards:  25% Completed | 11/44 [02:42<07:44, 14.08s/it]
Loading safetensors checkpoint shards:  27% Completed | 12/44 [02:56<07:30, 14.07s/it]
Loading safetensors checkpoint shards:  30% Completed | 13/44 [03:10<07:15, 14.06s/it]
Loading safetensors checkpoint shards:  32% Completed | 14/44 [03:24<07:02, 14.08s/it]
Loading safetensors checkpoint shards:  34% Completed | 15/44 [03:38<06:48, 14.08s/it]
Loading safetensors checkpoint shards:  36% Completed | 16/44 [03:52<06:34, 14.08s/it]
Loading safetensors checkpoint shards:  39% Completed | 17/44 [04:06<06:19, 14.06s/it]
Loading safetensors checkpoint shards:  41% Completed | 18/44 [04:20<06:06, 14.10s/it]
Loading safetensors checkpoint shards:  43% Completed | 19/44 [04:34<05:53, 14.13s/it]
Loading safetensors checkpoint shards:  45% Completed | 20/44 [04:49<05:39, 14.14s/it]
Loading safetensors checkpoint shards:  48% Completed | 21/44 [05:03<05:24, 14.10s/it]
Loading safetensors checkpoint shards:  50% Completed | 22/44 [05:16<05:07, 14.00s/it]
Loading safetensors checkpoint shards: 100% Completed | 44/44 [05:16<00:00,  7.20s/it]
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) 
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:19:39 [default_loader.py:293] Loading weights took 317.24 seconds
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:19:39 [fp8.py:539] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:14:18 [parallel_state.py:1717] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:14:19 [utils.py:129] Hidden layers were unevenly partitioned: [23,22]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:14:19 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:14:19 [flash_attn.py:593] Using FlashAttention version 2
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) WARNING 03-12 15:14:19 [step3p5.py:501] Disable custom fused all reduce...
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:14:19 [fp8.py:390] Using TRITON Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'TRITON', 'MARLIN', 'BATCHED_DEEPGEMM', 'BATCHED_TRITON', 'XPU'].
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) WARNING 03-12 15:14:22 [compilation.py:1114] Op 'quant_fp8' not present in model, enabling with '+quant_fp8' has no effect
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:19:39 [gpu_model_runner.py:4341] Model loading took 91.24 GiB memory and 320.313217 seconds
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:52 [default_loader.py:293] Loading weights took 389.33 seconds
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:52 [fp8.py:539] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:52 [gpu_model_runner.py:4341] Model loading took 99.76 GiB memory and 392.842387 seconds
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:53 [decorators.py:465] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/b103951f49e6120f93226a38e8a98f37838e37e606b55d85052750121d19bbbe/rank_1_0/model
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:20:54 [backends.py:913] Using cache directory: /root/.cache/vllm/torch_compile_cache/4736cbb7bc/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:20:54 [backends.py:973] Dynamo bytecode transform time: 2.20 s
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:57 [backends.py:283] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 2.265 s
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) WARNING 03-12 15:20:58 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/nvidia/workspaces/vllm/vllm/model_executor/layers/fused_moe/configs/E=288,N=1280,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:20:53 [decorators.py:465] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/55964fae7296c9bf8ac71610b565b1611db1cc6ea07b34ae8785462e86a1e546/rank_0_0/model
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) /root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/torch/utils/_config_module.py:540: FutureWarning: torch._dynamo.config.skip_code_recursive_on_recompile_limit_hit is deprecated and does not do anything. It will be removed in a future version of PyTorch.
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148)   config[key] = copy.deepcopy(getattr(self, key))
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:59 [monitor.py:35] torch.compile and initial profiling run took 6.57 s in total
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:59 [gpu_worker.py:425] Available KV cache memory: 0.91 GiB
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:54 [backends.py:913] Using cache directory: /root/.cache/vllm/torch_compile_cache/5369359fc9/rank_1_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:20:54 [backends.py:973] Dynamo bytecode transform time: 2.21 s
(EngineCore_DP0 pid=54151) WARNING 03-12 15:20:59 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 9.09% KV cache memory
(EngineCore_DP0 pid=54151) INFO 03-12 15:20:59 [kv_cache_utils.py:1314] GPU KV cache size: 9,984 tokens
(EngineCore_DP0 pid=54151) INFO 03-12 15:20:59 [kv_cache_utils.py:1319] Maximum concurrency for 8,192 tokens per request: 2.51x
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) 2026-03-12 15:20:59,913 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) 2026-03-12 15:21:00,047 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/51 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   2%|▏         | 1/51 [00:00<00:10,  4.77it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   4%|▍         | 2/51 [00:00<00:08,  5.99it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   6%|▌         | 3/51 [00:00<00:07,  6.60it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   8%|▊         | 4/51 [00:00<00:06,  7.11it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  10%|▉         | 5/51 [00:00<00:06,  7.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  12%|█▏        | 6/51 [00:00<00:06,  7.18it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  14%|█▎        | 7/51 [00:01<00:06,  7.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  16%|█▌        | 8/51 [00:01<00:05,  7.27it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  18%|█▊        | 9/51 [00:01<00:05,  7.32it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  20%|█▉        | 10/51 [00:01<00:05,  7.81it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  22%|██▏       | 11/51 [00:01<00:04,  8.16it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  24%|██▎       | 12/51 [00:01<00:04,  8.49it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  25%|██▌       | 13/51 [00:01<00:04,  8.80it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  27%|██▋       | 14/51 [00:01<00:04,  9.08it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  29%|██▉       | 15/51 [00:01<00:03,  9.23it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  33%|███▎      | 17/51 [00:02<00:03,  9.29it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  37%|███▋      | 19/51 [00:02<00:03,  9.75it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  41%|████      | 21/51 [00:02<00:02, 10.11it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  45%|████▌     | 23/51 [00:02<00:02, 10.37it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  49%|████▉     | 25/51 [00:02<00:02, 10.52it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  53%|█████▎    | 27/51 [00:03<00:02, 10.70it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  57%|█████▋    | 29/51 [00:03<00:02, 10.93it/s]
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) /root/miniconda3/envs/vllm-spark/lib/python3.11/site-packages/torch/utils/_config_module.py:540: FutureWarning: torch._dynamo.config.skip_code_recursive_on_recompile_limit_hit is deprecated and does not do anything. It will be removed in a future version of PyTorch.
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244)   config[key] = copy.deepcopy(getattr(self, key))
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  61%|██████    | 31/51 [00:03<00:01, 11.22it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  65%|██████▍   | 33/51 [00:03<00:01, 11.46it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  69%|██████▊   | 35/51 [00:03<00:01, 11.87it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  73%|███████▎  | 37/51 [00:03<00:01, 12.23it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  76%|███████▋  | 39/51 [00:04<00:00, 12.44it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  80%|████████  | 41/51 [00:04<00:00, 12.40it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  84%|████████▍ | 43/51 [00:04<00:00, 12.89it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  88%|████████▊ | 45/51 [00:04<00:00, 13.33it/s]
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) 2026-03-12 15:20:59,884 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  92%|█████████▏| 47/51 [00:04<00:00, 13.30it/s]
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) 2026-03-12 15:21:00,084 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  96%|█████████▌| 49/51 [00:04<00:00, 14.14it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:04<00:00, 10.29it/s]
Capturing CUDA graphs (decode, FULL):   0%|          | 0/35 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):   3%|▎         | 1/35 [00:00<00:08,  3.93it/s]
Capturing CUDA graphs (decode, FULL):   6%|▌         | 2/35 [00:00<00:05,  5.66it/s]
Capturing CUDA graphs (decode, FULL):   9%|▊         | 3/35 [00:00<00:04,  6.62it/s]
Capturing CUDA graphs (decode, FULL):  11%|█▏        | 4/35 [00:00<00:04,  7.24it/s]
Capturing CUDA graphs (decode, FULL):  14%|█▍        | 5/35 [00:00<00:03,  7.65it/s]
Capturing CUDA graphs (decode, FULL):  17%|█▋        | 6/35 [00:00<00:03,  7.90it/s]
Capturing CUDA graphs (decode, FULL):  20%|██        | 7/35 [00:00<00:03,  8.12it/s]
Capturing CUDA graphs (decode, FULL):  23%|██▎       | 8/35 [00:01<00:03,  8.36it/s]
Capturing CUDA graphs (decode, FULL):  26%|██▌       | 9/35 [00:01<00:03,  8.56it/s]
Capturing CUDA graphs (decode, FULL):  29%|██▊       | 10/35 [00:01<00:02,  8.69it/s]
Capturing CUDA graphs (decode, FULL):  31%|███▏      | 11/35 [00:01<00:02,  8.82it/s]
Capturing CUDA graphs (decode, FULL):  34%|███▍      | 12/35 [00:01<00:02,  8.96it/s]
Capturing CUDA graphs (decode, FULL):  37%|███▋      | 13/35 [00:01<00:02,  9.13it/s]
Capturing CUDA graphs (decode, FULL):  40%|████      | 14/35 [00:01<00:02,  9.33it/s]
Capturing CUDA graphs (decode, FULL):  46%|████▌     | 16/35 [00:01<00:01,  9.74it/s]
Capturing CUDA graphs (decode, FULL):  51%|█████▏    | 18/35 [00:02<00:01,  9.90it/s]
Capturing CUDA graphs (decode, FULL):  57%|█████▋    | 20/35 [00:02<00:01, 10.35it/s]
Capturing CUDA graphs (decode, FULL):  63%|██████▎   | 22/35 [00:02<00:01, 10.79it/s]
Capturing CUDA graphs (decode, FULL):  69%|██████▊   | 24/35 [00:02<00:00, 11.22it/s]
Capturing CUDA graphs (decode, FULL):  74%|███████▍  | 26/35 [00:02<00:00, 11.67it/s]
Capturing CUDA graphs (decode, FULL):  80%|████████  | 28/35 [00:02<00:00, 12.15it/s]
Capturing CUDA graphs (decode, FULL):  86%|████████▌ | 30/35 [00:03<00:00, 12.75it/s]
Capturing CUDA graphs (decode, FULL):  91%|█████████▏| 32/35 [00:03<00:00, 13.65it/s]
Capturing CUDA graphs (decode, FULL):  97%|█████████▋| 34/35 [00:03<00:00, 14.01it/s]
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=14526, ip=169.254.12.148) INFO 03-12 15:21:08 [gpu_model_runner.py:5363] Graph capturing finished in 9 secs, took 0.40 GiB
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:20:58 [backends.py:283] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 2.631 s
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) WARNING 03-12 15:20:59 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/nvidia/workspaces/vllm/vllm/model_executor/layers/fused_moe/configs/E=288,N=1280,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:20:59 [monitor.py:35] torch.compile and initial profiling run took 6.67 s in total
(EngineCore_DP0 pid=54151) (RayWorkerWrapper pid=54244) INFO 03-12 15:20:59 [gpu_worker.py:425] Available KV cache memory: 7.78 GiB
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:03<00:00, 10.32it/s]
(EngineCore_DP0 pid=54151) INFO 03-12 15:21:08 [core.py:293] init engine (profile, create kv cache, warmup model) took 16.60 seconds
(EngineCore_DP0 pid=54151) The tokenizer you are loading from '/home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=54151) INFO 03-12 15:21:09 [vllm.py:753] Asynchronous scheduling is disabled.
(APIServer pid=54127) INFO 03-12 15:21:09 [api_server.py:496] Supported tasks: ['generate']
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) INFO 03-12 15:21:09 [serving.py:182] Warming up chat template processing...
(APIServer pid=54127) The tokenizer you are loading from '/home/nvidia/workspaces/models/stepfun-ai/Step-3.5-Flash-FP8' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=54127) INFO 03-12 15:21:09 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=54127) INFO 03-12 15:21:09 [serving.py:207] Chat template warmup completed in 257.9ms
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54127) INFO 03-12 15:21:09 [api_server.py:501] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:37] Available routes are:
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=54127) INFO 03-12 15:21:09 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=54127) INFO:     Started server process [54127]
(APIServer pid=54127) INFO:     Waiting for application startup.
(APIServer pid=54127) INFO:     Application startup complete.


(EngineCore_DP0 pid=54151) INFO 03-12 15:22:51 [ray_executor.py:567] RAY_CGRAPH_get_timeout is set to 300
(EngineCore_DP0 pid=54151) INFO 03-12 15:22:51 [ray_executor.py:571] VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE = auto
(EngineCore_DP0 pid=54151) INFO 03-12 15:22:51 [ray_executor.py:575] VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM = False
(EngineCore_DP0 pid=54151) INFO 03-12 15:22:51 [ray_executor.py:634] Using RayPPCommunicator (which wraps vLLM _PP GroupCoordinator) for Ray Compiled Graph communication.
(EngineCore_DP0 pid=54151) 2026-03-12 15:22:51,551      INFO torch_tensor_accelerator_channel.py:807 -- Creating communicator group f73064ed-6020-47c5-8a24-3b9250686e76 on actors: [Actor(RayWorkerWrapper, 7eab06c3fd3796d2c9c9bba602000000), Actor(RayWorkerWrapper, 065a10d5c095d20c25eefb8802000000)]
(EngineCore_DP0 pid=54151) 2026-03-12 15:22:51,863      INFO torch_tensor_accelerator_channel.py:833 -- Communicator group initialized.





(APIServer pid=54127) INFO 03-12 15:23:00 [loggers.py:259] Engine 000: Avg prompt throughput: 24.2 tokens/s, Avg generation throughput: 54.2 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.9%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:23:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 14.7%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:23:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 19.6%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:23:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.8 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 26.8%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:23:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 55.7 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 31.6%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:23:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 56.8 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 36.4%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:24:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.2 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:24:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 55.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.4%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:24:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 63.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 53.2%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:24:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.8 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 60.4%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:24:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.7 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 67.7%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:24:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 72.5%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:25:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 59.8 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 77.2%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:25:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 80.6%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:25:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 55.7 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 81.8%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:25:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 83.6%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:25:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 84.8%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:25:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 86.7%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:26:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 87.9%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:26:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 87.3%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:26:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.3 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 90.9%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:26:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 55.7 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 92.1%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:26:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 93.9%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:26:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.1%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:27:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 96.9%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:27:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 98.1%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO:     127.0.0.1:34894 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34904 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34910 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34934 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34936 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34956 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34980 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34982 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:34994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:35010 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO:     127.0.0.1:35018 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO 03-12 15:27:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO 03-12 15:27:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 64.9%
(APIServer pid=54127) INFO:     127.0.0.1:41294 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO 03-12 15:31:30 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 62.9%



(APIServer pid=54127) INFO 03-12 15:31:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.8%, Prefix cache hit rate: 62.9%
(APIServer pid=54127) INFO 03-12 15:31:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 62.9%
(APIServer pid=54127) INFO 03-12 15:32:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 62.9%
(APIServer pid=54127) INFO 03-12 15:32:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.3%, Prefix cache hit rate: 62.9%
(APIServer pid=54127) INFO 03-12 15:32:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 62.9%
(APIServer pid=54127) INFO 03-12 15:32:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 62.9%
(APIServer pid=54127) INFO:     127.0.0.1:42238 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=54127) INFO 03-12 15:39:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.6 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 63.2%


(APIServer pid=54127) INFO 03-12 15:39:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.8%, Prefix cache hit rate: 63.2%
(APIServer pid=54127) INFO 03-12 15:39:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 63.2%
(APIServer pid=54127) INFO 03-12 15:40:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 63.2%
(APIServer pid=54127) INFO 03-12 15:40:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.3%, Prefix cache hit rate: 63.2%
(APIServer pid=54127) INFO 03-12 15:40:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 63.2%
(APIServer pid=54127) INFO 03-12 15:40:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 63.2%
(APIServer pid=54127) INFO 03-12 15:51:10 [loggers.py:259] Engine 000: Avg prompt throughput: 24.2 tokens/s, Avg generation throughput: 63.1 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.0%, Prefix cache hit rate: 64.0%
(APIServer pid=54127) INFO 03-12 15:51:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.2%, Prefix cache hit rate: 64.0%
(APIServer pid=54127) INFO 03-12 15:51:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 64.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 24.4%, Prefix cache hit rate: 64.0%
(APIServer pid=54127) INFO 03-12 15:51:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 67.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 31.7%, Prefix cache hit rate: 64.0%
(APIServer pid=54127) INFO 03-12 15:51:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 64.5 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 36.5%, Prefix cache hit rate: 64.0%
(APIServer pid=54127) INFO 03-12 15:52:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 66.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.7%, Prefix cache hit rate: 64.0%


(APIServer pid=54127) INFO 03-12 15:52:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.0 tokens/s, Running: 15 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.7%, Prefix cache hit rate: 64.0%

八、运维操作

8.1 优雅停机

bash 复制代码

# 找出并终止 api_server 进程
ps aux | grep api_server | awk '{print $2}' | xargs kill -9

# 清理所有 Python/Ray 残留进程
pkill -9 python
ray stop -f
sync; sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"

8.2 日志实时监控

bash 复制代码

# 监控 vLLM 服务日志（后台运行时）
tail -f /home/nvidia/workspaces/ray/vllm_server.log

# 仅看吞吐量报告
tail -f /home/nvidia/workspaces/ray/vllm_server.log | grep 'throughput\|tokens/s'

# 监控统一内存使用情况（GB10 统一内存，无独立显存）
nvidia-smi

# 监控网卡流量（PP 通信带宽）
ifstat -i enp1s0f1np1 1

8.3 Ray Dashboard

bash 复制代码

# 本机访问（spark-7 上）
# 在浏览器打开：http://127.0.0.1:8265

# 如需从其他机器访问，在 spark-7 启动 ray 时添加：
# ray start --head ... --dashboard-host=0.0.0.0
# 然后访问：http://169.254.72.234:8265

九、已知问题与后续优化

9.1 已知问题

问题	影响	状态
`--enforce-eager` 限制速度	~11 tokens/s，去掉后预期 25~35 tokens/s	待 Triton sm_121 支持稳定后去除
MoE 配置文件缺失	使用默认 MoE 配置，性能次优	需手动调优生成 GB10 专用配置
层分配不均等 $23, 22$	spark-7 多计算 1 层，轻微负载不平衡	可通过 `VLLM_PP_LAYER_PARTITION` 手动均等
PP 通信走 Ray Compiled DAG TCP	跨节点延迟较高，但通信量极小（~230 KB/s）不是瓶颈	可尝试 `VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE=nccl`
Ray Compiled DAG 跨节点挂起风险	长时间高负载下存在 hang 风险（Ray #58426）	关注 Ray 上游修复
tokenizer 正则警告	tokenization 可能有轻微偏差（无害）	可通过 `fix_mistral_regex=True` 修复
spark-6 SSH 需密码	运维不便	配置 `ssh-copy-id` 实现免密登录

9.2 MoE 调优配置（待完成）

系统启动时警告缺少 GB10 专用的 MoE 调优配置文件：

复制代码

# 缺失的配置文件路径：
/home/nvidia/workspaces/vllm/vllm/model_executor/layers/fused_moe/configs/
E=288,N=1280,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json

9.3 去除 enforce-eager（下一步优化）

去掉 --enforce-eager 参数后，vLLM 将启用 CUDA Graphs 进行批量推理优化，预期速度明显提升。

⚠️ 已知风险：Triton 在 GB10（sm_121a）上调用 ptxas 时存在"Value 'sm_121a' is not defined for option 'gpu-name'"的错误（triton #9181），导致 torch.compile / TorchInductor 路径失败并回退。去掉 enforce-eager 后请密切观察启动日志和运行稳定性。

bash 复制代码

# 测试时从脚本中移除该行：
#   --enforce-eager \

# 去掉后 vLLM 会尝试启用 CUDAGraphs，启动时间增加约 2~5 分钟

十、关键技术洞察

10.1 PP 通信架构与 NCCL 的关系

这是一个常见误区，必须澄清：

通信类型	使用的技术	说明
PP 层间激活值传递	Ray Compiled DAG（默认 auto 通道）	由 `VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE` 控制，默认 auto
各节点内部 TP 通信（TP>1 时）	NCCL	本次 TP=1，此路径不存在
Gloo all-reduce（部分同步操作）	Gloo over TCP	走 `GLOO_SOCKET_IFNAME` 指定的网卡

实际测量 ：PP 通信期间 enp1s0f1np1 的 TX 带宽仅约 150~230 KB/s，远低于网卡理论带宽，说明跨节点激活值传输量极小，网络带宽完全不是推理性能瓶颈。

Ray Compiled DAG 通道类型可通过环境变量切换：

bash 复制代码

# 默认 auto：由 Ray 自动选择最优通道
export VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE=auto

# 强制使用 NCCL（需要 RDMA 支持，可获得更低延迟）
export VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE=nccl

# 强制使用共享内存（单节点场景有效）
export VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE=shm

若使用 nccl 通道类型，可通过 NCCL_DEBUG=TRACE 确认实际传输路径：

[send] via NET/IB/GDRDMA → 走 InfiniBand / RDMA，高效
[send] via NET/Socket → 走 TCP Socket，次优

10.2 NCCL 变量的实际作用范围

环境变量	实际作用	对当前配置（TP=1, PP=2）的影响
`NCCL_SOCKET_IFNAME=enp1s0f1np1`	绑定 NCCL Socket 通信到指定网卡	有效（保留）
`GLOO_SOCKET_IFNAME=enp1s0f1np1`	绑定 Gloo 通信网卡	有效（保留）
`NCCL_IB_DISABLE=0`	允许 RDMA/RoCE 传输	对 PP DAG 通道无直接效果
`NCCL_NET_GDR_LEVEL=5`	控制 GPUDirect RDMA 级别	对 PP DAG 通道无直接效果
`RAY_CGRAPH_get_timeout`	Ray Compiled DAG 操作超时（默认 300s）	多节点时可适当调大
`VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE`	控制 PP 通信底层通道类型	核心变量，默认 auto

10.3 为什么 VLLM_HOST_IP 的修复如此关键

本次修复的核心问题本质上都是同一个：DGX Spark 同时有多块网卡（管理网 192.168.x.x、高速直连网 169.254.x.x），Python 标准的 socket 路由探测会选择默认路由网卡（管理网），返回错误 IP。

vLLM v0.17.0rc1 在三个不同位置调用了 get_ip()，每处都可能返回错误 IP。正确做法是在所有相关调用处优先读取 VLLM_HOST_IP 环境变量，绕过自动探测。

10.4 GB10 统一内存架构对推理性能的影响

特性	传统 PCIe GPU（如 H100 PCIe）	GB10 统一内存
CPU↔GPU 数据传输	需要通过 PCIe（~64 GB/s）	无需拷贝，直接访问
内存容量	最高 80GB HBM3	128GB LPDDR5X
内存带宽	~3.35 TB/s（HBM3）	273 GB/s
推理瓶颈	算力	内存带宽（主要瓶颈）
大模型适配性	需多卡才能装载	单机可装 200B 参数模型

内存带宽（273 GB/s）是 GB10 在 LLM 推理中的主要性能上限，尤其对于大 Batch Size 或高并发场景影响显著。