LMDeploy 上线实战：零部署清单、QPS–显存估算表与 TurboMind vs vLLM 压测脚本全套指南

LMDeploy 实战指南，聚焦三件事：不搭服务的零部署工作流 （离线评测/批处理一把梭）、QPS--显存容量的快速估算 （封闭公式 + 小脚本），以及 TurboMind vs vLLM 的标准化压测方案（oha/wrk/locust 一键跑、表格留痕可复现）。文章给出清单、脚本、架构图与参数模板，帮助你在开发机完成验证，在生产环境完成扩容与稳定上线。

1. 部署 / 零部署清单

1.1 零部署（只做离线/评测/批处理）

适用：跑评测脚本、批量生成数据集、离线清洗与标注、PoC 做法：仅用 lmdeploy.pipeline，不常驻服务，不暴露端口

Python 3.9--3.13，conda 独立环境
安装 lmdeploy（CUDA12+ 预编译包；50 系安装 cu128 版 wheel）
模型仓：默认 HuggingFace；国内推荐 ModelScope / OpenMind Hub（按需设 LMDEPLOY_USE_MODELSCOPE / LMDEPLOY_USE_OPENMIND_HUB）
存储与缓存：保证模型、tokenizer、KV 预热目录读写权限充足（SSD 优先）
典型脚本：pipeline() 批量 inference，CSV/JSONL 输入输出
资源使用：只考虑 权重显存 +（可选）KV Cache，无需反向/优化器
监控：记录 吞吐（samples/s、tok/s）、显存峰值、失败重试 到 CSV

1.2 渐进式部署（服务化 + 并发）

适用：需要 OpenAI 兼容 API、压测、AB 实验、小规模线上做法：TurboMind api_server 单机多卡起步，按需上 Request Distributor（代理服务）

驱动/Runtime：CUDA ≥ 12，NVIDIA 驱动与 nvidia-container-toolkit（容器化时）
引擎首选 TurboMind；PyTorch 引擎适合二开与新特性验证
性能开关 ：连续批处理（默认）、--kv-int4/8 在线量化、合适的 --session-len
并行：--tp/--pp；MoE/DeepSeek 系列可结合 PD 分离（Prefill/Decode 解耦）
路由与多模型：用 Request Distributor 做多模型/多机路由和配额
资源配置：--cache-max-size-mb 结合 QPS 目标做容量规划（见下一节估算）
可观测性：Prometheus/Grafana（RPS、P50/P95 延迟、生成 tok/s、OOM/抢占）
安全：OpenAI 兼容 Key、IP 白名单、TLS 终端（Nginx/Envoy）、审计日志

2. QPS--显存估算表 + 计算脚本

2.1 估算模型公式

总显存 ≈ 权重显存 + KV Cache × 并发会话数 + 碎片与保留
KV Cache / req（近似）：

KV_per_req≈L×S×2×B(Bytes)\text{KV_per_req} \approx L \times S \times 2 \times B \quad (\text{Bytes})

其中：
- <math xmlns="http://www.w3.org/1998/Math/MathML"> L L </math>L：层数（例如 Llama3-8B 约 32）
- <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S：上下文长度（session-len，如 8192）
- 2：K 与 V
- <math xmlns="http://www.w3.org/1998/Math/MathML"> B B </math>B：每元素字节数（FP16≈2，INT8≈1，INT4≈0.5），另加 5--10% 额外开销

2.2 典型配置估算（H100 80GB 示例）

模型 (权重量化)	估算权重显存	`session-len`	KV 精度	KV/会话	预留碎片	可用 KV 总量	预估并发会话
InternLM3-8B（FP8 权重）	≈ 8--10 GB	8192	FP16	≈ 4 GB	6 GB	≈ 64--66 GB	16
InternLM3-8B（FP8 权重）	≈ 8--10 GB	8192	INT8	≈ 2 GB	6 GB	≈ 64--66 GB	32
InternLM3-8B（FP8 权重）	≈ 8--10 GB	8192	INT4	≈ 1 GB	6 GB	≈ 64--66 GB	64
Mixtral 8×7B（W4 权重）	≈ 36--40 GB	4096	INT4	≈ 0.5 GB	6 GB	≈ 34 GB	≈ 68（受活跃专家数影响）

2.3 一键估算脚本（Python）

python 复制代码

def kv_per_req_mb(layers=32, session_len=8192, bytes_per_elem=2.0, overhead=0.08):
    per_token_mb = 0.5 if bytes_per_elem >= 2 else (0.25 if bytes_per_elem >= 1 else 0.125)
    base = per_token_mb * session_len
    return base * (1 + overhead)

def plan(total_vram_gb=80, weight_gb=10, session_len=8192,
         kv_bytes=0.5, layers=32, reserve_gb=6):
    kv_mb = kv_per_req_mb(layers, session_len, kv_bytes)
    usable_gb = max(total_vram_gb - weight_gb - reserve_gb, 0)
    conc = int((usable_gb * 1024) // kv_mb) if kv_mb > 0 else 0
    return conc, kv_mb, usable_gb

if __name__ == "__main__":
    for kv_bytes, name in [(2,"FP16"), (1,"INT8"), (0.5,"INT4")]:
        conc, kv_mb, usable_gb = plan(kv_bytes=kv_bytes)
        print(f"{name}: 并发≈{conc}，每会话KV≈{kv_mb:.1f}MB，可用KV≈{usable_gb:.1f}GB")

3. TurboMind vs vLLM 压测方案

3.1 启动两个服务（不同端口）

TurboMind

bash 复制代码

lmdeploy serve api_server \
  --model internlm/internlm3-8b-instruct \
  --tp 2 \
  --session-len 8192 \
  --kv-int4 true \
  --cache-max-size-mb 55000 \
  --server-port 23333

vLLM

bash 复制代码

python -m vllm.entrypoints.openai.api_server \
  --model internlm/internlm3-8b-instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --port 8000

3.2 oha 压测脚本

bench_oha.sh：

bash 复制代码

#!/usr/bin/env bash
set -euo pipefail

PROMPT="Tell me a short story about Shanghai."
MAX_TOKENS=64
CONCURRENCY=64
DURATION=60s

body() {
cat <<EOF
{
  "model": "internlm/internlm3-8b-instruct",
  "messages": [{"role":"user","content":"$PROMPT"}],
  "max_tokens": $MAX_TOKENS,
  "temperature": 0.0,
  "stream": false
}
EOF
}

bench() {
  NAME="$1"
  URL="$2"
  echo "==== Benchmark: $NAME ($URL) ===="
  oha -z "$DURATION" -c "$CONCURRENCY" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-TEST" \
    -m POST "$URL/v1/chat/completions" \
    --body "$(body)"
}

bench "LMDeploy-TurboMind" "http://127.0.0.1:23333"
bench "vLLM"              "http://127.0.0.1:8000"

3.3 wrk 压测脚本

bench_wrk.lua：

lua 复制代码

wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.headers["Authorization"] = "Bearer sk-TEST"
local prompt = "Tell me a short story about Shanghai."
local body = string.format('{"model":"internlm/internlm3-8b-instruct","messages":[{"role":"user","content":"%s"}],"max_tokens":64,"temperature":0.0,"stream":false}', prompt)

request = function()
  return wrk.format(nil, "/v1/chat/completions", nil, body)
end

运行：

bash 复制代码

wrk -t8 -c64 -d60s --latency -s bench_wrk.lua http://127.0.0.1:23333
wrk -t8 -c64 -d60s --latency -s bench_wrk.lua http://127.0.0.1:8000

3.4 Locust 压测脚本

安装：

bash 复制代码

pip install locust

locustfile.py：

python 复制代码

from locust import HttpUser, task, between

PROMPT = "Tell me a short story about Shanghai."

class LLMUser(HttpUser):
    wait_time = between(0.5, 2)

    @task
    def chat_completion(self):
        body = {
            "model": "internlm/internlm3-8b-instruct",
            "messages": [{"role": "user", "content": PROMPT}],
            "max_tokens": 64,
            "temperature": 0.0,
            "stream": False
        }
        headers = {"Authorization": "Bearer sk-TEST"}
        self.client.post("/v1/chat/completions", json=body, headers=headers)

运行：

bash 复制代码

locust -f locustfile.py --headless -u 100 -r 10 -t 5m --host http://127.0.0.1:23333

3.5 工具对比表

工具	优点	缺点	适用场景
oha	自带延迟分位统计，轻量，单命令跑	功能简单，扩展性有限	快速对比、一次性压测
wrk	高并发，Lua 可编写复杂请求，延迟统计精确	不支持图形化，需要手写脚本	基准性能测试，寻找瓶颈
Locust	Python 脚本化，支持真实用户行为模型，Web UI，可分布式	吞吐极限不如 wrk，资源占用偏高	长时间稳定压测、复杂场景模拟

4. 可直接用的启动参数

bash 复制代码

lmdeploy serve api_server \
  --model internlm/internlm3-8b-instruct \
  --tp 2 \
  --session-len 8192 \
  --kv-int4 true \
  --cache-max-size-mb 55000 \
  --server-port 23333

5. 架构演进图

flowchart TD A["零部署
pipeline()"] --> B[服务化
api_server] B --> C[多模型代理
Request Distributor]

6. 完整示例流程图（带压测回环）

flowchart LR U[用户请求] --> P[代理层
Request Distributor] P --> S[LMDeploy api_server] S --> M[TurboMind / PyTorch 引擎
加载模型] M -->|推理输出| P P --> U %% 压测工具环节 L1[oha] -.-> P L2[wrk] -.-> P L3[Locust] -.-> P

小结

零部署 ：用 pipeline() 完成离线推理。
服务化 ：api_server 提供 OpenAI 兼容 API。
多模型代理：Request Distributor 实现流量分发。
容量规划：估算表 + 脚本给出并发预测，压测校准真实表现。
压测工具：oha（快）、wrk（稳）、Locust（真）。
全链路流程：用户请求 → 代理 → api_server → 模型 → 压测回环，形成完整上线闭环。