部署-模型Serving-Caching与Cost

level: architect | 10y

domain: fintech-trade
version: v2.1
last-updated: 2026-05-21
topic: 部署 - 模型 Serving / Caching / Cost

部署 · 模型 Serving / Caching / Cost

🏠 返回 README ｜ ⬅️ 02-评估-Eval-Hallucination与质量度量.md ｜ ➡️ 01-电商AI辅助交易场景.md
风格说明 ：本篇是 操作型（主）+ 机制型（辅） ------从 vLLM / PagedAttention 的 KV 物理账本，到 TTFT/TPOT 延迟分解、量化与 prefix cache、再到 LiteLLM FinOps 与多模型路由。对齐 03-RAG / 14-Spring-AI 的「量化数字 + 落地清单」写法。

前置阅读 ：01-LLM-基础（Attention 与 KV）；06-评估（换模型前 eval 门禁）。

后续展开 ：08-架构-电商（客服/导购 SLA）；24-Gateway（路由与租户）；25-可观测（trace + $/task）。

L1 · 是什么

1.1 一句话定义

LLM Serving ：把训练好的权重变成 可预测的在线推理服务 ------在 GPU 显存、批处理调度、KV 缓存与量化之间做权衡，使 TTFT/TPOT 满足业务 SLA，同时把 $/1K tokens 压到财务可接受区间。

1.2 Serving 栈四层

层	组件	职责
接入	LiteLLM / Spring AI ChatClient / Gateway	路由、鉴权、计费标签、fallback
推理引擎	vLLM / TGI / TensorRT-LLM / SGLang	Continuous batching、PagedAttention、prefix cache
模型资产	HF weights + AWQ/GPTQ/FP8	显存占用与质量折中
观测	Prometheus + Langfuse/OTel	TTFT、TPOT、GPU util、$/task

1.3 电商交易场景的 Serving 画像

text 复制代码

典型「智能客服 + 订单解读」混合负载（某头部电商平台实测口径）:
  - 峰值并发会话: 8,000（大促前 1h）
  - 平均 prompt: 2,400 tokens（系统提示 + 商品 KB + 用户多轮）
  - 平均 completion: 180 tokens
  - SLA: TTFT P99 < 800ms，TPOT P99 < 45ms/token
  - 预算: $0.018 / 会话（含检索 embed，不含人工）

若直连 GPT-4o API（2025 价目）:
  - 输入 $2.5/1M × 2400 + 输出 $10/1M × 180 ≈ $0.0078 / 会话
  - 峰值 8000 并发 → 若全同步无 batch，API 限流与尾延迟不可控

自托管 vLLM（Llama-3.1-70B AWQ，2×A100 80GB）:
  - 机房摊销 + 电费 ≈ $1.2/ GPU-hour × 2 = $2.4/h
  - 稳态吞吐 ~35 req/s（batch=64，avg gen 180 tok）
  - $/会话 ≈ ($2.4/3600) / 35 ≈ $0.000019 × (推理秒数) ------ 需乘实际 RT
  - 工程结论: QPS>500 且 prompt 长 → 自托管；长尾/实验 → API

#mermaid-svg-6Bc3M2o0y4g9CMoU{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6Bc3M2o0y4g9CMoU .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6Bc3M2o0y4g9CMoU .error-icon{fill:#552222;}#mermaid-svg-6Bc3M2o0y4g9CMoU .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6Bc3M2o0y4g9CMoU .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6Bc3M2o0y4g9CMoU .marker.cross{stroke:#333333;}#mermaid-svg-6Bc3M2o0y4g9CMoU svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6Bc3M2o0y4g9CMoU p{margin:0;}#mermaid-svg-6Bc3M2o0y4g9CMoU .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-6Bc3M2o0y4g9CMoU .cluster-label text{fill:#333;}#mermaid-svg-6Bc3M2o0y4g9CMoU .cluster-label span{color:#333;}#mermaid-svg-6Bc3M2o0y4g9CMoU .cluster-label span p{background-color:transparent;}#mermaid-svg-6Bc3M2o0y4g9CMoU .label text,#mermaid-svg-6Bc3M2o0y4g9CMoU span{fill:#333;color:#333;}#mermaid-svg-6Bc3M2o0y4g9CMoU .node rect,#mermaid-svg-6Bc3M2o0y4g9CMoU .node circle,#mermaid-svg-6Bc3M2o0y4g9CMoU .node ellipse,#mermaid-svg-6Bc3M2o0y4g9CMoU .node polygon,#mermaid-svg-6Bc3M2o0y4g9CMoU .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6Bc3M2o0y4g9CMoU .rough-node .label text,#mermaid-svg-6Bc3M2o0y4g9CMoU .node .label text,#mermaid-svg-6Bc3M2o0y4g9CMoU .image-shape .label,#mermaid-svg-6Bc3M2o0y4g9CMoU .icon-shape .label{text-anchor:middle;}#mermaid-svg-6Bc3M2o0y4g9CMoU .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-6Bc3M2o0y4g9CMoU .rough-node .label,#mermaid-svg-6Bc3M2o0y4g9CMoU .node .label,#mermaid-svg-6Bc3M2o0y4g9CMoU .image-shape .label,#mermaid-svg-6Bc3M2o0y4g9CMoU .icon-shape .label{text-align:center;}#mermaid-svg-6Bc3M2o0y4g9CMoU .node.clickable{cursor:pointer;}#mermaid-svg-6Bc3M2o0y4g9CMoU .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-6Bc3M2o0y4g9CMoU .arrowheadPath{fill:#333333;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-6Bc3M2o0y4g9CMoU .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6Bc3M2o0y4g9CMoU .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-6Bc3M2o0y4g9CMoU .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6Bc3M2o0y4g9CMoU .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-6Bc3M2o0y4g9CMoU .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6Bc3M2o0y4g9CMoU .cluster text{fill:#333;}#mermaid-svg-6Bc3M2o0y4g9CMoU .cluster span{color:#333;}#mermaid-svg-6Bc3M2o0y4g9CMoU div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-6Bc3M2o0y4g9CMoU .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6Bc3M2o0y4g9CMoU rect.text{fill:none;stroke-width:0;}#mermaid-svg-6Bc3M2o0y4g9CMoU .icon-shape,#mermaid-svg-6Bc3M2o0y4g9CMoU .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6Bc3M2o0y4g9CMoU .icon-shape p,#mermaid-svg-6Bc3M2o0y4g9CMoU .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-6Bc3M2o0y4g9CMoU .icon-shape .label rect,#mermaid-svg-6Bc3M2o0y4g9CMoU .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6Bc3M2o0y4g9CMoU .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6Bc3M2o0y4g9CMoU .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6Bc3M2o0y4g9CMoU :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} GPU 池
推理引擎
接入层
route smart/fast
metrics
LiteLLM Proxy
Spring AI 客服
vLLM Worker
Prefix / KV Cache
GPU0 A100 80GB
GPU1 A100 80GB
Prometheus / Langfuse

L2 · 原理与实现（主战场）

2.1 vLLM 与 Continuous Batching

问题：朴素实现「一个请求占满 GPU 直到生成结束」→ GPU 在 decode 阶段大量空闲（memory-bound、算力未吃满）。

Continuous Batching（迭代级调度）：

每个 decoding step 把 本步已结束的序列移出 batch，新到达的 prefill 请求插入；
吞吐量提升 2--4×（vLLM 论文与生产口径：70B 模型 batch=32 时约 2.8×）。

GPU Kernels KV BlockManager vLLM Scheduler Client GPU Kernels KV BlockManager vLLM Scheduler Client #mermaid-svg-fPDXBadOdsxaArJr{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fPDXBadOdsxaArJr .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fPDXBadOdsxaArJr .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fPDXBadOdsxaArJr .error-icon{fill:#552222;}#mermaid-svg-fPDXBadOdsxaArJr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fPDXBadOdsxaArJr .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fPDXBadOdsxaArJr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fPDXBadOdsxaArJr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fPDXBadOdsxaArJr .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fPDXBadOdsxaArJr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fPDXBadOdsxaArJr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fPDXBadOdsxaArJr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fPDXBadOdsxaArJr .marker.cross{stroke:#333333;}#mermaid-svg-fPDXBadOdsxaArJr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fPDXBadOdsxaArJr p{margin:0;}#mermaid-svg-fPDXBadOdsxaArJr .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fPDXBadOdsxaArJr text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-fPDXBadOdsxaArJr .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-fPDXBadOdsxaArJr .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-fPDXBadOdsxaArJr .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-fPDXBadOdsxaArJr .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-fPDXBadOdsxaArJr #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-fPDXBadOdsxaArJr .sequenceNumber{fill:white;}#mermaid-svg-fPDXBadOdsxaArJr #sequencenumber{fill:#333;}#mermaid-svg-fPDXBadOdsxaArJr #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-fPDXBadOdsxaArJr .messageText{fill:#333;stroke:none;}#mermaid-svg-fPDXBadOdsxaArJr .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fPDXBadOdsxaArJr .labelText,#mermaid-svg-fPDXBadOdsxaArJr .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-fPDXBadOdsxaArJr .loopText,#mermaid-svg-fPDXBadOdsxaArJr .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-fPDXBadOdsxaArJr .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-fPDXBadOdsxaArJr .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-fPDXBadOdsxaArJr .noteText,#mermaid-svg-fPDXBadOdsxaArJr .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-fPDXBadOdsxaArJr .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fPDXBadOdsxaArJr .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fPDXBadOdsxaArJr .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fPDXBadOdsxaArJr .actorPopupMenu{position:absolute;}#mermaid-svg-fPDXBadOdsxaArJr .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-fPDXBadOdsxaArJr .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fPDXBadOdsxaArJr .actor-man circle,#mermaid-svg-fPDXBadOdsxaArJr line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-fPDXBadOdsxaArJr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} reqC arrives mid-batch loop $decode steps$ prefill reqA 2400 tok alloc blocks for A prefill kernel A prefill reqB 800 tok prefill B join batch decode step all active seqs logits stream token A alloc blocks C

关键配置（vLLM 0.4+ 口径，需按型号压测）：

参数	含义	电商客服推荐起点
`--max-num-seqs`	最大并发序列	64（A100×2 70B AWQ）
`--max-model-len`	最大上下文	8192（KB 长则 16k，显存 ×2 慎开）
`--gpu-memory-utilization`	显存占用上限	0.90（留 10% 给碎片与 cuda graph）
`--enable-prefix-caching`	前缀 KV 复用	true（系统 prompt + 店铺政策重复率高）
`--tensor-parallel-size`	TP 切分	2（70B 单卡放不下）

2.2 PagedAttention：KV Cache 的「虚拟内存」

朴素 KV ：为每条序列预分配 max_model_len × hidden 连续显存 → 碎片 + 过度预留；并发 64 时 OOM 而实际平均长度仅 3k。

PagedAttention：

将 KV 切成固定大小 block（如 16 tokens/block）；
用 block table 做逻辑地址 → 物理 block 映射（类似 OS 分页）；
共享 prefix 的 block 可引用计数共享（与 prefix caching 叠加）。

显存账本（Llama-3 70B，FP16 KV，单 token 单层简化）：

text 复制代码

hidden = 8192, layers = 80, heads = 64, head_dim = 128
每 token KV 体积 ≈ 2 × layers × hidden × 2B = 2 × 80 × 8192 × 2 ≈ 2.5 MB / token / 序列

序列长度 3000 tok → KV ≈ 7.5 GB（仅 KV，不含权重）
权重 70B FP16 ≈ 140 GB → 必须 AWQ/TP

AWQ 4bit 权重 ≈ 35--40 GB + KV 7.5 GB × 并发
并发 32 × 平均 2k tok → KV 总量级 ~120 GB → 2×A100 80GB 需控制 max-num-seqs 与 max-model-len

#mermaid-svg-HYcPtYTV3yw7CK2O{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-HYcPtYTV3yw7CK2O .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-HYcPtYTV3yw7CK2O .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-HYcPtYTV3yw7CK2O .error-icon{fill:#552222;}#mermaid-svg-HYcPtYTV3yw7CK2O .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-HYcPtYTV3yw7CK2O .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-HYcPtYTV3yw7CK2O .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-HYcPtYTV3yw7CK2O .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-HYcPtYTV3yw7CK2O .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-HYcPtYTV3yw7CK2O .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-HYcPtYTV3yw7CK2O .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-HYcPtYTV3yw7CK2O .marker{fill:#333333;stroke:#333333;}#mermaid-svg-HYcPtYTV3yw7CK2O .marker.cross{stroke:#333333;}#mermaid-svg-HYcPtYTV3yw7CK2O svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-HYcPtYTV3yw7CK2O p{margin:0;}#mermaid-svg-HYcPtYTV3yw7CK2O .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-HYcPtYTV3yw7CK2O .cluster-label text{fill:#333;}#mermaid-svg-HYcPtYTV3yw7CK2O .cluster-label span{color:#333;}#mermaid-svg-HYcPtYTV3yw7CK2O .cluster-label span p{background-color:transparent;}#mermaid-svg-HYcPtYTV3yw7CK2O .label text,#mermaid-svg-HYcPtYTV3yw7CK2O span{fill:#333;color:#333;}#mermaid-svg-HYcPtYTV3yw7CK2O .node rect,#mermaid-svg-HYcPtYTV3yw7CK2O .node circle,#mermaid-svg-HYcPtYTV3yw7CK2O .node ellipse,#mermaid-svg-HYcPtYTV3yw7CK2O .node polygon,#mermaid-svg-HYcPtYTV3yw7CK2O .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-HYcPtYTV3yw7CK2O .rough-node .label text,#mermaid-svg-HYcPtYTV3yw7CK2O .node .label text,#mermaid-svg-HYcPtYTV3yw7CK2O .image-shape .label,#mermaid-svg-HYcPtYTV3yw7CK2O .icon-shape .label{text-anchor:middle;}#mermaid-svg-HYcPtYTV3yw7CK2O .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-HYcPtYTV3yw7CK2O .rough-node .label,#mermaid-svg-HYcPtYTV3yw7CK2O .node .label,#mermaid-svg-HYcPtYTV3yw7CK2O .image-shape .label,#mermaid-svg-HYcPtYTV3yw7CK2O .icon-shape .label{text-align:center;}#mermaid-svg-HYcPtYTV3yw7CK2O .node.clickable{cursor:pointer;}#mermaid-svg-HYcPtYTV3yw7CK2O .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-HYcPtYTV3yw7CK2O .arrowheadPath{fill:#333333;}#mermaid-svg-HYcPtYTV3yw7CK2O .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-HYcPtYTV3yw7CK2O .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-HYcPtYTV3yw7CK2O .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-HYcPtYTV3yw7CK2O .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-HYcPtYTV3yw7CK2O .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-HYcPtYTV3yw7CK2O .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-HYcPtYTV3yw7CK2O .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-HYcPtYTV3yw7CK2O .cluster text{fill:#333;}#mermaid-svg-HYcPtYTV3yw7CK2O .cluster span{color:#333;}#mermaid-svg-HYcPtYTV3yw7CK2O div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-HYcPtYTV3yw7CK2O .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-HYcPtYTV3yw7CK2O rect.text{fill:none;stroke-width:0;}#mermaid-svg-HYcPtYTV3yw7CK2O .icon-shape,#mermaid-svg-HYcPtYTV3yw7CK2O .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-HYcPtYTV3yw7CK2O .icon-shape p,#mermaid-svg-HYcPtYTV3yw7CK2O .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-HYcPtYTV3yw7CK2O .icon-shape .label rect,#mermaid-svg-HYcPtYTV3yw7CK2O .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-HYcPtYTV3yw7CK2O .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-HYcPtYTV3yw7CK2O .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-HYcPtYTV3yw7CK2O :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 物理 GPU 块池
逻辑序列
Block0 tok0-15
Block1 tok16-31
Block2 ...
physBlock7
physBlock2
physBlock19

2.3 TTFT 与 TPOT：延迟分解

指标	定义	主导因素
TTFT	首 token 时间	prefill 算力、prompt 长度、排队、冷启动
TPOT	相邻 token 间隔	decode 带宽、batch 深度、量化内核
E2E	整句完成	TTFT + TPOT × (out_tokens - 1)

量化估算（70B AWQ，2×A100，prompt=2400，out=180）：

text 复制代码

Prefill 2400 tok @ ~1800 tok/s/cluster → TTFT_compute ≈ 1.3s（仅算力）
排队 + 网络 + 调度 → 生产加 200--400ms → TTFT P99 目标 800ms 需控制并发或降 prompt

Decode TPOT @ batch=32 → ~35--45 ms/tok
E2E ≈ 0.8s + 45ms × 179 ≈ 8.9s（流式体验由 TPOT 决定「打字感」）

优化 TTFT 的杠杆（按收益排序）：

Prefix cache：系统 prompt + 店铺 ID 模板命中 → prefill 从 2400 降到 200--400；
Prompt 裁剪 ：RAG top-K 从 20→8（配合 03-RAG）；
Speculative decoding：小模型 draft + 大模型 verify（TPOT ↓ 30--50%，实现复杂）；
分离 prefill/decode 实例（PD 分离）：大促时 prefill 池弹性扩容。

2.4 量化：AWQ / GPTQ / FP8

方案	权重大小	质量损失	吞吐	备注
FP16	100%	基线	基线	70B 很难单卡
AWQ 4bit	~25%	MMLU −0.5~1.5	+30--50%	生产首选
GPTQ 4bit	~25%	类似 AWQ	类似	离线量化流程成熟
FP8 (H100)	~50%	更小	H100 原生	需硬件代际匹配

电商客服选型建议：

高价值订单解读 / 合规：FP16 或 AWQ + 抽样人工评测；
海量 FAQ：AWQ 8B/14B 小模型 + 置信度路由到 70B；
禁止：未评测就全量 INT4 + 关闭 fallback。

2.5 Prefix Cache 与语义缓存

Prefix Cache（vLLM 原生）：

对 完全相同 token 前缀 复用 KV block；
命中时 TTFT 可降 60--85%（系统 prompt 2k 重复率 >90% 的客服场景）。

Semantic Cache（Gateway / Redis + embed）：

对 语义相近 问题复用完整回答；
命中延迟 < 50ms，但风险：价格/库存时效 → 必须 TTL + 业务 key（sku_id+policy_version）。

#mermaid-svg-Q1nzbscKuUg6gkIV{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Q1nzbscKuUg6gkIV .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Q1nzbscKuUg6gkIV .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Q1nzbscKuUg6gkIV .error-icon{fill:#552222;}#mermaid-svg-Q1nzbscKuUg6gkIV .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Q1nzbscKuUg6gkIV .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Q1nzbscKuUg6gkIV .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Q1nzbscKuUg6gkIV .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Q1nzbscKuUg6gkIV .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Q1nzbscKuUg6gkIV .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Q1nzbscKuUg6gkIV .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Q1nzbscKuUg6gkIV .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Q1nzbscKuUg6gkIV .marker.cross{stroke:#333333;}#mermaid-svg-Q1nzbscKuUg6gkIV svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Q1nzbscKuUg6gkIV p{margin:0;}#mermaid-svg-Q1nzbscKuUg6gkIV .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Q1nzbscKuUg6gkIV .cluster-label text{fill:#333;}#mermaid-svg-Q1nzbscKuUg6gkIV .cluster-label span{color:#333;}#mermaid-svg-Q1nzbscKuUg6gkIV .cluster-label span p{background-color:transparent;}#mermaid-svg-Q1nzbscKuUg6gkIV .label text,#mermaid-svg-Q1nzbscKuUg6gkIV span{fill:#333;color:#333;}#mermaid-svg-Q1nzbscKuUg6gkIV .node rect,#mermaid-svg-Q1nzbscKuUg6gkIV .node circle,#mermaid-svg-Q1nzbscKuUg6gkIV .node ellipse,#mermaid-svg-Q1nzbscKuUg6gkIV .node polygon,#mermaid-svg-Q1nzbscKuUg6gkIV .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Q1nzbscKuUg6gkIV .rough-node .label text,#mermaid-svg-Q1nzbscKuUg6gkIV .node .label text,#mermaid-svg-Q1nzbscKuUg6gkIV .image-shape .label,#mermaid-svg-Q1nzbscKuUg6gkIV .icon-shape .label{text-anchor:middle;}#mermaid-svg-Q1nzbscKuUg6gkIV .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Q1nzbscKuUg6gkIV .rough-node .label,#mermaid-svg-Q1nzbscKuUg6gkIV .node .label,#mermaid-svg-Q1nzbscKuUg6gkIV .image-shape .label,#mermaid-svg-Q1nzbscKuUg6gkIV .icon-shape .label{text-align:center;}#mermaid-svg-Q1nzbscKuUg6gkIV .node.clickable{cursor:pointer;}#mermaid-svg-Q1nzbscKuUg6gkIV .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Q1nzbscKuUg6gkIV .arrowheadPath{fill:#333333;}#mermaid-svg-Q1nzbscKuUg6gkIV .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Q1nzbscKuUg6gkIV .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Q1nzbscKuUg6gkIV .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Q1nzbscKuUg6gkIV .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Q1nzbscKuUg6gkIV .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Q1nzbscKuUg6gkIV .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Q1nzbscKuUg6gkIV .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Q1nzbscKuUg6gkIV .cluster text{fill:#333;}#mermaid-svg-Q1nzbscKuUg6gkIV .cluster span{color:#333;}#mermaid-svg-Q1nzbscKuUg6gkIV div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Q1nzbscKuUg6gkIV .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Q1nzbscKuUg6gkIV rect.text{fill:none;stroke-width:0;}#mermaid-svg-Q1nzbscKuUg6gkIV .icon-shape,#mermaid-svg-Q1nzbscKuUg6gkIV .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Q1nzbscKuUg6gkIV .icon-shape p,#mermaid-svg-Q1nzbscKuUg6gkIV .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Q1nzbscKuUg6gkIV .icon-shape .label rect,#mermaid-svg-Q1nzbscKuUg6gkIV .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Q1nzbscKuUg6gkIV .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Q1nzbscKuUg6gkIV .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Q1nzbscKuUg6gkIV :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} hit
miss
hit且未过期
miss
用户问题
hash 精确缓存?
返回缓存答案
语义相似度大于0.92?
返回语义缓存
vLLM 推理
流式回答
写回缓存 TTL=120s

2.6 LiteLLM FinOps 与多模型路由

LiteLLM 作为 OpenAI-compatible 控制面 （详见 24-Gateway）：

yaml 复制代码

# 片段：model_list + 路由（示意）
model_list:
  - model_name: fast
    litellm_params:
      model: azure/gpt-4o-mini
      api_base: os.environ/AZURE_OPENAI_ENDPOINT
  - model_name: smart
    litellm_params:
      model: bedrock/anthropic.claude-3-5-sonnet
router_settings:
  routing_strategy: latency-based-routing
  enable_pre_call_checks: true

FinOps 必备标签 （写入 metadata，25 对账）：

标签	用途
`tenant_id`	多店铺预算
`feature`	checkout_assist / cs_bot / ops_copilot
`model_alias`	fast / smart / embed
`experiment_id`	A/B 与成本归因

$/task 账本：

text 复制代码

$/task = (prompt_tokens × price_in + completion_tokens × price_out) / 1e6
       + embed_cost + rerank_cost + gpu_amortization

目标: 客服会话 $0.018
实测拆解（路由优化后）:
  - fast 处理 72% 会话 × $0.004 = $0.0029
  - smart 28% × $0.042 = $0.0118
  - embed+rerank = $0.002
  - 合计 ≈ $0.0167 ✓

多模型路由决策树：
#mermaid-svg-iMhoGeDpTwTIC5Ck{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-iMhoGeDpTwTIC5Ck .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-iMhoGeDpTwTIC5Ck .error-icon{fill:#552222;}#mermaid-svg-iMhoGeDpTwTIC5Ck .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-iMhoGeDpTwTIC5Ck .marker{fill:#333333;stroke:#333333;}#mermaid-svg-iMhoGeDpTwTIC5Ck .marker.cross{stroke:#333333;}#mermaid-svg-iMhoGeDpTwTIC5Ck svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-iMhoGeDpTwTIC5Ck p{margin:0;}#mermaid-svg-iMhoGeDpTwTIC5Ck .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-iMhoGeDpTwTIC5Ck .cluster-label text{fill:#333;}#mermaid-svg-iMhoGeDpTwTIC5Ck .cluster-label span{color:#333;}#mermaid-svg-iMhoGeDpTwTIC5Ck .cluster-label span p{background-color:transparent;}#mermaid-svg-iMhoGeDpTwTIC5Ck .label text,#mermaid-svg-iMhoGeDpTwTIC5Ck span{fill:#333;color:#333;}#mermaid-svg-iMhoGeDpTwTIC5Ck .node rect,#mermaid-svg-iMhoGeDpTwTIC5Ck .node circle,#mermaid-svg-iMhoGeDpTwTIC5Ck .node ellipse,#mermaid-svg-iMhoGeDpTwTIC5Ck .node polygon,#mermaid-svg-iMhoGeDpTwTIC5Ck .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-iMhoGeDpTwTIC5Ck .rough-node .label text,#mermaid-svg-iMhoGeDpTwTIC5Ck .node .label text,#mermaid-svg-iMhoGeDpTwTIC5Ck .image-shape .label,#mermaid-svg-iMhoGeDpTwTIC5Ck .icon-shape .label{text-anchor:middle;}#mermaid-svg-iMhoGeDpTwTIC5Ck .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-iMhoGeDpTwTIC5Ck .rough-node .label,#mermaid-svg-iMhoGeDpTwTIC5Ck .node .label,#mermaid-svg-iMhoGeDpTwTIC5Ck .image-shape .label,#mermaid-svg-iMhoGeDpTwTIC5Ck .icon-shape .label{text-align:center;}#mermaid-svg-iMhoGeDpTwTIC5Ck .node.clickable{cursor:pointer;}#mermaid-svg-iMhoGeDpTwTIC5Ck .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-iMhoGeDpTwTIC5Ck .arrowheadPath{fill:#333333;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-iMhoGeDpTwTIC5Ck .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iMhoGeDpTwTIC5Ck .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-iMhoGeDpTwTIC5Ck .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iMhoGeDpTwTIC5Ck .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-iMhoGeDpTwTIC5Ck .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-iMhoGeDpTwTIC5Ck .cluster text{fill:#333;}#mermaid-svg-iMhoGeDpTwTIC5Ck .cluster span{color:#333;}#mermaid-svg-iMhoGeDpTwTIC5Ck div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-iMhoGeDpTwTIC5Ck .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-iMhoGeDpTwTIC5Ck rect.text{fill:none;stroke-width:0;}#mermaid-svg-iMhoGeDpTwTIC5Ck .icon-shape,#mermaid-svg-iMhoGeDpTwTIC5Ck .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iMhoGeDpTwTIC5Ck .icon-shape p,#mermaid-svg-iMhoGeDpTwTIC5Ck .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-iMhoGeDpTwTIC5Ck .icon-shape .label rect,#mermaid-svg-iMhoGeDpTwTIC5Ck .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iMhoGeDpTwTIC5Ck .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-iMhoGeDpTwTIC5Ck .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-iMhoGeDpTwTIC5Ck :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 否
是
FAQ/物流
订单争议/合规
含图
否
是
请求进入
Guardrail 通过?
拒答
意图分类
fast: 4o-mini / 8B local
smart: Claude / 70B
vision route
置信度大于0.85?
返回

L3 · 边界陷阱

3.1 OOM 的真实触发条件

误区	事实
「调大 batch 只影响延迟」	batch↑ → KV 块占用↑ → OOM 杀全 pod
「max-model-len 设 32k 没事」	按 max 预留 block，并发一高就爆
「量化后显存一定够」	AWQ 后 KV 仍是 FP16/BF16，长上下文吃 KV

防护：

--max-num-seqs 与 --max-model-len 联动压测出阶梯曲线；
K8s memory limit 留 15% headroom；
OOM 时降级：关 RAG 长上下文 → 降 max_tokens → 切 fast 模型。

3.2 Prefix cache 失效场景

系统 prompt 中插入 动态时间戳 / 随机 few-shot → 前缀每请求变化 → 命中率归零；
多租户 店铺 policy 片段 不同 → 需按 tenant_id 分 cache namespace；
模型 热更新 未清 cache → 输出错乱（需版本号 model_revision）。

3.3 量化与工具调用

AWQ 模型在 JSON tool call 上格式错误率可能 +2--5% → Agent 场景要单独 eval（链 19-Harness）；
小模型 假自信 路由失败 → 必须 置信度 + 规则 双闸。

3.4 API vs 自托管成本交叉点

text 复制代码

Break-even 粗算（70B 级）:
  API: $0.0078/会话 × 1M 会话/月 = $7,800/月
  自托管: 2×A100 云租 $2.5/h × 730h × 2 = $3,650/月 + 工程 0.5 FTE

当月会话 > 400 万且 prompt 长 → 自托管划算
当波动大（大促 10× 峰谷）→ 混合：基线自托管 + API burst

L4 · 架构师视角（电商交易）

4.1 容量规划模板

维度	公式	示例
GPU 数	`peak_qps × avg_e2e_sec / batch_efficiency`	120 QPS × 9s / 32 ≈ 34 序列稳态 → 2 节点 ×2 GPU
网络	流式 SSE 长连接	8k 并发 × 4KB/s ≈ 32MB/s 出口
存储	模型权重 + LoRA	40GB × 3 模型版本 = 120GB PVC

4.2 发布与回滚

蓝绿权重：新 vLLM 镜像 5% → 观察 TTFT/TPOT/错误率 30min；
模型版本 pin ：model_revision 与 eval 报告绑定（06）；
一键回滚：Gateway 路由指回旧 alias + 旧镜像；
FinOps 闸 ：$/task 小时环比 +20% 自动告警并可选 强制 fast 路由。

4.3 与 Java/Spring AI 集成

java 复制代码

// Spring AI：base-url 指向 LiteLLM，metadata 贯穿 FinOps
ChatResponse response = chatClient.prompt()
    .user(userMessage)
    .options(ChatOptions.builder()
        .model("smart")
        .metadata(Map.of(
            "tenant_id", shopId,
            "feature", "checkout_assist",
            "trace_id", traceId))
        .build())
    .stream()
    .blockLast();

详见 14-Java-AI框架 §12 LiteLLM 网关。

8. 生产 Checklist（Serving / Cost）

TTFT / TPOT / OOM / batch depth 四类 Grafana 面板
tenant_id + feature + model_alias 100% 打点
Prefix cache 命中率 < 50% 时排查动态 prompt
量化模型单独 tool-call eval 门禁
max-model-len 与业务 max_prompt 对齐（禁止「配置 32k、业务塞 30k」）
大促预案：PD 分离扩容 + API burst 配额
语义缓存 TTL 与 sku_version 绑定
每月 $/task 分 feature 复盘（FinOps 月报）

9. 真实面试现场题（5 道带公司风格标记）

9.1 🟦 字节 · 抖音直播电商客服 vLLM 集群 OOM

(1) 标准答案 ：根因几乎都是 KV block 过度预留 + 并发阶梯失控 ；止血降 max-num-seqs / max-model-len 并切 fast 模型；根治用 PagedAttention 参数压测曲线 + prefix cache + 动态 prompt 治理。

(2) 原理 walk：

text 复制代码

现象: 大促当晚 21:00，vLLM pod 连续 OOMKilled，客服 TTFT P99 从 600ms → 12s

排查路径:
  1) kubectl describe pod → OOM exit 137，working set 78GB/80GB
  2) vLLM metrics: gpu_cache_usage_perc 0.98, num_requests_waiting 240
  3) 对比发布: 上午上线 max-model-len 8192→16384，未改 max-num-seqs=64
  4) 抽样 trace: 平均 prompt 2.1k，但 P99 prompt 14k（运营把大段活动规则塞进 system）

账本:
  block_size=16, 16384 len 预留 → 单序列 block 数 1024
  64 并发 × 1024 block × 2.5MB/tok 量级 KV → 远超 80GB

(3) 权衡与量化：

将 max-model-len 回退 8192 + max-num-seqs 48 → GPU cache 降至 62%，OOM 归零；
TTFT P99 从 12s → 720ms ；TPOT P99 42ms；
吞吐从 28 req/s → 41 req/s（反而升：减少碎片重试）。

(4) 落地清单：

告警：gpu_cache_usage_perc > 0.85 5min；
配置：按 prompt P95 设 max-model-len，非业务 max；
回滚：Helm values maxModelLen 一键回退；
开关：force_truncate_system_prompt=6000 tokens。

(5) 追问：

追问 1：为什么量化 AWQ 后还会 OOM？

AWQ 只压缩权重，KV cache 仍是 FP16/BF16。70B 在 2×A100 上权重约 38GB，剩 ~42GB 给 KV 。并发 64、长 prompt 时 KV 是线性增长项。必须同时看 权重 + KV + cuda graph 开销，不能只盯量化率。
追问 2：Prefix cache 能否解决长 prompt？

只能解决 重复前缀 。活动规则若每店铺不同或含实时库存，前缀不重复 → 命中 0%。应用层应 模板化静态段 + 动态段分离，静态段 hash 进 cache key；动态段控制在 512 tok 内。
追问 3：TP 和 PP 怎么选？

70B 在 A100-80GB：TP=2 最常见（层切分，AllReduce 延迟可接受）。PP 适合超大模型跨节点，但 bubble 大。电商 70B TP 优先；若上 405B 级才考虑 PP + EP。

9.2 🟧 阿里 · 淘宝导购 TTFT 与 TPOT 优化

(1) 标准答案 ：导购场景用户感知 = TTFT（首字）+ TPOT（流式）；优先 prefix cache + prompt 裁剪，其次 PD 分离；不要用「更大 batch」牺牲 TTFT。

(2) 原理 walk：

text 复制代码

基线: Claude API 代理，TTFT P99 1.4s, TPOT 55ms, 会话完成 9.2s
目标: TTFT P99 < 800ms, 保持回答质量 nDCG 不降

方案 A --- Prefix cache (vLLM):
  系统+类目模板 1.8k tok 固定 → 命中率 88%
  TTFT 1.4s → 0.55s

方案 B --- RAG topK 20→8:
  prompt -900 tok, 质量 -0.3% nDCG（可接受）
  TTFT 再 -120ms

方案 C --- 8B fast 路由 65% 流量:
  $/会话 $0.016 → $0.009

(3) 权衡与量化：

最终 TTFT P99 760ms ，TPOT 38ms （8B 路径 22ms）；
采纳率 +4.2%（更快首字降低跳出）；
GPU 成本 -34%。

(4) 落地清单：

分模型 TTFT 面板（fast/smart）；
routing_reason 标签入库；
A/B：prefix_cache=on/off。

(5) 追问：

追问 1：TTFT 和 TPOT 哪个优先优化？

交互式导购 TTFT 优先 （用户 800ms 无响应会刷新/退出）。批处理摘要可优先 TPOT/吞吐。用 会话取消率 与 首字延迟 做北极星，而非只看 tokens/s。
追问 2：PD 分离何时值得？

当 prefill 与 decode 资源争抢 导致相互恶化：prefill _burst 时 decode TPOT 飙升。阈值经验：prefill 队列等待 >200ms 且 decode P99 >60ms。PD 分离增加 2× 运维面，中小团队优先 prefix + 路由。
追问 3：如何证明优化没伤质量？

离线 golden 2000 条 + 在线 shadow smart。指标：nDCG、引用准确率、低价误推率 。任何 TTFT 优化若 误推率 +0.1% 即回滚。

9.3 🟪 蚂蚁 · 金融客服 LiteLLM 多模型 FinOps

(1) 标准答案 ：LiteLLM 统一 OpenAI SDK 契约 + 路由 + 预算 ；金融场景 smart 模型占比必须可控 ，用 metadata.budget_id 与每日 cap；成本异常 自动降级 fast。

(2) 原理 walk：

text 复制代码

架构:
  App → LiteLLM → {azure gpt-4o-mini, bedrock claude, 私有 vLLM}
  spend_logs → ClickHouse → 日账单 per tenant

事故型问题:
  某 tenant 误配 model=smart 默认 → 日成本 $12k → $180k/月 外推
  检测: spend_logs 小时环比 +180%
  动作: router 强制 tenant X 走 fast + 通知 BM

(3) 权衡与量化：

预算闸：tenant 日预算 $500 ，超 90% 告警，超 100% 硬拒绝 smart；
路由规则：合规类 100% smart ；余额查询 100% fast；
对账误差 <0.5%（与云厂商账单）。

(4) 落地清单：

LiteLLM max_budget + budget_duration；
feature 维度成本报表；
回滚：configmap default_model=fast。

(5) 追问：

追问 1：LiteLLM 与自研 Gateway 边界？

LiteLLM 强在 多 provider 适配与 spend 日志 ；企业级租户隔离、WAF、mTLS 常在前置 Kong/APISIX（见 24）。推荐 Kong auth + LiteLLM route，不重复造 provider SDK。
追问 2：私有 vLLM 与云 API 混合计费？

私有计入 GPU 摊销 （ / G P U − h o u r / 吞吐），云 A P I 计入 ∗ ∗ t o k e n 价目 ∗ ∗ 。 F i n O p s 看板统一为 ∗ ∗ /GPU-hour / 吞吐），云 API 计入 **token 价目**。FinOps 看板统一为 ** /GPU−hour/吞吐），云API计入∗∗token价目∗∗。FinOps看板统一为∗∗/task**，底层 dual ledger，避免只比 token 价误导（自托管固定成本高、边际低）。
追问 3：如何做 chargeback？

tenant_id × feature × model 三维聚合；每月导出给业务方。Ant 内部常对接 成本中心编码，与 K8s namespace 一一映射。

9.4 🟢 腾讯 · 微信支付助手语义缓存与一致性

(1) 标准答案 ：语义缓存 必须带业务版本键 ；支付状态、优惠规则 TTL ≤ 60s ；命中仍要走 轻量规则校验（金额/状态机）。

(2) 原理 walk：

text 复制代码

缓存键: hash(embed(query)) + order_status_version + sku_price_version
命中条件: cosine > 0.93 AND versions match

反例（必须避免）:
  用户问「订单为何未发货」→ 缓存答「已发货」（上一用户相似问句污染）→ 客诉

修复:
  缓存 value 存 {answer, valid_states:[SHIPPED,PAID]}
  命中后校验当前订单状态 ∈ valid_states

(3) 权衡与量化：

命中率 34%（FAQ 类）；
TTFT 中位数 40ms（缓存路径）；
错误率 <0.02%（加校验后）。

(4) 落地清单：

Redis 集群 + 版本号来自订单服务 MQ；
semantic_cache_enabled feature flag；
回滚：关语义缓存，保留 exact hash 缓存。

(5) 追问：

追问 1：和 vLLM prefix cache 区别？

Prefix cache 复用 KV 计算 ；语义缓存复用 最终文本 。前者仍消耗 GPU 做 decode 前段；后者零 GPU。支付场景 状态敏感 答案只能用语义缓存 + 强校验，或不用。
追问 2：embed 模型变更怎么办？

与 03-RAG 相同：双写索引 + 灰度 ，缓存 namespace 带 embed_rev。
追问 3：缓存穿透风暴？

热点问题（「双 11 退款规则」）miss 时打穿 GPU。用 singleflight + 短 TTL 预热 + 本地 Caffeine L1。

9.5 🔵 Google · TPU vs GPU Serving 与 SLO 建模

(1) 标准答案 ：面试考察 SLO 数学化 ：给定 TTFT/TPOT SLO 反推所需算力与副本数；TPU 适合 超大 batch 训练向推理 ，GPU+vLLM 生态更成熟；跨云用 Gateway 抽象。

(2) 原理 walk：

text 复制代码

SLO: TTFT_p99 < 1s, 完成 200 tok < 10s
=> TPOT_mean < (10 - 1) / 200 = 45ms

若实测 TPOT 60ms → 需:
  - 降 batch 竞争（减 max-num-seqs）
  - 或 +1 GPU 副本（水平扩展 decode）

副本数 N:
  peak_qps × e2e_sec / (batch × step_throughput) 
  = 80 × 9 / (32 × 0.9) ≈ 25 有效序列 → 2 节点 ×2 GPU（与压测对齐）

(3) 权衡与量化：

GPU vLLM：生态最全（AWQ、prefix、spec decode）；
TPU：吞吐/W 优，工具链与 HF 模型适配成本高；
混合云 burst：+15% 成本买 峰值 3× 能力。

(4) 落地清单：

SLO error budget 周看板；
自动扩容 HPA on num_requests_waiting；
多云 fallback 顺序写进 runbook。

(5) 追问：

追问 1：如何向非技术解释 TTFT/TPOT？

TTFT = 「开始打字前的等待」；TPOT = 「每个字出现的间隔」。用户投诉「卡」多半是 TTFT；投诉「慢」可能是 TPOT × 字数。
追问 2：Speculative decoding 生产成熟度？

2025--26 在 70B+ 高 QPS 逐步上线；需 draft 模型质量匹配，否则 reject 率高浪费算力。建议 shadow 模式 两周再切流。
追问 3：Carbon vs Cost？

FinOps 扩展 $/task × kgCO2/task （数据中心 PUE）。晚高峰调度到 可再生电力区域 副本（若多云）。

10. 真实事故复盘（电商交易场景）· LLM Serving 成本与 OOM

10.1 大促夜 vLLM 集群连锁 OOM 与 Token 账单暴涨

S（Situation）

业务：跨境电商智能客服 + 订单解读，峰值 8,200 并发会话；
架构：2 节点 × 2×A100 80GB，vLLM Llama-3.1-70B-AWQ，LiteLLM 路由 fast/smart；
基线：TTFT P99 620ms ， $/会话 **$ 0.017** ，GPU cache 占用 58%。

T（Trigger）

2025-11-11 20:40 ：监控 vllm_pod_oom_total 12 次/10min；同时 FinOps 告警 小时成本 +220%；
20:48 ：客服排队 >3min ，差评率 15min 内 ×3；
21:05：on-call 升级 P0。

A（Approach）

第 1 步：分离「挂掉」与「变贵」

text 复制代码

OOM 线程: kubelet OOMKilled → describe → memory at limit
Cost 线程: LiteLLM spend_logs → smart 占比 71%（平日 26%）

第 2 步：OOM 证据链

bash 复制代码

# Prometheus
vllm_gpu_cache_usage_perc{pod="vllm-7"} > 0.97
vllm_num_preemptions_total  spike

# 发布记录
Helm upgrade 18:00: max-model-len 8192 → 16384
ConfigMap 19:30: marketing 注入 full_rules.md 到 system（+6k tok）

第 3 步：成本证据链

sql 复制代码

SELECT model, count(*) FROM spend_logs
WHERE hour='2025-11-11 20:00' AND tenant='global'
GROUP BY 1;
-- smart 71%, avg prompt 5200 tokens（平日 2400）

根因双因素：

配置 + 运营叠加 → KV 预留爆炸 → OOM → 重试风暴；
路由 bug：intent_classifier 灰度模型把 FAQ 标成 dispute → 全走 smart → 账单暴涨。

R（Resolution）

止血（25 分钟）：

max-model-len 回滚 8192 ，max-num-seqs 48；
Gateway 强制 FAQ 意图走 fast（规则覆盖 ML）；
扩容 +2 GPU 节点（云 burst API 作为二级兜底，限流 30% 流量）。

根治（72 小时）：

System prompt 静态/动态分离 + prefix cache namespace；
路由灰度 shadow + 离线一致性 ≥99% 才上线；
GPU cache >85% 自动触发 prompt 截断 与 排队超时 fast fallback；
FinOps：tenant 小时预算 + smart 占比 >40% 告警。

M（Metrics）

指标	事故峰值	恢复后
OOM 次数 / 10min	12	0
TTFT P99	11.8s	680ms
排队超时率	18%	0.6%
$/会话	$0.052	$0.018
估算避免资损	---	~380 万元（4h 转化损失口径）

P（Prevention）

max-model-len 变更必须附 KV 压测报告
运营 prompt 注入走 PR 审核 + token 计数 CI
路由模型双闸：ML + 规则
季度 OOM 演练（故意升并发观察 cache 曲线）
25-可观测三合一看板绑定 on-call

关联文件 + 一句话速记

文件	速记
06-评估	换量化/模型前 offline eval 门禁
08-架构-电商	客服/导购 SLA 驱动 GPU 规格
14-Spring-AI	Java 经 LiteLLM 注入 metadata
24-Gateway	路由/租户/语义缓存策略
25-可观测	TTFT/TPOT/$/task 三联面板

🧭 章节导航

#	文件	风格
00	00-README.md	索引
06	02-评估-Eval-Hallucination与质量度量.md	机制
07	本篇 · Serving / Caching / Cost	操作
08	01-电商AI辅助交易场景.md	设计
24	07-AI-Gateway-LLM网关与多模型路由.md	设计
25	08-AI可观测性-Trace-Cost-质量三合一.md	操作
98	98-面试高频题满分答与Checklist.md	总览

官方文档与源码（一级依据）

AI Engineering · 正文机制应来自下方 官方文档（L1） 与 官方源码仓库（L2） ；

禁止用教程站/博客充当机制依据。本章 QPS/延迟/STAR 为面试示意。

写作规范：docs/official-sources-registry.md §0

L1 · 官方文档

L2 · 官方源码

L3 · 论文 / 开放规范

L3 MCP Specification