vLLM 架构总览
定位:本文档从架构层面深度分析 vLLM 源码,建立全局认知。涵盖六层分层架构、v0→v1 演进、核心数据结构与设计原则。
总体架构图
#mermaid-svg-ZFQBKXMUmridgM5r{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZFQBKXMUmridgM5r .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZFQBKXMUmridgM5r .error-icon{fill:#552222;}#mermaid-svg-ZFQBKXMUmridgM5r .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZFQBKXMUmridgM5r .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZFQBKXMUmridgM5r .marker.cross{stroke:#333333;}#mermaid-svg-ZFQBKXMUmridgM5r svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZFQBKXMUmridgM5r p{margin:0;}#mermaid-svg-ZFQBKXMUmridgM5r .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster-label text{fill:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster-label span{color:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster-label span p{background-color:transparent;}#mermaid-svg-ZFQBKXMUmridgM5r .label text,#mermaid-svg-ZFQBKXMUmridgM5r span{fill:#333;color:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .node rect,#mermaid-svg-ZFQBKXMUmridgM5r .node circle,#mermaid-svg-ZFQBKXMUmridgM5r .node ellipse,#mermaid-svg-ZFQBKXMUmridgM5r .node polygon,#mermaid-svg-ZFQBKXMUmridgM5r .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZFQBKXMUmridgM5r .rough-node .label text,#mermaid-svg-ZFQBKXMUmridgM5r .node .label text,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape .label,#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZFQBKXMUmridgM5r .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZFQBKXMUmridgM5r .rough-node .label,#mermaid-svg-ZFQBKXMUmridgM5r .node .label,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape .label,#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape .label{text-align:center;}#mermaid-svg-ZFQBKXMUmridgM5r .node.clickable{cursor:pointer;}#mermaid-svg-ZFQBKXMUmridgM5r .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZFQBKXMUmridgM5r .arrowheadPath{fill:#333333;}#mermaid-svg-ZFQBKXMUmridgM5r .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZFQBKXMUmridgM5r .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZFQBKXMUmridgM5r .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZFQBKXMUmridgM5r .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZFQBKXMUmridgM5r .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZFQBKXMUmridgM5r .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZFQBKXMUmridgM5r .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster text{fill:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster span{color:#333;}#mermaid-svg-ZFQBKXMUmridgM5r div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZFQBKXMUmridgM5r .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZFQBKXMUmridgM5r rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape p,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape .label rect,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZFQBKXMUmridgM5r .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZFQBKXMUmridgM5r .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZFQBKXMUmridgM5r :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} HTTP/gRPC
EngineCoreRequest
SchedulerOutput
collective_rpc
CUDA Kernel
🌐 第1层: 服务层 entrypoints/
OpenAI API Server
Anthropic API
Chat Completion
Embedding/Pooling
Batch Serving
🔧 第2层: 引擎层 engine/ + v1/engine/
EngineCore
核心引擎循环
LLMEngine
兼容包装
InputProcessor
输入处理
OutputProcessor
输出处理
Detokenizer
解码器
📋 第3层: 调度核心 v1/core/sched/
Scheduler
调度算法
AsyncScheduler
异步调度
RequestQueue
请求队列
BlockPool
KV Cache 块池
⚙️ 第4层: 执行器层 v1/executor/
UniProcExecutor
单进程
MultiprocExecutor
多进程
RayExecutor
Ray 分布式
WorkerBase
Worker 抽象
🎯 第5层: 模型执行器 model_executor/
Model Registry
150+ 模型架构
ModelRunner
GPU/CPU/XPU/TPU
量化后端
FP8/GPTQ/AWQ/NVFP4
🔧 第6层: 内核层 csrc/
CUDA Kernels
cutlass/flashinfer/triton
Custom Ops
vllm._custom_ops
一、分层架构详解
1.1 第 1 层:服务层(entrypoints/)
职责:对外提供 API 接口,负责协议转换、请求路由与负载均衡。
| 子模块 | 职责 | 关键文件 |
|---|---|---|
openai/ |
OpenAI 兼容 API(chat/completion/embedding) | api_server.py, serving.py |
anthropic/ |
Anthropic Messages API 兼容 | api_router.py |
pooling/ |
Embedding / Classification / Scoring | embed/io_processor.py |
cli/ |
CLI 入口(serve/benchmark) | serve.py |
grpc_server.py |
gRPC 服务入口 | grpc_server.py |
接口定义 :服务层通过 llm.py 将 HTTP 请求转化为对 LLMEngine 的调用,使用 protocol.py 定义数据契约。
1.2 第 2 层:引擎层(engine/ + v1/engine/)
职责:编排输入预处理、调度执行、输出后处理的完整流水线。
核心组件
-
LLMEngine (v1):面向用户的引擎接口,负责:
- 输入转换 (
InputProcessor):EngineInput→EngineCoreRequest - 输出转换 (
OutputProcessor):EngineCoreOutputs→RequestOutput - 统计日志 (
StatLoggerManager) - LoRA 管理
- 输入转换 (
-
EngineCore:解耦后的核心引擎,包含:
- 模型执行器管理 (
model_executor) - 调度器管理 (
scheduler) - KV Cache 初始化与管理
- 核心步进循环 (
step()/step_with_batch_queue()) - 多模态缓存管理 (
mm_receiver_cache) - 结构化输出管理 (
structured_output_manager)
- 模型执行器管理 (
-
EngineCoreProc:基于 ZMQ 的进程内通信封装,支持:
- 后台进程运行 EngineCore
- Socket IO 线程(input/output)
- 握手协议(handshake)
- Data Parallel 协调(DPEngineCoreProc)
- Elastic EP 扩缩容
层间数据流
#mermaid-svg-l3gFgAzHK1140UeP{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-l3gFgAzHK1140UeP .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-l3gFgAzHK1140UeP .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-l3gFgAzHK1140UeP .error-icon{fill:#552222;}#mermaid-svg-l3gFgAzHK1140UeP .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-l3gFgAzHK1140UeP .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-l3gFgAzHK1140UeP .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-l3gFgAzHK1140UeP .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-l3gFgAzHK1140UeP .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-l3gFgAzHK1140UeP .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-l3gFgAzHK1140UeP .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-l3gFgAzHK1140UeP .marker{fill:#333333;stroke:#333333;}#mermaid-svg-l3gFgAzHK1140UeP .marker.cross{stroke:#333333;}#mermaid-svg-l3gFgAzHK1140UeP svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-l3gFgAzHK1140UeP p{margin:0;}#mermaid-svg-l3gFgAzHK1140UeP .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-l3gFgAzHK1140UeP .cluster-label text{fill:#333;}#mermaid-svg-l3gFgAzHK1140UeP .cluster-label span{color:#333;}#mermaid-svg-l3gFgAzHK1140UeP .cluster-label span p{background-color:transparent;}#mermaid-svg-l3gFgAzHK1140UeP .label text,#mermaid-svg-l3gFgAzHK1140UeP span{fill:#333;color:#333;}#mermaid-svg-l3gFgAzHK1140UeP .node rect,#mermaid-svg-l3gFgAzHK1140UeP .node circle,#mermaid-svg-l3gFgAzHK1140UeP .node ellipse,#mermaid-svg-l3gFgAzHK1140UeP .node polygon,#mermaid-svg-l3gFgAzHK1140UeP .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-l3gFgAzHK1140UeP .rough-node .label text,#mermaid-svg-l3gFgAzHK1140UeP .node .label text,#mermaid-svg-l3gFgAzHK1140UeP .image-shape .label,#mermaid-svg-l3gFgAzHK1140UeP .icon-shape .label{text-anchor:middle;}#mermaid-svg-l3gFgAzHK1140UeP .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-l3gFgAzHK1140UeP .rough-node .label,#mermaid-svg-l3gFgAzHK1140UeP .node .label,#mermaid-svg-l3gFgAzHK1140UeP .image-shape .label,#mermaid-svg-l3gFgAzHK1140UeP .icon-shape .label{text-align:center;}#mermaid-svg-l3gFgAzHK1140UeP .node.clickable{cursor:pointer;}#mermaid-svg-l3gFgAzHK1140UeP .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-l3gFgAzHK1140UeP .arrowheadPath{fill:#333333;}#mermaid-svg-l3gFgAzHK1140UeP .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-l3gFgAzHK1140UeP .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-l3gFgAzHK1140UeP .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-l3gFgAzHK1140UeP .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-l3gFgAzHK1140UeP .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-l3gFgAzHK1140UeP .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-l3gFgAzHK1140UeP .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-l3gFgAzHK1140UeP .cluster text{fill:#333;}#mermaid-svg-l3gFgAzHK1140UeP .cluster span{color:#333;}#mermaid-svg-l3gFgAzHK1140UeP div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-l3gFgAzHK1140UeP .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-l3gFgAzHK1140UeP rect.text{fill:none;stroke-width:0;}#mermaid-svg-l3gFgAzHK1140UeP .icon-shape,#mermaid-svg-l3gFgAzHK1140UeP .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-l3gFgAzHK1140UeP .icon-shape p,#mermaid-svg-l3gFgAzHK1140UeP .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-l3gFgAzHK1140UeP .icon-shape .label rect,#mermaid-svg-l3gFgAzHK1140UeP .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-l3gFgAzHK1140UeP .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-l3gFgAzHK1140UeP .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-l3gFgAzHK1140UeP :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} EngineCore
引擎层
前端 (服务层)
render
InputProcessor.process_inputs()
add_request()
schedule()
execute_model()
update_from_output()
OutputProcessor
HTTP Request
EngineInput
(TokensInput/EmbedsInput)
EngineCoreRequest
(msgspec Struct)
EngineCoreOutputs
(msgpack 序列化)
RequestOutput
(用户可见)
Request
(内部表示)
SchedulerOutput
ModelRunnerOutput
1.3 第 3 层:调度核心(v1/core/sched/)
职责:决定每个 step 中各请求的 token 数量,管理 KV Cache 分配。
| 组件 | 文件 | 职责 |
|---|---|---|
| SchedulerInterface | interface.py | 调度器抽象基类,定义 schedule/add_request/update_from_output 等核心方法 |
| Scheduler | scheduler.py | 默认调度器实现,FCFS + 连续 batching |
| AsyncScheduler | async_scheduler.py | 异步调度实现,支持 scheduling 与 execution 重叠 |
| RequestQueue | request_queue.py | 优先级队列,支持 priority/arrival_time 排序 |
| BlockPool | block_pool.py | KV Cache 物理块分配器 |
| KVCacheManager | kv_cache_manager.py | KV Cache 逻辑管理,含 prefix caching |
关键调度决策流程:
python
# 来自 interface.py L52-L75
def schedule(self) -> "SchedulerOutput":
"""Schedule the requests to process in this scheduling step.
The scheduler produces a dictionary of {req_id: num_tokens}
that specifies how many tokens to process for each request.
num_tokens can be:
- prompt token count for new requests (prefill)
- 1 for auto-regressive decoding
- somewhere between for chunked prefills / speculative decoding
"""
1.4 第 4 层:执行器层(v1/executor/)
职责:管理分布式 worker 进程,屏蔽单机/多机/Ray 差异。
Executor 类层次
#mermaid-svg-6auKD20DysusXvuJ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6auKD20DysusXvuJ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6auKD20DysusXvuJ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6auKD20DysusXvuJ .error-icon{fill:#552222;}#mermaid-svg-6auKD20DysusXvuJ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6auKD20DysusXvuJ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6auKD20DysusXvuJ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6auKD20DysusXvuJ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6auKD20DysusXvuJ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6auKD20DysusXvuJ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6auKD20DysusXvuJ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6auKD20DysusXvuJ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6auKD20DysusXvuJ .marker.cross{stroke:#333333;}#mermaid-svg-6auKD20DysusXvuJ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6auKD20DysusXvuJ p{margin:0;}#mermaid-svg-6auKD20DysusXvuJ g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-6auKD20DysusXvuJ g.classGroup text .title{font-weight:bolder;}#mermaid-svg-6auKD20DysusXvuJ .cluster-label text{fill:#333;}#mermaid-svg-6auKD20DysusXvuJ .cluster-label span{color:#333;}#mermaid-svg-6auKD20DysusXvuJ .cluster-label span p{background-color:transparent;}#mermaid-svg-6auKD20DysusXvuJ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6auKD20DysusXvuJ .cluster text{fill:#333;}#mermaid-svg-6auKD20DysusXvuJ .cluster span{color:#333;}#mermaid-svg-6auKD20DysusXvuJ .nodeLabel,#mermaid-svg-6auKD20DysusXvuJ .edgeLabel{color:#131300;}#mermaid-svg-6auKD20DysusXvuJ .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-6auKD20DysusXvuJ .label text{fill:#131300;}#mermaid-svg-6auKD20DysusXvuJ .labelBkg{background:#ECECFF;}#mermaid-svg-6auKD20DysusXvuJ .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-6auKD20DysusXvuJ .classTitle{font-weight:bolder;}#mermaid-svg-6auKD20DysusXvuJ .node rect,#mermaid-svg-6auKD20DysusXvuJ .node circle,#mermaid-svg-6auKD20DysusXvuJ .node ellipse,#mermaid-svg-6auKD20DysusXvuJ .node polygon,#mermaid-svg-6auKD20DysusXvuJ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6auKD20DysusXvuJ .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ g.clickable{cursor:pointer;}#mermaid-svg-6auKD20DysusXvuJ g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-6auKD20DysusXvuJ g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-6auKD20DysusXvuJ .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-6auKD20DysusXvuJ .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-6auKD20DysusXvuJ .dashed-line{stroke-dasharray:3;}#mermaid-svg-6auKD20DysusXvuJ .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-6auKD20DysusXvuJ #compositionStart,#mermaid-svg-6auKD20DysusXvuJ .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #compositionEnd,#mermaid-svg-6auKD20DysusXvuJ .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #dependencyStart,#mermaid-svg-6auKD20DysusXvuJ .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #dependencyStart,#mermaid-svg-6auKD20DysusXvuJ .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #extensionStart,#mermaid-svg-6auKD20DysusXvuJ .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #extensionEnd,#mermaid-svg-6auKD20DysusXvuJ .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #aggregationStart,#mermaid-svg-6auKD20DysusXvuJ .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #aggregationEnd,#mermaid-svg-6auKD20DysusXvuJ .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #lollipopStart,#mermaid-svg-6auKD20DysusXvuJ .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #lollipopEnd,#mermaid-svg-6auKD20DysusXvuJ .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-6auKD20DysusXvuJ .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6auKD20DysusXvuJ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6auKD20DysusXvuJ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6auKD20DysusXvuJ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} <<abstract>>
Executor
+get_class(vllm_config) : Executor
+execute_model(scheduler_output) : ModelRunnerOutput
+collective_rpc(method, args, kwargs) : list
+initialize_from_config(kv_cache_configs)
+determine_available_memory() : list<int>
+shutdown()
UniProcExecutor
+_init_executor()
+collective_rpc()
MultiprocExecutor
+_init_executor()
+collective_rpc()
RayDistributedExecutor
+_init_executor()
+collective_rpc()
RayExecutorV2
+_init_executor()
+collective_rpc()
ExecutorWithExternalLauncher
+_init_executor()
工厂方法 ------ Executor.get_class() 根据 distributed_executor_backend 配置选择具体实现:
python
# abstract.py L48-L92
@staticmethod
def get_class(vllm_config: VllmConfig) -> type["Executor"]:
distributed_executor_backend = parallel_config.distributed_executor_backend
if isinstance(distributed_executor_backend, type):
executor_class = distributed_executor_backend # 用户自定义
elif distributed_executor_backend == "ray":
executor_class = RayExecutorV2 or RayDistributedExecutor
elif distributed_executor_backend == "mp":
executor_class = MultiprocExecutor
elif distributed_executor_backend == "uni":
executor_class = UniProcExecutor
elif distributed_executor_backend == "external_launcher":
executor_class = ExecutorWithExternalLauncher
return executor_class
1.5 第 5 层:模型执行器(model_executor/)
职责:加载模型权重、构建计算图、执行前向传播。
| 子模块 | 职责 |
|---|---|
| models/ | 150+ 模型架构实现(Llama/Qwen/Gemma/Mistral 等),通过 registry.py 注册 |
| models/interfaces.py | 模型能力接口定义(supports_multimodal / supports_pp / is_attention_free 等) |
| kernels/ | 自定义 CUDA/Triton kernel 封装 |
| layers/ | 通用算子层(Linear / Attention / RMSNorm 等) |
| warmup/ | 模型 warmup 逻辑 |
Worker 类型:
| Worker 类 | 用途 | 文件位置 |
|---|---|---|
GPUModelRunner |
GPU 上模型执行主逻辑 | gpu/model_runner.py |
CPUModelRunner |
CPU 推理 | cpu_model_runner.py |
XPUModelRunner |
Intel XPU | xpu_model_runner.py |
TPUModelRunner |
Google TPU | tpu_input_batch.py |
1.6 第 6 层:内核层(csrc/ + kernels/)
职责:高性能 GPU kernel 实现。
| 组件 | 技术 | 用途 |
|---|---|---|
| FlashAttention | flash-attn | 高效注意力计算 |
| FlashInfer | flashinfer | Paged Attention / 解码优化 |
| Cutlass MLA | CUTLASS | DeepSeek MLA 注意力 |
| Triton Kernels | Triton | 自定义融合 kernel(attention / rms_norm / silu_mul) |
| Custom Ops | C++/CUDA | FP8 量化、AllReduce 融合等 |
二、v0 → v1 架构演进
2.1 演进总览
#mermaid-svg-AvTfkmd5jeE0Lbx7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .error-icon{fill:#552222;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .marker.cross{stroke:#333333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 p{margin:0;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster-label text{fill:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster-label span{color:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster-label span p{background-color:transparent;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .label text,#mermaid-svg-AvTfkmd5jeE0Lbx7 span{fill:#333;color:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .node rect,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node circle,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node ellipse,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node polygon,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .rough-node .label text,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node .label text,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape .label,#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape .label{text-anchor:middle;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .rough-node .label,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node .label,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape .label,#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape .label{text-align:center;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .node.clickable{cursor:pointer;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .arrowheadPath{fill:#333333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AvTfkmd5jeE0Lbx7 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster text{fill:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster span{color:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 rect.text{fill:none;stroke-width:0;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape p,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape .label rect,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AvTfkmd5jeE0Lbx7 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-AvTfkmd5jeE0Lbx7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} v1 架构 (当前)
v0 架构 (已废弃)
重构
抽象为 Interface
统一为 Executor
inproc / zmq
继承
LLMEngine
单体类
Scheduler
同步调度
Worker
直接 RPC
LLMEngine
兼容门面 (llm_engine.py)
EngineCoreClient
IPC 抽象
EngineCore
解耦核心 (core.py)
EngineCoreProc
ZMQ 进程封装
SchedulerInterface
可插拔调度
Executor
分布式抽象
2.2 v0 LLMEngine → v1 包装策略
关键发现 :当前 vllm/engine/llm_engine.py 仅是一个 重导出别名:
python
# vllm/engine/llm_engine.py (全文,仅 6 行)
from vllm.v1.engine.llm_engine import LLMEngine as V1LLMEngine
LLMEngine = V1LLMEngine # type: ignore
"""The `LLMEngine` class is an alias of vllm.v1.engine.llm_engine.LLMEngine."""
这意味着 v0 版本已被完全移除 ,当前代码库中不存在旧的 LLMEngine 实现。所有调用方使用的 LLMEngine 实际上都是 v1 版本。
2.3 EngineCore 解耦设计
EngineCore 是 v1 架构的核心创新,实现了以下解耦:
初始化流程
Scheduler Executor EngineCore EngineCoreClient LLMEngine Scheduler Executor EngineCore EngineCoreClient LLMEngine #mermaid-svg-8Opndy2NFRobkAgP{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-8Opndy2NFRobkAgP .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-8Opndy2NFRobkAgP .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-8Opndy2NFRobkAgP .error-icon{fill:#552222;}#mermaid-svg-8Opndy2NFRobkAgP .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-8Opndy2NFRobkAgP .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-8Opndy2NFRobkAgP .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-8Opndy2NFRobkAgP .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-8Opndy2NFRobkAgP .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-8Opndy2NFRobkAgP .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-8Opndy2NFRobkAgP .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-8Opndy2NFRobkAgP .marker{fill:#333333;stroke:#333333;}#mermaid-svg-8Opndy2NFRobkAgP .marker.cross{stroke:#333333;}#mermaid-svg-8Opndy2NFRobkAgP svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-8Opndy2NFRobkAgP p{margin:0;}#mermaid-svg-8Opndy2NFRobkAgP .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-8Opndy2NFRobkAgP text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-8Opndy2NFRobkAgP .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-8Opndy2NFRobkAgP .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-8Opndy2NFRobkAgP .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-8Opndy2NFRobkAgP #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-8Opndy2NFRobkAgP .sequenceNumber{fill:white;}#mermaid-svg-8Opndy2NFRobkAgP #sequencenumber{fill:#333;}#mermaid-svg-8Opndy2NFRobkAgP #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-8Opndy2NFRobkAgP .messageText{fill:#333;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-8Opndy2NFRobkAgP .labelText,#mermaid-svg-8Opndy2NFRobkAgP .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .loopText,#mermaid-svg-8Opndy2NFRobkAgP .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-8Opndy2NFRobkAgP .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-8Opndy2NFRobkAgP .noteText,#mermaid-svg-8Opndy2NFRobkAgP .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-8Opndy2NFRobkAgP .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-8Opndy2NFRobkAgP .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-8Opndy2NFRobkAgP .actorPopupMenu{position:absolute;}#mermaid-svg-8Opndy2NFRobkAgP .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-8Opndy2NFRobkAgP .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-8Opndy2NFRobkAgP .actor-man circle,#mermaid-svg-8Opndy2NFRobkAgP line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-8Opndy2NFRobkAgP :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} EngineCore 就绪 make_client(multiprocess_mode) init(vllm_config, executor_class) executor_class(vllm_config) determine_available_memory() get_kv_cache_specs() _initialize_kv_caches() Scheduler(vllm_config, kv_cache_config) initialize_from_config(kv_cache_configs)
核心步进循环
EngineCore.step() 是引擎的心跳:
python
# core.py L402-L431
def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
"""Schedule, execute, and make output.
Returns tuple of outputs and a flag indicating whether the model
was executed.
"""
# 1. 检查是否有待处理请求
if not self.scheduler.has_requests():
return {}, False
# 2. 调度决策:决定每个请求处理多少 token
scheduler_output = self.scheduler.schedule()
# 3. 异步执行模型前向传播
future = self.model_executor.execute_model(scheduler_output, non_block=True)
# 4. 获取结构化输出的 grammar bitmask
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
# 5. 等待模型输出 & 采样
with self.log_error_detail(scheduler_output), \
self.log_iteration_details(scheduler_output):
model_output = future.result()
if model_output is None:
model_output = self.model_executor.sample_tokens(grammar_output)
# 6. 处理异步 abort
self._process_aborts_queue()
# 7. 更新调度器状态并生成输出
engine_core_outputs = self.scheduler.update_from_output(
scheduler_output, model_output
)
return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0
Pipeline Parallelism 优化:batch_queue
当 max_concurrent_batches > 1 时(即 PP > 1),EngineCore 使用 step_with_batch_queue() 替代 step():
- 使用双端队列
deque缓冲多个 batch - 优先填满 batch queue(消除 pipeline bubble)
- 阻塞等待最早完成的 batch 结果
- 支持 deferred sampling(结构化输出 + speculative decoding 场景)
2.4 v1 LLMEngine:向后兼容的门面
v1 LLMEngine 作为公共 API 门面,职责包括:
组合模式
python
# llm_engine.py L90-L111 (精简)
class LLMEngine:
def __init__(self, vllm_config, executor_class, log_stats, ...):
# 1. Renderer: 处理 chat template / multimodal inputs
self.renderer = renderer_from_config(self.vllm_config)
# 2. InputProcessor: EngineInput → EngineCoreRequest
self.input_processor = InputProcessor(self.vllm_config, renderer)
# 3. OutputProcessor: EngineCoreOutputs → RequestOutput
self.output_processor = OutputProcessor(
renderer.tokenizer,
log_stats=self.log_stats,
stream_interval=...,
)
# 4. EngineCoreClient: 通过 IPC 访问 EngineCore
self.engine_core = EngineCoreClient.make_client(
multiprocess_mode=multiprocess_mode,
asyncio_mode=False,
...
)
add_request 流水线
python
# llm_engine.py L209-L285 (精简)
def add_request(self, request_id, prompt, params, ...):
# 1. 输入预处理:EngineInput → EngineCoreRequest
request = self.input_processor.process_inputs(
request_id, prompt, params, ...
)
# 2. n>1 时 fan-out子请求(beam search)
if n > 1:
for idx in range(n):
child_request = copy(request)
child_request.sampling_params = child_params
self.output_processor.add_request(child_request, ...)
self.engine_core.add_request(child_request)
else:
self.output_processor.add_request(request, ...)
self.engine_core.add_request(request)
step 流水线
python
# llm_engine.py L287-L325 (精简)
def step(self) -> list[RequestOutput | PoolingRequestOutput]:
# 1. 从 EngineCore 获取原始输出
outputs = self.engine_core.get_output()
# 2. 后处理:EngineCoreOutputs → RequestOutput
processed_outputs = self.output_processor.process_outputs(
outputs.outputs, engine_core_timestamp=outputs.timestamp, ...
)
# 3. 中止 stop string 触发的请求
self.engine_core.abort_requests(processed_outputs.reqs_to_abort)
# 4. 记录统计信息
if self.logger_manager is not None:
self.logger_manager.record(...)
return processed_outputs.request_outputs
2.5 v1 核心改进点总结
| 改进维度 | v0 | v1 |
|---|---|---|
| 核心循环 | 单体 LLMEngine.step() | EngineCore.step() 解耦 |
| 进程模型 | 同步多进程 | 可选 inproc/multiproc/ZMQ |
| 调度器 | 硬编码 Scheduler | SchedulerInterface 可插拔 |
| 执行器 | RayWorkerWrapper 固定 | Executor 抽象 + 工厂方法 |
| 序列表示 | Sequence / SequenceGroup | 简化为 Request(无 group 概念) |
| 数据序列化 | pickle | msgspec Struct(零拷贝友好) |
| 异步调度 | 不支持 | AsyncScheduler(scheduling ∥ execution) |
| Data Parallel | 不支持 | DPEngineCoreProc + DP Coordinator |
| Elastic EP | 不支持 | 完整扩缩容支持 |
2.6 向后兼容策略
虽然 v0 已被移除,但 vLLM 通过以下机制保持 API 兼容性:
- 模块级别名 :
vllm.engine.llm_engine.LLMEngine直接指向 v1 实现 - 属性透传 :
self.model_executor = self.engine_core.engine_core.model_executor(v0 兼容访问路径,见 llm_engine.py:L124) - 接口签名保留 :
add_request()/step()/abort_request()签名不变
三、核心数据结构
3.1 请求生命周期数据结构
#mermaid-svg-FnW8oyXcPSh2DEyN{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-FnW8oyXcPSh2DEyN .error-icon{fill:#552222;}#mermaid-svg-FnW8oyXcPSh2DEyN .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-FnW8oyXcPSh2DEyN .marker{fill:#333333;stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN .marker.cross{stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-FnW8oyXcPSh2DEyN p{margin:0;}#mermaid-svg-FnW8oyXcPSh2DEyN defs #statediagram-barbEnd{fill:#333333;stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup text{fill:#9370DB;stroke:none;font-size:10px;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup text{fill:#333;stroke:none;font-size:10px;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup .state-title{font-weight:bolder;fill:#131300;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup line{stroke:#333333;stroke-width:1;}#mermaid-svg-FnW8oyXcPSh2DEyN .transition{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-FnW8oyXcPSh2DEyN .stateGroup .composit{fill:white;border-bottom:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .state-note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-FnW8oyXcPSh2DEyN .state-note text{fill:black;stroke:none;font-size:10px;}#mermaid-svg-FnW8oyXcPSh2DEyN .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel .label rect{fill:#ECECFF;opacity:0.5;}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel .label text{fill:#333;}#mermaid-svg-FnW8oyXcPSh2DEyN .label div .edgeLabel{color:#333;}#mermaid-svg-FnW8oyXcPSh2DEyN .stateLabel text{fill:#131300;font-size:10px;font-weight:bold;}#mermaid-svg-FnW8oyXcPSh2DEyN .node circle.state-start{fill:#333333;stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN .node .fork-join{fill:#333333;stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN .node circle.state-end{fill:#9370DB;stroke:white;stroke-width:1.5;}#mermaid-svg-FnW8oyXcPSh2DEyN .end-state-inner{fill:white;stroke-width:1.5;}#mermaid-svg-FnW8oyXcPSh2DEyN .node rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN #statediagram-barbEnd{fill:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .cluster-label,#mermaid-svg-FnW8oyXcPSh2DEyN .nodeLabel{color:#131300;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster rect.outer{rx:5px;ry:5px;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-state .divider{stroke:#9370DB;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-state .title-state{rx:5px;ry:5px;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster.statediagram-cluster .inner{fill:white;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster.statediagram-cluster-alt .inner{fill:#f0f0f0;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster .inner{rx:0;ry:0;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-state rect.basic{rx:5px;ry:5px;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#f0f0f0;}#mermaid-svg-FnW8oyXcPSh2DEyN .note-edge{stroke-dasharray:5;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-note text{fill:black;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-note .nodeLabel{color:black;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram .edgeLabel{color:red;}#mermaid-svg-FnW8oyXcPSh2DEyN #dependencyStart,#mermaid-svg-FnW8oyXcPSh2DEyN #dependencyEnd{fill:#333333;stroke:#333333;stroke-width:1;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagramTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-FnW8oyXcPSh2DEyN :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} add_request()
use_structured_output
grammar compiled
need remote KV
KV received
scheduled
preempted (low priority / OOM)
rescheduled
stop token / stop string
max_tokens reached
client abort
internal error
repetition detected
prompt too long
WAITING
WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR
WAITING_FOR_REMOTE_KVS
RUNNING
PREEMPTED
FINISHED_STOPPED
FINISHED_LENGTH_CAPPED
FINISHED_ABORTED
FINISHED_ERROR
FINISHED_REPETITION
FINISHED_IGNORED
3.2 Request(内部请求表示)
python
# request.py L59-L100 (精简)
class Request:
def __init__(
self,
request_id: str,
prompt_token_ids: list[int] | None,
sampling_params: SamplingParams | None,
pooling_params: PoolingParams | None,
client_index: int = 0,
arrival_time: float | None = None,
prompt_embeds: torch.Tensor | None = None,
mm_features: list[MultiModalFeatureSpec] | None = None,
lora_request: LoRARequest | None = None,
block_hasher: Callable | None = None,
...
):
self.request_id = request_id
self.status = RequestStatus.WAITING # 初始状态
self.sampling_params = sampling_params
self.prompt_token_ids = prompt_token_ids
self._output_token_ids: list[int] = [] # 生成的 token
self.num_computed_tokens = 0 # 已计算 token 数
self.block_hashes: list[BlockHash] = [] # prefix caching hash
self.events: list[EngineCoreEvent] = [] # 事件追踪
self.num_preemptions = 0 # 被抢占次数
关键字段语义:
| 字段 | 类型 | 说明 |
|---|---|---|
status |
RequestStatus |
请求状态机(见上图) |
output_token_ids |
ConstantList[int] |
只读视图,防止外部直接 append |
all_token_ids |
ConstantList[int] |
包含 prompt + 所有生成 token |
num_output_tokens |
int |
当前已生成 token 数 |
block_hashes |
list[BlockHash] |
每个 full block 的 hash,用于 prefix caching |
is_prefill_chunk |
bool |
是否处于非最终 prefill chunk |
num_nans_in_logits |
int |
logits 中 NaN 数量(用于检测异常) |
streaming_queue |
deque |
streaming session 续传队列 |
3.3 EngineInput → EngineCoreRequest → Request 转换链
#mermaid-svg-TLNl5Z2gYNNDCLL6{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .error-icon{fill:#552222;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .marker.cross{stroke:#333333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 p{margin:0;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster-label text{fill:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster-label span{color:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster-label span p{background-color:transparent;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .label text,#mermaid-svg-TLNl5Z2gYNNDCLL6 span{fill:#333;color:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .node rect,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node circle,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node ellipse,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node polygon,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .rough-node .label text,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node .label text,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape .label,#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape .label{text-anchor:middle;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .rough-node .label,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node .label,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape .label,#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape .label{text-align:center;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .node.clickable{cursor:pointer;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .arrowheadPath{fill:#333333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TLNl5Z2gYNNDCLL6 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster text{fill:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster span{color:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 rect.text{fill:none;stroke-width:0;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape p,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape .label rect,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TLNl5Z2gYNNDCLL6 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-TLNl5Z2gYNNDCLL6 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Request (内部)
EngineCoreRequest (msgspec Struct)
EngineInput (TypedDict)
用户 API 层
Renderer.render_chat()
InputProcessor.process_inputs()
Request.from_engine_core_request()
prompt='Hello'
prompt_token_ids=1,2,3
prompt_embeds=tensor(...)
TokensInput
type='token'
prompt_token_ids=\[\]
EmbedsInput
type='embeds'
prompt_embeds=tensor
request_id, prompt_token_ids,
sampling_params, mm_features,
lora_request, priority, ...
-
status, _output_token_ids,
-
block_hashes, events,
-
num_computed_tokens,
-
prefill_stats
EngineCoreRequest 定义 (engine/init.py):
python
# __init__.py L80-L131
class EngineCoreRequest(
msgspec.Struct,
array_like=True,
omit_defaults=True,
gc=False, # 禁用 GC 追踪(性能关键)
):
request_id: str
prompt_token_ids: list[int] | None
mm_features: list[MultiModalFeatureSpec] | None
sampling_params: SamplingParams | None
pooling_params: PoolingParams | None
arrival_time: float
lora_request: LoRARequest | None
cache_salt: str | None
data_parallel_rank: int | None
prompt_embeds: torch.Tensor | None = None
prompt_is_token_ids: list[bool] | None = None
client_index: int = 0
current_wave: int = 0
priority: int = 0
trace_headers: Mapping[str, str] | None = None
resumable: bool = False
external_req_id: str | None = None
reasoning_ended: bool | None = None
reasoning_parser_kwargs: dict[str, Any] | None = None
转换函数 (request.py L186-L209):
python
@classmethod
def from_engine_core_request(cls, request: EngineCoreRequest, block_hasher) -> "Request":
return cls(
request_id=request.request_id,
client_index=request.client_index,
prompt_token_ids=request.prompt_token_ids,
prompt_embeds=request.prompt_embeds,
mm_features=request.mm_features,
sampling_params=request.sampling_params,
pooling_params=request.pooling_params,
arrival_time=request.arrival_time,
lora_request=request.lora_request,
block_hasher=block_hasher,
resumable=request.resumable,
...
)
3.4 输出数据结构体系
来源:v1/outputs.py
#mermaid-svg-Hkjq6WXshFen7H3O{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Hkjq6WXshFen7H3O .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Hkjq6WXshFen7H3O .error-icon{fill:#552222;}#mermaid-svg-Hkjq6WXshFen7H3O .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Hkjq6WXshFen7H3O .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Hkjq6WXshFen7H3O .marker.cross{stroke:#333333;}#mermaid-svg-Hkjq6WXshFen7H3O svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Hkjq6WXshFen7H3O p{margin:0;}#mermaid-svg-Hkjq6WXshFen7H3O g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-Hkjq6WXshFen7H3O g.classGroup text .title{font-weight:bolder;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster-label text{fill:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster-label span{color:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster-label span p{background-color:transparent;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster text{fill:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster span{color:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .nodeLabel,#mermaid-svg-Hkjq6WXshFen7H3O .edgeLabel{color:#131300;}#mermaid-svg-Hkjq6WXshFen7H3O .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-Hkjq6WXshFen7H3O .label text{fill:#131300;}#mermaid-svg-Hkjq6WXshFen7H3O .labelBkg{background:#ECECFF;}#mermaid-svg-Hkjq6WXshFen7H3O .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-Hkjq6WXshFen7H3O .classTitle{font-weight:bolder;}#mermaid-svg-Hkjq6WXshFen7H3O .node rect,#mermaid-svg-Hkjq6WXshFen7H3O .node circle,#mermaid-svg-Hkjq6WXshFen7H3O .node ellipse,#mermaid-svg-Hkjq6WXshFen7H3O .node polygon,#mermaid-svg-Hkjq6WXshFen7H3O .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Hkjq6WXshFen7H3O .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O g.clickable{cursor:pointer;}#mermaid-svg-Hkjq6WXshFen7H3O g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-Hkjq6WXshFen7H3O g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-Hkjq6WXshFen7H3O .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-Hkjq6WXshFen7H3O .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-Hkjq6WXshFen7H3O .dashed-line{stroke-dasharray:3;}#mermaid-svg-Hkjq6WXshFen7H3O .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-Hkjq6WXshFen7H3O #compositionStart,#mermaid-svg-Hkjq6WXshFen7H3O .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #compositionEnd,#mermaid-svg-Hkjq6WXshFen7H3O .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #dependencyStart,#mermaid-svg-Hkjq6WXshFen7H3O .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #dependencyStart,#mermaid-svg-Hkjq6WXshFen7H3O .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #extensionStart,#mermaid-svg-Hkjq6WXshFen7H3O .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #extensionEnd,#mermaid-svg-Hkjq6WXshFen7H3O .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #aggregationStart,#mermaid-svg-Hkjq6WXshFen7H3O .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #aggregationEnd,#mermaid-svg-Hkjq6WXshFen7H3O .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #lollipopStart,#mermaid-svg-Hkjq6WXshFen7H3O .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #lollipopEnd,#mermaid-svg-Hkjq6WXshFen7H3O .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-Hkjq6WXshFen7H3O .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Hkjq6WXshFen7H3O .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Hkjq6WXshFen7H3O :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} scheduler.update_from_output()
聚合
sample_tokens()
SamplerOutput
+sampled_token_ids: Tensor
+logprobs_tensors: LogprobsTensors
ModelRunnerOutput
+req_ids: list<str>
+req_id_to_index: dict
+sampled_token_ids: list<list<int>>
+logprobs: LogprobsLists
+prompt_logprobs_dict: dict
+pooler_output: list
+kv_connector_output: KVConnectorOutput
+num_nans_in_logits: dict
EngineCoreOutput
+request_id: str
+new_token_ids: list<int>
+new_logprobs: LogprobsLists
+finish_reason: FinishReason
+events: list<EngineCoreEvent>
+prefill_stats: PrefillStats
EngineCoreOutputs
+outputs: list<EngineCoreOutput>
+scheduler_stats: SchedulerStats
+timestamp: float
+finished_requests: set<str>
+utility_output: UtilityOutput
关键数据结构详解:
SamplerOutput (outputs.py L118-L124):
python
@dataclass
class SamplerOutput:
sampled_token_ids: torch.Tensor # [num_reqs, max_num_generated_tokens]
logprobs_tensors: LogprobsTensors | None # logprob 信息
ModelRunnerOutput (outputs.py L166-L206):
python
@dataclass
class ModelRunnerOutput:
req_ids: list[str] # 请求 ID 列表
req_id_to_index: dict[str, int] # ID → 索引映射
sampled_token_ids: list[list[int]] # 每个请求生成的 token IDs
logprobs: LogprobsLists | None # log probability
prompt_logprobs_dict: dict # prompt 阶段 logprob
pooler_output: list[Tensor | None] # pooling 模型输出
kv_connector_output: KVConnectorOutput | None # KV transfer 信息
num_nans_in_logits: dict | None # NaN 检测
EngineCoreOutput (engine/init.py L161-L191):
python
class EngineCoreOutput(msgspec.Struct, array_like=True, omit_defaults=True, gc=False):
request_id: str
new_token_ids: list[int]
new_logprobs: LogprobsLists | None
finish_reason: FinishReason | None # STOP/LENGTH/ABORT/ERROR/REPETITION
events: list[EngineCoreEvent] | None
prefill_stats: PrefillStats | None
3.5 FinishReason 状态机
python
# engine/__init__.py L42-L64
class FinishReason(enum.IntEnum):
STOP = 0 # stop string / stop token emitted
LENGTH = 1 # max_tokens consumed or max_model_len reached
ABORT = 2 # aborted by client
ERROR = 3 # retryable internal error (→ 500)
REPETITION = 4 # repetitive pattern (hallucination)
四、设计原则
4.1 配置驱动(Configuration-Driven)
vLLM 通过 28+ 配置模块 驱动全部行为,核心配置聚合在 VllmConfig:
#mermaid-svg-Ml6lPTVE9oFhIAn7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .error-icon{fill:#552222;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .marker.cross{stroke:#333333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 p{margin:0;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster-label text{fill:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster-label span{color:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster-label span p{background-color:transparent;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .label text,#mermaid-svg-Ml6lPTVE9oFhIAn7 span{fill:#333;color:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .node rect,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node circle,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node ellipse,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node polygon,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .rough-node .label text,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node .label text,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape .label,#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape .label{text-anchor:middle;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .rough-node .label,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node .label,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape .label,#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape .label{text-align:center;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .node.clickable{cursor:pointer;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .arrowheadPath{fill:#333333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Ml6lPTVE9oFhIAn7 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster text{fill:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster span{color:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 rect.text{fill:none;stroke-width:0;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape p,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape .label rect,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Ml6lPTVE9oFhIAn7 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Ml6lPTVE9oFhIAn7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} VllmConfig
聚合根
ModelConfig
模型配置
CacheConfig
KV Cache 配置
ParallelConfig
并行配置
SchedulerConfig
调度配置
DeviceConfig
设备配置
LoadConfig
加载配置
AttentionConfig
注意力配置
KernelConfig
Kernel 配置
CompilationConfig
编译配置
SpeculativeConfig
推测解码配置
ObservabilityConfig
可观测性
QuantizationConfig
量化配置
ReasoningConfig
推理配置
StructuredOutputsConfig
结构化输出
KVTransferConfig
KV Transfer
LoRAConfig
LoRA 配置
OffloadConfig
Offload 配置
ProfilerConfig
Profiler
MambaConfig
Mamba 配置
MultimodalConfig
多模态配置
VllmConfig.post_init() (vllm.py L758-L1401) 在初始化时执行大量交叉验证和默认值推导:
- 优化级别应用(O0/O1/O2/O3)
- async scheduling 自动启用判断
- cudagraph capture sizes 计算
- SP (Sequence Parallelism) 阈值推导
- platform-specific defaults 应用
- KV transfer 兼容性检查
优化级别系统:
python
# vllm.py L68-L265
class OptimizationLevel(IntEnum):
O0 = 0 # 无优化,最快启动
O1 = 1 # Dynamo+Inductor + Piecewise CUDAGraph
O2 = 2 # Full + Piecewise CUDAGraph(默认)
O3 = 3 # O2 + FlashInfer autotune
4.2 注册表模式(Registry Pattern)
模型注册表
model_executor/models/registry.py 维护了 150+ 模型架构的注册表:
python
# registry.py L70-L221 (精简)
_TEXT_GENERATION_MODELS = {
"LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
"Qwen3ForCausalLM": ("qwen3", "Qwen3ForCausalLM"),
"DeepseekV3ForCausalLM": ("deepseek_v2", "DeepseekV3ForCausalLM"),
"Gemma4ForCausalLM": ("gemma4", "Gemma4ForCausalLM"),
# ... 共 200+ 条目
}
_EMBEDDING_MODELS = { ... } # Embedding 模型
_MULTIMODAL_MODELS = { ... } # 多模态模型
_SPECULATIVE_DECODING_MODELS = {...} # 推测解码模型
_VLLM_MODELS = {
**_TEXT_GENERATION_MODELS,
**_EMBEDDING_MODELS,
**_MULTIMODAL_MODELS,
**_SPECULATIVE_DECODING_MODELS,
...
}
ModelRegistry = _ModelRegistry({
model_arch: _LazyRegisteredModel( # 懒加载!避免 import 时初始化 CUDA
module_name=f"vllm.model_executor.models.{mod_relname}",
class_name=cls_name,
)
for model_arch, (mod_relname, cls_name) in _VLLM_MODELS.items()
})
关键设计 :使用 _LazyRegisteredModel 实现延迟导入,避免在非 GPU 进程中初始化 CUDA context。
多模态处理器注册表
multimodal/registry.py 使用装饰器模式注册处理器:
python
# registry.py L142-L174 (精简)
class MultiModalRegistry:
def register_processor(
self,
processor: MultiModalFactory[_I],
*,
info: ProcessingInfoFactory[_I],
dummy_inputs: DummyInputsBuilderFactory[_I],
):
def wrapper(model_cls: N) -> N:
model_cls._processor_factory = _ProcessorFactories(
info=info, dummy_inputs=dummy_inputs, processor=processor,
)
return model_cls
return wrapper
# 使用示例(在模型文件中):
# @MULTIMODAL_REGISTRY.register_processor(MyProcessor.factory, ...)
# class MyModel(nn.Module): ...
4.3 策略模式(Strategy Pattern)
注意力后端选择器
v1/attention/selector.py 根据运行时参数选择最优注意力后端:
#mermaid-svg-jt7mAmfn1Fh9NSAD{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jt7mAmfn1Fh9NSAD .error-icon{fill:#552222;}#mermaid-svg-jt7mAmfn1Fh9NSAD .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jt7mAmfn1Fh9NSAD .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .marker.cross{stroke:#333333;}#mermaid-svg-jt7mAmfn1Fh9NSAD svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jt7mAmfn1Fh9NSAD p{margin:0;}#mermaid-svg-jt7mAmfn1Fh9NSAD .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster-label text{fill:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster-label span{color:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster-label span p{background-color:transparent;}#mermaid-svg-jt7mAmfn1Fh9NSAD .label text,#mermaid-svg-jt7mAmfn1Fh9NSAD span{fill:#333;color:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .node rect,#mermaid-svg-jt7mAmfn1Fh9NSAD .node circle,#mermaid-svg-jt7mAmfn1Fh9NSAD .node ellipse,#mermaid-svg-jt7mAmfn1Fh9NSAD .node polygon,#mermaid-svg-jt7mAmfn1Fh9NSAD .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .rough-node .label text,#mermaid-svg-jt7mAmfn1Fh9NSAD .node .label text,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape .label,#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape .label{text-anchor:middle;}#mermaid-svg-jt7mAmfn1Fh9NSAD .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .rough-node .label,#mermaid-svg-jt7mAmfn1Fh9NSAD .node .label,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape .label,#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape .label{text-align:center;}#mermaid-svg-jt7mAmfn1Fh9NSAD .node.clickable{cursor:pointer;}#mermaid-svg-jt7mAmfn1Fh9NSAD .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .arrowheadPath{fill:#333333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-jt7mAmfn1Fh9NSAD .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jt7mAmfn1Fh9NSAD .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster text{fill:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster span{color:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-jt7mAmfn1Fh9NSAD .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD rect.text{fill:none;stroke-width:0;}#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape p,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape .label rect,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jt7mAmfn1Fh9NSAD .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-jt7mAmfn1Fh9NSAD .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-jt7mAmfn1Fh9NSAD :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} get_attn_backend()
selector.py L53-L103
构建 AttentionSelectorConfig
@cache 装饰器
_cached_get_attn_backend()
current_platform.
get_attn_backend_cls()
AttentionBackend 子类
FlashAttention
FlashInfer
TritonAttention
ROCm AITER
MLA Backends
(DeepSeek)
Mamba Attention
选择参数空间 (selector.py L22-L50):
python
class AttentionSelectorConfig(NamedTuple):
head_size: int
dtype: torch.dtype
kv_cache_dtype: CacheDType | None
block_size: int | None
use_mla: bool = False # DeepSeek MLA
has_sink: bool = False # Sink attention
use_sparse: bool = False # Sparse attention
use_mm_prefix: bool = False # Multimodal prefix
use_per_head_quant_scales: bool = False
attn_type: str = AttentionType.DECODER
use_non_causal: bool = False
use_batch_invariant: bool = False # Batch invariant mode
量化后端选择
量化配置通过 QuantizationConfig 及其子类驱动不同量化后端的选择:
- FP8 :
Fp8Config→Fp8Linear - GPTQ :
GptqConfig→GptqMarlinLinear - AWQ :
AwqConfig→AwqLinear - NVFP4 :
Nvfp4Config→Nvfp4Linear
4.4 工厂模式(Factory Pattern)
Executor 工厂
如 [4.1 节](#4.1 节) 所述,Executor.get_class() 是经典的工厂方法:
python
# abstract.py L48-L92
@staticmethod
def get_class(vllm_config: VllmConfig) -> type["Executor"]:
backend = parallel_config.distributed_executor_backend
match backend:
case "ray": return RayExecutorV2 or RayDistributedExecutor
case "mp": return MultiprocExecutor
case "uni": return UniProcExecutor
case "external_launcher": return ExecutorWithExternalLauncher
case type(): return backend # 用户自定义子类
case str(): return resolve_obj_by_qualname(backend) # 按限定名加载
EngineCoreClient 工厂
core_client.py 中的 make_client() 根据 multiprocess_mode 和 asyncio_mode 创建不同的客户端实现:
- inproc 模式:直接持有 EngineCore 引用
- multiproc 模式:通过 ZMQ socket 与 EngineCoreProc 通信
- asyncio 模式:异步版本客户端
KV Cache Offload 工厂
kv_offload/factory.py 根据 kv_transfer_config.kv_connector 选择不同的 KV offload 后端:
"OffloadingConnector": CPU offloading"LMCacheConnectorV1": LMCache 集成- NIXL Connector: RDMA-based transfer
4.5 其他重要设计模式
| 模式 | 应用场景 | 位置 |
|---|---|---|
| 观察者模式 | 统计日志 / metrics 收集 | metrics/ |
| 适配器模式 | 不同平台(CUDA/ROCm/XPU/TPU)差异 | platforms/ |
| 装饰器模式 | 编译 pass 注入 / tracing | compilation/ |
| 原型模式 | Dummy input 构建 | multimodal/processing/ |
| 命令模式 | Utility method 远程调用 | EngineCoreProc._handle_client_request() |
五、关键交互时序
5.1 完整推理请求生命周期
OutputProcessor ModelRunner Executor Scheduler EngineCore InputProcessor LLMEngine OpenAI API 客户端 OutputProcessor ModelRunner Executor Scheduler EngineCore InputProcessor LLMEngine OpenAI API 客户端 #mermaid-svg-bzRce79fkB1GjhRn{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bzRce79fkB1GjhRn .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bzRce79fkB1GjhRn .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bzRce79fkB1GjhRn .error-icon{fill:#552222;}#mermaid-svg-bzRce79fkB1GjhRn .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bzRce79fkB1GjhRn .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bzRce79fkB1GjhRn .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bzRce79fkB1GjhRn .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bzRce79fkB1GjhRn .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bzRce79fkB1GjhRn .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bzRce79fkB1GjhRn .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bzRce79fkB1GjhRn .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bzRce79fkB1GjhRn .marker.cross{stroke:#333333;}#mermaid-svg-bzRce79fkB1GjhRn svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bzRce79fkB1GjhRn p{margin:0;}#mermaid-svg-bzRce79fkB1GjhRn .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bzRce79fkB1GjhRn text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bzRce79fkB1GjhRn .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-bzRce79fkB1GjhRn .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-bzRce79fkB1GjhRn .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-bzRce79fkB1GjhRn #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-bzRce79fkB1GjhRn .sequenceNumber{fill:white;}#mermaid-svg-bzRce79fkB1GjhRn #sequencenumber{fill:#333;}#mermaid-svg-bzRce79fkB1GjhRn #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-bzRce79fkB1GjhRn .messageText{fill:#333;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bzRce79fkB1GjhRn .labelText,#mermaid-svg-bzRce79fkB1GjhRn .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .loopText,#mermaid-svg-bzRce79fkB1GjhRn .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bzRce79fkB1GjhRn .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-bzRce79fkB1GjhRn .noteText,#mermaid-svg-bzRce79fkB1GjhRn .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bzRce79fkB1GjhRn .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bzRce79fkB1GjhRn .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bzRce79fkB1GjhRn .actorPopupMenu{position:absolute;}#mermaid-svg-bzRce79fkB1GjhRn .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-bzRce79fkB1GjhRn .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bzRce79fkB1GjhRn .actor-man circle,#mermaid-svg-bzRce79fkB1GjhRn line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-bzRce79fkB1GjhRn :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} loop 每个 iteration POST /v1/chat/completions add_request(prompt, params) process_inputs(prompt, params) EngineCoreRequest add_request(request) add_request(Request) schedule() SchedulerOutput execute_model(SchedulerOutput) forward(batch) hidden_states sample_tokens(grammar_output) sampled_token_ids ModelRunnerOutput update_from_output(output) EngineCoreOutputs EngineCoreOutputs process_outputs(outputs) RequestOutput\[\] stream chunks / final response JSON response
5.2 多进程架构下的 IPC 流程
output_queue OutputSocket Thread EngineCore (Busy Loop) input_queue InputSocket Thread EngineCoreClient 前端进程 (API Server) output_queue OutputSocket Thread EngineCore (Busy Loop) input_queue InputSocket Thread EngineCoreClient 前端进程 (API Server) #mermaid-svg-j1SpQ3km8j0xfEQY{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-j1SpQ3km8j0xfEQY .error-icon{fill:#552222;}#mermaid-svg-j1SpQ3km8j0xfEQY .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-j1SpQ3km8j0xfEQY .marker{fill:#333333;stroke:#333333;}#mermaid-svg-j1SpQ3km8j0xfEQY .marker.cross{stroke:#333333;}#mermaid-svg-j1SpQ3km8j0xfEQY svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-j1SpQ3km8j0xfEQY p{margin:0;}#mermaid-svg-j1SpQ3km8j0xfEQY .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-j1SpQ3km8j0xfEQY text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-j1SpQ3km8j0xfEQY .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY .sequenceNumber{fill:white;}#mermaid-svg-j1SpQ3km8j0xfEQY #sequencenumber{fill:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY .messageText{fill:#333;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-j1SpQ3km8j0xfEQY .labelText,#mermaid-svg-j1SpQ3km8j0xfEQY .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .loopText,#mermaid-svg-j1SpQ3km8j0xfEQY .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-j1SpQ3km8j0xfEQY .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-j1SpQ3km8j0xfEQY .noteText,#mermaid-svg-j1SpQ3km8j0xfEQY .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-j1SpQ3km8j0xfEQY .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-j1SpQ3km8j0xfEQY .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-j1SpQ3km8j0xfEQY .actorPopupMenu{position:absolute;}#mermaid-svg-j1SpQ3km8j0xfEQY .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-j1SpQ3km8j0xfEQY .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-j1SpQ3km8j0xfEQY .actor-man circle,#mermaid-svg-j1SpQ3km8j0xfEQY line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-j1SpQ3km8j0xfEQY :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} add_request(req) ZMQ SEND (msgpack) put((ADD, Request)) get() ← busy loop 轮询 step() → schedule → execute → update put((client_idx, EngineCoreOutputs)) ZMQ SEND (msgpack) recv() → get_output() EngineCoreOutputs
六、目录导航速查
| 目录 | 层次 | 核心文件 | 一句话说明 |
|---|---|---|---|
entrypoints/ |
L1 | openai/api_server.py |
HTTP API 入口 |
engine/ |
L2 | llm_engine.py |
别名重导出到 v1 |
v1/engine/ |
L2 | llm_engine.py, core.py |
引擎门面 + 核心循环 |
v1/core/sched/ |
L3 | scheduler.py, interface.py |
调度算法 |
v1/executor/ |
L4 | abstract.py, uniproc_executor.py |
分布式执行 |
v1/worker/ |
L4-L5 | gpu/model_runner.py, worker_base.py |
Worker 抽象与 GPU 执行 |
model_executor/ |
L5 | models/registry.py, models/llama.py |
模型加载与执行 |
v1/attention/ |
L5-L6 | selector.py, backends/ |
注意力后端选择 |
kernels/ |
L6 | vllm_c.py |
Custom ops 入口 |
config/ |
横切 | vllm.py |
全局配置聚合 |
multimodal/ |
横切 | registry.py |
多模态处理器注册 |
distributed/ |
横切 | parallel_state.py |
分布式通信原语 |
七、扩展阅读指引
阅读完本文档后,建议按以下顺序深入源码:
- EngineCore.step() --- 理解单步执行的完整流程
- Scheduler.schedule() --- 理解调度算法细节(chunked prefill / continuous batching / preempt)
- GPUModelRunner.execute_model() --- 理解模型前向传播的完整链路
- InputProcessor.process_inputs() --- 理解从 HTTP request 到 EngineCoreRequest 的完整转换
- AttentionSelectorConfig --- 理解注意力后端的策略选择机制