01-vLLM 架构总览

vLLM 架构总览

定位:本文档从架构层面深度分析 vLLM 源码,建立全局认知。涵盖六层分层架构、v0→v1 演进、核心数据结构与设计原则。

总体架构图

#mermaid-svg-ZFQBKXMUmridgM5r{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZFQBKXMUmridgM5r .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZFQBKXMUmridgM5r .error-icon{fill:#552222;}#mermaid-svg-ZFQBKXMUmridgM5r .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZFQBKXMUmridgM5r .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZFQBKXMUmridgM5r .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZFQBKXMUmridgM5r .marker.cross{stroke:#333333;}#mermaid-svg-ZFQBKXMUmridgM5r svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZFQBKXMUmridgM5r p{margin:0;}#mermaid-svg-ZFQBKXMUmridgM5r .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster-label text{fill:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster-label span{color:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster-label span p{background-color:transparent;}#mermaid-svg-ZFQBKXMUmridgM5r .label text,#mermaid-svg-ZFQBKXMUmridgM5r span{fill:#333;color:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .node rect,#mermaid-svg-ZFQBKXMUmridgM5r .node circle,#mermaid-svg-ZFQBKXMUmridgM5r .node ellipse,#mermaid-svg-ZFQBKXMUmridgM5r .node polygon,#mermaid-svg-ZFQBKXMUmridgM5r .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZFQBKXMUmridgM5r .rough-node .label text,#mermaid-svg-ZFQBKXMUmridgM5r .node .label text,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape .label,#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZFQBKXMUmridgM5r .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZFQBKXMUmridgM5r .rough-node .label,#mermaid-svg-ZFQBKXMUmridgM5r .node .label,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape .label,#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape .label{text-align:center;}#mermaid-svg-ZFQBKXMUmridgM5r .node.clickable{cursor:pointer;}#mermaid-svg-ZFQBKXMUmridgM5r .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZFQBKXMUmridgM5r .arrowheadPath{fill:#333333;}#mermaid-svg-ZFQBKXMUmridgM5r .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZFQBKXMUmridgM5r .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZFQBKXMUmridgM5r .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZFQBKXMUmridgM5r .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZFQBKXMUmridgM5r .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZFQBKXMUmridgM5r .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZFQBKXMUmridgM5r .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster text{fill:#333;}#mermaid-svg-ZFQBKXMUmridgM5r .cluster span{color:#333;}#mermaid-svg-ZFQBKXMUmridgM5r div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZFQBKXMUmridgM5r .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZFQBKXMUmridgM5r rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape p,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZFQBKXMUmridgM5r .icon-shape .label rect,#mermaid-svg-ZFQBKXMUmridgM5r .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZFQBKXMUmridgM5r .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZFQBKXMUmridgM5r .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZFQBKXMUmridgM5r :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} HTTP/gRPC
EngineCoreRequest
SchedulerOutput
collective_rpc
CUDA Kernel
🌐 第1层: 服务层 entrypoints/
OpenAI API Server
Anthropic API
Chat Completion
Embedding/Pooling
Batch Serving
🔧 第2层: 引擎层 engine/ + v1/engine/
EngineCore

核心引擎循环
LLMEngine

兼容包装
InputProcessor

输入处理
OutputProcessor

输出处理
Detokenizer

解码器
📋 第3层: 调度核心 v1/core/sched/
Scheduler

调度算法
AsyncScheduler

异步调度
RequestQueue

请求队列
BlockPool

KV Cache 块池
⚙️ 第4层: 执行器层 v1/executor/
UniProcExecutor

单进程
MultiprocExecutor

多进程
RayExecutor

Ray 分布式
WorkerBase

Worker 抽象
🎯 第5层: 模型执行器 model_executor/
Model Registry

150+ 模型架构
ModelRunner

GPU/CPU/XPU/TPU
量化后端

FP8/GPTQ/AWQ/NVFP4
🔧 第6层: 内核层 csrc/
CUDA Kernels

cutlass/flashinfer/triton
Custom Ops

vllm._custom_ops


一、分层架构详解

1.1 第 1 层:服务层(entrypoints/)

职责:对外提供 API 接口,负责协议转换、请求路由与负载均衡。

子模块 职责 关键文件
openai/ OpenAI 兼容 API(chat/completion/embedding) api_server.py, serving.py
anthropic/ Anthropic Messages API 兼容 api_router.py
pooling/ Embedding / Classification / Scoring embed/io_processor.py
cli/ CLI 入口(serve/benchmark) serve.py
grpc_server.py gRPC 服务入口 grpc_server.py

接口定义 :服务层通过 llm.py 将 HTTP 请求转化为对 LLMEngine 的调用,使用 protocol.py 定义数据契约。

1.2 第 2 层:引擎层(engine/ + v1/engine/)

职责:编排输入预处理、调度执行、输出后处理的完整流水线。

核心组件
  • LLMEngine (v1):面向用户的引擎接口,负责:

    • 输入转换 (InputProcessor):EngineInputEngineCoreRequest
    • 输出转换 (OutputProcessor):EngineCoreOutputsRequestOutput
    • 统计日志 (StatLoggerManager)
    • LoRA 管理
  • EngineCore:解耦后的核心引擎,包含:

    • 模型执行器管理 (model_executor)
    • 调度器管理 (scheduler)
    • KV Cache 初始化与管理
    • 核心步进循环 (step() / step_with_batch_queue())
    • 多模态缓存管理 (mm_receiver_cache)
    • 结构化输出管理 (structured_output_manager)
  • EngineCoreProc:基于 ZMQ 的进程内通信封装,支持:

    • 后台进程运行 EngineCore
    • Socket IO 线程(input/output)
    • 握手协议(handshake)
    • Data Parallel 协调(DPEngineCoreProc)
    • Elastic EP 扩缩容
层间数据流

#mermaid-svg-l3gFgAzHK1140UeP{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-l3gFgAzHK1140UeP .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-l3gFgAzHK1140UeP .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-l3gFgAzHK1140UeP .error-icon{fill:#552222;}#mermaid-svg-l3gFgAzHK1140UeP .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-l3gFgAzHK1140UeP .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-l3gFgAzHK1140UeP .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-l3gFgAzHK1140UeP .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-l3gFgAzHK1140UeP .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-l3gFgAzHK1140UeP .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-l3gFgAzHK1140UeP .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-l3gFgAzHK1140UeP .marker{fill:#333333;stroke:#333333;}#mermaid-svg-l3gFgAzHK1140UeP .marker.cross{stroke:#333333;}#mermaid-svg-l3gFgAzHK1140UeP svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-l3gFgAzHK1140UeP p{margin:0;}#mermaid-svg-l3gFgAzHK1140UeP .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-l3gFgAzHK1140UeP .cluster-label text{fill:#333;}#mermaid-svg-l3gFgAzHK1140UeP .cluster-label span{color:#333;}#mermaid-svg-l3gFgAzHK1140UeP .cluster-label span p{background-color:transparent;}#mermaid-svg-l3gFgAzHK1140UeP .label text,#mermaid-svg-l3gFgAzHK1140UeP span{fill:#333;color:#333;}#mermaid-svg-l3gFgAzHK1140UeP .node rect,#mermaid-svg-l3gFgAzHK1140UeP .node circle,#mermaid-svg-l3gFgAzHK1140UeP .node ellipse,#mermaid-svg-l3gFgAzHK1140UeP .node polygon,#mermaid-svg-l3gFgAzHK1140UeP .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-l3gFgAzHK1140UeP .rough-node .label text,#mermaid-svg-l3gFgAzHK1140UeP .node .label text,#mermaid-svg-l3gFgAzHK1140UeP .image-shape .label,#mermaid-svg-l3gFgAzHK1140UeP .icon-shape .label{text-anchor:middle;}#mermaid-svg-l3gFgAzHK1140UeP .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-l3gFgAzHK1140UeP .rough-node .label,#mermaid-svg-l3gFgAzHK1140UeP .node .label,#mermaid-svg-l3gFgAzHK1140UeP .image-shape .label,#mermaid-svg-l3gFgAzHK1140UeP .icon-shape .label{text-align:center;}#mermaid-svg-l3gFgAzHK1140UeP .node.clickable{cursor:pointer;}#mermaid-svg-l3gFgAzHK1140UeP .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-l3gFgAzHK1140UeP .arrowheadPath{fill:#333333;}#mermaid-svg-l3gFgAzHK1140UeP .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-l3gFgAzHK1140UeP .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-l3gFgAzHK1140UeP .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-l3gFgAzHK1140UeP .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-l3gFgAzHK1140UeP .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-l3gFgAzHK1140UeP .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-l3gFgAzHK1140UeP .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-l3gFgAzHK1140UeP .cluster text{fill:#333;}#mermaid-svg-l3gFgAzHK1140UeP .cluster span{color:#333;}#mermaid-svg-l3gFgAzHK1140UeP div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-l3gFgAzHK1140UeP .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-l3gFgAzHK1140UeP rect.text{fill:none;stroke-width:0;}#mermaid-svg-l3gFgAzHK1140UeP .icon-shape,#mermaid-svg-l3gFgAzHK1140UeP .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-l3gFgAzHK1140UeP .icon-shape p,#mermaid-svg-l3gFgAzHK1140UeP .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-l3gFgAzHK1140UeP .icon-shape .label rect,#mermaid-svg-l3gFgAzHK1140UeP .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-l3gFgAzHK1140UeP .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-l3gFgAzHK1140UeP .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-l3gFgAzHK1140UeP :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} EngineCore
引擎层
前端 (服务层)
render
InputProcessor.process_inputs()
add_request()
schedule()
execute_model()
update_from_output()
OutputProcessor
HTTP Request
EngineInput

(TokensInput/EmbedsInput)
EngineCoreRequest

(msgspec Struct)
EngineCoreOutputs

(msgpack 序列化)
RequestOutput

(用户可见)
Request

(内部表示)
SchedulerOutput
ModelRunnerOutput

1.3 第 3 层:调度核心(v1/core/sched/)

职责:决定每个 step 中各请求的 token 数量,管理 KV Cache 分配。

组件 文件 职责
SchedulerInterface interface.py 调度器抽象基类,定义 schedule/add_request/update_from_output 等核心方法
Scheduler scheduler.py 默认调度器实现,FCFS + 连续 batching
AsyncScheduler async_scheduler.py 异步调度实现,支持 scheduling 与 execution 重叠
RequestQueue request_queue.py 优先级队列,支持 priority/arrival_time 排序
BlockPool block_pool.py KV Cache 物理块分配器
KVCacheManager kv_cache_manager.py KV Cache 逻辑管理,含 prefix caching

关键调度决策流程

python 复制代码
# 来自 interface.py L52-L75
def schedule(self) -> "SchedulerOutput":
    """Schedule the requests to process in this scheduling step.
    
    The scheduler produces a dictionary of {req_id: num_tokens}
    that specifies how many tokens to process for each request.
    num_tokens can be:
    - prompt token count for new requests (prefill)
    - 1 for auto-regressive decoding
    - somewhere between for chunked prefills / speculative decoding
    """

1.4 第 4 层:执行器层(v1/executor/)

职责:管理分布式 worker 进程,屏蔽单机/多机/Ray 差异。

Executor 类层次

#mermaid-svg-6auKD20DysusXvuJ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6auKD20DysusXvuJ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6auKD20DysusXvuJ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6auKD20DysusXvuJ .error-icon{fill:#552222;}#mermaid-svg-6auKD20DysusXvuJ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6auKD20DysusXvuJ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6auKD20DysusXvuJ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6auKD20DysusXvuJ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6auKD20DysusXvuJ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6auKD20DysusXvuJ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6auKD20DysusXvuJ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6auKD20DysusXvuJ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6auKD20DysusXvuJ .marker.cross{stroke:#333333;}#mermaid-svg-6auKD20DysusXvuJ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6auKD20DysusXvuJ p{margin:0;}#mermaid-svg-6auKD20DysusXvuJ g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-6auKD20DysusXvuJ g.classGroup text .title{font-weight:bolder;}#mermaid-svg-6auKD20DysusXvuJ .cluster-label text{fill:#333;}#mermaid-svg-6auKD20DysusXvuJ .cluster-label span{color:#333;}#mermaid-svg-6auKD20DysusXvuJ .cluster-label span p{background-color:transparent;}#mermaid-svg-6auKD20DysusXvuJ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6auKD20DysusXvuJ .cluster text{fill:#333;}#mermaid-svg-6auKD20DysusXvuJ .cluster span{color:#333;}#mermaid-svg-6auKD20DysusXvuJ .nodeLabel,#mermaid-svg-6auKD20DysusXvuJ .edgeLabel{color:#131300;}#mermaid-svg-6auKD20DysusXvuJ .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-6auKD20DysusXvuJ .label text{fill:#131300;}#mermaid-svg-6auKD20DysusXvuJ .labelBkg{background:#ECECFF;}#mermaid-svg-6auKD20DysusXvuJ .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-6auKD20DysusXvuJ .classTitle{font-weight:bolder;}#mermaid-svg-6auKD20DysusXvuJ .node rect,#mermaid-svg-6auKD20DysusXvuJ .node circle,#mermaid-svg-6auKD20DysusXvuJ .node ellipse,#mermaid-svg-6auKD20DysusXvuJ .node polygon,#mermaid-svg-6auKD20DysusXvuJ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6auKD20DysusXvuJ .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ g.clickable{cursor:pointer;}#mermaid-svg-6auKD20DysusXvuJ g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-6auKD20DysusXvuJ g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-6auKD20DysusXvuJ .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-6auKD20DysusXvuJ .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-6auKD20DysusXvuJ .dashed-line{stroke-dasharray:3;}#mermaid-svg-6auKD20DysusXvuJ .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-6auKD20DysusXvuJ #compositionStart,#mermaid-svg-6auKD20DysusXvuJ .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #compositionEnd,#mermaid-svg-6auKD20DysusXvuJ .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #dependencyStart,#mermaid-svg-6auKD20DysusXvuJ .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #dependencyStart,#mermaid-svg-6auKD20DysusXvuJ .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #extensionStart,#mermaid-svg-6auKD20DysusXvuJ .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #extensionEnd,#mermaid-svg-6auKD20DysusXvuJ .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #aggregationStart,#mermaid-svg-6auKD20DysusXvuJ .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #aggregationEnd,#mermaid-svg-6auKD20DysusXvuJ .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #lollipopStart,#mermaid-svg-6auKD20DysusXvuJ .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ #lollipopEnd,#mermaid-svg-6auKD20DysusXvuJ .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-6auKD20DysusXvuJ .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-6auKD20DysusXvuJ .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6auKD20DysusXvuJ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6auKD20DysusXvuJ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6auKD20DysusXvuJ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} <<abstract>>
Executor
+get_class(vllm_config) : Executor
+execute_model(scheduler_output) : ModelRunnerOutput
+collective_rpc(method, args, kwargs) : list
+initialize_from_config(kv_cache_configs)
+determine_available_memory() : list<int>
+shutdown()
UniProcExecutor
+_init_executor()
+collective_rpc()
MultiprocExecutor
+_init_executor()
+collective_rpc()
RayDistributedExecutor
+_init_executor()
+collective_rpc()
RayExecutorV2
+_init_executor()
+collective_rpc()
ExecutorWithExternalLauncher
+_init_executor()

工厂方法 ------ Executor.get_class() 根据 distributed_executor_backend 配置选择具体实现:

python 复制代码
# abstract.py L48-L92
@staticmethod
def get_class(vllm_config: VllmConfig) -> type["Executor"]:
    distributed_executor_backend = parallel_config.distributed_executor_backend
    if isinstance(distributed_executor_backend, type):
        executor_class = distributed_executor_backend  # 用户自定义
    elif distributed_executor_backend == "ray":
        executor_class = RayExecutorV2 or RayDistributedExecutor
    elif distributed_executor_backend == "mp":
        executor_class = MultiprocExecutor
    elif distributed_executor_backend == "uni":
        executor_class = UniProcExecutor
    elif distributed_executor_backend == "external_launcher":
        executor_class = ExecutorWithExternalLauncher
    return executor_class

1.5 第 5 层:模型执行器(model_executor/)

职责:加载模型权重、构建计算图、执行前向传播。

子模块 职责
models/ 150+ 模型架构实现(Llama/Qwen/Gemma/Mistral 等),通过 registry.py 注册
models/interfaces.py 模型能力接口定义(supports_multimodal / supports_pp / is_attention_free 等)
kernels/ 自定义 CUDA/Triton kernel 封装
layers/ 通用算子层(Linear / Attention / RMSNorm 等)
warmup/ 模型 warmup 逻辑

Worker 类型

Worker 类 用途 文件位置
GPUModelRunner GPU 上模型执行主逻辑 gpu/model_runner.py
CPUModelRunner CPU 推理 cpu_model_runner.py
XPUModelRunner Intel XPU xpu_model_runner.py
TPUModelRunner Google TPU tpu_input_batch.py

1.6 第 6 层:内核层(csrc/ + kernels/)

职责:高性能 GPU kernel 实现。

组件 技术 用途
FlashAttention flash-attn 高效注意力计算
FlashInfer flashinfer Paged Attention / 解码优化
Cutlass MLA CUTLASS DeepSeek MLA 注意力
Triton Kernels Triton 自定义融合 kernel(attention / rms_norm / silu_mul)
Custom Ops C++/CUDA FP8 量化、AllReduce 融合等

二、v0 → v1 架构演进

2.1 演进总览

#mermaid-svg-AvTfkmd5jeE0Lbx7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .error-icon{fill:#552222;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .marker.cross{stroke:#333333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 p{margin:0;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster-label text{fill:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster-label span{color:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster-label span p{background-color:transparent;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .label text,#mermaid-svg-AvTfkmd5jeE0Lbx7 span{fill:#333;color:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .node rect,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node circle,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node ellipse,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node polygon,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .rough-node .label text,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node .label text,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape .label,#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape .label{text-anchor:middle;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .rough-node .label,#mermaid-svg-AvTfkmd5jeE0Lbx7 .node .label,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape .label,#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape .label{text-align:center;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .node.clickable{cursor:pointer;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .arrowheadPath{fill:#333333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-AvTfkmd5jeE0Lbx7 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AvTfkmd5jeE0Lbx7 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster text{fill:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .cluster span{color:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-AvTfkmd5jeE0Lbx7 rect.text{fill:none;stroke-width:0;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape p,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .icon-shape .label rect,#mermaid-svg-AvTfkmd5jeE0Lbx7 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AvTfkmd5jeE0Lbx7 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-AvTfkmd5jeE0Lbx7 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-AvTfkmd5jeE0Lbx7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} v1 架构 (当前)
v0 架构 (已废弃)
重构
抽象为 Interface
统一为 Executor
inproc / zmq
继承
LLMEngine

单体类
Scheduler

同步调度
Worker

直接 RPC
LLMEngine

兼容门面 (llm_engine.py)
EngineCoreClient

IPC 抽象
EngineCore

解耦核心 (core.py)
EngineCoreProc

ZMQ 进程封装
SchedulerInterface

可插拔调度
Executor

分布式抽象

2.2 v0 LLMEngine → v1 包装策略

关键发现 :当前 vllm/engine/llm_engine.py 仅是一个 重导出别名

python 复制代码
# vllm/engine/llm_engine.py (全文,仅 6 行)
from vllm.v1.engine.llm_engine import LLMEngine as V1LLMEngine

LLMEngine = V1LLMEngine  # type: ignore
"""The `LLMEngine` class is an alias of vllm.v1.engine.llm_engine.LLMEngine."""

这意味着 v0 版本已被完全移除 ,当前代码库中不存在旧的 LLMEngine 实现。所有调用方使用的 LLMEngine 实际上都是 v1 版本。

2.3 EngineCore 解耦设计

EngineCore 是 v1 架构的核心创新,实现了以下解耦:

初始化流程

Scheduler Executor EngineCore EngineCoreClient LLMEngine Scheduler Executor EngineCore EngineCoreClient LLMEngine #mermaid-svg-8Opndy2NFRobkAgP{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-8Opndy2NFRobkAgP .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-8Opndy2NFRobkAgP .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-8Opndy2NFRobkAgP .error-icon{fill:#552222;}#mermaid-svg-8Opndy2NFRobkAgP .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-8Opndy2NFRobkAgP .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-8Opndy2NFRobkAgP .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-8Opndy2NFRobkAgP .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-8Opndy2NFRobkAgP .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-8Opndy2NFRobkAgP .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-8Opndy2NFRobkAgP .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-8Opndy2NFRobkAgP .marker{fill:#333333;stroke:#333333;}#mermaid-svg-8Opndy2NFRobkAgP .marker.cross{stroke:#333333;}#mermaid-svg-8Opndy2NFRobkAgP svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-8Opndy2NFRobkAgP p{margin:0;}#mermaid-svg-8Opndy2NFRobkAgP .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-8Opndy2NFRobkAgP text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-8Opndy2NFRobkAgP .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-8Opndy2NFRobkAgP .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-8Opndy2NFRobkAgP .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-8Opndy2NFRobkAgP #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-8Opndy2NFRobkAgP .sequenceNumber{fill:white;}#mermaid-svg-8Opndy2NFRobkAgP #sequencenumber{fill:#333;}#mermaid-svg-8Opndy2NFRobkAgP #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-8Opndy2NFRobkAgP .messageText{fill:#333;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-8Opndy2NFRobkAgP .labelText,#mermaid-svg-8Opndy2NFRobkAgP .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .loopText,#mermaid-svg-8Opndy2NFRobkAgP .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-8Opndy2NFRobkAgP .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-8Opndy2NFRobkAgP .noteText,#mermaid-svg-8Opndy2NFRobkAgP .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-8Opndy2NFRobkAgP .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-8Opndy2NFRobkAgP .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-8Opndy2NFRobkAgP .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-8Opndy2NFRobkAgP .actorPopupMenu{position:absolute;}#mermaid-svg-8Opndy2NFRobkAgP .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-8Opndy2NFRobkAgP .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-8Opndy2NFRobkAgP .actor-man circle,#mermaid-svg-8Opndy2NFRobkAgP line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-8Opndy2NFRobkAgP :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} EngineCore 就绪 make_client(multiprocess_mode) init(vllm_config, executor_class) executor_class(vllm_config) determine_available_memory() get_kv_cache_specs() _initialize_kv_caches() Scheduler(vllm_config, kv_cache_config) initialize_from_config(kv_cache_configs)

核心步进循环

EngineCore.step() 是引擎的心跳:

python 复制代码
# core.py L402-L431
def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
    """Schedule, execute, and make output.

    Returns tuple of outputs and a flag indicating whether the model
    was executed.
    """
    # 1. 检查是否有待处理请求
    if not self.scheduler.has_requests():
        return {}, False

    # 2. 调度决策:决定每个请求处理多少 token
    scheduler_output = self.scheduler.schedule()

    # 3. 异步执行模型前向传播
    future = self.model_executor.execute_model(scheduler_output, non_block=True)

    # 4. 获取结构化输出的 grammar bitmask
    grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)

    # 5. 等待模型输出 & 采样
    with self.log_error_detail(scheduler_output), \
         self.log_iteration_details(scheduler_output):
        model_output = future.result()
        if model_output is None:
            model_output = self.model_executor.sample_tokens(grammar_output)

    # 6. 处理异步 abort
    self._process_aborts_queue()

    # 7. 更新调度器状态并生成输出
    engine_core_outputs = self.scheduler.update_from_output(
        scheduler_output, model_output
    )

    return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0
Pipeline Parallelism 优化:batch_queue

max_concurrent_batches > 1 时(即 PP > 1),EngineCore 使用 step_with_batch_queue() 替代 step()

  • 使用双端队列 deque 缓冲多个 batch
  • 优先填满 batch queue(消除 pipeline bubble)
  • 阻塞等待最早完成的 batch 结果
  • 支持 deferred sampling(结构化输出 + speculative decoding 场景)

2.4 v1 LLMEngine:向后兼容的门面

v1 LLMEngine 作为公共 API 门面,职责包括:

组合模式
python 复制代码
# llm_engine.py L90-L111 (精简)
class LLMEngine:
    def __init__(self, vllm_config, executor_class, log_stats, ...):
        # 1. Renderer: 处理 chat template / multimodal inputs
        self.renderer = renderer_from_config(self.vllm_config)

        # 2. InputProcessor: EngineInput → EngineCoreRequest
        self.input_processor = InputProcessor(self.vllm_config, renderer)

        # 3. OutputProcessor: EngineCoreOutputs → RequestOutput
        self.output_processor = OutputProcessor(
            renderer.tokenizer,
            log_stats=self.log_stats,
            stream_interval=...,
        )

        # 4. EngineCoreClient: 通过 IPC 访问 EngineCore
        self.engine_core = EngineCoreClient.make_client(
            multiprocess_mode=multiprocess_mode,
            asyncio_mode=False,
            ...
        )
add_request 流水线
python 复制代码
# llm_engine.py L209-L285 (精简)
def add_request(self, request_id, prompt, params, ...):
    # 1. 输入预处理:EngineInput → EngineCoreRequest
    request = self.input_processor.process_inputs(
        request_id, prompt, params, ...
    )

    # 2. n>1 时 fan-out子请求(beam search)
    if n > 1:
        for idx in range(n):
            child_request = copy(request)
            child_request.sampling_params = child_params
            self.output_processor.add_request(child_request, ...)
            self.engine_core.add_request(child_request)
    else:
        self.output_processor.add_request(request, ...)
        self.engine_core.add_request(request)
step 流水线
python 复制代码
# llm_engine.py L287-L325 (精简)
def step(self) -> list[RequestOutput | PoolingRequestOutput]:
    # 1. 从 EngineCore 获取原始输出
    outputs = self.engine_core.get_output()

    # 2. 后处理:EngineCoreOutputs → RequestOutput
    processed_outputs = self.output_processor.process_outputs(
        outputs.outputs, engine_core_timestamp=outputs.timestamp, ...
    )

    # 3. 中止 stop string 触发的请求
    self.engine_core.abort_requests(processed_outputs.reqs_to_abort)

    # 4. 记录统计信息
    if self.logger_manager is not None:
        self.logger_manager.record(...)

    return processed_outputs.request_outputs

2.5 v1 核心改进点总结

改进维度 v0 v1
核心循环 单体 LLMEngine.step() EngineCore.step() 解耦
进程模型 同步多进程 可选 inproc/multiproc/ZMQ
调度器 硬编码 Scheduler SchedulerInterface 可插拔
执行器 RayWorkerWrapper 固定 Executor 抽象 + 工厂方法
序列表示 Sequence / SequenceGroup 简化为 Request(无 group 概念)
数据序列化 pickle msgspec Struct(零拷贝友好)
异步调度 不支持 AsyncScheduler(scheduling ∥ execution)
Data Parallel 不支持 DPEngineCoreProc + DP Coordinator
Elastic EP 不支持 完整扩缩容支持

2.6 向后兼容策略

虽然 v0 已被移除,但 vLLM 通过以下机制保持 API 兼容性:

  1. 模块级别名vllm.engine.llm_engine.LLMEngine 直接指向 v1 实现
  2. 属性透传self.model_executor = self.engine_core.engine_core.model_executor(v0 兼容访问路径,见 llm_engine.py:L124
  3. 接口签名保留add_request() / step() / abort_request() 签名不变

三、核心数据结构

3.1 请求生命周期数据结构

#mermaid-svg-FnW8oyXcPSh2DEyN{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-FnW8oyXcPSh2DEyN .error-icon{fill:#552222;}#mermaid-svg-FnW8oyXcPSh2DEyN .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-FnW8oyXcPSh2DEyN .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-FnW8oyXcPSh2DEyN .marker{fill:#333333;stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN .marker.cross{stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-FnW8oyXcPSh2DEyN p{margin:0;}#mermaid-svg-FnW8oyXcPSh2DEyN defs #statediagram-barbEnd{fill:#333333;stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup text{fill:#9370DB;stroke:none;font-size:10px;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup text{fill:#333;stroke:none;font-size:10px;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup .state-title{font-weight:bolder;fill:#131300;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-FnW8oyXcPSh2DEyN g.stateGroup line{stroke:#333333;stroke-width:1;}#mermaid-svg-FnW8oyXcPSh2DEyN .transition{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-FnW8oyXcPSh2DEyN .stateGroup .composit{fill:white;border-bottom:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .state-note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-FnW8oyXcPSh2DEyN .state-note text{fill:black;stroke:none;font-size:10px;}#mermaid-svg-FnW8oyXcPSh2DEyN .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel .label rect{fill:#ECECFF;opacity:0.5;}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-FnW8oyXcPSh2DEyN .edgeLabel .label text{fill:#333;}#mermaid-svg-FnW8oyXcPSh2DEyN .label div .edgeLabel{color:#333;}#mermaid-svg-FnW8oyXcPSh2DEyN .stateLabel text{fill:#131300;font-size:10px;font-weight:bold;}#mermaid-svg-FnW8oyXcPSh2DEyN .node circle.state-start{fill:#333333;stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN .node .fork-join{fill:#333333;stroke:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN .node circle.state-end{fill:#9370DB;stroke:white;stroke-width:1.5;}#mermaid-svg-FnW8oyXcPSh2DEyN .end-state-inner{fill:white;stroke-width:1.5;}#mermaid-svg-FnW8oyXcPSh2DEyN .node rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN #statediagram-barbEnd{fill:#333333;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-FnW8oyXcPSh2DEyN .cluster-label,#mermaid-svg-FnW8oyXcPSh2DEyN .nodeLabel{color:#131300;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster rect.outer{rx:5px;ry:5px;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-state .divider{stroke:#9370DB;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-state .title-state{rx:5px;ry:5px;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster.statediagram-cluster .inner{fill:white;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster.statediagram-cluster-alt .inner{fill:#f0f0f0;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-cluster .inner{rx:0;ry:0;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-state rect.basic{rx:5px;ry:5px;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#f0f0f0;}#mermaid-svg-FnW8oyXcPSh2DEyN .note-edge{stroke-dasharray:5;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-note text{fill:black;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram-note .nodeLabel{color:black;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagram .edgeLabel{color:red;}#mermaid-svg-FnW8oyXcPSh2DEyN #dependencyStart,#mermaid-svg-FnW8oyXcPSh2DEyN #dependencyEnd{fill:#333333;stroke:#333333;stroke-width:1;}#mermaid-svg-FnW8oyXcPSh2DEyN .statediagramTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-FnW8oyXcPSh2DEyN :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} add_request()
use_structured_output
grammar compiled
need remote KV
KV received
scheduled
preempted (low priority / OOM)
rescheduled
stop token / stop string
max_tokens reached
client abort
internal error
repetition detected
prompt too long
WAITING
WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR
WAITING_FOR_REMOTE_KVS
RUNNING
PREEMPTED
FINISHED_STOPPED
FINISHED_LENGTH_CAPPED
FINISHED_ABORTED
FINISHED_ERROR
FINISHED_REPETITION
FINISHED_IGNORED

3.2 Request(内部请求表示)

来源:v1/request.py

python 复制代码
# request.py L59-L100 (精简)
class Request:
    def __init__(
        self,
        request_id: str,
        prompt_token_ids: list[int] | None,
        sampling_params: SamplingParams | None,
        pooling_params: PoolingParams | None,
        client_index: int = 0,
        arrival_time: float | None = None,
        prompt_embeds: torch.Tensor | None = None,
        mm_features: list[MultiModalFeatureSpec] | None = None,
        lora_request: LoRARequest | None = None,
        block_hasher: Callable | None = None,
        ...
    ):
        self.request_id = request_id
        self.status = RequestStatus.WAITING          # 初始状态
        self.sampling_params = sampling_params
        self.prompt_token_ids = prompt_token_ids
        self._output_token_ids: list[int] = []       # 生成的 token
        self.num_computed_tokens = 0                  # 已计算 token 数
        self.block_hashes: list[BlockHash] = []       # prefix caching hash
        self.events: list[EngineCoreEvent] = []      # 事件追踪
        self.num_preemptions = 0                      # 被抢占次数

关键字段语义

字段 类型 说明
status RequestStatus 请求状态机(见上图)
output_token_ids ConstantList[int] 只读视图,防止外部直接 append
all_token_ids ConstantList[int] 包含 prompt + 所有生成 token
num_output_tokens int 当前已生成 token 数
block_hashes list[BlockHash] 每个 full block 的 hash,用于 prefix caching
is_prefill_chunk bool 是否处于非最终 prefill chunk
num_nans_in_logits int logits 中 NaN 数量(用于检测异常)
streaming_queue deque streaming session 续传队列

3.3 EngineInput → EngineCoreRequest → Request 转换链

#mermaid-svg-TLNl5Z2gYNNDCLL6{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .error-icon{fill:#552222;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .marker.cross{stroke:#333333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 p{margin:0;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster-label text{fill:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster-label span{color:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster-label span p{background-color:transparent;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .label text,#mermaid-svg-TLNl5Z2gYNNDCLL6 span{fill:#333;color:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .node rect,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node circle,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node ellipse,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node polygon,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .rough-node .label text,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node .label text,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape .label,#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape .label{text-anchor:middle;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .rough-node .label,#mermaid-svg-TLNl5Z2gYNNDCLL6 .node .label,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape .label,#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape .label{text-align:center;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .node.clickable{cursor:pointer;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .arrowheadPath{fill:#333333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-TLNl5Z2gYNNDCLL6 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TLNl5Z2gYNNDCLL6 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster text{fill:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .cluster span{color:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-TLNl5Z2gYNNDCLL6 rect.text{fill:none;stroke-width:0;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape p,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .icon-shape .label rect,#mermaid-svg-TLNl5Z2gYNNDCLL6 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TLNl5Z2gYNNDCLL6 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-TLNl5Z2gYNNDCLL6 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-TLNl5Z2gYNNDCLL6 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Request (内部)
EngineCoreRequest (msgspec Struct)
EngineInput (TypedDict)
用户 API 层
Renderer.render_chat()
InputProcessor.process_inputs()
Request.from_engine_core_request()
prompt='Hello'
prompt_token_ids=1,2,3
prompt_embeds=tensor(...)
TokensInput

type='token'

prompt_token_ids=\[\]
EmbedsInput

type='embeds'

prompt_embeds=tensor
request_id, prompt_token_ids,

sampling_params, mm_features,

lora_request, priority, ...

  • status, _output_token_ids,

  • block_hashes, events,

  • num_computed_tokens,

  • prefill_stats

EngineCoreRequest 定义 (engine/init.py):

python 复制代码
# __init__.py L80-L131
class EngineCoreRequest(
    msgspec.Struct,
    array_like=True,
    omit_defaults=True,
    gc=False,           # 禁用 GC 追踪(性能关键)
):
    request_id: str
    prompt_token_ids: list[int] | None
    mm_features: list[MultiModalFeatureSpec] | None
    sampling_params: SamplingParams | None
    pooling_params: PoolingParams | None
    arrival_time: float
    lora_request: LoRARequest | None
    cache_salt: str | None
    data_parallel_rank: int | None
    prompt_embeds: torch.Tensor | None = None
    prompt_is_token_ids: list[bool] | None = None
    client_index: int = 0
    current_wave: int = 0
    priority: int = 0
    trace_headers: Mapping[str, str] | None = None
    resumable: bool = False
    external_req_id: str | None = None
    reasoning_ended: bool | None = None
    reasoning_parser_kwargs: dict[str, Any] | None = None

转换函数 (request.py L186-L209):

python 复制代码
@classmethod
def from_engine_core_request(cls, request: EngineCoreRequest, block_hasher) -> "Request":
    return cls(
        request_id=request.request_id,
        client_index=request.client_index,
        prompt_token_ids=request.prompt_token_ids,
        prompt_embeds=request.prompt_embeds,
        mm_features=request.mm_features,
        sampling_params=request.sampling_params,
        pooling_params=request.pooling_params,
        arrival_time=request.arrival_time,
        lora_request=request.lora_request,
        block_hasher=block_hasher,
        resumable=request.resumable,
        ...
    )

3.4 输出数据结构体系

来源:v1/outputs.py
#mermaid-svg-Hkjq6WXshFen7H3O{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Hkjq6WXshFen7H3O .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Hkjq6WXshFen7H3O .error-icon{fill:#552222;}#mermaid-svg-Hkjq6WXshFen7H3O .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Hkjq6WXshFen7H3O .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Hkjq6WXshFen7H3O .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Hkjq6WXshFen7H3O .marker.cross{stroke:#333333;}#mermaid-svg-Hkjq6WXshFen7H3O svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Hkjq6WXshFen7H3O p{margin:0;}#mermaid-svg-Hkjq6WXshFen7H3O g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-Hkjq6WXshFen7H3O g.classGroup text .title{font-weight:bolder;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster-label text{fill:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster-label span{color:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster-label span p{background-color:transparent;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster text{fill:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .cluster span{color:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .nodeLabel,#mermaid-svg-Hkjq6WXshFen7H3O .edgeLabel{color:#131300;}#mermaid-svg-Hkjq6WXshFen7H3O .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-Hkjq6WXshFen7H3O .label text{fill:#131300;}#mermaid-svg-Hkjq6WXshFen7H3O .labelBkg{background:#ECECFF;}#mermaid-svg-Hkjq6WXshFen7H3O .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-Hkjq6WXshFen7H3O .classTitle{font-weight:bolder;}#mermaid-svg-Hkjq6WXshFen7H3O .node rect,#mermaid-svg-Hkjq6WXshFen7H3O .node circle,#mermaid-svg-Hkjq6WXshFen7H3O .node ellipse,#mermaid-svg-Hkjq6WXshFen7H3O .node polygon,#mermaid-svg-Hkjq6WXshFen7H3O .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Hkjq6WXshFen7H3O .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O g.clickable{cursor:pointer;}#mermaid-svg-Hkjq6WXshFen7H3O g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-Hkjq6WXshFen7H3O g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-Hkjq6WXshFen7H3O .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-Hkjq6WXshFen7H3O .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-Hkjq6WXshFen7H3O .dashed-line{stroke-dasharray:3;}#mermaid-svg-Hkjq6WXshFen7H3O .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-Hkjq6WXshFen7H3O #compositionStart,#mermaid-svg-Hkjq6WXshFen7H3O .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #compositionEnd,#mermaid-svg-Hkjq6WXshFen7H3O .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #dependencyStart,#mermaid-svg-Hkjq6WXshFen7H3O .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #dependencyStart,#mermaid-svg-Hkjq6WXshFen7H3O .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #extensionStart,#mermaid-svg-Hkjq6WXshFen7H3O .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #extensionEnd,#mermaid-svg-Hkjq6WXshFen7H3O .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #aggregationStart,#mermaid-svg-Hkjq6WXshFen7H3O .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #aggregationEnd,#mermaid-svg-Hkjq6WXshFen7H3O .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #lollipopStart,#mermaid-svg-Hkjq6WXshFen7H3O .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O #lollipopEnd,#mermaid-svg-Hkjq6WXshFen7H3O .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-Hkjq6WXshFen7H3O .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-Hkjq6WXshFen7H3O .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Hkjq6WXshFen7H3O .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Hkjq6WXshFen7H3O .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Hkjq6WXshFen7H3O :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} scheduler.update_from_output()
聚合
sample_tokens()
SamplerOutput
+sampled_token_ids: Tensor
+logprobs_tensors: LogprobsTensors
ModelRunnerOutput
+req_ids: list<str>
+req_id_to_index: dict
+sampled_token_ids: list<list<int>>
+logprobs: LogprobsLists
+prompt_logprobs_dict: dict
+pooler_output: list
+kv_connector_output: KVConnectorOutput
+num_nans_in_logits: dict
EngineCoreOutput
+request_id: str
+new_token_ids: list<int>
+new_logprobs: LogprobsLists
+finish_reason: FinishReason
+events: list<EngineCoreEvent>
+prefill_stats: PrefillStats
EngineCoreOutputs
+outputs: list<EngineCoreOutput>
+scheduler_stats: SchedulerStats
+timestamp: float
+finished_requests: set<str>
+utility_output: UtilityOutput

关键数据结构详解

SamplerOutput (outputs.py L118-L124):

python 复制代码
@dataclass
class SamplerOutput:
    sampled_token_ids: torch.Tensor     # [num_reqs, max_num_generated_tokens]
    logprobs_tensors: LogprobsTensors | None  # logprob 信息

ModelRunnerOutput (outputs.py L166-L206):

python 复制代码
@dataclass
class ModelRunnerOutput:
    req_ids: list[str]                    # 请求 ID 列表
    req_id_to_index: dict[str, int]       # ID → 索引映射
    sampled_token_ids: list[list[int]]     # 每个请求生成的 token IDs
    logprobs: LogprobsLists | None         # log probability
    prompt_logprobs_dict: dict             # prompt 阶段 logprob
    pooler_output: list[Tensor | None]     # pooling 模型输出
    kv_connector_output: KVConnectorOutput | None  # KV transfer 信息
    num_nans_in_logits: dict | None        # NaN 检测

EngineCoreOutput (engine/init.py L161-L191):

python 复制代码
class EngineCoreOutput(msgspec.Struct, array_like=True, omit_defaults=True, gc=False):
    request_id: str
    new_token_ids: list[int]
    new_logprobs: LogprobsLists | None
    finish_reason: FinishReason | None   # STOP/LENGTH/ABORT/ERROR/REPETITION
    events: list[EngineCoreEvent] | None
    prefill_stats: PrefillStats | None

3.5 FinishReason 状态机

python 复制代码
# engine/__init__.py L42-L64
class FinishReason(enum.IntEnum):
    STOP = 0        # stop string / stop token emitted
    LENGTH = 1      # max_tokens consumed or max_model_len reached
    ABORT = 2       # aborted by client
    ERROR = 3       # retryable internal error (→ 500)
    REPETITION = 4  # repetitive pattern (hallucination)

四、设计原则

4.1 配置驱动(Configuration-Driven)

vLLM 通过 28+ 配置模块 驱动全部行为,核心配置聚合在 VllmConfig
#mermaid-svg-Ml6lPTVE9oFhIAn7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .error-icon{fill:#552222;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .marker.cross{stroke:#333333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 p{margin:0;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster-label text{fill:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster-label span{color:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster-label span p{background-color:transparent;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .label text,#mermaid-svg-Ml6lPTVE9oFhIAn7 span{fill:#333;color:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .node rect,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node circle,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node ellipse,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node polygon,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .rough-node .label text,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node .label text,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape .label,#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape .label{text-anchor:middle;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .rough-node .label,#mermaid-svg-Ml6lPTVE9oFhIAn7 .node .label,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape .label,#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape .label{text-align:center;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .node.clickable{cursor:pointer;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .arrowheadPath{fill:#333333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Ml6lPTVE9oFhIAn7 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Ml6lPTVE9oFhIAn7 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster text{fill:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .cluster span{color:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Ml6lPTVE9oFhIAn7 rect.text{fill:none;stroke-width:0;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape p,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .icon-shape .label rect,#mermaid-svg-Ml6lPTVE9oFhIAn7 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Ml6lPTVE9oFhIAn7 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Ml6lPTVE9oFhIAn7 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Ml6lPTVE9oFhIAn7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} VllmConfig

聚合根
ModelConfig

模型配置
CacheConfig

KV Cache 配置
ParallelConfig

并行配置
SchedulerConfig

调度配置
DeviceConfig

设备配置
LoadConfig

加载配置
AttentionConfig

注意力配置
KernelConfig

Kernel 配置
CompilationConfig

编译配置
SpeculativeConfig

推测解码配置
ObservabilityConfig

可观测性
QuantizationConfig

量化配置
ReasoningConfig

推理配置
StructuredOutputsConfig

结构化输出
KVTransferConfig

KV Transfer
LoRAConfig

LoRA 配置
OffloadConfig

Offload 配置
ProfilerConfig

Profiler
MambaConfig

Mamba 配置
MultimodalConfig

多模态配置

VllmConfig.post_init() (vllm.py L758-L1401) 在初始化时执行大量交叉验证和默认值推导:

  • 优化级别应用(O0/O1/O2/O3)
  • async scheduling 自动启用判断
  • cudagraph capture sizes 计算
  • SP (Sequence Parallelism) 阈值推导
  • platform-specific defaults 应用
  • KV transfer 兼容性检查

优化级别系统

python 复制代码
# vllm.py L68-L265
class OptimizationLevel(IntEnum):
    O0 = 0   # 无优化,最快启动
    O1 = 1   # Dynamo+Inductor + Piecewise CUDAGraph
    O2 = 2   # Full + Piecewise CUDAGraph(默认)
    O3 = 3   # O2 + FlashInfer autotune

4.2 注册表模式(Registry Pattern)

模型注册表

model_executor/models/registry.py 维护了 150+ 模型架构的注册表:

python 复制代码
# registry.py L70-L221 (精简)
_TEXT_GENERATION_MODELS = {
    "LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
    "Qwen3ForCausalLM": ("qwen3", "Qwen3ForCausalLM"),
    "DeepseekV3ForCausalLM": ("deepseek_v2", "DeepseekV3ForCausalLM"),
    "Gemma4ForCausalLM": ("gemma4", "Gemma4ForCausalLM"),
    # ... 共 200+ 条目
}

_EMBEDDING_MODELS = { ... }       # Embedding 模型
_MULTIMODAL_MODELS = { ... }      # 多模态模型
_SPECULATIVE_DECODING_MODELS = {...}  # 推测解码模型

_VLLM_MODELS = {
    **_TEXT_GENERATION_MODELS,
    **_EMBEDDING_MODELS,
    **_MULTIMODAL_MODELS,
    **_SPECULATIVE_DECODING_MODELS,
    ...
}

ModelRegistry = _ModelRegistry({
    model_arch: _LazyRegisteredModel(   # 懒加载!避免 import 时初始化 CUDA
        module_name=f"vllm.model_executor.models.{mod_relname}",
        class_name=cls_name,
    )
    for model_arch, (mod_relname, cls_name) in _VLLM_MODELS.items()
})

关键设计 :使用 _LazyRegisteredModel 实现延迟导入,避免在非 GPU 进程中初始化 CUDA context。

多模态处理器注册表

multimodal/registry.py 使用装饰器模式注册处理器:

python 复制代码
# registry.py L142-L174 (精简)
class MultiModalRegistry:
    def register_processor(
        self,
        processor: MultiModalFactory[_I],
        *,
        info: ProcessingInfoFactory[_I],
        dummy_inputs: DummyInputsBuilderFactory[_I],
    ):
        def wrapper(model_cls: N) -> N:
            model_cls._processor_factory = _ProcessorFactories(
                info=info, dummy_inputs=dummy_inputs, processor=processor,
            )
            return model_cls
        return wrapper

# 使用示例(在模型文件中):
# @MULTIMODAL_REGISTRY.register_processor(MyProcessor.factory, ...)
# class MyModel(nn.Module): ...

4.3 策略模式(Strategy Pattern)

注意力后端选择器

v1/attention/selector.py 根据运行时参数选择最优注意力后端:
#mermaid-svg-jt7mAmfn1Fh9NSAD{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jt7mAmfn1Fh9NSAD .error-icon{fill:#552222;}#mermaid-svg-jt7mAmfn1Fh9NSAD .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jt7mAmfn1Fh9NSAD .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .marker.cross{stroke:#333333;}#mermaid-svg-jt7mAmfn1Fh9NSAD svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jt7mAmfn1Fh9NSAD p{margin:0;}#mermaid-svg-jt7mAmfn1Fh9NSAD .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster-label text{fill:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster-label span{color:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster-label span p{background-color:transparent;}#mermaid-svg-jt7mAmfn1Fh9NSAD .label text,#mermaid-svg-jt7mAmfn1Fh9NSAD span{fill:#333;color:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .node rect,#mermaid-svg-jt7mAmfn1Fh9NSAD .node circle,#mermaid-svg-jt7mAmfn1Fh9NSAD .node ellipse,#mermaid-svg-jt7mAmfn1Fh9NSAD .node polygon,#mermaid-svg-jt7mAmfn1Fh9NSAD .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .rough-node .label text,#mermaid-svg-jt7mAmfn1Fh9NSAD .node .label text,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape .label,#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape .label{text-anchor:middle;}#mermaid-svg-jt7mAmfn1Fh9NSAD .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .rough-node .label,#mermaid-svg-jt7mAmfn1Fh9NSAD .node .label,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape .label,#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape .label{text-align:center;}#mermaid-svg-jt7mAmfn1Fh9NSAD .node.clickable{cursor:pointer;}#mermaid-svg-jt7mAmfn1Fh9NSAD .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .arrowheadPath{fill:#333333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jt7mAmfn1Fh9NSAD .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-jt7mAmfn1Fh9NSAD .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jt7mAmfn1Fh9NSAD .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster text{fill:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD .cluster span{color:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-jt7mAmfn1Fh9NSAD .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-jt7mAmfn1Fh9NSAD rect.text{fill:none;stroke-width:0;}#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape p,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-jt7mAmfn1Fh9NSAD .icon-shape .label rect,#mermaid-svg-jt7mAmfn1Fh9NSAD .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jt7mAmfn1Fh9NSAD .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-jt7mAmfn1Fh9NSAD .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-jt7mAmfn1Fh9NSAD :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} get_attn_backend()

selector.py L53-L103
构建 AttentionSelectorConfig

@cache 装饰器

_cached_get_attn_backend()
current_platform.

get_attn_backend_cls()
AttentionBackend 子类
FlashAttention
FlashInfer
TritonAttention
ROCm AITER
MLA Backends

(DeepSeek)
Mamba Attention

选择参数空间 (selector.py L22-L50):

python 复制代码
class AttentionSelectorConfig(NamedTuple):
    head_size: int
    dtype: torch.dtype
    kv_cache_dtype: CacheDType | None
    block_size: int | None
    use_mla: bool = False              # DeepSeek MLA
    has_sink: bool = False              # Sink attention
    use_sparse: bool = False            # Sparse attention
    use_mm_prefix: bool = False         # Multimodal prefix
    use_per_head_quant_scales: bool = False
    attn_type: str = AttentionType.DECODER
    use_non_causal: bool = False
    use_batch_invariant: bool = False   # Batch invariant mode
量化后端选择

量化配置通过 QuantizationConfig 及其子类驱动不同量化后端的选择:

  • FP8 : Fp8ConfigFp8Linear
  • GPTQ : GptqConfigGptqMarlinLinear
  • AWQ : AwqConfigAwqLinear
  • NVFP4 : Nvfp4ConfigNvfp4Linear

4.4 工厂模式(Factory Pattern)

Executor 工厂

如 [4.1 节](#4.1 节) 所述,Executor.get_class() 是经典的工厂方法:

python 复制代码
# abstract.py L48-L92
@staticmethod
def get_class(vllm_config: VllmConfig) -> type["Executor"]:
    backend = parallel_config.distributed_executor_backend
    match backend:
        case "ray": return RayExecutorV2 or RayDistributedExecutor
        case "mp": return MultiprocExecutor
        case "uni": return UniProcExecutor
        case "external_launcher": return ExecutorWithExternalLauncher
        case type(): return backend  # 用户自定义子类
        case str(): return resolve_obj_by_qualname(backend)  # 按限定名加载
EngineCoreClient 工厂

core_client.py 中的 make_client() 根据 multiprocess_modeasyncio_mode 创建不同的客户端实现:

  • inproc 模式:直接持有 EngineCore 引用
  • multiproc 模式:通过 ZMQ socket 与 EngineCoreProc 通信
  • asyncio 模式:异步版本客户端
KV Cache Offload 工厂

kv_offload/factory.py 根据 kv_transfer_config.kv_connector 选择不同的 KV offload 后端:

  • "OffloadingConnector": CPU offloading
  • "LMCacheConnectorV1": LMCache 集成
  • NIXL Connector: RDMA-based transfer

4.5 其他重要设计模式

模式 应用场景 位置
观察者模式 统计日志 / metrics 收集 metrics/
适配器模式 不同平台(CUDA/ROCm/XPU/TPU)差异 platforms/
装饰器模式 编译 pass 注入 / tracing compilation/
原型模式 Dummy input 构建 multimodal/processing/
命令模式 Utility method 远程调用 EngineCoreProc._handle_client_request()

五、关键交互时序

5.1 完整推理请求生命周期

OutputProcessor ModelRunner Executor Scheduler EngineCore InputProcessor LLMEngine OpenAI API 客户端 OutputProcessor ModelRunner Executor Scheduler EngineCore InputProcessor LLMEngine OpenAI API 客户端 #mermaid-svg-bzRce79fkB1GjhRn{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bzRce79fkB1GjhRn .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bzRce79fkB1GjhRn .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bzRce79fkB1GjhRn .error-icon{fill:#552222;}#mermaid-svg-bzRce79fkB1GjhRn .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bzRce79fkB1GjhRn .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bzRce79fkB1GjhRn .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bzRce79fkB1GjhRn .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bzRce79fkB1GjhRn .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bzRce79fkB1GjhRn .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bzRce79fkB1GjhRn .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bzRce79fkB1GjhRn .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bzRce79fkB1GjhRn .marker.cross{stroke:#333333;}#mermaid-svg-bzRce79fkB1GjhRn svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bzRce79fkB1GjhRn p{margin:0;}#mermaid-svg-bzRce79fkB1GjhRn .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bzRce79fkB1GjhRn text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bzRce79fkB1GjhRn .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-bzRce79fkB1GjhRn .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-bzRce79fkB1GjhRn .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-bzRce79fkB1GjhRn #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-bzRce79fkB1GjhRn .sequenceNumber{fill:white;}#mermaid-svg-bzRce79fkB1GjhRn #sequencenumber{fill:#333;}#mermaid-svg-bzRce79fkB1GjhRn #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-bzRce79fkB1GjhRn .messageText{fill:#333;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bzRce79fkB1GjhRn .labelText,#mermaid-svg-bzRce79fkB1GjhRn .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .loopText,#mermaid-svg-bzRce79fkB1GjhRn .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bzRce79fkB1GjhRn .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-bzRce79fkB1GjhRn .noteText,#mermaid-svg-bzRce79fkB1GjhRn .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-bzRce79fkB1GjhRn .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bzRce79fkB1GjhRn .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bzRce79fkB1GjhRn .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bzRce79fkB1GjhRn .actorPopupMenu{position:absolute;}#mermaid-svg-bzRce79fkB1GjhRn .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-bzRce79fkB1GjhRn .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bzRce79fkB1GjhRn .actor-man circle,#mermaid-svg-bzRce79fkB1GjhRn line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-bzRce79fkB1GjhRn :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} loop 每个 iteration POST /v1/chat/completions add_request(prompt, params) process_inputs(prompt, params) EngineCoreRequest add_request(request) add_request(Request) schedule() SchedulerOutput execute_model(SchedulerOutput) forward(batch) hidden_states sample_tokens(grammar_output) sampled_token_ids ModelRunnerOutput update_from_output(output) EngineCoreOutputs EngineCoreOutputs process_outputs(outputs) RequestOutput\[\] stream chunks / final response JSON response

5.2 多进程架构下的 IPC 流程

output_queue OutputSocket Thread EngineCore (Busy Loop) input_queue InputSocket Thread EngineCoreClient 前端进程 (API Server) output_queue OutputSocket Thread EngineCore (Busy Loop) input_queue InputSocket Thread EngineCoreClient 前端进程 (API Server) #mermaid-svg-j1SpQ3km8j0xfEQY{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-j1SpQ3km8j0xfEQY .error-icon{fill:#552222;}#mermaid-svg-j1SpQ3km8j0xfEQY .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-j1SpQ3km8j0xfEQY .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-j1SpQ3km8j0xfEQY .marker{fill:#333333;stroke:#333333;}#mermaid-svg-j1SpQ3km8j0xfEQY .marker.cross{stroke:#333333;}#mermaid-svg-j1SpQ3km8j0xfEQY svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-j1SpQ3km8j0xfEQY p{margin:0;}#mermaid-svg-j1SpQ3km8j0xfEQY .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-j1SpQ3km8j0xfEQY text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-j1SpQ3km8j0xfEQY .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY .sequenceNumber{fill:white;}#mermaid-svg-j1SpQ3km8j0xfEQY #sequencenumber{fill:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-j1SpQ3km8j0xfEQY .messageText{fill:#333;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-j1SpQ3km8j0xfEQY .labelText,#mermaid-svg-j1SpQ3km8j0xfEQY .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .loopText,#mermaid-svg-j1SpQ3km8j0xfEQY .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-j1SpQ3km8j0xfEQY .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-j1SpQ3km8j0xfEQY .noteText,#mermaid-svg-j1SpQ3km8j0xfEQY .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-j1SpQ3km8j0xfEQY .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-j1SpQ3km8j0xfEQY .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-j1SpQ3km8j0xfEQY .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-j1SpQ3km8j0xfEQY .actorPopupMenu{position:absolute;}#mermaid-svg-j1SpQ3km8j0xfEQY .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-j1SpQ3km8j0xfEQY .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-j1SpQ3km8j0xfEQY .actor-man circle,#mermaid-svg-j1SpQ3km8j0xfEQY line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-j1SpQ3km8j0xfEQY :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} add_request(req) ZMQ SEND (msgpack) put((ADD, Request)) get() ← busy loop 轮询 step() → schedule → execute → update put((client_idx, EngineCoreOutputs)) ZMQ SEND (msgpack) recv() → get_output() EngineCoreOutputs


六、目录导航速查

目录 层次 核心文件 一句话说明
entrypoints/ L1 openai/api_server.py HTTP API 入口
engine/ L2 llm_engine.py 别名重导出到 v1
v1/engine/ L2 llm_engine.py, core.py 引擎门面 + 核心循环
v1/core/sched/ L3 scheduler.py, interface.py 调度算法
v1/executor/ L4 abstract.py, uniproc_executor.py 分布式执行
v1/worker/ L4-L5 gpu/model_runner.py, worker_base.py Worker 抽象与 GPU 执行
model_executor/ L5 models/registry.py, models/llama.py 模型加载与执行
v1/attention/ L5-L6 selector.py, backends/ 注意力后端选择
kernels/ L6 vllm_c.py Custom ops 入口
config/ 横切 vllm.py 全局配置聚合
multimodal/ 横切 registry.py 多模态处理器注册
distributed/ 横切 parallel_state.py 分布式通信原语

七、扩展阅读指引

阅读完本文档后,建议按以下顺序深入源码:

  1. EngineCore.step() --- 理解单步执行的完整流程
  2. Scheduler.schedule() --- 理解调度算法细节(chunked prefill / continuous batching / preempt)
  3. GPUModelRunner.execute_model() --- 理解模型前向传播的完整链路
  4. InputProcessor.process_inputs() --- 理解从 HTTP request 到 EngineCoreRequest 的完整转换
  5. AttentionSelectorConfig --- 理解注意力后端的策略选择机制