09-vLLM KV Cache 系统完整分析

vLLM KV Cache 系统完整分析

定位:本文档系统性地分析 vLLM V1 架构下的 KV Cache 全栈实现,涵盖缓存管理层次、多 GPU 协调、接口定义、CPU 卸载、跨节点传输、Block Table 管理及前缀缓存机制。
#mermaid-svg-fZLh2Ca3p4D4jSXu{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fZLh2Ca3p4D4jSXu .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fZLh2Ca3p4D4jSXu .error-icon{fill:#552222;}#mermaid-svg-fZLh2Ca3p4D4jSXu .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fZLh2Ca3p4D4jSXu .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fZLh2Ca3p4D4jSXu .marker.cross{stroke:#333333;}#mermaid-svg-fZLh2Ca3p4D4jSXu svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fZLh2Ca3p4D4jSXu p{margin:0;}#mermaid-svg-fZLh2Ca3p4D4jSXu .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fZLh2Ca3p4D4jSXu .cluster-label text{fill:#333;}#mermaid-svg-fZLh2Ca3p4D4jSXu .cluster-label span{color:#333;}#mermaid-svg-fZLh2Ca3p4D4jSXu .cluster-label span p{background-color:transparent;}#mermaid-svg-fZLh2Ca3p4D4jSXu .label text,#mermaid-svg-fZLh2Ca3p4D4jSXu span{fill:#333;color:#333;}#mermaid-svg-fZLh2Ca3p4D4jSXu .node rect,#mermaid-svg-fZLh2Ca3p4D4jSXu .node circle,#mermaid-svg-fZLh2Ca3p4D4jSXu .node ellipse,#mermaid-svg-fZLh2Ca3p4D4jSXu .node polygon,#mermaid-svg-fZLh2Ca3p4D4jSXu .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fZLh2Ca3p4D4jSXu .rough-node .label text,#mermaid-svg-fZLh2Ca3p4D4jSXu .node .label text,#mermaid-svg-fZLh2Ca3p4D4jSXu .image-shape .label,#mermaid-svg-fZLh2Ca3p4D4jSXu .icon-shape .label{text-anchor:middle;}#mermaid-svg-fZLh2Ca3p4D4jSXu .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fZLh2Ca3p4D4jSXu .rough-node .label,#mermaid-svg-fZLh2Ca3p4D4jSXu .node .label,#mermaid-svg-fZLh2Ca3p4D4jSXu .image-shape .label,#mermaid-svg-fZLh2Ca3p4D4jSXu .icon-shape .label{text-align:center;}#mermaid-svg-fZLh2Ca3p4D4jSXu .node.clickable{cursor:pointer;}#mermaid-svg-fZLh2Ca3p4D4jSXu .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fZLh2Ca3p4D4jSXu .arrowheadPath{fill:#333333;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fZLh2Ca3p4D4jSXu .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fZLh2Ca3p4D4jSXu .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fZLh2Ca3p4D4jSXu .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fZLh2Ca3p4D4jSXu .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fZLh2Ca3p4D4jSXu .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fZLh2Ca3p4D4jSXu .cluster text{fill:#333;}#mermaid-svg-fZLh2Ca3p4D4jSXu .cluster span{color:#333;}#mermaid-svg-fZLh2Ca3p4D4jSXu div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fZLh2Ca3p4D4jSXu .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fZLh2Ca3p4D4jSXu rect.text{fill:none;stroke-width:0;}#mermaid-svg-fZLh2Ca3p4D4jSXu .icon-shape,#mermaid-svg-fZLh2Ca3p4D4jSXu .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fZLh2Ca3p4D4jSXu .icon-shape p,#mermaid-svg-fZLh2Ca3p4D4jSXu .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fZLh2Ca3p4D4jSXu .icon-shape .label rect,#mermaid-svg-fZLh2Ca3p4D4jSXu .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fZLh2Ca3p4D4jSXu .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fZLh2Ca3p4D4jSXu .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fZLh2Ca3p4D4jSXu :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 扩展能力
协调与接口
缓存管理器层次
KVCacheManager

顶层抽象管理器
KVCacheCoordinator

多 Group 协调器
SingleTypeKVCacheManager

单类型基类
FullAttentionManager
SlidingWindowManager
ChunkedLocalAttentionManager
MambaManager
CrossAttentionManager
SinkFullAttentionManager
UnitaryKVCacheCoordinator

(单 Group)
HybridKVCacheCoordinator

(混合模型)
NoPrefixCacheCoordinator

(无前缀缓存)
KVCacheInterface

统一 API 定义
KVCacheConfig / KVCacheSpec

数据结构
KV Offload CPU 卸载

OffloadingManager
Simple KV Offload

轻量级 CPU 卸载
KV Transfer 跨节点传输

KVConnectorBase_V1
Block Table 管理

MultiGroupBlockTable
KV Cache Metrics

统计与监控
EncoderCacheManager

编码器输出缓存


目录

  • [一、KV Cache Manager 层次](#一、KV Cache Manager 层次)
    • [1.1 KVCacheManager --- 顶层抽象](#1.1 KVCacheManager — 顶层抽象)
    • [1.2 SingleTypeKVCacheManager --- 单类型基类](#1.2 SingleTypeKVCacheManager — 单类型基类)
    • [1.3 具体管理器子类](#1.3 具体管理器子类)
    • [1.4 EncoderCacheManager --- 编码器缓存](#1.4 EncoderCacheManager — 编码器缓存)
    • [1.5 层次关系总览](#1.5 层次关系总览)
  • [二、KV Cache Coordinator 多 GPU 缓存协调](#二、KV Cache Coordinator 多 GPU 缓存协调)
  • [三、KV Cache 接口定义](#三、KV Cache 接口定义)
  • [四、KV Cache Metrics 缓存使用统计](#四、KV Cache Metrics 缓存使用统计)
  • [五、KV Offload CPU 卸载](#五、KV Offload CPU 卸载)
  • [六、Simple KV Offload](#六、Simple KV Offload)
  • [七、KV Transfer 跨节点传输](#七、KV Transfer 跨节点传输)
  • [八、Block Table 管理](#八、Block Table 管理)
  • [九、前缀缓存 Prefix Caching](#九、前缀缓存 Prefix Caching)
  • 十、数据流全景图

一、KV Cache Manager 层次

1.1 KVCacheManager --- 顶层抽象

KVCacheManager(file:///workspace/vllm/v1/core/kv_cache_manager.py#L106-L541) 是调度器 (Scheduler) 与底层缓存系统的唯一交互入口。它隐藏了内部数据结构细节,通过 KVCacheBlocks 数据类与 Scheduler 通信。

核心职责:

方法 功能 关键逻辑
get_computed_blocks() 查询前缀缓存命中块 委托给 coordinator.find_longest_cache_hit()
allocate_slots() 为请求分配新 slot 三阶段:释放跳过块 → 分配已计算块 → 分配新块
free() 释放请求的所有块 反序释放,尾部优先
cache_blocks() 将计算完成的块标记为可复用 仅缓存 num_tokens 以内的"最终确定"token

allocate_slots() 的 Blocks 布局(源码注释原文):

复制代码
----------------------------------------------------------------------
| < comp > | < new_comp > | < ext_comp >  | < new >  | < lookahead > |
----------------------------------------------------------------------
                                                  |   < to be computed >     |
----------------------------------------------------------------------
                                  |            < to be allocated >           |
----------------------------------------------------------------------
                                  | < to be cached (roughly, |
                                  | details below)>          |
----------------------------------------------------------------------

初始化流程 (kv_cache_manager.py:107-160(file:///workspace/vllm/v1/core/kv_cache_manager.py#L107-L160)):

python 复制代码
class KVCacheManager:
    def __init__(self, kv_cache_config, max_model_len, hash_block_size, ...):
        self.coordinator = get_kv_cache_coordinator(
            kv_cache_config=kv_cache_config,
            max_model_len=self.max_model_len,
            ...
        )
        self.block_pool = self.coordinator.block_pool
        self.num_kv_cache_groups = len(kv_cache_config.kv_cache_groups)
        # 预构建空 KVCacheBlocks,避免 GC 开销
        self.empty_kv_cache_blocks = KVCacheBlocks(
            tuple(() for _ in range(self.num_kv_cache_groups))
        )

KVCacheBlocks 数据结构 (kv_cache_manager.py:22-103(file:///workspace/vllm/v1/core/kv_cache_manager.py#L22-L103)):

python 复制代码
@dataclass
class KVCacheBlocks:
    """分配结果,作为 Scheduler 和 KVCacheManager 的接口。"""
    blocks: tuple[Sequence[KVCacheBlock], ...]
    # blocks[i][j] 表示第 i 个 kv_cache_group 的第 j 个 block

    def get_block_ids(self, allow_none=False):
        """转换为 block_id 列表"""

1.2 SingleTypeKVCacheManager --- 单类型基类

SingleTypeKVCacheManager(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L30-L443) 是一个抽象基类,处理单一注意力类型的 KV cache 管理逻辑。

核心状态:

python 复制代码
class SingleTypeKVCacheManager(ABC):
    def __init__(self, kv_cache_spec, block_pool, enable_caching, ...):
        self.block_size = kv_cache_spec.block_size
        self.kv_cache_spec = kv_cache_spec
        self.block_pool = block_pool
        # request_id -> 已分配的 blocks 列表
        self.req_to_blocks: defaultdict[str, list[KVCacheBlock]] = defaultdict(list)
        # request_id -> 已缓存的 block 数量
        self.num_cached_block: dict[str, int] = {}

关键方法签名:

方法 说明
get_num_blocks_to_allocate() 计算需要分配的 block 数量(考虑 admission cap)
allocate_new_computed_blocks() 将前缀缓存命中的块追加到请求
allocate_new_blocks() 为待计算 token 分配新 block
cache_blocks() 将满 block 标记为可复用
free() 释放请求的 blocks(反序)
find_longest_cache_hit() (抽象) 查找最长公共前缀缓存命中
remove_skipped_blocks() 移除注意力窗口外的块
get_num_skipped_tokens() (虚方法) 计算需跳过的 token 数

admission cap 机制 (single_type_kv_cache_manager.py:88-167(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L88-L167)):

对于 SlidingWindow 和 ChunkedLocalAttention 这类回收型 spec ,运行时会对每个请求的 block 预留数量施加上限 (_max_admission_blocks_per_request),防止 chunked prefill 期间过度分配导致死锁:

python 复制代码
def get_num_blocks_to_allocate(self, ..., apply_admission_cap=False):
    num_required_blocks = cdiv(num_tokens, self.block_size)
    if apply_admission_cap and self._max_admission_blocks_per_request is not None:
        num_required_blocks = min(
            num_required_blocks, self._max_admission_blocks_per_request
        )

1.3 具体管理器子类

FullAttentionManager

FullAttentionManager(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L446-L504) 用于全注意力层(含 ChunkedLocalAttention)。

find_longest_cache_hit 实现 --- 从左到右顺序扫描 block hash 链:

python 复制代码
@classmethod
def find_longest_cache_hit(cls, block_hashes, max_length, ...):
    computed_blocks = tuple([] for _ in range(len(kv_cache_group_ids)))
    for block_hash in itertools.islice(block_hashes, max_num_blocks):
        if cached_block := block_pool.get_cached_block(block_hash, kv_cache_group_ids):
            for computed, cached in zip(computed_blocks, cached_block):
                computed.append(cached)
        else:
            break  # hash 不匹配则终止
    if use_eagle and computed_blocks[0]:
        for computed in computed_blocks:
            computed.pop()  # EAGLE 需要丢弃最后一块以重算

get_num_common_prefix_blocks --- 通过比较 ref_cnt 与请求数判断公共前缀:

python 复制代码
def get_num_common_prefix_blocks(self, running_request_id):
    blocks = self.req_to_blocks[running_request_id]
    num_common_blocks = 0
    for block in blocks:
        if block.ref_cnt == len(self.req_to_blocks):
            num_common_blocks += 1
        else:
            break
    return num_common_blocks
SlidingWindowManager

SlidingWindowManager(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L507-L641) 处理滑动窗口注意力。

核心特点 --- 从右向左反向搜索,找到满足滑动窗口连续性要求的最大匹配:

python 复制代码
@classmethod
def find_longest_cache_hit(cls, block_hashes, max_length, ...):
    sliding_window_contiguous_blocks = cdiv(kv_cache_spec.sliding_window - 1, ...)
    # 从右向左搜索,找到连续匹配后提前退出
    for i in range(max_num_blocks - 1, -1, -1):
        if cached_block := block_pool.get_cached_block(block_hashes[i], ...):
            computed[i] = cached_block
            num_contiguous_blocks += 1
            if num_contiguous_blocks >= sliding_window_contiguous_blocks:
                match_found = True
                break
        else:
            num_contiguous_blocks = 0  # 重置连续计数

跳过 token 计算 (single_type_kv_cache_manager.py:606-632(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L606-L632)):

复制代码
sliding_window=4, num_computed_tokens=7
Tokens: [ 0  1  2  3  4  5  6  7 ]
         | ---- computed -----|
                                ^ next token
                     |-----------| sliding window
        |--skipped---|
→ get_num_skipped_tokens(7) == 4
python 复制代码
def get_num_skipped_tokens(self, num_computed_tokens):
    return max(0, num_computed_tokens - self.sliding_window + 1)
ChunkedLocalAttentionManager

ChunkedLocalAttentionManager(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L644-L791) 处理分块局部注意力。

策略:窗口外的 block 直接标记为 null(视为已计算),窗口内按前缀缓存查找:

python 复制代码
@classmethod
def find_longest_cache_hit(cls, block_hashes, max_length, ...):
    local_attention_start_idx = (
        max_length // kv_cache_spec.attention_chunk_size * kv_cache_spec.attention_chunk_size
    )
    # 窗口外标记为 null
    computed_blocks = tuple(
        [block_pool.null_block] * local_attention_start_block_idx
        for _ in range(len(kv_cache_group_ids))
    )
    # 窗口内按 hash 查找
    for i in range(local_attention_start_block_idx, max_num_blocks):
        if cached_block := block_pool.get_cached_block(block_hashes[i], ...):
            computed.append(cached_block)
        else:
            break
MambaManager

MambaManager(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L794-L1066) 处理 Mamba/SSM 线性注意力层。

特殊之处

  • Align 模式 下只保留最近 2 个 block 的状态(last_state_block_idx 追踪)
  • 不支持同步骤内复用其他请求产生的 blockcached_blocks_this_step 过滤)
  • 投机解码时分配额外的 speculative blocks
CrossAttentionManager

CrossAttentionManager(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L1069-L1115) 处理 encoder-decoder 模型的交叉注意力。

不参与前缀缓存:编码器状态是每个请求唯一的,不存在可共享的前缀。

SinkFullAttentionManager

SinkFullAttentionManager(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L1118-L1139) 继承自 FullAttentionManager,额外预分配 sink token 的常驻 block。

Spec → Manager 映射表 (single_type_kv_cache_manager.py:1142-1152(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L1142-L1152)):

python 复制代码
spec_manager_map: dict[type[KVCacheSpec], type[SingleTypeKVCacheManager]] = {
    FullAttentionSpec: FullAttentionManager,
    TQFullAttentionSpec: FullAttentionManager,
    MLAAttentionSpec: FullAttentionManager,
    SlidingWindowSpec: SlidingWindowManager,
    SlidingWindowMLASpec: SlidingWindowManager,
    ChunkedLocalAttentionSpec: ChunkedLocalAttentionManager,
    MambaSpec: MambaManager,
    CrossAttentionSpec: CrossAttentionManager,
    SinkFullAttentionSpec: SinkFullAttentionManager,
}

1.4 EncoderCacheManager --- 编码器缓存

EncoderCacheManager(file:///workspace/vllm/v1/core/encoder_cache_manager.py#L17-L266) 管理多模态模型中编码器输出的缓存。

设计目标:避免对相同的多模态输入(如图像)重复执行昂贵的编码器计算。

核心数据结构:

字段 类型 用途
cached dict[str, set[str]] mm_hash → 引用该 entry 的 request ID 集合
freeable OrderedDict[str, int] 无引用的 mm_hash → 编码器 embedding 数量
freed list[str] 自上次调用后被实际驱逐的 mm_hash 列表
num_free_slots int 当前可用容量
num_freeable_slots int 可回收容量(含无引用条目)

关键操作流程:

  1. check_and_update_cache() --- 检查是否已缓存,更新引用计数
  2. can_allocate() --- 容量检查,不足时从 freeable 中 LRU 驱逐
  3. allocate() --- 预留空间,更新 bookkeeping
  4. free_encoder_input() --- 释放引用,无引用时移入 freeable

还有 EncoderDecoderCacheManager (encoder_cache_manager.py:323-381(file:///workspace/vllm/v1/core/encoder_cache_manager.py#L323-L381)) --- encoder-decoder 模型的临时简化版本,仅用于调度目的。

1.5 层次关系总览

#mermaid-svg-ri5cduuvElmo6fIk{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ri5cduuvElmo6fIk .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ri5cduuvElmo6fIk .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ri5cduuvElmo6fIk .error-icon{fill:#552222;}#mermaid-svg-ri5cduuvElmo6fIk .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ri5cduuvElmo6fIk .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ri5cduuvElmo6fIk .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ri5cduuvElmo6fIk .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ri5cduuvElmo6fIk .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ri5cduuvElmo6fIk .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ri5cduuvElmo6fIk .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ri5cduuvElmo6fIk .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ri5cduuvElmo6fIk .marker.cross{stroke:#333333;}#mermaid-svg-ri5cduuvElmo6fIk svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ri5cduuvElmo6fIk p{margin:0;}#mermaid-svg-ri5cduuvElmo6fIk g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-ri5cduuvElmo6fIk g.classGroup text .title{font-weight:bolder;}#mermaid-svg-ri5cduuvElmo6fIk .cluster-label text{fill:#333;}#mermaid-svg-ri5cduuvElmo6fIk .cluster-label span{color:#333;}#mermaid-svg-ri5cduuvElmo6fIk .cluster-label span p{background-color:transparent;}#mermaid-svg-ri5cduuvElmo6fIk .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ri5cduuvElmo6fIk .cluster text{fill:#333;}#mermaid-svg-ri5cduuvElmo6fIk .cluster span{color:#333;}#mermaid-svg-ri5cduuvElmo6fIk .nodeLabel,#mermaid-svg-ri5cduuvElmo6fIk .edgeLabel{color:#131300;}#mermaid-svg-ri5cduuvElmo6fIk .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-ri5cduuvElmo6fIk .label text{fill:#131300;}#mermaid-svg-ri5cduuvElmo6fIk .labelBkg{background:#ECECFF;}#mermaid-svg-ri5cduuvElmo6fIk .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-ri5cduuvElmo6fIk .classTitle{font-weight:bolder;}#mermaid-svg-ri5cduuvElmo6fIk .node rect,#mermaid-svg-ri5cduuvElmo6fIk .node circle,#mermaid-svg-ri5cduuvElmo6fIk .node ellipse,#mermaid-svg-ri5cduuvElmo6fIk .node polygon,#mermaid-svg-ri5cduuvElmo6fIk .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ri5cduuvElmo6fIk .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk g.clickable{cursor:pointer;}#mermaid-svg-ri5cduuvElmo6fIk g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-ri5cduuvElmo6fIk g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-ri5cduuvElmo6fIk .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-ri5cduuvElmo6fIk .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-ri5cduuvElmo6fIk .dashed-line{stroke-dasharray:3;}#mermaid-svg-ri5cduuvElmo6fIk .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-ri5cduuvElmo6fIk #compositionStart,#mermaid-svg-ri5cduuvElmo6fIk .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #compositionEnd,#mermaid-svg-ri5cduuvElmo6fIk .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #dependencyStart,#mermaid-svg-ri5cduuvElmo6fIk .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #dependencyStart,#mermaid-svg-ri5cduuvElmo6fIk .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #extensionStart,#mermaid-svg-ri5cduuvElmo6fIk .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #extensionEnd,#mermaid-svg-ri5cduuvElmo6fIk .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #aggregationStart,#mermaid-svg-ri5cduuvElmo6fIk .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #aggregationEnd,#mermaid-svg-ri5cduuvElmo6fIk .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #lollipopStart,#mermaid-svg-ri5cduuvElmo6fIk .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk #lollipopEnd,#mermaid-svg-ri5cduuvElmo6fIk .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-ri5cduuvElmo6fIk .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-ri5cduuvElmo6fIk .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ri5cduuvElmo6fIk .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ri5cduuvElmo6fIk .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ri5cduuvElmo6fIk :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} KVCacheManager
+coordinator: KVCacheCoordinator
+block_pool: BlockPool
+get_computed_blocks()
+allocate_slots()
+free()
+cache_blocks()
KVCacheCoordinator
+abstract
+single_type_managers: tuple
+find_longest_cache_hit()
UnitaryKVCacheCoordinator
HybridKVCacheCoordinator
NoPrefixCacheCoordinator
SingleTypeKVCacheManager
+abstract
+req_to_blocks: dict
+find_longest_cache_hit()
+get_num_skipped_tokens()
FullAttentionManager
SlidingWindowManager
ChunkedLocalAttentionManager
MambaManager
CrossAttentionManager
EncoderCacheManager
+cached: dict
+freeable: OrderedDict
+check_and_update_cache()
+can_allocate()


二、KV Cache Coordinator 多 GPU 缓存协调

KVCacheCoordinator(file:///workspace/vllm/v1/core/kv_cache_coordinator.py#L28-L274) 是连接 KVCacheManager 与各 SingleTypeKVCacheManager 的中间协调层,负责在多个 KV cache group 之间统一调度。

2.1 三种 Coordinator 变体

类名 适用场景 特点
NoPrefixCache(file:///workspace/vllm/v1/core/kv_cache_coordinator.py#L276-L321) 前缀缓存禁用或不支持 find_longest_cache_hit 始终返回空;支持任意数量 group
Unitary(file:///workspace/vllm/v1/core/kv_cache_coordinator.py#L324-L389) 单一 KV cache group 直接委托给唯一的 manager;hash_block_size == block_size
Hybrid(file:///workspace/vllm/v1/core/kv_cache_coordinator.py#L392-L591) 多种注意力类型的混合模型 迭代固定点算法协调多组 cache hit 长度

2.2 HybridKVCacheCoordinator --- 固点算法

混合模型(如 Attention + Mamba、Attention + SlidingWindow)需要协调不同 attention type 对 cache hit 长度的约束。

算法核心 (kv_cache_coordinator.py:487-591(file:///workspace/vllm/v1/core/kv_cache_coordinator.py#L487-L591)):

python 复制代码
def find_longest_cache_hit(self, block_hashes, max_cache_hit_length):
    hit_length = max_cache_hit_length
    while True:
        curr_hit_length = hit_length
        for idx, (spec, group_ids, manager_cls) in enumerate(self.attention_groups):
            if isinstance(spec, FullAttentionSpec) and cached_blocks is not None:
                # FullAttn 是向下封闭的:只需查找一次,后续仅 trim
                curr_hit_length = curr_hit_length // spec.block_size * spec.block_size
                continue

            hit_blocks = manager_cls.find_longest_cache_hit(
                ..., max_length=_max_length, alignment_tokens=self.lcm_block_size
            )
            _new_hit_length = len(hit_blocks[0]) * spec.block_size
            curr_hit_length = _new_hit_length

        if curr_hit_length >= hit_length:
            break  # 收敛:长度不再缩减
        hit_length = curr_hit_length
        if is_simple_hybrid:
            break  # 简单混合(1 FullAttn + 1 其他)一次迭代即够

关键设计决策:

  1. Full Attention 优先 :将 FullAttentionSpec 组排在 attention_groups 首位,因其高效的左到右扫描可提供更紧的初始上界
  2. LCM 对齐lcm_block_size = lcm(*all_block_sizes) 确保 hit length 是所有 group block size 的公倍数
  3. EAGLE 处理 :每个 eagle group 最多执行一次 last-block-drop,通过 eagle_verified 集合去重

Group 分组与验证 (kv_cache_coordinator.py:436-470(file:///workspace/vllm/v1/core/kv_cache_coordinator.py#L436-L470)):

python 复制代码
def verify_and_split_kv_cache_groups(self):
    attention_groups = []
    for i, g in enumerate(self.kv_cache_config.kv_cache_groups):
        manager_cls = self.single_type_managers[i].__class__
        spec = g.kv_cache_spec
        # 按 (spec_type, manager_cls) 合并相同类型的 group
        for existing_spec, group_ids, existing_cls in attention_groups:
            if existing_spec == spec:
                group_ids.append(i)
                break
        else:
            attention_groups.append((spec, [i], manager_cls))
    # FullAttn 排第一
    self.attention_groups = sorted(attention_groups, key=lambda x: not isinstance(x[0], FullAttentionSpec))

2.3 Coordinator 工厂函数

get_kv_cache_coordinator()(file:///workspace/vllm/v1/core/kv_cache_coordinator.py#L594-L642) 根据配置自动选择 Coordinator 类型:

python 复制代码
def get_kv_cache_coordinator(...):
    if not enable_caching:
        return KVCacheCoordinatorNoPrefixCache(...)
    if len(kv_cache_config.kv_cache_groups) == 1:
        return UnitaryKVCacheCoordinator(...)
    return HybridKVCacheCoordinator(...)

三、KV Cache 接口定义

3.1 KVCacheConfig --- 全局配置

KVCacheConfig(file:///workspace/vllm/v1/kv_cache_interface.py#L759-L783) 描述模型的整体 KV cache 配置:

python 复制代码
@dataclass
class KVCacheConfig:
    num_blocks: int                    # KV cache block 总数
    kv_cache_tensors: list[KVCacheTensor]  # 每个 tensor 的初始化信息
    kv_cache_groups: list[KVCacheGroupSpec]  # KV cache 分组

    @property
    def has_mamba_layers(self) -> bool: ...
    @property
    def needs_kv_cache_zeroing(self) -> bool: ...

3.2 KVCacheSpec 层次体系

KVCacheSpec(file:///workspace/vllm/v1/kv_cache_interface.py#L81-L127) 是所有 cache spec 的基类:
#mermaid-svg-rXxurYGkufkCv6Y7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-rXxurYGkufkCv6Y7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-rXxurYGkufkCv6Y7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-rXxurYGkufkCv6Y7 .error-icon{fill:#552222;}#mermaid-svg-rXxurYGkufkCv6Y7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-rXxurYGkufkCv6Y7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-rXxurYGkufkCv6Y7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-rXxurYGkufkCv6Y7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-rXxurYGkufkCv6Y7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-rXxurYGkufkCv6Y7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-rXxurYGkufkCv6Y7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-rXxurYGkufkCv6Y7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-rXxurYGkufkCv6Y7 .marker.cross{stroke:#333333;}#mermaid-svg-rXxurYGkufkCv6Y7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-rXxurYGkufkCv6Y7 p{margin:0;}#mermaid-svg-rXxurYGkufkCv6Y7 g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-rXxurYGkufkCv6Y7 g.classGroup text .title{font-weight:bolder;}#mermaid-svg-rXxurYGkufkCv6Y7 .cluster-label text{fill:#333;}#mermaid-svg-rXxurYGkufkCv6Y7 .cluster-label span{color:#333;}#mermaid-svg-rXxurYGkufkCv6Y7 .cluster-label span p{background-color:transparent;}#mermaid-svg-rXxurYGkufkCv6Y7 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-rXxurYGkufkCv6Y7 .cluster text{fill:#333;}#mermaid-svg-rXxurYGkufkCv6Y7 .cluster span{color:#333;}#mermaid-svg-rXxurYGkufkCv6Y7 .nodeLabel,#mermaid-svg-rXxurYGkufkCv6Y7 .edgeLabel{color:#131300;}#mermaid-svg-rXxurYGkufkCv6Y7 .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-rXxurYGkufkCv6Y7 .label text{fill:#131300;}#mermaid-svg-rXxurYGkufkCv6Y7 .labelBkg{background:#ECECFF;}#mermaid-svg-rXxurYGkufkCv6Y7 .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-rXxurYGkufkCv6Y7 .classTitle{font-weight:bolder;}#mermaid-svg-rXxurYGkufkCv6Y7 .node rect,#mermaid-svg-rXxurYGkufkCv6Y7 .node circle,#mermaid-svg-rXxurYGkufkCv6Y7 .node ellipse,#mermaid-svg-rXxurYGkufkCv6Y7 .node polygon,#mermaid-svg-rXxurYGkufkCv6Y7 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-rXxurYGkufkCv6Y7 .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 g.clickable{cursor:pointer;}#mermaid-svg-rXxurYGkufkCv6Y7 g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-rXxurYGkufkCv6Y7 g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-rXxurYGkufkCv6Y7 .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-rXxurYGkufkCv6Y7 .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-rXxurYGkufkCv6Y7 .dashed-line{stroke-dasharray:3;}#mermaid-svg-rXxurYGkufkCv6Y7 .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-rXxurYGkufkCv6Y7 #compositionStart,#mermaid-svg-rXxurYGkufkCv6Y7 .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #compositionEnd,#mermaid-svg-rXxurYGkufkCv6Y7 .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #dependencyStart,#mermaid-svg-rXxurYGkufkCv6Y7 .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #dependencyStart,#mermaid-svg-rXxurYGkufkCv6Y7 .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #extensionStart,#mermaid-svg-rXxurYGkufkCv6Y7 .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #extensionEnd,#mermaid-svg-rXxurYGkufkCv6Y7 .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #aggregationStart,#mermaid-svg-rXxurYGkufkCv6Y7 .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #aggregationEnd,#mermaid-svg-rXxurYGkufkCv6Y7 .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #lollipopStart,#mermaid-svg-rXxurYGkufkCv6Y7 .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 #lollipopEnd,#mermaid-svg-rXxurYGkufkCv6Y7 .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-rXxurYGkufkCv6Y7 .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-rXxurYGkufkCv6Y7 .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-rXxurYGkufkCv6Y7 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-rXxurYGkufkCv6Y7 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-rXxurYGkufkCv6Y7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} KVCacheSpec
+block_size: int
+page_size_bytes
+max_memory_usage_bytes()
AttentionSpec
+num_kv_heads: int
+head_size: int
+dtype: torch.dtype
+kv_quant_mode: KVQuantMode
FullAttentionSpec
+sliding_window: int|None
+attention_chunk_size: int|None
TQFullAttentionSpec
+tq_slot_size: int
MLAAttentionSpec
+cache_dtype_str: str|None
+compress_ratio: int
+alignment: int|None
SlidingWindowSpec
+sliding_window: int
+max_admission_blocks_per_request()
ChunkedLocalAttentionSpec
+attention_chunk_size: int
MambaSpec
+shapes: tuple
+mamba_cache_mode: str
+num_speculative_blocks: int
CrossAttentionSpec
SinkFullAttentionSpec
+sink_len: int|None
EncoderOnlyAttentionSpec

量化模式 (kv_cache_interface.py:31-72(file:///workspace/vllm/v1/kv_cache_interface.py#L31-L72)):

python 复制代码
class KVQuantMode(IntEnum):
    NONE = 0                          # 无量化
    FP8_PER_TENSOR = 1               # per-tensor FP8
    INT8_PER_TOKEN_HEAD = 2          # per-token-head INT8
    FP8_PER_TOKEN_HEAD = 3           # per-token-head FP8
    NVFP4 = 4                         # NVFP4 打包量化

KVCacheGroupSpec (kv_cache_interface.py#L744-756(file:///workspace/vllm/v1/kv_cache_interface.py#L744-L756)):

python 复制代码
@dataclass
class KVCacheGroupSpec:
    layer_names: list[str]      # 该组包含的层名列表
    kv_cache_spec: KVCacheSpec   # 该组的 cache spec
    is_eagle_group: bool = False # 是否包含 EAGLE/MTP 草稿层

3.3 UniformTypeKVCacheSpecs

UniformTypeKVCacheSpecs(file:///workspace/vllm/v1/kv_cache_interface.py#L632-731) 用于处理同一类型但参数不同的多层合并(如 DeepSeekV4 的不同 head_size 层)。


四、KV Cache Metrics 缓存使用统计

KVCacheMetricsCollector(file:///workspace/vllm/v1/core/kv_cache_metrics.py#L46-L96) 提供采样式的 Block 级别生命周期指标收集。

4.1 架构设计

#mermaid-svg-TixNr2Ojna13xPYW{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-TixNr2Ojna13xPYW .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-TixNr2Ojna13xPYW .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-TixNr2Ojna13xPYW .error-icon{fill:#552222;}#mermaid-svg-TixNr2Ojna13xPYW .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-TixNr2Ojna13xPYW .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-TixNr2Ojna13xPYW .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-TixNr2Ojna13xPYW .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-TixNr2Ojna13xPYW .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-TixNr2Ojna13xPYW .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-TixNr2Ojna13xPYW .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-TixNr2Ojna13xPYW .marker{fill:#333333;stroke:#333333;}#mermaid-svg-TixNr2Ojna13xPYW .marker.cross{stroke:#333333;}#mermaid-svg-TixNr2Ojna13xPYW svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-TixNr2Ojna13xPYW p{margin:0;}#mermaid-svg-TixNr2Ojna13xPYW .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-TixNr2Ojna13xPYW .cluster-label text{fill:#333;}#mermaid-svg-TixNr2Ojna13xPYW .cluster-label span{color:#333;}#mermaid-svg-TixNr2Ojna13xPYW .cluster-label span p{background-color:transparent;}#mermaid-svg-TixNr2Ojna13xPYW .label text,#mermaid-svg-TixNr2Ojna13xPYW span{fill:#333;color:#333;}#mermaid-svg-TixNr2Ojna13xPYW .node rect,#mermaid-svg-TixNr2Ojna13xPYW .node circle,#mermaid-svg-TixNr2Ojna13xPYW .node ellipse,#mermaid-svg-TixNr2Ojna13xPYW .node polygon,#mermaid-svg-TixNr2Ojna13xPYW .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-TixNr2Ojna13xPYW .rough-node .label text,#mermaid-svg-TixNr2Ojna13xPYW .node .label text,#mermaid-svg-TixNr2Ojna13xPYW .image-shape .label,#mermaid-svg-TixNr2Ojna13xPYW .icon-shape .label{text-anchor:middle;}#mermaid-svg-TixNr2Ojna13xPYW .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-TixNr2Ojna13xPYW .rough-node .label,#mermaid-svg-TixNr2Ojna13xPYW .node .label,#mermaid-svg-TixNr2Ojna13xPYW .image-shape .label,#mermaid-svg-TixNr2Ojna13xPYW .icon-shape .label{text-align:center;}#mermaid-svg-TixNr2Ojna13xPYW .node.clickable{cursor:pointer;}#mermaid-svg-TixNr2Ojna13xPYW .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-TixNr2Ojna13xPYW .arrowheadPath{fill:#333333;}#mermaid-svg-TixNr2Ojna13xPYW .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-TixNr2Ojna13xPYW .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-TixNr2Ojna13xPYW .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TixNr2Ojna13xPYW .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-TixNr2Ojna13xPYW .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TixNr2Ojna13xPYW .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-TixNr2Ojna13xPYW .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-TixNr2Ojna13xPYW .cluster text{fill:#333;}#mermaid-svg-TixNr2Ojna13xPYW .cluster span{color:#333;}#mermaid-svg-TixNr2Ojna13xPYW div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-TixNr2Ojna13xPYW .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-TixNr2Ojna13xPYW rect.text{fill:none;stroke-width:0;}#mermaid-svg-TixNr2Ojna13xPYW .icon-shape,#mermaid-svg-TixNr2Ojna13xPYW .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-TixNr2Ojna13xPYW .icon-shape p,#mermaid-svg-TixNr2Ojna13xPYW .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-TixNr2Ojna13xPYW .icon-shape .label rect,#mermaid-svg-TixNr2Ojna13xPYW .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-TixNr2Ojna13xPYW .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-TixNr2Ojna13xPYW .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-TixNr2Ojna13xPYW :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 输出指标
BlockMetricsState
Block 生命周期事件
on_block_allocated
BlockMetricsState
on_block_accessed
on_block_evicted
KVCacheEvictionEvent
birth_time_ns
last_access_ns
access_history deque~4
lifetime_seconds
idle_seconds
reuse_gaps_seconds

4.2 核心实现

python 复制代码
class KVCacheMetricsCollector:
    def __init__(self, sample_rate=0.01):
        self.sample_rate = sample_rate          # 默认 1% 采样率
        self.block_metrics: dict[int, BlockMetricsState] = {}
        self._eviction_events: list[KVCacheEvictionEvent] = []

    def should_sample_block(self) -> bool:
        return random.random() < self.sample_rate

    def on_block_allocated(self, block): ...  # 记录创建时间
    def on_block_accessed(self, block): ...   # 更新访问时间和历史
    def on_block_evicted(self, block): ...    # 生成驱逐事件并移除追踪
    def drain_events(self) -> list: ...       # 取出并清空事件队列

BlockMetricsState (kv_cache_metrics.py:16-43(file:///workspace/vllm/v1/core/kv_cache_metrics.py#L16-43)):

  • access_history 使用 deque(maxlen=4) 有界记录,防止高频访问导致无限增长
  • get_reuse_gaps_seconds() 返回相邻两次访问的时间间隔列表

五、KV Offload CPU 卸载

KV Offload 子系统提供将 GPU KV cache 块卸载到 CPU 内存的能力,支持可插拔的缓存策略。

5.1 整体架构

渲染错误: Mermaid 渲染失败: Parse error on line 24: ...> POLICY POLICY <|-- LRU POLICY ----------------------^ Expecting 'SEMI', 'NEWLINE', 'SPACE', 'EOF', 'AMP', 'COLON', 'START_LINK', 'LINK', 'LINK_ID', 'DOWN', 'DEFAULT', 'NUM', 'COMMA', 'NODE_STRING', 'BRKT', 'MINUS', 'MULT', 'UNICODE_TEXT', got 'TAGSTART'

5.2 OffloadingManager 抽象接口

OffloadingManager(file:///workspace/vllm/v1/kv_offload/base.py#L110-L216) 定义了 Scheduler 侧的核心原语:

方法 功能 说明
lookup(key) 检查 block 是否已卸载且就绪 返回 True/False/None(None 表示需重试)
prepare_load(keys) 准备读取 保护不被驱逐;返回 LoadStoreSpec
touch(keys) 标记最近使用 更新 LRU 位置
complete_load(keys) 标记加载完成 解除保护
prepare_store(keys) 准备卸载写入 可能触发驱逐;返回 PrepareStoreOutput
complete_store(keys) 标记存储完成 block 变为可加载

关键数据结构

  • OffloadKey = block_hash(字节) + group_idx(4字节大端) --- 唯一标识一个卸载块
  • GPULoadStoreSpec --- GPU 端 load/store 规格,含 group_sizesblock_indices
  • CPULoadStoreSpec --- CPU 端 load/store 规格

5.3 CPUOffloadingManager 实现

CPUOffloadingManager(file:///workspace/vllm/v1/kv_offload/cpu/manager.py#L25-L204) 是 CPU 卸载的具体管理器。

Block Pool 管理

python 复制代码
class CPUOffloadingManager(OffloadingManager):
    def __init__(self, num_blocks, cache_policy="lru"):
        self._num_blocks = num_blocks
        self._num_allocated_blocks = 0     # 已分配总数
        self._free_list: list[int] = []    # 可重用 block ID 列表
        self._policy: CachePolicy = policy_cls(cache_capacity=num_blocks)

prepare_store() 流程 (cpu/manager.py:119-172(file:///workspace/vllm/v1/kv_offload/cpu/manager.py#L119-L172)):

  1. 过滤已存储的 blocks
  2. 计算需要驱逐的数量 (len(keys_to_store) - free_blocks)
  3. 调用 policy.evict() 执行驱逐(排除 protected set)
  4. 从 block pool 分配新 block(优先重用 free_list)
  5. (key, block) 插入 policy

5.4 缓存策略

CachePolicy 抽象

CachePolicy(file:///workspace/vllm/v1/kv_offload/cpu/policies/base.py#L36-76) 定义策略接口:

python 复制代码
class CachePolicy(ABC):
    @abstractmethod
    def get(self, key) -> BlockStatus | None: ...
    @abstractmethod
    def insert(self, key, block) -> None: ...
    @abstractmethod
    def remove(self, key) -> None: ...
    @abstractmethod
    def touch(self, keys) -> None: ...
    @abstractmethod
    def evict(self, n, protected) -> list | None: ...

BlockStatus 使用 ctypes Structure,含 ref_cnt(引用计数,-1 表示未就绪)和 block_id

LRU 策略

LRUCachePolicy(file:///workspace/vllm/v1/kv_offload/cpu/policies/lru.py#L10-46) 基于 OrderedDict

  • touch(): move_to_end(key) 移至末尾
  • evict(): 从头部扫描 ref_cnt == 0 且非 protected 的候选者
ARC 策略

ARCCachePolicy(file:///workspace/vllm/v1/kv_offload/cpu/policies/arc.py) 实现 Adaptive Replacement Cache,维护 T1(最近访问)、T2(频繁访问)和 ghost 列表。

5.5 FilterReusedOffloadingManager --- 复用门控

FilterReusedOffloadingManager(file:///workspace/vllm/v1/kv_offload/reuse_manager.py#L23-120) 是一个装饰器 ,包装任意 OffloadingManager,在 prepare_store() 前过滤低频 block。

核心逻辑

python 复制代码
class FilterReusedOffloadingManager(OffloadingManager):
    def lookup(self, key, req_context):
        # 在 LRU tracker 中记录访问次数
        if key in self.counts:
            self.counts.move_to_end(key)
            self.counts[key] += 1
        else:
            self.counts[key] = 1
        return self._backing.lookup(key, req_context)

    def prepare_store(self, keys, req_context):
        # 只传递达到阈值的 keys
        eligible = [k for k in keys if self.counts.get(k, 0) >= self.store_threshold]
        return self._backing.prepare_store(eligible, req_context)

参数说明

  • store_threshold: 必须 ≥ 2(1 等于不过滤)
  • max_tracker_size: LRU tracker 最大条目数(默认 64,000)

5.6 Worker 侧传输引擎

SingleDirectionOffloadingHandler

SingleDirectionOffloadingHandler(file:///workspace/vllm/v1/kv_offload/cpu/gpu_worker.py#L111-373) 处理单向异步传输。

核心机制

  • 每个 transfer 使用独立 CUDA stream
  • Transfer 严格按提交顺序串行执行(前一个完成后再开始下一个)
  • 支持 sub-block 映射:当 offloaded block size > GPU block size 时自动拆分

compute_sub_block_ptrs() (gpu_worker.py:39-86(file:///workspace/vllm/v1/kv_offload/cpu/gpu_worker.py#L39-86)):

python 复制代码
# 例: gpu_block_size=16, cpu_block_size=32 (block_size_factor=2)
# kv_manager_block_id 0 → kernel block [0, 1]
# kv_manager_block_id 1 → kernel block [2, 3]
kernel_block_ids = (
    kv_manager_block_ids.reshape(-1, 1) * blocks_per_kv_block
    + kernel_block_arange  # [0, 1, ..., factor-1]
).reshape(-1)
SharedOffloadRegion --- mmap 共享内存

SharedOffloadRegion(file:///workspace/vllm/v1/kv_offload/cpu/shared_offload_region.py#L27-192) 实现多 worker 间的共享 CPU 缓存区域。

布局设计

复制代码
文件路径: /dev/shm/vllm_offload_{instance_id}.mmap

物理布局 (interleaved):
worker0_block0 | worker1_block0 | ... | worker{M-1}_block0
worker0_block1 | worker1_block1 | ... | worker{M-1}_block1
...

row_stride = cpu_page_size * num_workers
每个 worker 通过 stride 视图访问自己的区域

关键特性

  • 第一个打开文件的 worker 成为 creator 并调用 ftruncate
  • 其他 worker 等待文件大小达标后 mmap
  • 使用 MADV_POPULATE_WRITE 预分配物理页
  • 支持通过 cudaHostRegister 注册为 pinned memory 以加速 DMA
CPUOffloadingSpec --- 规格工厂

CPUOffloadingSpec(file:///workspace/vllm/v1/kv_offload/cpu/spec.py#L22-102) 将所有组件组装在一起:

python 复制代码
class CPUOffloadingSpec(OffloadingSpec):
    def get_manager(self) -> OffloadingManager:
        # 创建 CPUOffloadingManager (+ 可选 FilterReused 包装)
        ...

    def get_handlers(self, kv_caches):
        # 创建 CpuGpuOffloadingHandlers (GPU↔CPU 双向)
        yield GPULoadStoreSpec, CPULoadStoreSpec, handlers.gpu_to_cpu_handler
        yield CPULoadStoreSpec, GPULoadStoreSpec, handlers.cpu_to_gpu_handler

六、Simple KV Offload

Simple KV Offload(file:///workspace/vllm/v1/simple_kv_offload/) 是一个轻量级的 CPU 卸载方案,不依赖完整的 OffloadingManager 抽象框架,直接基于 KVCacheCoordinator 的 CPU 副本工作。

6.1 架构对比

特性 KV Offload (完整版) Simple KV Offload
复杂度 高(多抽象层) 低(直接操作 BlockPool)
缓存策略 LRU / ARC 继承 GPU prefix cache 行为
Store 模式 eager / lazy eager + lazy 双模式
多节点支持 ✅ (via connector) ❌ 单机
适用场景 生产级部署 快速验证、原型

6.2 SimpleCPUOffloadScheduler

SimpleCPUOffloadScheduler(file:///workspace/vllm/v1/simple_kv_offload/manager.py#L67-L741) 是调度器侧的管理器。

初始化 --- 从 GPU 配置派生 CPU 配置 (simple_kv_offload/manager.py:160-187(file:///workspace/vllm/v1/simple_kv_offload/manager.py#L160-187)):

python 复制代码
@staticmethod
def _derive_cpu_config(gpu_config, cpu_capacity_bytes):
    gpu_total_bytes = sum(t.size for t in gpu_config.kv_cache_tensors)
    num_gpu_blocks = gpu_config.num_blocks
    num_cpu_blocks = max(1, num_gpu_blocks * cpu_capacity_bytes // gpu_total_bytes)
    # 保持相同的 kv_cache_groups,仅缩放 num_blocks
    return KVCacheConfig(num_blocks=num_cpu_blocks, ...)

两种 Store 模式

Lazy Mode --- 游标扫描

_prepare_lazy_store_specs()(file:///workspace/vllm/v1/simple_kv_offload/manager.py#L378-444):使用单遍游标扫描 GPU free queue,选择接近 eviction 边缘的 cached block 进行卸载。

python 复制代码
# 游标沿 free_queue 前进,覆盖 target_free 个位置
while covered < self._target_free and len(gpu_ids) < num_cpu_free:
    bhash = node.block_hash
    if bhash is not None and not node.is_null and cpu_pool 未缓存该 hash:
        gpu_ids.append(node.block_id)
        block_hashes.append(bhash)
    covered += 1
    node = node.next_free_block
Eager Mode --- 请求驱动

_prepare_eager_store_specs()(file:///workspace/vllm/v1/simple_kv_offload/manager.py#L446-582):跟踪每个请求的新计算 block,确认 KV 数据已写回后批量卸载。

两阶段扫描:

  1. Phase 1: 分类 block 为 cached vs to-store(跳过 null 和已缓存的)
  2. Phase 2: 批量分配 CPU block 并 stamp hash

6.3 SimpleCPUOffloadWorker

SimpleCPUOffloadWorker(file:///workspace/vllm/v1/simple_kv_offload/worker.py#L25-305) 是 Worker 侧处理器。

KV Cache 注册 (worker.py:66-182(file:///workspace/vllm/v1/simple_kv_offload/worker.py#L66-182)):

  • 自动去重共享同一底层 storage 的多个层
  • 推导物理 layout(FlashAttn 的 (2, num_blocks, ...) vs FlashInfer 的 (num_blocks, ...)
  • 分配 pinned CPU tensor(使用 cudaHostRegister 避免 PyTorch 的 2 幂向上取整)

DmaCopyBackend (copy_backend.py(file:///workspace/vllm/v1/simple_kv_offload/copy_backend.py#L23-97)):

  • 后台线程执行 cuMemcpyBatchAsync
  • 低优先级 CUDA stream(不干扰模型计算)
  • Event 驱动的完成通知

6.4 Metadata 定义

SimpleCPUOffloadMetadata(file:///workspace/vllm/v1/simple_kv_offload/metadata.py#L16-38) (Scheduler → Worker):

python 复制代码
@dataclass
class SimpleCPUOffloadMetadata(KVConnectorMetadata):
    load_event: int              # 加载任务 ID (-1 = 无)
    load_gpu_blocks: list[int]
    load_cpu_blocks: list[int]
    store_event: int             # 存储任务 ID
    store_gpu_blocks: list[int]
    store_cpu_blocks: list[int]
    need_flush: bool             # 是否有请求被抢占

SimpleCPUOffloadWorkerMetadata(file:///workspace/vllm/v1/simple_kv_offload/metadata.py#L41-60) (Worker → Scheduler):

python 复制代码
@dataclass
class SimpleCPUOffloadWorkerMetadata(KVConnectorWorkerMetadata):
    completed_store_events: dict[int, int]  # event_idx → 完成计数
    # aggregate() 跨 worker 汇总

七、KV Transfer 跨节点传输

KV Transfer 子系统支持在** disaggregated 架构**中跨节点传输 KV cache 数据(如 Prefill 节点 → Decode 节点)。

7.1 KVConnectorBase_V1 核心接口

KVConnectorBase_V1(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/base.py#L170-654) 定义了双向通信协议:
Remote (KV Cache) Worker (KVConnector) Scheduler (KVConnector) Remote (KV Cache) Worker (KVConnector) Scheduler (KVConnector) #mermaid-svg-naEuRdBGu7Weyxzr{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-naEuRdBGu7Weyxzr .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-naEuRdBGu7Weyxzr .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-naEuRdBGu7Weyxzr .error-icon{fill:#552222;}#mermaid-svg-naEuRdBGu7Weyxzr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-naEuRdBGu7Weyxzr .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-naEuRdBGu7Weyxzr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-naEuRdBGu7Weyxzr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-naEuRdBGu7Weyxzr .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-naEuRdBGu7Weyxzr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-naEuRdBGu7Weyxzr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-naEuRdBGu7Weyxzr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-naEuRdBGu7Weyxzr .marker.cross{stroke:#333333;}#mermaid-svg-naEuRdBGu7Weyxzr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-naEuRdBGu7Weyxzr p{margin:0;}#mermaid-svg-naEuRdBGu7Weyxzr .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-naEuRdBGu7Weyxzr text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-naEuRdBGu7Weyxzr .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-naEuRdBGu7Weyxzr .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-naEuRdBGu7Weyxzr .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-naEuRdBGu7Weyxzr .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-naEuRdBGu7Weyxzr #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-naEuRdBGu7Weyxzr .sequenceNumber{fill:white;}#mermaid-svg-naEuRdBGu7Weyxzr #sequencenumber{fill:#333;}#mermaid-svg-naEuRdBGu7Weyxzr #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-naEuRdBGu7Weyxzr .messageText{fill:#333;stroke:none;}#mermaid-svg-naEuRdBGu7Weyxzr .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-naEuRdBGu7Weyxzr .labelText,#mermaid-svg-naEuRdBGu7Weyxzr .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-naEuRdBGu7Weyxzr .loopText,#mermaid-svg-naEuRdBGu7Weyxzr .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-naEuRdBGu7Weyxzr .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-naEuRdBGu7Weyxzr .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-naEuRdBGu7Weyxzr .noteText,#mermaid-svg-naEuRdBGu7Weyxzr .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-naEuRdBGu7Weyxzr .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-naEuRdBGu7Weyxzr .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-naEuRdBGu7Weyxzr .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-naEuRdBGu7Weyxzr .actorPopupMenu{position:absolute;}#mermaid-svg-naEuRdBGu7Weyxzr .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-naEuRdBGu7Weyxzr .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-naEuRdBGu7Weyxzr .actor-man circle,#mermaid-svg-naEuRdBGu7Weyxzr line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-naEuRdBGu7Weyxzr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} get_num_new_matched_tokens() build_connector_meta() start_load_kv() save_kv_layer() get_finished() update_connector_output() 查询远程 cache 命中数 (hit_count, is_async) KVConnectorMetadata (load/store specs) 异步加载 KV 数据 模型前向计算 (与 load 重叠) 异步保存 KV 数据 finished_recving/sending req_ids 处理完成事件

Scheduler 侧方法:

方法 说明
get_num_new_matched_tokens() 查询远程 cache 可匹配的新 token 数
update_state_after_alloc() 分配后更新状态
build_connector_meta() 构建 metadata 发送给 worker
update_connector_output() 处理 worker 回传的完成事件
request_finished() 请求结束时的清理,返回是否延迟释放
take_events() 收集 KV cache 事件

Worker 侧方法:

方法 说明
register_kv_caches() 注册本地 KV cache tensors
bind_connector_metadata() 接收 scheduler 的 metadata
start_load_kv() 启动异步加载
wait_for_layer_load() 等待指定层加载完成
save_kv_layer() 启动异步保存某层
wait_for_save() 等待所有保存完成
get_finished() 上报完成的异步传输
handle_preemptions() 抢占前的同步处理

7.2 连接器注册与发现

KVConnectorFactory(file:///workspace/vllm/distributed/kv_transfer/kv_connector/factory.py) 通过名称注册制管理连接器类型:

python 复制代码
# 内置注册
OffloadingSpecFactory.register_spec(
    "CPUOffloadingSpec",
    "vllm.v1.kv_offload.cpu.spec",
    "CPUOffloadingSpec"
)

kv_transfer_state.py(file:///workspace/vllm/distributed/kv_transfer/kv_transfer_state.py#L19-78) 管理全局连接器实例:

python 复制代码
_KV_CONNECTOR_AGENT: KVConnectorBaseType | None = None

def ensure_kv_transfer_initialized(vllm_config, kv_cache_config):
    global _KV_CONNECTOR_AGENT
    _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
        config=vllm_config, role=KVConnectorRole.WORKER, ...
    )

7.3 支持的连接器类型

kv_connector/v1/ 目录下实现了多种连接器后端:

连接器 文件 用途
LMCache lmcache_connector.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py) LMCache 集成
NIXL nixl/connector.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/nixl/connector.py) NVIDIA NIXL 传输
P2P NCCL p2p/p2p_nccl_connector.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py) 点对点 NCCL 传输
Mooncake mooncake/mooncake_connector.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py) Mooncake 分离架构
HF3FS hf3fs/hf3fs_connector.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/hffs/hf3fs_connector.py) HDFS/Federated FS
Moriio moriio/moriio_connector.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_connector.py) Moriio 分布式缓存
Offloading offloading_connector.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py) 通用卸载连接器
Simple CPU simple_cpu_offload_connector.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/simple_cpu_offload_connector.py) 简单 CPU 卸载

7.4 Disaggregated Prefill 工作流

在分离式架构中,Prefill 节点和 Decode 节点分别运行,KV cache 通过 Connector 传输:

复制代码
┌──────────────┐     KV Transfer      ┌──────────────┐
│  Prefill 节点 │◄──────────────────►│  Decode 节点  │
│              │                     │              │
│ • Prefill    │  ──store─────►     │ • Load       │
│ • Cache Hit  │  ◄──load──────     │ • Decode     │
│ • Save KV    │                     │ • Evict      │
└──────────────┘                     └──────────────┘

HMA (Hybrid Memory Allocator) 支持 :通过 SupportsHMA 接口标记支持混合内存分配的连接器,额外实现 request_finished_all_groups() 方法处理跨 group 的异步释放。


八、Block Table 管理

8.1 BlockTable

BlockTable(file:///workspace/vllm/v1/worker/block_table.py#L18-220) 管理单个 KV cache group 的 block table,是 attention kernel 直接消费的数据结构。

核心字段:

python 复制代码
class BlockTable:
    def __init__(self, block_size, max_num_reqs, max_num_blocks_per_req,
                 max_num_batched_tokens, pin_memory, device,
                 kernel_block_size, cp_kv_cache_interleave_size):
        # hybrid block 支持:allocation block_size ≠ kernel block_size
        if kernel_block_size == block_size:
            self.blocks_per_kv_block = 1
            self.use_hybrid_blocks = False
        else:
            self.blocks_per_kv_block = block_size // kernel_block_size
            self.use_hybrid_blocks = True

        self.block_table = self._make_buffer(max_num_reqs, max_num_blocks_per_req, int32)
        self.slot_mapping = self._make_buffer(max_num_batched_tokens, int64)
        self.num_blocks_per_row = np.zeros(max_num_reqs, dtype=np.int32)

Hybrid Block 映射 (block_table.py:173-201(file:///workspace/vllm/v1/worker/block_table.py#L173-201)):

当 allocation block size 与 kernel block size 不同时(如混合模型),需要映射:

python 复制代码
@staticmethod
def map_to_kernel_blocks(kv_manager_block_ids, blocks_per_kv_block, kernel_block_arange):
    """
    例: kv_manager block size=32, kernel block size=16
    input:  [0, 1, 2]
    output: [0, 1, 2, 3, 4, 5]
    # 每个 manager block → blocks_per_kv_block 个 kernel blocks
    """
    kernel_block_ids = (
        kv_manager_block_ids.reshape(-1, 1) * blocks_per_kv_block
        + kernel_block_arange
    ).reshape(-1)

主要操作:

方法 功能
append_row(block_ids, row_idx) 向请求行追加 block IDs
add_row(block_ids, row_idx) 设置请求行的 block IDs
clear_row(row_idx) 清空请求行
move_row(src, tgt) 移动行(用于 swap)
swap_row(src, tgt) 交换两行
compute_slot_mapping() Triton kernel 计算 slot mapping
commit_block_table(num_reqs) 将 CPU 数据复制到 GPU

8.2 MultiGroupBlockTable

MultiGroupBlockTable(file:///workspace/vllm/v1/worker/block_table.py#L223-322) 为每个 KV cache group 维护一个独立的 BlockTable

python 复制代码
class MultiGroupBlockTable:
    def __init__(self, max_num_reqs, max_model_len, max_num_batched_tokens,
                 pin_memory, device, block_sizes, kernel_block_sizes, ...):
        self.block_tables = [
            BlockTable(block_size, ..., kernel_block_size, ...)
            for block_size, kernel_block_size in zip(block_sizes, kernel_block_sizes)
        ]

    def append_row(self, block_ids, row_idx):
        for i, block_table in enumerate(self.block_tables):
            block_table.append_row(block_ids[i], row_idx)

TRTLLM 对齐max_num_blocks 会向上对齐到 128 / block_size 的倍数以满足某些 attention 后端的要求。

8.3 Slot Mapping Kernel

_compute_slot_mapping_kernel(file:///workspace/vllm/v1/worker/block_table.py#L325-380) 是一个 Triton kernel,将 token position 映射到物理 slot ID:

python 复制代码
@triton.jit
def _compute_slot_mapping_kernel(
    num_tokens, query_start_loc_ptr, positions_ptr,
    block_table_ptr, block_table_stride, block_size,
    slot_mapping_ptr,
    TOTAL_CP_WORLD_SIZE, TOTAL_CP_RANK,
    CP_KV_CACHE_INTERLEAVE_SIZE, PAD_ID, BLOCK_SIZE):
    # 对每个 request:
    #   virtual_block_size = block_size * TOTAL_CP_WORLD_SIZE
    #   block_index = pos // virtual_block_size
    #   判断是否属于当前 rank 的 local slot
    #   local_slot = block_number * block_size + local_offset
    #   不属于当前 rank 则填 PAD_SLOT_ID

Context Parallel 支持 :通过 TOTAL_CP_WORLD_SIZETOTAL_CP_RANK 参数支持 Decode Context Parallel 和 Prefill Context Parallel。CP_KV_CACHE_INTERLEAVE_SIZE 控制 KV cache 在 rank 间的交错粒度。


九、前缀缓存 Prefix Caching

Prefix Caching 是 vLLM 的核心性能优化特性,通过检测并复用公共 prompt 前缀来避免重复计算。

9.1 查找最长缓存命中

入口KVCacheManager.get_computed_blocks()(file:///workspace/vllm/v1/core/kv_cache_manager.py#183-L223)

python 复制代码
def get_computed_blocks(self, request: Request) -> tuple[KVCacheBlocks, int]:
    if not self.enable_caching or request.skip_reading_prefix_cache:
        return self.empty_kv_cache_blocks, 0

    # 必须至少重算最后一个 token 以获取 logits
    max_cache_hit_length = request.num_tokens - 1
    computed_blocks, num_new_computed_tokens = (
        self.coordinator.find_longest_cache_hit(
            request.block_hashes, max_cache_hit_length
        )
    )

    if self.log_stats:
        self.prefix_cache_stats.record(
            num_tokens=request.num_tokens,
            num_hits=num_new_computed_tokens,
            preempted=request.num_preemptions > 0,
        )

    return self.create_kv_cache_blocks(computed_blocks), num_new_computed_tokens

9.2 各 Manager 的查找策略对比

Manager 搜索方向 匹配策略 特殊处理
FullAttention 左→右 顺序匹配直到 miss EAGLE pop 最后一块
SlidingWindow 右→左 反向找连续 window 内匹配 null padding; EAGLE
ChunkedLocal 分区+顺序 窗口外=null, 窗口内=hash 查找 不支持 EAGLE
Mamba 右→左 只取最后一个匹配块 align mode 特殊处理

9.3 FullAttentionManager 详细流程

find_longest_cache_hit(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L448-L494):

复制代码
输入: block_hashes=[H0, H1, H2, H3, H4, ...], max_length=20, block_size=16

Step 1: 检查 H0 → cache hit? → 是 → computed=[B0]
Step 2: 检查 H1 → cache hit? → 是 → computed=[B0, B1]
Step 3: 检查 H2 → cache miss! → break

返回: ([B0, B1], 32)  # 2 blocks × 16 tokens = 32 tokens

9.4 SlidingWindowManager 详细流程

find_longest_cache_hit(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py#L512-L604):

复制代码
输入: sliding_window=8, block_size=4, max_length=20

需要连续匹配: ceil((8-1)/4) = 1 个 block (实际上需要更多以覆盖窗口)

从右向左搜索:
  i=4: H4 → hit? → [_, _, _, _, B4], contiguous=1
  i=3: H3 → hit? → [_, _, _, B3, B4], contiguous=2 ✓ 达标!
  → trim trailing → [_, _, _, B3, B4]

返回: ([NULL, NULL, NULL, B3, B4], 16)  # 4 blocks, 但前3个是null

9.5 缓存命中后的处理路径

allocate_slots() 中的三阶段分配 (kv_cache_manager.py:225-416(file:///workspace/vllm/v1/core/kv_cache_manager.py#L225-L416)):

复制代码
阶段 1: remove_skipped_blocks()
  → 释放滑动窗口外的旧 block(替换为 null_block)

阶段 2: allocate_new_computed_blocks()
  → touch 前缀缓存命中的 block(防止被驱逐)
  → 追加到请求的 block 列表
  → 为 external computed token 分配新 block

阶段 3: allocate_new_blocks()
  → 为待计算的新 token 分配 block
  → 返回新分配的 block 列表

最后: cache_blocks()
  → 将满 block 标记为可复用(写入 block_hash → block 映射)

9.6 BlockPool 中的缓存数据结构

BlockHashToBlockMap(file:///workspace/vllm/v1/core/block_pool.py#L34-) 维护 hash→block 的映射:

python 复制代码
class BlockHashToBlockMap:
    """block_hash → KVCacheBlock | dict[int, KVCacheBlock]"""
    def get_one_block(self, key) -> KVCacheBlock | None: ...
    def insert(self, key, block) -> None: ...

配合 FreeKVCacheBlockQueue(free list)实现:

  • 新分配的 block 来自 free list
  • 被 cache 的 block 进入 cached_block_hash_to_block 映射
  • 当 ref_cnt 降为 0 时,cached block 可被重新利用或驱逐

9.7 与调度器的集成

Prefix Cache 与调度器的交互点:

  1. 调度前查询get_computed_blocks() 告知调度器有多少 token 可以跳过
  2. 全序列准入检查full_sequence_must_fit=True 时确保整个序列能放入
  3. Cascade Attention 优化get_num_common_prefix_blocks() 让调度器知道公共前缀长度,用于 cascade attention 的 KV fuse 优化
  4. 统计报告make_prefix_cache_stats() 每 interval 辞出命中率等指标

十、数据流全景图

#mermaid-svg-GOsqgfOPfE8wnhJG{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-GOsqgfOPfE8wnhJG .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-GOsqgfOPfE8wnhJG .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-GOsqgfOPfE8wnhJG .error-icon{fill:#552222;}#mermaid-svg-GOsqgfOPfE8wnhJG .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-GOsqgfOPfE8wnhJG .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-GOsqgfOPfE8wnhJG .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-GOsqgfOPfE8wnhJG .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-GOsqgfOPfE8wnhJG .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-GOsqgfOPfE8wnhJG .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-GOsqgfOPfE8wnhJG .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-GOsqgfOPfE8wnhJG .marker{fill:#333333;stroke:#333333;}#mermaid-svg-GOsqgfOPfE8wnhJG .marker.cross{stroke:#333333;}#mermaid-svg-GOsqgfOPfE8wnhJG svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-GOsqgfOPfE8wnhJG p{margin:0;}#mermaid-svg-GOsqgfOPfE8wnhJG .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-GOsqgfOPfE8wnhJG .cluster-label text{fill:#333;}#mermaid-svg-GOsqgfOPfE8wnhJG .cluster-label span{color:#333;}#mermaid-svg-GOsqgfOPfE8wnhJG .cluster-label span p{background-color:transparent;}#mermaid-svg-GOsqgfOPfE8wnhJG .label text,#mermaid-svg-GOsqgfOPfE8wnhJG span{fill:#333;color:#333;}#mermaid-svg-GOsqgfOPfE8wnhJG .node rect,#mermaid-svg-GOsqgfOPfE8wnhJG .node circle,#mermaid-svg-GOsqgfOPfE8wnhJG .node ellipse,#mermaid-svg-GOsqgfOPfE8wnhJG .node polygon,#mermaid-svg-GOsqgfOPfE8wnhJG .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-GOsqgfOPfE8wnhJG .rough-node .label text,#mermaid-svg-GOsqgfOPfE8wnhJG .node .label text,#mermaid-svg-GOsqgfOPfE8wnhJG .image-shape .label,#mermaid-svg-GOsqgfOPfE8wnhJG .icon-shape .label{text-anchor:middle;}#mermaid-svg-GOsqgfOPfE8wnhJG .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-GOsqgfOPfE8wnhJG .rough-node .label,#mermaid-svg-GOsqgfOPfE8wnhJG .node .label,#mermaid-svg-GOsqgfOPfE8wnhJG .image-shape .label,#mermaid-svg-GOsqgfOPfE8wnhJG .icon-shape .label{text-align:center;}#mermaid-svg-GOsqgfOPfE8wnhJG .node.clickable{cursor:pointer;}#mermaid-svg-GOsqgfOPfE8wnhJG .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-GOsqgfOPfE8wnhJG .arrowheadPath{fill:#333333;}#mermaid-svg-GOsqgfOPfE8wnhJG .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-GOsqgfOPfE8wnhJG .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-GOsqgfOPfE8wnhJG .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-GOsqgfOPfE8wnhJG .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-GOsqgfOPfE8wnhJG .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-GOsqgfOPfE8wnhJG .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-GOsqgfOPfE8wnhJG .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-GOsqgfOPfE8wnhJG .cluster text{fill:#333;}#mermaid-svg-GOsqgfOPfE8wnhJG .cluster span{color:#333;}#mermaid-svg-GOsqgfOPfE8wnhJG div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-GOsqgfOPfE8wnhJG .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-GOsqgfOPfE8wnhJG rect.text{fill:none;stroke-width:0;}#mermaid-svg-GOsqgfOPfE8wnhJG .icon-shape,#mermaid-svg-GOsqgfOPfE8wnhJG .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-GOsqgfOPfE8wnhJG .icon-shape p,#mermaid-svg-GOsqgfOPfE8wnhJG .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-GOsqgfOPfE8wnhJG .icon-shape .label rect,#mermaid-svg-GOsqgfOPfE8wnhJG .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-GOsqgfOPfE8wnhJG .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-GOsqgfOPfE8wnhJG .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-GOsqgfOPfE8wnhJG :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 可选: Offload Path
BlockTable
BlockPool
SingleType Managers
Coordinator 层
Scheduler 侧
Request 生命周期
可选
新 Request 到达
KVCacheManager.get_computed_blocks()

→ find_longest_cache_hit()
KVCacheManager.allocate_slots()

→ 三阶段分配
KVCacheManager.free()
KVCacheManager.cache_blocks()
Coordinator.find_longest_cache_hit()

固定点算法(Hybrid)
Coordinator.allocate_new_blocks()

分发到各 SingleTypeMgr
FullAttentionManager

左→右顺序匹配
SlidingWindowManager

右→左反向匹配
ChunkedLocalAttnMgr

分区+窗口内匹配
MambaManager

末尾单块匹配
get_new_blocks()
free_blocks()
cache_full_blocks()
get_cached_block()
evict_blocks()
append_row()
compute_slot_mapping()

Triton Kernel
commit_block_table()
prepare_store()
prepare_load()
GPU ↔ CPU Async Transfer


附录:关键文件索引

文件路径 核心内容
kv_cache_manager.py(file:///workspace/vllm/v1/core/kv_cache_manager.py) KVCacheManager 顶层抽象、KVCacheBlocks 数据类
single_type_kv_cache_manager.py(file:///workspace/vllm/v1/core/single_type_kv_cache_manager.py) 单类型基类及 6 种具体 Manager 实现
encoder_cache_manager.py(file:///workspace/vllm/v1/core/encoder_cache_manager.py) 编码器输出缓存管理
kv_cache_coordinator.py(file:///workspace/vllm/v1/core/kv_cache_coordinator.py) 三种 Coordinator (NoPrefix/Unitary/Hybrid)
kv_cache_interface.py(file:///workspace/vllm/v1/kv_cache_interface.py) KVCacheSpec 层次体系、KVCacheConfig、量化模式
kv_cache_metrics.py(file:///workspace/vllm/v1/core/kv_cache_metrics.py) 采样式 Block 生命周期指标收集
kv_offload/base.py(file:///workspace/vllm/v1/kv_offload/base.py) OffloadingManager 抽象接口、LoadStoreSpec
kv_offload/reuse_manager.py(file:///workspace/vllm/v1/kv_offload/reuse_manager.py) FilterReusedOffloadingManager 复用门控
kv_offload/cpu/manager.py(file:///workspace/vllm/v1/kv_offload/cpu/manager.py) CPUOffloadingManager 具体实现
kv_offload/cpu/policies/(file:///workspace/vllm/v1/kv_offload/cpu/policies/) LRU / ARC 缓存替换策略
kv_offload/cpu/gpu_worker.py(file:///workspace/vllm/v1/kv_offload/cpu/gpu_worker.py) GPU↔CPU 异步传输 Handler
kv_offload/cpu/shared_offload_region.py(file:///workspace/vllm/v1/kv_offload/cpu/shared_offload_region.py) mmap 共享内存区域
kv_offload/cpu/spec.py(file:///workspace/vllm/v1/kv_offload/cpu/spec.py) CPUOffloadingSpec 规格工厂
simple_kv_offload/manager.py(file:///workspace/vllm/v1/simple_kv_offload/manager.py) SimpleCPUOffloadScheduler (lazy/eager)
simple_kv_offload/worker.py(file:///workspace/vllm/v1/simple_kv_offload/worker.py) SimpleCPUOffloadWorker
simple_kv_offload/copy_backend.py(file:///workspace/vllm/v1/simple_kv_offload/copy_backend.py) DMA Copy 后台线程
kv_connector/v1/base.py(file:///workspace/vllm/distributed/kv_transfer/kv_connector/v1/base.py) KVConnectorBase_V1 完整接口定义
kv_transfer_state.py(file:///workspace/vllm/distributed/kv_transfer/kv_transfer_state.py) 全局连接器实例管理
block_table.py(file:///workspace/vllm/v1/worker/block_table.py) BlockTable / MultiGroupBlockTable + Triton slot mapping
block_pool.py(file:///workspace/vllm/v1/core/block_pool.py) BlockPool、BlockHashToBlockMap、FreeKVCacheBlockQueue
相关推荐
掉鱼的猫1 小时前
ReActAgent 使用指南:构建会思考、能行动的 AI Agent
java·llm·agent
pp起床1 小时前
黑马点评 - 短信验证码登录实现
java·开发语言·tomcat
CodeStats2 小时前
《源纹天书》第121-125章:源匠归来——全栈重构与归元圣域的2.0时代
java·开发语言·源纹天书
AI人工智能+电脑小能手2 小时前
【大白话说Java面试题 第154题】【06_Spring篇】第14题:Spring 支持的 Bean 作用域
java·开发语言·spring·面试
wuminyu2 小时前
markword在高并发场景下变化剖析
java·linux·c语言·jvm·c++
组合缺一2 小时前
用 ChatModel 构建 LLM 驱动的 Java 应用
java·开发语言·ai·llm·solon·rag
zzz_23682 小时前
【Java实习面试算法冲刺】哈希!
java·算法·面试
带刺的坐椅2 小时前
ReActAgent 使用指南:构建会思考、能行动的 AI Agent
java·ai·llm·solon·loop·react-agent
happyprince2 小时前
12-vLLM 量化方案全面分析
前端·javascript·vllm