【vllm】(八)vLLM v1 Simple KV Offload — 系统级架构深度分析之二

3.5 manager.py --- 逐行深度解析

第70-155行:SimpleCPUOffloadScheduler.init

python 复制代码
def __init__(self, vllm_config, kv_cache_config, cpu_capacity_bytes, lazy_offload=False):
    self.block_size = vllm_config.cache_config.block_size
    self.cpu_kv_cache_config = self._derive_cpu_config(kv_cache_config, cpu_capacity_bytes)
    self.num_cpu_blocks = self.cpu_kv_cache_config.num_blocks
    
    # Find full attention group
    self.fa_gidx = -1
    for g_idx, g in enumerate(self.cpu_kv_cache_config.kv_cache_groups):
        if isinstance(g.kv_cache_spec, FullAttentionSpec):
            self.fa_gidx = g_idx
            break
    assert 0 <= self.fa_gidx < len(self.cpu_kv_cache_config.kv_cache_groups)

逐行解析

  • block_size 相同:与 kv_offload 不同,本模块 GPU/CPU 使用相同 block_size
  • fa_gidx :FullAttention 组索引
    • 用于 update_state_after_alloc 中定位 prefix cache 匹配
    • 断言必须存在 --- 没有 FullAttention 的模型不支持 CPU offload
  • DCP/PCP 限制assert dcp_world_size == 1 and pcp_world_size == 1
    • 不支持 Decode Context Parallel 和 Prefill Context Parallel

第156-184行:_derive_cpu_config

python 复制代码
@staticmethod
def _derive_cpu_config(gpu_config, cpu_capacity_bytes):
    gpu_total_bytes = sum(t.size for t in gpu_config.kv_cache_tensors)
    num_gpu_blocks = gpu_config.num_blocks
    num_cpu_blocks = max(1, num_gpu_blocks * cpu_capacity_bytes // gpu_total_bytes)
    
    cpu_tensors = [
        KVCacheTensor(
            size=t.size // num_gpu_blocks * num_cpu_blocks,
            shared_by=list(t.shared_by),
        )
        for t in gpu_config.kv_cache_tensors
    ]
    
    return KVCacheConfigCls(
        num_blocks=num_cpu_blocks,
        kv_cache_tensors=cpu_tensors,
        kv_cache_groups=gpu_config.kv_cache_groups,  # Same groups!
    )

逐行解析

  • CPU 块数计算num_gpu_blocks * cpu_capacity_bytes // gpu_total_bytes
    • 比例缩放:如果 CPU 容量是 GPU 的 2 倍,CPU 块数就是 GPU 的 2 倍
  • CPU 张量大小t.size // num_gpu_blocks * num_cpu_blocks
    • 先算每块大小:t.size / num_gpu_blocks
    • 再乘 CPU 块数:得到 CPU 张量总大小
  • kv_cache_groups 直接复用 :CPU 侧使用与 GPU 相同的 KV Cache 组结构
    • 这是 simple_kv_offload 的设计核心:复用 vLLM 原生的 BlockPool + Coordinator

第208-228行:get_num_new_matched_tokens

python 复制代码
def get_num_new_matched_tokens(self, request, num_computed_tokens):
    skipped = num_computed_tokens // self.block_size
    remaining_hashes = request.block_hashes[skipped:]
    
    if not remaining_hashes:
        return 0, False
    
    max_hit_len = request.num_tokens - 1 - num_computed_tokens
    if max_hit_len <= 0:
        return 0, False
    
    _, hit_length = self.cpu_coordinator.find_longest_cache_hit(
        remaining_hashes, max_hit_len
    )
    
    if hit_length > 0:
        return hit_length, True
    return 0, False

逐行解析

  • 跳过已计算skipped = num_computed_tokens // block_size --- 已计算的块不在查询范围
  • remaining_hashes:从已计算之后开始的 block hash 列表
  • max_hit_len 限制:必须至少重新计算最后一个 token(vLLM 的要求)
  • is_async=True:返回 True 表示这是异步加载(需要等待传输完成)
  • cpu_coordinator.find_longest_cache_hit :使用 vLLM 原生的前缀缓存匹配算法
    • 在 CPU block_pool 的 cached_block_hash_to_block 中查找连续命中

第232-313行:update_state_after_alloc --- Load 路径核心

python 复制代码
def update_state_after_alloc(self, request, blocks, num_external_tokens):
    req_id = request.request_id
    block_ids_by_group = blocks.get_block_ids()
    num_groups = len(block_ids_by_group)
    
    # Eager store tracking
    if not self._lazy_mode and req_id not in self._reqs_to_store:
        self._reqs_to_store[req_id] = StoreRequestState(
            request=request,
            block_ids=tuple([] for _ in range(num_groups)),
            num_stored_blocks=[0] * num_groups,
        )
    
    if num_external_tokens == 0:
        return
    
    num_blocks_to_load = num_external_tokens // self.block_size
    assert num_blocks_to_load > 0
    
    skipped = sum(blk.block_hash is not None for blk in blocks.blocks[self.fa_gidx])
    num_computed_tokens = skipped * self.block_size
    hashes_to_load = request.block_hashes[skipped : skipped + num_blocks_to_load]
    
    # Find CPU cached blocks across all groups
    cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(
        hashes_to_load, max_hit_len
    )
    assert hit_length == num_external_tokens
    
    # Build transfer pairs across all groups
    for g in range(num_groups):
        cpu_blocks_g = cpu_hit_blocks[g]
        n_ext_g = len(cpu_blocks_g)
        if n_ext_g == 0:
            continue
        
        g_block_size = kv_cache_groups[g].kv_cache_spec.block_size
        n_computed_g = cdiv(total_computed_tokens, g_block_size)
        gpu_ext_start = n_computed_g - n_ext_g
        
        for i, cpu_blk in enumerate(cpu_blocks_g):
            if cpu_blk.is_null:
                continue
            gpu_block_ids.append(group_gpu_ids[gpu_ext_start + i])
            cpu_block_ids.append(cpu_blk.block_id)
            cpu_blocks_to_touch.append(cpu_blk)
    
    # Touch to prevent eviction/freeing during async load
    self.cpu_block_pool.touch(cpu_blocks_to_touch)
    self._gpu_block_pool.touch([self._gpu_block_pool.blocks[bid] for bid in gpu_block_ids])
    
    self._reqs_to_load[req_id] = LoadRequestState(
        request=request, transfer_meta=TransferMeta(gpu_block_ids, cpu_block_ids)
    )

逐行深度解析

已计算块计数

python 复制代码
skipped = sum(blk.block_hash is not None for blk in blocks.blocks[self.fa_gidx])
  • blocks.blocks[fa_gidx]:FullAttention 组的块列表
  • block_hash is not None:有 hash 的块是已计算且缓存的
  • skipped:跳过的块数(GPU prefix cache 命中的部分)

CPU 缓存查找

python 复制代码
hashes_to_load = request.block_hashes[skipped : skipped + num_blocks_to_load]
cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(...)
assert hit_length == num_external_tokens
  • 在 CPU 侧查找剩余块的 hash 匹配
  • 断言必须完全命中:num_external_tokens 全部来自 CPU 缓存

GPU 块 ID 定位(关键算法):

python 复制代码
n_computed_g = cdiv(total_computed_tokens, g_block_size)
gpu_ext_start = n_computed_g - n_ext_g
  • total_computed_tokens = num_computed_tokens + num_external_tokens
  • n_computed_g:该组总计算块数
  • n_ext_g:外部加载的块数
  • gpu_ext_start:外部块在 GPU 块列表中的起始位置
    • 推导:外部块位于已计算范围的尾部

Touch 操作

  • cpu_block_pool.touch(cpu_blocks_to_touch):增加 CPU 块引用计数,防止加载期间被驱逐
  • gpu_block_pool.touch(...):增加 GPU 块引用计数,防止加载期间被释放

第315-364行:build_connector_meta

python 复制代码
def build_connector_meta(self, scheduler_output):
    # --- Stores ---
    store_event = -1
    store_gpu, store_cpu, store_req_ids = self.prepare_store_specs(scheduler_output)
    if store_gpu:
        store_event = self._store_event_counter
        self._store_event_counter += 1
        self._store_event_to_blocks[store_event] = TransferMeta(store_gpu, store_cpu)
        if store_req_ids:
            self._store_event_to_reqs[store_event] = store_req_ids
            for req_id in store_req_ids:
                store_state = self._reqs_to_store.get(req_id)
                if store_state is not None:
                    store_state.store_events.add(store_event)
    
    # --- Loads ---
    load_event = -1
    load_gpu, load_cpu, load_req_ids = [], [], []
    for req_id, load_state in self._reqs_to_load.items():
        if load_state.load_event is not None:
            continue
        load_gpu.extend(load_state.transfer_meta.gpu_block_ids)
        load_cpu.extend(load_state.transfer_meta.cpu_block_ids)
        load_req_ids.append(req_id)
    if load_req_ids:
        load_event = self._load_event_counter
        self._load_event_counter += 1
        for req_id in load_req_ids:
            self._reqs_to_load[req_id].load_event = load_event
        self._load_event_to_reqs[load_event] = load_req_ids
    
    return SimpleCPUOffloadMetadata(
        load_event=load_event, load_gpu_blocks=load_gpu, load_cpu_blocks=load_cpu,
        load_event_to_reqs=self._load_event_to_reqs,
        store_event=store_event, store_gpu_blocks=store_gpu, store_cpu_blocks=store_cpu,
        need_flush=bool(scheduler_output.preempted_req_ids),
    )

逐行解析

Store 部分

  • 调用 prepare_store_specs 获取块 ID 列表
  • 如果有块需要 Store,分配递增的 store_event 索引
  • store_event_to_blocks:记录事件→块映射,用于完成时处理
  • store_event_to_reqs:仅 Eager 模式使用,记录事件→请求映射

Load 部分

  • 遍历 _reqs_to_load,收集所有尚未分配 event 的请求
  • 合并所有请求的块 ID 到扁平列表
  • 分配一个 load_event 给所有这些请求
    • 注意:同一步的所有 Load 共享一个事件

need_flushbool(scheduler_output.preempted_req_ids) --- 有抢占时设为 True

第375-441行:_prepare_lazy_store_specs

python 复制代码
def _prepare_lazy_store_specs(self):
    gpu_pool = self._gpu_block_pool
    if gpu_pool is None or self._target_free <= 0:
        return [], [], []
    
    free_queue = gpu_pool.free_block_queue
    cpu_pool = self.cpu_block_pool
    num_cpu_free = cpu_pool.get_num_free_blocks()
    
    # Validate cursor
    if self._cursor is not None and self._cursor.ref_cnt > 0:
        self._cursor = None
    
    # Determine start node
    if self._cursor is None:
        node = free_queue.fake_free_list_head.next_free_block
    else:
        node = self._cursor.next_free_block
    
    tail = free_queue.fake_free_list_tail
    gpu_ids, block_hashes, covered, last_visited = [], [], 0, self._cursor
    
    while (
        node is not None and node is not tail
        and covered < self._target_free
        and len(gpu_ids) < num_cpu_free
    ):
        last_visited = node
        bhash = node.block_hash
        
        if (
            bhash is not None
            and not node.is_null
            and cpu_pool.cached_block_hash_to_block.get_one_block(bhash) is None
        ):
            gpu_ids.append(node.block_id)
            block_hashes.append(bhash)
        
        covered += 1
        node = node.next_free_block
    
    self._cursor = last_visited
    
    # Batch-allocate CPU blocks and stamp hashes
    if gpu_ids:
        cpu_blocks = cpu_pool.get_new_blocks(len(gpu_ids))
        cpu_ids = [blk.block_id for blk in cpu_blocks]
        for cpu_blk, bhash in zip(cpu_blocks, block_hashes):
            cpu_blk._block_hash = bhash
        gpu_pool.touch([gpu_pool.blocks[bid] for bid in gpu_ids])
    else:
        cpu_ids = []
    
    return gpu_ids, cpu_ids, []

逐行深度解析

游标验证

python 复制代码
if self._cursor is not None and self._cursor.ref_cnt > 0:
    self._cursor = None
  • ref_cnt > 0 表示块已被重新分配(不再在 free queue 中)
  • 此时游标过期,重置为 None

游标恢复

python 复制代码
if self._cursor is None:
    node = free_queue.fake_free_list_head.next_free_block
else:
    node = self._cursor.next_free_block
  • 从头开始或从上次位置继续 --- 增量扫描

筛选条件(三重过滤):

  1. bhash is not None:块有 hash(已计算,非空块)
  2. not node.is_null:非 null 块(非滑动窗口/Mamba 填充)
  3. cpu_pool.cached_block_hash_to_block.get_one_block(bhash) is None:CPU 端未缓存

终止条件(四重):

  1. node is None:链表结束
  2. node is tail:到达哨兵节点
  3. covered >= target_free:已覆盖足够多的空闲块
  4. len(gpu_ids) >= num_cpu_free:CPU 块池已满

CPU 块 stamp

python 复制代码
cpu_blk._block_hash = bhash
  • 直接设置私有属性 _block_hash,而非通过公共 API
  • 原因:此时块刚分配,还未写入数据,hash 是从 GPU 块"预印"的

Touch GPU 块

  • 增加引用计数,防止 DMA 传输期间块被驱逐或重新分配

第443-579行:_prepare_eager_store_specs

python 复制代码
def _prepare_eager_store_specs(self, scheduler_output):
    merged_gpu_block_ids, merged_cpu_block_ids, req_ids = [], [], []
    gpu_block_pool = self._gpu_block_pool
    cpu_block_pool = self.cpu_block_pool
    num_free = cpu_block_pool.get_num_free_blocks()
    kv_cache_groups = self.cpu_kv_cache_config.kv_cache_groups
    num_groups = len(kv_cache_groups)
    gpu_blocks_this_step: set[int] = set()
    
    for req_id, new_block_id_groups, preempted in yield_req_data(scheduler_output):
        state = self._reqs_to_store.get(req_id)
        if state is None or state.finished:
            continue
        
        # Accumulate new block IDs
        if preempted:
            state.block_ids = tuple([] for _ in range(num_groups))
            state.num_stored_blocks = [0] * num_groups
        if new_block_id_groups:
            for g in range(min(num_groups, len(new_block_id_groups))):
                if new_block_id_groups[g] is not None:
                    state.block_ids[g].extend(new_block_id_groups[g])

逐行解析

  • yield_req_data :从 scheduler_output 中提取每请求的新块 ID 和抢占标志
  • 抢占重置 :请求被抢占后,清空累积的块 ID 和游标
    • 原因:抢占后块被释放,之前的累积不再有效
  • 块 ID 累积 :跨步累积每请求每组的块 ID
    • state.block_ids[g] 是一个列表,每步追加新分配的块 ID

Phase 1:扫描分类(第492-545行):

python 复制代码
        confirmed_tokens = req.num_computed_tokens - req.num_output_placeholders
        
        for g in range(num_groups):
            already_stored_g = state.num_stored_blocks[g]
            group_gpu_ids = block_ids_by_group[g]
            g_block_size = kv_cache_groups[g].kv_cache_spec.block_size
            ready_blocks_g = confirmed_tokens // g_block_size
            scannable = group_gpu_ids[already_stored_g:ready_blocks_g]

逐行解析

  • confirmed_tokensnum_computed_tokens - num_output_placeholders
    • num_output_placeholders: speculative decoding 的占位符数
    • 只存储确认已计算的块,避免存储未完成的数据
  • scannable 范围[already_stored : ready_blocks]
    • already_stored_g:之前已处理的块数
    • ready_blocks_g:当前确认已计算的块数
    • 只扫描这个增量范围
python 复制代码
            for gpu_block_id in scannable:
                gpu_block = gpu_block_pool.blocks[gpu_block_id]
                if gpu_block.is_null:
                    advanced_per_group[g] += 1
                    continue
                bhash_with_group = gpu_block.block_hash
                if bhash_with_group is None:
                    break
                if (
                    gpu_block_id in gpu_blocks_this_step
                    or cpu_block_pool.cached_block_hash_to_block.get_one_block(bhash_with_group) is not None
                ):
                    advanced_per_group[g] += 1
                    continue
                if num_free <= 0:
                    out_of_space = True
                    break
                num_free -= 1
                gpu_block_ids.append(gpu_block_id)
                block_hashes_to_store.append(bhash_with_group)
                advanced_per_group[g] += 1

逐行解析

  • null 块跳过:滑动窗口/Mamba 填充块不存储
  • 无 hash 终止bhash is Nonebreak(不是 continue)
    • 原因:没有 hash 的块意味着这是新计算的但尚未被 hash 的块
    • 后续块也不会有 hash(按顺序),直接终止
  • 重复检查
    • gpu_blocks_this_step:同一请求/同一步已计划 Store 的块
    • cached_block_hash_to_block:CPU 已缓存的块
  • 空间不足num_free <= 0 → 终止

第581-648行:update_connector_output --- 完成处理

python 复制代码
def update_connector_output(self, connector_output):
    # --- Load completions ---
    for req_id in list(connector_output.finished_recving or []):
        self._cleanup_load_request(req_id)
    
    # --- Store completions ---
    meta = connector_output.kv_connector_worker_meta
    if not isinstance(meta, SimpleCPUOffloadWorkerMetadata):
        return
    for event_idx, count in meta.completed_store_events.items():
        total = self._store_event_pending_counts.get(event_idx, 0) + count
        if total >= self._expected_worker_count:
            self._store_event_pending_counts.pop(event_idx, None)
            self._process_store_event(event_idx)
        else:
            self._store_event_pending_counts[event_idx] = total

逐行解析

  • Load 完成 :直接调用 _cleanup_load_request --- 无需 TP 累积
    • 原因:每个 Worker 独立加载,任一 Worker 完成即可通知 Scheduler
  • Store 完成 :TP 累积机制
    • total = pending_count + new_count
    • total >= world_size → 所有 Worker 完成 → 处理
    • 否则暂存,等待其他 Worker 上报

第625-648行:_process_store_completion

python 复制代码
def _process_store_completion(self, gpu_block_ids, cpu_block_ids):
    cpu_blocks = [self.cpu_block_pool.blocks[bid] for bid in cpu_block_ids]
    
    for cpu_block in cpu_blocks:
        bhash = cpu_block.block_hash
        assert bhash is not None
        self.cpu_block_pool.cached_block_hash_to_block.insert(bhash, cpu_block)
    
    # Free CPU and GPU blocks to turn them into prefix cache
    self.cpu_block_pool.free_blocks(cpu_blocks)
    self._gpu_block_pool.free_blocks(
        self._gpu_block_pool.blocks[bid] for bid in gpu_block_ids
    )

逐行解析

  • 注册到缓存cached_block_hash_to_block.insert(hash, block)
    • 将 CPU 块注册到前缀缓存映射
    • 此后,其他请求可以通过 hash 查找并 Load 此块
  • 释放引用free_blocks 递减引用计数
    • CPU 块变为 prefix cache(ref_cnt 由缓存映射持有)
    • GPU 块释放回空闲池(可供新请求分配)
  • 设计要点 :Store 完成后,GPU 块不再需要保留 KV 数据
    • 因为 CPU 已有副本,需要时可以 Load 回来

第654-680行:request_finished --- 请求结束处理

python 复制代码
def request_finished(self, request, block_ids):
    req_id = request.request_id
    
    # Handle load: defer if in-flight
    load_state = self._reqs_to_load.get(req_id)
    if load_state is not None:
        if load_state.load_event is not None:
            load_state.finished = True  # Defer: load in-flight
        else:
            self._cleanup_load_request(req_id)
    
    # Handle store (eager only): defer if stores in-flight
    if not self._lazy_mode:
        store_state = self._reqs_to_store.get(req_id)
        if store_state is not None:
            if store_state.store_events:
                store_state.finished = True  # Defer: stores in-flight
            else:
                self._cleanup_store_request(req_id)
    
    return False, None

逐行解析

  • 延迟清理 :如果 Load/Store 还在飞行中,标记 finished=True 但不清理
    • 等传输完成后,在 update_connector_output / _process_store_event 中清理
  • 立即清理:如果没有飞行中的传输,直接清理
  • 返回 (False, None):始终返回 False --- GPU 块由引用计数保护,Scheduler 可立即释放

四、完整执行流程汇总

4.1 初始化流程

复制代码
1. vLLM 启动 → 配置 kv_connector_extra_config
2. Scheduler 创建:
   SimpleCPUOffloadScheduler(vllm_config, kv_cache_config, cpu_capacity_bytes)
   → _derive_cpu_config() → CPU KVCacheConfig
   → get_kv_cache_coordinator() → cpu_coordinator + cpu_block_pool
   → bind_gpu_block_pool() → GPU 块池引用

3. Worker 创建:
   SimpleCPUOffloadWorker(vllm_config, kv_cache_config, cpu_capacity_bytes)
   → register_kv_caches(kv_caches)
     → _repr_tensor + dedup by data_ptr
     → Classify layout (FlashAttn K/V split)
     → Allocate CPU tensors + pin_tensor
     → Create low-priority CUDA streams
     → DmaCopyBackend.init() → build_params + _copy_loop thread

4.2 Store 流程(GPU→CPU 卸载)

复制代码
Step N:
  Scheduler:
    1. build_connector_meta()
       ├── prepare_store_specs()
       │   ├── [Lazy] cursor walk → select GPU blocks near eviction
       │   └── [Eager] yield_req_data → scan confirmed blocks per request
       └── Allocate CPU blocks + stamp hashes + touch GPU blocks
    
    2. Send SimpleCPUOffloadMetadata to Worker

  Worker:
    3. bind_connector_metadata(metadata)
    4. [Model Execution --- KV data written to GPU]
    5. get_finished()
       ├── launch_copy(store_gpu→store_cpu, is_store=True)
       │   └── DmaCopyBackend → _queue.put → _copy_loop → cuMemcpyBatchAsync
       ├── _poll_stream_events(store) → _completed_store_events
       └── return (None, finished_recving)
    
    6. build_connector_worker_meta()
       └── SimpleCPUOffloadWorkerMetadata({event_idx: 1})

Step N+1:
  Scheduler:
    7. update_connector_output()
       ├── accumulate store event counts from all workers
       ├── if count >= world_size:
       │   └── _process_store_completion()
       │       ├── cached_block_hash_to_block.insert → prefix cache
       │       ├── cpu_pool.free_blocks → release CPU refs
       │       └── gpu_pool.free_blocks → release GPU refs
       └── [Eager] cleanup StoreRequestState

4.3 Load 流程(CPU→GPU 恢复)

复制代码
Step N:
  Scheduler:
    1. get_num_new_matched_tokens(request, num_computed)
       └── cpu_coordinator.find_longest_cache_hit → (hit_length, is_async=True)
    
    2. update_state_after_alloc(request, blocks, num_external_tokens)
       ├── find CPU blocks by hash across all groups
       ├── Build GPU↔CPU block ID pairs
       ├── Touch CPU + GPU blocks (prevent eviction)
       └── Register in _reqs_to_load
    
    3. build_connector_meta()
       └── Collect all pending loads → assign load_event

  Worker:
    4. bind_connector_metadata(metadata)
    5. [Model Execution]
    6. get_finished()
       ├── launch_copy(load_cpu→load_gpu, is_store=False)
       ├── _poll_stream_events(load) → finished_recving = req_ids
       └── return (None, finished_recving)

Step N+1:
  Scheduler:
    7. update_connector_output()
       └── For each req_id in finished_recving:
           └── _cleanup_load_request() → free CPU/GPU touch refs

4.4 抢占处理流程

复制代码
1. Scheduler detects preemption
2. build_connector_meta() → need_flush=True
3. Worker: handle_preemptions()
   └── _flush_and_sync_all()
       ├── Synchronize all load events → update _load_hwm
       ├── Synchronize all store events → update _store_hwm
       ├── Clear event lists
       └── Now all blocks are safe to reuse
4. Scheduler: request_finished(request)
   └── Defer cleanup if transfers in-flight

五、设计模式与架构总结

5.1 设计模式

模式 应用位置 说明
策略模式 lazy_offload flag Lazy vs Eager Store 策略可切换
生产者-消费者 DmaCopyBackend SimpleQueue + 后台线程
观察者 Event + HWM 机制 异步事件驱动完成通知
延迟初始化 _batch_memcpy_fn 首次使用时解析 CUDA Driver API
引用计数保护 touch() / free_blocks() 防止传输中的块被驱逐
TP 一致性协议 Store event count accumulation 跨 Worker 累加确认

5.2 关键设计决策

  1. cuMemcpyBatchAsync --- 使用 CUDA Driver API 而非自定义内核,利用硬件 DMA 引擎
  2. 后台线程 --- DMA 传输在独立线程执行,不占用 Worker 主线程
  3. 相同 block_size --- GPU/CPU 块大小一致,简化地址计算和块管理
  4. 复用原生 BlockPool/Coordinator --- 不自建缓存策略,利用 vLLM 已有的前缀缓存
  5. 扁平化事件模型 --- 每步最多 1 个 load_event + 1 个 store_event
  6. 延迟传输提交 --- 在 get_finished() 中提交(模型执行后),隐藏 CPU 开销
  7. 低优先级 CUDA 流 --- KV 传输让路于模型计算
  8. cudaHostRegister 绕过 --- 避免 PyTorch 的 2 的幂分配浪费

5.3 性能考量

优化点 技术 效果
DMA 批量传输 cuMemcpyBatchAsync 单次 API 调用传输多块
向量化地址计算 numpy 广播 一次计算所有层×所有块地址
后台线程 DmaCopyBackend._copy_loop CPU 侧开销完全异步
传输延迟提交 get_finished() 隐藏 ~5ms 块拷贝开销
低优先级流 Stream(priority=low_pri) KV I/O 让路于推理计算
Pinned Memory cudaHostRegister GPU DMA 直传 CPU
Cursor 增量扫描 Lazy mode _cursor 不重复扫描已处理的块
触摸保护 touch() + free_blocks() 防止传输中块被重用

六、配置参数说明

参数 配置路径 默认值 说明
cpu_capacity_bytes 构造参数 必填 CPU offload 可用内存字节数
lazy_offload 构造参数 False Store 策略:True=Lazy, False=Eager
block_size cache_config 16 GPU/CPU 共用块大小
enable_kv_cache_events kv_events_config False 是否启用 KV Cache 事件监控
相关推荐
AI360labs_atyun2 小时前
GPT-5.5 和 DeepSeek V4同期发布,谁更行?
人工智能·gpt·学习·ai·agi
珠海西格电力2 小时前
零碳园区管理系统“云-边-端”架构协同的核心价值
大数据·人工智能·分布式·微服务·架构·能源
阿杰学AI2 小时前
AI核心知识143—大语言模型之 奖励作弊(简洁且通俗易懂版)
人工智能·ai·语言模型·自然语言处理·aigc·reward hacking·奖励作弊
阿杰学AI2 小时前
AI核心知识144—大语言模型之 红队(简洁且通俗易懂版)
人工智能·ai·语言模型·自然语言处理·aigc·红队·红队测试
早睡早起早日毕业2 小时前
大数据管理与应用系列丛书《大数据平台架构》之第2章 分布式理论基础:大数据系统的架构基石
大数据·hadoop·分布式·架构
LcGero2 小时前
腾讯混元OCR:1B小模型如何在OCR界扛起SOTA大旗
ai·ocr·腾讯·sota·混元·1b
qcx232 小时前
深入解析,什么是Agent,Agent的 架构与设计模式
设计模式·架构
奇逍科技圈2 小时前
开源赋能与 BC 一体化:深度解析中企销订货系统源码如何重构批发零售增长引擎
后端·架构·开源·零售
SamDeepThinking3 小时前
秒杀系统里的RocketMQ,不是发个消息那么简单
java·后端·架构