16-vLLM CUDA_C++ 内核层深度分析

vLLM CUDA/C++ 内核层深度分析

定位: 本文档深入分析 vLLM 的 CUDA/C++ 内核层架构,涵盖构建系统、注意力机制、缓存管理、激活函数、量化、MoE、通信、采样以及平台特定优化等核心内核实现。
#mermaid-svg-RUDXAIIDhscOeAhM{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-RUDXAIIDhscOeAhM .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-RUDXAIIDhscOeAhM .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-RUDXAIIDhscOeAhM .error-icon{fill:#552222;}#mermaid-svg-RUDXAIIDhscOeAhM .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-RUDXAIIDhscOeAhM .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-RUDXAIIDhscOeAhM .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-RUDXAIIDhscOeAhM .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-RUDXAIIDhscOeAhM .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-RUDXAIIDhscOeAhM .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-RUDXAIIDhscOeAhM .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-RUDXAIIDhscOeAhM .marker{fill:#333333;stroke:#333333;}#mermaid-svg-RUDXAIIDhscOeAhM .marker.cross{stroke:#333333;}#mermaid-svg-RUDXAIIDhscOeAhM svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-RUDXAIIDhscOeAhM p{margin:0;}#mermaid-svg-RUDXAIIDhscOeAhM .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-RUDXAIIDhscOeAhM .cluster-label text{fill:#333;}#mermaid-svg-RUDXAIIDhscOeAhM .cluster-label span{color:#333;}#mermaid-svg-RUDXAIIDhscOeAhM .cluster-label span p{background-color:transparent;}#mermaid-svg-RUDXAIIDhscOeAhM .label text,#mermaid-svg-RUDXAIIDhscOeAhM span{fill:#333;color:#333;}#mermaid-svg-RUDXAIIDhscOeAhM .node rect,#mermaid-svg-RUDXAIIDhscOeAhM .node circle,#mermaid-svg-RUDXAIIDhscOeAhM .node ellipse,#mermaid-svg-RUDXAIIDhscOeAhM .node polygon,#mermaid-svg-RUDXAIIDhscOeAhM .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-RUDXAIIDhscOeAhM .rough-node .label text,#mermaid-svg-RUDXAIIDhscOeAhM .node .label text,#mermaid-svg-RUDXAIIDhscOeAhM .image-shape .label,#mermaid-svg-RUDXAIIDhscOeAhM .icon-shape .label{text-anchor:middle;}#mermaid-svg-RUDXAIIDhscOeAhM .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-RUDXAIIDhscOeAhM .rough-node .label,#mermaid-svg-RUDXAIIDhscOeAhM .node .label,#mermaid-svg-RUDXAIIDhscOeAhM .image-shape .label,#mermaid-svg-RUDXAIIDhscOeAhM .icon-shape .label{text-align:center;}#mermaid-svg-RUDXAIIDhscOeAhM .node.clickable{cursor:pointer;}#mermaid-svg-RUDXAIIDhscOeAhM .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-RUDXAIIDhscOeAhM .arrowheadPath{fill:#333333;}#mermaid-svg-RUDXAIIDhscOeAhM .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-RUDXAIIDhscOeAhM .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-RUDXAIIDhscOeAhM .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-RUDXAIIDhscOeAhM .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-RUDXAIIDhscOeAhM .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-RUDXAIIDhscOeAhM .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-RUDXAIIDhscOeAhM .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-RUDXAIIDhscOeAhM .cluster text{fill:#333;}#mermaid-svg-RUDXAIIDhscOeAhM .cluster span{color:#333;}#mermaid-svg-RUDXAIIDhscOeAhM div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-RUDXAIIDhscOeAhM .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-RUDXAIIDhscOeAhM rect.text{fill:none;stroke-width:0;}#mermaid-svg-RUDXAIIDhscOeAhM .icon-shape,#mermaid-svg-RUDXAIIDhscOeAhM .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-RUDXAIIDhscOeAhM .icon-shape p,#mermaid-svg-RUDXAIIDhscOeAhM .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-RUDXAIIDhscOeAhM .icon-shape .label rect,#mermaid-svg-RUDXAIIDhscOeAhM .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-RUDXAIIDhscOeAhM .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-RUDXAIIDhscOeAhM .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-RUDXAIIDhscOeAhM :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 🖥️ 平台特定
NVIDIA CUDA

SM 7.5 - 10.0
AMD ROCm

MI300X/MI325X
CPU x86/ARM

AMX/NEON/VSX
Intel XPU

oneAPI
⚡ 内核类别
注意力内核

Attention Kernels
缓存内核

Cache Kernels
激活函数内核

Activation Kernels
量化内核

Quantization Kernels
MoE 内核

MoE Kernels
通信内核

Communication Kernels
采样内核

Sampling Kernels
🔧 构建系统
Python 绑定

vllm/kernels/
torch.utils.ffi

Extension 编译
第三方库集成

FlashInfer/Marlin/CUTLASS
Triton Kernel

JIT 编译


目录


一、构建系统

1.1 整体架构概述

vLLM 采用混合内核架构 ,与传统深度学习框架(如 PyTorch 的 csrc/ 目录)不同,vLLM 没有单一的 CMakeLists.txt 构建文件。其内核层由以下几部分组成:

组件 位置 说明
Python 绑定 vllm/kernels/ 通过 torch.utils.cpp_extension 加载 C++/CUDA 扩展
Triton Kernels 分散在各模块 JIT 编译的 Triton GPU kernel
第三方库 集成在 model_executor/ FlashInfer, Marlin, CUTLASS 等
自定义 ops vllm/_custom_ops.py 统一的 custom op 注册入口

1.2 Python 绑定层 (vllm/kernels/)

核心文件结构
复制代码
vllm/kernels/
├── __init__.py          # 导出所有内核模块
├── vllm_c.py            # CUDA C++ 扩展绑定 (RMSNorm 等)
├── aiter_ops.py         # AMD ROCm AITER 操作绑定
├── oink_ops.py          # OINK 优化操作绑定
├── xpu_ops.py           # Intel XPU 操作绑定
├── helion/              # Helion 硬件特定配置
│   ├── configs/         # 硬件配置 JSON (H100/H200)
│   └── ops/             # SiLU*FP8 融合操作
└── triton/              # Triton kernel 实现
    └── qkv_padded_fp8_quant.py  # QKV FP8 量化
vllm_c.py - 核心 C++ 扩展绑定

vllm_c.py(file:///workspace/vllm/kernels/vllm_c.py) 是 vLLM 最核心的 Python-C++ 桥接文件,注册了底层的 RMSNorm 和 FusedAddRMSNorm 操作:

python 复制代码
# 文件: /workspace/vllm/kernels/vllm_c.py
# 行号: 21-33

@ir.ops.rms_norm.register_impl(
    "vllm_c", supports_args=rms_no_var_size, supported=CUDA_ALIKE
)
def rms_norm(
    x: Tensor, weight: Tensor | None, epsilon: float,
    variance_size: int | None = None
) -> Tensor:
    if weight is None:
        weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    assert variance_size is None
    output = torch.empty(x.shape, device=x.device, dtype=x.dtype)
    torch.ops._C.rms_norm(output, x, weight, epsilon)
    return output

关键点

  • 使用 ir.ops.rms_norm.register_impl 注册到 IR op 系统
  • 底层调用 torch.ops._C.rms_norm,这是编译好的 C++ 扩展
  • 支持 CUDA_ALIKE 平台(CUDA、ROCm 等)
融合 Add-RMSNorm 实现
python 复制代码
# 文件: /workspace/vllm/kernels/vllm_c.py
# 行号: 44-62

@ir.ops.fused_add_rms_norm.register_impl(
    "vllm_c",
    supports_args=rms_add_no_var_size,
    supported=CUDA_ALIKE,
    inplace=True,
)
def fused_add_rms_norm(
    x: Tensor, x_residual: Tensor, weight: Tensor | None,
    epsilon: float, variance_size: int | None = None,
) -> tuple[Tensor, Tensor]:
    if weight is None:
        weight = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
    assert variance_size is None
    torch.ops._C.fused_add_rms_norm(x, x_residual, weight, epsilon)
    return x, x_residual

1.3 torch_utils 扩展编译方式

torch_utils.py(file:///workspace/vllm/utils/torch_utils.py) 提供了核心的工具函数和 custom op 注册机制:

python 复制代码
# 文件: /workspace/vllm/utils/torch_utils.py
# 行号: 928-967

# 创建 vLLM 自定义库
vllm_lib = Library("vllm", "FRAGMENT")

def direct_register_custom_op(
    op_name: str,
    op_func: Callable,
    mutates_args: list[str] | None = None,
    fake_impl: Callable | None = None,
    target_lib: Library | None = None,
    dispatch_key: str | None = None,
    tags: tuple[torch.Tag, ...] = (),
):
    """
    直接注册自定义 op 并分发到 CUDA 后端。
    绕过 torch.library.custom_op 的复杂调度逻辑。
    """
    if mutates_args is None:
        mutates_args = []

    if dispatch_key is None:
        from vllm.platforms import current_platform
        dispatch_key = current_platform.dispatch_key

    schema_str = infer_schema(op_func, mutates_args=mutates_args)
    my_lib = target_lib or vllm_lib
    my_lib.define(op_name + schema_str, tags=tags)
    my_lib.impl(op_name, op_func, dispatch_key=dispatch_key)
    if fake_impl is not None:
        my_lib._register_fake(op_name, fake_impl)

设计亮点

  • 性能优化:直接注册绕过 PyTorch 复杂的调度逻辑
  • 平台感知 :自动使用当前平台的 dispatch_key
  • FRAGMENT 库:使用轻量级 FRAGMENT 类型避免完整 Library 开销

1.4 平台导入机制

各平台通过 import_kernels() 方法动态加载对应的内核:

python 复制代码
# CUDA 平台 (platforms/cuda.py)
import vllm._C          # noqa: 触发 C++ 扩展加载
import vllm._C_stable_libtorch

# ROCm 平台 (platforms/rocm.py)
import vllm._C          # noqa
import vllm._rocm_C     # noqa: ROCm 特定扩展

二、注意力内核

2.1 注意力后端架构

vLLM 实现了多后端可插拔架构,支持多种注意力计算实现:
#mermaid-svg-ScCFni3jBBdHYIKG{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ScCFni3jBBdHYIKG .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ScCFni3jBBdHYIKG .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ScCFni3jBBdHYIKG .error-icon{fill:#552222;}#mermaid-svg-ScCFni3jBBdHYIKG .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ScCFni3jBBdHYIKG .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ScCFni3jBBdHYIKG .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ScCFni3jBBdHYIKG .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ScCFni3jBBdHYIKG .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ScCFni3jBBdHYIKG .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ScCFni3jBBdHYIKG .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ScCFni3jBBdHYIKG .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ScCFni3jBBdHYIKG .marker.cross{stroke:#333333;}#mermaid-svg-ScCFni3jBBdHYIKG svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ScCFni3jBBdHYIKG p{margin:0;}#mermaid-svg-ScCFni3jBBdHYIKG .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ScCFni3jBBdHYIKG .cluster-label text{fill:#333;}#mermaid-svg-ScCFni3jBBdHYIKG .cluster-label span{color:#333;}#mermaid-svg-ScCFni3jBBdHYIKG .cluster-label span p{background-color:transparent;}#mermaid-svg-ScCFni3jBBdHYIKG .label text,#mermaid-svg-ScCFni3jBBdHYIKG span{fill:#333;color:#333;}#mermaid-svg-ScCFni3jBBdHYIKG .node rect,#mermaid-svg-ScCFni3jBBdHYIKG .node circle,#mermaid-svg-ScCFni3jBBdHYIKG .node ellipse,#mermaid-svg-ScCFni3jBBdHYIKG .node polygon,#mermaid-svg-ScCFni3jBBdHYIKG .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ScCFni3jBBdHYIKG .rough-node .label text,#mermaid-svg-ScCFni3jBBdHYIKG .node .label text,#mermaid-svg-ScCFni3jBBdHYIKG .image-shape .label,#mermaid-svg-ScCFni3jBBdHYIKG .icon-shape .label{text-anchor:middle;}#mermaid-svg-ScCFni3jBBdHYIKG .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ScCFni3jBBdHYIKG .rough-node .label,#mermaid-svg-ScCFni3jBBdHYIKG .node .label,#mermaid-svg-ScCFni3jBBdHYIKG .image-shape .label,#mermaid-svg-ScCFni3jBBdHYIKG .icon-shape .label{text-align:center;}#mermaid-svg-ScCFni3jBBdHYIKG .node.clickable{cursor:pointer;}#mermaid-svg-ScCFni3jBBdHYIKG .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ScCFni3jBBdHYIKG .arrowheadPath{fill:#333333;}#mermaid-svg-ScCFni3jBBdHYIKG .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ScCFni3jBBdHYIKG .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ScCFni3jBBdHYIKG .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ScCFni3jBBdHYIKG .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ScCFni3jBBdHYIKG .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ScCFni3jBBdHYIKG .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ScCFni3jBBdHYIKG .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ScCFni3jBBdHYIKG .cluster text{fill:#333;}#mermaid-svg-ScCFni3jBBdHYIKG .cluster span{color:#333;}#mermaid-svg-ScCFni3jBBdHYIKG div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ScCFni3jBBdHYIKG .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ScCFni3jBBdHYIKG rect.text{fill:none;stroke-width:0;}#mermaid-svg-ScCFni3jBBdHYIKG .icon-shape,#mermaid-svg-ScCFni3jBBdHYIKG .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ScCFni3jBBdHYIKG .icon-shape p,#mermaid-svg-ScCFni3jBBdHYIKG .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ScCFni3jBBdHYIKG .icon-shape .label rect,#mermaid-svg-ScCFni3jBBdHYIKG .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ScCFni3jBBdHYIKG .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ScCFni3jBBdHYIKG .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ScCFni3jBBdHYIKG :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 🖥️ 平台特定
ROCM_ATTN
CPU_ATTN
ROCM_AITER_MLA
📚 MLA 专用后端
FLASHINFER_MLA
FLASHMLA
CUTLASS_MLA
TRITON_MLA
FLASH_ATTN_MLA
🎯 AttentionBackendEnum
FLASH_ATTN
FLASHINFER
TRITON_ATTN
FLEX_ATTENTION
TURBOQUANT

后端注册表 (registry.py(file:///workspace/vllm/v1/attention/backends/registry.py))
python 复制代码
# 文件: /workspace/vllm/v1/attention/backends/registry.py
# 行号: 34-87

class AttentionBackendEnum(Enum, metaclass=_AttentionBackendEnumMeta):
    """所有支持的注意力后端枚举"""

    FLASH_ATTN = "vllm.v1.attention.backends.flash_attn.FlashAttentionBackend"
    FLASH_ATTN_DIFFKV = (
        "vllm.v1.attention.backends.flash_attn_diffkv.FlashAttentionDiffKVBackend"
    )
    TRITON_ATTN = "vllm.v1.attention.backends.triton_attn.TritonAttentionBackend"
    ROCM_ATTN = "vllm.v1.attention.backends.rocm_attn.RocmAttentionBackend"
    TORCH_SDPA = ""  # 仅用于 ViT
    FLASHINFER = "vllm.v1.attention.backends.flashinfer.FlashInferBackend"
    # ... 更多后端
    CPU_ATTN = "vllm.v1.attention.backends.cpu_attn.CPUAttentionBackend"
    TURBOQUANT = "vllm.v1.attention.backends.turboquant_attn.TurboQuantAttentionBackend"

2.2 FlashInfer 后端详解

flashinfer.py(file:///workspace/vllm/v1/attention/backends/flashinfer.py) 是 vLLM 默认且最高性能的注意力后端:

python 复制代码
# 文件: /workspace/vllm/v1/attention/backends/flashinfer.py
# 行号: 1-50

from flashinfer import (
    BatchDecodeWithPagedKVCacheWrapper,
    BatchPrefillWithPagedKVCacheWrapper,
    BatchPrefillWithRaggedKVCacheWrapper,
    MultiLevelCascadeAttentionWrapper,
)
from flashinfer.decode import fast_decode_plan, trtllm_batch_decode_with_kv_cache
from flashinfer.prefill import trtllm_batch_context_with_kv_cache
from flashinfer.utils import FP4Tensor

核心特性

  • Paged KV Cache:支持分页式 KV 缓存的高效访问
  • FP8/FP4 量化支持:原生支持低精度 KV cache
  • TRT-LLM 集成:可直接调用 TensorRT-LLM 优化的 kernel
  • Multi-Level Cascade:支持 DeepSeek V3/V4 的多级级联注意力
FP8 KV Cache 反量化 Kernel
python 复制代码
# 文件: /workspace/vllm/v1/attention/backends/flashinfer.py
# 行号: 98-149

@triton.jit
def _trtllm_prefill_attn_kvfp8_dequant(
    kv_cache_ptr,
    block_tables_prefill_ptr,
    block_table_stride,
    mock_kv_cache_ptr,
    k_scale_ptr,
    v_scale_ptr,
    src_stride_page,
    src_stride_kv,
    src_stride_head,
    DST_K_CACHE_STRIDE: tl.constexpr,
    DST_KV_CACHE_STRIDE: tl.constexpr,
    HEAD_STRIDE: tl.constexpr,
    NUM_KV_HEADS: tl.constexpr,
):
    """将 FP8 量化的 KV cache 反量化为 BF16/FP16"""
    batch_idx = tl.program_id(0).to(tl.int64)
    mock_block_table_idx = tl.program_id(1).to(tl.int64)
    orig_page_num = tl.load(
        block_tables_prefill_ptr + batch_idx * block_table_stride + mock_block_table_idx
    ).to(tl.int64)
    if orig_page_num <= 0:
        return

    k_scale_val = tl.load(k_scale_ptr)
    v_scale_val = tl.load(v_scale_ptr)

    for h in range(NUM_KV_HEADS):
        h_off = tl.cast(h, tl.int64)
        # 读取 FP8 K 并反量化
        src_k = orig_page_num * src_stride_page + h_off * src_stride_head + head_offsets
        fp8_k = tl.load(kv_cache_ptr + src_k)
        dequant_k = (fp8_k.to(tl.float32) * k_scale_val).to(dequant_dtype)

        # 读取 FP8 V 并反量化
        src_v = (orig_page_num * src_stride_page + src_stride_kv +
                 h_off * src_stride_head + head_offsets)
        fp8_v = tl.load(kv_cache_ptr + src_v)
        dequant_v = (fp8_v.to(tl.float32) * v_scale_val).to(dequant_dtype)

2.3 MLA (Multi-Head Latent Attention) 内核

MLA 是 DeepSeek V3/V4 引入的新型注意力机制,vLLM 提供了完整的 MLA 内核生态

MLA 后端列表 (mla/ 目录(file:///workspace/vllm/v1/attention/backends/mla/))
后端文件 适用场景 性能特点
flashinfer_mla.py(file:///workspace/vllm/v1/attention/backends/mla/flashinfer_mla.py) NVIDIA GPU (通用) 推荐首选,平衡性能与兼容性
flashmla.py(file:///workspace/vllm/v1/attention/backends/mla/flashmla.py) NVIDIA Blackwell (SM 10.x) 最高吞吐量,利用 TMA
cutlass_mla.py(file:///workspace/vllm/v1/attention/backends/mla/cutlass_mla.py) NVIDIA Hopper+ CUTLASS 实现,高精度
triton_mla.py(file:///workspace/vllm/v1/attention/backends/mla/triton_mla.py) 通用 CUDA 纯 Triton,易调试
flashattn_mla.py(file:///workspace/vllm/v1/attention/backends/mla/flashattn_mla.py) 基于 FlashAttn 兼容性好
rocm_aiter_mla.py(file:///workspace/vllm/v1/attention/backends/mla/rocm_aiter_mla.py) AMD MI300X ROCm AITER 加速
Sparse MLA 变体

针对长序列场景,vLLM 实现了 Sparse MLA(稀疏注意力):

  • flashinfer_mla_sparse.py(file:///workspace/vllm/v1/attention/backends/mla/flashinfer_mla_sparse.py):FlashInfer 稀疏实现
  • flashmla_sparse.py(file:///workspace/vllm/v1/attention/backends/mla/flashmla_sparse.py):FlashMLA 稀疏实现
  • rocm_aiter_mla_sparse.py(file:///workspace/vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse.py):ROCm 稀疏实现
  • xpu_mla_sparse.py(file:///workspace/vllm/v1/attention/backends/mla/xpu_mla_sparse.py):Intel XPU 稀疏实现
MLA Prefill 子系统

#mermaid-svg-mkmZtxHyK3nm1dTT{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-mkmZtxHyK3nm1dTT .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-mkmZtxHyK3nm1dTT .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-mkmZtxHyK3nm1dTT .error-icon{fill:#552222;}#mermaid-svg-mkmZtxHyK3nm1dTT .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-mkmZtxHyK3nm1dTT .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-mkmZtxHyK3nm1dTT .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-mkmZtxHyK3nm1dTT .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-mkmZtxHyK3nm1dTT .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-mkmZtxHyK3nm1dTT .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-mkmZtxHyK3nm1dTT .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-mkmZtxHyK3nm1dTT .marker{fill:#333333;stroke:#333333;}#mermaid-svg-mkmZtxHyK3nm1dTT .marker.cross{stroke:#333333;}#mermaid-svg-mkmZtxHyK3nm1dTT svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-mkmZtxHyK3nm1dTT p{margin:0;}#mermaid-svg-mkmZtxHyK3nm1dTT .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-mkmZtxHyK3nm1dTT .cluster-label text{fill:#333;}#mermaid-svg-mkmZtxHyK3nm1dTT .cluster-label span{color:#333;}#mermaid-svg-mkmZtxHyK3nm1dTT .cluster-label span p{background-color:transparent;}#mermaid-svg-mkmZtxHyK3nm1dTT .label text,#mermaid-svg-mkmZtxHyK3nm1dTT span{fill:#333;color:#333;}#mermaid-svg-mkmZtxHyK3nm1dTT .node rect,#mermaid-svg-mkmZtxHyK3nm1dTT .node circle,#mermaid-svg-mkmZtxHyK3nm1dTT .node ellipse,#mermaid-svg-mkmZtxHyK3nm1dTT .node polygon,#mermaid-svg-mkmZtxHyK3nm1dTT .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-mkmZtxHyK3nm1dTT .rough-node .label text,#mermaid-svg-mkmZtxHyK3nm1dTT .node .label text,#mermaid-svg-mkmZtxHyK3nm1dTT .image-shape .label,#mermaid-svg-mkmZtxHyK3nm1dTT .icon-shape .label{text-anchor:middle;}#mermaid-svg-mkmZtxHyK3nm1dTT .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-mkmZtxHyK3nm1dTT .rough-node .label,#mermaid-svg-mkmZtxHyK3nm1dTT .node .label,#mermaid-svg-mkmZtxHyK3nm1dTT .image-shape .label,#mermaid-svg-mkmZtxHyK3nm1dTT .icon-shape .label{text-align:center;}#mermaid-svg-mkmZtxHyK3nm1dTT .node.clickable{cursor:pointer;}#mermaid-svg-mkmZtxHyK3nm1dTT .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-mkmZtxHyK3nm1dTT .arrowheadPath{fill:#333333;}#mermaid-svg-mkmZtxHyK3nm1dTT .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-mkmZtxHyK3nm1dTT .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-mkmZtxHyK3nm1dTT .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-mkmZtxHyK3nm1dTT .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-mkmZtxHyK3nm1dTT .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-mkmZtxHyK3nm1dTT .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-mkmZtxHyK3nm1dTT .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-mkmZtxHyK3nm1dTT .cluster text{fill:#333;}#mermaid-svg-mkmZtxHyK3nm1dTT .cluster span{color:#333;}#mermaid-svg-mkmZtxHyK3nm1dTT div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-mkmZtxHyK3nm1dTT .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-mkmZtxHyK3nm1dTT rect.text{fill:none;stroke-width:0;}#mermaid-svg-mkmZtxHyK3nm1dTT .icon-shape,#mermaid-svg-mkmZtxHyK3nm1dTT .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-mkmZtxHyK3nm1dTT .icon-shape p,#mermaid-svg-mkmZtxHyK3nm1dTT .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-mkmZtxHyK3nm1dTT .icon-shape .label rect,#mermaid-svg-mkmZtxHyK3nm1dTT .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-mkmZtxHyK3nm1dTT .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-mkmZtxHyK3nm1dTT .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-mkmZtxHyK3nm1dTT :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 🚀 MLA Prefill 子系统
selector.py

路由选择
base.py

基类定义
flashinfer.py

FlashInfer 实现
trtllm_ragged.py

TRT-LLM Ragged

2.4 注意力后端选择策略

cuda.py(file:///workspace/vllm/platforms/cuda.py) 中的智能选择逻辑:

python 复制代码
# 文件: /workspace/vllm/platforms/cuda.py
# 行号: 79-143

@lru_cache(maxsize=8)
def _get_backend_priorities(
    use_mla: bool,
    device_capability: DeviceCapability,
    num_heads: int | None = None,
    kv_cache_dtype: CacheDType | None = None,
) -> list[AttentionBackendEnum]:
    """根据硬件和配置获取后端优先级列表"""
    if use_mla:
        if device_capability.major == 10:
            # Blackwell GPU: 优先使用稀疏 MLA
            if kv_cache_dtype is not None and is_quantized_kv_cache(kv_cache_dtype):
                sparse_backends = [
                    AttentionBackendEnum.FLASHINFER_MLA_SPARSE,
                    AttentionBackendEnum.FLASHMLA_SPARSE,
                ]
            else:
                if num_heads is not None and num_heads <= 16:
                    sparse_backends = [
                        AttentionBackendEnum.FLASHINFER_MLA_SPARSE,
                        AttentionBackendEnum.FLASHMLA_SPARSE,
                    ]
                else:
                    sparse_backends = [
                        AttentionBackendEnum.FLASHMLA_SPARSE,
                        AttentionBackendEnum.FLASHINFER_MLA_SPARSE,
                    ]

            return [
                AttentionBackendEnum.FLASHINFER_MLA,
                AttentionBackendEnum.CUTLASS_MLA,
                AttentionBackendEnum.FLASH_ATTN_MLA,
                AttentionBackendEnum.FLASHMLA,
                AttentionBackendEnum.TRITON_MLA,
                *sparse_backends,
            ]
    else:
        if device_capability.major == 10:
            return [
                AttentionBackendEnum.FLASHINFER,
                AttentionBackendEnum.FLASH_ATTN,
                AttentionBackendEnum.TRITON_ATTN,
                AttentionBackendEnum.FLEX_ATTENTION,
                AttentionBackendEnum.TURBOQUANT,
            ]
        else:
            return [
                AttentionBackendEnum.FLASH_ATTN,
                AttentionBackendEnum.FLASHINFER,
                AttentionBackendEnum.TRITON_ATTN,
                AttentionBackendEnum.FLEX_ATTENTION,
                AttentionBackendEnum.TURBOQUANT,
            ]

选择策略总结

  1. Blackwell (SM 10.x):优先 FlashInfer → FlashAttn → Triton → Flex → TurboQuant
  2. Hopper/Ampere (SM 9.x/8.x):优先 FlashAttn → FlashInfer → Triton
  3. MLA 场景:优先 FlashInfer_MLA → CUTLASS_MLA → FlashMLA
  4. 量化 KV Cache:自动切换到稀疏变体

三、缓存内核

3.1 reshape_and_cache_kernel - KV 缓存写入核心

triton_reshape_and_cache_flash.py(file:///workspace/vllm/v1/attention/ops/triton_reshape_and_cache_flash.py) 实现了高性能的 KV 缓存写入 kernel:

python 复制代码
# 文件: /workspace/vllm/v1/attention/ops/triton_reshape_and_cache_flash.py
# 行号: 17-100

@triton.jit
def reshape_and_cache_kernel_flash(
    key_ptr,              # [num_tokens, num_heads, head_size]
    value_ptr,            # [num_tokens, num_heads, head_size]
    key_cache_ptr,        # [num_blocks, block_size, num_heads, head_size]
    value_cache_ptr,      # [num_blocks, block_size, num_heads, head_size]
    slot_mapping_ptr,     # [num_tokens] - token 到 cache slot 的映射
    k_scale,              # float32 - K 的量化 scale
    v_scale,              # float32 - V 的量化 scale
    # strides...
    num_heads: tl.constexpr,
    head_size: tl.constexpr,
    block_size: tl.constexpr,
    x: tl.constexpr,      # FP8 打包因子 (通常为 8)
    USE_HEAD_MAJOR_LAYOUT: tl.constexpr,
    FP8_KV_CACHE: tl.constexpr,   # 是否启用 FP8 缓存
    TILE_SIZE: tl.constexpr,
):
    """将新的 K/V 写入 paged KV cache"""
    token_idx = tl.program_id(axis=0)
    slot_idx = tl.load(slot_mapping_ptr + token_idx).to(tl.int64)
    if slot_idx < 0:
        return  # 忽略 padding token

    block_idx = slot_idx // block_size
    block_offset = slot_idx % block_size

    tile_i = tl.program_id(axis=1)
    tile_offs = tl.arange(0, TILE_SIZE)
    tile_pos = tile_i * TILE_SIZE + tile_offs

    if USE_HEAD_MAJOR_LAYOUT:
        cur_head = tile_pos // head_size
        cur_dim = tile_pos % head_size
        # Head-Major 布局寻址
        tgt_idx_v = (
            block_idx * block_stride
            + cur_head * head_stride
            + cur_dim * dim_stride_v
            + block_offset
        )
        tgt_idx_k = (
            block_idx * block_stride
            + cur_head * head_stride
            + (cur_dim // x) * dim_stride_k
            + block_offset * x
            + (cur_dim % x)
        )

    # 加载并写入 Key
    key_load = tl.load(
        key_ptr + src_key_idx + tile_pos,
        mask=tile_pos < (num_heads * head_size)
    )
    if FP8_KV_CACHE:
        key_tile = (key_load if key_load.dtype.is_fp8()
                   else key_load / tl.load(k_scale))
    else:
        key_tile = key_load

    tl.store(key_cache_ptr + tgt_idx_k, key_tile,
             mask=tile_pos < (num_heads * head_size))

关键特性

  • 双布局支持:Head-Major 和 Token-Major 两种 KV cache 布局
  • FP8 量化:支持在线量化写入 FP8 KV cache
  • Tiled 内存访问:使用 TILE_SIZE 优化内存合并访问
  • Padding 跳过:通过 slot_mapping < 0 跳过无效 token

3.2 NVFP4 KV Cache 支持

torch_utils.py(file:///workspace/vllm/utils/torch_utils.py) 中实现了 NVFP4 (NVIDIA 4-bit Floating Point) 格式的 KV cache:

python 复制代码
# 文件: /workspace/vllm/utils/torch_utils.py
# 行号: 415-469

def nvfp4_kv_cache_full_dim(head_size: int) -> int:
    """NVFP4 KV cache 打包维度: fp4 数据 + fp8 block scales"""
    return head_size // 2 + head_size // 16

def _nvfp4_split_data_scale(
    kv_side: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]:
    """拆分 NVFP4 buffer 为数据和 scale 视图"""
    num_pages = kv_side.shape[0]
    dim_1, dim_2 = kv_side.shape[1], kv_side.shape[2]
    full_dim = kv_side.shape[3]
    data_dim = full_dim * 8 // 9       # 8/9 是数据比例
    scale_dim = full_dim - data_dim     # 1/9 是 scale 比例

    data_per_kv = dim_1 * dim_2 * data_dim
    page_bytes = kv_side.stride(0)

    # 保持原始布局 (NHD vs HND)
    s1 = kv_side.stride(1) * data_dim // full_dim
    s2 = kv_side.stride(2) * data_dim // full_dim
    data_shape = (num_pages, dim_1, dim_2, data_dim)
    data_strides = (page_bytes, s1, s2, 1)

    base = kv_side.storage_offset()
    data = torch.as_strided(kv_side, data_shape, data_strides,
                            storage_offset=base)
    scale = torch.as_strided(
        kv_site, scale_shape, scale_strides,
        storage_offset=base + data_per_kv
    ).view(torch.float8_e4m3fn)

    return data, scale

NVFP4 格式说明

  • 数据压缩比 :相比 FP16 节省 75% 显存 (4-bit vs 16-bit)
  • Block Scale:每 16 个元素共享一个 FP8 scale
  • 零拷贝视图 :通过 torch.as_strided 实现无开销的数据/Scale 拆分

3.3 KV Cache 量化格式支持

vLLM 支持多种 KV Cache 量化格式:

格式 数据类型 压缩比 适用场景
FP16/BF16 半精度 1x 默认,最佳精度
FP8 (E4M3) 8-bit 浮点 2x Hopper+ GPU 推荐
NVFP4 4-bit 浮点 4x Blackwell GPU 长序列
INT8 8-bit 整数 2x 特定量化方案
TurboQuant 混合精度 4-8x TurboQuant 专用

四、激活函数内核

4.1 RMSNorm 和 FusedAddRMSNorm

通过 vllm_c.py(file:///workspace/vllm/kernels/vllm_c.py) 绑定的底层 C++ 实现:

python 复制代码
# 文件: /workspace/vllm/kernels/vllm_c.py
# 行号: 14-33

rms_no_var_size = (
    lambda x, weight, epsilon, variance_size=None:
    variance_size is None
    and (weight is None or weight.dtype == x.dtype)
)
"""vLLM kernel 要求: 无 variance_size 参数且 input/weight dtype 匹配"""

RMSNorm 公式

RMSNorm ( x ) = x 1 n ∑ i = 1 n x i 2 + ϵ ⋅ γ \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}x_i^2 + \epsilon}} \cdot \gamma RMSNorm(x)=n1∑i=1nxi2+ϵ x⋅γ

FusedAddRMSNorm 优势

  • 融合操作:在一个 kernel 中完成 residual add + normalization
  • 减少内存带宽:避免中间结果的读写
  • Inplace 执行:直接修改输入 tensor,节省显存分配

4.2 RoPE (Rotary Position Embedding) 内核

vLLM 实现了丰富的 RoPE 变体以支持不同模型:

RoPE 变体列表 (rotary_embedding/(file:///workspace/vllm/model_executor/layers/rotary_embedding/))
文件名 支持的模型 特点
linear_scaling_rope.py(file:///workspace/vllm/model_executor/layers/rotary_embedding/linear_scaling_rope.py) LLaMA 标准 线性位置缩放
ntk_scaling_rope.py(file:///workspace/vllm/model_executor/layers/rotary_embedding/ntk_scaling_rope.py) NTK-aware 外推能力增强
dynamic_ntk_scaling_rope.py(file:///workspace/vllm/model_executor/layers/rotary_embedding/dynamic_ntk_scaling_rope.py) Dynamic NTK 动态 NTK 缩放
yarn_scaling_rope.py(file:///workspace/vllm/model_executor/layers/rotary_embedding/yarn_scaling_rope.py) YaRN 长序列外推
deepseek_scaling_rope.py(file:///workspace/vllm/model_executor/layers/rotary_embedding/deepseek_scaling_rope.py) DeepSeek DeepSeek 特定优化
llama3_rope.py(file:///workspace/vllm/model_executor/layers/rotary_embedding/llama3_rope.py) LLaMA 3 LLaMA 3 标准
mrope.py(file:///workspace/vllm/model_executor/layers/rotary_embedding/mrope.py) Qwen2-VL 多模态 RoPE
xdrope.py(file:///workspace/vllm/model_executor/layers/rotary_embedding/xdrope.py) Qwen2.5 XRoPE 扩展
融合 RoPE + FP8 量化 Kernel
python 复制代码
# 文件: /workspace/vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py

class FusedInvRoPEFp8Quant:
    """融合逆 RoPE + FP8 量化操作"""
    def __init__(self, head_size: int, num_heads: int, ...):
        self.head_size = head_size
        self.num_heads = num_heads

    def forward(self, q, k, positions, ...):
        # 1. 应用逆 RoPE
        q = apply_rotary_emb(q, cos, sin)
        k = apply_rotary_emb(k, cos, sin)
        # 2. 在同一 kernel 中量化为 FP8
        q_fp8 = quantize_to_fp8(q, scale_q)
        k_fp8 = quantize_to_fp8(k, scale_k)
        return q_fp8, k_fp8, scale_q, scale_k

4.3 SiLU * FP8 融合操作

helion/ops/silu_mul_fp8.py(file:///workspace/vllm/kernels/helion/ops/silu_mul_fp8.py) 实现了针对 NVIDIA H100/H200 的 SiLU + MatMul 融合:

python 复制代码
# 文件: /workspace/vllm/kernels/helion/configs/silu_mul_fp8/nvidia_h100.json
{
    "kernel_name": "silu_mul_fp8",
    "device": "NVIDIA_H100",
    "optimizations": ["tensor_core", "pipeline"],
    "block_size": [128, 128, 32],
    "stages": 4,
    "wave_size": 128,
    "num_warps": 8
}

Helion 配置系统

  • 硬件特定调优:每个 GPU 型号有独立的最优配置
  • Kernel auto-tuning:运行时选择最优参数组合
  • Wavefront 调度:充分利用 GPU 波前并行

五、量化内核

5.1 量化内核架构总览

#mermaid-svg-iqa14NN6fqupm4Nw{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-iqa14NN6fqupm4Nw .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-iqa14NN6fqupm4Nw .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-iqa14NN6fqupm4Nw .error-icon{fill:#552222;}#mermaid-svg-iqa14NN6fqupm4Nw .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-iqa14NN6fqupm4Nw .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-iqa14NN6fqupm4Nw .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-iqa14NN6fqupm4Nw .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-iqa14NN6fqupm4Nw .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-iqa14NN6fqupm4Nw .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-iqa14NN6fqupm4Nw .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-iqa14NN6fqupm4Nw .marker{fill:#333333;stroke:#333333;}#mermaid-svg-iqa14NN6fqupm4Nw .marker.cross{stroke:#333333;}#mermaid-svg-iqa14NN6fqupm4Nw svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-iqa14NN6fqupm4Nw p{margin:0;}#mermaid-svg-iqa14NN6fqupm4Nw .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-iqa14NN6fqupm4Nw .cluster-label text{fill:#333;}#mermaid-svg-iqa14NN6fqupm4Nw .cluster-label span{color:#333;}#mermaid-svg-iqa14NN6fqupm4Nw .cluster-label span p{background-color:transparent;}#mermaid-svg-iqa14NN6fqupm4Nw .label text,#mermaid-svg-iqa14NN6fqupm4Nw span{fill:#333;color:#333;}#mermaid-svg-iqa14NN6fqupm4Nw .node rect,#mermaid-svg-iqa14NN6fqupm4Nw .node circle,#mermaid-svg-iqa14NN6fqupm4Nw .node ellipse,#mermaid-svg-iqa14NN6fqupm4Nw .node polygon,#mermaid-svg-iqa14NN6fqupm4Nw .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-iqa14NN6fqupm4Nw .rough-node .label text,#mermaid-svg-iqa14NN6fqupm4Nw .node .label text,#mermaid-svg-iqa14NN6fqupm4Nw .image-shape .label,#mermaid-svg-iqa14NN6fqupm4Nw .icon-shape .label{text-anchor:middle;}#mermaid-svg-iqa14NN6fqupm4Nw .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-iqa14NN6fqupm4Nw .rough-node .label,#mermaid-svg-iqa14NN6fqupm4Nw .node .label,#mermaid-svg-iqa14NN6fqupm4Nw .image-shape .label,#mermaid-svg-iqa14NN6fqupm4Nw .icon-shape .label{text-align:center;}#mermaid-svg-iqa14NN6fqupm4Nw .node.clickable{cursor:pointer;}#mermaid-svg-iqa14NN6fqupm4Nw .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-iqa14NN6fqupm4Nw .arrowheadPath{fill:#333333;}#mermaid-svg-iqa14NN6fqupm4Nw .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-iqa14NN6fqupm4Nw .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-iqa14NN6fqupm4Nw .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iqa14NN6fqupm4Nw .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-iqa14NN6fqupm4Nw .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iqa14NN6fqupm4Nw .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-iqa14NN6fqupm4Nw .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-iqa14NN6fqupm4Nw .cluster text{fill:#333;}#mermaid-svg-iqa14NN6fqupm4Nw .cluster span{color:#333;}#mermaid-svg-iqa14NN6fqupm4Nw div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-iqa14NN6fqupm4Nw .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-iqa14NN6fqupm4Nw rect.text{fill:none;stroke-width:0;}#mermaid-svg-iqa14NN6fqupm4Nw .icon-shape,#mermaid-svg-iqa14NN6fqupm4Nw .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iqa14NN6fqupm4Nw .icon-shape p,#mermaid-svg-iqa14NN6fqupm4Nw .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-iqa14NN6fqupm4Nw .icon-shape .label rect,#mermaid-svg-iqa14NN6fqupm4Nw .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iqa14NN6fqupm4Nw .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-iqa14NN6fqupm4Nw .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-iqa14NN6fqupm4Nw :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 📐 Linear 层抽象
MixedPrecision

混合精度线性层
ScaledMM

缩放矩阵乘法
NVFP4

4-bit 浮点线性层
MXFP8

8-bit 浮点 MoE 线性层
🎯 量化内核体系
AWQ

Activation-aware Weight Quantization
GPTQ

GPT Quantization
GGUF

llama.cpp 格式
Marlin

4-bit GEMM Engine
Machete

新一代 CUTLASS GEMM
CUTLASS

W8A8/FP4/NVFP4

5.2 AWQ (Activation-aware Weight Quantization)

awq_marlin.py(file:///workspace/vllm/model_executor/layers/quantization/awq_marlin.py) 实现 AWQ 量化方案的 Marlin 后端:

python 复制代码
# 文件: /workspace/vllm/model_executor/layers/quantization/awq_marlin.py
# 行号: 67-100

_REVERSE_AWQ_PACK_ORDER = [0, 4, 1, 5, 2, 6, 3, 7]
"""AWQ 使用非标准打包顺序,需要反转"""

def _convert_awq_to_standard_format(
    layer: torch.nn.Module,
    w_q_name: str,
    w_zp_name: str,
    size_bits: int,
) -> None:
    """将 AWQ 权重和零点转换为标准 GPTQ 格式"""
    pack_factor = 32 // size_bits
    mask = (1 << size_bits) - 1
    device = getattr(layer, w_q_name).device

    # 反转 AWQ 打包顺序
    reverse_order = torch.tensor(
        _REVERSE_AWQ_PACK_ORDER, dtype=torch.long, device=device
    )
    shifts = torch.arange(0, 32, size_bits, dtype=torch.int32, device=device)

    # 解包 int32 → 单独值,修正 AWQ 顺序
    qw = getattr(layer, w_q_name).data
    K, N_packed = qw.shape
    unpacked = (qw.unsqueeze(-1) >> shifts) & mask

AWQ 特性

  • 激活感知:基于激活分布优化量化
  • Marlin 加速:通过 Marlin kernel 实现高效推理
  • Group-wise 量化:支持 64/128 group size
  • Zero-point 支持:支持对称和非对称量化

5.3 Marlin - 高性能 4-bit GEMM Engine

marlin.py(file:///workspace/vllm/model_executor/kernels/linear/mixed_precision/marlin.py) 是 vLLM 的核心量化计算引擎

python 复制代码
# 文件: /workspace/vllm/model_executor/kernels/linear/mixed_precision/marlin.py
# 行号: 30-100

class MarlinLinearKernel(MPLinearKernel):
    @classmethod
    def get_min_capability(cls) -> int:
        return 75  # 要求 Turing (SM 7.5) 及以上

    @classmethod
    def can_implement(cls, c: MPLinearLayerConfig) -> tuple[bool, str | None]:
        # Marlin 使用 inline PTX,仅兼容 NVIDIA
        if not current_platform.is_cuda():
            return False, "Marlin only supported on CUDA"

        quant_types = query_marlin_supported_quant_types(c.zero_points)
        if c.weight_type not in quant_types:
            return False, (
                f"Quant type ({c.weight_type}) not supported by Marlin, "
                f"supported types are: {quant_types}"
            )

        if c.group_size not in MARLIN_SUPPORTED_GROUP_SIZES:
            return False, (
                f"Group size ({c.group_size}) not supported by Marlin"
            )

        return check_marlin_supports_shape(...)

    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        device = getattr(layer, self.w_q_name).device
        c = self.config
        is_a_8bit = c.act_type is not None and c.act_type.itemsize == 1

        if is_a_8bit:
            assert c.weight_type == scalar_types.uint4b8
            ops.marlin_int4_fp8_preprocess(getattr(layer, self.w_q_name), inplace=True)
            getattr(layer, self.w_s_name).data *= 512

        # 分配 Marlin workspace
        self.workspace = marlin_make_workspace_new(device)

        def transform_w_q(x):
            permute_param_layout_(x, input_dim=0, output_dim=1, packed_dim=0)
            x.data = ops.gptq_marlin_repack(x.data.contiguous(), ...)

Marlin 核心优势

  • PTX 内联汇编:直接使用 PTX 指令,最大化硬件利用率
  • 权重预打包gptq_marlin_repack 在加载时重新排列权重
  • Workspace 复用:预分配 workspace 避免运行时分配
  • FP8 Activation:支持 FP8 输入 + INT4 权重的混合精度

5.4 Machete - 新一代 CUTLASS GEMM

machete.py(file:///workspace/vllm/model_executor/kernels/linear/mixed_precision/machete.py) 是基于 CUTLASS 3.x 的新一代量化 GEMM:

python 复制代码
# 文件: /workspace/vllm/model_executor/kernels/linear/mixed_precision/machete.py
# 行号: 24-100

class MacheteLinearKernel(MPLinearKernel):
    @classmethod
    def get_min_capability(cls) -> int:
        return 90  # **仅限 Hopper (SM 9.0) 及以上**

    @classmethod
    def can_implement(cls, c: MPLinearLayerConfig) -> tuple[bool, str | None]:
        # Machete 使用 CUTLASS,仅兼容 NVIDIA
        if not current_platform.is_cuda():
            return False, "Machete only supported on CUDA"

        if not current_platform.is_device_capability(90):
            return False, "Machete requires compute capability of 90 (Hopper)"

        if c.has_g_idx and c.partition_weight_shape[0] != c.full_weight_shape[0]:
            return False, "Act reordering currently not supported by Machete..."

        if c.weight_type not in query_machete_supported_quant_types(c.zero_points):
            return False, (...)

        if c.group_size not in query_machete_supported_group_sizes(c.act_type):
            return False, (...)

        return check_machete_supports_shape(...)

    def process_weights_after_loading(self, layer: torch.nn.Module):
        c = self.config
        if c.has_g_idx:
            # 对 activation 进行重排序以匹配 g_idx
            perm = torch.argsort(getattr(layer, self.w_gidx_name)).to(torch.int)
            self.act_perm = lambda x: x[:, perm]
            # 优先使用优化的 permute_cols op
            if (c.act_type in [torch.float16, torch.bfloat16]
                and c.partition_weight_shape[0] % 8 == 0):
                self.act_perm = partial(ops.permute_cols, perm=perm)

        def transform_w_q(x):
            permute_param_layout_(x, input_dim=0, output_dim=1, packed_dim=0)
            if c.has_g_idx:
                x_unpacked = unpack_quantized_values_into_int32(
                    x.data, c.weight_type, packed_dim=0
                )
                x_perm = x_unpacked[perm, :]
                x.data = pack_quantized_values_into_int32(
                    x_perm, c.weight_type, packed_dim=0
                )
            # 使用 Machete 的预打包函数
            x.data = ops.machete_prepack_B(
                x.data.t().contiguous().t(),
                a_type=c.act_type,
                b_type=c.weight_type,
                group_scales_type=c.act_type,
            )

Machete vs Marlin 对比

特性 Marlin Machete
最低 GPU 架构 Turing (SM 7.5) Hopper (SM 90)
实现方式 Inline PTX CUTLASS 3.x
GPTQ Act Order ✅ 支持 ❌ 不支持 (EP 场景)
性能峰值 更高 (Hopper 优化)
维护成本 中等 低 (依赖 CUTLASS)

5.5 GGUF 反量化内核

gguf.py(file:///workspace/vllm/model_executor/layers/quantization/gguf.py) 实现 llama.cpp GGUF 格式支持:

python 复制代码
# 文件: /workspace/vllm/model_executor/layers/quantization/gguf.py
# 行号: 52-100

class GGUFConfig(QuantizationConfig):
    """GGUF 量化配置"""

    def get_supported_act_dtypes(self) -> list[torch.dtype]:
        # GGUF 反量化 kernel 内部使用 half precision
        # BF16 在 Blackwell 上有精度问题
        if current_platform.has_device_capability(100):
            logger.warning_once("GGUF has precision issues with bfloat16 on Blackwell.")
            return [torch.half, torch.float32]
        return [torch.half, torch.bfloat16, torch.float32]

    @classmethod
    def get_min_capability(cls) -> int:
        return 60  # 兼容 Pascal 及以上

GGUF 支持的量化类型 (来自 gguf.GGMLQuantizationType):

  • Q4_0, Q4_1 (4-bit)
  • Q5_0, Q5_1 (5-bit)
  • Q8_0 (8-bit)
  • Q2_K, Q3_K, Q4_K, Q5_K, Q6_K (K-quantizations)
  • IQS (IQ quantization)

5.6 CUTLASS 扩展内核

vLLM 通过多个模块集成 CUTLASS 功能:

W8A8 (8-bit Weight, 8-bit Activation)

cutlass.py(file:///workspace/vllm/model_executor/kernels/linear/scaled_mm/cutlass.py):

python 复制代码
# 位置: /workspace/vllm/model_executor/kernels/linear/scaled_mm/cutlass.py
class CutlassScaledMMKernel(ScaledMMLinearKernel):
    """CUTLASS 实现的缩放矩阵乘法"""
    def forward(self, x, weight, scale, ...):
        # 调用 CUTLASS W8A8 GEMM
        output = cutlass_scaled_mm(x, weight, scale_x, scale_y, ...)
        return output
NVFP4 (NVIDIA 4-bit Floating Point)

nvfp4/cutlass.py(file:///workspace/vllm/model_executor/kernels/linear/nvfp4/cutlass.py):

python 复制代码
# 位置: /workspace/vllm/model_executor/kernels/linear/nvfp4/cutlass.py
class CutlassNVFP4Kernel(NVFP4LinearKernel):
    """CUTLASS 实现的 NVFP4 线性层"""
    def can_implement(self, config):
        return check_cutlass_nvfp4_support(...)
FP4 (4-bit Floating Point)

nvfp4/base.py(file:///workspace/vllm/model_executor/kernels/linear/nvfp4/base.py):

python 复制代码
# 位置: /workspace/vllm/model_executor/kernels/linear/nvfp4/base.py
class NVFP4LinearKernel(BaseLinearKernel):
    """NVFP4 格式的线性层基类"""
    SUPPORTED_DTYPES = [torch.uint8]  # 打包存储

5.7 ScaledMM - 缩放矩阵乘法框架

ScaledMMLinearKernel(file:///workspace/vllm/model_executor/kernels/linear/scaled_mm/ScaledMMLinearKernel.py) 提供统一的缩放矩阵乘法接口:
#mermaid-svg-p850BIvbITWs60ay{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-p850BIvbITWs60ay .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-p850BIvbITWs60ay .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-p850BIvbITWs60ay .error-icon{fill:#552222;}#mermaid-svg-p850BIvbITWs60ay .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-p850BIvbITWs60ay .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-p850BIvbITWs60ay .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-p850BIvbITWs60ay .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-p850BIvbITWs60ay .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-p850BIvbITWs60ay .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-p850BIvbITWs60ay .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-p850BIvbITWs60ay .marker{fill:#333333;stroke:#333333;}#mermaid-svg-p850BIvbITWs60ay .marker.cross{stroke:#333333;}#mermaid-svg-p850BIvbITWs60ay svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-p850BIvbITWs60ay p{margin:0;}#mermaid-svg-p850BIvbITWs60ay .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-p850BIvbITWs60ay .cluster-label text{fill:#333;}#mermaid-svg-p850BIvbITWs60ay .cluster-label span{color:#333;}#mermaid-svg-p850BIvbITWs60ay .cluster-label span p{background-color:transparent;}#mermaid-svg-p850BIvbITWs60ay .label text,#mermaid-svg-p850BIvbITWs60ay span{fill:#333;color:#333;}#mermaid-svg-p850BIvbITWs60ay .node rect,#mermaid-svg-p850BIvbITWs60ay .node circle,#mermaid-svg-p850BIvbITWs60ay .node ellipse,#mermaid-svg-p850BIvbITWs60ay .node polygon,#mermaid-svg-p850BIvbITWs60ay .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-p850BIvbITWs60ay .rough-node .label text,#mermaid-svg-p850BIvbITWs60ay .node .label text,#mermaid-svg-p850BIvbITWs60ay .image-shape .label,#mermaid-svg-p850BIvbITWs60ay .icon-shape .label{text-anchor:middle;}#mermaid-svg-p850BIvbITWs60ay .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-p850BIvbITWs60ay .rough-node .label,#mermaid-svg-p850BIvbITWs60ay .node .label,#mermaid-svg-p850BIvbITWs60ay .image-shape .label,#mermaid-svg-p850BIvbITWs60ay .icon-shape .label{text-align:center;}#mermaid-svg-p850BIvbITWs60ay .node.clickable{cursor:pointer;}#mermaid-svg-p850BIvbITWs60ay .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-p850BIvbITWs60ay .arrowheadPath{fill:#333333;}#mermaid-svg-p850BIvbITWs60ay .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-p850BIvbITWs60ay .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-p850BIvbITWs60ay .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-p850BIvbITWs60ay .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-p850BIvbITWs60ay .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-p850BIvbITWs60ay .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-p850BIvbITWs60ay .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-p850BIvbITWs60ay .cluster text{fill:#333;}#mermaid-svg-p850BIvbITWs60ay .cluster span{color:#333;}#mermaid-svg-p850BIvbITWs60ay div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-p850BIvbITWs60ay .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-p850BIvbITWs60ay rect.text{fill:none;stroke-width:0;}#mermaid-svg-p850BIvbITWs60ay .icon-shape,#mermaid-svg-p850BIvbITWs60ay .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-p850BIvbITWs60ay .icon-shape p,#mermaid-svg-p850BIvbITWs60ay .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-p850BIvbITWs60ay .icon-shape .label rect,#mermaid-svg-p850BIvbITWs60ay .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-p850BIvbITWs60ay .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-p850BIvbITWs60ay .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-p850BIvbITWs60ay :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 📊 ScaledMM 后端
Triton

triton.py
CUTLASS

cutlass.py
Marlin

marlin.py
FlashInfer

flashinfer.py
DeepGEMM

deep_gemm.py
PyTorch

pytorch.py
ROCm

rocm.py
XPU

xpu.py
CPU

cpu.py
AITER

aiter.py

各后端自动选择逻辑在 __init__.py 中实现。


六、MoE 内核

6.1 MoE 架构概述

vLLM 实现了模块化 MoE 架构,将 FusedMoE 分解为可组合的组件:
#mermaid-svg-ZSLpa1Bq1KP1exaB{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZSLpa1Bq1KP1exaB .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZSLpa1Bq1KP1exaB .error-icon{fill:#552222;}#mermaid-svg-ZSLpa1Bq1KP1exaB .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZSLpa1Bq1KP1exaB .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZSLpa1Bq1KP1exaB .marker.cross{stroke:#333333;}#mermaid-svg-ZSLpa1Bq1KP1exaB svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZSLpa1Bq1KP1exaB p{margin:0;}#mermaid-svg-ZSLpa1Bq1KP1exaB .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZSLpa1Bq1KP1exaB .cluster-label text{fill:#333;}#mermaid-svg-ZSLpa1Bq1KP1exaB .cluster-label span{color:#333;}#mermaid-svg-ZSLpa1Bq1KP1exaB .cluster-label span p{background-color:transparent;}#mermaid-svg-ZSLpa1Bq1KP1exaB .label text,#mermaid-svg-ZSLpa1Bq1KP1exaB span{fill:#333;color:#333;}#mermaid-svg-ZSLpa1Bq1KP1exaB .node rect,#mermaid-svg-ZSLpa1Bq1KP1exaB .node circle,#mermaid-svg-ZSLpa1Bq1KP1exaB .node ellipse,#mermaid-svg-ZSLpa1Bq1KP1exaB .node polygon,#mermaid-svg-ZSLpa1Bq1KP1exaB .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZSLpa1Bq1KP1exaB .rough-node .label text,#mermaid-svg-ZSLpa1Bq1KP1exaB .node .label text,#mermaid-svg-ZSLpa1Bq1KP1exaB .image-shape .label,#mermaid-svg-ZSLpa1Bq1KP1exaB .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZSLpa1Bq1KP1exaB .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZSLpa1Bq1KP1exaB .rough-node .label,#mermaid-svg-ZSLpa1Bq1KP1exaB .node .label,#mermaid-svg-ZSLpa1Bq1KP1exaB .image-shape .label,#mermaid-svg-ZSLpa1Bq1KP1exaB .icon-shape .label{text-align:center;}#mermaid-svg-ZSLpa1Bq1KP1exaB .node.clickable{cursor:pointer;}#mermaid-svg-ZSLpa1Bq1KP1exaB .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZSLpa1Bq1KP1exaB .arrowheadPath{fill:#333333;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZSLpa1Bq1KP1exaB .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZSLpa1Bq1KP1exaB .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZSLpa1Bq1KP1exaB .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZSLpa1Bq1KP1exaB .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZSLpa1Bq1KP1exaB .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZSLpa1Bq1KP1exaB .cluster text{fill:#333;}#mermaid-svg-ZSLpa1Bq1KP1exaB .cluster span{color:#333;}#mermaid-svg-ZSLpa1Bq1KP1exaB div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZSLpa1Bq1KP1exaB .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZSLpa1Bq1KP1exaB rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZSLpa1Bq1KP1exaB .icon-shape,#mermaid-svg-ZSLpa1Bq1KP1exaB .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZSLpa1Bq1KP1exaB .icon-shape p,#mermaid-svg-ZSLpa1Bq1KP1exaB .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZSLpa1Bq1KP1exaB .icon-shape .label rect,#mermaid-svg-ZSLpa1Bq1KP1exaB .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZSLpa1Bq1KP1exaB .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZSLpa1Bq1KP1exaB .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZSLpa1Bq1KP1exaB :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 🔀 模块化 MoE 架构
🎯 路由器类型
FusedTopKRouter

融合 Top-K
GroupedTopKRouter

分组 Top-K
TopKBiasRouter

带偏置 Top-K
Router

路由器
Quantize-Dispatch

量化分发
Permute-Experts-Unpermute

排列-专家计算-逆排列
Combine

合并输出

核心抽象类 (modular_kernel.py(file:///workspace/vllm/model_executor/layers/fused_moe/modular_kernel.py))
python 复制代码
# 文件: /workspace/vllm/model_executor/layers/fused_moe/modular_kernel.py
# 行号: 46-150

class FusedMoEActivationFormat(Enum):
    """标准激活格式"""
    Standard = ("standard",)
    BatchedExperts = ("batched_experts",)  # (num_experts, max_tokens, hidden_dim)


@dataclass
class ExpertTokensMetadata:
    """Expert-Token 路由元数据"""
    expert_num_tokens: torch.Tensor
    expert_num_tokens_cpu: torch.Tensor | None


class TopKWeightAndReduce(ABC):
    """权重应用与归约的抽象基类"""
    @abstractmethod
    def apply(self, output, fused_expert_output, topk_weights, topk_ids,
              apply_router_weight_on_input) -> torch.Tensor:
        raise NotImplementedError


# MoE 计算流水线:
# [Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]

6.2 Mixtral-style MoE 路由器

fused_topk_router.py(file:///workspace/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py) 实现高性能的融合 Top-K 路由:

python 复制代码
# 文件: /workspace/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py
# 行号: 17-100

def vllm_topk_softmax(
    topk_weights: torch.Tensor,
    topk_indices: torch.Tensor,
    token_expert_indices: torch.Tensor,
    gating_output: torch.Tensor,
    renormalize: bool = False,
) -> tuple[torch.Tensor, ...]:
    """融合 Top-K + Softmax 操作"""
    ops.topk_softmax(
        topk_weights, topk_indices,
        token_expert_indices, gating_output, renormalize,
    )
    return topk_weights, topk_indices


def fused_topk(
    hidden_states: torch.Tensor,
    gating_output: torch.Tensor,
    topk: int,
    renormalize: bool,
    indices_type: torch.dtype | None = None,
    scoring_func: str = "softmax",
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    执行 MoE Top-K 路由:
    1. 计算 gating logits
    2. 选择 Top-K experts
    3. 应用 softmax/sigmoid 归一化
    4. 返回 weights, ids, token-expert 映射
    """
    M, _ = hidden_states.size()

    topk_weights = torch.empty(M, topk, dtype=torch.float32, device=hidden_states.device)
    topk_ids = torch.empty(M, topk, dtype=torch.int32 if indices_type is None else indices_type,
                           device=hidden_states.device)
    token_expert_indices = torch.empty(M, topk, dtype=torch.int32, device=hidden_states.device)

    if scoring_func == "softmax":
        topk_func = dispatch_topk_softmax_func(use_rocm_aiter=...)
        topk_weights, topk_ids = topk_func(
            topk_weights, topk_ids, token_expert_indices, gating_output, renormalize
        )
    elif scoring_func == "sigmoid":
        # ... sigmoid 变体

关键优化

  • Fused Kernel:Top-K + Softmax 在单个 CUDA kernel 中完成
  • ROCm AITER 支持:AMD GPU 使用专用 AITER 实现
  • 多评分函数:支持 softmax 和 sigmoid 两种路由方式

6.3 WNA16 (4-bit MoE) 量化

moe_wna16.py(file:///workspace/vllm/model_executor/layers/quantization/moe_wna16.py) 实现 MoE 专家权重的 4-bit 量化:

python 复制代码
# 文件: /workspace/vllm/model_executor/layers/quantization/moe_wna16.py

class MoEWNA16Config(QuantizationConfig):
    """WNA16 (4-bit) MoE 量化配置"""
    def get_supported_act_dtypes(self) -> list[torch.dtype]:
        return [torch.bfloat16, torch.float16]

    def get_quant_method(self, layer, prefix):
        if isinstance(layer, FusedMoE):
            return MoEWNA16Method()
        return None


class MoEWNA16Method(FusedMoEMethodBase):
    """WNA16 量化的 MoE 方法"""
    def __init__(self):
        # 使用 Marlin 或 Machete 作为 4-bit GEMM 后端
        self.linear_kernel = choose_mp_linear_kernel(...)

6.4 MXFP8 MoE 线性层

Mxfp8LinearKernel(file:///workspace/vllm/model_executor/kernels/linear/mxfp8/Mxfp8LinearKernel.py) 提供 MXFP8 (Microscaling FP8) 格式的 MoE 线性层:

python 复制代码
# 位置: /workspace/vllm/model_executor/kernels/linear/mxfp8/Mxfp8LinearKernel.py

class Mxfp8LinearKernel(BaseLinearKernel):
    """MXFP8 格式的线性层,专为 MoE 设计"""
    BACKENDS = {
        "marlin": MarlinMxfp8Impl,      # NVIDIA Marlin
        "flashinfer": FlashInferMxfp8Impl,  # FlashInfer
        "emulation": EmulationImpl,      # CPU 回退仿真
        "xpu": XPUMxfp8Impl,             # Intel XPU
    }

MXFP8 特性

  • Microscaling Format:每个 block 共享一个 exponent
  • 高压缩比:相比 FP16 节省 ~50% 显存(对大 MoE 模型至关重要)
  • 多后端支持:Marlin、FlashInfer、XPU 等

七、通信内核

7.1 Custom AllReduce - 自定义全归约

custom_all_reduce.py(file:///workspace/vllm/distributed/device_communicators/custom_all_reduce.py) 实现节点内高性能 AllReduce

python 复制代码
# 文件: /workspace/vllm/distributed/device_communicators/custom_all_reduce.py
# 行号: 50-100

class CustomAllreduce:
    _SUPPORTED_WORLD_SIZES = [2, 4, 6, 8]

    def __init__(
        self,
        group: ProcessGroup,
        device: int | str | torch.device,
        max_size=8192 * 1024,  # 默认 8MB
        symm_mem_enabled=False,
    ) -> None:
        """
        初始化 Custom AllReduce:

        Args:
            group: 进程组 (必须是非 NCCL 组)
            device: 绑定的设备 ID
            max_size: 支持的最大 allreduce 大小
            symm_mem_enabled: 是否启用对称内存
        """
        self._IS_CAPTURING = False
        self.disabled = True

        if not custom_ar:
            logger.info("Custom allreduce is disabled because of missing library")
            return

        self.group = group
        assert dist.get_backend(group) != dist.Backend.NCCL, (
            "CustomAllreduce should be attached to a non-NCCL group."
        )

        if not all(in_the_same_node_as(group, source_rank=0)):
            logger.warning("Custom allreduce is disabled for multi-node case.")
            return

Custom AllReduce 优势

  • P2P 通信:使用 NVLink/P2P 直接 GPU 间通信
  • 低延迟 :比 NCCL AllReduce 延迟降低 3-5x
  • CUDA Graph 兼容:可在 CUDA Graph 中捕获
  • 内存高效:使用对称内存避免额外拷贝
P2P 可达性检查
python 复制代码
# 文件: /workspace/vllm/distributed/device_communicators/custom_all_reduce.py
# 行号: 31-40

def _can_p2p(rank: int, world_size: int) -> bool:
    for i in range(world_size):
        if i == rank:
            continue
        if envs.VLLM_SKIP_P2P_CHECK:
            return torch.cuda.can_device_access_peer(rank, i)
        if not gpu_p2p_access_check(rank, i):
            return False
    return True

7.2 CU Memory Allocator - CUDA 内存管理器

cumem.py(file:///workspace/vllm/device_allocator/cumem.py) 实现基于 cuMem API 的可插拔内存分配器:

python 复制代码
# 文件: /workspace/vllm/device_allocator/cumem.py
# 行号: 27-88

cumem_available = False
try:
    from vllm.cumem_allocator import (
        init_module,
        python_create_and_map,
        python_unmap_and_release,
    )
    from vllm.distributed.device_communicators.cuda_wrapper import CudaRTLibrary

    lib_name = find_loaded_library("cumem_allocator")
    libcudart = CudaRTLibrary()
    cumem_available = True
except ModuleNotFoundError:
    init_module = None
    python_create_and_map = None
    python_unmap_and_release = None
    lib_name = None


def get_pluggable_allocator(
    python_malloc_fn: Callable[[HandleType], None],
    python_free_func: Callable[[int], HandleType],
) -> torch.cuda.memory.CUDAPluggableAllocator:
    """创建 PyTorch 可插拔分配器"""
    init_module(python_malloc_fn, python_free_func)
    new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
        lib_name, "my_malloc", "my_free"
    )
    return new_alloc


@dataclasses.dataclass
class AllocationData:
    handle: HandleType
    tag: str
    cpu_backup_tensor: torch.Tensor | None = None


class CuMemAllocator:
    """
    基于 cuMem 的单例内存池管理器。
    支持 sleep 模式:可将标记的 tensor 卸载或丢弃。
    """

CuMemAllocator 核心功能

  • 虚拟内存管理 :使用 cuMemCreate/cuMemMap 管理虚拟地址空间
  • Sleep Mode:支持将内存池内容卸载到 CPU 或直接丢弃
  • Tag-based 管理:按 tag 分组管理不同用途的内存
  • Pluggable Interface :通过 PyTorch 的 CUDAPluggableAllocator 接口集成

八、采样内核

8.1 Top-k / Top-p 采样器

sampler.py(file:///workspace/vllm/v1/sample/sampler.py) 实现完整的采样流程:

python 复制代码
# 文件: /workspace/vllm/v1/sample/sampler.py
# 行号: 21-143

class Sampler(nn.Module):
    """
    采样层执行以下步骤:
    1. 如果需要 logprobs:计算原始 logprobs
    2. 将 logits 转换为 float32
    3. 应用允许的 token id 白名单
    4. 应用 bad words 排除
    5. 应用 logit processors (非 argmax-invariant)
    6. 应用 penalties (重复/频率/存在惩罚)
    7. 采样下一个 token:
       a) 如果不是 all_random:贪婪采样
       b) 应用 temperature
       c) 应用 argmax-invariant logit processors (min_p)
       d) 应用 top_k 和/或 top_p
       e) 从概率分布采样
    8. 收集 logprobs
    9. 返回 SamplerOutput
    """

    def __init__(self, logprobs_mode: LogprobsMode = "raw_logprobs"):
        super().__init__()
        self.topk_topp_sampler = TopKTopPSampler(logprobs_mode)
        self.pin_memory = is_pin_memory_available()
        self.logprobs_mode = logprobs_mode

    def forward(self, logits, sampling_metadata, predict_bonus_token=False,
                logprobs_mode_override=None) -> SamplerOutput:
        # Step 1-6: Preprocessing
        logits = logits.to(torch.float32)
        logits = self.apply_logits_processors(logits, sampling_metadata, ...)

        # Step 7: Sampling
        sampled, processed_logprobs = self.sample(logits, sampling_metadata)

        # Step 8-9: Post-processing
        sampled = sampled.long()
        ...

8.2 TopKTopPSampler - 核心采样 Kernel

topk_topp_sampler.py(file:///workspace/vllm/v1/sample/ops/topk_topp_sampler.py) 和 topk_topp_triton.py(file:///workspace/vllm/v1/sample/ops/topk_topp_triton.py) 实现 GPU 加速的 Top-k/Top-p 采样:
#mermaid-svg-iOcfLGhVNPHTtsyY{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-iOcfLGhVNPHTtsyY .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-iOcfLGhVNPHTtsyY .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-iOcfLGhVNPHTtsyY .error-icon{fill:#552222;}#mermaid-svg-iOcfLGhVNPHTtsyY .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-iOcfLGhVNPHTtsyY .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-iOcfLGhVNPHTtsyY .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-iOcfLGhVNPHTtsyY .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-iOcfLGhVNPHTtsyY .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-iOcfLGhVNPHTtsyY .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-iOcfLGhVNPHTtsyY .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-iOcfLGhVNPHTtsyY .marker{fill:#333333;stroke:#333333;}#mermaid-svg-iOcfLGhVNPHTtsyY .marker.cross{stroke:#333333;}#mermaid-svg-iOcfLGhVNPHTtsyY svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-iOcfLGhVNPHTtsyY p{margin:0;}#mermaid-svg-iOcfLGhVNPHTtsyY .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-iOcfLGhVNPHTtsyY .cluster-label text{fill:#333;}#mermaid-svg-iOcfLGhVNPHTtsyY .cluster-label span{color:#333;}#mermaid-svg-iOcfLGhVNPHTtsyY .cluster-label span p{background-color:transparent;}#mermaid-svg-iOcfLGhVNPHTtsyY .label text,#mermaid-svg-iOcfLGhVNPHTtsyY span{fill:#333;color:#333;}#mermaid-svg-iOcfLGhVNPHTtsyY .node rect,#mermaid-svg-iOcfLGhVNPHTtsyY .node circle,#mermaid-svg-iOcfLGhVNPHTtsyY .node ellipse,#mermaid-svg-iOcfLGhVNPHTtsyY .node polygon,#mermaid-svg-iOcfLGhVNPHTtsyY .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-iOcfLGhVNPHTtsyY .rough-node .label text,#mermaid-svg-iOcfLGhVNPHTtsyY .node .label text,#mermaid-svg-iOcfLGhVNPHTtsyY .image-shape .label,#mermaid-svg-iOcfLGhVNPHTtsyY .icon-shape .label{text-anchor:middle;}#mermaid-svg-iOcfLGhVNPHTtsyY .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-iOcfLGhVNPHTtsyY .rough-node .label,#mermaid-svg-iOcfLGhVNPHTtsyY .node .label,#mermaid-svg-iOcfLGhVNPHTtsyY .image-shape .label,#mermaid-svg-iOcfLGhVNPHTtsyY .icon-shape .label{text-align:center;}#mermaid-svg-iOcfLGhVNPHTtsyY .node.clickable{cursor:pointer;}#mermaid-svg-iOcfLGhVNPHTtsyY .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-iOcfLGhVNPHTtsyY .arrowheadPath{fill:#333333;}#mermaid-svg-iOcfLGhVNPHTtsyY .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-iOcfLGhVNPHTtsyY .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-iOcfLGhVNPHTtsyY .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iOcfLGhVNPHTtsyY .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-iOcfLGhVNPHTtsyY .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iOcfLGhVNPHTtsyY .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-iOcfLGhVNPHTtsyY .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-iOcfLGhVNPHTtsyY .cluster text{fill:#333;}#mermaid-svg-iOcfLGhVNPHTtsyY .cluster span{color:#333;}#mermaid-svg-iOcfLGhVNPHTtsyY div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-iOcfLGhVNPHTtsyY .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-iOcfLGhVNPHTtsyY rect.text{fill:none;stroke-width:0;}#mermaid-svg-iOcfLGhVNPHTtsyY .icon-shape,#mermaid-svg-iOcfLGhVNPHTtsyY .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iOcfLGhVNPHTtsyY .icon-shape p,#mermaid-svg-iOcfLGhVNPHTtsyY .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-iOcfLGhVNPHTtsyY .icon-shape .label rect,#mermaid-svg-iOcfLGhVNPHTtsyY .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iOcfLGhVNPHTtsyY .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-iOcfLGhVNPHTtsyY .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-iOcfLGhVNPHTtsyY :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} ⚡ 优化技术
Persistent TopK

持久化 Top-K
Fused Kernel

融合操作
Batched

批处理
🎲 采样流程
Logits

模型输出
Temperature

温度缩放
Top-K

截断
Top-P

核采样
Normalize

归一化
Sample

采样

采样算法说明

  • Top-k:保留概率最高的 k 个 token,其余置零
  • Top-p (nucleus):保留累积概率达到 p 的最小 token 集
  • Temperature:控制分布的锐度 (T→0: 贪婪, T→∞: 均匀)
  • Min-p:基于最大概率的阈值过滤

九、平台特定内核

9.1 ROCm (AMD GPU) 内核

rocm.py(file:///workspace/vllm/platforms/rocm.py) 定义 AMD ROCm 平台支持:

python 复制代码
# 文件: /workspace/vllm/platforms/rocm.py
# 行号: 28-74

try:
    from amdsmi import (
        AmdSmiException,
        amdsmi_get_gpu_asic_info,
        amdsmi_get_gpu_device_uuid,
        amdsmi_get_processor_handles,
        amdsmi_init,
        amdsmi_shut_down,
        amdsmi_topo_get_link_type,
        amdsmi_topo_get_numa_node_number,
    )
except ImportError as e:
    logger.warning("Failed to import from amdsmi with %r", e)

# import custom ops, trigger op registration
try:
    import vllm._C  # noqa: F401
except ImportError as e:
    logger.warning("Failed to import from vllm._C with %r", e)

try:
    import vllm._rocm_C  # noqa: ROCm 特定扩展
except ImportError as e:
    logger.warning("Failed to import from vllm._rocm_C with %r", e)

_ROCM_DEVICE_ID_NAME_MAP: dict[str, str] = {
    "0x74a0": "AMD_Instinct_MI300A",
    "0x74a1": "AMD_Instinct_MI300X",
    "0x74b5": "AMD_Instinct_MI300X",   # MI300X VF
    "0x74a5": "AMD_Instinct_MI325X",
    "0x74b9": "AMD_Instinct_MI325X",   # MI325X VF
    "0x7550": "AMD_Radeon_RX9070XT",   # RDNA 4 (Navi 48)
    "0x7551": "AMD_Radeon_R9700",      # RDNA 4
}

ROCm 特定组件

组件 说明
_rocm_C ROCm 编译的 C++ 扩展
AITER AMD AI Tensor 扩展库
amdsmi AMD 系统管理接口 (类似 NVML)
RCCL ROCm Communication Collectives Library
ROCm 注意力后端
  • rocm_attn.py(file:///workspace/vllm/v1/attention/backends/rocm_attn.py):通用 ROCm 注意力
  • rocm_aiter_fa.py(file:///workspace/vllm/v1/attention/backends/rocm_aiter_fa.py):AITER Flash Attention
  • rocm_aiter_mla.py(file:///workspace/vllm/v1/attention/backends/mla/rocm_aiter_mla.py):AITER MLA
  • rocm_aiter_unified_attn.py(file:///workspace/vllm/v1/attention/backends/rocm_aiter_unified_attn.py):统一 AITER 注意力

9.2 CPU 内核

cpu.py(file:///workspace/vllm/platforms/cpu.py) 定义 CPU 平台支持:

python 复制代码
# 文件: /workspace/vllm/platforms/cpu.py
# 行号: 41-99

class CpuPlatform(Platform):
    _enum = PlatformEnum.CPU
    device_name: str = "cpu"
    device_type: str = "cpu"
    dispatch_key: str = "CPU"
    dist_backend: str = "gloo"

    @property
    def supported_dtypes(self) -> list[torch.dtype]:
        if self.get_cpu_architecture() == CpuArchEnum.POWERPC:
            return [torch.bfloat16, torch.float32]
        elif self.get_cpu_architecture() == CpuArchEnum.ARM:
            if sys.platform.startswith("darwin"):
                # Apple Silicon with BF16
                return [torch.bfloat16, torch.float16, torch.float32]
            return [torch.float16, torch.float32]
        elif self.get_cpu_architecture() == CpuArchEnum.RISCV:
            return [torch.bfloat16, torch.float16, torch.float32]
        # x86/aarch64: 全部支持
        return [torch.bfloat16, torch.float16, torch.float32]

    @classmethod
    def get_attn_backend_cls(cls, selected_backend, attn_selector_config, num_heads=None):
        if attn_selector_config.use_mla:
            raise NotImplementedError("MLA is not supported on CPU.")
        if attn_selector_config.use_sparse:
            raise NotImplementedError("Sparse Attention is not supported on CPU.")
        return AttentionBackendEnum.CPU_ATTN.get_path()

CPU 架构支持

CPU 架构 SIMD 扩展 特殊指令
x86-64 AVX2/AVX-512 AMX (Intel)
ARM (AArch64) NEON SVE (可选)
PowerPC VSX/Altivec -
RISC-V RVV (V extension) -
Apple Silicon AMX (Apple) BF16 硬件加速
CPU 注意力后端

cpu_attn.py(file:///workspace/vllm/v1/attention/backends/cpu_attn.py):

  • 使用 PyTorch 原生 SDPA 或手动实现
  • 针对 CPU 缓存层次结构优化
  • 支持多线程并行化

9.3 Intel XPU 内核

xpu_ops.py(file:///workspace/vllm/kernels/xpu_ops.py) 和 xpu.py(file:///workspace/vllm/platforms/xpu.py):

  • oneAPI 支持:Intel Data Parallel C++ 后端
  • XMX 引擎:利用 Intel X Matrix Extensions
  • XPU MLAxpu_mla_sparse.py(file:///workspace/vllm/v1/attention/backends/mla/xpu_mla_sparse.py)
  • XPU Linear Kernelsxpu.py(file:///workspace/vllm/model_executor/kernels/linear/scaled_mm/xpu.py), mixed_precision/xpu.py(file:///workspace/vllm/model_executor/kernels/linear/mixed_precision/xpu.py)

十、内核分类与调用链总览

10.1 完整内核分类图

#mermaid-svg-LaZhdgeaWMVJvLuN{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-LaZhdgeaWMVJvLuN .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-LaZhdgeaWMVJvLuN .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-LaZhdgeaWMVJvLuN .error-icon{fill:#552222;}#mermaid-svg-LaZhdgeaWMVJvLuN .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-LaZhdgeaWMVJvLuN .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-LaZhdgeaWMVJvLuN .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-LaZhdgeaWMVJvLuN .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-LaZhdgeaWMVJvLuN .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-LaZhdgeaWMVJvLuN .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-LaZhdgeaWMVJvLuN .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-LaZhdgeaWMVJvLuN .marker{fill:#333333;stroke:#333333;}#mermaid-svg-LaZhdgeaWMVJvLuN .marker.cross{stroke:#333333;}#mermaid-svg-LaZhdgeaWMVJvLuN svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-LaZhdgeaWMVJvLuN p{margin:0;}#mermaid-svg-LaZhdgeaWMVJvLuN .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-LaZhdgeaWMVJvLuN .cluster-label text{fill:#333;}#mermaid-svg-LaZhdgeaWMVJvLuN .cluster-label span{color:#333;}#mermaid-svg-LaZhdgeaWMVJvLuN .cluster-label span p{background-color:transparent;}#mermaid-svg-LaZhdgeaWMVJvLuN .label text,#mermaid-svg-LaZhdgeaWMVJvLuN span{fill:#333;color:#333;}#mermaid-svg-LaZhdgeaWMVJvLuN .node rect,#mermaid-svg-LaZhdgeaWMVJvLuN .node circle,#mermaid-svg-LaZhdgeaWMVJvLuN .node ellipse,#mermaid-svg-LaZhdgeaWMVJvLuN .node polygon,#mermaid-svg-LaZhdgeaWMVJvLuN .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-LaZhdgeaWMVJvLuN .rough-node .label text,#mermaid-svg-LaZhdgeaWMVJvLuN .node .label text,#mermaid-svg-LaZhdgeaWMVJvLuN .image-shape .label,#mermaid-svg-LaZhdgeaWMVJvLuN .icon-shape .label{text-anchor:middle;}#mermaid-svg-LaZhdgeaWMVJvLuN .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-LaZhdgeaWMVJvLuN .rough-node .label,#mermaid-svg-LaZhdgeaWMVJvLuN .node .label,#mermaid-svg-LaZhdgeaWMVJvLuN .image-shape .label,#mermaid-svg-LaZhdgeaWMVJvLuN .icon-shape .label{text-align:center;}#mermaid-svg-LaZhdgeaWMVJvLuN .node.clickable{cursor:pointer;}#mermaid-svg-LaZhdgeaWMVJvLuN .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-LaZhdgeaWMVJvLuN .arrowheadPath{fill:#333333;}#mermaid-svg-LaZhdgeaWMVJvLuN .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-LaZhdgeaWMVJvLuN .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-LaZhdgeaWMVJvLuN .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-LaZhdgeaWMVJvLuN .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-LaZhdgeaWMVJvLuN .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-LaZhdgeaWMVJvLuN .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-LaZhdgeaWMVJvLuN .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-LaZhdgeaWMVJvLuN .cluster text{fill:#333;}#mermaid-svg-LaZhdgeaWMVJvLuN .cluster span{color:#333;}#mermaid-svg-LaZhdgeaWMVJvLuN div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-LaZhdgeaWMVJvLuN .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-LaZhdgeaWMVJvLuN rect.text{fill:none;stroke-width:0;}#mermaid-svg-LaZhdgeaWMVJvLuN .icon-shape,#mermaid-svg-LaZhdgeaWMVJvLuN .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-LaZhdgeaWMVJvLuN .icon-shape p,#mermaid-svg-LaZhdgeaWMVJvLuN .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-LaZhdgeaWMVJvLuN .icon-shape .label rect,#mermaid-svg-LaZhdgeaWMVJvLuN .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-LaZhdgeaWMVJvLuN .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-LaZhdgeaWMVJvLuN .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-LaZhdgeaWMVJvLuN :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 🎲 采样系统
sampler.py

采样器
topk_topp_

TopK/TopP Kernel
📡 通信系统
custom_all_reduce.py

自定义 AllReduce
quick_all_reduce.py

快速 Reduce
cumem.py

CU Mem Alloc
🔀 MoE 系统
modular_kernel.py

模块化架构
router/

路由器
moe_wna16.py

4-bit MoE
🎯 量化系统
mixed_precision/

混合精度
scaled_mm/

缩放 MM
nvfp4/

NVFP4 线性层
mxfp8/

MXFP8 线性层
⚡ 激活函数系统
vllm_c.py

RMSNorm
rotary_embedding/

RoPE 变体
helion/

SiLU*FP8
💾 缓存系统
reshape_and_cache_flash.py

KV 写入
nvfp4_utils.py

NVFP4 格式
fp8_utils.py

FP8 量化
👁️ 注意力系统
registry.py

后端注册表
flashinfer.py

FlashInfer 后端
flash_attn.py

FlashAttn 后端
triton_attn.py

Triton 后端
mla/

MLA 子系统
cpu_attn.py

CPU 后端
rocm_attn.py

ROCm 后端
🚀 入口点
vllm/kernels/init.py
vllm/_custom_ops.py
platforms/*.py

10.2 典型调用链示例

推理时注意力计算调用链

Triton/CUDA Kernel KV Cache Ops FlashInfer Attention Backend Model Forward Triton/CUDA Kernel KV Cache Ops FlashInfer Attention Backend Model Forward #mermaid-svg-fS6bvloRtxIZ5NjJ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fS6bvloRtxIZ5NjJ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fS6bvloRtxIZ5NjJ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fS6bvloRtxIZ5NjJ .error-icon{fill:#552222;}#mermaid-svg-fS6bvloRtxIZ5NjJ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fS6bvloRtxIZ5NjJ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fS6bvloRtxIZ5NjJ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fS6bvloRtxIZ5NjJ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fS6bvloRtxIZ5NjJ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fS6bvloRtxIZ5NjJ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fS6bvloRtxIZ5NjJ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fS6bvloRtxIZ5NjJ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fS6bvloRtxIZ5NjJ .marker.cross{stroke:#333333;}#mermaid-svg-fS6bvloRtxIZ5NjJ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fS6bvloRtxIZ5NjJ p{margin:0;}#mermaid-svg-fS6bvloRtxIZ5NjJ .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fS6bvloRtxIZ5NjJ text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-fS6bvloRtxIZ5NjJ .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-fS6bvloRtxIZ5NjJ .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-fS6bvloRtxIZ5NjJ .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-fS6bvloRtxIZ5NjJ .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-fS6bvloRtxIZ5NjJ #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-fS6bvloRtxIZ5NjJ .sequenceNumber{fill:white;}#mermaid-svg-fS6bvloRtxIZ5NjJ #sequencenumber{fill:#333;}#mermaid-svg-fS6bvloRtxIZ5NjJ #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-fS6bvloRtxIZ5NjJ .messageText{fill:#333;stroke:none;}#mermaid-svg-fS6bvloRtxIZ5NjJ .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fS6bvloRtxIZ5NjJ .labelText,#mermaid-svg-fS6bvloRtxIZ5NjJ .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-fS6bvloRtxIZ5NjJ .loopText,#mermaid-svg-fS6bvloRtxIZ5NjJ .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-fS6bvloRtxIZ5NjJ .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-fS6bvloRtxIZ5NjJ .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-fS6bvloRtxIZ5NjJ .noteText,#mermaid-svg-fS6bvloRtxIZ5NjJ .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-fS6bvloRtxIZ5NjJ .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fS6bvloRtxIZ5NjJ .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fS6bvloRtxIZ5NjJ .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-fS6bvloRtxIZ5NjJ .actorPopupMenu{position:absolute;}#mermaid-svg-fS6bvloRtxIZ5NjJ .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-fS6bvloRtxIZ5NjJ .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-fS6bvloRtxIZ5NjJ .actor-man circle,#mermaid-svg-fS6bvloRtxIZ5NjJ line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-fS6bvloRtxIZ5NjJ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} forward(hidden_states, kv_cache, metadata) batch_prefill_with_paged_kv(q, k_v_cache, metadata) reshape_and_cache(new_k, new_v, cache, slot_mapping) launch reshape_and_cache_kernel_flash() KV written launch flashinfer attention kernel attention output output tensor context output

量化线性层调用链

_custom_ops Marlin Kernel MPLinearKernel Linear Layer _custom_ops Marlin Kernel MPLinearKernel Linear Layer #mermaid-svg-rsNkBVMgXv2FLLVK{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-rsNkBVMgXv2FLLVK .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-rsNkBVMgXv2FLLVK .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-rsNkBVMgXv2FLLVK .error-icon{fill:#552222;}#mermaid-svg-rsNkBVMgXv2FLLVK .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-rsNkBVMgXv2FLLVK .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-rsNkBVMgXv2FLLVK .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-rsNkBVMgXv2FLLVK .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-rsNkBVMgXv2FLLVK .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-rsNkBVMgXv2FLLVK .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-rsNkBVMgXv2FLLVK .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-rsNkBVMgXv2FLLVK .marker{fill:#333333;stroke:#333333;}#mermaid-svg-rsNkBVMgXv2FLLVK .marker.cross{stroke:#333333;}#mermaid-svg-rsNkBVMgXv2FLLVK svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-rsNkBVMgXv2FLLVK p{margin:0;}#mermaid-svg-rsNkBVMgXv2FLLVK .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rsNkBVMgXv2FLLVK text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-rsNkBVMgXv2FLLVK .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rsNkBVMgXv2FLLVK .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-rsNkBVMgXv2FLLVK .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-rsNkBVMgXv2FLLVK .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-rsNkBVMgXv2FLLVK #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-rsNkBVMgXv2FLLVK .sequenceNumber{fill:white;}#mermaid-svg-rsNkBVMgXv2FLLVK #sequencenumber{fill:#333;}#mermaid-svg-rsNkBVMgXv2FLLVK #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-rsNkBVMgXv2FLLVK .messageText{fill:#333;stroke:none;}#mermaid-svg-rsNkBVMgXv2FLLVK .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rsNkBVMgXv2FLLVK .labelText,#mermaid-svg-rsNkBVMgXv2FLLVK .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-rsNkBVMgXv2FLLVK .loopText,#mermaid-svg-rsNkBVMgXv2FLLVK .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-rsNkBVMgXv2FLLVK .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-rsNkBVMgXv2FLLVK .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-rsNkBVMgXv2FLLVK .noteText,#mermaid-svg-rsNkBVMgXv2FLLVK .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-rsNkBVMgXv2FLLVK .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rsNkBVMgXv2FLLVK .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rsNkBVMgXv2FLLVK .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-rsNkBVMgXv2FLLVK .actorPopupMenu{position:absolute;}#mermaid-svg-rsNkBVMgXv2FLLVK .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-rsNkBVMgXv2FLLVK .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-rsNkBVMgXv2FLLVK .actor-man circle,#mermaid-svg-rsNkBVMgXv2FLLVK line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-rsNkBVMgXv2FLLVK :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} forward(x) marlin_gemm(x, q_weight, scales, workspace) ops.gptq_marlin_gemm(...) result (CUDA kernel executed) output tensor linear output

10.3 性能优化策略总结

优化维度 技术手段 适用场景
Kernel Fusion FusedAddRMSNorm, FusedInvRoPE+FP8 减少内存带宽压力
Memory Layout Head-Major, PagedAttention 优化缓存利用率
Quantization INT4/FP8/NVFP4 减少显存占用
Communication CustomAllReduce, P2P 降低通信延迟
Batching Continuous Batching, uBatching 提高 GPU 利用率
Auto-tuning Helion configs, Backend selection 自动选择最优路径
Platform-specific SM-specific kernels, ISA extensions 充分利用硬件特性

10.4 关键文件索引

类别 核心文件 行号范围
构建系统 vllm/kernels/vllm_c.py(file:///workspace/vllm/kernels/vllm_c.py) 1-63
构建系统 vllm/utils/torch_utils.py(file:///workspace/vllm/utils/torch_utils.py) 928-967
注意力 vllm/v1/attention/backends/registry.py(file:///workspace/vllm/v1/attention/backends/registry.py) 34-87
注意力 vllm/v1/attention/backends/flashinfer.py(file:///workspace/vllm/v1/attention/backends/flashinfer.py) 1-149
缓存 vllm/v1/attention/ops/triton_reshape_and_cache_flash.py(file:///workspace/vllm/v1/attention/ops/triton_reshape_and_cache_flash.py) 17-100
缓存 vllm/utils/torch_utils.py (NVFP4)(file:///workspace/vllm/utils/torch_utils.py) 415-469
量化 vllm/model_executor/kernels/linear/mixed_precision/marlin.py(file:///workspace/vllm/model_executor/kernels/linear/mixed_precision/marlin.py) 30-100
量化 vllm/model_executor/kernels/linear/mixed_precision/machete.py(file:///workspace/vllm/model_executor/kernels/linear/mixed_precision/machete.py) 24-100
量化 vllm/model_executor/layers/quantization/awq_marlin.py(file:///workspace/vllm/model_executor/layers/quantization/awq_marlin.py) 67-100
量化 vllm/model_executor/layers/quantization/gguf.py(file:///workspace/vllm/model_executor/layers/quantization/gguf.py) 52-100
MoE vllm/model_executor/layers/fused_moe/modular_kernel.py(file:///workspace/vllm/model_executor/layers/fused_moe/modular_kernel.py) 46-150
MoE vllm/model_executor/layers/fused_moe/router/fused_topk_router.py(file:///workspace/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py) 17-100
通信 vllm/distributed/device_communicators/custom_all_reduce.py(file:///workspace/vllm/distributed/device_communicators/custom_all_reduce.py) 50-100
通信 vllm/device_allocator/cumem.py(file:///workspace/vllm/device_allocator/cumem.py) 27-88
采样 vllm/v1/sample/sampler.py(file:///workspace/vllm/v1/sample/sampler.py) 21-143
平台 vllm/platforms/cuda.py(file:///workspace/vllm/platforms/cuda.py) 79-143
平台 vllm/platforms/rocm.py(file:///workspace/vllm/platforms/rocm.py) 28-74
平台 vllm/platforms/cpu.py(file:///workspace/vllm/platforms/cpu.py) 41-99

总结

本文档深入分析了 vLLM 的 CUDA/C++ 内核层架构,揭示了其作为现代 LLM 推理引擎的核心设计理念:

🏗️ 架构特点

  1. 混合内核生态:Python 绑定 + Triton JIT + C++/CUDA 扩展 + 第三方库(FlashInfer、Marlin、CUTLASS)
  2. 平台抽象层 :通过 Platform 接口屏蔽 CUDA/ROCm/CPU/XPU 差异
  3. 可插拔后端:注意力、线性层、MoE 等均支持多后端动态选择
  4. IR Op 系统 :通过 ir.ops 实现算子级别的优化和替换

⚡ 性能优化核心

  • PagedAttention:创新的分页式注意力,解决显存碎片问题
  • 量化支持:从 INT4 到 FP8 的完整量化栈(AWQ/GPTQ/GGUF/Marlin/Machete)
  • 通信优化:Custom AllReduce + P2P 显著降低张量并行延迟
  • MoE 加速:模块化架构支持多种路由和专家计算方案

🔮 未来方向

  • Blackwell (SM 10.x):充分利用 TMA (Tensor Memory Accelerator) 和 FP4
  • 稀疏注意力:Sparse MLA 等长序列优化
  • 更多量化格式:MXFP4、OC-MX 等新微缩放格式
  • 跨平台统一:通过 CUTLASS 和 Triton 实现更好的可移植性

文档版本 : v1.0

基于源码 : /workspace/vllm

生成日期 : 2026-05-10

总行数: ~1200 行

相关推荐
happyprince3 小时前
09-vLLM KV Cache 系统完整分析
java·spring·vllm
happyprince4 小时前
12-vLLM 量化方案全面分析
前端·javascript·vllm
happyprince4 小时前
18-vLLM 结构化输出约束分析文档
网络·vllm
happyprince4 小时前
14-vLLM LoRA 适配器系统深度解析
vllm
AI小百科18 天前
llama.cpp vs vLLM:深度解析与选型指南
llama·vllm
IRevers18 天前
【大模型】Gemma4在ROCm和vLLM部署
人工智能·pytorch·深度学习·大模型·datawhale·vllm·amdev
下班走回家19 天前
本地部署大模型的三种方式:Ollama vs vLLM vs llama.cpp
人工智能·llama·vllm
花间相见19 天前
【大模型部署01】—— vLLM 部署大模型服务实操:从 0 到 1 搭建 OpenAI 兼容 API
vllm
有来有去952719 天前
【训推框架】Vime-大规模 LLM/VLM 强化学习训练框架
人工智能·深度学习·语言模型·gpu算力·vllm