12-vLLM 量化方案全面分析

vLLM 量化方案全面分析

定位:vLLM 量化(Quantization)子系统的全面架构分析,涵盖从配置层到 CUDA 内核的完整技术栈。
#mermaid-svg-NNXdknHGJ4LMg56n{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-NNXdknHGJ4LMg56n .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-NNXdknHGJ4LMg56n .error-icon{fill:#552222;}#mermaid-svg-NNXdknHGJ4LMg56n .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-NNXdknHGJ4LMg56n .marker{fill:#333333;stroke:#333333;}#mermaid-svg-NNXdknHGJ4LMg56n .marker.cross{stroke:#333333;}#mermaid-svg-NNXdknHGJ4LMg56n svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-NNXdknHGJ4LMg56n p{margin:0;}#mermaid-svg-NNXdknHGJ4LMg56n .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster-label text{fill:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster-label span{color:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster-label span p{background-color:transparent;}#mermaid-svg-NNXdknHGJ4LMg56n .label text,#mermaid-svg-NNXdknHGJ4LMg56n span{fill:#333;color:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .node rect,#mermaid-svg-NNXdknHGJ4LMg56n .node circle,#mermaid-svg-NNXdknHGJ4LMg56n .node ellipse,#mermaid-svg-NNXdknHGJ4LMg56n .node polygon,#mermaid-svg-NNXdknHGJ4LMg56n .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-NNXdknHGJ4LMg56n .rough-node .label text,#mermaid-svg-NNXdknHGJ4LMg56n .node .label text,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape .label,#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape .label{text-anchor:middle;}#mermaid-svg-NNXdknHGJ4LMg56n .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-NNXdknHGJ4LMg56n .rough-node .label,#mermaid-svg-NNXdknHGJ4LMg56n .node .label,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape .label,#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape .label{text-align:center;}#mermaid-svg-NNXdknHGJ4LMg56n .node.clickable{cursor:pointer;}#mermaid-svg-NNXdknHGJ4LMg56n .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-NNXdknHGJ4LMg56n .arrowheadPath{fill:#333333;}#mermaid-svg-NNXdknHGJ4LMg56n .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-NNXdknHGJ4LMg56n .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-NNXdknHGJ4LMg56n .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-NNXdknHGJ4LMg56n .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-NNXdknHGJ4LMg56n .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-NNXdknHGJ4LMg56n .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-NNXdknHGJ4LMg56n .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster text{fill:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster span{color:#333;}#mermaid-svg-NNXdknHGJ4LMg56n div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-NNXdknHGJ4LMg56n .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-NNXdknHGJ4LMg56n rect.text{fill:none;stroke-width:0;}#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape p,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape .label rect,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-NNXdknHGJ4LMg56n .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-NNXdknHGJ4LMg56n .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-NNXdknHGJ4LMg56n :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} ⚡ 内核类型
🏗️ 实现层次
📊 量化方案矩阵
FP8 E4M3
INT8 W8A8
INT4 W4A16
GPTQ 2-8bit
AWQ 4bit
GGUF 多精度
NVFP4 Blackwell
MXFP4 Microscaling
MXFP8 Microscaling
Marlin 4-bit
Machete Hopper+
QuantizationConfig

配置层
LinearKernel

内核选择层
CUDA Kernel

算子实现层
ScaledMM

缩放矩阵乘法
MPLinearKernel

混合精度内核
Mxfp8LinearKernel

MXFP8 专用内核
NvFp4LinearKernel

NVFP4 专用内核


目录

  • 一、量化方案矩阵
  • [二、FP8 量化](#二、FP8 量化)
  • [三、INT8 / INT4 量化(GPTQ / AWQ)](#三、INT8 / INT4 量化(GPTQ / AWQ))
  • [四、GGUF 格式](#四、GGUF 格式)
  • [五、NVFP4 Blackwell 4-bit 浮点](#五、NVFP4 Blackwell 4-bit 浮点)
  • [六、MXFP4 / MXFP8 块缩放浮点](#六、MXFP4 / MXFP8 块缩放浮点)
  • [七、Marlin 高性能 4-bit 内核](#七、Marlin 高性能 4-bit 内核)
  • [八、Machete 新一代量化内核](#八、Machete 新一代量化内核)
  • [九、量化线性层与 GEMM 操作](#九、量化线性层与 GEMM 操作)
  • 十、量化配置体系
  • 十一、量化方案选择决策树
  • 十二、量化数据流全景

一、量化方案矩阵

vLLM 支持的量化方案覆盖了从浮点到整数的全谱系,下表为各方案的横向对比:

方案 权重精度 激活精度 用途/特点 最小 GPU 能力 配置类 关键文件
FP8 float8_e4m3fn BF16/FP16 (动态) H100/H200 原生支持,在线/离线量化 SM 75 (Turing) Fp8Config fp8.py
FP8 Per-Block float8_e4m3fn 动态 128-block DeepSeek 风格块级量化 SM 75 Fp8Config fp8.py:292-311
MXFP8 float8 + uint8 scale BF16 MicroScaling FP8,32 元素块缩放 SM 80 ModelOptMxFp8Config modelopt.py
GPTQ INT4/INT8 (packed int32) FP16 训练后量化,支持 act-order SM 60 GPTQConfig gptq.py
GPTQ-Marlin UINT4B8/UINT8B128 FP16/BF16 Marlin 加速的 GPTQ SM 75 GPTQMarlinConfig gptq_marlin.py
AWQ UINT4 (packed int32) FP16 激活感知量化 SM 75 AWQConfig awq.py
AWQ-Marlin UINT4 FP16/BF16 Marlin 加速的 AWQ SM 75 AWQMarlinConfig awq_marlin.py
GGUF Q2~Q8 (多精度) FP16/BF16 llama.cpp 兼容格式 SM 60 GGUFConfig gguf.py
NVFP4 float4_e2m1fn (uint8 packed) FP8-E4M3 Blackwell 架构原生 4-bit 浮点 SM 100+ ModelOptNvFp4Config modelopt.py, nvfp4/base.py
MXFP4 uint8 (2×FP4/byte) BF16 MicroScaling FP4,MoE 专用 SM 80 Mxfp4Config mxfp4.py
Marlin UINT4/UINT8/FP8 FP16/INT8/FP8 高性能 4/8-bit GEMM 内核 SM 75+ --- marlin_utils.py
Machete UINT4/UINT8 FP16/BF16 Hopper (SM90) CUTLASS 内核 SM 90 --- machete.py, machete_utils.py

在线量化(Online Quantization)

除上述离线量化方案外,vLLM 还支持在线量化------加载全精度 checkpoint 后在推理时动态量化:

python 复制代码
# config/quantization.py 中定义
class OnlineQuantScheme(Enum):
    FP8_PER_TENSOR = "fp8_per_tensor"           # FP8 per-tensor 缩放
    FP8_PER_BLOCK = "fp8_per_block"              # FP8 128×128 块缩放(DeepSeek 风格)
    INT8_PER_CHANNEL_WEIGHT_ONLY = "int8_per_channel_weight_only"  # MoE 专家权重 INT8
    MXFP8 = "mxfp8"                              # MicroScaling FP8

源码位置:config/quantization.py:12-27

所有已注册量化方法

init.py:12-47 中通过 QuantizationMethods Literal 类型统一注册:

python 复制代码
QuantizationMethods = Literal[
    "awq", "fp8", "fbgemm_fp8", "fp_quant", "modelopt",
    "modelopt_fp4", "modelopt_mxfp8", "modelopt_mixed",
    "gguf", "gptq_marlin", "awq_marlin", "gptq",
    "humming", "compressed-tensors", "bitsandbytes",
    "experts_int8", "quark", "moe_wna16", "torchao",
    "inc", "mxfp4", "gpt_oss_mxfp4", "deepseek_v4_fp8",
    "cpu_awq", "online",
    "fp8_per_tensor", "fp8_per_block",
    "int8_per_channel_weight_only", "mxfp8",
]

二、FP8 量化

2.1 架构概述

FP8 是 vLLM 中最重要的量化方案之一,充分利用 NVIDIA H100/H200 的原生 FP8 Tensor Core 支持。其核心设计围绕两个维度展开:

  1. 离线 vs 在线:是否以 FP8 格式存储 checkpoint
  2. 激活策略:static(静态)vs dynamic(动态)缩放

#mermaid-svg-cvE5iu2KvQwWgkAw{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-cvE5iu2KvQwWgkAw .error-icon{fill:#552222;}#mermaid-svg-cvE5iu2KvQwWgkAw .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-cvE5iu2KvQwWgkAw .marker{fill:#333333;stroke:#333333;}#mermaid-svg-cvE5iu2KvQwWgkAw .marker.cross{stroke:#333333;}#mermaid-svg-cvE5iu2KvQwWgkAw svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-cvE5iu2KvQwWgkAw p{margin:0;}#mermaid-svg-cvE5iu2KvQwWgkAw .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster-label text{fill:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster-label span{color:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster-label span p{background-color:transparent;}#mermaid-svg-cvE5iu2KvQwWgkAw .label text,#mermaid-svg-cvE5iu2KvQwWgkAw span{fill:#333;color:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .node rect,#mermaid-svg-cvE5iu2KvQwWgkAw .node circle,#mermaid-svg-cvE5iu2KvQwWgkAw .node ellipse,#mermaid-svg-cvE5iu2KvQwWgkAw .node polygon,#mermaid-svg-cvE5iu2KvQwWgkAw .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-cvE5iu2KvQwWgkAw .rough-node .label text,#mermaid-svg-cvE5iu2KvQwWgkAw .node .label text,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape .label,#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape .label{text-anchor:middle;}#mermaid-svg-cvE5iu2KvQwWgkAw .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-cvE5iu2KvQwWgkAw .rough-node .label,#mermaid-svg-cvE5iu2KvQwWgkAw .node .label,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape .label,#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape .label{text-align:center;}#mermaid-svg-cvE5iu2KvQwWgkAw .node.clickable{cursor:pointer;}#mermaid-svg-cvE5iu2KvQwWgkAw .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-cvE5iu2KvQwWgkAw .arrowheadPath{fill:#333333;}#mermaid-svg-cvE5iu2KvQwWgkAw .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-cvE5iu2KvQwWgkAw .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-cvE5iu2KvQwWgkAw .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-cvE5iu2KvQwWgkAw .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-cvE5iu2KvQwWgkAw .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-cvE5iu2KvQwWgkAw .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster text{fill:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster span{color:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-cvE5iu2KvQwWgkAw .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw rect.text{fill:none;stroke-width:0;}#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape p,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape .label rect,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-cvE5iu2KvQwWgkAw .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-cvE5iu2KvQwWgkAw .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-cvE5iu2KvQwWgkAw :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Fp8Config
Offline 离线

is_checkpoint_fp8_serialized=True
Online 在线

is_checkpoint_fp8_serialized=False
Fp8LinearMethod
Fp8OnlineLinearMethod
Fp8MoEMethod
Fp8OnlineMoEMethod
Fp8KVCacheMethod
init_fp8_linear_kernel()

内核选择
torch._scaled_mm
MarlinFP8ScaledMM
DeepGEMM
BlockScaledMM
FlashInfer

2.2 Fp8Config 核心配置

定义于 fp8.py:97-225

python 复制代码
class Fp8Config(QuantizationConfig):
    def __init__(
        self,
        is_checkpoint_fp8_serialized: bool = False,  # 是否为 FP8 序列化 checkpoint
        activation_scheme: str = "dynamic",            # "static" 或 "dynamic"
        ignored_layers: list[str] | None = None,       # 跳过量化的层
        weight_block_size: list[int] | None = None,     # 块级量化尺寸 [N, K]
    ) -> None:

关键属性说明:

  • is_checkpoint_fp8_serialized:决定走 Fp8LinearMethod(离线)还是 Fp8OnlineLinearMethod(在线)
  • activation_scheme:dynamic 模式下激活值每 token 动态计算缩放因子;static 模式使用预计算的固定缩放因子
  • weight_block_size:当设置为 [128, 128] 时启用 DeepSeek V3/R1 风格的 128×128 块级量化

2.3 量化 Key 体系

FP8 的量化行为由 QuantKey 精确控制,定义于 quant_utils.py:100-182

QuantKey 含义 使用场景
kFp8StaticTensorSym FP8 + 静态 per-tensor scale 静态激活量化
kFp8DynamicTensorSym FP8 + 动态 per-tensor scale 标准 dynamic 模式
kFp8DynamicTokenSym FP8 + 动态 per-token scale Cutlass 支持时的高效模式
kFp8Dynamic128Sym FP8 + 动态 128-block scale DeepSeek 块级量化(激活侧)
kFp8Static128BlockSym FP8 + 静态 128×128 block scale DeepSeek 块级量化(权重侧)

2.4 离线 FP8 Linear 方法

Fp8LinearMethod 的核心流程:

python 复制代码
class Fp8LinearMethod(LinearMethodBase):
    def create_weights(self, layer, ...):
        # 1. 创建 FP8 权重参数
        weight = create_fp8_weight_parameter(...)
        # 2. 创建权重量级缩放因子
        scale = create_fp8_scale_parameter(...)
        # 3. 初始化线性内核(自动选择最优后端)
        self.fp8_linear = init_fp8_linear_kernel(
            activation_quant_key=self.activation_quant_key,
            weight_quant_key=self.weight_quant_key,
            ...
        )

    def process_weights_after_loading(self, layer):
        # 处理融合模块(如 QKV)的多 shard 权重
        weight, weight_scale, input_scale = process_fp8_weight_tensor_strategy(
            weight, weight_scale, layer.logical_widths, ...
        )
        # 转置为 [K, N] 布局(scaled_mm 要求)
        weight = weight.t()
        # 交给选定的内核做后处理(如 Marlin 重排)
        self.fp8_linear.process_weights_after_loading(layer)

    def apply(self, layer, x, bias=None):
        if envs.VLLM_BATCH_INVARIANT:
            # batch invariant 模式:反量化到 BF16 再计算
            return self._bf16_fallback(layer, x, bias)
        if self.use_marlin:
            return self.fp8_linear.apply_weights(layer, x, bias)
        return self.fp8_linear.apply_weights(layer, x, bias)

2.5 在线 FP8 量化

Fp8OnlineLinearMethod 的关键区别在于使用 meta device 延迟分配:

python 复制代码
class Fp8OnlineLinearMethod(Fp8LinearMethod):
    uses_meta_device: bool = True  # 标记:在 meta device 上创建权重

    def create_weights(self, layer, ...):
        # 权重在 meta device 上创建(不占实际显存)
        weight = ModelWeightParameter(
            data=torch.empty(..., device="meta", dtype=params_dtype),
            ...
        )
        initialize_online_processing(layer)

    def process_weights_after_loading(self, layer):
        # 加载完成后在线量化:BF16 → FP8
        qweight, weight_scale = ops.scaled_fp8_quant(layer.weight, scale=None)
        replace_parameter(layer, "weight", qweight.data)
        replace_parameter(layer, "weight_scale", weight_scale.data)

核心量化操作 ops.scaled_fp8_quant 将全精度权重动态转换为 FP8。

2.6 FP8 MoE 方法

Fp8MoEMethodFp8OnlineMoEMethod 为 MoE 层提供 FP8 量化支持:

  • 权重格式:w13_weight(gate_up 融合)和 w2_weight(down_proj),dtype 为 float8_e4m3fn
  • 缩放因子:per-tensor(w13_scale, w2_scale)或 per-block(w13_weight_scale_inv
  • 后端选择:通过 select_fp8_moe_backend() 自动选择 FlashInfer/AITER/CUTLASS 等后端
  • 权重重排:convert_to_fp8_moe_kernel_format() 将权重转换为各后端的运行时格式

2.7 MXFP8(MicroScaling FP8)

MXFP8 是 NVIDIA 推出的微缩放浮点格式,每 32 个元素共享一个 uint8 缩放因子。相关实现在 modelopt.pymxfp8/ 目录中。

支持的内核后端:

  • MarlinMxfp8LinearKernel --- Marlin 加速的 MXFP8
  • FlashInferCutlassMxfp8LinearKernel --- FlashInfer + CUTLASS
  • EmulationMxfp8LinearKernel --- 仿真模式(非 Blackwell 设备)

三、INT8 / INT4 量化(GPTQ / AWQ)

3.1 GPTQ 量化

GPTQ 是基于近似二阶信息的训练后量化方法。

GPTQConfig

定义于 gptq.py:44-223

python 复制代码
class GPTQConfig(QuantizationConfig):
    def __init__(
        self,
        weight_bits: int,          # 支持 2/3/4/8 bit
        group_size: int,           # 分组大小,-1 表示 per-channel
        desc_act: bool,            # 是否启用 activation ordering
        lm_head_quantized: bool,   # 是否量化 lm_head
        dynamic: dict,             # GPTQModel 的逐模块动态配置
        autoround_version: str="", # AutoRound 版本标识
        modules_in_block_to_quantize: list[str] | None = None,
        checkpoint_format: str="", # "gptq_v2" 或 ""
    ):
        self.pack_factor = Fraction(32, self.weight_bits)  # 打包因子
GPTQLinearMethod

gptq.py:231-399

python 复制代码
class GPTQLinearMethod(LinearMethodBase):
    def create_weights(self, layer, ...):
        # 量化权重:按行打包进 int32
        qweight = PackedvLLMParameter(
            data=torch.empty(
                input_size_per_partition // self.quant_config.pack_factor,
                output_size_per_partition,
                dtype=torch.int32,
            ),
            input_dim=0, output_dim=1,
            packed_dim=0,  # 沿输入维度打包
            packed_factor=self.quant_config.pack_factor,
        )
        # Activation order 索引
        g_idx = RowvLLMParameter(...)
        # 零点(打包)
        qzeros = PackedColumnParameter(...)
        # 缩放因子
        scales = ChannelQuantScaleParameter(...)  # or GroupQuantScaleParameter

    def process_weights_after_loading(self, layer):
        # Exllama shuffle:按 g_idx 重排权重
        if layer.exllama_state == ExllamaState.UNINITIALIZED:
            if self.quant_config.desc_act:
                layer.g_idx.data = torch.argsort(layer.g_idx).to(torch.int)
            ops.gptq_shuffle(layer.qweight, layer.g_idx, self.quant_config.weight_bits)

    def apply(self, layer, x, bias=None):
        output = ops.gptq_gemm(
            reshaped_x, layer.qweight, layer.qzeros,
            layer.scales, layer.g_idx,
            layer.exllama_state == ExllamaState.READY,
            self.use_v2_format,      # GPTQ v1/v2 格式差异
            self.quant_config.weight_bits,
        )

GPTQ v1 vs v2 差异:v2 格式对零点的处理方式不同,需要不同的 GEMM 内核。

GPTQ 动态配置

GPTQModel 引入的动态配置允许对模型的不同层使用不同量化参数:

python 复制代码
# gptq.py:61-83 中的示例
dynamic = {
    r"+:.*\.(?:1[0-5])\..*": {"bits": 8},         # 第 10-15 层用 8bit
    r"+:.*\.(?:1[6-9]|20|21)\..*": {"bits": 8, "group_size": 64},
    r"-:.*\.moe\..*": {},                            # 跳过所有 MoE 层
}

3.2 AWQ 量化

AWQ (Activation-aware Weight Quantization) 通过分析激活分布来保护重要权重。

AWQConfig

awq.py:34-95

python 复制代码
class AWQConfig(QuantizationConfiguration):
    def __init__(self, weight_bits=4, group_size=128, zero_point=True,
                 modules_to_not_convert=None):
        # AWQ 固定为 4-bit
        assert self.weight_bits == 4
        self.pack_factor = 32 // self.weight_bits  # = 8
AWQLinearMethod

awq.py:172-286

python 复制代码
class AWQLinearMethod(LinearMethodBase):
    def create_weights(self, layer, ...):
        # AWQ 沿输出维度打包(与 GPTQ 不同!)
        qweight = PackedvLLMParameter(
            data=torch.empty(
                input_size_per_partition,
                output_size_per_partition // self.quant_config.pack_factor,
                dtype=torch.int32,
            ),
            input_dim=0, output_dim=1,
            packed_dim=1,  # 沿输出维度打包(AWQ 特有)
            packed_factor=self.quant_config.pack_factor,
        )
        scales = GroupQuantScaleParameter(...)
        qzeros = PackedvLLMParameter(...)

    def apply(self, layer, x, bias=None):
        # 小 batch 启发式:直接反量化 + matmul
        if x.shape[:-1].numel() >= 256 or envs.VLLM_BATCH_INVARIANT:
            out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0)
            out = torch.matmul(reshaped_x, out)
        else:
            # 大 batch:使用优化的 AWQ GEMM kernel
            out = ops.awq_gemm(reshaped_x, qweight, scales, qzeros, pack_factor)

AWQ vs GPTQ 打包方向对比

  • GPTQ:qweight 形状 [K//pack, N],沿输入维(dim 0)打包
  • AWQ:qweight 形状 [K, N//pack],沿输出维(dim 1)打包

3.3 AWQ → Marlin 格式转换

由于 AWQ 使用非标准的 bit 打包顺序,转换为 Marlin 格式需要特殊处理。

awq_marlin.py:67-149 定义了转换逻辑:

python 复制代码
_REVERSE_AWQ_PACK_ORDER = [0, 4, 1, 5, 2, 6, 3, 7]  # AWQ 特有的位排列

def _convert_awq_to_standard_format(layer, w_q_name, w_zp_name, size_bits):
    """将 AWQ 的非标准格式转换为 GPTQ-like 标准格式"""
    pack_factor = 32 // size_bits
    # 1. 解包 AWQ qweight,修复位顺序
    unpacked = (qw.unsqueeze(-1) >> shifts) & mask
    unpacked = unpacked[:, :, reverse_order]  # 修正 AWQ 位序
    # 2. 从沿输出维打包转为沿输入维打包
    new_qw = repack_along_input_dim(unpacked)
    # 3. 同样处理 qzeros
    new_qz = convert_and_repack_zeros(qz)

四、GGUF 格式

4.1 概述

GGUF 是 llama.cpp 定义的单文件模型格式,支持多种量化精度。vLLM 通过 gguf.py 提供完整的 GGUF 推理支持。

4.2 支持的量化类型

gguf.py:166-198

python 复制代码
UNQUANTIZED_TYPES = {WeightType.F32, WeightType.F16, WeightType.BF16}
STANDARD_QUANT_TYPES = {
    WeightType.Q4_0, WeightType.Q4_1,       # 标准 4-bit
    WeightType.Q5_0, WeightType.Q5_1,       # 标准 5-bit
    WeightType.Q8_0, WeightType.Q8_1,       # 标准 8-bit
}
KQUANT_TYPES = {                              # K-quantization(改进版)
    WeightType.Q2_K, WeightType.Q3_K,
    WeightType.Q4_K, WeightType.Q5_K, WeightType.Q6_K,
}
IMATRIX_QUANT_TYPES = {                        # I-Matrix 量化(重要性矩阵)
    WeightType.IQ1_M, WeightType.IQ1_S,
    WeightType.IQ2_XXS, WeightType.IQ2_XS, ...
}

4.3 GGUF GEMM 操作

gguf.py:201-233 中的 _fused_mul_mat_gguf 实现了三级内核选择:

python 复制代码
def _fused_mul_mat_gguf(x, qweight, qweight_type):
    # 1. MMVQ(向量-矩阵量化乘法):适合小 batch(batch_size <= 2~16)
    if x.shape[0] <= mmvq_safe and qweight_type in MMVQ_QUANT_TYPES:
        y = ops.ggml_mul_mat_vec_a8(qweight, x, qweight_type, ...)
    # 2. MMQ(矩阵-矩阵量化乘法):标准批量推理
    elif qweight_type in MMQ_QUANT_TYPES:
        y = ops.ggml_mul_mat_a8(qweight, x, qweight_type, ...)
    # 3. 反量化回退:无专用 kernel 时先反量化再 matmul
    elif qweight_type in DEQUANT_TYPES:
        weight = ops.ggml_dequantize(qweight, qweight_type, ...)
        y = x @ weight.T

4.4 GGUF MoE 支持

gguf.py:564-669GGUFMoEMethod

  • 当两个权重都支持 MMQ 且 x.shape[0] > 64 时使用 fused MoE kernel(ops.ggml_moe_a8
  • 否则使用逐 expert 的向量 kernel(ops.ggml_moe_a8_vec
  • 最终回退到逐 token 逐 expert 的慢速路径

4.5 GGUF 工具函数

gguf_utils.py 提供:

  • check_gguf_file() --- 检测文件是否为 GGUF 格式
  • is_remote_gguf() --- 检测远程 GGUF 模型(如 repo_id:Q4_K_M 格式)
  • is_valid_gguf_quant_type() --- 校验量化类型名称合法性

五、NVFP4 Blackwell 4-bit 浮点

5.1 概述

NVFP4 是 NVIDIA Blackwell 架构(SM100+,如 B200 GPU)引入的原生 4-bit 浮点格式 (float4_e2m1fn),具有以下特征:

  • 数据类型float4_e2m1fn,2 个 FP4 值打包在一个 uint8
  • 缩放方式:FP8-E4M3FN 格式的 block scale,默认 group size = 16
  • 全局缩放:额外的标量全局缩放因子用于权重和激活

5.2 数据结构

nvfp4/base.py:10-19

python 复制代码
@dataclass
class NvFp4LinearLayerConfig:
    """所有 NVFP4 层共享相同结构:
    - packed uint8 权重(每字节 2 个 FP4 值)
    - FP8-E4M3 per-block 权重缩放(group size 16)
    - 权重和激活的全局标量缩放
    """
    pass

5.3 量化 Key 定义

quant_utils.py:138-146

python 复制代码
# 动态 NVFP4:激活侧动态计算 FP8 block scale
kNvfp4DynamicGroupScale = ScaleDesc(FP8_DTYPE, False, GroupShape(1, 16))
kNvfp4Dynamic = QuantKey(FP4_DTYPE, scale=kNvfp4DynamicGroupScale,
                         scale2=kStaticTensorScale)

# 静态 NVFP4:预计算的 FP8 block scale
kNvfp4StaticGroupScale = ScaleDesc(FP8_DTYPE, True, GroupShape(1, 16))
kNvfp4Static = QuantKey(FP4_DTYPE, scale=kNvfp4StaticGroupScale,
                        scale2=kStaticTensorScale)

5.4 NVFP4 内核后端

linear/init.py:77-80 可见注册的后端:

内核类 文件 说明
CutlassNvFp4LinearKernel nvfp4/cutlass.py CUTLASS 实现(Blackwell 原生)
MarlinNvFp4LinearKernel nvfp4/marlin.py Marlin 内核适配
FlashInferNvFp4LinearKernel nvfp4/flashinfer.py FlashInfer 后端
FBGemmNvFp4LinearKernel nvfp4/fbgemm.py FBGEMM(ROCm)
EmulationNvFp4LinearKernel nvfp4/emulation.py 仿真模式(非 Blackwell)

5.5 NVFP4 工具函数

nvfp4_utils.py 提供关键工具:

python 复制代码
def swizzle_blockscale(scale: torch.Tensor) -> torch.Tensor:
    """Pad 并 block-interleave FP4 block-scale 以匹配 CUTLASS/FlashInfer 布局"""
    # reshape → permute → contiguos 匹配内核期望的数据排布

def pad_nvfp4_weight_for_cutlass(weight, alignment=32):
    """填充 NVFP4 权重以满足 CUTLASS 的 32 对齐要求"""

def cutlass_fp4_supported() -> bool:
    """检测当前设备是否支持 CUTLASS FP4"""

六、MXFP4 / MXFP8 块缩放浮点

6.1 MXFP4(MicroScaling FP4)

MXFP4 是面向 MoE 模型的 4-bit 微缩放浮点格式,主要服务于 GPT-OSS 和 DeepSeek-V4 等模型。

mxfp4.py:42-101

python 复制代码
class Mxfp4Config(QuantizationConfig):
    @classmethod
    def get_min_capability(cls) -> int:
        return 80  # 需要 Ampere (SM80) 及以上

    @classmethod
    def get_supported_act_dtypes(cls) -> list[torch.dtype]:
        return [torch.bfloat16]

    def get_quant_method(self, layer, prefix):
        if isinstance(layer, LinearBase):
            # Linear 层暂未实现 MXFP4,回退到 Unquantized
            return UnquantizedLinearMethod()
        elif isinstance(layer, FusedMoE):
            return GptOssMxfp4MoEMethod(layer.moe_config)

MXFP4 权重布局

  • w13_weight:形状 [num_experts, 2*intermediate, hidden//2],dtype uint8
  • 每 2 个 FP4 值打包在 1 个字节中
  • w13_weight_scale:形状 [num_experts, 2*intermediate, hidden//32],block size = 32

MXFP4 后端选择 (通过 select_mxfp4_moe_backend()):

  • AITER_MXFP4_BF16 --- AITER 后端
  • FLASHINFER_TRTLLM_MXFP4_MXFP8 --- FlashInfer TRTLLM(支持 padding skip)
  • TRITON_MXFP4_BF16 --- Triton kernel

6.2 MXFP8(MicroScaling FP8)

MXFP8 每 32 个元素共享一个 uint8 缩放因子,是 DeepSeek-V3 等模型的关键量化技术。

量化 Key 定义于 quant_utils.py:154-158

python 复制代码
kMxfp8StaticScale = ScaleDesc(torch.uint8, True, GroupShape(1, 32))
kMxfp8Static = QuantKey(FP8_DTYPE, kMxfp8StaticScale, symmetric=True)
kMxfp8DynamicScale = ScaleDesc(torch.uint8, False, GroupShape(1, 32))
kMxfp8Dynamic = QuantKey(FP8_DTYPE, kMxfp8DynamicScale, symmetric=True)

MXFP8 专用内核位于 kernels/linear/mxfp8/ 目录:

  • MarlinMxfp8LinearKernel --- Marlin 加速
  • FlashInferCutlassMxfp8LinearKernel --- FlashInfer + CUTLASS
  • EmulationMxfp8LinearKernel --- 仿真回退

七、Marlin 高性能 4-bit 内核

7.1 概述

Marlin 是专为 4-bit 权重量化设计的超高性能 GEMM 内核,vLLM 将其作为 GPTQ/AWQ/FP8 的首选加速后端。

7.2 支持的量化类型

marlin_utils.py:41-79

python 复制代码
def query_marlin_supported_quant_types(has_zp=None, include_fp_type=True, ...):
    """
    has_zp=True  (AWQ 风格):   [scalar_types.uint4]
    has_zp=False (GPTQ 风格):  [uint4b8, uint8b128] (+ FP8 types if include_fp_type)
    """

完整支持列表

  • 整数型:uint4 (AWQ)、uint4b8 (GPTQ 4-bit)、uint8b128 (GPTQ 8-bit)
  • 浮点型:float8_e4m3fn (FP8 权重)、float4_e2m1f (FP4 权重)

7.3 Marlin 约束条件

marlin_utils.py:25-35

python 复制代码
GPTQ_MARLIN_TILE = 16           # Marlin tile 尺寸
GPTQ_MARLIN_MIN_THREAD_N = 64    # 输出维度最小线程数
GPTQ_MARLIN_MIN_THREAD_K = 128   # 输入维度最小线程数
MARLIN_SUPPORTED_GROUP_SIZES = [-1, 32, 64, 128]  # 支持的分组大小

形状约束验证函数 verify_marlin_supports_shape() 检查:

  • output_size % 64 == 0
  • input_size % 128 == 0
  • 如果 group_size < input_sizeinput_size % group_size == 0

7.4 Marlin 权重预处理

Marlin 内核要求权重经过特殊的重排(reordering)和置换(permutation):

python 复制代码
# marlin_utils.py:292-312 --- scale permutation
def marlin_permute_scales(s, size_k, size_n, group_size, is_a_8bit=False):
    scale_perm, scale_perm_single = get_scale_perms()
    # group quant: 使用 8×8 full permutation
    # channel quant: 使用 4×8 single permutation
    s = s.reshape((-1, len(scale_perm)))[:, scale_perm]
    return s.reshape((-1, size_n)).contiguous()

# marlin_utils.py:344-365 --- zero-point permutation
def marlin_zero_points(zp, size_k, size_n, num_bits, is_a_8bit=False):
    # 1. 应用 scale permutation
    # 2. interleave 列维度
    # 3. pack 到 int32

7.5 Marlin Workspace

marlin_utils.py:257-265

python 复制代码
def marlin_make_workspace_new(device, max_blocks_per_sm=1):
    """创建 Marlin workspace tensor
    大小 = SM 数量 × max_blocks_per_sm
    用于原子操作的同步
    """
    sms = num_compute_units(device.index)
    return torch.zeros(sms * max_blocks_per_sm, dtype=torch.int, device=device)

7.6 Marlin GEMM 调用

marlin_utils.py:506-570

python 复制代码
def apply_gptq_marlin_linear(input, weight, weight_scale, weight_zp,
                               g_idx, g_idx_sort_indices, workspace,
                               wtype, ..., input_dtype=None):
    # 可选:W4A8 输入量化
    if input_dtype == torch.int8:
        reshaped_x, a_scales = marlin_quant_input(reshaped_x, torch.int8)
    elif input_dtype == torch.float8_e4m3fn:
        reshaped_x, a_scales = marlin_quant_input(reshaped_x, torch.float8_e4m3fn)

    output = ops.marlin_gemm(
        reshaped_x, None, weight, bias,
        weight_scale, a_scales, None,
        weight_zp, g_idx, g_idx_sort_indices, workspace,
        wtype, size_m=..., size_n=..., size_k=...,
        is_k_full=..., use_atomic_add=..., use_fp32_reduce=...,
    )

7.7 W4A8 扩展:INT8 / FP8 激活量化

Marlin 不仅支持 W4A16,还扩展支持低精度激活:

python 复制代码
# marlin_utils.py:474-493
def get_marlin_input_dtype(prefix=None):
    """通过环境变量 VLLM_MARLIN_INPUT_DTYPE 控制:
    - None: 标准 W4A16
    - "int8": W4A8-INT8(需 SM75+)
    - "fp8": W4A8-FP8(仅 SM89 H20 或 SM120 Blackwell)
    """

八、Machete 新一代量化内核

8.1 概述

Machete 是基于 CUTLASS 的新一代量化 GEMM 内核,专门针对 NVIDIA Hopper (SM90) 架构优化,提供比 Marlin 更高的吞吐量。

8.2 Machete 约束

machete_utils.py:9-56

python 复制代码
MACHETE_PREPACKED_BLOCK_SHAPE = [64, 128]  # 预打包块形状

def query_machete_supported_quant_types(zero_points):
    if zero_points:
        return [scalar_types.uint4, scalar_types.uint8]
    else:
        return [scalar_types.uint4b8, scalar_types.uint8b128]

def query_machete_supported_group_sizes(act_type):
    if act_type in [torch.float16, torch.bfloat16]:
        return [-1, 64, 128]   # 支持 channel-wise / 64 / 128
    else:
        return [-1, 128]

def check_machete_supports_shape(in_features, out_features):
    # in_features % 64 == 0
    # out_features % 128 == 0

8.3 MacheteLinearKernel

mixed_precision/machete.py:24-80

python 复制代码
class MacheteLinearKernel(MPLinearKernel):
    @classmethod
    def get_min_capability(cls) -> int:
        return 90  # 仅支持 Hopper (SM90)

    @classmethod
    def can_implement(cls, c: MPLinearLayerConfig):
        # 1. 必须是 CUDA 平台
        # 2. 必须是 SM90
        # 3. 不支持 act_order + TP 分区
        # 4. 量化类型必须在支持列表中
        # 5. group_size 必须在支持列表中
        # 6. 形状必须满足 64/128 对齐

8.4 Marlin vs Machete 对比

特性 Marlin Machete
最低架构 SM 75 (Turing) SM 90 (Hopper)
底层实现 自定义 CUDA CUTLASS
对齐要求 N%64, K%128 N%128, K%64
支持 ZP ✅ (uint4) ✅ (uint4, uint8)
Act Order ❌ (TP 分区时)
FP8 权重
W4A8 ✅ (INT8/FP8)
典型场景 通用 4-bit 推理 Hopper 高吞吐

九、量化线性层与 GEMM 操作

9.1 内核选择体系

vLLM 的量化 GEMM 内核采用分层选择架构,定义于 kernels/linear/ 目录:
#mermaid-svg-ZeUMocae7GE4HUNQ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZeUMocae7GE4HUNQ .error-icon{fill:#552222;}#mermaid-svg-ZeUMocae7GE4HUNQ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZeUMocae7GE4HUNQ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZeUMocae7GE4HUNQ .marker.cross{stroke:#333333;}#mermaid-svg-ZeUMocae7GE4HUNQ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZeUMocae7GE4HUNQ p{margin:0;}#mermaid-svg-ZeUMocae7GE4HUNQ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster-label text{fill:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster-label span{color:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster-label span p{background-color:transparent;}#mermaid-svg-ZeUMocae7GE4HUNQ .label text,#mermaid-svg-ZeUMocae7GE4HUNQ span{fill:#333;color:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .node rect,#mermaid-svg-ZeUMocae7GE4HUNQ .node circle,#mermaid-svg-ZeUMocae7GE4HUNQ .node ellipse,#mermaid-svg-ZeUMocae7GE4HUNQ .node polygon,#mermaid-svg-ZeUMocae7GE4HUNQ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZeUMocae7GE4HUNQ .rough-node .label text,#mermaid-svg-ZeUMocae7GE4HUNQ .node .label text,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape .label,#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZeUMocae7GE4HUNQ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZeUMocae7GE4HUNQ .rough-node .label,#mermaid-svg-ZeUMocae7GE4HUNQ .node .label,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape .label,#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape .label{text-align:center;}#mermaid-svg-ZeUMocae7GE4HUNQ .node.clickable{cursor:pointer;}#mermaid-svg-ZeUMocae7GE4HUNQ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZeUMocae7GE4HUNQ .arrowheadPath{fill:#333333;}#mermaid-svg-ZeUMocae7GE4HUNQ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZeUMocae7GE4HUNQ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZeUMocae7GE4HUNQ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZeUMocae7GE4HUNQ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZeUMocae7GE4HUNQ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZeUMocae7GE4HUNQ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster text{fill:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster span{color:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZeUMocae7GE4HUNQ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape p,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape .label rect,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZeUMocae7GE4HUNQ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZeUMocae7GE4HUNQ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZeUMocae7GE4HUNQ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} init_fp8_linear_kernel()
ScaledMM 子系统
init_nvfp4_linear_kernel()
init_mxfp8_linear_kernel()
PyTorch _scaled_mm
CutlassFP8ScaledMM
FlashInferFP8
MarlinFP8ScaledMM
Fp8BlockScaledMM
DeepGEMM
AiterInt8
TritonInt8
ROCmFP8
CPUFp8BlockScaled
choose_mp_linear_kernel()
MarlinLinearKernel
MacheteLinearKernel
ExllamaLinearKernel
ConchLinearKernel
AllSparkLinearKernel
CutlassW4A8LinearKernel
Dynamic4bitLinearKernel
TritonW4A16LinearKernel
CPUWNA16LinearKernel

9.2 ScaledMM 内核层次

scaled_mm/init.py 注册了以下 FP8/INT8 缩放矩阵乘法内核:

内核类 精度 后端 条件
PerTensorTorchFP8ScaledMMLinearKernel W8A8 PyTorch native 通用 fallback
ChannelWiseTorchFP8ScaledMMLinearKernel W8A8 PyTorch native per-channel
RowWiseTorchFP8ScaledMMLinearKernel W8A8 PyTorch native per-row
CutlassFP8ScaledMMLinearKernel W8A8 CUTLASS Ampere+ CUDA
CutlassInt8ScaledMMLinearKernel W8A8 CUTLASS Ampere+ CUDA
FlashInferFP8ScaledMMLinearKernel W8A8 FlashInfer Hopper+
MarlinFP8ScaledMMLinearKernel W8A8 Marlin Turing+, 非 FP8 原生设备
Fp8BlockScaledMMLinearKernel W8A8 Block CUTLASS 块级缩放
ROCmFP8ScaledMMLinearKernel W8A8 ROCm AMD GPU
AiterInt8ScaledMMLinearKernel W8A8 AITER 华为昇腾
TritonInt8ScaledMMLinearKernel W8A8 Triton 通用
CPUFp8BlockScaledKernel W8A8 CPU CPU 推理

9.3 Mixed Precision 内核层次

mixed_precision/ 目录包含整数量化内核:

内核类 精度 后端 说明
MarlinLinearKernel W4A16/W8A16 Marlin CUDA 主要 4/8-bit 内核
MacheteLinearKernel W4A16/W8A16 CUTLASS SM90 优化
ExllamaLinearKernel W4A16 Exllama GPTQ 原生格式
ConchLinearKernel W4A16/W8A16 Conch DeepSeek 内核
AllSparkLinearKernel W4A16 AllSpark 通义千问
CutlassW4A8LinearKernel W4A8 CUTLASS W4A8 精度
Dynamic4bitLinearKernel W4A16 Dynamic 动态 4-bit
TritonW4A16LinearKernel W4A16 Triton 可移植
CPUWNA16LinearKernel W4A16 CPU CPU 推理

十、量化配置体系

10.1 基类层次

复制代码
QuantizationConfig (ABC)          # base_config.py:70
├── get_name() → QuantizationMethods
├── get_supported_act_dtypes() → list[dtype]
├── get_min_capability() → int     # 最低 GPU compute capability
├── get_config_filenames() → list[str]
├── from_config(dict) → Self
└── get_quant_method(layer, prefix) → QuantizeMethodBase | None

QuantizeMethodBase (ABC)          # base_config.py:19
├── create_weights(layer, ...)    # 创建量化参数
├── apply(layer, x, bias) → Tensor # 前向计算
├── embedding(layer, x) → Tensor  # Embedding 查找(可选)
└── process_weights_after_loading(layer)  # 权重后处理(可选)

源码位置:base_config.py

10.2 配置类清单

配置类 方法名 权重位数 最低能力 文件
Fp8Config fp8 8-bit float SM 75 fp8.py
GPTQConfig gptq 2/3/4/8-bit int SM 60 gptq.py
GPTQMarlinConfig gptq_marlin 4/8-bit int SM 75 gptq_marlin.py
AWQConfig awq 4-bit int SM 75 awq.py
AWQMarlinConfig awq_marlin 4-bit int SM 75 awq_marlin.py
GGUFConfig gguf 多精度 SM 60 gguf.py
ModelOptFp8Config modelopt 8-bit float SM 75 modelopt.py
ModelOptNvFp4Config modelopt_fp4 4-bit float SM 100 modelopt.py
ModelOptMxFp8Config modelopt_mxfp8 8-bit float + uint8 scale SM 80 modelopt.py
Mxfp4Config mxfp4 4-bit float (uint8) SM 80 mxfp4.py
BitsAndBytesConfig bitsandbytes 多精度 --- bitsandbytes.py
CompressedTensorsConfig compressed-tensors 多精度 --- compressed_tensors/
INCConfig inc / auto-round 多精度 --- inc.py
TorchAOConfig torchao 多精度 --- torchao.py
HummingConfig humming 多精度 --- humming.py
ExpertsInt8Config experts_int8 8-bit int --- experts_int8.py
OnlineQuantizationConfig online 动态 --- online/

10.3 参数类型体系

vLLM 定义了丰富的参数类型来描述不同量化格式的权重布局:

参数类型 用途 来源
ModelWeightParameter 全精度权重 parameter.py
PerTensorScaleParameter Per-tensor 缩放因子 parameter.py
BlockQuantScaleParameter Block 量化缩放因子 parameter.py
ChannelQuantScaleParameter Per-channel 缩放因子 parameter.py
GroupQuantScaleParameter Per-group 缩放因子 parameter.py
PackedvLLMParameter 打包的量化权重(int32) parameter.py
PackedColumnParameter 沿列打包的零点 parameter.py
RowvLLMParameter 行索引参数(g_idx) parameter.py
GGUFUninitializedParameter GGUF 延迟初始化参数 gguf.py:689-691

10.4 自动格式检测与覆盖

部分配置类实现了 override_quantization_method() 来自动检测 checkpoint 格式并推荐最优内核:

python 复制代码
# gptq_marlin.py:220-245 --- 自动升级 GPTQ → GPTQ-Marlin
@classmethod
def override_quantization_method(cls, hf_quant_cfg, user_quant, hf_config=None):
    can_convert = cls.is_gptq_marlin_compatible(hf_quant_cfg)
    if can_convert and user_quant in (None, "marlin", "gptq_marlin"):
        return "gptq_marlin"  # 自动切换到更快的 Marlin 内核

# awq_marlin.py:233-262 --- 自动升级 AWQ → AWQ-Marlin
@classmethod
def override_quantization_method(cls, hf_quant_cfg, user_quant, hf_config=None):
    can_convert = cls.is_awq_marlin_compatible(hf_quant_cfg)
    if can_convert and user_quant in (None, "marlin", "awq_marlin"):
        return "awq_marlin"

# gguf.py:86-93 --- 强制 GGUF 格式
@classmethod
def override_quantization_method(cls, hf_quant_cfg, user_quant, hf_config=None):
    if user_quant == "gguf":
        return "gguf"  # 覆盖 HF config 中的其他量化设置

十一、量化方案选择决策树

#mermaid-svg-2Tl6VKeaPWPXScAZ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-2Tl6VKeaPWPXScAZ .error-icon{fill:#552222;}#mermaid-svg-2Tl6VKeaPWPXScAZ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-2Tl6VKeaPWPXScAZ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .marker.cross{stroke:#333333;}#mermaid-svg-2Tl6VKeaPWPXScAZ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-2Tl6VKeaPWPXScAZ p{margin:0;}#mermaid-svg-2Tl6VKeaPWPXScAZ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster-label text{fill:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster-label span{color:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster-label span p{background-color:transparent;}#mermaid-svg-2Tl6VKeaPWPXScAZ .label text,#mermaid-svg-2Tl6VKeaPWPXScAZ span{fill:#333;color:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .node rect,#mermaid-svg-2Tl6VKeaPWPXScAZ .node circle,#mermaid-svg-2Tl6VKeaPWPXScAZ .node ellipse,#mermaid-svg-2Tl6VKeaPWPXScAZ .node polygon,#mermaid-svg-2Tl6VKeaPWPXScAZ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .rough-node .label text,#mermaid-svg-2Tl6VKeaPWPXScAZ .node .label text,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape .label,#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape .label{text-anchor:middle;}#mermaid-svg-2Tl6VKeaPWPXScAZ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .rough-node .label,#mermaid-svg-2Tl6VKeaPWPXScAZ .node .label,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape .label,#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape .label{text-align:center;}#mermaid-svg-2Tl6VKeaPWPXScAZ .node.clickable{cursor:pointer;}#mermaid-svg-2Tl6VKeaPWPXScAZ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .arrowheadPath{fill:#333333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-2Tl6VKeaPWPXScAZ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-2Tl6VKeaPWPXScAZ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster text{fill:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster span{color:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-2Tl6VKeaPWPXScAZ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ rect.text{fill:none;stroke-width:0;}#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape p,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape .label rect,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-2Tl6VKeaPWPXScAZ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-2Tl6VKeaPWPXScAZ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-2Tl6VKeaPWPXScAZ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} FP8 序列化
FP16/BF16 全精度
GPTQ 格式
AWQ 格式
GGUF 格式
MXFP4/MXFP8
NVFP4
H100/H200 (SM90)
其他 GPU

是,用户指定
是 (SM75+)

是 (SM75+)

H100 原生 FP8
非原生 FP8
DeepGEMM 可用
块级量化
SM90 Hopper
SM75-SM89
开始选择量化方案
Checkpoint 格式?
Fp8Config

Fp8LinearMethod / Fp8MoEMethod
目标 GPU?
可用 Marlin?
可用 Marlin?
GGUFConfig

GGUFLinearMethod / GGUFMoEMethod
Mxfp4Config / ModelOptMxFp8Config
ModelOptNvFp4Config

NvFp4LinearKernel
Fp8Config (online)

Fp8OnlineLinearMethod
需要量化?
UnquantizedLinearMethod
用户指定的量化方案
GPTQMarlinConfig

GPTQMarlinLinearMethod
GPTQConfig

GPTQLinearMethod (gptq_gemm)
AWQMarlinConfig

AWQMarlinLinearMethod
AWQConfig

AWQLinearMethod (awq_gemm)
选择 FP8 内核后端
选择 Marlin 变体
_scaled_mm / CutlassFP8
MarlinFP8ScaledMM
DeepGEMM kernel
Fp8BlockScaledMM
MacheteLinearKernel (优先)
MarlinLinearKernel


十二、量化数据流全景

#mermaid-svg-vywasSeHgL7nTtbo{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-vywasSeHgL7nTtbo .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-vywasSeHgL7nTtbo .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-vywasSeHgL7nTtbo .error-icon{fill:#552222;}#mermaid-svg-vywasSeHgL7nTtbo .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-vywasSeHgL7nTtbo .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-vywasSeHgL7nTtbo .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-vywasSeHgL7nTtbo .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-vywasSeHgL7nTtbo .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-vywasSeHgL7nTtbo .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-vywasSeHgL7nTtbo .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-vywasSeHgL7nTtbo .marker{fill:#333333;stroke:#333333;}#mermaid-svg-vywasSeHgL7nTtbo .marker.cross{stroke:#333333;}#mermaid-svg-vywasSeHgL7nTtbo svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-vywasSeHgL7nTtbo p{margin:0;}#mermaid-svg-vywasSeHgL7nTtbo .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-vywasSeHgL7nTtbo .cluster-label text{fill:#333;}#mermaid-svg-vywasSeHgL7nTtbo .cluster-label span{color:#333;}#mermaid-svg-vywasSeHgL7nTtbo .cluster-label span p{background-color:transparent;}#mermaid-svg-vywasSeHgL7nTtbo .label text,#mermaid-svg-vywasSeHgL7nTtbo span{fill:#333;color:#333;}#mermaid-svg-vywasSeHgL7nTtbo .node rect,#mermaid-svg-vywasSeHgL7nTtbo .node circle,#mermaid-svg-vywasSeHgL7nTtbo .node ellipse,#mermaid-svg-vywasSeHgL7nTtbo .node polygon,#mermaid-svg-vywasSeHgL7nTtbo .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-vywasSeHgL7nTtbo .rough-node .label text,#mermaid-svg-vywasSeHgL7nTtbo .node .label text,#mermaid-svg-vywasSeHgL7nTtbo .image-shape .label,#mermaid-svg-vywasSeHgL7nTtbo .icon-shape .label{text-anchor:middle;}#mermaid-svg-vywasSeHgL7nTtbo .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-vywasSeHgL7nTtbo .rough-node .label,#mermaid-svg-vywasSeHgL7nTtbo .node .label,#mermaid-svg-vywasSeHgL7nTtbo .image-shape .label,#mermaid-svg-vywasSeHgL7nTtbo .icon-shape .label{text-align:center;}#mermaid-svg-vywasSeHgL7nTtbo .node.clickable{cursor:pointer;}#mermaid-svg-vywasSeHgL7nTtbo .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-vywasSeHgL7nTtbo .arrowheadPath{fill:#333333;}#mermaid-svg-vywasSeHgL7nTtbo .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-vywasSeHgL7nTtbo .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-vywasSeHgL7nTtbo .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-vywasSeHgL7nTtbo .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-vywasSeHgL7nTtbo .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-vywasSeHgL7nTtbo .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-vywasSeHgL7nTtbo .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-vywasSeHgL7nTtbo .cluster text{fill:#333;}#mermaid-svg-vywasSeHgL7nTtbo .cluster span{color:#333;}#mermaid-svg-vywasSeHgL7nTtbo div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-vywasSeHgL7nTtbo .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-vywasSeHgL7nTtbo rect.text{fill:none;stroke-width:0;}#mermaid-svg-vywasSeHgL7nTtbo .icon-shape,#mermaid-svg-vywasSeHgL7nTtbo .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-vywasSeHgL7nTtbo .icon-shape p,#mermaid-svg-vywasSeHgL7nTtbo .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-vywasSeHgL7nTtbo .icon-shape .label rect,#mermaid-svg-vywasSeHgL7nTtbo .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-vywasSeHgL7nTtbo .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-vywasSeHgL7nTtbo .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-vywasSeHgL7nTtbo :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} ⚡ CUDA 内核
🚀 推理执行
⚙️ 权重后处理
🔧 创建参数
📥 权重加载
Checkpoint 文件

(SafeTensors/GGUF/HF)
WeightLoader

分片 & 加载
create_weights()

分配量化参数
process_weights_after_loading()

• 格式转换

• 权重重排

• 缩放因子计算
激活量化 (动态方案)

scaled_fp8_quant()
量化 GEMM

• ops.marlin_gemm

• ops.gptq_gemm

• ops.awq_gemm

• torch._scaled_mm
反量化 / 输出
marlin_gemm.cu
gptq_gemm.cu
awq_gemm.cu / awq_dequantize
ggml_dequantize / ggml_mul_mat_a8
cutlass_fp4 GEMM

典型数据流示例

示例 1:FP8 在线量化(DeepSeek 风格)
复制代码
BF16 Checkpoint
    ↓ [WeightLoader: 加载到 meta device]
    ↓ [Fp8OnlineLinearMethod.create_weights(): 创建 FP8 参数]
    ↓ [Fp8OnlineLinearMethod.process_weights_after_loading():
        ops.scaled_fp8_quant(weight) → (qweight: FP8, scale: FP32)
        weight.t() → [K, N] 布局
    ]
    ↓ [推理时每个 token:]
    ↓ [动态计算激活 scale]
    ↓ [torch._scaled_mm(x_fp8, w_fp8, x_scale, w_scale) → output]
示例 2:GPTQ-Marlin 4-bit 量化
复制代码
GPTQ Checkpoint (int32 packed weights)
    ↓ [GPTQMarlinLinearMethod.create_weights():
        qweight: [K//8, N] int32 (packed uint4)
        scales: [K//gs, N] fp16
        qzeros: [K//gs, N//8] int32 (packed)
        g_idx: [K] int32
    ]
    ↓ [process_weights_after_loading():
        g_idx sorting & argsort
        ops.gptq_shuffle(qweight, g_idx, 4)
        marlin_permute_scales(scales)
        marlin_zero_points(qzeros)
        marlin_make_workspace_new(device)
    ]
    ↓ [推理时:]
    ↓ [ops.marlin_gemm(x, qweight, scales, zeros, g_idx, ...)]
示例 3:GGUF Q4_K_M 量化
复制代码
GGUF File (.gguf)
    ↓ [gguf_utils.py: 加载元数据 + 权重]
    ↓ [GGUFLinearMethod.create_weights():
        qweight: GGUFUninitializedParameter (延迟初始化)
        qweight_type: WeightType.Q4_K (uint8 标识)
    ]
    ↓ [process_weights_after_loading():
        检查量化类型有效性
        为融合层 (QKV/gate_up) 创建 padded weight
    ]
    ↓ [推理时 (_fused_mul_mat_gguf):]
    ↓ [batch_size ≤ threshold?]
    ↓   Yes → ops.ggml_mul_mat_vec_a8()  [MMVQ 向量 kernel]
    ↓   No  → ops.ggml_mul_mat_a8()     [MMQ 矩阵 kernel]
    ↓   No kernel → ops.ggml_dequantize() + matmul  [反量化回退]

附录:关键源码路径索引

类别 路径 说明
量化配置入口 config/quantization.py OnlineQuantScheme, resolve_online_quant_config
量化方法注册 layers/quantization/init.py QuantizationMethods, get_quantization_config
基类定义 layers/quantization/base_config.py QuantizationConfig, QuantizeMethodBase
Schema 定义 layers/quantization/schema.py KVCacheQuantSchema
FP8 量化 layers/quantization/fp8.py Fp8Config, Fp8LinearMethod, Fp8MoEMethod
GPTQ 量化 layers/quantization/gptq.py GPTQConfig, GPTQLinearMethod
GPTQ-Marlin layers/quantization/gptq_marlin.py GPTQMarlinConfig, GPTQMarlinLinearMethod
AWQ 量化 layers/quantization/awq.py AWQConfig, AWQLinearMethod
AWQ-Marlin layers/quantization/awq_marlin.py AWQMarlinConfig, AWQMarlinLinearMethod
GGUF 量化 layers/quantization/gguf.py GGUFConfig, GGUFLinearMethod, GGUFMoEMethod
GGUF 工具 transformers_utils/gguf_utils.py GGUF 文件检测与解析
MXFP4 量化 layers/quantization/mxfp4.py Mxfp4Config, GptOssMxfp4MoEMethod
ModelOpt 系列 layers/quantization/modelopt.py ModelOptFp8/NvFp4/MxFp8/Mixed Configs
Marlin 工具 layers/quantization/utils/marlin_utils.py Marlin 格式转换、shape 验证、workspace
Machete 工具 layers/quantization/utils/machete_utils.py Machete 约束查询
NVFP4 工具 layers/quantization/utils/nvfp4_utils.py NVFP4 swizzle/pad 工具
量化工具集 layers/quantization/utils/quant_utils.py QuantKey, GroupShape, scaled_quantize, pack/unpack
FP8 工具 layers/quantization/utils/fp8_utils.py FP8 参数创建与处理
Linear 内核总入口 kernels/linear/init.py 所有 linear kernel 的统一导出
ScaledMM 内核 kernels/linear/scaled_mm/ FP8/INT8 缩放矩阵乘法内核族
Mixed Precision 内核 kernels/linear/mixed_precision/ 整数量化内核族 (Marlin/Machete/Exllama...)
MXFP8 内核 kernels/linear/mxfp8/ MXFP8 专用内核
NVFP4 内核 kernels/linear/nvfp4/ NVFP4 专用内核
在线量化 layers/quantization/online/ Online Quantization 实现
相关推荐
大圣编程1 小时前
python break语句
开发语言·前端·python
happyprince1 小时前
18-vLLM 结构化输出约束分析文档
网络·vllm
EntyIU2 小时前
Vue History 模式配置文档
前端·javascript·vue.js
随风一样自由2 小时前
【AI全栈+前端代理】前端代理配置中最常用的参数及说明
前端·前端代理
happyprince2 小时前
14-vLLM LoRA 适配器系统深度解析
vllm
Lorin 洛林3 小时前
一文读懂 Agent Skills
前端·网络
newbe365244 小时前
我们如何使用 impeccable 优化前端界面设计与实现稳定性
前端·人工智能·分布式·github·aigc·wpf
KaMeidebaby11 小时前
卡梅德生物技术快报|蛋白 N 端测序在重组贻贝融合蛋白表征中的应用,解决原核表达序列偏移工艺难题
前端·人工智能·物联网·算法·百度
kyriewen12 小时前
我筛了 1400 个 Claude Code Skills,留下 5 个天天在用的
前端·ai编程·claude