vLLM 量化方案全面分析
定位:vLLM 量化(Quantization)子系统的全面架构分析,涵盖从配置层到 CUDA 内核的完整技术栈。
#mermaid-svg-NNXdknHGJ4LMg56n{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-NNXdknHGJ4LMg56n .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-NNXdknHGJ4LMg56n .error-icon{fill:#552222;}#mermaid-svg-NNXdknHGJ4LMg56n .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-NNXdknHGJ4LMg56n .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-NNXdknHGJ4LMg56n .marker{fill:#333333;stroke:#333333;}#mermaid-svg-NNXdknHGJ4LMg56n .marker.cross{stroke:#333333;}#mermaid-svg-NNXdknHGJ4LMg56n svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-NNXdknHGJ4LMg56n p{margin:0;}#mermaid-svg-NNXdknHGJ4LMg56n .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster-label text{fill:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster-label span{color:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster-label span p{background-color:transparent;}#mermaid-svg-NNXdknHGJ4LMg56n .label text,#mermaid-svg-NNXdknHGJ4LMg56n span{fill:#333;color:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .node rect,#mermaid-svg-NNXdknHGJ4LMg56n .node circle,#mermaid-svg-NNXdknHGJ4LMg56n .node ellipse,#mermaid-svg-NNXdknHGJ4LMg56n .node polygon,#mermaid-svg-NNXdknHGJ4LMg56n .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-NNXdknHGJ4LMg56n .rough-node .label text,#mermaid-svg-NNXdknHGJ4LMg56n .node .label text,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape .label,#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape .label{text-anchor:middle;}#mermaid-svg-NNXdknHGJ4LMg56n .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-NNXdknHGJ4LMg56n .rough-node .label,#mermaid-svg-NNXdknHGJ4LMg56n .node .label,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape .label,#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape .label{text-align:center;}#mermaid-svg-NNXdknHGJ4LMg56n .node.clickable{cursor:pointer;}#mermaid-svg-NNXdknHGJ4LMg56n .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-NNXdknHGJ4LMg56n .arrowheadPath{fill:#333333;}#mermaid-svg-NNXdknHGJ4LMg56n .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-NNXdknHGJ4LMg56n .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-NNXdknHGJ4LMg56n .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-NNXdknHGJ4LMg56n .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-NNXdknHGJ4LMg56n .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-NNXdknHGJ4LMg56n .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-NNXdknHGJ4LMg56n .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster text{fill:#333;}#mermaid-svg-NNXdknHGJ4LMg56n .cluster span{color:#333;}#mermaid-svg-NNXdknHGJ4LMg56n div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-NNXdknHGJ4LMg56n .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-NNXdknHGJ4LMg56n rect.text{fill:none;stroke-width:0;}#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape p,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-NNXdknHGJ4LMg56n .icon-shape .label rect,#mermaid-svg-NNXdknHGJ4LMg56n .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-NNXdknHGJ4LMg56n .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-NNXdknHGJ4LMg56n .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-NNXdknHGJ4LMg56n :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} ⚡ 内核类型
🏗️ 实现层次
📊 量化方案矩阵
FP8 E4M3
INT8 W8A8
INT4 W4A16
GPTQ 2-8bit
AWQ 4bit
GGUF 多精度
NVFP4 Blackwell
MXFP4 Microscaling
MXFP8 Microscaling
Marlin 4-bit
Machete Hopper+
QuantizationConfig
配置层
LinearKernel
内核选择层
CUDA Kernel
算子实现层
ScaledMM
缩放矩阵乘法
MPLinearKernel
混合精度内核
Mxfp8LinearKernel
MXFP8 专用内核
NvFp4LinearKernel
NVFP4 专用内核
目录
- 一、量化方案矩阵
- [二、FP8 量化](#二、FP8 量化)
- [三、INT8 / INT4 量化(GPTQ / AWQ)](#三、INT8 / INT4 量化(GPTQ / AWQ))
- [四、GGUF 格式](#四、GGUF 格式)
- [五、NVFP4 Blackwell 4-bit 浮点](#五、NVFP4 Blackwell 4-bit 浮点)
- [六、MXFP4 / MXFP8 块缩放浮点](#六、MXFP4 / MXFP8 块缩放浮点)
- [七、Marlin 高性能 4-bit 内核](#七、Marlin 高性能 4-bit 内核)
- [八、Machete 新一代量化内核](#八、Machete 新一代量化内核)
- [九、量化线性层与 GEMM 操作](#九、量化线性层与 GEMM 操作)
- 十、量化配置体系
- 十一、量化方案选择决策树
- 十二、量化数据流全景
一、量化方案矩阵
vLLM 支持的量化方案覆盖了从浮点到整数的全谱系,下表为各方案的横向对比:
| 方案 | 权重精度 | 激活精度 | 用途/特点 | 最小 GPU 能力 | 配置类 | 关键文件 |
|---|---|---|---|---|---|---|
| FP8 | float8_e4m3fn | BF16/FP16 (动态) | H100/H200 原生支持,在线/离线量化 | SM 75 (Turing) | Fp8Config |
fp8.py |
| FP8 Per-Block | float8_e4m3fn | 动态 128-block | DeepSeek 风格块级量化 | SM 75 | Fp8Config |
fp8.py:292-311 |
| MXFP8 | float8 + uint8 scale | BF16 | MicroScaling FP8,32 元素块缩放 | SM 80 | ModelOptMxFp8Config |
modelopt.py |
| GPTQ | INT4/INT8 (packed int32) | FP16 | 训练后量化,支持 act-order | SM 60 | GPTQConfig |
gptq.py |
| GPTQ-Marlin | UINT4B8/UINT8B128 | FP16/BF16 | Marlin 加速的 GPTQ | SM 75 | GPTQMarlinConfig |
gptq_marlin.py |
| AWQ | UINT4 (packed int32) | FP16 | 激活感知量化 | SM 75 | AWQConfig |
awq.py |
| AWQ-Marlin | UINT4 | FP16/BF16 | Marlin 加速的 AWQ | SM 75 | AWQMarlinConfig |
awq_marlin.py |
| GGUF | Q2~Q8 (多精度) | FP16/BF16 | llama.cpp 兼容格式 | SM 60 | GGUFConfig |
gguf.py |
| NVFP4 | float4_e2m1fn (uint8 packed) | FP8-E4M3 | Blackwell 架构原生 4-bit 浮点 | SM 100+ | ModelOptNvFp4Config |
modelopt.py, nvfp4/base.py |
| MXFP4 | uint8 (2×FP4/byte) | BF16 | MicroScaling FP4,MoE 专用 | SM 80 | Mxfp4Config |
mxfp4.py |
| Marlin | UINT4/UINT8/FP8 | FP16/INT8/FP8 | 高性能 4/8-bit GEMM 内核 | SM 75+ | --- | marlin_utils.py |
| Machete | UINT4/UINT8 | FP16/BF16 | Hopper (SM90) CUTLASS 内核 | SM 90 | --- | machete.py, machete_utils.py |
在线量化(Online Quantization)
除上述离线量化方案外,vLLM 还支持在线量化------加载全精度 checkpoint 后在推理时动态量化:
python
# config/quantization.py 中定义
class OnlineQuantScheme(Enum):
FP8_PER_TENSOR = "fp8_per_tensor" # FP8 per-tensor 缩放
FP8_PER_BLOCK = "fp8_per_block" # FP8 128×128 块缩放(DeepSeek 风格)
INT8_PER_CHANNEL_WEIGHT_ONLY = "int8_per_channel_weight_only" # MoE 专家权重 INT8
MXFP8 = "mxfp8" # MicroScaling FP8
源码位置:config/quantization.py:12-27
所有已注册量化方法
在 init.py:12-47 中通过 QuantizationMethods Literal 类型统一注册:
python
QuantizationMethods = Literal[
"awq", "fp8", "fbgemm_fp8", "fp_quant", "modelopt",
"modelopt_fp4", "modelopt_mxfp8", "modelopt_mixed",
"gguf", "gptq_marlin", "awq_marlin", "gptq",
"humming", "compressed-tensors", "bitsandbytes",
"experts_int8", "quark", "moe_wna16", "torchao",
"inc", "mxfp4", "gpt_oss_mxfp4", "deepseek_v4_fp8",
"cpu_awq", "online",
"fp8_per_tensor", "fp8_per_block",
"int8_per_channel_weight_only", "mxfp8",
]
二、FP8 量化
2.1 架构概述
FP8 是 vLLM 中最重要的量化方案之一,充分利用 NVIDIA H100/H200 的原生 FP8 Tensor Core 支持。其核心设计围绕两个维度展开:
- 离线 vs 在线:是否以 FP8 格式存储 checkpoint
- 激活策略:static(静态)vs dynamic(动态)缩放
#mermaid-svg-cvE5iu2KvQwWgkAw{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-cvE5iu2KvQwWgkAw .error-icon{fill:#552222;}#mermaid-svg-cvE5iu2KvQwWgkAw .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-cvE5iu2KvQwWgkAw .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-cvE5iu2KvQwWgkAw .marker{fill:#333333;stroke:#333333;}#mermaid-svg-cvE5iu2KvQwWgkAw .marker.cross{stroke:#333333;}#mermaid-svg-cvE5iu2KvQwWgkAw svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-cvE5iu2KvQwWgkAw p{margin:0;}#mermaid-svg-cvE5iu2KvQwWgkAw .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster-label text{fill:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster-label span{color:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster-label span p{background-color:transparent;}#mermaid-svg-cvE5iu2KvQwWgkAw .label text,#mermaid-svg-cvE5iu2KvQwWgkAw span{fill:#333;color:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .node rect,#mermaid-svg-cvE5iu2KvQwWgkAw .node circle,#mermaid-svg-cvE5iu2KvQwWgkAw .node ellipse,#mermaid-svg-cvE5iu2KvQwWgkAw .node polygon,#mermaid-svg-cvE5iu2KvQwWgkAw .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-cvE5iu2KvQwWgkAw .rough-node .label text,#mermaid-svg-cvE5iu2KvQwWgkAw .node .label text,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape .label,#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape .label{text-anchor:middle;}#mermaid-svg-cvE5iu2KvQwWgkAw .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-cvE5iu2KvQwWgkAw .rough-node .label,#mermaid-svg-cvE5iu2KvQwWgkAw .node .label,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape .label,#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape .label{text-align:center;}#mermaid-svg-cvE5iu2KvQwWgkAw .node.clickable{cursor:pointer;}#mermaid-svg-cvE5iu2KvQwWgkAw .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-cvE5iu2KvQwWgkAw .arrowheadPath{fill:#333333;}#mermaid-svg-cvE5iu2KvQwWgkAw .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-cvE5iu2KvQwWgkAw .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-cvE5iu2KvQwWgkAw .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-cvE5iu2KvQwWgkAw .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-cvE5iu2KvQwWgkAw .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-cvE5iu2KvQwWgkAw .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster text{fill:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw .cluster span{color:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-cvE5iu2KvQwWgkAw .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-cvE5iu2KvQwWgkAw rect.text{fill:none;stroke-width:0;}#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape p,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-cvE5iu2KvQwWgkAw .icon-shape .label rect,#mermaid-svg-cvE5iu2KvQwWgkAw .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-cvE5iu2KvQwWgkAw .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-cvE5iu2KvQwWgkAw .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-cvE5iu2KvQwWgkAw :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Fp8Config
Offline 离线
is_checkpoint_fp8_serialized=True
Online 在线
is_checkpoint_fp8_serialized=False
Fp8LinearMethod
Fp8OnlineLinearMethod
Fp8MoEMethod
Fp8OnlineMoEMethod
Fp8KVCacheMethod
init_fp8_linear_kernel()
内核选择
torch._scaled_mm
MarlinFP8ScaledMM
DeepGEMM
BlockScaledMM
FlashInfer
2.2 Fp8Config 核心配置
定义于 fp8.py:97-225:
python
class Fp8Config(QuantizationConfig):
def __init__(
self,
is_checkpoint_fp8_serialized: bool = False, # 是否为 FP8 序列化 checkpoint
activation_scheme: str = "dynamic", # "static" 或 "dynamic"
ignored_layers: list[str] | None = None, # 跳过量化的层
weight_block_size: list[int] | None = None, # 块级量化尺寸 [N, K]
) -> None:
关键属性说明:
is_checkpoint_fp8_serialized:决定走Fp8LinearMethod(离线)还是Fp8OnlineLinearMethod(在线)activation_scheme:dynamic 模式下激活值每 token 动态计算缩放因子;static 模式使用预计算的固定缩放因子weight_block_size:当设置为[128, 128]时启用 DeepSeek V3/R1 风格的 128×128 块级量化
2.3 量化 Key 体系
FP8 的量化行为由 QuantKey 精确控制,定义于 quant_utils.py:100-182:
| QuantKey | 含义 | 使用场景 |
|---|---|---|
kFp8StaticTensorSym |
FP8 + 静态 per-tensor scale | 静态激活量化 |
kFp8DynamicTensorSym |
FP8 + 动态 per-tensor scale | 标准 dynamic 模式 |
kFp8DynamicTokenSym |
FP8 + 动态 per-token scale | Cutlass 支持时的高效模式 |
kFp8Dynamic128Sym |
FP8 + 动态 128-block scale | DeepSeek 块级量化(激活侧) |
kFp8Static128BlockSym |
FP8 + 静态 128×128 block scale | DeepSeek 块级量化(权重侧) |
2.4 离线 FP8 Linear 方法
Fp8LinearMethod 的核心流程:
python
class Fp8LinearMethod(LinearMethodBase):
def create_weights(self, layer, ...):
# 1. 创建 FP8 权重参数
weight = create_fp8_weight_parameter(...)
# 2. 创建权重量级缩放因子
scale = create_fp8_scale_parameter(...)
# 3. 初始化线性内核(自动选择最优后端)
self.fp8_linear = init_fp8_linear_kernel(
activation_quant_key=self.activation_quant_key,
weight_quant_key=self.weight_quant_key,
...
)
def process_weights_after_loading(self, layer):
# 处理融合模块(如 QKV)的多 shard 权重
weight, weight_scale, input_scale = process_fp8_weight_tensor_strategy(
weight, weight_scale, layer.logical_widths, ...
)
# 转置为 [K, N] 布局(scaled_mm 要求)
weight = weight.t()
# 交给选定的内核做后处理(如 Marlin 重排)
self.fp8_linear.process_weights_after_loading(layer)
def apply(self, layer, x, bias=None):
if envs.VLLM_BATCH_INVARIANT:
# batch invariant 模式:反量化到 BF16 再计算
return self._bf16_fallback(layer, x, bias)
if self.use_marlin:
return self.fp8_linear.apply_weights(layer, x, bias)
return self.fp8_linear.apply_weights(layer, x, bias)
2.5 在线 FP8 量化
Fp8OnlineLinearMethod 的关键区别在于使用 meta device 延迟分配:
python
class Fp8OnlineLinearMethod(Fp8LinearMethod):
uses_meta_device: bool = True # 标记:在 meta device 上创建权重
def create_weights(self, layer, ...):
# 权重在 meta device 上创建(不占实际显存)
weight = ModelWeightParameter(
data=torch.empty(..., device="meta", dtype=params_dtype),
...
)
initialize_online_processing(layer)
def process_weights_after_loading(self, layer):
# 加载完成后在线量化:BF16 → FP8
qweight, weight_scale = ops.scaled_fp8_quant(layer.weight, scale=None)
replace_parameter(layer, "weight", qweight.data)
replace_parameter(layer, "weight_scale", weight_scale.data)
核心量化操作 ops.scaled_fp8_quant 将全精度权重动态转换为 FP8。
2.6 FP8 MoE 方法
Fp8MoEMethod 和 Fp8OnlineMoEMethod 为 MoE 层提供 FP8 量化支持:
- 权重格式:
w13_weight(gate_up 融合)和w2_weight(down_proj),dtype 为float8_e4m3fn - 缩放因子:per-tensor(
w13_scale,w2_scale)或 per-block(w13_weight_scale_inv) - 后端选择:通过
select_fp8_moe_backend()自动选择 FlashInfer/AITER/CUTLASS 等后端 - 权重重排:
convert_to_fp8_moe_kernel_format()将权重转换为各后端的运行时格式
2.7 MXFP8(MicroScaling FP8)
MXFP8 是 NVIDIA 推出的微缩放浮点格式,每 32 个元素共享一个 uint8 缩放因子。相关实现在 modelopt.py 和 mxfp8/ 目录中。
支持的内核后端:
MarlinMxfp8LinearKernel--- Marlin 加速的 MXFP8FlashInferCutlassMxfp8LinearKernel--- FlashInfer + CUTLASSEmulationMxfp8LinearKernel--- 仿真模式(非 Blackwell 设备)
三、INT8 / INT4 量化(GPTQ / AWQ)
3.1 GPTQ 量化
GPTQ 是基于近似二阶信息的训练后量化方法。
GPTQConfig
定义于 gptq.py:44-223:
python
class GPTQConfig(QuantizationConfig):
def __init__(
self,
weight_bits: int, # 支持 2/3/4/8 bit
group_size: int, # 分组大小,-1 表示 per-channel
desc_act: bool, # 是否启用 activation ordering
lm_head_quantized: bool, # 是否量化 lm_head
dynamic: dict, # GPTQModel 的逐模块动态配置
autoround_version: str="", # AutoRound 版本标识
modules_in_block_to_quantize: list[str] | None = None,
checkpoint_format: str="", # "gptq_v2" 或 ""
):
self.pack_factor = Fraction(32, self.weight_bits) # 打包因子
GPTQLinearMethod
python
class GPTQLinearMethod(LinearMethodBase):
def create_weights(self, layer, ...):
# 量化权重:按行打包进 int32
qweight = PackedvLLMParameter(
data=torch.empty(
input_size_per_partition // self.quant_config.pack_factor,
output_size_per_partition,
dtype=torch.int32,
),
input_dim=0, output_dim=1,
packed_dim=0, # 沿输入维度打包
packed_factor=self.quant_config.pack_factor,
)
# Activation order 索引
g_idx = RowvLLMParameter(...)
# 零点(打包)
qzeros = PackedColumnParameter(...)
# 缩放因子
scales = ChannelQuantScaleParameter(...) # or GroupQuantScaleParameter
def process_weights_after_loading(self, layer):
# Exllama shuffle:按 g_idx 重排权重
if layer.exllama_state == ExllamaState.UNINITIALIZED:
if self.quant_config.desc_act:
layer.g_idx.data = torch.argsort(layer.g_idx).to(torch.int)
ops.gptq_shuffle(layer.qweight, layer.g_idx, self.quant_config.weight_bits)
def apply(self, layer, x, bias=None):
output = ops.gptq_gemm(
reshaped_x, layer.qweight, layer.qzeros,
layer.scales, layer.g_idx,
layer.exllama_state == ExllamaState.READY,
self.use_v2_format, # GPTQ v1/v2 格式差异
self.quant_config.weight_bits,
)
GPTQ v1 vs v2 差异:v2 格式对零点的处理方式不同,需要不同的 GEMM 内核。
GPTQ 动态配置
GPTQModel 引入的动态配置允许对模型的不同层使用不同量化参数:
python
# gptq.py:61-83 中的示例
dynamic = {
r"+:.*\.(?:1[0-5])\..*": {"bits": 8}, # 第 10-15 层用 8bit
r"+:.*\.(?:1[6-9]|20|21)\..*": {"bits": 8, "group_size": 64},
r"-:.*\.moe\..*": {}, # 跳过所有 MoE 层
}
3.2 AWQ 量化
AWQ (Activation-aware Weight Quantization) 通过分析激活分布来保护重要权重。
AWQConfig
python
class AWQConfig(QuantizationConfiguration):
def __init__(self, weight_bits=4, group_size=128, zero_point=True,
modules_to_not_convert=None):
# AWQ 固定为 4-bit
assert self.weight_bits == 4
self.pack_factor = 32 // self.weight_bits # = 8
AWQLinearMethod
python
class AWQLinearMethod(LinearMethodBase):
def create_weights(self, layer, ...):
# AWQ 沿输出维度打包(与 GPTQ 不同!)
qweight = PackedvLLMParameter(
data=torch.empty(
input_size_per_partition,
output_size_per_partition // self.quant_config.pack_factor,
dtype=torch.int32,
),
input_dim=0, output_dim=1,
packed_dim=1, # 沿输出维度打包(AWQ 特有)
packed_factor=self.quant_config.pack_factor,
)
scales = GroupQuantScaleParameter(...)
qzeros = PackedvLLMParameter(...)
def apply(self, layer, x, bias=None):
# 小 batch 启发式:直接反量化 + matmul
if x.shape[:-1].numel() >= 256 or envs.VLLM_BATCH_INVARIANT:
out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0)
out = torch.matmul(reshaped_x, out)
else:
# 大 batch:使用优化的 AWQ GEMM kernel
out = ops.awq_gemm(reshaped_x, qweight, scales, qzeros, pack_factor)
AWQ vs GPTQ 打包方向对比:
- GPTQ:
qweight形状[K//pack, N],沿输入维(dim 0)打包 - AWQ:
qweight形状[K, N//pack],沿输出维(dim 1)打包
3.3 AWQ → Marlin 格式转换
由于 AWQ 使用非标准的 bit 打包顺序,转换为 Marlin 格式需要特殊处理。
awq_marlin.py:67-149 定义了转换逻辑:
python
_REVERSE_AWQ_PACK_ORDER = [0, 4, 1, 5, 2, 6, 3, 7] # AWQ 特有的位排列
def _convert_awq_to_standard_format(layer, w_q_name, w_zp_name, size_bits):
"""将 AWQ 的非标准格式转换为 GPTQ-like 标准格式"""
pack_factor = 32 // size_bits
# 1. 解包 AWQ qweight,修复位顺序
unpacked = (qw.unsqueeze(-1) >> shifts) & mask
unpacked = unpacked[:, :, reverse_order] # 修正 AWQ 位序
# 2. 从沿输出维打包转为沿输入维打包
new_qw = repack_along_input_dim(unpacked)
# 3. 同样处理 qzeros
new_qz = convert_and_repack_zeros(qz)
四、GGUF 格式
4.1 概述
GGUF 是 llama.cpp 定义的单文件模型格式,支持多种量化精度。vLLM 通过 gguf.py 提供完整的 GGUF 推理支持。
4.2 支持的量化类型
python
UNQUANTIZED_TYPES = {WeightType.F32, WeightType.F16, WeightType.BF16}
STANDARD_QUANT_TYPES = {
WeightType.Q4_0, WeightType.Q4_1, # 标准 4-bit
WeightType.Q5_0, WeightType.Q5_1, # 标准 5-bit
WeightType.Q8_0, WeightType.Q8_1, # 标准 8-bit
}
KQUANT_TYPES = { # K-quantization(改进版)
WeightType.Q2_K, WeightType.Q3_K,
WeightType.Q4_K, WeightType.Q5_K, WeightType.Q6_K,
}
IMATRIX_QUANT_TYPES = { # I-Matrix 量化(重要性矩阵)
WeightType.IQ1_M, WeightType.IQ1_S,
WeightType.IQ2_XXS, WeightType.IQ2_XS, ...
}
4.3 GGUF GEMM 操作
gguf.py:201-233 中的 _fused_mul_mat_gguf 实现了三级内核选择:
python
def _fused_mul_mat_gguf(x, qweight, qweight_type):
# 1. MMVQ(向量-矩阵量化乘法):适合小 batch(batch_size <= 2~16)
if x.shape[0] <= mmvq_safe and qweight_type in MMVQ_QUANT_TYPES:
y = ops.ggml_mul_mat_vec_a8(qweight, x, qweight_type, ...)
# 2. MMQ(矩阵-矩阵量化乘法):标准批量推理
elif qweight_type in MMQ_QUANT_TYPES:
y = ops.ggml_mul_mat_a8(qweight, x, qweight_type, ...)
# 3. 反量化回退:无专用 kernel 时先反量化再 matmul
elif qweight_type in DEQUANT_TYPES:
weight = ops.ggml_dequantize(qweight, qweight_type, ...)
y = x @ weight.T
4.4 GGUF MoE 支持
gguf.py:564-669 的 GGUFMoEMethod:
- 当两个权重都支持 MMQ 且
x.shape[0] > 64时使用 fused MoE kernel(ops.ggml_moe_a8) - 否则使用逐 expert 的向量 kernel(
ops.ggml_moe_a8_vec) - 最终回退到逐 token 逐 expert 的慢速路径
4.5 GGUF 工具函数
gguf_utils.py 提供:
check_gguf_file()--- 检测文件是否为 GGUF 格式is_remote_gguf()--- 检测远程 GGUF 模型(如repo_id:Q4_K_M格式)is_valid_gguf_quant_type()--- 校验量化类型名称合法性
五、NVFP4 Blackwell 4-bit 浮点
5.1 概述
NVFP4 是 NVIDIA Blackwell 架构(SM100+,如 B200 GPU)引入的原生 4-bit 浮点格式 (float4_e2m1fn),具有以下特征:
- 数据类型 :
float4_e2m1fn,2 个 FP4 值打包在一个uint8中 - 缩放方式:FP8-E4M3FN 格式的 block scale,默认 group size = 16
- 全局缩放:额外的标量全局缩放因子用于权重和激活
5.2 数据结构
python
@dataclass
class NvFp4LinearLayerConfig:
"""所有 NVFP4 层共享相同结构:
- packed uint8 权重(每字节 2 个 FP4 值)
- FP8-E4M3 per-block 权重缩放(group size 16)
- 权重和激活的全局标量缩放
"""
pass
5.3 量化 Key 定义
python
# 动态 NVFP4:激活侧动态计算 FP8 block scale
kNvfp4DynamicGroupScale = ScaleDesc(FP8_DTYPE, False, GroupShape(1, 16))
kNvfp4Dynamic = QuantKey(FP4_DTYPE, scale=kNvfp4DynamicGroupScale,
scale2=kStaticTensorScale)
# 静态 NVFP4:预计算的 FP8 block scale
kNvfp4StaticGroupScale = ScaleDesc(FP8_DTYPE, True, GroupShape(1, 16))
kNvfp4Static = QuantKey(FP4_DTYPE, scale=kNvfp4StaticGroupScale,
scale2=kStaticTensorScale)
5.4 NVFP4 内核后端
从 linear/init.py:77-80 可见注册的后端:
| 内核类 | 文件 | 说明 |
|---|---|---|
CutlassNvFp4LinearKernel |
nvfp4/cutlass.py | CUTLASS 实现(Blackwell 原生) |
MarlinNvFp4LinearKernel |
nvfp4/marlin.py | Marlin 内核适配 |
FlashInferNvFp4LinearKernel |
nvfp4/flashinfer.py | FlashInfer 后端 |
FBGemmNvFp4LinearKernel |
nvfp4/fbgemm.py | FBGEMM(ROCm) |
EmulationNvFp4LinearKernel |
nvfp4/emulation.py | 仿真模式(非 Blackwell) |
5.5 NVFP4 工具函数
nvfp4_utils.py 提供关键工具:
python
def swizzle_blockscale(scale: torch.Tensor) -> torch.Tensor:
"""Pad 并 block-interleave FP4 block-scale 以匹配 CUTLASS/FlashInfer 布局"""
# reshape → permute → contiguos 匹配内核期望的数据排布
def pad_nvfp4_weight_for_cutlass(weight, alignment=32):
"""填充 NVFP4 权重以满足 CUTLASS 的 32 对齐要求"""
def cutlass_fp4_supported() -> bool:
"""检测当前设备是否支持 CUTLASS FP4"""
六、MXFP4 / MXFP8 块缩放浮点
6.1 MXFP4(MicroScaling FP4)
MXFP4 是面向 MoE 模型的 4-bit 微缩放浮点格式,主要服务于 GPT-OSS 和 DeepSeek-V4 等模型。
python
class Mxfp4Config(QuantizationConfig):
@classmethod
def get_min_capability(cls) -> int:
return 80 # 需要 Ampere (SM80) 及以上
@classmethod
def get_supported_act_dtypes(cls) -> list[torch.dtype]:
return [torch.bfloat16]
def get_quant_method(self, layer, prefix):
if isinstance(layer, LinearBase):
# Linear 层暂未实现 MXFP4,回退到 Unquantized
return UnquantizedLinearMethod()
elif isinstance(layer, FusedMoE):
return GptOssMxfp4MoEMethod(layer.moe_config)
MXFP4 权重布局:
w13_weight:形状[num_experts, 2*intermediate, hidden//2],dtypeuint8- 每 2 个 FP4 值打包在 1 个字节中
w13_weight_scale:形状[num_experts, 2*intermediate, hidden//32],block size = 32
MXFP4 后端选择 (通过 select_mxfp4_moe_backend()):
AITER_MXFP4_BF16--- AITER 后端FLASHINFER_TRTLLM_MXFP4_MXFP8--- FlashInfer TRTLLM(支持 padding skip)TRITON_MXFP4_BF16--- Triton kernel
6.2 MXFP8(MicroScaling FP8)
MXFP8 每 32 个元素共享一个 uint8 缩放因子,是 DeepSeek-V3 等模型的关键量化技术。
量化 Key 定义于 quant_utils.py:154-158:
python
kMxfp8StaticScale = ScaleDesc(torch.uint8, True, GroupShape(1, 32))
kMxfp8Static = QuantKey(FP8_DTYPE, kMxfp8StaticScale, symmetric=True)
kMxfp8DynamicScale = ScaleDesc(torch.uint8, False, GroupShape(1, 32))
kMxfp8Dynamic = QuantKey(FP8_DTYPE, kMxfp8DynamicScale, symmetric=True)
MXFP8 专用内核位于 kernels/linear/mxfp8/ 目录:
MarlinMxfp8LinearKernel--- Marlin 加速FlashInferCutlassMxfp8LinearKernel--- FlashInfer + CUTLASSEmulationMxfp8LinearKernel--- 仿真回退
七、Marlin 高性能 4-bit 内核
7.1 概述
Marlin 是专为 4-bit 权重量化设计的超高性能 GEMM 内核,vLLM 将其作为 GPTQ/AWQ/FP8 的首选加速后端。
7.2 支持的量化类型
python
def query_marlin_supported_quant_types(has_zp=None, include_fp_type=True, ...):
"""
has_zp=True (AWQ 风格): [scalar_types.uint4]
has_zp=False (GPTQ 风格): [uint4b8, uint8b128] (+ FP8 types if include_fp_type)
"""
完整支持列表:
- 整数型:
uint4(AWQ)、uint4b8(GPTQ 4-bit)、uint8b128(GPTQ 8-bit) - 浮点型:
float8_e4m3fn(FP8 权重)、float4_e2m1f(FP4 权重)
7.3 Marlin 约束条件
python
GPTQ_MARLIN_TILE = 16 # Marlin tile 尺寸
GPTQ_MARLIN_MIN_THREAD_N = 64 # 输出维度最小线程数
GPTQ_MARLIN_MIN_THREAD_K = 128 # 输入维度最小线程数
MARLIN_SUPPORTED_GROUP_SIZES = [-1, 32, 64, 128] # 支持的分组大小
形状约束验证函数 verify_marlin_supports_shape() 检查:
output_size % 64 == 0input_size % 128 == 0- 如果
group_size < input_size:input_size % group_size == 0
7.4 Marlin 权重预处理
Marlin 内核要求权重经过特殊的重排(reordering)和置换(permutation):
python
# marlin_utils.py:292-312 --- scale permutation
def marlin_permute_scales(s, size_k, size_n, group_size, is_a_8bit=False):
scale_perm, scale_perm_single = get_scale_perms()
# group quant: 使用 8×8 full permutation
# channel quant: 使用 4×8 single permutation
s = s.reshape((-1, len(scale_perm)))[:, scale_perm]
return s.reshape((-1, size_n)).contiguous()
# marlin_utils.py:344-365 --- zero-point permutation
def marlin_zero_points(zp, size_k, size_n, num_bits, is_a_8bit=False):
# 1. 应用 scale permutation
# 2. interleave 列维度
# 3. pack 到 int32
7.5 Marlin Workspace
python
def marlin_make_workspace_new(device, max_blocks_per_sm=1):
"""创建 Marlin workspace tensor
大小 = SM 数量 × max_blocks_per_sm
用于原子操作的同步
"""
sms = num_compute_units(device.index)
return torch.zeros(sms * max_blocks_per_sm, dtype=torch.int, device=device)
7.6 Marlin GEMM 调用
python
def apply_gptq_marlin_linear(input, weight, weight_scale, weight_zp,
g_idx, g_idx_sort_indices, workspace,
wtype, ..., input_dtype=None):
# 可选:W4A8 输入量化
if input_dtype == torch.int8:
reshaped_x, a_scales = marlin_quant_input(reshaped_x, torch.int8)
elif input_dtype == torch.float8_e4m3fn:
reshaped_x, a_scales = marlin_quant_input(reshaped_x, torch.float8_e4m3fn)
output = ops.marlin_gemm(
reshaped_x, None, weight, bias,
weight_scale, a_scales, None,
weight_zp, g_idx, g_idx_sort_indices, workspace,
wtype, size_m=..., size_n=..., size_k=...,
is_k_full=..., use_atomic_add=..., use_fp32_reduce=...,
)
7.7 W4A8 扩展:INT8 / FP8 激活量化
Marlin 不仅支持 W4A16,还扩展支持低精度激活:
python
# marlin_utils.py:474-493
def get_marlin_input_dtype(prefix=None):
"""通过环境变量 VLLM_MARLIN_INPUT_DTYPE 控制:
- None: 标准 W4A16
- "int8": W4A8-INT8(需 SM75+)
- "fp8": W4A8-FP8(仅 SM89 H20 或 SM120 Blackwell)
"""
八、Machete 新一代量化内核
8.1 概述
Machete 是基于 CUTLASS 的新一代量化 GEMM 内核,专门针对 NVIDIA Hopper (SM90) 架构优化,提供比 Marlin 更高的吞吐量。
8.2 Machete 约束
python
MACHETE_PREPACKED_BLOCK_SHAPE = [64, 128] # 预打包块形状
def query_machete_supported_quant_types(zero_points):
if zero_points:
return [scalar_types.uint4, scalar_types.uint8]
else:
return [scalar_types.uint4b8, scalar_types.uint8b128]
def query_machete_supported_group_sizes(act_type):
if act_type in [torch.float16, torch.bfloat16]:
return [-1, 64, 128] # 支持 channel-wise / 64 / 128
else:
return [-1, 128]
def check_machete_supports_shape(in_features, out_features):
# in_features % 64 == 0
# out_features % 128 == 0
8.3 MacheteLinearKernel
mixed_precision/machete.py:24-80:
python
class MacheteLinearKernel(MPLinearKernel):
@classmethod
def get_min_capability(cls) -> int:
return 90 # 仅支持 Hopper (SM90)
@classmethod
def can_implement(cls, c: MPLinearLayerConfig):
# 1. 必须是 CUDA 平台
# 2. 必须是 SM90
# 3. 不支持 act_order + TP 分区
# 4. 量化类型必须在支持列表中
# 5. group_size 必须在支持列表中
# 6. 形状必须满足 64/128 对齐
8.4 Marlin vs Machete 对比
| 特性 | Marlin | Machete |
|---|---|---|
| 最低架构 | SM 75 (Turing) | SM 90 (Hopper) |
| 底层实现 | 自定义 CUDA | CUTLASS |
| 对齐要求 | N%64, K%128 | N%128, K%64 |
| 支持 ZP | ✅ (uint4) | ✅ (uint4, uint8) |
| Act Order | ✅ | ❌ (TP 分区时) |
| FP8 权重 | ✅ | ❌ |
| W4A8 | ✅ (INT8/FP8) | ❌ |
| 典型场景 | 通用 4-bit 推理 | Hopper 高吞吐 |
九、量化线性层与 GEMM 操作
9.1 内核选择体系
vLLM 的量化 GEMM 内核采用分层选择架构,定义于 kernels/linear/ 目录:
#mermaid-svg-ZeUMocae7GE4HUNQ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZeUMocae7GE4HUNQ .error-icon{fill:#552222;}#mermaid-svg-ZeUMocae7GE4HUNQ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZeUMocae7GE4HUNQ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZeUMocae7GE4HUNQ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZeUMocae7GE4HUNQ .marker.cross{stroke:#333333;}#mermaid-svg-ZeUMocae7GE4HUNQ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZeUMocae7GE4HUNQ p{margin:0;}#mermaid-svg-ZeUMocae7GE4HUNQ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster-label text{fill:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster-label span{color:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster-label span p{background-color:transparent;}#mermaid-svg-ZeUMocae7GE4HUNQ .label text,#mermaid-svg-ZeUMocae7GE4HUNQ span{fill:#333;color:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .node rect,#mermaid-svg-ZeUMocae7GE4HUNQ .node circle,#mermaid-svg-ZeUMocae7GE4HUNQ .node ellipse,#mermaid-svg-ZeUMocae7GE4HUNQ .node polygon,#mermaid-svg-ZeUMocae7GE4HUNQ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZeUMocae7GE4HUNQ .rough-node .label text,#mermaid-svg-ZeUMocae7GE4HUNQ .node .label text,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape .label,#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZeUMocae7GE4HUNQ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZeUMocae7GE4HUNQ .rough-node .label,#mermaid-svg-ZeUMocae7GE4HUNQ .node .label,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape .label,#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape .label{text-align:center;}#mermaid-svg-ZeUMocae7GE4HUNQ .node.clickable{cursor:pointer;}#mermaid-svg-ZeUMocae7GE4HUNQ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZeUMocae7GE4HUNQ .arrowheadPath{fill:#333333;}#mermaid-svg-ZeUMocae7GE4HUNQ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZeUMocae7GE4HUNQ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZeUMocae7GE4HUNQ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZeUMocae7GE4HUNQ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZeUMocae7GE4HUNQ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZeUMocae7GE4HUNQ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster text{fill:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ .cluster span{color:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZeUMocae7GE4HUNQ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZeUMocae7GE4HUNQ rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape p,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZeUMocae7GE4HUNQ .icon-shape .label rect,#mermaid-svg-ZeUMocae7GE4HUNQ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZeUMocae7GE4HUNQ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZeUMocae7GE4HUNQ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZeUMocae7GE4HUNQ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} init_fp8_linear_kernel()
ScaledMM 子系统
init_nvfp4_linear_kernel()
init_mxfp8_linear_kernel()
PyTorch _scaled_mm
CutlassFP8ScaledMM
FlashInferFP8
MarlinFP8ScaledMM
Fp8BlockScaledMM
DeepGEMM
AiterInt8
TritonInt8
ROCmFP8
CPUFp8BlockScaled
choose_mp_linear_kernel()
MarlinLinearKernel
MacheteLinearKernel
ExllamaLinearKernel
ConchLinearKernel
AllSparkLinearKernel
CutlassW4A8LinearKernel
Dynamic4bitLinearKernel
TritonW4A16LinearKernel
CPUWNA16LinearKernel
9.2 ScaledMM 内核层次
scaled_mm/init.py 注册了以下 FP8/INT8 缩放矩阵乘法内核:
| 内核类 | 精度 | 后端 | 条件 |
|---|---|---|---|
PerTensorTorchFP8ScaledMMLinearKernel |
W8A8 | PyTorch native | 通用 fallback |
ChannelWiseTorchFP8ScaledMMLinearKernel |
W8A8 | PyTorch native | per-channel |
RowWiseTorchFP8ScaledMMLinearKernel |
W8A8 | PyTorch native | per-row |
CutlassFP8ScaledMMLinearKernel |
W8A8 | CUTLASS | Ampere+ CUDA |
CutlassInt8ScaledMMLinearKernel |
W8A8 | CUTLASS | Ampere+ CUDA |
FlashInferFP8ScaledMMLinearKernel |
W8A8 | FlashInfer | Hopper+ |
MarlinFP8ScaledMMLinearKernel |
W8A8 | Marlin | Turing+, 非 FP8 原生设备 |
Fp8BlockScaledMMLinearKernel |
W8A8 Block | CUTLASS | 块级缩放 |
ROCmFP8ScaledMMLinearKernel |
W8A8 | ROCm | AMD GPU |
AiterInt8ScaledMMLinearKernel |
W8A8 | AITER | 华为昇腾 |
TritonInt8ScaledMMLinearKernel |
W8A8 | Triton | 通用 |
CPUFp8BlockScaledKernel |
W8A8 | CPU | CPU 推理 |
9.3 Mixed Precision 内核层次
mixed_precision/ 目录包含整数量化内核:
| 内核类 | 精度 | 后端 | 说明 |
|---|---|---|---|
MarlinLinearKernel |
W4A16/W8A16 | Marlin CUDA | 主要 4/8-bit 内核 |
MacheteLinearKernel |
W4A16/W8A16 | CUTLASS | SM90 优化 |
ExllamaLinearKernel |
W4A16 | Exllama | GPTQ 原生格式 |
ConchLinearKernel |
W4A16/W8A16 | Conch | DeepSeek 内核 |
AllSparkLinearKernel |
W4A16 | AllSpark | 通义千问 |
CutlassW4A8LinearKernel |
W4A8 | CUTLASS | W4A8 精度 |
Dynamic4bitLinearKernel |
W4A16 | Dynamic | 动态 4-bit |
TritonW4A16LinearKernel |
W4A16 | Triton | 可移植 |
CPUWNA16LinearKernel |
W4A16 | CPU | CPU 推理 |
十、量化配置体系
10.1 基类层次
QuantizationConfig (ABC) # base_config.py:70
├── get_name() → QuantizationMethods
├── get_supported_act_dtypes() → list[dtype]
├── get_min_capability() → int # 最低 GPU compute capability
├── get_config_filenames() → list[str]
├── from_config(dict) → Self
└── get_quant_method(layer, prefix) → QuantizeMethodBase | None
QuantizeMethodBase (ABC) # base_config.py:19
├── create_weights(layer, ...) # 创建量化参数
├── apply(layer, x, bias) → Tensor # 前向计算
├── embedding(layer, x) → Tensor # Embedding 查找(可选)
└── process_weights_after_loading(layer) # 权重后处理(可选)
源码位置:base_config.py
10.2 配置类清单
| 配置类 | 方法名 | 权重位数 | 最低能力 | 文件 |
|---|---|---|---|---|
Fp8Config |
fp8 | 8-bit float | SM 75 | fp8.py |
GPTQConfig |
gptq | 2/3/4/8-bit int | SM 60 | gptq.py |
GPTQMarlinConfig |
gptq_marlin | 4/8-bit int | SM 75 | gptq_marlin.py |
AWQConfig |
awq | 4-bit int | SM 75 | awq.py |
AWQMarlinConfig |
awq_marlin | 4-bit int | SM 75 | awq_marlin.py |
GGUFConfig |
gguf | 多精度 | SM 60 | gguf.py |
ModelOptFp8Config |
modelopt | 8-bit float | SM 75 | modelopt.py |
ModelOptNvFp4Config |
modelopt_fp4 | 4-bit float | SM 100 | modelopt.py |
ModelOptMxFp8Config |
modelopt_mxfp8 | 8-bit float + uint8 scale | SM 80 | modelopt.py |
Mxfp4Config |
mxfp4 | 4-bit float (uint8) | SM 80 | mxfp4.py |
BitsAndBytesConfig |
bitsandbytes | 多精度 | --- | bitsandbytes.py |
CompressedTensorsConfig |
compressed-tensors | 多精度 | --- | compressed_tensors/ |
INCConfig |
inc / auto-round | 多精度 | --- | inc.py |
TorchAOConfig |
torchao | 多精度 | --- | torchao.py |
HummingConfig |
humming | 多精度 | --- | humming.py |
ExpertsInt8Config |
experts_int8 | 8-bit int | --- | experts_int8.py |
OnlineQuantizationConfig |
online | 动态 | --- | online/ |
10.3 参数类型体系
vLLM 定义了丰富的参数类型来描述不同量化格式的权重布局:
| 参数类型 | 用途 | 来源 |
|---|---|---|
ModelWeightParameter |
全精度权重 | parameter.py |
PerTensorScaleParameter |
Per-tensor 缩放因子 | parameter.py |
BlockQuantScaleParameter |
Block 量化缩放因子 | parameter.py |
ChannelQuantScaleParameter |
Per-channel 缩放因子 | parameter.py |
GroupQuantScaleParameter |
Per-group 缩放因子 | parameter.py |
PackedvLLMParameter |
打包的量化权重(int32) | parameter.py |
PackedColumnParameter |
沿列打包的零点 | parameter.py |
RowvLLMParameter |
行索引参数(g_idx) | parameter.py |
GGUFUninitializedParameter |
GGUF 延迟初始化参数 | gguf.py:689-691 |
10.4 自动格式检测与覆盖
部分配置类实现了 override_quantization_method() 来自动检测 checkpoint 格式并推荐最优内核:
python
# gptq_marlin.py:220-245 --- 自动升级 GPTQ → GPTQ-Marlin
@classmethod
def override_quantization_method(cls, hf_quant_cfg, user_quant, hf_config=None):
can_convert = cls.is_gptq_marlin_compatible(hf_quant_cfg)
if can_convert and user_quant in (None, "marlin", "gptq_marlin"):
return "gptq_marlin" # 自动切换到更快的 Marlin 内核
# awq_marlin.py:233-262 --- 自动升级 AWQ → AWQ-Marlin
@classmethod
def override_quantization_method(cls, hf_quant_cfg, user_quant, hf_config=None):
can_convert = cls.is_awq_marlin_compatible(hf_quant_cfg)
if can_convert and user_quant in (None, "marlin", "awq_marlin"):
return "awq_marlin"
# gguf.py:86-93 --- 强制 GGUF 格式
@classmethod
def override_quantization_method(cls, hf_quant_cfg, user_quant, hf_config=None):
if user_quant == "gguf":
return "gguf" # 覆盖 HF config 中的其他量化设置
十一、量化方案选择决策树
#mermaid-svg-2Tl6VKeaPWPXScAZ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-2Tl6VKeaPWPXScAZ .error-icon{fill:#552222;}#mermaid-svg-2Tl6VKeaPWPXScAZ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-2Tl6VKeaPWPXScAZ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .marker.cross{stroke:#333333;}#mermaid-svg-2Tl6VKeaPWPXScAZ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-2Tl6VKeaPWPXScAZ p{margin:0;}#mermaid-svg-2Tl6VKeaPWPXScAZ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster-label text{fill:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster-label span{color:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster-label span p{background-color:transparent;}#mermaid-svg-2Tl6VKeaPWPXScAZ .label text,#mermaid-svg-2Tl6VKeaPWPXScAZ span{fill:#333;color:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .node rect,#mermaid-svg-2Tl6VKeaPWPXScAZ .node circle,#mermaid-svg-2Tl6VKeaPWPXScAZ .node ellipse,#mermaid-svg-2Tl6VKeaPWPXScAZ .node polygon,#mermaid-svg-2Tl6VKeaPWPXScAZ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .rough-node .label text,#mermaid-svg-2Tl6VKeaPWPXScAZ .node .label text,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape .label,#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape .label{text-anchor:middle;}#mermaid-svg-2Tl6VKeaPWPXScAZ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .rough-node .label,#mermaid-svg-2Tl6VKeaPWPXScAZ .node .label,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape .label,#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape .label{text-align:center;}#mermaid-svg-2Tl6VKeaPWPXScAZ .node.clickable{cursor:pointer;}#mermaid-svg-2Tl6VKeaPWPXScAZ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .arrowheadPath{fill:#333333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-2Tl6VKeaPWPXScAZ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-2Tl6VKeaPWPXScAZ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-2Tl6VKeaPWPXScAZ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster text{fill:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ .cluster span{color:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-2Tl6VKeaPWPXScAZ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-2Tl6VKeaPWPXScAZ rect.text{fill:none;stroke-width:0;}#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape p,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-2Tl6VKeaPWPXScAZ .icon-shape .label rect,#mermaid-svg-2Tl6VKeaPWPXScAZ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-2Tl6VKeaPWPXScAZ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-2Tl6VKeaPWPXScAZ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-2Tl6VKeaPWPXScAZ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} FP8 序列化
FP16/BF16 全精度
GPTQ 格式
AWQ 格式
GGUF 格式
MXFP4/MXFP8
NVFP4
H100/H200 (SM90)
其他 GPU
否
是,用户指定
是 (SM75+)
否
是 (SM75+)
否
H100 原生 FP8
非原生 FP8
DeepGEMM 可用
块级量化
SM90 Hopper
SM75-SM89
开始选择量化方案
Checkpoint 格式?
Fp8Config
Fp8LinearMethod / Fp8MoEMethod
目标 GPU?
可用 Marlin?
可用 Marlin?
GGUFConfig
GGUFLinearMethod / GGUFMoEMethod
Mxfp4Config / ModelOptMxFp8Config
ModelOptNvFp4Config
NvFp4LinearKernel
Fp8Config (online)
Fp8OnlineLinearMethod
需要量化?
UnquantizedLinearMethod
用户指定的量化方案
GPTQMarlinConfig
GPTQMarlinLinearMethod
GPTQConfig
GPTQLinearMethod (gptq_gemm)
AWQMarlinConfig
AWQMarlinLinearMethod
AWQConfig
AWQLinearMethod (awq_gemm)
选择 FP8 内核后端
选择 Marlin 变体
_scaled_mm / CutlassFP8
MarlinFP8ScaledMM
DeepGEMM kernel
Fp8BlockScaledMM
MacheteLinearKernel (优先)
MarlinLinearKernel
十二、量化数据流全景
#mermaid-svg-vywasSeHgL7nTtbo{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-vywasSeHgL7nTtbo .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-vywasSeHgL7nTtbo .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-vywasSeHgL7nTtbo .error-icon{fill:#552222;}#mermaid-svg-vywasSeHgL7nTtbo .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-vywasSeHgL7nTtbo .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-vywasSeHgL7nTtbo .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-vywasSeHgL7nTtbo .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-vywasSeHgL7nTtbo .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-vywasSeHgL7nTtbo .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-vywasSeHgL7nTtbo .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-vywasSeHgL7nTtbo .marker{fill:#333333;stroke:#333333;}#mermaid-svg-vywasSeHgL7nTtbo .marker.cross{stroke:#333333;}#mermaid-svg-vywasSeHgL7nTtbo svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-vywasSeHgL7nTtbo p{margin:0;}#mermaid-svg-vywasSeHgL7nTtbo .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-vywasSeHgL7nTtbo .cluster-label text{fill:#333;}#mermaid-svg-vywasSeHgL7nTtbo .cluster-label span{color:#333;}#mermaid-svg-vywasSeHgL7nTtbo .cluster-label span p{background-color:transparent;}#mermaid-svg-vywasSeHgL7nTtbo .label text,#mermaid-svg-vywasSeHgL7nTtbo span{fill:#333;color:#333;}#mermaid-svg-vywasSeHgL7nTtbo .node rect,#mermaid-svg-vywasSeHgL7nTtbo .node circle,#mermaid-svg-vywasSeHgL7nTtbo .node ellipse,#mermaid-svg-vywasSeHgL7nTtbo .node polygon,#mermaid-svg-vywasSeHgL7nTtbo .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-vywasSeHgL7nTtbo .rough-node .label text,#mermaid-svg-vywasSeHgL7nTtbo .node .label text,#mermaid-svg-vywasSeHgL7nTtbo .image-shape .label,#mermaid-svg-vywasSeHgL7nTtbo .icon-shape .label{text-anchor:middle;}#mermaid-svg-vywasSeHgL7nTtbo .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-vywasSeHgL7nTtbo .rough-node .label,#mermaid-svg-vywasSeHgL7nTtbo .node .label,#mermaid-svg-vywasSeHgL7nTtbo .image-shape .label,#mermaid-svg-vywasSeHgL7nTtbo .icon-shape .label{text-align:center;}#mermaid-svg-vywasSeHgL7nTtbo .node.clickable{cursor:pointer;}#mermaid-svg-vywasSeHgL7nTtbo .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-vywasSeHgL7nTtbo .arrowheadPath{fill:#333333;}#mermaid-svg-vywasSeHgL7nTtbo .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-vywasSeHgL7nTtbo .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-vywasSeHgL7nTtbo .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-vywasSeHgL7nTtbo .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-vywasSeHgL7nTtbo .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-vywasSeHgL7nTtbo .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-vywasSeHgL7nTtbo .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-vywasSeHgL7nTtbo .cluster text{fill:#333;}#mermaid-svg-vywasSeHgL7nTtbo .cluster span{color:#333;}#mermaid-svg-vywasSeHgL7nTtbo div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-vywasSeHgL7nTtbo .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-vywasSeHgL7nTtbo rect.text{fill:none;stroke-width:0;}#mermaid-svg-vywasSeHgL7nTtbo .icon-shape,#mermaid-svg-vywasSeHgL7nTtbo .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-vywasSeHgL7nTtbo .icon-shape p,#mermaid-svg-vywasSeHgL7nTtbo .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-vywasSeHgL7nTtbo .icon-shape .label rect,#mermaid-svg-vywasSeHgL7nTtbo .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-vywasSeHgL7nTtbo .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-vywasSeHgL7nTtbo .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-vywasSeHgL7nTtbo :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} ⚡ CUDA 内核
🚀 推理执行
⚙️ 权重后处理
🔧 创建参数
📥 权重加载
Checkpoint 文件
(SafeTensors/GGUF/HF)
WeightLoader
分片 & 加载
create_weights()
分配量化参数
process_weights_after_loading()
• 格式转换
• 权重重排
• 缩放因子计算
激活量化 (动态方案)
scaled_fp8_quant()
量化 GEMM
• ops.marlin_gemm
• ops.gptq_gemm
• ops.awq_gemm
• torch._scaled_mm
反量化 / 输出
marlin_gemm.cu
gptq_gemm.cu
awq_gemm.cu / awq_dequantize
ggml_dequantize / ggml_mul_mat_a8
cutlass_fp4 GEMM
典型数据流示例
示例 1:FP8 在线量化(DeepSeek 风格)
BF16 Checkpoint
↓ [WeightLoader: 加载到 meta device]
↓ [Fp8OnlineLinearMethod.create_weights(): 创建 FP8 参数]
↓ [Fp8OnlineLinearMethod.process_weights_after_loading():
ops.scaled_fp8_quant(weight) → (qweight: FP8, scale: FP32)
weight.t() → [K, N] 布局
]
↓ [推理时每个 token:]
↓ [动态计算激活 scale]
↓ [torch._scaled_mm(x_fp8, w_fp8, x_scale, w_scale) → output]
示例 2:GPTQ-Marlin 4-bit 量化
GPTQ Checkpoint (int32 packed weights)
↓ [GPTQMarlinLinearMethod.create_weights():
qweight: [K//8, N] int32 (packed uint4)
scales: [K//gs, N] fp16
qzeros: [K//gs, N//8] int32 (packed)
g_idx: [K] int32
]
↓ [process_weights_after_loading():
g_idx sorting & argsort
ops.gptq_shuffle(qweight, g_idx, 4)
marlin_permute_scales(scales)
marlin_zero_points(qzeros)
marlin_make_workspace_new(device)
]
↓ [推理时:]
↓ [ops.marlin_gemm(x, qweight, scales, zeros, g_idx, ...)]
示例 3:GGUF Q4_K_M 量化
GGUF File (.gguf)
↓ [gguf_utils.py: 加载元数据 + 权重]
↓ [GGUFLinearMethod.create_weights():
qweight: GGUFUninitializedParameter (延迟初始化)
qweight_type: WeightType.Q4_K (uint8 标识)
]
↓ [process_weights_after_loading():
检查量化类型有效性
为融合层 (QKV/gate_up) 创建 padded weight
]
↓ [推理时 (_fused_mul_mat_gguf):]
↓ [batch_size ≤ threshold?]
↓ Yes → ops.ggml_mul_mat_vec_a8() [MMVQ 向量 kernel]
↓ No → ops.ggml_mul_mat_a8() [MMQ 矩阵 kernel]
↓ No kernel → ops.ggml_dequantize() + matmul [反量化回退]
附录:关键源码路径索引
| 类别 | 路径 | 说明 |
|---|---|---|
| 量化配置入口 | config/quantization.py | OnlineQuantScheme, resolve_online_quant_config |
| 量化方法注册 | layers/quantization/init.py | QuantizationMethods, get_quantization_config |
| 基类定义 | layers/quantization/base_config.py | QuantizationConfig, QuantizeMethodBase |
| Schema 定义 | layers/quantization/schema.py | KVCacheQuantSchema |
| FP8 量化 | layers/quantization/fp8.py | Fp8Config, Fp8LinearMethod, Fp8MoEMethod |
| GPTQ 量化 | layers/quantization/gptq.py | GPTQConfig, GPTQLinearMethod |
| GPTQ-Marlin | layers/quantization/gptq_marlin.py | GPTQMarlinConfig, GPTQMarlinLinearMethod |
| AWQ 量化 | layers/quantization/awq.py | AWQConfig, AWQLinearMethod |
| AWQ-Marlin | layers/quantization/awq_marlin.py | AWQMarlinConfig, AWQMarlinLinearMethod |
| GGUF 量化 | layers/quantization/gguf.py | GGUFConfig, GGUFLinearMethod, GGUFMoEMethod |
| GGUF 工具 | transformers_utils/gguf_utils.py | GGUF 文件检测与解析 |
| MXFP4 量化 | layers/quantization/mxfp4.py | Mxfp4Config, GptOssMxfp4MoEMethod |
| ModelOpt 系列 | layers/quantization/modelopt.py | ModelOptFp8/NvFp4/MxFp8/Mixed Configs |
| Marlin 工具 | layers/quantization/utils/marlin_utils.py | Marlin 格式转换、shape 验证、workspace |
| Machete 工具 | layers/quantization/utils/machete_utils.py | Machete 约束查询 |
| NVFP4 工具 | layers/quantization/utils/nvfp4_utils.py | NVFP4 swizzle/pad 工具 |
| 量化工具集 | layers/quantization/utils/quant_utils.py | QuantKey, GroupShape, scaled_quantize, pack/unpack |
| FP8 工具 | layers/quantization/utils/fp8_utils.py | FP8 参数创建与处理 |
| Linear 内核总入口 | kernels/linear/init.py | 所有 linear kernel 的统一导出 |
| ScaledMM 内核 | kernels/linear/scaled_mm/ | FP8/INT8 缩放矩阵乘法内核族 |
| Mixed Precision 内核 | kernels/linear/mixed_precision/ | 整数量化内核族 (Marlin/Machete/Exllama...) |
| MXFP8 内核 | kernels/linear/mxfp8/ | MXFP8 专用内核 |
| NVFP4 内核 | kernels/linear/nvfp4/ | NVFP4 专用内核 |
| 在线量化 | layers/quantization/online/ | Online Quantization 实现 |