llm-compressor添加新量化策略 -- 邪修方法

常规步骤请参考

自定义 INT4 Block 量化：从 llm-compressor 到 vLLM 完整讲解-CSDN博客

这篇文章主要讲解一下另一条非常规但是特别简单的添加量化策略的路径

从上一篇文章我们知道添加自定义的量化策略是需要通过继承Modifier，自己实现一边完整流程，包括计算scale等。但是量化整体上来区分无非也就一下几种：

group分组
block分块
per-channel
per-token
per-tensor

这些其实在compressor里面都有的，只不过定义在另一个依赖包compressed-tensors里：

compressed-tensors\src\compressed_tensors\quantization\quant_scheme.py

文件太长就不完全粘贴过来了，留两个示例

复制代码

# 4 bit integer weights only quantization
W4A16 = dict(
    weights=QuantizationArgs(
        num_bits=4,
        type=QuantizationType.INT,
        strategy=QuantizationStrategy.GROUP,
        group_size=128,
        symmetric=True,
        dynamic=False,
    ),
)

# 4 bit integer weights only asymmetric quantization
W4A16_ASYM = dict(
    weights=QuantizationArgs(
        num_bits=4,
        type=QuantizationType.INT,
        strategy=QuantizationStrategy.GROUP,
        group_size=128,
        symmetric=False,
        dynamic=False,
    ),
)

FP8_BLOCK = dict(
    weights=QuantizationArgs(
        num_bits=8,
        type=QuantizationType.FLOAT,
        strategy=QuantizationStrategy.BLOCK,
        symmetric=True,
        dynamic=False,
        block_structure=[128, 128],
    ),
    input_activations=QuantizationArgs(
        num_bits=8,
        type=QuantizationType.FLOAT,
        strategy=QuantizationStrategy.GROUP,
        symmetric=True,
        dynamic=True,
        group_size=128,
    ),
)

那如果我们直接自己组合一个新的QuantizationArgs能不能行呢

QuantizationArgs(

num_bits=4,

type=QuantizationType.INT,

strategy=QuantizationStrategy.GROUP,

group_size=128,

symmetric=True,

dynamic=False,

)

答案是：还真行

下面来详细讲解一些怎么通过这种方式实现上一篇文章中的int4_block量化

1. 定义新的QuantizationArgs

有两种方式可以实现，自行选取

方式一：复制 W4A16 的所有字段，只改分组方式

python 复制代码

from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationStrategy,
    QuantizationType,
)

W4A16_BLOCK = dict(
    weights=QuantizationArgs(
        **{**W4A16["weights"].model_dump(),  # 继承 W4A16 的所有字段
           "strategy": QuantizationStrategy.BLOCK,
           "group_size": None,                 # 清掉 group 配置
           "block_structure": [16, 16]},       # 加上 block 配置
    ),
)

方式二：直接写新配置（更清晰）

python 复制代码

from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationStrategy,
    QuantizationType,
)

W4A16_BLOCK = dict(
    weights=QuantizationArgs(
        num_bits=4,
        type=QuantizationType.INT,
        strategy=QuantizationStrategy.BLOCK,
        block_structure=[16, 16],
        symmetric=True,
        dynamic=False,
    ),
)

这样写是不是简单多了，避免了继承 Modifier 的复杂流程

2. 注册新的scheme

这是最重要的一步，把这个自定义的scheme注册到quant_scheme中

python 复制代码

from compressed_tensors.quantization import quant_scheme

if "W4A16_BLOCK" not in quant_scheme.PRESET_SCHEMES:
    quant_scheme.PRESET_SCHEMES["W4A16_BLOCK"] = W4A16_BLOCK

    print("[register_block_scheme] W4A16_BLOCK registered")

完整代码 register_custom_scheme.py：

python 复制代码

"""Import this module to register W4A16_BLOCK preset scheme."""

from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationStrategy,
    QuantizationType,
)
from compressed_tensors.quantization import quant_scheme


W4A16_BLOCK = dict(
    weights=QuantizationArgs(
        num_bits=4,
        type=QuantizationType.INT,
        strategy=QuantizationStrategy.BLOCK,
        block_structure=[16, 16],
        symmetric=True,
        dynamic=False,
    ),
)


if "W4A16_BLOCK" not in quant_scheme.PRESET_SCHEMES:
    quant_scheme.PRESET_SCHEMES["W4A16_BLOCK"] = W4A16_BLOCK

    print("[register_block_scheme] W4A16_BLOCK registered")

量化时只要导入register_custom_scheme.py 执行注册就可以直接使用了

python 复制代码

import register_block_scheme            # register custom quant scheme

recipe = QuantizationModifier(targets="Linear", scheme="W4A16_BLOCK", ignore=["lm_head"])
oneshot(model=model, recipe=recipe, pipeline="datafree")

这样是不是就简单多了

不过这是针对llm-compressor这部分量化sheme的组合，想要在vllm顺利执行推理，还要在vllm侧添加对应的scheme分发路由，可以参考上一篇文章，自定义 INT4 Block 量化：从 llm-compressor 到 vLLM 完整讲解-CSDN博客