常规步骤请参考
这篇文章主要讲解一下另一条非常规但是特别简单的添加量化策略的路径
从上一篇文章我们知道添加自定义的量化策略是需要通过继承Modifier,自己实现一边完整流程,包括计算scale等。但是量化整体上来区分无非也就一下几种:
- group分组
- block分块
- per-channel
- per-token
- per-tensor
这些其实在compressor里面都有的,只不过定义在另一个依赖包compressed-tensors里:
compressed-tensors\src\compressed_tensors\quantization\quant_scheme.py
文件太长就不完全粘贴过来了,留两个示例
# 4 bit integer weights only quantization
W4A16 = dict(
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
strategy=QuantizationStrategy.GROUP,
group_size=128,
symmetric=True,
dynamic=False,
),
)
# 4 bit integer weights only asymmetric quantization
W4A16_ASYM = dict(
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
strategy=QuantizationStrategy.GROUP,
group_size=128,
symmetric=False,
dynamic=False,
),
)
FP8_BLOCK = dict(
weights=QuantizationArgs(
num_bits=8,
type=QuantizationType.FLOAT,
strategy=QuantizationStrategy.BLOCK,
symmetric=True,
dynamic=False,
block_structure=[128, 128],
),
input_activations=QuantizationArgs(
num_bits=8,
type=QuantizationType.FLOAT,
strategy=QuantizationStrategy.GROUP,
symmetric=True,
dynamic=True,
group_size=128,
),
)
那如果我们直接自己组合一个新的QuantizationArgs能不能行呢
QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
strategy=QuantizationStrategy.GROUP,
group_size=128,
symmetric=True,
dynamic=False,
)
答案是:还真行
下面来详细讲解一些怎么通过这种方式实现上一篇文章中的int4_block量化
1. 定义新的QuantizationArgs
有两种方式可以实现,自行选取
方式一:复制 W4A16 的所有字段,只改分组方式
python
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationStrategy,
QuantizationType,
)
W4A16_BLOCK = dict(
weights=QuantizationArgs(
**{**W4A16["weights"].model_dump(), # 继承 W4A16 的所有字段
"strategy": QuantizationStrategy.BLOCK,
"group_size": None, # 清掉 group 配置
"block_structure": [16, 16]}, # 加上 block 配置
),
)
方式二:直接写新配置(更清晰)
python
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationStrategy,
QuantizationType,
)
W4A16_BLOCK = dict(
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
strategy=QuantizationStrategy.BLOCK,
block_structure=[16, 16],
symmetric=True,
dynamic=False,
),
)
这样写是不是简单多了,避免了继承 Modifier 的复杂流程
2. 注册新的scheme
这是最重要的一步,把这个自定义的scheme注册到quant_scheme中
python
from compressed_tensors.quantization import quant_scheme
if "W4A16_BLOCK" not in quant_scheme.PRESET_SCHEMES:
quant_scheme.PRESET_SCHEMES["W4A16_BLOCK"] = W4A16_BLOCK
print("[register_block_scheme] W4A16_BLOCK registered")
完整代码 register_custom_scheme.py:
python
"""Import this module to register W4A16_BLOCK preset scheme."""
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationStrategy,
QuantizationType,
)
from compressed_tensors.quantization import quant_scheme
W4A16_BLOCK = dict(
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
strategy=QuantizationStrategy.BLOCK,
block_structure=[16, 16],
symmetric=True,
dynamic=False,
),
)
if "W4A16_BLOCK" not in quant_scheme.PRESET_SCHEMES:
quant_scheme.PRESET_SCHEMES["W4A16_BLOCK"] = W4A16_BLOCK
print("[register_block_scheme] W4A16_BLOCK registered")
量化时只要导入register_custom_scheme.py 执行注册就可以直接使用了
python
import register_block_scheme # register custom quant scheme
recipe = QuantizationModifier(targets="Linear", scheme="W4A16_BLOCK", ignore=["lm_head"])
oneshot(model=model, recipe=recipe, pipeline="datafree")
这样是不是就简单多了
不过这是针对llm-compressor这部分量化sheme的组合,想要在vllm顺利执行推理,还要在vllm侧添加对应的scheme分发路由,可以参考上一篇文章,自定义 INT4 Block 量化:从 llm-compressor 到 vLLM 完整讲解-CSDN博客