1 SpQR量化实践
Packages 参照github指导安装版本包:\
pip install -r requirements.txt
torch: 1.13 cuda: 11.7 查看cuda版本 import torch; print(torch.version.cuda)
Datasets and tokenizer SpQR脚本会下载、缓存相关tokenizer、datasets。Huggingface缓存。
报错1:ValueError: Invalid pattern: '**' can only be an entire path component 根因:The issue was caused by an incompatibility between the versions of datasets, huggingface-hub and fsspec datasets-2.19.1 fixed the minimum requirement huggingface-hub >= 0.21.2: Bump huggingface-hub lower version to 0.21.2 #6713 升级 datasets版本解决。
报错2:[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed 解决:关掉证书校验。 python3.8/site-packages/requests/adapters.py: verify=True => verify=False python3.8/site-packages/requests/sessions.py: self.verify = True => self.verify = False, verify=True => verify=False
关闭证书校验,执行正常,warning日志可忽略:
bash
============ Evaluating perplexity... ============
/root/anaconda3/envs/ly_spqr_p38/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'proxyhk.huawei.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
/root/anaconda3/envs/ly_spqr_p38/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'proxyhk.huawei.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
/root/anaconda3/envs/ly_spqr_p38/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'proxyhk.huawei.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
/root/anaconda3/envs/ly_spqr_p38/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'proxyhk.huawei.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
Model 支持LLama、Falcon、OPT系列。
Data 建议使用模型训练的数据集,LLama模型使用 RedPajama,代码仓已有。
- data/red_pajama_n=1024.pth
- data/refined_web_n=128.pth
执行SpQR量化
css
export MODEL_PATH=/home/liyan/llm_datas/models/llama-7b
export DATASET=pajama
python main.py $MODEL_PATH $DATASET \
--wbits 4 \
--groupsize 16 \
--perchannel \
--qq_scale_bits 3 \
--qq_zero_bits 3 \
--qq_groupsize 16 \
--outlier_threshold=0.2 \
--permutation_order act_order \
--percdamp 1e0 \
--nsamples 128 \
--save /home/liyan/LLM/spqr/SpQR/output/
执行结果如下,可以看到PPL下降0.1以内。
ini
base:
wikitext2 perplexity = 5.6771
ptb perplexity = 27.3401
quantization:
wikitext2 perplexity = 5.7282
ptb perplexity = 27.5875
LM Evaluation Harness benchmark
SpQR仓内置测评框架Language Model Evaluation Harness,参照指导安装,代码入口是lmeval.py
,当前只支持LLaMA/Falcon quantization。
bash
pip install -r lm-evaluation-harness/requirements.txt
执行命令如下:
ini
export CUDA_VISIBLE_DEVICES=2
export MODEL_PATH=/home/liyan/llm_datas/models/llama-7b
export DATASET=pajama
python lmeval.py \
--model hf-causal \
--model_args pretrained=$MODEL_PATH,dtype=float16,use_accelerate=True \
--quantization_args dataset=$DATASET,wbits=4,groupsize=16,perchannel=True,qq_scale_bits=3,qq_zero_bits=3,qq_groupsize=16,percdamp=1.0,outlier_threshold=0.2,simplified_outliers=False,nsamples=128,offload_activations=True \
--tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \
--batch_size 1
2 源码解析
解读SpQR实现源码,从入口main.py:quantize_model开始:
python
def quantize_model(model, args, device):
"""main entry point to functions for model quantization"""
tick = time.time()
if args.wbits == 16:
print("not quantizing the model with args.wbits=16", flush=True)
results = None, args.wbits
elif args.nearest:
results = quantize_nearest(model, args, device) # RTN压缩算法
else:
print("Loading data ...")
...
results = quantize_spqr(model, dataloader, args, device) # spqr压缩算法
print(f"quantization time: {time.time() - tick:.1f}")
return results
SpQR quantize核心实现:
python
def quantize(
self,
*,
bits: int = 2,
blocksize: int = 128,
percdamp: float = 1e-2,
groupsize: Optional[int] = None,
keep_last_columns: int = 0,
outlier_relative_threshold: float = float("inf"),
permutation_order: Union[str, torch.Tensor] = "identity",
keep_H: bool = True,
simplified_outliers: bool = False,
verbose=True,
perchannel: bool = True,
sym: bool = False,
save_quantization: bool = False,
**kwargs,
) -> QuantizationResult:
for block_start in block_start_iter: ## block 分组
block_end = min(block_start + blocksize, in_dim)
for column_index in range(block_start, block_end):
if column_index % groupsize == 0:
# fit weight quantizer on the upcoming group of weight columns (inputs), across all rows (outputs)
in_group_index += 1
group_weight = weight[:, column_index : column_index + groupsize]
if simplified_outliers or (unstructured_outlier_threshold == float("inf")):
quantizer.find_params(group_weight, weight=True)
else:
# objective: detect which weights will be designated as outliers, fit quantizer *without* these weights
# step 1: fit quantizer on a leave-one-out version of weights, i.e. in each group, drop one weight at a time
assert perchannel, "refitting quantizer is only implemented for perchannel=True"
group_diag_hessian_inv_cho = H_inv_cho_diag[column_index : column_index + groupsize]
loo_quantization_error_sq = get_leave_one_out_error(
group_weight, group_diag_hessian_inv_cho, bits=bits, sym=sym
)
# ^-- dequantized(quantized(group_weight)) using a quantizer trained on all weights except the reconstructed one
likely_unstructured_outlier_mask = (
loo_quantization_error_sq > unstructured_outlier_threshold
).float() ## likely离群点
non_outlier_mask = 1 - likely_unstructured_outlier_mask
mean_over_non_outliers = torch.sum(
group_weight * non_outlier_mask, dim=1, keepdim=True
) / torch.sum(non_outlier_mask, dim=1, keepdim=True).clamp_min(1)
group_weight_without_outliers = group_weight * non_outlier_mask + mean_over_non_outliers * (
1 - non_outlier_mask
)
quantizer.find_params(group_weight_without_outliers, weight=True) ## 除去outliers后,重新寻找量化参数量化
del group_diag_hessian_inv_cho, loo_quantization_error_sq
del mean_over_non_outliers, group_weight_without_outliers, non_outlier_mask
weight_quant_i = quantize(
weight[:, column_index].unsqueeze(1), quantizer.scale, quantizer.zero, quantizer.maxq
)
weight_i_quantized = dequantize(weight_quant_i, quantizer.scale, quantizer.zero).reshape_as(
weight[:, column_index]
)
delta_weight_i = weight[:, column_index] - weight_i_quantized # [out_dim]
quantization_errors[:, column_index] = (
delta_weight_i / H_inv_cho[column_index, column_index]
) # [out_dim]
if unstructured_outlier_threshold != float("inf"):
unstructured_outlier_mask[:, column_index] = (
quantization_errors[:, column_index].square() > unstructured_outlier_threshold
) # unstructured_outlier_mask 离群点
# re-quantize without outliers
is_outlier = unstructured_outlier_mask[:, column_index].float()
weight_quant_i = quantize(
(weight[:, column_index] * (1 - is_outlier)).unsqueeze(1),
quantizer.scale,
quantizer.zero,
quantizer.maxq,
)
weight_i_quantized_wo_outliers = dequantize(
weight_quant_i, quantizer.scale, quantizer.zero
).reshape_as(weight[:, column_index])
weight_i_quantized = (
weight_i_quantized_wo_outliers * (1 - is_outlier) + weight[:, column_index] * is_outlier
) # [out_dim]
delta_weight_i = weight[:, column_index] - weight_i_quantized # [out_dim]
quantization_errors[:, column_index] = (
delta_weight_i / H_inv_cho[column_index, column_index]
) # [out_dim]
weight[:, column_index] = weight_i_quantized
weight[:, column_index + 1 : block_end].addr_(
quantization_errors[:, column_index],
H_inv_cho[column_index, column_index + 1 : block_end],
alpha=-1,
)
## 量化误差 补偿到 weight[:, block_end:]
weight[:, block_end:].addmm_(
quantization_errors[:, block_start:block_end],
H_inv_cho[block_start:block_end, block_end:],
alpha=-1,
)