LLM系列:SpQR实践和源码

1 SpQR量化实践

Packages 参照github指导安装版本包:\

复制代码
pip install -r requirements.txt

torch: 1.13 cuda: 11.7 查看cuda版本 import torch; print(torch.version.cuda)

Datasets and tokenizer SpQR脚本会下载、缓存相关tokenizer、datasets。Huggingface缓存

报错1:ValueError: Invalid pattern: '**' can only be an entire path component 根因:The issue was caused by an incompatibility between the versions of datasets, huggingface-hub and fsspec datasets-2.19.1 fixed the minimum requirement huggingface-hub >= 0.21.2: Bump huggingface-hub lower version to 0.21.2 #6713 升级 datasets版本解决。
报错2:[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed 解决:关掉证书校验。 python3.8/site-packages/requests/adapters.py: verify=True => verify=False python3.8/site-packages/requests/sessions.py: self.verify = True => self.verify = False, verify=True => verify=False

关闭证书校验,执行正常,warning日志可忽略:

bash 复制代码
============ Evaluating perplexity... ============
/root/anaconda3/envs/ly_spqr_p38/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'proxyhk.huawei.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  warnings.warn(
/root/anaconda3/envs/ly_spqr_p38/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'proxyhk.huawei.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  warnings.warn(
/root/anaconda3/envs/ly_spqr_p38/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'proxyhk.huawei.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  warnings.warn(
/root/anaconda3/envs/ly_spqr_p38/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'proxyhk.huawei.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings

Model 支持LLama、Falcon、OPT系列。

Data 建议使用模型训练的数据集,LLama模型使用 RedPajama,代码仓已有。

  • data/red_pajama_n=1024.pth
  • data/refined_web_n=128.pth

执行SpQR量化

css 复制代码
export MODEL_PATH=/home/liyan/llm_datas/models/llama-7b
export DATASET=pajama

python main.py $MODEL_PATH $DATASET \
    --wbits 4 \
    --groupsize 16 \
    --perchannel \
    --qq_scale_bits 3 \
    --qq_zero_bits 3 \
    --qq_groupsize 16 \
    --outlier_threshold=0.2 \
    --permutation_order act_order \
    --percdamp 1e0 \
    --nsamples 128 \
    --save /home/liyan/LLM/spqr/SpQR/output/

执行结果如下,可以看到PPL下降0.1以内。

ini 复制代码
base:
wikitext2 perplexity = 5.6771
ptb perplexity = 27.3401

quantization:
wikitext2 perplexity = 5.7282
ptb perplexity = 27.5875

LM Evaluation Harness benchmark

SpQR仓内置测评框架Language Model Evaluation Harness,参照指导安装,代码入口是lmeval.py,当前只支持LLaMA/Falcon quantization。

bash 复制代码
pip install -r lm-evaluation-harness/requirements.txt

执行命令如下:

ini 复制代码
export CUDA_VISIBLE_DEVICES=2

export MODEL_PATH=/home/liyan/llm_datas/models/llama-7b
export DATASET=pajama

python lmeval.py \
    --model hf-causal \
    --model_args pretrained=$MODEL_PATH,dtype=float16,use_accelerate=True \
    --quantization_args dataset=$DATASET,wbits=4,groupsize=16,perchannel=True,qq_scale_bits=3,qq_zero_bits=3,qq_groupsize=16,percdamp=1.0,outlier_threshold=0.2,simplified_outliers=False,nsamples=128,offload_activations=True \
    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \
    --batch_size 1

2 源码解析

解读SpQR实现源码,从入口main.py:quantize_model开始:

python 复制代码
def quantize_model(model, args, device):
    """main entry point to functions for model quantization"""
    tick = time.time()
    if args.wbits == 16:
        print("not quantizing the model with args.wbits=16", flush=True)
        results = None, args.wbits
    elif args.nearest:
        results = quantize_nearest(model, args, device) # RTN压缩算法
    else:
        print("Loading data ...")
        ...
        results = quantize_spqr(model, dataloader, args, device)  # spqr压缩算法
    print(f"quantization time: {time.time() - tick:.1f}")
    return results

SpQR quantize核心实现:

python 复制代码
def quantize(
        self,
        *,
        bits: int = 2,
        blocksize: int = 128,
        percdamp: float = 1e-2,
        groupsize: Optional[int] = None,
        keep_last_columns: int = 0,
        outlier_relative_threshold: float = float("inf"),
        permutation_order: Union[str, torch.Tensor] = "identity",
        keep_H: bool = True,
        simplified_outliers: bool = False,
        verbose=True,
        perchannel: bool = True,
        sym: bool = False,
        save_quantization: bool = False,
        **kwargs,
    ) -> QuantizationResult:
      for block_start in block_start_iter: ## block 分组
          block_end = min(block_start + blocksize, in_dim)
          for column_index in range(block_start, block_end):
              if column_index % groupsize == 0:
                  # fit weight quantizer on the upcoming group of weight columns (inputs), across all rows (outputs)
                  in_group_index += 1
                  group_weight = weight[:, column_index : column_index + groupsize]

                  if simplified_outliers or (unstructured_outlier_threshold == float("inf")):
                      quantizer.find_params(group_weight, weight=True)

                  else:
                      # objective: detect which weights will be designated as outliers, fit quantizer *without* these weights
                      # step 1: fit quantizer on a leave-one-out version of weights, i.e. in each group, drop one weight at a time
                      assert perchannel, "refitting quantizer is only implemented for perchannel=True"
                      group_diag_hessian_inv_cho = H_inv_cho_diag[column_index : column_index + groupsize]
                      loo_quantization_error_sq = get_leave_one_out_error(
                          group_weight, group_diag_hessian_inv_cho, bits=bits, sym=sym
                      )
                      # ^-- dequantized(quantized(group_weight)) using a quantizer trained on all weights except the reconstructed one

                      likely_unstructured_outlier_mask = (
                          loo_quantization_error_sq > unstructured_outlier_threshold
                      ).float()  ## likely离群点

                      non_outlier_mask = 1 - likely_unstructured_outlier_mask
                      mean_over_non_outliers = torch.sum(
                          group_weight * non_outlier_mask, dim=1, keepdim=True
                      ) / torch.sum(non_outlier_mask, dim=1, keepdim=True).clamp_min(1)
                      group_weight_without_outliers = group_weight * non_outlier_mask + mean_over_non_outliers * (
                          1 - non_outlier_mask
                      )
                      quantizer.find_params(group_weight_without_outliers, weight=True)  ## 除去outliers后,重新寻找量化参数量化
                      del group_diag_hessian_inv_cho, loo_quantization_error_sq
                      del mean_over_non_outliers, group_weight_without_outliers, non_outlier_mask


              weight_quant_i = quantize(
                  weight[:, column_index].unsqueeze(1), quantizer.scale, quantizer.zero, quantizer.maxq
              )
              weight_i_quantized = dequantize(weight_quant_i, quantizer.scale, quantizer.zero).reshape_as(
                  weight[:, column_index]
              )

              delta_weight_i = weight[:, column_index] - weight_i_quantized  # [out_dim]
              quantization_errors[:, column_index] = (
                  delta_weight_i / H_inv_cho[column_index, column_index]
              )  # [out_dim]

              if unstructured_outlier_threshold != float("inf"):
                  unstructured_outlier_mask[:, column_index] = (
                      quantization_errors[:, column_index].square() > unstructured_outlier_threshold
                  )  # unstructured_outlier_mask 离群点
                  # re-quantize without outliers
                  is_outlier = unstructured_outlier_mask[:, column_index].float()

                  weight_quant_i = quantize(
                      (weight[:, column_index] * (1 - is_outlier)).unsqueeze(1),
                      quantizer.scale,
                      quantizer.zero,
                      quantizer.maxq,
                  )
                  weight_i_quantized_wo_outliers = dequantize(
                      weight_quant_i, quantizer.scale, quantizer.zero
                  ).reshape_as(weight[:, column_index])
                  weight_i_quantized = (
                      weight_i_quantized_wo_outliers * (1 - is_outlier) + weight[:, column_index] * is_outlier
                  )  # [out_dim]

                  delta_weight_i = weight[:, column_index] - weight_i_quantized  # [out_dim]
                  quantization_errors[:, column_index] = (
                      delta_weight_i / H_inv_cho[column_index, column_index]
                  )  # [out_dim]

              weight[:, column_index] = weight_i_quantized
              weight[:, column_index + 1 : block_end].addr_(
                  quantization_errors[:, column_index],
                  H_inv_cho[column_index, column_index + 1 : block_end],
                  alpha=-1,
              )
          ## 量化误差 补偿到 weight[:, block_end:]
          weight[:, block_end:].addmm_(
              quantization_errors[:, block_start:block_end],
              H_inv_cho[block_start:block_end, block_end:],
              alpha=-1,
          )    
相关推荐
C++ 老炮儿的技术栈38 分钟前
UDP 与 TCP 的区别是什么?
开发语言·c++·windows·算法·visual studio
殇者知忧40 分钟前
【论文笔记】若干矿井粉尘检测算法概述
深度学习·神经网络·算法·随机森林·机器学习·支持向量机·计算机视觉
mochensage2 小时前
C++信息学竞赛中常用函数的一般用法
java·c++·算法
chengooooooo2 小时前
leetcode Top100 238. 除自身以外数组的乘积|数组系列
算法·leetcode
GUIQU.3 小时前
【每日一题 | 2025年6.2 ~ 6.8】第16届蓝桥杯部分偏简单题
算法·蓝桥杯·每日一题
weixin_527550404 小时前
初级程序员入门指南
javascript·python·算法
嘉陵妹妹5 小时前
深度优先算法学习
学习·算法·深度优先
GalaxyPokemon6 小时前
LeetCode - 53. 最大子数组和
算法·leetcode·职场和发展
hn小菜鸡6 小时前
LeetCode 1356.根据数字二进制下1的数目排序
数据结构·算法·leetcode
zhuiQiuMX6 小时前
分享今天做的力扣SQL题
sql·算法·leetcode