【大模型实战篇】使用GPTQ量化QwQ-32B微调后的推理模型

1. 量化背景

之所以做量化,就是希望在现有的硬件条件下,提升性能。量化能将模型权重从高精度(如FP32)转换为低精度(如INT8/FP16),内存占用可减少50%~75%。低精度运算(如INT8)在GPU等硬件上计算效率更高,推理速度可提升2~4倍。

我们的任务是,将QwQ-32B微调后的推理模型,也就是bf16的精度,通过量化,压缩到int4。关于QwQ-32B微调,可以参考《利用ms-swift微调框架对QwQ-32B推理模型进行微调》。关于推理模型吞吐性能对比,可以参考《对比包括QwQ-32B在内的不同推理模型的吞吐量表现》。

2. 量化流程

接下来进入量化介绍:

QwQ-32B的模型架构依然还是Qwen2系列,所以可以使用GPTQ进行量化。之前尝试用AWQ,会报错。下列内容是基于AutoGPTQ实现量化。

首先通过安装源代码的方式获取并安装最新版本的该软件包。

bash 复制代码
git clone https://github.com/AutoGPTQ/AutoGPTQ
cd AutoGPTQ
pip install -e .

假设基于QwQ-32B模型进行微调,并将该微调后的模型命名为 QwQ-32B-finetuned ,且使用的是自己的带推理链的数据集。要构建GPTQ量化模型,还需要使用训练数据进行校准。

这里校准数据的设置,最好配置参数damp_percent=0.1,然后我采用的校准样本量是128个sample。不然会报错【1】:

torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite

在我的场景中,damp_percent我设置0.01,通过调整校准样本量解决了该报错。

我们采用双卡进行量化,脚本如下:

python 复制代码
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch
import json

# 设置路径
model_path = "/data/QwQ-32B-finetuned"
quant_path = "/data/quantized_model"

# 设置量化配置
quantize_config = BaseQuantizeConfig(
    bits=4,  # 可选择4或8位量化
    group_size=128,
    damp_percent=0.01,
    desc_act=False,  # 为了加速推理,可将其设置为False,但可能会导致困惑度稍差
    static_groups=False,
    sym=True,
    true_sequential=True,
    model_name_or_path=None,
    model_file_base_name="model"
)

max_len = 8192  # 设置最大文本长度

# 加载tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 加载模型,并指定使用GPU 2和GPU 5
model = AutoGPTQForCausalLM.from_pretrained(
    model_path,
    quantize_config,
    max_memory={2: "80GB", 5: "80GB"}  # 使用GPU 2和GPU 5,各分配80GB显存
)

# 准备校准数据集
data = []
with open("/data/jz_v0303.jsonl", "r") as f:
    for line in f:
        msg = json.loads(line)
        text = tokenizer.apply_chat_template(msg["messages"], tokenize=False, add_generation_prompt=False)
        model_inputs = tokenizer([text])
        input_ids = torch.tensor(model_inputs.input_ids[:max_len], dtype=torch.int)
        data.append(dict(input_ids=input_ids, attention_mask=input_ids.ne(tokenizer.pad_token_id)))

# 运行量化过程
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
model.quantize(data, cache_examples_on_gpu=False)

# 保存量化后的模型
model.save_quantized(quant_path, use_safetensors=True)
tokenizer.save_pretrained(quant_path)

量化日志:

QwQ有64层transformer层,整个量化共花费约110分钟。

Loading checkpoint shards: 100%|██████████| 14/14 [00:24<00:00, 1.74s/it]

INFO - Start quantizing layer 1/64

INFO - Quantizing self_attn.k_proj in layer 1/64...

2025-03-16 13:08:27 INFO [auto_gptq.quantization.gptq] duration: 4.176503658294678

2025-03-16 13:08:27 INFO [auto_gptq.quantization.gptq] avg loss: 4.942690849304199

INFO - Quantizing self_attn.v_proj in layer 1/64...

2025-03-16 13:08:28 INFO [auto_gptq.quantization.gptq] duration: 1.400636911392212

2025-03-16 13:08:28 INFO [auto_gptq.quantization.gptq] avg loss: 1.4266357421875

INFO - Quantizing self_attn.q_proj in layer 1/64...

2025-03-16 13:08:30 INFO [auto_gptq.quantization.gptq] duration: 1.4035542011260986

2025-03-16 13:08:30 INFO [auto_gptq.quantization.gptq] avg loss: 14.252044677734375

INFO - Quantizing self_attn.o_proj in layer 1/64...

2025-03-16 13:08:35 INFO [auto_gptq.quantization.gptq] duration: 1.4259772300720215

2025-03-16 13:08:35 INFO [auto_gptq.quantization.gptq] avg loss: 21.492481231689453

INFO - Quantizing mlp.up_proj in layer 1/64...

2025-03-16 13:08:42 INFO [auto_gptq.quantization.gptq] duration: 1.4980144500732422

2025-03-16 13:08:42 INFO [auto_gptq.quantization.gptq] avg loss: 11.520009994506836

INFO - Quantizing mlp.gate_proj in layer 1/64...

2025-03-16 13:08:43 INFO [auto_gptq.quantization.gptq] duration: 1.4689013957977295

2025-03-16 13:08:43 INFO [auto_gptq.quantization.gptq] avg loss: 13.158416748046875

INFO - Quantizing mlp.down_proj in layer 1/64...

2025-03-16 13:09:36 INFO [auto_gptq.quantization.gptq] duration: 11.233691692352295

2025-03-16 13:09:36 INFO [auto_gptq.quantization.gptq] avg loss: 5.198782444000244

INFO - Start quantizing layer 2/64

INFO - Quantizing self_attn.k_proj in layer 2/64...

2025-03-16 13:09:50 INFO [auto_gptq.quantization.gptq] duration: 1.4270472526550293

2025-03-16 13:09:50 INFO [auto_gptq.quantization.gptq] avg loss: 0.25423723459243774

INFO - Quantizing self_attn.v_proj in layer 2/64...

2025-03-16 13:09:51 INFO [auto_gptq.quantization.gptq] duration: 1.377784252166748

2025-03-16 13:09:51 INFO [auto_gptq.quantization.gptq] avg loss: 0.12605950236320496

INFO - Quantizing self_attn.q_proj in layer 2/64...

2025-03-16 13:09:53 INFO [auto_gptq.quantization.gptq] duration: 1.3954062461853027

2025-03-16 13:09:53 INFO [auto_gptq.quantization.gptq] avg loss: 0.6923567056655884

INFO - Quantizing self_attn.o_proj in layer 2/64...

2025-03-16 13:09:58 INFO [auto_gptq.quantization.gptq] duration: 1.4187729358673096

2025-03-16 13:09:58 INFO [auto_gptq.quantization.gptq] avg loss: 0.21527329087257385

INFO - Quantizing mlp.up_proj in layer 2/64...

2025-03-16 13:10:05 INFO [auto_gptq.quantization.gptq] duration: 1.4918739795684814

2025-03-16 13:10:05 INFO [auto_gptq.quantization.gptq] avg loss: 42.98908615112305

INFO - Quantizing mlp.gate_proj in layer 2/64...

2025-03-16 13:10:07 INFO [auto_gptq.quantization.gptq] duration: 1.4632303714752197

2025-03-16 13:10:07 INFO [auto_gptq.quantization.gptq] avg loss: 254.09523010253906

INFO - Quantizing mlp.down_proj in layer 2/64...

2025-03-16 13:10:59 INFO [auto_gptq.quantization.gptq] duration: 11.405533790588379

2025-03-16 13:10:59 INFO [auto_gptq.quantization.gptq] avg loss: 1.4062278270721436

INFO - Start quantizing layer 3/64

.......

2025-03-16 14:33:08 INFO [auto_gptq.quantization.gptq] duration: 11.416744709014893

2025-03-16 14:33:08 INFO [auto_gptq.quantization.gptq] avg loss: 10015.05078125

INFO - Start quantizing layer 62/64

INFO - Quantizing self_attn.k_proj in layer 62/64...

2025-03-16 14:33:22 INFO [auto_gptq.quantization.gptq] duration: 1.4608099460601807

2025-03-16 14:33:22 INFO [auto_gptq.quantization.gptq] avg loss: 129.20584106445312

INFO - Quantizing self_attn.v_proj in layer 62/64...

2025-03-16 14:33:23 INFO [auto_gptq.quantization.gptq] duration: 1.417314052581787

2025-03-16 14:33:23 INFO [auto_gptq.quantization.gptq] avg loss: 834.720947265625

INFO - Quantizing self_attn.q_proj in layer 62/64...

2025-03-16 14:33:25 INFO [auto_gptq.quantization.gptq] duration: 1.4364099502563477

2025-03-16 14:33:25 INFO [auto_gptq.quantization.gptq] avg loss: 770.3301391601562

INFO - Quantizing self_attn.o_proj in layer 62/64...

2025-03-16 14:33:30 INFO [auto_gptq.quantization.gptq] duration: 1.4644238948822021

2025-03-16 14:33:30 INFO [auto_gptq.quantization.gptq] avg loss: 1413.948486328125

INFO - Quantizing mlp.up_proj in layer 62/64...

2025-03-16 14:33:38 INFO [auto_gptq.quantization.gptq] duration: 1.5320115089416504

2025-03-16 14:33:38 INFO [auto_gptq.quantization.gptq] avg loss: 7386.39453125

INFO - Quantizing mlp.gate_proj in layer 62/64...

2025-03-16 14:33:39 INFO [auto_gptq.quantization.gptq] duration: 1.5006358623504639

2025-03-16 14:33:39 INFO [auto_gptq.quantization.gptq] avg loss: 6787.9912109375

INFO - Quantizing mlp.down_proj in layer 62/64...

2025-03-16 14:34:32 INFO [auto_gptq.quantization.gptq] duration: 11.412427186965942

2025-03-16 14:34:32 INFO [auto_gptq.quantization.gptq] avg loss: 11235.9814453125

INFO - Start quantizing layer 63/64

INFO - Quantizing self_attn.k_proj in layer 63/64...

2025-03-16 14:34:46 INFO [auto_gptq.quantization.gptq] duration: 1.4546654224395752

2025-03-16 14:34:46 INFO [auto_gptq.quantization.gptq] avg loss: 130.98355102539062

INFO - Quantizing self_attn.v_proj in layer 63/64...

2025-03-16 14:34:48 INFO [auto_gptq.quantization.gptq] duration: 1.4156157970428467

2025-03-16 14:34:48 INFO [auto_gptq.quantization.gptq] avg loss: 958.8649291992188

INFO - Quantizing self_attn.q_proj in layer 63/64...

2025-03-16 14:34:49 INFO [auto_gptq.quantization.gptq] duration: 1.4323241710662842

2025-03-16 14:34:49 INFO [auto_gptq.quantization.gptq] avg loss: 780.7476196289062

INFO - Quantizing self_attn.o_proj in layer 63/64...

2025-03-16 14:34:55 INFO [auto_gptq.quantization.gptq] duration: 1.4556679725646973

2025-03-16 14:34:55 INFO [auto_gptq.quantization.gptq] avg loss: 2276.7041015625

INFO - Quantizing mlp.up_proj in layer 63/64...

2025-03-16 14:35:01 INFO [auto_gptq.quantization.gptq] duration: 1.533803939819336

2025-03-16 14:35:01 INFO [auto_gptq.quantization.gptq] avg loss: 7764.6142578125

INFO - Quantizing mlp.gate_proj in layer 63/64...

2025-03-16 14:35:03 INFO [auto_gptq.quantization.gptq] duration: 1.4962470531463623

2025-03-16 14:35:03 INFO [auto_gptq.quantization.gptq] avg loss: 7304.74365234375

INFO - Quantizing mlp.down_proj in layer 63/64...

2025-03-16 14:35:56 INFO [auto_gptq.quantization.gptq] duration: 11.429993629455566

2025-03-16 14:35:56 INFO [auto_gptq.quantization.gptq] avg loss: 17015.2734375

INFO - Start quantizing layer 64/64

INFO - Quantizing self_attn.k_proj in layer 64/64...

2025-03-16 14:36:10 INFO [auto_gptq.quantization.gptq] duration: 1.453392744064331

2025-03-16 14:36:10 INFO [auto_gptq.quantization.gptq] avg loss: 112.55108642578125

INFO - Quantizing self_attn.v_proj in layer 64/64...

2025-03-16 14:36:11 INFO [auto_gptq.quantization.gptq] duration: 1.4028844833374023

2025-03-16 14:36:11 INFO [auto_gptq.quantization.gptq] avg loss: 509.4556884765625

INFO - Quantizing self_attn.q_proj in layer 64/64...

2025-03-16 14:36:12 INFO [auto_gptq.quantization.gptq] duration: 1.434821605682373

2025-03-16 14:36:12 INFO [auto_gptq.quantization.gptq] avg loss: 685.0777587890625

INFO - Quantizing self_attn.o_proj in layer 64/64...

2025-03-16 14:36:18 INFO [auto_gptq.quantization.gptq] duration: 1.4707720279693604

2025-03-16 14:36:18 INFO [auto_gptq.quantization.gptq] avg loss: 990.3109130859375

INFO - Quantizing mlp.up_proj in layer 64/64...

2025-03-16 14:36:25 INFO [auto_gptq.quantization.gptq] duration: 1.572035312652588

2025-03-16 14:36:25 INFO [auto_gptq.quantization.gptq] avg loss: 8309.283203125

INFO - Quantizing mlp.gate_proj in layer 64/64...

2025-03-16 14:36:27 INFO [auto_gptq.quantization.gptq] duration: 1.8046717643737793

2025-03-16 14:36:27 INFO [auto_gptq.quantization.gptq] avg loss: 7995.7509765625

INFO - Quantizing mlp.down_proj in layer 64/64...

2025-03-16 14:37:20 INFO [auto_gptq.quantization.gptq] duration: 11.410486698150635

2025-03-16 14:37:20 INFO [auto_gptq.quantization.gptq] avg loss: 27875.2734375

INFO - Packing model...

2025-03-16 14:37:25 INFO [auto_gptq.modeling._utils] Packing model...

Packing model.layers.63.mlp.down_proj...: 100%|██████████| 448/448 [20:01<00:00, 2.68s/it]

INFO - Model packed.

2025-03-16 14:57:31 INFO [auto_gptq.modeling._utils] Model packed.

量化前模型大小为62G:

total 62G

-rw-r--r-- 1 research research 707 Mar 12 10:19 added_tokens.json

-rw-r--r-- 1 research research 16K Mar 12 10:19 args.json

-rw-r--r-- 1 research research 785 Mar 12 10:15 config.json

-rw-r--r-- 1 research research 214 Mar 12 10:15 generation_config.json

-rw-r--r-- 1 research research 1.6M Mar 12 10:19 merges.txt

-rw-r--r-- 1 research research 4.6G Mar 12 10:15 model-00001-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:16 model-00002-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:16 model-00003-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:16 model-00004-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:16 model-00005-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:17 model-00006-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:17 model-00007-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:17 model-00008-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:18 model-00009-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:18 model-00010-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:18 model-00011-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:18 model-00012-of-00014.safetensors

-rw-r--r-- 1 research research 4.6G Mar 12 10:19 model-00013-of-00014.safetensors

-rw-r--r-- 1 research research 2.0G Mar 12 10:19 model-00014-of-00014.safetensors

-rw-r--r-- 1 research research 62K Mar 12 10:19 model.safetensors.index.json

-rw-r--r-- 1 research research 613 Mar 12 10:19 special_tokens_map.json

-rw-r--r-- 1 research research 8.0K Mar 12 10:19 tokenizer_config.json

-rw-r--r-- 1 research research 11M Mar 12 10:19 tokenizer.json

-rw-r--r-- 1 research research 2.7M Mar 12 10:19 vocab.json

量化后模型大小为19G:

total 19G

-rw-r--r-- 1 research research 707 Mar 16 14:58 added_tokens.json

-rw-r--r-- 1 research research 1.2K Mar 16 14:58 config.json

-rw-r--r-- 1 research research 1.6M Mar 16 14:58 merges.txt

-rw-r--r-- 1 research research 19G Mar 16 14:58 model.safetensors

-rw-r--r-- 1 research research 271 Mar 16 14:58 quantize_config.json

-rw-r--r-- 1 research research 613 Mar 16 14:58 special_tokens_map.json

-rw-r--r-- 1 research research 8.0K Mar 16 14:58 tokenizer_config.json

-rw-r--r-- 1 research research 11M Mar 16 14:58 tokenizer.json

-rw-r--r-- 1 research research 2.7M Mar 16 14:58 vocab.json

3. 量化模型部署

vLLM已支持GPTQ,可以直接使用AutoGPTQ量化的模型。使用GPTQ模型与vLLM的基本用法相同。

bash 复制代码
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve /data/quantized_model \
--tensor-parallel-size 4 \
--port 8001

另外对api调用的model id,可以通过设置别名方式,而不需要暴露完整路径:

bash 复制代码
vllm serve my_model --served-model-name my_alias

随后,可以这样调用API:

bash 复制代码
curl http://localhost:8001/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "quantized_model",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "推荐一款防水耳机."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

也可以使用 openai Python包中的API客户端:

python 复制代码
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="/data/quantized_model",
    messages=[
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": "推荐一款防水耳机"},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "repetition_penalty": 1.05,
    },
)
print("Chat response:", chat_response)

实测了下,模型生成吞吐量可以在92 tokens/s, 还是很不错的。

注意:需要注意下,百亿参数的模型,一般还是选择int8量化比较合适。int4更适合是千亿模型,百亿规模损失会有点大。

以下是int8的量化loss表现:

还有一个需要注意的是,量化后用vllm推理,默认会在prompt中添加<|im_start>assistant\n<think>这段,其实就是强制模型先输出推理链,本质上是指令遵循。所以你推理拿到的生成结果看起来是丢了<think>这个特殊token,实际上是已经在prompt中体现了。

  1. 参考材料

【1】https://github.com/AutoGPTQ/AutoGPTQ/issues/196

【2】GPTQ - Qwen

相关推荐
iisugar37 分钟前
AI学习第二天--大模型压缩(量化、剪枝、蒸馏、低秩分解)
人工智能·学习·剪枝·量化·推理
yutianzuijin5 天前
Scaled_dot_product_attention(SDPA)使用详解
人工智能·深度学习·llm·大模型推理
Better Bench10 天前
【金融量化】Ptrade中如何获取各类回测数据?
金融·量化·策略·交易·量化投资·回测·ptrade
Better Bench10 天前
【金融量化】Ptrade中的基础交易与高级量化交易策略的下单接口
金融·量化·策略·交易·量化投资·下单
Better Bench13 天前
【金融量化】Ptrade中交易环境支持的业务类型
金融·股票·etf·量化·融资融券·业务·转债