【vLLM-模型特性适配】GLM4.5 GLM4.6 w8a8权重量化

作者:昇腾实战派

0 背景介绍

智谱GLM4.5、GLM4.6模型推理性能优化,本文提供量化权重导出方法,使用工具是msmodelslim。

1 环境准备

1.1 获取浮点权重

模型权重链接:
https://modelscope.cn/models/ZhipuAI/GLM-4.5
https://modelscope.cn/models/ZhipuAI/GLM-4.6

1.2 版本配套

1.2.1 硬件版本

组件 版本
硬件环境 910B(8卡)

1.2.2 软件版本

组件 版本
HDK Ascend HDK 25.0.rc1.1
CANN 8.2.RC1
msmodelslim 803a9b266
模型 GLM4.5/GLM4.6

1.2.3 镜像准备

采用8.2.RC1的CANN,镜像可下载quay.io中的v0.10.2rc1的A2版本,下载后确认CANN版本。

bash 复制代码
docker run -it -d --net=host --shm-size=1g \
    --privileged \
    --name glm4_5_360b_w8a8 \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /home:/home \
    9bthinking:latest \ #换成对应镜像名
    bash

1.3 安装Python第三方库

bash 复制代码
pip3 install attrs cython 'numpy>=1.19.2,<=1.24.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf==3.20.0 scipy requests absl-py

1.4 升级transformers

bash 复制代码
pip install transformers==4.54.0

1.5 安装msmodelslim

bash 复制代码
git clone https://gitee.com/ascend/msit.git
git checkout 803a9b266

# 进入到msit/msmodelslim的目录,运行安装脚本
cd msit/msmodelslim
bash install.sh

2 适配modelslim

由于当前量化方案工具需要额外适配,参照Qwen3的方法进行修改,一共涉及两个__init__.py文件,一个glm4moe.py文件。

修改 msit/msmodelslim/msmodelslim/pytorch/llm_ptq/model/

在其目录下 新建一个 glm4moe

在glm4moe文件夹中 新建 init.py 、glm4moe.py,内容仿照 qwen3中的代码写入以下内容

2.1 新建__init__.py

python 复制代码
#  -*- coding: utf-8 -*-
#  Copyright (c) 2025-2025 Huawei Technologies Co., Ltd.
#  #
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#  #
#  http://www.apache.org/licenses/LICENSE-2.0
#  #
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

__all__ = ['Glm4moeAdapter']

from .glm4moe import Glm4moeAdapter

2.2 新建glm4moe.py

python 复制代码
# Copyright Huawei Technologies Co., Ltd. 2025-2025. All rights reserved.

from typing import List, Any, Dict, Optional, Type, TYPE_CHECKING

import torch
import torch.nn as nn
from transformers import PreTrainedModel

from msmodelslim.pytorch.llm_ptq.model.base import ModelAdapter, ModelAdapterRegistry

if TYPE_CHECKING:
    from msmodelslim.pytorch.llm_ptq.anti_outlier.config import AntiOutlierConfig

@ModelAdapterRegistry.register("glm4_moe")
@ModelAdapterRegistry.register("glm4moe")
class Glm4moeAdapter(ModelAdapter):

    def __init__(self, model: PreTrainedModel):
        super().__init__(model)
        self.is_moe = "moe" in self.model.config.model_type
        self.num_attention_heads, self.num_key_value_heads = self._init_num_attention_heads()

    def get_norm_linear_subgraph(self,
                                 cfg: 'AntiOutlierConfig',
                                 dummy_input: Optional[torch.Tensor] = None,
                                 norm_class: Optional[List[Type[nn.Module]]] = None):
        """获取Norm->Linear子图"""
        norm_linear = {}
        layer_num = self.model.config.num_hidden_layers

        # 校验layer_num是否过大或过小
        if layer_num < 1 or layer_num > 999:
            raise ValueError(f"The number of hidden layers({layer_num}) is invalid. It must be between 1 and 999.")

        for layer in range(layer_num):
            input_layernorm = 'model.layers.' + str(layer) + '.input_layernorm'
            q_proj = 'model.layers.' + str(layer) + '.self_attn.q_proj'
            k_proj = 'model.layers.' + str(layer) + '.self_attn.k_proj'
            v_proj = 'model.layers.' + str(layer) + '.self_attn.v_proj'
            o_proj = 'model.layers.' + str(layer) + '.self_attn.o_proj'

            norm_linear[v_proj] = [o_proj]
            norm_linear[input_layernorm] = [q_proj, k_proj, v_proj]

            if not self.is_moe:
                post_layernorm = 'model.layers.' + str(layer) + '.post_attention_layernorm'
                gate_proj = 'model.layers.' + str(layer) + '.mlp.gate_proj'
                up_proj = 'model.layers.' + str(layer) + '.mlp.up_proj'
                down_proj = 'model.layers.' + str(layer) + '.mlp.down_proj'

                norm_linear[up_proj] = [down_proj]
                norm_linear[post_layernorm] = [gate_proj, up_proj]

        return norm_linear

    def modify_smooth_args(self,
                           cfg: 'AntiOutlierConfig',
                           norm_name: str,
                           linear_names: str,
                           args: List[Any],
                           kwargs: Dict[str, Any]):
        # 针对该模型进行m4量化时,需要对特定层开启偏移
        if cfg.anti_method == 'm4':
            is_shift = False
            if 'norm' in norm_name:
                is_shift = True

            kwargs['is_shift'] = is_shift
            kwargs['alpha'] = cfg.alpha

        # 针对qwen3模型,需要对num_attention_heads和num_key_value_heads进行修改
        if cfg.anti_method == 'm4' and 'num_attention_heads' in kwargs:
            kwargs['num_attention_heads'] = [self.num_attention_heads, self.num_key_value_heads]

        return args, kwargs

    def _init_num_attention_heads(self):
        num_attention_heads = None
        num_key_value_heads = None

        attention_heads_keys = ["num_attention_heads", "n_head", "num_heads"]
        key_value_heads_keys = ["num_key_value_heads"]

        for key in attention_heads_keys:
            if hasattr(self.model.config, key):
                num_attention_heads = getattr(self.model.config, key)

        for key in key_value_heads_keys:
            if hasattr(self.model.config, key):
                num_key_value_heads = getattr(self.model.config, key)

        if not num_attention_heads:
            raise ValueError(
                f"the config of model must have num_attention_heads, n_head or num_heads, \
                                please check or modify the config file"
            )
        return num_attention_heads, num_key_value_heads

2.3 修改__init__.py

修改msit/msmodelslim/msmodelslim/pytorch/llm_ptq/model/init.py

python 复制代码
__all__ = ['ModelAdapter', 'ModelAdapterRegistry']

from .base import ModelAdapter, ModelAdapterRegistry
from .deepseek_v2 import DeepseekV2Adapter
from .hunyuan import HunyuanLargeAdapter, HunyuanVideoAdapter
from .qwen3 import Qwen3Adapter
from .flux import FluxAdapter
from .glm4moe import Glm4moeAdapter

2.4 重装msmodelslim

进入到msit/msmodelslim的目录,运行安装脚本

bash 复制代码
cd msit/msmodelslim
bash install.sh

3 量化

3.1 量化方案

  • MOE层使用动态量化,其他层使用静态量化。
  • 精度考虑,回退down_proj层和o_proj层

3.2 新建glm4_moe_w8a8.py

在msit/msmodelslim/example/Qwen3-MOE中新建glm4_moe_w8a8.py文件,内容如下:

python 复制代码
# Copyright (c) Huawei Technologies Co., Ltd. 2025-2025. All rights reserved.
import os
import sys
import argparse
import functools
import json
from unittest.mock import patch

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

current_directory = os.path.dirname(os.path.abspath(__file__))
parent_directory = os.path.abspath(os.path.join(current_directory, '..', ".."))
sys.path.append(parent_directory)

from example.common.security.path import get_valid_read_path, get_write_directory
from example.common.security.type import check_number
from example.common.utils import SafeGenerator, cmd_bool
from msmodelslim.tools.copy_config_files import copy_config_files, modify_config_json
from msmodelslim.pytorch.llm_ptq.anti_outlier import AntiOutlierConfig, AntiOutlier
from msmodelslim.pytorch.llm_ptq.llm_ptq_tools import Calibrator, QuantConfig
from msmodelslim.tools.logger import set_logger_level


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_path', type=str, help="The path of float model and tokenizer"),
    parser.add_argument('--save_path', type=str, help="The path to save quant model"),
    parser.add_argument('--layer_count', type=int, default=0, help="Layer count when loading model")
    parser.add_argument('--anti_dataset', type=str, default="./anti_prompt_50.json",
                        help="The calib data for anti outlier")
    parser.add_argument('--calib_dataset', type=str, default="./calib_prompt_50.json",
                        help="The calib data for calibration")
    parser.add_argument('--batch_size', type=int, default=4, help="Batch size for anti and calibration")
    parser.add_argument('--mindie_format', action="store_true", help="Enable only mindie config save")
    parser.add_argument('--trust_remote_code', type=cmd_bool, default=False)
    return parser.parse_args()


def custom_hook(model_config):
    model_config["quantize"] = "w8a8_dynamic"


def get_calib_dataset_batch(model_tokenizer, calib_list, batch_size, device="npu"):
    calib_dataset = []
    calib_list = [calib_list[i:i + batch_size] for i in range(0, len(calib_list), batch_size)]
    for calib_data in calib_list:
        inputs = model_tokenizer(calib_data, return_tensors='pt', padding=True).to(device)
        calib_dataset.append(
            [value.to(device) for key, value in inputs.data.items() if isinstance(value, torch.Tensor)])
    return calib_dataset


def main():
    args = parse_args()
    set_logger_level("info")

    model_path = args.model_path
    batch_size = args.batch_size

    save_path = get_write_directory(args.save_path, write_mode=0o750)
    check_number(batch_size, int, 1, 16, "batch_size")

    safe_generator = SafeGenerator()

    config = safe_generator.get_config_from_pretrained(model_path=model_path, 
                                                       trust_remote_code=args.trust_remote_code)
    num_layer = config.num_hidden_layers
    if args.layer_count < 0 or args.layer_count > num_layer:
        raise ValueError(
            f"Invalid value for parameter layer_count: {args.layer_count}."
            f"Must be between 0 and {num_layer}."
        )
    # Set layer count to 0 means use all layers, otherwise it will only use the first layer_count layers
    config.num_hidden_layers = args.layer_count if args.layer_count != 0 else config.num_hidden_layers
    # Disable use cache because we don't need to use cache, otherwise it will use too much device memory then cause OOM
    config.use_cache = False

    tokenizer = safe_generator.get_tokenizer_from_pretrained(model_path=model_path,
                                                             config=config,
                                                             trust_remote_code=args.trust_remote_code,
                                                             use_fast=True,
                                                             add_eos_token=True)

    model = safe_generator.get_model_from_pretrained(model_path=model_path,
                                                     config=config,
                                                     trust_remote_code=args.trust_remote_code,
                                                     device_map={
                                                         "model.embed_tokens": 0,
                                                         "model.layers": "cpu",
                                                         "model.norm": "cpu",
                                                         "lm_head": 0,
                                                     },
                                                     torch_dtype="auto",
                                                     attn_implementation='eager')

    #keys_46 = [k for k in model.state_dict() if k.startswith('model.layers.46.')]
    #print(keys_46)
    anti_dataset_path = get_valid_read_path(args.anti_dataset, "json", is_dir=False)
    calib_dataset_path = get_valid_read_path(args.calib_dataset, "json", is_dir=False)
    with open(anti_dataset_path, "r") as file:
        anti_prompt = json.load(file)
    with open(calib_dataset_path, "r") as file:
        calib_prompt = json.load(file)
    anti_dataset = get_calib_dataset_batch(tokenizer, anti_prompt, batch_size, model.device)
    dataset_calib = get_calib_dataset_batch(tokenizer, calib_prompt, batch_size, model.device)

    with torch.no_grad():
        anti_config = AntiOutlierConfig(w_bit=8,
                                        a_bit=8,
                                        anti_method='m4',
                                        dev_type='npu',
                                        dev_id=model.device.index)
        anti_outlier = AntiOutlier(model, calib_data=anti_dataset, cfg=anti_config)
        anti_outlier.process()

    disable_names = []
    for ids in range(config.num_hidden_layers):
       if ids in [0,1,2]:
           disable_names.append(f"model.layers.{ids}.mlp.down_proj")
       disable_names.append(f"model.layers.{ids}.self_attn.o_proj")

    quant_config = QuantConfig(
        a_bit=8,
        w_bit=8,
        disable_names=disable_names,
        dev_type='npu',
        dev_id=model.device.index,
        act_method=1,
        pr=1.0,
        w_sym=True,
        mm_tensor=False,
    )

    calibrator = Calibrator(model, 
                            quant_config, 
                            calib_data=dataset_calib, 
                            disable_level="L0", 
                            mix_cfg={"*.mlp.*": "w8a8_dynamic", "*": "w8a8"})
    calibrator.run()

    if args.mindie_format:
        quant_model_description_json_name = "quant_model_description_w8a8_dynamic.json"
    else:
        quant_model_description_json_name = "quant_model_description.json"

    save_type = "safe_tensor" if args.mindie_format else "ascendV1"
    calibrator.save(save_path,
                    json_name=quant_model_description_json_name,
                    safetensors_name="quant_model_weight_w8a8_dynamic.safetensors",
                    save_type=[save_type],
                    part_file_size=4)

    custom_hooks = {
        'config.json': functools.partial(modify_config_json, custom_hook=custom_hook)
    }
    copy_config_files(input_path=model_path, output_path=save_path, quant_config=quant_config,
                      mindie_format=args.mindie_format, custom_hooks=custom_hooks)


if __name__ == "__main__":
    # torch_npu will fork a new process to init,
    # it's lazy_init will fail after we load a big model,so we need to init it here
    torch_npu.npu.init()
    # Invoke main process
    main()

3.3 修改transformers代码

上述量化后报错,需修改transformers中代码,解决量化时缺少score参数权重的问题

pip show transformers

文件路径:transformers/models/glm4_moe/modeling_glm4_moe.py

修改Glm4MoeTopkRouter类中代码self.e_score_correction_bias = nn.Parameter(torch.zeros((self.n_routed_experts), dtype=torch.float32))

3.4 执行脚本转量化权重

bash 复制代码
python glm4_moe_w8a8.py --model_path "/home/weight" --save_path /home/glm4.5_w8a8_dynamic_v2/ --trust_remote_code True

权重量化完成

3.5 权重量化后处理

3.5.1合并config.json

quant_model_description.json中的quantization_config合并到config.json中

新增配置文件合并脚本configmerge.py

python 复制代码
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import json
import sys
import argparse

# 直接指定路径
INPUT_DIR = "/GLM360B/glm4.5_w8a8_dynamic_v2"  # 修改为你的模型目录
OUTPUT_FILE = "/GLM360B/glm4.5_w8a8_dynamic_v2/config.json"  # 修改为你想要的输出文件路径

def merge_configs():
    """
    合并配置文件:
    1. 从quant_model_description.json读取内容
    2. 将内容合并到config.json的quantization_config部分
    3. 保存文件config.json
    """
    # 构建文件路径
    config_path = os.path.join(INPUT_DIR, "config.json")
    quant_desc_path = os.path.join(INPUT_DIR, "quant_model_description_w8a8.json")

    # 检查文件存在
    if not os.path.exists(config_path):
        print(f"错误: 配置文件不存在: {config_path}")
        return False

    if not os.path.exists(quant_desc_path):
        print(f"错误: 量化描述文件不存在: {quant_desc_path}")
        return False

    try:
        # 读取config.json
        with open(config_path, 'r', encoding='utf-8') as f:
            config_data = json.load(f)

        # 读取quant_model_description_w8a8.json
        with open(quant_desc_path, 'r', encoding='utf-8') as f:
            quant_desc_data = json.load(f)

        # 确保config.json有quantization_config字段
        if "quantization_config" not in config_data:
            config_data["quantization_config"] = {}

        # 合并配置
        config_data["quantization_config"].update(quant_desc_data)

        # 确保有必要的字段
        if "moe_quantize" not in config_data:
            config_data["moe_quantize"] = "w8a8_dynamic"

        # 保存新配置文件
        with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
            json.dump(config_data, f, indent=4)

        print(f"成功: 配置已合并并保存到 {OUTPUT_FILE}")
        return True

    except Exception as e:
        print(f"错误: 处理配置文件时发生异常: {str(e)}")
        return False

def main(): 
    # 执行合并
    success = merge_configs()
    if not success:
        sys.exit(1)

if __name__ == "__main__":
    main()

执行脚本python configmerge.py

3.5.2 拷贝template文件

将原始文件中的template文件拷贝到量化权重中:

pyton 复制代码
cp chat_template.jinja /GLM360B/glm4.5_w8a8_dynamic_v2/

4 MTP权重拷贝(可选)

在msit/msmodelslim/example/DeepSeek/文件夹下参考add_safetensors.py将mtp权重拷贝至量化权重目录,完成后还需将config.json修改为新的quantization_config(包含mtp),例如:

python 复制代码
from add_safetensors import add_safetensors
add_safetensors(org_paths="/home/z00694760/GLM-4.5", target_dir="/home/z00694760/glm4.5_w8a8_with_float_mtp", safetensors_prefix="mtp_float",
                        max_file_size_gb=5, prefix="model.layers.92.")
相关推荐
DARLING Zero two♡3 小时前
0-Day 极速响应:基于 vLLM-Ascend 在昇腾 NPU 上部署 Qwen2.5 的实战避坑指南
华为·gpu算力·vllm
HyperAI超神经6 小时前
【vLLM 学习】Reproduciblity
人工智能·深度学习·学习·cpu·gpu·编程语言·vllm
小虚竹6 小时前
基于昇腾310的CANN推理开发--图片深度识别应用
昇腾·cann
是Yu欸7 小时前
在昇腾8卡上极限部署 Qwen3-235B MoE
部署·qwen·昇腾·npu·ascend·vllm·多节点
意疏20 小时前
Llama-2-7b大模型在昇腾NPU上的部署与性能测评报告
昇腾
破烂pan1 天前
模型格式 × GPU × 显存 × vLLM 参数模板对照表
vllm
Token_w3 天前
快速入门 vLLM-Ascend:开源仓结构、环境部署与基础配置
昇腾
技术路上的探险家3 天前
vLLM常用启动参数的详细解释
python·大模型·qwen·vllm
LucianaiB3 天前
昇腾NPU实战:Llama-2-7B大模型的部署全流程与性能深度分析
cpu·llama·昇腾