征程 6｜工具链量化简介与代码实操

一、量化简述

1.1 定义

将网络参数从 32 位浮点数据映射到更低位数（int16/int8/int4 等）的数据，这个过程称之为量化。反之，称之为反量化。

量化本质上是对数值范围的重新调整，可以「粗略」理解为是一种线性映射。（之所以加「粗略」二字，是因为有些论文会用非线性量化（对数量化等），但目前在工业界落地的还都是线性量化（对称量化、非对称量化、二值化等），地平线采用的主要是线性量化中的对称量化。

反量化一般没有信息损失，而量化一般会有精度损失。这是由于 float32 能保存的数值范围比 uint8 多，因此必定有大量数值无法用 uint8 表示，只能四舍五入成 uint8 类型的数值，继而引起量化误差。

量化的可行性依据：神经网络具有良好的鲁棒性，将高精度模型量化到低精度模型，这个过程可以认为是引入了噪声，而模型对噪声相对不敏感，因此量化后的模型也能保持较好的精度。

量化的目的：降低计算复杂度，提高模型推理速度，降低存储体积，减少计算能耗。在一些对能耗和时间要求更高的场景下，量化是一个必然的选择。

量化的目的：降低计算复杂度，提高模型推理速度，降低存储体积，减少计算能耗。在一些对能耗和时间要求更高的场景下，量化是一个必然的选择。

1.2 浮点/定点转换公式

用 r 表示浮点实数，q 表示量化后的定点整数。浮点和整型之间的换算公式为：

其中，S 是 scale，表示实数和整数之间的比例关系，Z 是 zero point，表示实数中的 0 经过量化后对应的整数，它们的计算方法为：

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> r m i n r_{min} </math>rmin、 <math xmlns="http://www.w3.org/1998/Math/MathML"> r m a x r_{max} </math>rmax 分别是浮点实数r 的最小值和最大值， <math xmlns="http://www.w3.org/1998/Math/MathML"> q m i n q_{min} </math>qmin 、 <math xmlns="http://www.w3.org/1998/Math/MathML"> q m a x q_{max} </math>qmax 分别是定点整数 q 的最小值和最大值。

重点解释：定点整数的 zero point 代表浮点实数的 0，二者之间的换算不存在精度损失，这一点可以从公式（2）中看出来，把 r=0 代入后就可以得到 q=Z。这么做的目的是为了在 padding 时保证浮点数值的 0 和定点整数的 zero point 完全等价，保证定点和浮点之间的表征能够一致。

对称/非对称量化：当实数中的 0 量化后对应整数 Z 也是 0 时，称之为对称量化，否则，为非对称量化。对称量化相比于非对称量化的精度可能要差一些，但速度会快一些，原因可见公式（7），将公式中的 Z 置零。

1.3 矩阵运算的量化

卷积网络中的卷积层和全连接层本质上都是一堆矩阵乘法，下面我们来看一看如何将矩阵中的浮点运算转换为定点运算。

假设 <math xmlns="http://www.w3.org/1998/Math/MathML"> r 1 r_1 </math>r1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> r 2 r_2 </math>r2 是浮点实数上的两个 N×N 的矩阵， <math xmlns="http://www.w3.org/1998/Math/MathML"> r 3 r_3 </math>r3 是 <math xmlns="http://www.w3.org/1998/Math/MathML"> r 1 r_1 </math>r1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> r 2 r_2 </math>r2 相乘后的矩阵，矩阵相乘可表示为：

假设 <math xmlns="http://www.w3.org/1998/Math/MathML"> S 1 S_1 </math>S1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> S 2 S_2 </math>S2 是 <math xmlns="http://www.w3.org/1998/Math/MathML"> r 1 r_1 </math>r1矩阵对应的 scale 和 zero point， <math xmlns="http://www.w3.org/1998/Math/MathML"> S 2 、 Z 2 、 S 3 、 Z 3 S_2 、Z_2 、S_3 、Z_3 </math>S2、Z2、S3、Z3 同理，那么由（5）式可以推出：

整理一下可以得到：

观察（7）式可以发现，除了 <math xmlns="http://www.w3.org/1998/Math/MathML"> S 1 S 2 / S 3 S_1S_2/S_3 </math>S1S2/S3 ，其它都是定点整数运算。

那如何把 <math xmlns="http://www.w3.org/1998/Math/MathML"> S 1 S 2 / S 3 S_1S_2/S_3 </math>S1S2/S3也变成定点运算呢？

假设 <math xmlns="http://www.w3.org/1998/Math/MathML"> M = S 1 S 2 / S 3 M=S_1S_2/S_3 </math>M=S1S2/S3，由于 M 通常都是（0， 1）之间的实数（这是通过大量实验统计出来的），因此可以表示成 <math xmlns="http://www.w3.org/1998/Math/MathML"> M = 2 − n M 0 M=2^{−n}M_0 </math>M=2−nM0，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> M 0 M_0 </math>M0 是一个定点实数。注意，定点数并不一定是整数，所谓定点，指的是小数位数是固定的。

因此，如果存在 <math xmlns="http://www.w3.org/1998/Math/MathML"> M = 2 − n M 0 M=2^{−n}M_0 </math>M=2−nM0，我们就可以通过 <math xmlns="http://www.w3.org/1998/Math/MathML"> M 0 M_0 </math>M0 的 bit 位移操作实现 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 − n M 0 2^{−n}M_0 </math>2−nM0，这样整个矩阵计算过程就都在定点上计算了。

本节主要参考链接：zhuanlan.zhihu.com/p/149659607

二、征程 6 工具链中的量化

从征程 6 OE3.0.31 版本开始，量化方式有所调整，主要影响 QAT 算法侧与 UCP 软件侧。修改内容：QAT 从 floor（x_data + 0.5）变成 nearbyint（x_data），可以理解为"向最近偶数舍入"，称之为"round-half-to-even"。

下面主要介绍征程 6 工具链 OE3.0.31 及以后量化公式写法，并简单介绍反量化节点的实现。

量化 Quantize 节点用于将模型 float 类型的数据量化至 int 类型，下面先对量化的公式和代码进行介绍。

2.1 numpy 实现

Plain 复制代码

import numpy as np

input_float = np.array([-2.5, -1.5, -0.5, 0.5, 1.5, 2.5])
scale = 1.0    # scale为 量化系数 或 比例因子

data_round_div = np.round(input_float / scale)
# int8量化
data_clip = np.clip(data_round_div, -128, 127).astype(np.int8)
# int16量化
data_clip_int16 = np.clip(data_round_div, -32768, 32767).astype(np.int16)

print("原始值input_float:", input_float)
print("np.round(input_float / scale):", data_round_div)
print("np.clip(data_round_div, -128, 127):", data_clip)

输出：

Plain 复制代码

原始值input_float: [-2.5 -1.5 -0.5  0.5  1.5  2.5]
np.round(input_float / scale): [-2. -2. -0.  0.  2.  2.]
np.clip(data_round_div, -128, 127): [-2 -2  0  0  2  2]

注意：np.round 在。5 的时候会"向最近偶数舍入"（2.5 → 2，1.5 → 2），其他写法时注意也要这样。

解释：NumPy 默认采用的是 "round half to even"（银行家舍入法）

2.2 Pytorch 实现

Plain 复制代码

import torch

input_float = torch.tensor([-2.5, -1.5, -0.5, 0.5, 1.5, 2.5])
scale = 1.0

data_round_div = torch.round(input_float / scale)  
# int8量化
data_clamp = torch.clamp(data_round_div, -128, 127).to(torch.int8)
# int16量化
data_clamp_int16 = torch.clamp(data_round_div, -32768, 32767).to(torch.int16)

print("原始值:", input_float)
print("torch.round(input_float / scale):", data_round_div)
print("torch.clamp(data_round_div, -128, 127):", data_clamp)

输出：

Plain 复制代码

原始值: tensor([-2.5000, -1.5000, -0.5000,  0.5000,  1.5000,  2.5000])
torch.round(input_float / scale): tensor([-2., -2., -0.,  0.,  2.,  2.])
torch.clamp(data_round_div, -128, 127): tensor([-2, -2,  0,  0,  2,  2], dtype=torch.int8)

注意：torch.round 默认是 round-half-to-even

2.3 C++实现

Plain 复制代码

#include <iostream>
#include <vector>
#include <cmath>
#include <algorithm>

// 对称 int8 量化函数（zero_point = 0）
int8_t quantize_core(float input, float scale, int32_t zero_point) {
    int qunantized_data = std::nearbyint(input / scale + zero_point);
    return std::min(std::max(-128, qunantized_data), 127);
}

// 对称 int16 量化函数（zero_point = 0）
int16_t quantize_core_int16(float input, float scale, int32_t zero_point) {
    int qunantized_data = std::nearbyint(input / scale + zero_point);
    return std::min(std::max(-32768, qunantized_data), 32767);
}

// 批量量化函数，对称int8量化
std::vector<int8_t> quantize_tensor(std::vector<float>& data, float scale) {
    std::vector<int8_t> result;
    for (float val : data) {
        result.push_back(quantize_core(val, scale, 0));
    }
    return result;
}

// 测试主函数
int main() {
    std::vector<float> input_data = {-2.5f, -1.5f, -0.5f, 0.5f, 1.5f, 2.5f};
    float scale = 1.0f;

    // 对称int8量化
    auto quantized = quantize_tensor(input_data, scale);
    std::cout << "Quantized (int8): ";
    for (auto q : quantized) std::cout << static_cast<int>(q) << " ";
    std::cout << std::endl;

    return 0;
}

编译运行：

Plain 复制代码

g++ -std=c++11 2.cpp -o quant
./quant

输出：

Plain 复制代码

Quantized (int8): -2 -2 0 0 2 2

注意：不能使用 std::round，需要使用 std::nearbyint，std::nearbyint 可控制舍入模式（默认是"银行家舍入"）。

补充：如果使用如下写法也是可以的：

Plain 复制代码

// 设置当前线程的浮点舍入模式为 最近值舍入（四舍六入，五成偶）
// std::nearbyint 默认使用该模式
std::fesetround(FE_TONEAREST);
// 参数为 float，返回 float
std::nearbyintf(res);

三、征程 6 工具链中的反量化

反量化 Dequantize 节点用于将模型中 int8 或 int32 类型的数据反量化回 float 类型，其计算公式如下：

Plain 复制代码

dequantize_x = (x - zero_point) * scale

如下为 Dequantize 节点的 C++的实现代码：

Plain 复制代码

//简单版本
static_cast<float>(input - zero_point) * scale
//标准版本
template <typename T>
float dequantize_core(T input, float scale, int32_t zero_point) {
  return (input - zero_point) * scale;
}