📉 大模型量化 (Quantization) 全维度解析：从哲学到算力

量化不仅是一种"压缩技术"，更是一场关于计算效率与信息精度 的深刻博弈。其核心思想是：用更粗糙但更高效的数值系统，去模拟复杂的智能行为。

一、量化的数学哲学：映射与格点化

量化的本质是将神经网络中连续的浮点数 （Floating Point）映射到离散的整数（Integer）空间。

公式核心 ： <math xmlns="http://www.w3.org/1998/Math/MathML"> Q = clamp ( round ( R S + Z ) ; Q m i n , Q m a x ) Q = \text{clamp}\left(\text{round}\left(\frac{R}{S} + Z\right); Q_{min}, Q_{max}\right) </math>Q=clamp(round(SR+Z);Qmin,Qmax) <math xmlns="http://www.w3.org/1998/Math/MathML"> R a p p r o x = ( Q − Z ) × S R_{approx} = (Q - Z) \times S </math>Rapprox=(Q−Z)×S
- <math xmlns="http://www.w3.org/1998/Math/MathML"> R R </math>R (Real)：原始浮点值。
- <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q (Quantized)：量化后的整数。
- <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S (Scale)：缩放因子（步长）。
- <math xmlns="http://www.w3.org/1998/Math/MathML"> Z Z </math>Z (Zero-point)：零点偏移，确保浮点 0 对应整数格点。

二、关键参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S 与 <math xmlns="http://www.w3.org/1998/Math/MathML"> Z Z </math>Z 是如何获得的？

获取这两个参数的过程被称为 校准 (Calibration)，它是量化精度的"生死线"。

1. 寻找数值范围 (Dynamic Range)

要算 <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> Z Z </math>Z，首先要确定原始数据的最小值 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> α \alpha </math>α) 和最大值 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> β \beta </math>β)：

权重校准：权重是静态的，直接遍历该层矩阵即可获得。
激活值校准：激活值随输入变化，需准备 128~512 条真实数据（校准集）跑一遍模型，记录各层输出的分布。

2. 确定阈值的策略

Min-Max (全域法) ：直接取 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ min , max ] [\text{min}, \text{max}] </math>[min,max]。虽然保留了所有信息，但极易受"离群值"（Outliers）干扰，导致中间大部分数值分辨率极低。
Entropy / KL 散度法：寻找一个截断阈值，使得量化前后的信息熵丢失最小（忽略极个别偏离巨大的噪点）。
Percentile (分位数法)：忽略最极端的 0.1% 的点，取 99.9% 处的值作为边界。

3. 参数计算

一旦确定了 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ α , β ] [\alpha, \beta] </math>[α,β]，即可根据量化位数（如 <math xmlns="http://www.w3.org/1998/Math/MathML"> I N T 8 INT8 </math>INT8 的范围是 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ − 128 , 127 ] [-128, 127] </math>[−128,127]）计算：

<math xmlns="http://www.w3.org/1998/Math/MathML"> S = β − α Q m a x − Q m i n S = \frac{\beta - \alpha}{Q_{max} - Q_{min}} </math>S=Qmax−Qminβ−α
<math xmlns="http://www.w3.org/1998/Math/MathML"> Z = round ( Q m i n − α S ) Z = \text{round}(Q_{min} - \frac{\alpha}{S}) </math>Z=round(Qmin−Sα)

三、计算过程：整数域的降维打击

真相纠正：量化推理不是"还原成浮点数再算"，而是在整数域直接战斗。

带宽红利 (Bandwidth) ：从显存搬运 <math xmlns="http://www.w3.org/1998/Math/MathML"> I N T 4 INT4 </math>INT4 数据比 <math xmlns="http://www.w3.org/1998/Math/MathML"> F P 16 FP16 </math>FP16 快 4 倍，极大缓解了"内存墙"问题。
算力红利 (Integer Arithmetic) ：
- 直接对战 ：显卡 Tensor Core 直接执行 <math xmlns="http://www.w3.org/1998/Math/MathML"> I N T 4 × I N T 4 INT4 \times INT4 </math>INT4×INT4 运算。
- 底层优势：整数运算单元电路简单，单周期内的吞吐量远高于浮点单元。
反量化时机 ：
- 大规模的乘加运算都在整数域（累加器）中完成。
- 延迟还原 ：只有在这一层计算彻底结束、准备进入下一层前，才进行一次反量化乘法 <math xmlns="http://www.w3.org/1998/Math/MathML"> R = ( Q − Z ) × S R = (Q - Z) \times S </math>R=(Q−Z)×S。

四、进阶：如何让参数更准？(GPTQ & AWQ)

普通的线性映射对智商损耗较大，进阶算法引入了补偿机制：

GPTQ (误差补偿)：量化某权重产生误差时，微调该层其他尚未量化的权重，利用"二阶导数（海森矩阵）"信息抵消误差。
AWQ (重要通道保护) ：发现激活值中 <math xmlns="http://www.w3.org/1998/Math/MathML"> 1 % 1\% </math>1% 的核心通道决定了精度，通过先对这些核心权重进行"预缩放"，让它们在量化后的格点中位置更优。

五、总结：量化的三重收益

收益维度	物理体现	结果
存储收益	16-bit <math xmlns="http://www.w3.org/1998/Math/MathML"> → \to </math>→ 4-bit	体积缩小 75%，廉价显卡跑大模型。
带宽收益	显存读取速度翻倍	解决生成卡顿，提升 Token/s 吞吐。
算力收益	整数单元替代浮点单元	提高计算效率，支持更高并发。

一句话总结：量化是用局部的精度舍入（Rounding Error），换取全局计算效率的指数级飞跃。

📉 大模型量化 (Quantization) 全维度解析：从哲学到算力

一、 量化的数学哲学：映射与格点化

二、 关键参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S 与 <math xmlns="http://www.w3.org/1998/Math/MathML"> Z Z </math>Z 是如何获得的？