提出问题

在NLP领域，对于下游任务进行大型预训练语言模型的微调已经成为一种重要的做法。一般而言，我们会采用对原有的预训练模型进行全量微调的方法来适配下游任务。然而，对于大规模的模型，微调过程可能会消耗大量的内存和计算资源，使得对于模型的微调产生了一定的门槛。

核心要点

💡✂️ <math xmlns="http://www.w3.org/1998/Math/MathML"> QLoRA \text{QLoRA} </math>QLoRA：通过4-bit量化的BaseModel在保持高性能的同时减少内存使用，使得模型微调的门槛大大降低。

a. 核心方法是提出了NormalFloat数据类型进行量化。

b. 核心思想是通过量化降低基座模型的显存占用，使得65B模型在单GPU上可以完成训练。
🆕🚀 <math xmlns="http://www.w3.org/1998/Math/MathML"> Guanaco \text{Guanaco} </math>Guanaco：新发布了基于LLaMa的模型家族，在Vicuna基准测试中的表现优于所有以前公开发布的模型。

a. 如果有相同的存储预算，4bit的33B的模型效果优于8bit的13B模型
🔍📊 <math xmlns="http://www.w3.org/1998/Math/MathML"> Misc \text{Misc} </math>Misc：指出现有的评估方式存在一定问题；针对特定下游任务训练数据需要更加优质。

a. 使用GPT4进行自动评估存在先入为主的情况，无法准确评估聊天机器人的性能。

b. 针对特定任务，训练数据的适用性(Suitability)和质量(Quality)相比于数量更加重要。
📈🎯 <math xmlns="http://www.w3.org/1998/Math/MathML"> Outlier Matters \text{Outlier Matters} </math>Outlier Matters: 大型语言模型权重中的离群值分布集中且对模型性能影响很大。

a. 对于越大的模型，离群值对于模型性能的影响越大，模型对outlier的依赖更强

b. 离群值很少，集中于确定的几列，并且在模型输出的Prefix，可能存储了一些上下文无关的信息。

解决方案

Overview

<math xmlns="http://www.w3.org/1998/Math/MathML"> LoRA \text{LoRA} </math>LoRA
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Y B F 16 = X B F 16 W B F 16 + X B F 16 L 1 B F 16 L 2 B F 16 \begin{equation} \mathbf{Y}^{\mathrm{BF} 16}=\mathbf{X}^{\mathrm{BF} 16} \mathbf{W}^{\mathrm{BF} 16}+\mathbf{X}^{\mathrm{BF} 16} \mathbf{L}_1^{\mathrm{BF} 16} \mathbf{L}_2^{\mathrm{BF} 16} \end{equation} </math>YBF16=XBF16WBF16+XBF16L1BF16L2BF16

<math xmlns="http://www.w3.org/1998/Math/MathML"> QLoRA \text{QLoRA} </math>QLoRA
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Y B F 16 = X B F 16 doubleDequant ( c 1 F P 32 , c 2 k -bit , W N F 4 ) + X B F 16 L 1 B F 16 L 2 B F 16 \begin{equation} \mathbf{Y}^{\mathrm{BF} 16}=\mathbf{X}^{\mathrm{BF} 16} \text { doubleDequant }\left(c_1^{\mathrm{FP} 32}, c_2^{\mathrm{k} \text {-bit }}, \mathbf{W}^{\mathrm{NF} 4}\right)+\mathbf{X}^{\mathrm{BF} 16} \mathbf{L}_1^{\mathrm{BF} 16} \mathbf{L}_2^{\mathrm{BF} 16} \end{equation} </math>YBF16=XBF16 doubleDequant (c1FP32,c2k-bit ,WNF4)+XBF16L1BF16L2BF16
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> doubleDequant ( c 1 F P 32 , c 2 k -bit , W k − b i t ) = dequant ⁡ ( dequant ⁡ ( c 1 F P 32 , c 2 k − bit ) , W 4 b i t ) = W B F 16 \begin{equation} \text { doubleDequant }\left(c_1^{\mathrm{FP} 32}, c_2^{\mathrm{k} \text {-bit }}, \mathbf{W}^{\mathrm{k}-\mathrm{bit}}\right)=\operatorname{dequant}\left(\operatorname{dequant}\left(c_1^{\mathrm{FP} 32}, c_2^{\mathrm{k}-\text { bit }}\right), \mathbf{W}^{4 \mathrm{bit}}\right)=\mathbf{W}^{\mathrm{BF} 16} \end{equation} </math> doubleDequant (c1FP32,c2k-bit ,Wk−bit)=dequant(dequant(c1FP32,c2k− bit ),W4bit)=WBF16

通过 <math xmlns="http://www.w3.org/1998/Math/MathML"> QLoRA \text{QLoRA} </math>QLoRA，可以在保留完整的16位微调任务性能的同时，减少内存使用，以在单个48GB GPU上微调65B参数模型。最新模型Guanaco在Vicuna基准测试中的表现超过了所有以前公开发布的模型，达到了ChatGPT性能水平的99.3%，而且只需要在单个GPU上微调24小时。

4-bit NormalFloat Quantization

Quant

SimpleQuant-Int4

ini 复制代码

quantiles = [
    -1., -0.86666667, -0.73333333, -0.6, -0.46666667,
    -0.33333333, -0.2, -0.06666667, 0.06666667, 0.2,
    0.33333333, 0.46666667, 0.6, 0.73333333, 0.86666667, 1.
]

FP4 Quant

huggingface.co/blog/4bit-t...

github.com/TimDettmers...

ini 复制代码

v = sign * (2 ** exp) * (1 + man)
spliter = [
    -12 ,-8 ,-6 ,-4 ,-3 ,-2 ,-0.0625 ,
    0 ,0.0625 ,2 ,3 ,4 ,6 ,8 ,12]
quantiles = [
    -1.0, -0.6667, -0.5, -0.3333, -0.25, -0.1667, -0.0052, 0.0, 
    0.0052, 0.1667, 0.25, 0.3333,  0.5, 0.6667, 1.0]

kotlin 复制代码

__device__ float d2DequantizeFP4(unsigned char val)
{
  float sign = (val & 0b1000) == 8 ? -1.0f : 1.0f;
  if((val & 0b0110) == 0)
  {
    // subnormal
    if((val & 0b0001) == 0)
      return 0.0f;
    else
      return sign*0.0625f;
  }
  else
  {
    // normal
    float exponent = ((val & 0b0100) == 4 ? 2.0f : 8.0f) + ((val & 0b0010) == 2 ? 0.0f : 2.0f);
    float fraction = (val & 0b0001) == 1 ? 1.5f : 1.0f;

    return sign*exponent*fraction;
  }
}

NF4 Quant

<math xmlns="http://www.w3.org/1998/Math/MathML"> 4-bit NormalFloat \text{4-bit NormalFloat} </math>4-bit NormalFloat是一种数据类型，它在量化过程中保留了零点，并使用所有 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 k 2^k </math>2k位来表示$$$$位数据类型。这种数据类型通过估计两个范围的分位数 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q^i </math>qi来创建一个非对称的数据类型，这两个范围分别是负数部分 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ − 1 , 0 ] [-1,0] </math>[−1,0]的 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 k − 1 2^{k-1} </math>2k−1和正数部分 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 0 , 1 ] [0,1] </math>[0,1]的 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 k − 1 + 1 2^{k-1}+1 </math>2k−1+1。然后，它统一了这两组分位数 <math xmlns="http://www.w3.org/1998/Math/MathML"> q i q^i </math>qi，并从两组中都出现的两个零中移除一个。这种结果数据类型在每个量化bin中都有相等的期望值数量，因此被称为 <math xmlns="http://www.w3.org/1998/Math/MathML"> k-bit NormalFloat ( NF k ) \text{k-bit NormalFloat}\space (\text{NF}_k) </math>k-bit NormalFloat (NFk)，这种数据类型对于以零为中心的正态分布数据在信息论上是最优的。

我们使用下面的公式来计算具体的分位数，
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> q i = 1 2 ( Q X ( i 2 k + 1 ) + Q X ( i + 1 2 k + 1 ) ) \begin{equation} q_i=\frac{1}{2}\left(Q_X\left(\frac{i}{2^k+1}\right)+Q_X\left(\frac{i+1}{2^k+1}\right)\right) \end{equation} </math>qi=21(QX(2k+1i)+QX(2k+1i+1))

<math xmlns="http://www.w3.org/1998/Math/MathML"> where Q X ( ⋅ ) is the quantile function of the standard normal distribution N ( 0 , 1 ) \text { where } Q_X(\cdot) \text { is the quantile function of the standard normal distribution } N(0,1) </math> where QX(⋅) is the quantile function of the standard normal distribution N(0,1)

标准正态分布量化函数把[-1, 0]分成7份，然后生成[-1, ..., 0]共8个分位数, 把[0, 1]分成8份，然后生成[0, ..., 1]共9个分位数，两个合起来去掉一个0就生成全部的16个分位数了。

算法解释

对于这个函数的简要解释是：它创建了NF4数据类型的16个值，并用零填充，以便在8位量化函数中使用（256个值，其中包括256-16个零）。该函数在bitsandbytes库中使用8位量化方法来"模拟"NF4。尽管算法可能有些晦涩，但以下是更直观的解释：

我们的目标是找到等面积的量化区间，使得量化区间左右两侧的面积相等。这意味着我们不从正态分布的0和1量化区间开始，而是从一个偏移量量化区间开始。代码片段中称之为"offset"，其值为1-1/(215)。如果我们有一个非对称的数据类型，其中一侧的间隔等于每个量化区间周围的16个"半个"，而另一侧只有15个"半个"。因此，平均偏移量为(1-1/(215) + 1-1/(2*16))/2 = 0.9677083。

我们使用norm.ppf函数获取标准正态分布（N(0, 1)）的量化区间。然后，通过将这些量化区间的值除以绝对最大值来重新缩放它们。

回复原文

Hi Xinyu,

You can find a code snippet in the bitsandbytes library: github.com/TimDettmers...

To give you a short explanation of this function: It creates the 16 values for the NF4 data type and then pads it with zeros so it can be used in 8-bit quantization functions (256 values, and 256-16 zeros). This "simulates" NF4 with the 8-bit quantization methods in the bitsandbytes library. The algorithm might be a little cryptic, but here is more intuition:

We want to find the quantiles which have equal area to the left and the right side of the quantile. This means, we do not start with the 0 or 1 quantile for the normal distribution, but with an offset quantile. This start position is called offset in the code snipped and is 1-1/(215). If we have an asymmetric data type, we have one side with spacing equivalent to 16 "halves" around each quantile and the other side with 15 halves. As such, the offset is on average (1-1/(215) + 1-1/(2*16))/2 = 0.9677083.

We use the norm.ppf function which gives the quantiles for the standard normal distribution (N(0, 1))

We rescale the quantile values by dividing by the absolute maximum value

Let me know if the details are still unclear or if you have any more questions.

Best regards,

Tim

具体代码

github.com/TimDettmers...

ini 复制代码

def create_normal_map(offset=0.9677083, use_extra_value=True):

    if use_extra_value:
        # one more positive value, this is an asymmetric type
        v1 = norm.ppf(torch.linspace(offset, 0.5, 9)[:-1]).tolist()
        v2 = [0]*(256-15) ## we have 15 non-zero values in this data type
        v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
        v = v1 + v2 + v3
    else:
        v1 = norm.ppf(torch.linspace(offset, 0.5, 8)[:-1]).tolist()
        v2 = [0]*(256-14) ## we have 14 non-zero values in this data type
        v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
        v = v1 + v2 + v3

    values = torch.Tensor(v)
    values = values.sort().values
    values /= values.max()
    assert values.numel() == 256
    return values

github.com/TimDettmers...

ini 复制代码

quantiles =  [
    -1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453,
    -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0,
    0.07958029955625534, 0.16093020141124725, 0.24611230194568634,
    0.33791524171829224, 0.44070982933044434, 0.5626170039176941, 0.7229568362236023, 1.0
]

Compare

quant_compare.ipynb

Layer error

通过对于 <math xmlns="http://www.w3.org/1998/Math/MathML"> opt-125m layer6 \text{opt-125m layer6} </math>opt-125m layer6单层权重量化对于输入输出的量化误差的对比我们可以发现：

当数据集数据不参与量化的情况下，量化效果 <math xmlns="http://www.w3.org/1998/Math/MathML"> NF4 > GPTQ-0 > FP4 \text{NF4 > GPTQ-0 > FP4} </math>NF4 > GPTQ-0 > FP4
当数据集数据参与量化的情况下, NF4和测试数据占比 <math xmlns="http://www.w3.org/1998/Math/MathML"> 1 0 − 5 10^{-5} </math>10−5的GPTQ算法效果相当
q_proj, k_proj, fc2对量化更加敏感，量化误差明显高于v_proj, out_proj, fc1

c4(ppl)

github.com/qwopqwop200...

Infer latency

From Author of AutoGPTQ: PanQiWei - Overview

Double Quantization

在量化的过程中，为了降低Outlier的影响，我们采用分块的方式进行进行量化。

具体来说就是每64个参数共享一个量化常数(Absmax, 32bit)，这样的话相当于每一个参数的量化额外开销为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 32 / 64 = 0.5 bit 32/64 = 0.5 \text{bit} </math>32/64=0.5bit。这个总体来说也是比较大的一个开销，所以为了进一步优化这个量化开销，我们对其进行二次量化( <math xmlns="http://www.w3.org/1998/Math/MathML"> Double Quantization \text{Double Quantization} </math>Double Quantization)，对量化常数进行进一步的量化。我们采用256的块大小对量化常数进行 <math xmlns="http://www.w3.org/1998/Math/MathML"> FP8 \text{FP8} </math>FP8量化，这样的话，我们可以把每个参数的量化额外开销降低到：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> 8 / 64 + 32 / ( 64 ∗ 256 ) = 0.127 b i t \begin{equation} 8/64 + 32/(64 * 256) = 0.127 bit \end{equation} </math>8/64+32/(64∗256)=0.127 bit

Paged Optimizers

在GPU偶尔内存不足的情况下，自动在CPU和GPU之间进行页面到页面的传输，以避免GPU OOM。这个特性就像CPU RAM和磁盘之间的常规内存分页一样工作。我们使用这个特性为优化器状态分配分页内存，当GPU内存不足时，这些优化器状态会自动被驱逐到CPU RAM，当在优化器更新步骤中需要内存时，它们会被分页回GPU内存。

实验验证

模型对比

Guanaco取得了不错的成绩，自动评估系统存在一定问题。对比有点不讲武德。

Elo Rating

通过对比，我们可以发现Guanaco取得了不错的成绩，并且用了较少的存储空间。同时自动评估系统存在明显的偏见，系统对首次出现的结果给予更高的分数，并且GPT-4对其自身的输出给予的分数高于人类评分。

Harmless

Future Works

在更大的模型上QLoRA和FullFinetuing的的差别是什么样的？

目前只是用了LoRA作为训练的方式，其他的PEFT训练方式效果怎么样？

在不同的数据集上进行评估并不能确定在特定任务的表现。未来能否有更好的评估标准？

基准测试也包含不同语言的提示，多语言训练在何种程度上提高了对非英语指令的性能？

QLoRA: 训练更大的GPT

提出问题

核心要点

解决方案

Overview

4-bit NormalFloat Quantization

Quant

SimpleQuant-Int4

FP4 Quant

NF4 Quant

算法解释

回复原文

具体代码

Compare

Layer error

c4(ppl)

Infer latency

Double Quantization

Paged Optimizers

实验验证

模型对比

Elo Rating

Harmless

Future Works