"Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks." 以GPT-3 175B为例,部署独立的、经过微调的模型实例,每个实例都拥有1750亿参数,这在成本上是不可承受的。我们提出了低秩适应方法(Low-Rank Adaptation,简称LoRA),该方法冻结预训练模型的权重,并在Transformer架构的每一层中注入可训练的低秩分解矩阵,从而大大减少了下游任务中需要训练的参数数量。
"Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times." 与使用Adam优化器微调的GPT-3 175B相比,LoRA可以将可训练参数的数量减少至原来的1/10,000,并将GPU内存需求降低3倍。
"LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency." 尽管LoRA的可训练参数较少、训练吞吐量更高,并且与适配器不同的是不会增加额外的推理延迟,但其在RoBERTa、DeBERTa、GPT-2和GPT-3上的模型质量与微调相当或更优。
原文描述:"As larger models are trained every few months, this changes from a mere 'inconvenience' for GPT-2 (Radford et al., b) or RoBERTa large (Liu et al., 2019) to a critical deployment challenge for GPT-3 (Brown et al., 2020) with 175 billion trainable parameters."
原文描述:"However, existing techniques often introduce inference latency (Houlsby et al., 2019; Rebuffi et al., 2017) by extending model depth or reduce the model's usable sequence length (Li & Liang, 2021; Lester et al., 2021; Hambardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality." 然而,现有技术通常通过增加模型深度而引入推理延迟(Houlsby等人,2019;Rebuffi等人,2017),或通过减少模型的可用序列长度来达到效果(Li & Liang, 2021;Lester等人,2021;Hambardzumyan等人,2020;Liu等人,2021)(参见第3节)。更重要的是,这些方法往往无法达到微调基线的效果,从而在效率与模型质量之间造成权衡。
原文描述:"We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low 'intrinsic rank', leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers' change during adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1." 我们从Li等人(2018a)和Aghajanyan等人(2020)的研究中获得启发,这些研究表明,学习到的过参数化模型实际上存在于一个低本质维度中。我们假设模型适应过程中权重的变化也具有低"本质秩",由此提出了低秩适应(Low-Rank Adaptation,LoRA)方法。LoRA使我们能够通过优化神经网络中密集层在适应过程中的变化的秩分解矩阵来间接训练某些密集层,同时保持预训练权重不变,如图1所示。
语言建模问题与完全微调的挑战 :在进行下游任务(如文本摘要、机器阅读理解、自然语言到SQL的生成)时,通常需要微调一个预训练的自回归语言模型 P Φ ( y ∣ x ) P_\Phi(y|x) PΦ(y∣x),其参数为 Φ \Phi Φ。每个下游任务都对应着一组训练数据集 Z = { ( x i , y i ) } Z = \{(x_i, y_i)\} Z={(xi,yi)},其中 x i x_i xi和 y i y_i yi都是词序列。例如,在自然语言到SQL任务中, x i x_i xi是自然语言查询, y i y_i yi是对应的SQL指令。
原文描述: "Suppose we are given a pre-trained autoregressive language model P Φ ( y ∣ x ) P_\Phi(y|x) PΦ(y∣x)parametrized by Φ \Phi Φ. For instance, P Φ ( y ∣ x ) P_\Phi(y|x) PΦ(y∣x)can be a generic multi-task learner such as GPT"
"Consider adapting this pre-trained model to downstream conditional text generation tasks"
"Each downstream task is represented by a training dataset of context-target pairs: Z = ( x i , y i ) i = 1 , . . . , N Z = {(x_i, y_i)}{i=1,...,N} Z=(xi,yi)i=1,...,N..." 每个下游任务都由一组上下文-目标对组成的数据集表示 。数据集记为 Z = ( x i , y i ) i = 1 , . . . , N Z = {(x_i, y_i)}{i=1,...,N} Z=(xi,yi)i=1,...,N,其中 x i x_i xi和 y i y_i yi是由标记序列组成的。例如,在NL2SQL任务中, x i x_i xi是自然语言查询, y i y_i yi是其对应的SQL指令。
完全微调公式与问题 :在完全微调过程中,模型初始化为预训练权重 Φ 0 \Phi_0 Φ0,并通过反复优化下式来更新权重 Φ 0 \Phi_0 Φ0为 Φ 0 + Δ Φ \Phi_0 + \Delta \Phi Φ0+ΔΦ,以 最大化条件概率: max Φ ∑ ( x , y ) ∈ Z ∑ t = 1 ∣ y ∣ log ( P Φ ( y t ∣ x , y < t ) ) \max_\Phi \sum_{(x, y) \in Z} \sum_{t=1}^{|y|} \log (P_\Phi(y_t|x, y_{<t})) maxΦ∑(x,y)∈Z∑t=1∣y∣log(PΦ(yt∣x,y<t))
max Θ ∑ ( x , y ) ∈ Z ∑ t = 1 ∣ y ∣ log ( P Φ 0 + Δ Φ ( Θ ) ( y t ∣ x , y < t ) ) \max_\Theta \sum_{(x, y) \in Z} \sum_{t=1}^{|y|} \log \left(P_{\Phi_0+\Delta \Phi(\Theta)}(y_t|x, y_{<t})\right) maxΘ∑(x,y)∈Z∑t=1∣y∣log(PΦ0+ΔΦ(Θ)(yt∣x,y<t))
原文描述:"In this paper, we adopt a more parameter-efficient approach, where the task-specific parameter increment Δ Φ = Δ Φ ( Θ ) \Delta \Phi = \Delta \Phi(\Theta) ΔΦ=ΔΦ(Θ)is further encoded by a much smaller-sized set of parameters Θ \Theta Θwith ∣ Θ ∣ ≪ ∣ Φ 0 ∣ |\Theta| \ll |\Phi_0| ∣Θ∣≪∣Φ0∣. The task of finding Δ Φ \Delta \Phi ΔΦthus becomes optimizing over Θ \Theta Θ."
原文描述: "In the subsequent sections, we propose to use a low-rank representation to encode Δ Φ \Delta \Phi ΔΦthat is both compute- and memory-efficient. When the pre-trained model is GPT-3 175B, the number of trainable parameters ∣ Θ ∣ |\Theta| ∣Θ∣can be as small as 0.01% of ∣ Φ 0 ∣ |\Phi_0| ∣Φ0∣."
原文描述:"There are many variants of adapters. We focus on the original design by Houlsby et al. (2019) which has two adapter layers per Transformer block and a more recent one by Lin et al. (2020) which has only one per block but with an additional LayerNorm (Ba et al., 2016)."
原文描述:"While one can reduce the overall latency by pruning layers or exploiting multi-task settings (Rücklé et al., 2020; Pfeiffer et al., 2021), there is no direct ways to bypass the extra compute in adapter layers."
原文描述:"We observe that prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters, confirming similar observations in the original paper."
具体来说,对于一个预训练的权重矩阵 W 0 ∈ R d × k W_0 \in \mathbb{R}^{d \times k} W0∈Rd×k,LoRA将其更新表示为 W 0 + Δ W = W 0 + B A W_0 + \Delta W = W_0 + BA W0+ΔW=W0+BA,其中 B ∈ R d × r B \in \mathbb{R}^{d \times r} B∈Rd×r和 A ∈ R r × k A \in \mathbb{R}^{r \times k} A∈Rr×k,并且 r ≪ min ( d , k ) r \ll \min(d, k) r≪min(d,k)。这种低秩表示大大减少了所需的可训练参数数量。
原文描述: "For a pre-trained weight matrix W 0 ∈ R d × k W_0 \in \mathbb{R}^{d \times k} W0∈Rd×k, we constrain its update by representing the latter with a low-rank decomposition W 0 + Δ W = W 0 + B A W_0 + \Delta W = W_0 + BA W0+ΔW=W0+BA, where B ∈ R d × r B \in \mathbb{R}^{d \times r} B∈Rd×r, A ∈ R r × k A \in \mathbb{R}^{r \times k} A∈Rr×k, and the rank r ≪ min ( d , k ) r \ll \min(d, k) r≪min(d,k)."
训练过程与公式解释 :
在LoRA中,预训练的权重 W 0 W_0 W0**是冻结的** ,不会在训练中更新,只有 A A A和 B B B需要进行训练。
输入向量 x x x经由 W 0 W_0 W0和低秩矩阵的计算得到输出: h = W 0 x + Δ W x = W 0 x + B A x h = W_0x + \Delta Wx = W_0x + BAx h=W0x+ΔWx=W0x+BAx
原文描述: "During training, W 0 W_0 W0is frozen and does not receive gradient updates, while A A Aand B B Bcontain trainable parameters. Note both W 0 W_0 W0and Δ W = B A \Delta W = BA ΔW=BAare multiplied with the same input, and their respective output vectors are summed coordinate-wise."
参数初始化与缩放策略 :
在LoRA中,矩阵 A A A采用高斯随机初始化,而 B B B则初始化为零,这样在训练开始时 Δ W = B A \Delta W = BA ΔW=BA为零。
为了在不同的秩值 r r r之间调节,LoRA引入了一个缩放因子 α \alpha α,从而减少在改变 r r r时调整超参数的需求。
原文描述:"We use a random Gaussian initialization for A A Aand zero for B B B, so Δ W = B A \Delta W = BA ΔW=BAis zero at the beginning of training. We then scale Δ W x \Delta Wx ΔWxby α / r \alpha / r α/r, where α \alpha αis a constant in r r r."
原文描述:"LoRA takes a step further and does not require the accumulated gradient update to weight matrices to have full-rank during adaptation. This means that when applying LoRA to all weight matrices and training all biases, we roughly recover the expressiveness of full fine-tuning by setting the LoRA rank r r rto the rank of the pre-trained weight matrices."
无额外推理延迟的优点 :
LoRA的设计使得在部署过程中可以直接将 W = W 0 + B A W = W_0 + BA W=W0+BA进行计算和存储,与完全微调的模型相比,不会在推理时引入额外的延迟。对于需要频繁切换任务的场景,LoRA可以快速替换适应权重矩阵,从而实现低成本的任务切换。
原文描述:"Critically, this guarantees that we do not introduce any additional latency during inference compared to a fine-tuned model by construction."
Applying LoRA to Transformer
在"4.2 Applying LoRA to Transformer"的这部分,论文详细说明了如何将LoRA方法应用于Transformer架构中。
原文描述:"In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architecture, there are four weight matrices in the self-attention module ( W q W_q Wq, W k W_k Wk, W v W_v Wv, W o W_o Wo) and two in the MLP module."
原文描述:"We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency."
"The most significant benefit comes from the reduction in memory and storage usage" 最大的好处来自于内存和存储使用的减少。
"On GPT-3 175B, we reduce the VRAM consumption during training from 1.2TB to 350GB." 在GPT-3 175B上,我们将训练期间的VRAM消耗从1.2TB减少到350GB。
限制 :然而,LoRA在某些场景下也有其局限性。例如,在不同任务的输入需要不同的 A A A和 B B B矩阵时,如果将 A A A和 B B B与原始权重合并,那么在同一次前向传递中批量处理这些任务将变得困难。
原文描述 :"LoRA also has its limitations. For example, it is not straightforward to batch inputs to different tasks with different A A Aand B B Bin a single forward pass, if one chooses to absorb A A Aand B B Binto W W Wto eliminate additional inference latency."
在高斯随机初始化中,我们使用高斯分布随机生成权重矩阵中的初始值。具体地,假设权重矩阵 W W W的每个元素 w w w都是从 N ( μ , σ 2 ) \mathcal{N}(\mu, \sigma^2) N(μ,σ2)分布中抽样得到的: w ∼ N ( μ , σ 2 ) w \sim \mathcal{N}(\mu, \sigma^2) w∼N(μ,σ2)
论文原文:"We evaluate the downstream task performance of LoRA on RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021), and GPT-2 (Radford et al., b), before scaling up to GPT-3 175B (Brown et al., 2020). Our experiments cover a wide range of tasks, from natural language understanding (NLU) to generation (NLG)."
原文:"Fine-Tuning (FT) is a common approach for adaptation. During fine-tuning, the model is initialized to the pre-trained weights and biases, and all model parameters undergo gradient updates. A simple variant is to update only some layers while freezing others. We include one such baseline reported in prior work (Li & Liang, 2021) on GPT-2, which adapts just the last two layers (FTTop2)" 微调(FT)是一种常见的适应方法。在微调过程中,模型被初始化为预训练的权重和偏置,且所有模型参数都进行梯度更新。一个简单的变体是只更新部分层,而冻结其他层。我们包含了先前研究中报告的一个基线(Li & Liang, 2021)在GPT-2上的方法,该方法仅适应最后两层(FTTop2)。
公式 :该方法中可训练参数的数量为 ∣ Θ ∣ = d m o d e l × ( l p + l i ) |Θ|=d_{model} \times (l_p + l_i) ∣Θ∣=dmodel×(lp+li),其中 d m o d e l d_{model} dmodel为模型的隐藏层维度, l p l_p lp和 l i l_i li分别代表 前缀和中缀的token数量。
原文:"Prefix-embedding tuning (PreEmbed) inserts special tokens among the input tokens. These special tokens have trainable word embeddings and are generally not in the model's vocabulary"
公式 :其可训练参数的数量为 ∣ Θ ∣ = L A d p t × ( 2 × d m o d e l × r + r + d m o d e l ) + 2 × L L N × d m o d e l |Θ|=L_{Adpt} \times (2 \times d_{model} \times r + r + d_{model}) + 2 \times L_{LN} \times d_{model} ∣Θ∣=LAdpt×(2×dmodel×r+r+dmodel)+2×LLN×dmodel,其中 L A d p t L_{Adpt} LAdpt为适配层的数量, L L N L_{LN} LLN为可训练的LayerNorm数量:contentReference[oaicite:6]{index=6}。
原文:"Adapter tuning as proposed in Houlsby et al. (2019) inserts adapter layers between the self-attention module (and the MLP module) and the subsequent residual connection. There are two fully connected layers with biases in an adapter layer with a nonlinearity in between"
原文观点:"Transformer (Vaswani et al., 2017) is a sequence-to-sequence architecture that makes heavy use of self-attention. Radford et al. (a) applied it to autoregressive language modeling by using a stack of Transformer decoders. Since then, Transformer-based language models have dominated NLP, achieving the state-of-the-art in many tasks."
原文观点:"While GPT-3 175B can adapt its behavior with just a few additional training examples, the result depends heavily on the input prompt (Brown et al., 2020). This necessitates an empirical art of composing and formatting the prompt to maximize a model's performance on a desired task, which is known as prompt engineering or prompt hacking."
原文观点:"Many have proposed inserting adapter layers between existing layers in a neural network (Houlsby et al., 2019; Rebuffi et al., 2017; Lin et al., 2020). Our method uses a similar bottleneck structure to impose a low-rank constraint on the weight updates."
原文观点 :
"Given the empirical advantage of LoRA, we hope to further explain the properties of the low-rank adaptation learned from downstream tasks. Note that the low-rank structure not only lowers the hardware barrier to entry which allows us to run multiple experiments in parallel, but also gives better interpretability of how the update weights are correlated with the pre-trained weights. We focus our study on GPT-3 175B, where we achieved the largest reduction of trainable parameters (up to 10,000×) without adversely affecting task performances."
原文观点 :
"We draw several conclusions from Table 7. First, ∆W has a stronger correlation with W compared to a random matrix, indicating that ∆W amplifies some features that are already in W. Second, instead of repeating the top singular directions of W, ∆W only amplifies directions that are not emphasized in W. This suggests that the low-rank adaptation matrix potentially amplifies the important features for specific downstream tasks that were learned but not emphasized in the general pre-training model."
最佳的权重组合:实验表明,将LoRA应用于 W q W_q Wq和 W v W_v Wv的组合(即 对这两个矩阵同时进行低秩适配 )可以获得最佳的整体性能。在WikiSQL数据集上获得 73.7%的准确率 ,在MultiNLI数据集上则达到91.7%。这优于仅适配单一权重类型(如 W q W_q Wq或 W v W_v Wv)的情况。
权重组合的重要性:相比于将所有参数放在 Δ W q \Delta W_q ΔWq或 Δ W k \Delta W_k ΔWk中,选择对 W q W_q Wq和 W v W_v Wv进行适配能够显著提升性能。即便rank较小,这种组合仍能捕捉足够的信息,表现 优于只适配单个权重矩阵类型且rank较大的情况。
因此,研究结果建议在有限参数预算下,优先选择 多种权重组合 (如 W q W_q Wq和 W v W_v Wv)进行适配,以充分利用LoRA的参数效率,实现更高的任务表现。这种选择在大规模模型的下游任务适应中尤为重要。
⭐What is the Optimal Rank r for LoRA?
解析:
从这部分内容可以得出以下几点分析:
Rank对性能的影响 :LoRA在不同rank值下的表现差异显著。在WikiSQL和MultiNLI两个数据集上,适应不同的自注意力权重矩阵组合可以达到不同的准确率。当rank较小(如 r = 1 r=1 r=1)时,若同时适应 W q W_q Wq和 W v W_v Wv,依然可以取得较好的效果。而仅适应 W q W_q Wq则需要更高的rank才能达到类似的效果。
较小的Rank已足够 :实验结果表明,即使是一个非常小的rank(如 r = 1 r=1 r=1),LoRA在 W q W_q Wq和 W v W_v Wv的组合上也能达到令人满意的表现。对于这些数据集,适应矩阵 Δ W \Delta W ΔW可能具有非常小的"内在秩"(intrinsic rank)。进一步的实验检查了不同rank和随机种子下学习到的子空间重叠情况,以支持这一发现。
不同rank的子空间相似性 :给定rank为 r = 8 r=8 r=8和 r = 64 r=64 r=64的适应矩阵 A r = 8 A_{r=8} Ar=8和 A r = 64 A_{r=64} Ar=64,通过奇异值分解获得它们的右奇异向量矩阵 U A r = 8 U_{A_{r=8}} UAr=8和 U A r = 64 U_{A_{r=64}} UAr=64。为了衡量这两个子空间的重叠程度,计算了基于Grassmann距离的归一化子空间相似性 ϕ ( A r = 8 , A r = 64 , i , j ) \phi(A_{r=8}, A_{r=64}, i, j) ϕ(Ar=8,Ar=64,i,j),表示前 i i i个奇异向量所张成的子空间与前 j j j个奇异向量所张成的子空间的相似性。
公式如下: ϕ ( A r = 8 , A r = 64 , i , j ) = ∣ ∣ U A r = 8 i T U A r = 64 j ∣ ∣ F 2 min ( i , j ) \phi(A_{r=8}, A_{r=64}, i, j) = \frac{||U_{A_{r=8}}^i T U_{A_{r=64}}^j||_F^2}{\min(i, j)} ϕ(Ar=8,Ar=64,i,j)=min(i,j)∣∣UAr=8iTUAr=64j∣∣F2
其中 U A r = 8 i U_{A_{r=8}}^i UAr=8i表示 U A r = 8 U_{A_{r=8}} UAr=8的前 i i i列,Frobenius 范数是矩阵中所有元素的平方和再开方、常用于衡量矩阵的大小或矩阵之间的距离
这个分析为理解不同rank下适应矩阵的子空间重叠提供了依据。
图3展示了在 A r = 8 A_{r=8} Ar=8和 A r = 64 A_{r=64} Ar=64中 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv的 列向量之间的子空间 相似性。
前两个图分别展示了 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv的完整子空间相似性分布,而后两个图则放大了前两个图的左下角区域。
可以观察到,在 r = 8 r=8 r=8中的主要方向包含在 r = 64 r=64 r=64中,反之亦然。
此图表明,即使在较小的rank值(如 r = 8 r=8 r=8)下,主要的奇异向量方向也能在更高的rank设置(如 r = 64 r=64 r=64)中得到体现,从而支持了子空间相似性在不同rank下的稳定性。
这段文字对图3中的观察进行了重要的解释:
奇异向量重叠: 在 r = 8 r=8 r=8和 r = 64 r=64 r=64的设定下, Δ W v \Delta W_v ΔWv(以及 Δ W q \Delta W_q ΔWq)的 顶级奇异向量方向 有显著的重叠,而其他方向没有。
这意味着在这两种rank值的矩阵中,最主要的信息方向是一致的,从而解释了为什么即使在较小的rank(如 r = 1 r=1 r=1)下,LoRA在GPT-3的下游任务中也能表现良好。
子空间中的随机噪声:由于 r = 8 r=8 r=8和 r = 64 r=64 r=64的适应矩阵都是使用相同的预训练模型学习的,图3显示, A r = 8 A_{r=8} Ar=8和 A r = 64 A_{r=64} Ar=64的顶级奇异向量方向包含了最有用的信息,而其他方向则可能主要包含训练过程中累积的随机噪声。
这说明适应矩阵的有效rank可以很低,从而在不增加复杂度的情况下实现高效的模型适应。
"顶级奇异向量方向"是指在矩阵的奇异值分解(SVD)中,对应于最大奇异值的奇异向量方向。
具体来说,假设我们对一个矩阵 M M M进行奇异值分解,可以得到: M = U Σ V T M=U \Sigma V^T M=UΣVT
其中:
U U U和 V V V是正交矩阵,包含了 M M M的左奇异向量和右奇异向量。
Σ \Sigma Σ是一个对角矩阵,对角线上包含了矩阵 M M M的奇异值,按从大到小的顺序排列。
顶级奇异向量方向通常是指在 U U U和 V V V中对应最大奇异值的奇异向量。
这个方向是矩阵数据中最主要的成分或模式,它捕捉了矩阵数据中方差最大的方向。
这些顶级奇异向量在矩阵的低秩近似和特征提取中非常重要,因为它们代表了矩阵的主要信息。
这段文字分析了不同随机种子下的子空间相似性:
① 随机种子对子空间相似性的影响: 通过比较两个随机种子下的子空间相似性(rank为 r = 64 r=64 r=64), Δ W q \Delta W_q ΔWq相比 Δ W v \Delta W_v ΔWv具有 更高的"内在秩" ,这意味着 Δ W q \Delta W_q ΔWq在不同随机种子下学到的奇异向量方向重叠更多。
② 对比随机高斯矩阵:作为对比,作者还绘制了两个随机高斯矩阵的相似性 ,它们之间没有任何共同的奇异向量方向。这进一步验证了 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv的子空间相似性是有意义 的,而非随机噪声导致的重叠。
图4的三个子图展示了不同情况下子空间相似性的热图,以进一步验证 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv的子空间相似性是否有意义,还是说仅由随机噪声引起的。
左图和中图: 显示了在不同随机种子下 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv的列向量之间的归一化子空间相似性。在图中,颜色越接近白色(相似性越高),表示两个矩阵在特定方向上共享的奇异值分量越多。可以观察到 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv的子空间在不同随机种子下具有显著的相似性,特别是 Δ W q \Delta W_q ΔWq的相似性更高,这说明 Δ W q \Delta W_q ΔWq的子空间更稳定,对随机噪声更不敏感。
右图: 显示了两个随机高斯矩阵的相似性热图。由于随机高斯矩阵没有任何共同的奇异向量方向,其子空间相似性非常低,接近黑色。这进一步验证了,若 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv的子空间相似性只是随机噪声导致的,它们的相似性应类似于随机高斯矩阵的情况。然而,实际中 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv展示出显著的子空间重叠,表明这种相似性是有结构性和有意义的,而非随机噪声引起的重叠
因此,通过将 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv与随机高斯矩阵的子空间相似性对比,作者进一步证实了 Δ W q \Delta W_q ΔWq和 Δ W v \Delta W_v ΔWv的子空间相似性是任务相关的,反映了模型在学习过程中捕捉到的特定信息结构,而不仅仅是随机性噪声的结果。
表7中的分析展示了 Δ W q \Delta W_q ΔWq与 W q W_q Wq在特定方向上的相关性强度,重点在于 Δ W q \Delta W_q ΔWq在适应下游任务时 如何放大特定特征。
这些特定特征在 W q W_q Wq中存在,但并未被显著强调。
以下是表格中各项的具体说明:
∣ ∣ U T W q V T ∣ ∣ F ||U^T W_q V^T||_F ∣∣UTWqVT∣∣F:这是将 W q W_q Wq投影到 Δ W q \Delta W_q ΔWq的 r r r维子空间后的Frobenius范数, U U U和 V V V分别为 Δ W q \Delta W_q ΔWq的左右奇异向量矩阵。这一范数值展示了 W q W_q Wq在 Δ W q \Delta W_q ΔWq的特定方向上有多大的重叠,即 Δ W q \Delta W_q ΔWq对 W q W_q Wq中的哪些方向做了放大。
对比结果:当 r = 4 r=4 r=4时, Δ W q \Delta W_q ΔWq的投影范数为0.32,而 W q W_q Wq的自身投影为21.67,随机矩阵则只有0.02。这说明 Δ W q \Delta W_q ΔWq所强调的方向在 W q W_q Wq中并不突出,但仍比随机噪声更有结构。当 r = 64 r=64 r=64时, Δ W q \Delta W_q ΔWq的投影范数为1.90,而 W q W_q Wq的投影为37.71,随机矩阵则为0.33。这进一步表明, Δ W q \Delta W_q ΔWq的更新方向虽然包含了一些在 W q W_q Wq中已有的信息,但它更侧重于放大某些未被显著强调的方向 。
意义: Δ W q \Delta W_q ΔWq并非简单复制 W q W_q Wq的主要方向,而是选择性地放大了某些在 W q W_q Wq中并不突出的特征。这种选择性放大说明,低秩适应矩阵 (如 Δ W q \Delta W_q ΔWq)可以帮助模型在下游任务中更好地利用某些关键特征 ,而无需重新训练整个权重矩阵 。这种结构表明 Δ W q \Delta W_q ΔWq在调整过程中具有 任务相关性 , 而不仅是随机调整。
"Few-shot learning, or prompt engineering, is very advantageous when we only have a handful of training samples. However, in practice, we can often afford to curate a few thousand or more training examples for performance-sensitive applications. As shown in Table 8, fine-tuning improves the model performance drastically compared to few-shot learning on datasets large and small." 当我们只有少量训练样本时,少样本学习或提示工程非常有利。然而,在实际应用中,对于对性能敏感的应用,我们通常可以收集数千个或更多的训练样本。正如表 8 所示,与少样本学习相比,微调在大小数据集上都能显著提升模型性能。
论文原文:"Adapter layers are external modules added to a pre-trained model in a sequential manner, whereas our proposal, LoRA, can be seen as external modules added in a parallel manner. Consequently, adapter layers must be computed in addition to the base model, inevitably introducing additional latency."
实验结果与适配器设计的差异 : 作者使用了两种适配器设计------Houlsby等人(2019)提出的设计(称为AdapterH^HH)和Lin等人(2020)提出的更高效的变体(称为 A d a p t e r L Adapter^L AdapterL)------来测量在不同批次大小和序列长度下的延迟。结果如图5所示,延迟因适配器设计和输入设置的不同而有所变化。
延迟分析与图5的解释 : 图5展示了不同批次大小和序列长度下,适配器层( A d a p t e r H Adapter^H AdapterH和 A d a p t e r L Adapter^L AdapterL)相对于没有适配器时的延迟百分比增加。延迟随着批次大小的增大而减小,但在小批次和较短序列长度的场景中,延迟增加的幅度较大。例如,在批次大小为1且序列长度为128时,延迟可能高达30%以上。该图提供了详细的延迟数据以便更好地理解延迟随参数变化的趋势。
论文原文: "We measure the latency of a single forward pass on an NVIDIA Quadro RTX8000 by averaging over 100 trials. We vary the input batch size, sequence length, and the adapter bottleneck dimension r. We plot the slow-down in percentage compared to the no-adapter baseline in Figure 5." 我们通过对 100 次试验取平均值来测量在 NVIDIA Quadro RTX8000 上单次前向传播的延迟。我们改变输入的批量大小、序列长度和适配器瓶颈维度 r r r。我们在图 5 中绘制了相对于无适配器基线的延迟百分比减速。
论文原文 :"GLUE Benchmark is a wide-ranging collection of natural language understanding tasks. It includes MNLI (inference, Williams et al. (2018)), SST-2 (sentiment analysis, Socher et al. (2013)), MRPC (paraphrase detection, Dolan & Brockett (2005)), CoLA (linguistic acceptability, Warstadt et al. (2018)), QNLI (inference, Rajpurkar et al. (2018)), QQP (question-answering), RTE (inference), and STS-B (textual similarity, Cer et al. (2017))."
论文原文 :"WikiSQL is introduced in Zhong et al. (2017) and contains 56,355/8,421 training/validation examples. The task is to generate SQL queries from natural language questions and table schemata."
SAMSum :
概述:SAMSum数据集包含聊天摘要任务,通过对话生成简要摘要,适用于摘要生成任务。
论文原文 :"SAMSum is introduced in Gliwa et al. (2019) and contains 14,732/819 training/test examples. It consists of staged chat conversations between two people and corresponding abstractive summaries written by linguists."
论文原文 :"E2E NLG Challenge was first introduced in Novikova et al. (2017) as a dataset for training end-to-end, data-driven natural language generation systems and is commonly used for data-to-text evaluation."
论文原文 :"DART is an open-domain data-to-text dataset described in Nan et al. (2020). DART inputs are structured as sequences of ENTITY --- RELATION --- ENTITY triples."
论文原文 :"WebNLG is another commonly used dataset for data-to-text evaluation (Gardent et al., 2017). With 22K examples in total WebNLG comprises 14 distinct categories, nine of which are seen during training."
总结:
C Dataset Details部分为LoRA模型在各种自然语言任务中提供了详细的数据集背景和用途说明。每个数据集的不同任务类型和特定结构,为LoRA模型在不同任务中的适应性提供了坚实的基础。通过这些描述,作者希望为读者提供对LoRA模型在不同数据集上应用的全面理解。
论文原文 :"We train using AdamW with a linear learning rate decay schedule. We sweep learning rate, number of training epochs, and batch size for LoRA. Following Liu et al. (2019), we initialize the LoRA modules to our best MNLI checkpoint when adapting to MRPC, RTE, and STS-B, instead of the usual initialization; the pre-trained model stays frozen for all tasks."
论文原文 :"We again train using AdamW with a linear learning rate decay schedule. Following He et al. (2021), we tune learning rate, dropout probability, warm-up steps, and batch size."
论文原文 :"We train all of our GPT-2 models using AdamW (Loshchilov & Hutter, 2017) with a linear learning rate schedule for 5 epochs. We use the batch size, learning rate, and beam search beam size described in Li & Liang (2021)."
论文原文 :"For all GPT-3 experiments, we train using AdamW (Loshchilov & Hutter, 2017) for 2 epochs with a batch size of 128 samples and a weight decay factor of 0.1."
论文原文 :"LoRA+PrefixEmbed (LoRA+PE) combines LoRA with prefix-embedding tuning, where we insert lp + li special tokens whose embeddings are treated as trainable parameters."
论文原文 :"LoRA+PrefixLayer (LoRA+PL) combines LoRA with prefix-layer tuning. We also insert lp + li special tokens; however, instead of letting the hidden representations of these tokens evolve naturally, we replace them after every Transformer block with an input agnostic vector."
"First of all, LoRA+PE significantly outperforms both LoRA and prefix-embedding tuning on WikiSQL, which indicates that LoRA is somewhat orthogonal to prefix-embedding tuning."
"On MultiNLI, the combination of LoRA+PE doesn't perform better than LoRA, possibly because LoRA on its own already achieves performance comparable to the human baseline."
"Secondly, we notice that LoRA+PL performs slightly worse than LoRA even with more trainable parameters."
论文原文 :"Similar to our result on E2E NLG Challenge, reported in Section 5, LoRA performs better than or at least on-par with prefix-based approaches given the same number of trainable parameters."
论文原文 :"We present additional runs on GPT-3 with different adaptation methods in Table 15. The focus is on identifying the trade-off between performance and the number of trainable parameters."
论文原文 :"PrefixEmbed and PrefixLayer performs very poorly on MNLI-100 dataset, with PrefixEmbed performing only slightly better than random chance (37.6% vs. 33.3%). PrefixLayer performs better than PrefixEmbed but is still significantly worse than Fine-Tune or LoRA on MNLI-100."
总结:
F Additional Empirical Experiments部分的实验结果表明,LoRA在不同模型和任务环境中都表现出了出色的适应性,尤其是在低数据环境和生成任务中。
G Measuring Similarity Between Subspaces:测量子空间的相似性
这一块主要定义了一个用于 度量两个低秩矩阵子空间相似性 的方法。
作者使用一种子空间相似性度量 ϕ ( A , B , i , j ) \phi(A, B, i, j) ϕ(A,B,i,j),计算两个列正交矩阵 U A i U_A^i UAi和 U B j U_B^j UBj的相似度。
这个度量基于Grassmann流形的投影度量,与传统的投影度量在方向上相反。
子空间相似性度量定义为: ϕ ( A , B , i , j ) = ψ ( U A i , U B j ) = ∣ ∣ U A i T U B j ∣ ∣ F 2 min { i , j } \phi(A, B, i, j) = \psi(U_A^i, U_B^j) = \frac{||{U_A^i}^T U_B^j||_F^2}{\min\{i, j\}} ϕ(A,B,i,j)=ψ(UAi,UBj)=min{i,j}∣∣UAiTUBj∣∣F2
其中:
U A i ∈ R d × i U_A^i \in \mathbb{R}^{d \times i} UAi∈Rd×i和 U B j ∈ R d × j U_B^j \in \mathbb{R}^{d \times j} UBj∈Rd×j是矩阵 A A A和 B B B的左奇异矩阵的前 i i i和 j j j列;
ψ ( U A i , U B j ) \psi(U_A^i, U_B^j) ψ(UAi,UBj)反映了两个子空间之间的相似度,其值在0到1之间,1表示完全相似,0表示完全不同;
配置和性能 :随着可训练参数 l p l_p lp和 l i l_i li的增大(即前缀token的数量增加),PrefixEmbed方法在WikiSQL和MNLI上的性能表现有所提升,但在参数大幅增加后,性能开始下降。对于最大配置( l p = 512 , l i = 8 l_p=512, l_i=8 lp=512,li=8),WikiSQL和MNLI的准确率分别达到 63.1 % 63.1\% 63.1%和 85.2 % 85.2\% 85.2%。
PrefixLayer :
配置和性能 :PrefixLayer方法比PrefixEmbed有更多的可训练参数,且性能稳定在较高水平。最高配置( l p = 64 , l i = 0 l_p=64, l_i=0 lp=64,li=0)对应的WikiSQL准确率为 64.9 % 64.9\% 64.9%,MNLI为 87.9 % 87.9\% 87.9%。
Adapter :
配置和性能 :Adapter方法在不同秩 r r r下的性能变化较大。当 r = 64 r=64 r=64时,可训练参数数达到 304.4 M 304.4M 304.4M,WikiSQL的准确率为 72.6 % 72.6\% 72.6%,MNLI的准确率为 91.5 % 91.5\% 91.5%。
LoRA :
配置和性能 :LoRA在不同配置下表现较稳定。即使在较小的可训练参数下(如 r q = r v = 2 r_q=r_v=2 rq=rv=2,参数数为 4.7 M 4.7M 4.7M),LoRA的WikiSQL准确率达到 73.4 % 73.4\% 73.4%,MNLI为 91.7 % 91.7\% 91.7%。随着参数数增大到 301.9 M 301.9M 301.9M( r q = r k = r v = r o = 64 r_q=r_k=r_v=r_o=64 rq=rk=rv=ro=64),其性能保持在高水平。
LoRA+PE 和 LoRA+PL :
配置和性能 :在LoRA和PrefixEmbed或PrefixLayer组合的设置下,LoRA+PE的最高性能出现在 r q = r v = 64 , l p = 8 , l i = 4 r_q=r_v=64, l_p=8, l_i=4 rq=rv=64,lp=8,li=4,对应的WikiSQL准确率为 76.2 % 76.2\% 76.2%,MNLI为 91.6 % 91.6\% 91.6%。LoRA+PL的最大配置( r q = r v = 8 , l p = 8 , l i = 4 r_q=r_v=8, l_p=8, l_i=4 rq=rv=8,lp=8,li=4)则提供了更均衡的表现。
主要超参数包括优化器(AdamW)、学习率调度、批量大小、学习率和方法特定的参数(如 l p l_p lp、 l i l_i li和 LoRA 的 r q = r v = 8 r_q=r_v=8 rq=rv=8)。
投影度量公式与子空间相似性定义:
投影度量定义:
为了量化两个子空间的相似性,首先定义了投影度量 d ( U A i , U B j ) d(U_A^i, U_B^j) d(UAi,UBj)。
假设矩阵 U A i U B j U_A^i U_B^j UAiUBj的奇异值为 σ 1 , σ 2 , ... , σ p \sigma_1, \sigma_2, \dots, \sigma_p σ1,σ2,...,σp,其中 p = min { i , j } p = \min\{i, j\} p=min{i,j},投影度量公式为: d ( U A i , U B j ) = p − ∑ i = 1 p σ i 2 ∈ [ 0 , p ] d(U_A^i, U_B^j) = \sqrt{p - \sum_{i=1}^p \sigma_i^2} \in [0, \sqrt{p}] d(UAi,UBj)=p−∑i=1pσi2 ∈[0,p ]
这个公式反映了两个子空间之间的距离,其中较大的 d d d值表示较大的子空间差异。
子空间相似性定义:
基于投影度量 d ( U A i , U B j ) d(U_A^i, U_B^j) d(UAi,UBj),子空间相似性 ϕ ( A , B , i , j ) \phi(A, B, i, j) ϕ(A,B,i,j)被定义为 ϕ ( A , B , i , j ) \phi(A, B, i, j) ϕ(A,B,i,j)
公式: ϕ ( A , B , i , j ) = ψ ( U A i , U B j ) = ∑ i = 1 p σ i 2 p = 1 p ( 1 − ( d ( U A i , U B j ) ) 2 ) \phi(A, B, i, j) = \psi(U_A^i, U_B^j) = \frac{\sum_{i=1}^p \sigma_i^2}{p} = \frac{1}{p} \left( 1 - (d(U_A^i, U_B^j))^2 \right) ϕ(A,B,i,j)=ψ(UAi,UBj)=p∑i=1pσi2=p1(1−(d(UAi,UBj))2)
该定义确保了当 U A i U_A^i UAi和 U B j U_B^j UBj拥有 相同的列空间 时, ϕ ( A , B , i , j ) = 1 \phi(A, B, i, j) = 1 ϕ(A,B,i,j)=1;
如果它们 完全正交 ,则 ϕ ( A , B , i , j ) = 0 \phi(A, B, i, j) = 0 ϕ(A,B,i,j)=0。否则, ϕ ( A , B , i , j ) \phi(A, B, i, j) ϕ(A,B,i,j)的值在 ( 0 , 1 ) (0, 1) (0,1)之间 。