LLMs之LoLCATs：《LoLCATs: On Low-Rank Linearizing of Large Language Models》翻译与解读

导读：这篇论文的核心是提出了一种名为 LoLCATs (Low-rank Linear Conversion via Attention Transfer) 的方法，用于高效地将大型语言模型 (LLM) 线性化。

>> 背景痛点：

● 现有线性化方法质量差：将基于 Transformer 的 LLM 的二次注意力机制替换为亚二次注意力机制（例如线性注意力）虽然提高了效率，但通常会显著降低模型质量。

● 训练成本高：线性化 LLM 通常仍然需要在数十亿甚至上百亿个 token 上进行训练，计算和内存成本巨大。

● 可扩展性差：现有方法主要局限于较小的 13 亿到 70 亿参数的 LLM，难以扩展到更大的 700 亿甚至 4000 亿参数的模型。

>> 具体的解决方案：LoLCATs 是一种两阶段方法，通过"注意力转移"和"低秩线性化"来解决上述问题：

● 阶段一：注意力转移 (Attention Transfer)：训练线性注意力机制以逼近原始 LLM 中的 softmax 注意力机制。这通过最小化线性注意力输出和 softmax 注意力输出之间的均方误差 (MSE) 来实现。关键在于学习合适的特征映射函数 ϕ，论文中提出了结合滑动窗口注意力和线性注意力的混合方法来改进特征映射的学习效果。

● 阶段二：低秩线性化 (Low-rank Linearizing)：将原始 LLM 中的 softmax 注意力机制替换为阶段一中训练得到的线性注意力机制。为了弥补近似误差并恢复 LLM 的质量，只使用低秩自适应 (LoRA) 方法微调线性注意力机制的投影矩阵，而不是更新所有模型参数。为了进一步提高可扩展性，论文提出了分块训练的方法，将 LLM 分成多个 k 层的块，分别训练每个块的注意力机制。

>> 核心思路步骤：

● 设计高效的线性注意力机制：论文改进线性注意力，结合滑动窗口注意力，兼顾效率和精度。

● 注意力转移：训练线性注意力以匹配 softmax 注意力输出，最小化 MSE 损失。

● 低秩微调：使用 LoRA 微调线性注意力投影矩阵，只更新少量参数。

● 分块训练 (针对大型模型)：将 LLM 分成块，分别训练，降低内存需求。

>> 优势：

● 显著提高线性化质量：在多个基准测试中，LoLCATs 的线性化 LLM 显著优于现有方法，缩小了线性化模型与原始 Transformer 模型之间的性能差距。

● 极高的参数和 token 效率：只训练少量参数 (小于 0.2% 的原始模型参数) 并使用少量训练 token (小于 0.4% 的现有方法)，极大降低了训练成本。

● 良好的可扩展性：成功地将线性化方法扩展到了 700 亿和 4000 亿参数的 LLM，这是以往方法无法实现的。

● 提高生成效率：线性化模型的生成速度和吞吐量显著高于使用 FlashAttention-2 的原始 Transformer 模型。

>> 结论和观点：

● LoLCATs 提供了一种高效且高质量的 LLM 线性化方法，在保持模型质量的同时显著降低了训练成本和内存需求。

● LoLCATs 成功地将线性化方法扩展到了前所未有的规模，实现了对 700 亿和 4000 亿参数 LLM 的线性化。

● 注意力转移和低秩自适应的结合是实现高效线性化的关键。

● 分块训练策略有效地解决了大型模型线性化训练中的内存问题。

● 论文还探讨了注意力转移的质量与最终线性化模型性能之间的关系，以及注意力熵和模型层数对注意力转移的影响。

总而言之，LoLCATs 提出了一种新颖且高效的 LLM 线性化方法，为构建更大规模、更高效的 LLM 提供了一种新的途径。其核心思想在于++++通过注意力转移学习到高质量的线性注意力近似，并结合低秩自适应和分块训练策略来降低训练成本和内存需求++++，最终实现对超大型 LLM 的高效线性化。

[《LoLCATs: On Low-Rank Linearizing of Large Language Models》翻译与解读](#《LoLCATs: On Low-Rank Linearizing of Large Language Models》翻译与解读)

Abstract

[1 Introduction](#1 Introduction)

[Figure 1: LoLCATs framework. We linearize LLMs by (1) training attention analogs to approximate softmax attentions (attention transfer), before swapping attentions and (2) minimally adjusting (with LoRA).图1：LoLCATs框架。我们通过(1)训练注意力类似物以近似softmax注意力（注意力转移），然后交换注意力并(2)最小地调整（使用LoRA）来线性化LLMs。](#Figure 1: LoLCATs framework. We linearize LLMs by (1) training attention analogs to approximate softmax attentions (attention transfer), before swapping attentions and (2) minimally adjusting (with LoRA).图1：LoLCATs框架。我们通过(1)训练注意力类似物以近似softmax注意力（注意力转移），然后交换注意力并(2)最小地调整（使用LoRA）来线性化LLMs。)

[Figure 2: Linearizing comparison. LoLCATs significantly improves LLM linearizing quality and training efficiency. Multiple ✓ or ✗ denoting relatively better or worse support.图2：线性化比较。LoLCATs显著提高了LLMs线性化的质量和训练效率。多个✓或✗表示相对更好的或更差的支持。](#Figure 2: Linearizing comparison. LoLCATs significantly improves LLM linearizing quality and training efficiency. Multiple ✓ or ✗ denoting relatively better or worse support.图2：线性化比较。LoLCATs显著提高了LLMs线性化的质量和训练效率。多个✓或✗表示相对更好的或更差的支持。)

Conclusion

[Limitations and Future Work局限性和未来工作](#Limitations and Future Work局限性和未来工作)

《LoLCATs: On Low-Rank Linearizing of Large Language Models》翻译与解读

|------------|--------------------------------------------------------------------------------------------------------------|
| 地址 | 论文地址：https://arxiv.org/abs/2410.10254 |
| 时间 | 2024年10月14日 |
| 作者 | 斯坦福大学 |

Abstract

1 Introduction

|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| "Linearizing" large language models (LLMs)---or converting existing Transformer-based LLMs into attention-free or subquadratic alternatives---has shown promise for scaling up efficient architectures. While many such architectures offer complexity-level efficiency gains, like linear-time and constant-memory generation, they are often limited to smaller models pretrained on academic budgets [4, 5, 24, 39, 63]. In a complementary direction, linearizing aims to start with openly available LLMs---e.g., those with 7B+ parameters trained on trillions of tokens [2, 28]---and (i) swap their softmax attentions with subquadratic analogs, before (ii) further finetuning to recover quality. This holds exciting promise for quickly scaling up subquadratic capabilities. However, to better realize this promise and allow anyone to convert LLMs into subquadratic models, we desire methods that are (1) quality-preserving, e.g., to recover the zero-shot abilities of modern LLMs; (2) parameter and token efficient, to linearize LLMs on widely accessible compute; and (3) highly scalable, to support linearizing the various 70B+ LLMs available today [56, 57]. | "线性化"大型语言模型（LLMs）------即将现有的基于Transformer的LLMs转换为无注意力或次二次的替代方案------展示了扩大高效架构规模的潜力。虽然许多此类架构提供了复杂度级别的效率增益，例如线性时间和常数内存生成，但它们通常局限于学术预算下预训练的小型模型[4, 5, 24, 39, 63]。在互补的方向上，线性化旨在从公开可用的LLMs开始------例如，那些具有7B+参数并在万亿标记上训练的模型[2, 28]------并且(i)用次二次类似物替换它们的softmax注意力，然后再(ii)进一步微调以恢复质量。这为快速扩大次二次能力提供了令人兴奋的前景。然而，为了更好地实现这一前景并允许任何人将LLMs转换为次二次模型，我们需要的方法应当是(1)质量保持的，例如，恢复现代LLMs的零样本能力；(2)参数和标记高效的，以便在广泛可访问的计算资源上线性化LLMs；以及(3)高度可扩展的，以支持线性化今天可用的各种70B+ LLMs[56, 57]。 |
| Existing methods present opportunities to improve all three criteria. On quality, despite using motivated subquadratic analogs such as RetNet-inspired linear attentions [35, 54] or state-space model (SSM)-based Mamba layers [24, 60, 64], prior works significantly reduce performance on popular LM Evaluation Harness tasks (LM Eval) [21] (up to 23.4-28.2 pts on 5-shot MMLU [26]). On parameter and token efficiency, to adjust for architectural differences, prior methods update all model parameters in at least one stage of training [35, 60, 64], and use 20 - 100B tokens to linearize 7B LLMs. On scalability, these training costs make linearizing larger models on academic compute more difficult; existing works only linearize up to 8B LLMs. This makes it unclear how to support linearizing 70B to 405B LLMs [20]. In this work, we thus propose LoLCATs (Low-rank Linear Conversion with Attention Transfer), a simple approach to improve the quality, efficiency, and scalability of linearizing LLMs. As guiding motivation, we ask if we can linearize LLMs by simply reducing architectural differences, i.e., 1、Starting with simple softmax attention analogs such as linear attention (Eq. 2), and training their pa-rameterizations explicitly to approximate softmax attention ("attention transfer"). 2、Subsequently only training with low-cost finetuning to adjust for any approximation errors, e.g., with low-rank adaptation (LoRA) [27] ("low-rank linearizing"). | 现有方法提供了改进所有三个标准的机会。在质量方面，尽管使用了有动机的次二次类似物，如受RetNet启发的线性注意力[35, 54]或基于状态空间模型（SSM）的Mamba层[24, 60, 64]，但以往的工作在流行的语言模型评估框架任务（LM Eval）[21]上的表现显著下降（最多23.4-28.2分在5-shot MMLU[26]）。在参数和标记效率方面，为了调整架构差异，以往的方法至少在一个训练阶段更新所有模型参数[35, 60, 64]，并使用20-1000亿标记来线性化7B LLMs。在可扩展性方面，这些训练成本使得在学术计算资源上对更大模型进行线性化更加困难；现有的工作只线性化到了8B LLMs。这使得如何支持70B到405B LLMs[20]的线性化变得不清楚。在这项工作中，我们提出了LoLCATs（低秩线性转换与注意力转移），这是一种简单的方法，用于改善线性化LLMs的质量、效率和可扩展性。作为指导动机，我们问是否可以通过简单地减少架构差异来线性化LLMs，即： 1、从简单的softmax注意力类似物开始，如线性注意力（公式2），并显式地训练它们的参数化以近似softmax注意力（"注意力转移"）。 2、随后仅进行低成本的微调以调整任何近似误差，例如，使用低秩适应（LoRA）[27]（"低秩线性化"）。 |
| In evaluating this hypothesis, we make several contributions. First, to better understand linearizing feasibility, we empirically study attention transfer and low-rank linearizing with existing linear attentions. While intuitive---by swapping in perfect subquadratic softmax attention approximators, we could get sub-quadratic LLMs with no additional training---prior works suggest linear attentions struggle to match softmax expressivity [31, 44] or need full-model updates to recover linearizing quality [29, 35]. In contrast, we find that while either attention transfer or LoRA alone is insufficient, we can rapidly recover quality by simply doing both (Figure 3, Table 1). At the same time, we do uncover quality issues related to attention-matching architecture and training. With prior linear attentions, the best low-rank linearized LLMs still significantly degrade in quality vs. original Transformers (up to 42.4 pts on 5-shot MMLU). With prior approaches that train all attentions jointly [66], we also find that later layers can result in 200× the MSE of earlier ones (Figure 7). We later find this issue aggravated by larger LLMs; jointly training all of Llama 3.1 405B's 126 attention layers fails to viably linearize the LLM. Next, to resolve these issues and improve upon our original criteria, we detail LoLCATs' method com-ponents. For quality, we generalize prior notions of learnable linear attentions to sliding window + linear attention variants. These remain subquadratic to compute yet consistently yield better attention transfer via lower mean-squared error (MSE) on attention outputs. For parameter and token efficiency, we maintain our simple 2-step framework of (1) training subquadratic attentions to match softmax attentions, before (2) adjusting for any errors via only LoRA. For scalability, we use finer-grained "block-by-block" training. We split LLMs into blocks of k layers before jointly training attentions only within each block to improve layer-wise attention matching. We pick k to balance the speed of training blocks in parallel with the memory of saving hidden state outputs of prior blocks (as inputs for later ones). We provide a simple cost model to navigate these tradeoffs. | 在评估这一假设时，我们做出了几项贡献。首先，为了更好地理解线性化的可行性，我们使用现有的线性注意力经验研究了注意力转移和低秩线性化。虽然直观上------通过插入完美的次二次softmax注意力近似器，我们可以在没有额外训练的情况下获得次二次LLMs------但以往的工作表明线性注意力难以匹配softmax表达力[31, 44]或需要全模型更新以恢复线性化质量[29, 35]。相比之下，我们发现单独的注意力转移或LoRA都不足以迅速恢复质量（图3，表1）。同时，我们也发现了与注意力匹配架构和训练相关的质量问题。使用之前的线性注意力，最佳的低秩线性化LLMs的质量仍显著下降（最多42.4分在5-shot MMLU）。使用联合训练所有注意力的方法[66]，我们还发现后期层可以导致早期层的200倍均方误差（MSE）（图7）。我们后来发现这个问题在更大的LLMs中更加严重；联合训练Llama 3.1 405B的所有126个注意力层无法有效地线性化该LLM。接下来，为了解决这些问题并改进我们的原始标准，我们详细说明了LoLCATs的方法组件。对于质量，我们将之前关于可学习线性注意力的概念推广到滑动窗口+线性注意力变体。这些保持次二次计算，但在注意力输出上通过较低的均方误差（MSE）始终提供更好的注意力转移。对于参数和标记效率，我们保持我们简单的两步框架：(1) 训练次二次注意力以匹配softmax注意力，然后(2) 仅通过LoRA调整任何误差。对于可扩展性，我们使用更细粒度的"逐块"训练。我们将LLMs分割成k层的块，然后仅在每个块内联合训练注意力，以改善逐层注意力匹配。我们选择k来平衡并行训练块的速度与保存前一个块的隐藏状态输出（作为后续块的输入）所需的内存。我们提供了一个简单的成本模型来导航这些权衡。 |
| Finally, in experiments, we validate that LoLCATs improves on each of our desired criteria. >> On quality, when linearizing popular LLMs such as Mistral-7B and Llama 3 8B, LoLCATs significantly improves past linearizing methods (by 1.1−8.6 points (pts) on zero-shot LM Eval tasks; +17.2 pts on 5-shot MMLU)). With Llama 3 8B, LoLCATs for the first time closes the zero-shot LM Eval gap between linearized and Transformer models (73.1 vs 74.2 pts), while supporting 3× higher throughput and 64× larger batch sizes vs. popular FlashAttention-2 [15] implementations (generating 4096 token samples on an 80GB H100). We further validate LoLCATs as a high-quality training method, outperforming strong 7B subquadratic LLMs (RWKV-v6 [40], Mamba [24], Griffin [18]) and hybrids (StripedHyena [42], Zamba [23]) trained from scratch by 1.2 to 9.9 pts on average over popular LM Eval tasks. >> On parameter and token-efficiency, by only training linear attention feature maps in Stage 1, while only using LoRA on linear attention projections in Stage 2, LoLCATs enables these gains while updating only <0.2% of past linearizing methods' model parameters (doable on a single 40GB GPU). This also only takes 40M tokens, i.e., 0.003% and 0.04% of prior pretraining and linearizing methods' token counts. >> On scalability, with LoLCATs we scale up linearizing to support Llama 3.1 70B and 405B LLMs [20]. LoLCATs presents the first viable approach to linearizing larger LLMs. We create the first linearized 70B LLM, taking only 18 hours on one 8×80GB H100 node, and the first linearized 405B LLM with a combination of 5 hours on 14 80GB H100 GPUs (attention transfer) + 16 hours on three 8×80GB H100 nodes (LoRA finetuning) for Llama 3.1 405B. For both models, this amount to under half the total GPU hours than prior methods reported to linearize 8B models (5 days on 8×80GB A100s) [60]. Furthermore, under these computational constraints, LoLCATs significantly improves quality versus prior linearizing approaches without attention transfer. With Llama 3.1 70B and 405B, we close 77.8% and 78.1% of the 5-shot MMLU gap between Transformers and linearized variants respectively. Our code is available at: https://github.com/HazyResearch/lolcats. | 最后，在实验中，我们验证了LoLCATs在我们期望的每个标准上都得到了改进。 >> 在质量方面，当线性化流行的LLMs如Mistral-7B和Llama 3 8B时，LoLCATs显著改进了过去的线性化方法（在零样本LM Eval任务上提高了1.1-8.6分；在5-shot MMLU上提高了17.2分）。通过Llama 3 8B，LoLCATs首次关闭了线性化模型与Transformer模型之间在零样本LM Eval上的差距（73.1分对比74.2分），同时支持3倍更高的吞吐量和64倍更大的批次大小，与流行的FlashAttention-2[15]实现相比（在80GB H100上生成4096个标记样本）。我们进一步验证了LoLCATs作为一种高质量的训练方法，在流行的LM Eval任务上平均超过了从头训练的强大7B次二次LLMs（RWKV-v6[40]，Mamba[24]，Griffin[18]）和混合模型（StripedHyena[42]，Zamba[23]）1.2至9.9分。在参数和标记效率方面，通过仅在第一阶段训练线性注意力特征映射，而在第二阶段仅在线性注意力投影上使用LoRA，LoLCATs能够在仅更新过去线性化方法不到0.2%的模型参数的情况下实现这些增益（可在单个40GB GPU上完成）。这只需要4000万标记，即过去预训练和线性化方法标记数的0.003%和0.04%。 >> 在可扩展性方面，通过LoLCATs，我们扩大了线性化规模以支持Llama 3.1 70B和405B LLMs[20]。LoLCATs提供了第一个可行的大规模LLMs线性化方法。我们创建了首个线性化的70B LLM，仅需在单个8×80GB H100节点上花费18小时，以及首个线性化的405B LLM，通过组合14个80GB H100 GPU的5小时（注意力转移）+ 三个8×80GB H100节点的16小时（LoRA微调）来完成Llama 3.1 405B的线性化。对于这两个模型，这相当于过去方法报告线性化8B模型所需总GPU小时数的一半以下（在8×80GB A100s上5天）[60]。此外，在这些计算限制下，LoLCATs在没有注意力转移的情况下，与先前的线性化方法相比显著提高了质量。通过Llama 3.1 70B和405B，我们分别关闭了Transformers和线性化变体之间5-shot MMLU差距的77.8%和78.1%。我们的代码可在以下地址获取：https://github.com/HazyResearch/lolcats。 |

Figure 1: LoLCATs framework. We linearize LLMs by (1) training attention analogs to approximate softmax attentions (attention transfer), before swapping attentions and (2) minimally adjusting (with LoRA).图1：LoLCATs框架。我们通过(1)训练注意力类似物以近似softmax注意力（注意力转移），然后交换注意力并(2)最小地调整（使用LoRA）来线性化LLMs。

Figure 2: Linearizing comparison. LoLCATs significantly improves LLM linearizing quality and training efficiency. Multiple ✓ or ✗ denoting relatively better or worse support.图2：线性化比较。LoLCATs显著提高了LLMs线性化的质量和训练效率。多个✓或✗表示相对更好的或更差的支持。