外推性-位置编码的阿喀琉斯之踵

"外推性"（Extrapolation）是贯穿整个位置编码演进史的核心暗线，也是区分不同方案优劣的"试金石"。

上下文的"悬崖"------解构 LLM 位置编码的外推性危机

为什么在 4k 上训练的模型，跑到 4097 就会崩溃？

1. 什么是外推性 (Extrapolation)？

在 LLM 的上下文中，"外推性"指的是一个模型在训练期间未见过的序列长度上，其表现（如 Perplexity）是否能保持稳定。

插值 (Interpolation)： 在训练长度 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N（例如 4096）之内的位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 0 , N − 1 ] [0, N-1] </math>[0,N−1] 上进行推理。
外推 (Extrapolation)： 在训练长度 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N 之外的位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> M > N M > N </math>M>N（例如 4097）上进行推理。

"外推性差"意味着模型一旦超出训练长度，其性能就会显著下降甚至崩溃。这个性能"悬崖"是所有 LLM 架构师都必须面对的核心挑战之一。

为什么会这样？答案几乎总是指向同一个"罪魁祸首"：位置编码 (PE) 。模型本身（Attention 机制）是置换不变的，它对"长度"的唯一感知就来自于 PE。当 PE 在 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N 之外"失效"时，模型也就"崩溃"了。

2. 失败的模式：不同 PE 为何外推性差？

不同 PE 方案的外推性失败模式（Failure Mode）是不同的，理解这一点至关重要。

案例一：绝对位置编码 (APE) 的"灾难性"外推

技术（如原始 Transformer）： APE 是一种"位置查找表"。无论是可学习的嵌入，还是固定的 <math xmlns="http://www.w3.org/1998/Math/MathML"> sin ⁡ / cos ⁡ \sin/\cos </math>sin/cos 函数，它本质上都是 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E ( p o s ) PE(pos) </math>PE(pos)。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> x ′ = TokenEmbedding ( x ) + APE ( p o s ) \mathbf{x}' = \text{TokenEmbedding}(\mathbf{x}) + \text{APE}(pos) </math>x′=TokenEmbedding(x)+APE(pos)

为何外推性差？

这是最严重的一种失败，是一种 "分布外"(Out-of-Distribution, OOD) 危机。

可学习的 APE： 假设模型在 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 0 , 4095 ] [0, 4095] </math>[0,4095] 上训练。当 <math xmlns="http://www.w3.org/1998/Math/MathML"> p o s = 4096 pos=4096 </math>pos=4096 出现时，模型会去查询一个从未被训练过的、完全随机的 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E ( 4096 ) PE(4096) </math>PE(4096) 嵌入向量。将这个"垃圾"向量添加到词元嵌入中，会导致 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q , K , V Q, K, V </math>Q,K,V 投影瞬间错乱。
<math xmlns="http://www.w3.org/1998/Math/MathML"> sin ⁡ / cos ⁡ \sin/\cos </math>sin/cos APE： 即使是固定的 <math xmlns="http://www.w3.org/1998/Math/MathML"> sin ⁡ / cos ⁡ \sin/\cos </math>sin/cos 编码，模型也 从未见过 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E ( 4096 ) PE(4096) </math>PE(4096) 这个特定的 <math xmlns="http://www.w3.org/1998/Math/MathML"> sin ⁡ / cos ⁡ \sin/\cos </math>sin/cos 向量与词元嵌入组合后的模式。 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q , K , V Q,K,V </math>Q,K,V 矩阵的权重没有学会如何处理这个"新奇"的输入。

结果： 性能立即崩溃，Perplexity 瞬间飙升。这就是一个"悬崖"（Cliff）。

案例二：RoPE (旋转位置编码) 的"非完美"外推

技术： RoPE 是一种相对位置编码，通过绝对位置（旋转）来实现。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> f ( q , m ) = q e i m θ i f(\mathbf{q}, m) = \mathbf{q} e^{im\theta_i} </math>f(q,m)=qeimθi

为何外推性差？

RoPE 的失败要微妙得多，它不是"垃圾输入"，而是"数学失效"。

周期性混淆 (Aliasing)： RoPE 依赖 <math xmlns="http://www.w3.org/1998/Math/MathML"> sin ⁡ ( m θ ) \sin(m\theta) </math>sin(mθ)，这是周期函数。如果基底 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ 设置不当， <math xmlns="http://www.w3.org/1998/Math/MathML"> p o s = 4096 pos=4096 </math>pos=4096 的旋转角度可能与 <math xmlns="http://www.w3.org/1998/Math/MathML"> p o s = 0 pos=0 </math>pos=0 的旋转角度完全相同（或非常接近）。模型会突然"混淆"一个在 4096 位置的词元和在 0 位置的词元。
相对距离"失效"： RoPE 的核心是 <math xmlns="http://www.w3.org/1998/Math/MathML"> ⟨ q ~ m , k ~ n ⟩ \langle \tilde{q}_m, \tilde{k}_n \rangle </math>⟨q~m,k~n⟩ 只依赖于 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) (m-n) </math>(m−n)。模型在训练时，已经学会了如何解释 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) (m-n) </math>(m−n) 在 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ − 4095 , 4095 ] [-4095, 4095] </math>[−4095,4095] 范围内的相对距离。当它遇到一个 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) = 6000 (m-n) = 6000 </math>(m−n)=6000 的相对距离时，它不知道这个距离"有多远" 。 <math xmlns="http://www.w3.org/1998/Math/MathML"> e i ( 6000 ) θ e^{i(6000)\theta} </math>ei(6000)θ 这个旋转因子对它来说是全新的、无意义的。

结果： 性能不会像 APE 那样立即崩溃，但会因为位置混淆 和无法理解超长相对距离 而性能快速下降。

3. 特例：ALiBi 为何外推性好？

要理解"差"，我们必须看什么是"好"。ALiBi 是为外推性而生的。

技术： ALiBi (Attention with Linear Biases)

机制： ALiBi 不在 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q , K Q, K </math>Q,K 上操作。它在 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q K T QK^T </math>QKT 的 Logits 上直接添加一个线性的"惩罚"偏置：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> A i , j = q i T k j + m ⋅ ( i − j ) A_{i,j} = \mathbf{q}_i^T \mathbf{k}_j + m \cdot (i-j) </math>Ai,j=qiTkj+m⋅(i−j)

为何外推性好？

ALiBi 的机制是一个极其简单、连续且非周期性的归纳偏置 (Inductive Bias)："距离越远，惩罚越多"。

模型在训练时，在 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 0 , 4095 ] [0, 4095] </math>[0,4095] 范围内学会了这条"线性规则"。
当在 <math xmlns="http://www.w3.org/1998/Math/MathML"> p o s = 6000 pos=6000 </math>pos=6000 推理时，它遇到的 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( i − j ) = 6000 (i-j)=6000 </math>(i−j)=6000。模型不需要任何额外信息 ，它可以完美地 将这条线性规则应用到新距离上：惩罚就是 <math xmlns="http://www.w3.org/1998/Math/MathML"> m ⋅ 6000 m \cdot 6000 </math>m⋅6000。

ALiBi 根本没有"OOD"问题，因为它学到的规则是数学上可无限外推的。

4. 解决之道：我们如何"欺骗"外推？

既然 RoPE（目前 SOTA 模型的基础）的外推性非完美，而 ALiBi 又与 RoPE 的架构（如 Llama）不兼容，我们如何实现 128k 的上下文？

答案是：我们不解决外推问题，我们"规避"它。

我们不再"外推"（Extrapolate），而是想办法"插值"（Interpolate）。这就是 位置插值 (Position Interpolation, PI) 及其后续演进（NTK, YaRN）的核心思想。

步骤 1：天真的插值 (PI)

思想： 我们有一个 4k 的模型。当一个 8k 的序列进来时，我们"假装"它是一个 4k 的序列。
操作： 我们将所有的 <math xmlns="http://www.w3.org/1998/Math/MathML"> p o s ∈ [ 0 , 8191 ] pos \in [0, 8191] </math>pos∈[0,8191] 线性压缩 到 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 0 , 4095 ] [0, 4095] </math>[0,4095] 范围内。例如， <math xmlns="http://www.w3.org/1998/Math/MathML"> p o s = 8191 pos=8191 </math>pos=8191 被当作 <math xmlns="http://www.w3.org/1998/Math/MathML"> p o s = 4095 pos=4095 </math>pos=4095 来旋转。
问题： 这种"统一压缩"等同于将 RoPE 的所有 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ i \theta_i </math>θi 频率都降低一半。这严重破坏了用于编码局部细节的高频维度，导致模型"变笨"。

步骤 2：智能的插值 (NTK-Aware / YaRN)

NTK-Aware 的发现： PI 的方向是对的（"压缩回插值"），但方法错了。我们不应该 压缩高频维度（它们负责局部细节，应该保持不变），我们只应该压缩低频维度（它们负责全局距离）。
YaRN 的发现： NTK 解决了频率问题，但它（和 PI）都引入了新问题：插值操作改变了 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q K T QK^T </math>QKT 点积的幅度，导致 Softmax 分布熵变低（即"过度自信"）。
YaRN 的解决方案：
1. 使用 NTK 的"选择性插值"来处理频率。
2. 额外引入一个温度 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t （ <math xmlns="http://www.w3.org/1998/Math/MathML"> softmax ( Q K T / t ) \text{softmax}(QK^T/t) </math>softmax(QKT/t)）来校准幅度，使 Softmax 的熵恢复到训练时的状态。

结论：外推性的"圣杯"

"外推性差"是位置编码的阿喀琉斯之踵。我们的演进路线清晰地表明了这一点：

APE 因"OOD"而外推性崩溃。
ALiBi 因"线性归纳偏置"而外推性极佳，但其"修改 Logits"的架构使其无法与线性 Attention 兼容。
RoPE 因"周期性"而外推性非完美，但其"解耦"架构（预处理 Q/K）与线性 Attention 完美兼容。

当前的 SOTA 方案（Llama + YaRN）选择了一条务实的路线：我们采用架构最优的 RoPE，然后用"智能插值" (YaRN) 的方式来规避其外推性短板。

真正的"圣杯"------一个既具备 ALiBi 连续外推特性，又具备 RoPE 架构解耦优点的 PE 方案------可能仍在等待被发现。