从傅里叶变换到 RoPE：解构位置编码的数学灵魂

旋转位置编码 (RoPE) 的天才之处，并不仅仅在于它使用了 sin 和 cos 函数。它真正的革命性在于，它将傅里叶变换的"时移定理" （Time-Shift Theorem）从一个分析工具，转变成了 Transformer 注意力机制的内建架构。

与原始 Transformer 将位置信息"相加"不同，RoPE 通过"旋转"（即复数乘法）来注入位置。这种设计在数学上强行保证 了注意力分数只依赖于词符间的相对位置 (m-n) ，从而从根本上解决了绝对位置编码的诸多弊端。

1. 问题的根源：注意力机制的"位置色盲"

Transformer 的核心------自注意力机制（Self-Attention）------天生具有"置换不变性"。这意味着，对于模型来说，["我", "打", "你"] 和 ["你", "打", "我"] 在计算注意力时是完全等价的，因为它只关心词符之间的"内容"互动，而不关心它们的"顺序"。

为了解决这个"位置色盲"，我们必须以某种方式将"位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> m m </math>m"和"位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n"的信息注入模型。

1.1 原始方案：加性的绝对编码 (APE)

原始论文《Attention Is All You Need》提出了一种精妙的方案：

为每个绝对位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> p o s pos </math>pos 计算一个基于多频率 sin/cos 的向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E p o s PE_{pos} </math>PEpos。
将这个位置向量**"加"**到词嵌入向量上： <math xmlns="http://www.w3.org/1998/Math/MathML"> X f i n a l = X w o r d + P E p o s X_{final} = X_{word} + PE_{pos} </math>Xfinal=Xword+PEpos。

这个方案（我们称之为 APE）是有效的，但它把一个难题留给了模型：

模型看到的是一个"混合"向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> X f i n a l X_{final} </math>Xfinal。
它必须**自行"学习"**如何从 <math xmlns="http://www.w3.org/1998/Math/MathML"> X f i n a l X_{final} </math>Xfinal 中解耦出词义 <math xmlns="http://www.w3.org/1998/Math/MathML"> X w o r d X_{word} </math>Xword 和位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E p o s PE_{pos} </math>PEpos。
更重要的是，它必须**"学习"出 <math xmlns="http://www.w3.org/1998/Math/MathML"> ⟨ X m + P E m , X n + P E n ⟩ \langle X_m + PE_m, X_n + PE_n \rangle </math>⟨Xm+PEm,Xn+PEn⟩ 这个复杂的点积中蕴含的"相对位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) (m-n) </math>(m−n)"信息。这是一种间接且低效**的方式。
它在训练中只见过 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E 0 PE_0 </math>PE0 到 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E 4095 PE_{4095} </math>PE4095，当 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E 5000 PE_{5000} </math>PE5000 出现时，模型会彻底"崩溃"，因为它从未见过这个位置的编码，导致外推性 (Extrapolation) 极差。

2. 傅里叶变换的启示：时移即"相旋"

为了找到更好的方案，我们必须回到信号处理的本源------傅里叶变换。

傅里叶变换的核心思想是：任何时域信号 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( t ) f(t) </math>f(t) 都可以分解为不同频率 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( ω ) (\omega) </math>(ω) 的 sin 和 cos 波的叠加。

而在所有特性中， "时移定理" (Shift Theorem) 是我们的关键：

时移定理（简易版）：

一个信号在时域的平移 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( t + n ) f(t+n) </math>f(t+n)，对应到频域上是一个相位的旋转。

让我们用最纯粹的数学形式（复数）来表达：

一个单一频率 <math xmlns="http://www.w3.org/1998/Math/MathML"> ω \omega </math>ω 的波，在时间 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 的"编码"是 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( t ) = e i ω t f(t) = e^{i\omega t} </math>f(t)=eiωt。
那么，在时间 <math xmlns="http://www.w3.org/1998/Math/MathML"> t + n t+n </math>t+n（即平移了 <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n）的"编码"是什么？

<math xmlns="http://www.w3.org/1998/Math/MathML"> f ( t + n ) = e i ω ( t + n ) = e i ω t ⋅ e i ω n f(t+n) = e^{i\omega(t+n)} = e^{i\omega t} \cdot e^{i\omega n} </math>f(t+n)=eiω(t+n)=eiωt⋅eiωn
我们得到了一个惊人的公式： <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( t + n ) = f ( t ) ⋅ f ( n ) f(t+n) = f(t) \cdot f(n) </math>f(t+n)=f(t)⋅f(n) （注：这里 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( n ) = e i ω n f(n) = e^{i\omega n} </math>f(n)=eiωn）

这个公式告诉我们： "平移后"的编码 = "原始编码" 乘以 "平移量对应的旋转因子" 。

这种关系是乘性的 (Multiplicative) ，而不是加性的。

3. RoPE：将"时移定理"注入注意力

RoPE 的设计者（苏剑林）敏锐地抓住了这个点。我们希望注意力 <math xmlns="http://www.w3.org/1998/Math/MathML"> A t t e n t i o n ( m , n ) Attention(m, n) </math>Attention(m,n) 只依赖于相对位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) (m-n) </math>(m−n)。我们如何利用上述的"乘性"关系呢？

3.1 目标重设

我们不再将 <math xmlns="http://www.w3.org/1998/Math/MathML"> P E PE </math>PE"加"到 <math xmlns="http://www.w3.org/1998/Math/MathML"> X X </math>X 上。我们定义一种新的 Query 和 Key，它们是位置的函数。

我们希望： <math xmlns="http://www.w3.org/1998/Math/MathML"> ⟨ q m , k n ⟩ \langle q_m, k_n \rangle </math>⟨qm,kn⟩ 能够只由 <math xmlns="http://www.w3.org/1998/Math/MathML"> q , k q, k </math>q,k 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) (m-n) </math>(m−n) 决定。

3.2 傅里叶"旋转"的实现

RoPE 正是傅里叶时移定理的直接应用。为了简化，我们暂时使用复数（RoPE 的实际实现就是 2D 旋转，与复数乘法等价）。

RoPE 将 <math xmlns="http://www.w3.org/1998/Math/MathML"> d d </math>d 维向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> q q </math>q 分解为 <math xmlns="http://www.w3.org/1998/Math/MathML"> d / 2 d/2 </math>d/2 个复数 <math xmlns="http://www.w3.org/1998/Math/MathML"> q j q_j </math>qj。对每一个复数（代表一个特定的频率 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ j \theta_j </math>θj）：

定义 <math xmlns="http://www.w3.org/1998/Math/MathML"> q m q_m </math>qm（Query 在位置 m）： <math xmlns="http://www.w3.org/1998/Math/MathML"> q m = q j ⋅ e i m θ j q_m = q_j \cdot e^{im\theta_j} </math>qm=qj⋅eimθj （这就是 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( t ) f(t) </math>f(t) 乘以 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( m ) f(m) </math>f(m)）
定义 <math xmlns="http://www.w3.org/1998/Math/MathML"> k n k_n </math>kn（Key 在位置 n）： <math xmlns="http://www.w3.org/1998/Math/MathML"> k n = k j ⋅ e i n θ j k_n = k_j \cdot e^{in\theta_j} </math>kn=kj⋅einθj （这就是 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( t ) f(t) </math>f(t) 乘以 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( n ) f(n) </math>f(n)）

现在，让我们计算 <math xmlns="http://www.w3.org/1998/Math/MathML"> q m q_m </math>qm 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> k n k_n </math>kn 之间的（复数）点积，这对应于注意力分数：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Attention ( m , n ) ∝ q m ⋅ k n ∗ ( * 表示共轭 ) = ( q j ⋅ e i m θ j ) ⋅ ( k j ⋅ e i n θ j ) ∗ = ( q j ⋅ e i m θ j ) ⋅ ( k j ∗ ⋅ e − i n θ j ) = ( q j ⋅ k j ∗ ) ⋅ ( e i m θ j ⋅ e − i n θ j ) = ( q j ⋅ k j ∗ ) ⋅ e i ( m − n ) θ j \begin{aligned} \text{Attention}(m, n) &\propto q_m \cdot k_n^* \quad (\text{* 表示共轭}) \\ &= (q_j \cdot e^{i m \theta_j}) \cdot (k_j \cdot e^{i n \theta_j})^* \\ &= (q_j \cdot e^{i m \theta_j}) \cdot (k_j^* \cdot e^{-i n \theta_j}) \\ &= (q_j \cdot k_j^*) \cdot (e^{i m \theta_j} \cdot e^{-i n \theta_j}) \\ &= (q_j \cdot k_j^*) \cdot e^{i (m - n) \theta_j} \end{aligned} </math>Attention(m,n)∝qm⋅kn∗(* 表示共轭)=(qj⋅eimθj)⋅(kj⋅einθj)∗=(qj⋅eimθj)⋅(kj∗⋅e−inθj)=(qj⋅kj∗)⋅(eimθj⋅e−inθj)=(qj⋅kj∗)⋅ei(m−n)θj

这就是 RoPE 的魔法所在！

<math xmlns="http://www.w3.org/1998/Math/MathML"> ( q j ⋅ k j ∗ ) (q_j \cdot k_j^*) </math>(qj⋅kj∗)：这部分只与内容 (content) 有关。
<math xmlns="http://www.w3.org/1998/Math/MathML"> e i ( m − n ) θ j e^{i(m-n)\theta_j} </math>ei(m−n)θj：这部分只与相对位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) (m-n) </math>(m−n) 有关。

RoPE 通过乘性（旋转）操作，在数学上将内容和相对位置完美地解耦了。

3.3 频率的多尺度（类比"进制"）

在 RoPE 中， <math xmlns="http://www.w3.org/1998/Math/MathML"> θ j \theta_j </math>θj 的值（即频率）也不是单一的，而是采用了和 APE 类似的多尺度设计：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> θ j = 1000 0 − 2 j / d \theta_j = 10000^{-2j/d} </math>θj=10000−2j/d

低 <math xmlns="http://www.w3.org/1998/Math/MathML"> j j </math>j 维度： <math xmlns="http://www.w3.org/1998/Math/MathML"> θ j \theta_j </math>θj 值大，频率高（旋转快，像时钟的"秒针"），捕捉精细的短程相对位置。
高 <math xmlns="http://www.w3.org/1998/Math/MathML"> j j </math>j 维度： <math xmlns="http://www.w3.org/1998/Math/MathML"> θ j \theta_j </math>θj 值小，频率低（旋转慢，像时钟的"时针"），捕捉粗粒度的长程相对位置。

用不同的"位"或"频率"来捕捉不同尺度的信息。

4. 结论：加法 vs 乘法

特性	原始 APE (加法)	RoPE (乘法/旋转)
集成方式	<math xmlns="http://www.w3.org/1998/Math/MathML"> X + P E a b s X + PE_{abs} </math>X+PEabs	<math xmlns="http://www.w3.org/1998/Math/MathML"> R m ⋅ q R_m \cdot q </math>Rm⋅q 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> R n ⋅ k R_n \cdot k </math>Rn⋅k
数学原理	线性叠加	傅里叶时移定理（复数乘法）
模型负担	必须"学习"出相对位置	相对位置"内建"于数学结构
编码对象	绝对位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> m m </math>m	绝对位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> m m </math>m 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n
解码结果	<math xmlns="http://www.w3.org/1998/Math/MathML"> ⟨ q m , k n ⟩ \langle q_m, k_n \rangle </math>⟨qm,kn⟩ 中混杂着绝对和相对信息	<math xmlns="http://www.w3.org/1998/Math/MathML"> ⟨ q m , k n ⟩ \langle q_m, k_n \rangle </math>⟨qm,kn⟩ 中只包含相对位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) (m-n) </math>(m−n)
外推性	极差	较好（为位置内插 PI 提供了基础）

总结：

RoPE 与傅里叶变换的联系，远不止于"都用了 sin/cos"。

RoPE 是一种架构上的飞跃：它不再满足于给模型提供"绝对位置"的线索（加法 APE），而是利用傅里ye变换最核心的"时移-相旋"特性，将相对位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( m − n ) (m-n) </math>(m−n) 的计算，变成了注意力机制的内生本能。这是一种远比加法更优雅、更符合数学原理的解决方案。