高效注意力机制深度解析：从 Linear Attention 到 RWKV 的线性复杂度序列建模

摘要
一、标准注意力的计算瓶颈
二、稀疏注意力：选择性关注的智慧
三、线性化注意力：核函数的魔力
[四、低秩近似：Linformer 的降维策略](#四、低秩近似：Linformer 的降维策略)
[五、状态空间模型与 RWKV](#五、状态空间模型与 RWKV)
[六、RetNet 与多头保留机制](#六、RetNet 与多头保留机制)
七、线性注意力架构对比分析
八、实际部署与性能评测
总结
参考资料

摘要

Transformer 的自注意力机制计算复杂度为 O ( n 2 ) O(n^2) O(n2)，随序列长度二次增长，成为长上下文建模的核心瓶颈。本文系统梳理了高效注意力机制的技术全景，从稀疏注意力、线性注意力到状态空间模型，深入分析 Performer、Linformer、RWKV、RetNet 等代表性方法的数学原理和架构设计。文章涵盖核技巧、低秩近似、递推状态更新等核心技术，并提供完整的性能对比和选型指南。

一、标准注意力的计算瓶颈

1.1 缩放点积注意力的复杂度分析

标准 Transformer 的自注意力机制（Scaled Dot-Product Attention）公式为：

Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V Attention(Q,K,V)=softmax(dk QKT)V

其中 Q , K , V ∈ R n × d Q, K, V \in \mathbb{R}^{n \times d} Q,K,V∈Rn×d 分别是查询、键和值矩阵， n n n 是序列长度， d d d 是特征维度。

计算 Q K T QK^T QKT 需要 O ( n 2 d ) O(n^2 d) O(n2d) 的乘法和 O ( n 2 ) O(n^2) O(n2) 的内存。当 n = 128 K n = 128K n=128K（典型长上下文场景）且 d = 64 d = 64 d=64（单头维度）时，仅注意力矩阵就需要 128 K × 128 K × 4 bytes = 64 GB 128K \times 128K \times 4\text{ bytes} = 64\text{ GB} 128K×128K×4 bytes=64 GB 的 GPU 内存。

1.2 注意力矩阵的低秩特性

关键观察：注意力矩阵通常是低秩的。这意味着 softmax ( Q K T ) \text{softmax}(QK^T) softmax(QKT) 可以用更小的矩阵乘积来近似：

softmax ( Q K T ) ≈ Φ ( Q ) Φ ( K ) T \text{softmax}(QK^T) \approx \Phi(Q) \Phi(K)^T softmax(QKT)≈Φ(Q)Φ(K)T

其中 Φ : R d → R m \Phi: \mathbb{R}^d \to \mathbb{R}^m Φ:Rd→Rm 是特征映射函数，将 d d d 维向量映射到 m m m 维空间。当 m ≪ n m \ll n m≪n 时，计算复杂度从 O ( n 2 d ) O(n^2 d) O(n2d) 降至 O ( n m d ) O(nmd) O(nmd)。

1.3 高效注意力分类体系

#mermaid-svg-fCV043wQSsegKhpX{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fCV043wQSsegKhpX .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fCV043wQSsegKhpX .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fCV043wQSsegKhpX .error-icon{fill:#552222;}#mermaid-svg-fCV043wQSsegKhpX .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fCV043wQSsegKhpX .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fCV043wQSsegKhpX .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fCV043wQSsegKhpX .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fCV043wQSsegKhpX .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fCV043wQSsegKhpX .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fCV043wQSsegKhpX .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fCV043wQSsegKhpX .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fCV043wQSsegKhpX .marker.cross{stroke:#333333;}#mermaid-svg-fCV043wQSsegKhpX svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fCV043wQSsegKhpX p{margin:0;}#mermaid-svg-fCV043wQSsegKhpX .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fCV043wQSsegKhpX .cluster-label text{fill:#333;}#mermaid-svg-fCV043wQSsegKhpX .cluster-label span{color:#333;}#mermaid-svg-fCV043wQSsegKhpX .cluster-label span p{background-color:transparent;}#mermaid-svg-fCV043wQSsegKhpX .label text,#mermaid-svg-fCV043wQSsegKhpX span{fill:#333;color:#333;}#mermaid-svg-fCV043wQSsegKhpX .node rect,#mermaid-svg-fCV043wQSsegKhpX .node circle,#mermaid-svg-fCV043wQSsegKhpX .node ellipse,#mermaid-svg-fCV043wQSsegKhpX .node polygon,#mermaid-svg-fCV043wQSsegKhpX .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fCV043wQSsegKhpX .rough-node .label text,#mermaid-svg-fCV043wQSsegKhpX .node .label text,#mermaid-svg-fCV043wQSsegKhpX .image-shape .label,#mermaid-svg-fCV043wQSsegKhpX .icon-shape .label{text-anchor:middle;}#mermaid-svg-fCV043wQSsegKhpX .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fCV043wQSsegKhpX .rough-node .label,#mermaid-svg-fCV043wQSsegKhpX .node .label,#mermaid-svg-fCV043wQSsegKhpX .image-shape .label,#mermaid-svg-fCV043wQSsegKhpX .icon-shape .label{text-align:center;}#mermaid-svg-fCV043wQSsegKhpX .node.clickable{cursor:pointer;}#mermaid-svg-fCV043wQSsegKhpX .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fCV043wQSsegKhpX .arrowheadPath{fill:#333333;}#mermaid-svg-fCV043wQSsegKhpX .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fCV043wQSsegKhpX .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fCV043wQSsegKhpX .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fCV043wQSsegKhpX .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fCV043wQSsegKhpX .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fCV043wQSsegKhpX .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fCV043wQSsegKhpX .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fCV043wQSsegKhpX .cluster text{fill:#333;}#mermaid-svg-fCV043wQSsegKhpX .cluster span{color:#333;}#mermaid-svg-fCV043wQSsegKhpX div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fCV043wQSsegKhpX .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fCV043wQSsegKhpX rect.text{fill:none;stroke-width:0;}#mermaid-svg-fCV043wQSsegKhpX .icon-shape,#mermaid-svg-fCV043wQSsegKhpX .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fCV043wQSsegKhpX .icon-shape p,#mermaid-svg-fCV043wQSsegKhpX .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fCV043wQSsegKhpX .icon-shape .label rect,#mermaid-svg-fCV043wQSsegKhpX .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fCV043wQSsegKhpX .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fCV043wQSsegKhpX .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fCV043wQSsegKhpX :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 自注意力 O(n^2)
稀疏注意力

限制 token 交互范围
线性注意力

核函数近似
低秩近似

矩阵分解
递推状态

SSM / RNN 混合
局部窗口

Sliding Window
步长采样

Sparse Transformer
Performer

FAVOR+ 正随机特征
CosFormer

余弦重加权
Linformer

投影低秩
RWKV

RNN 式注意力
RetNet

多头保留机制
Mamba

选择性 SSM

二、稀疏注意力：选择性关注的智慧

2.1 稀疏模式设计

稀疏注意力的核心思想：并非所有 token 对都需要直接交互。通过设计稀疏的注意力模式，将交互限制在 O ( n log ⁡ n ) O(n \log n) O(nlogn) 或 O ( n ) O(n) O(n) 的 token 对上。

稀疏模式	复杂度	描述	代表工作
局部窗口	O ( n w ) O(nw) O(nw)	每个 token 仅关注周围 w w w 个 token	Longformer
膨胀窗口	O ( n w ) O(nw) O(nw)	间隔 g g g 的扩展局部窗口	Longformer
全局 token	O ( n g ) O(ng) O(ng)	少数 g g g 个 token 对所有 token 可见	BigBird
随机模式	O ( n r ) O(nr) O(nr)	随机选择 r r r 个 token 交互	BigBird
步长模式	O ( n log ⁡ n ) O(n \log n) O(nlogn)	按 2 k 2^k 2k 步长间隔交互	Sparse Transformer

2.2 Longformer 的组合稀疏设计

Longformer 采用组合稀疏模式：局部窗口 + 膨胀窗口 + 全局 token。其注意力掩码设计：

M i j = { 1 if ∣ i − j ∣ ≤ w / 2 (局部窗口) 1 if ( i − j ) m o d g = 0 (膨胀窗口) 1 if j ∈ G (全局 token) 0 otherwise M_{ij} = \begin{cases} 1 & \text{if } |i-j| \leq w/2 \text{ (局部窗口)} \\ 1 & \text{if } (i-j) \bmod g = 0 \text{ (膨胀窗口)} \\ 1 & \text{if } j \in \mathcal{G} \text{ (全局 token)} \\ 0 & \text{otherwise} \end{cases} Mij=⎩ ⎨ ⎧1110if ∣i−j∣≤w/2 (局部窗口)if (i−j)modg=0 (膨胀窗口)if j∈G (全局 token)otherwise

这种设计将复杂度控制在 O ( n ( w + n / g + ∣ G ∣ ) ) O(n(w + n/g + |\mathcal{G}|)) O(n(w+n/g+∣G∣))，实践中 w = 512 w = 512 w=512， g g g 随层递增。

2.3 Flash Attention：精确但高效的注意力

Flash Attention 并非近似方法，而是通过 IO 感知的算法设计 实现精确注意力的高效计算。其核心创新：

分块计算 ：将 Q , K , V Q, K, V Q,K,V 分块加载到 SRAM 中计算，避免频繁的 HBM 读写
在线 Softmax：通过维护运行中的最大值和归一化因子，在单次遍历中计算 Softmax
重计算：反向传播时重新计算注意力矩阵，而非保存中间结果

数学上，在线 Softmax 的核心递推：

m i = max ⁡ ( m i − 1 , max ⁡ ( S i ) ) m_i = \max(m_{i-1}, \max(S_i)) mi=max(mi−1,max(Si))

ℓ i = ℓ i − 1 ⋅ e m i − 1 − m i + sum ( e S i − m i ) \ell_i = \ell_{i-1} \cdot e^{m_{i-1} - m_i} + \text{sum}(e^{S_i - m_i}) ℓi=ℓi−1⋅emi−1−mi+sum(eSi−mi)

O i = O i − 1 ⋅ ℓ i − 1 ⋅ e m i − 1 − m i ℓ i + e S i − m i ℓ i ⋅ V i O_i = O_{i-1} \cdot \frac{\ell_{i-1} \cdot e^{m_{i-1} - m_i}}{\ell_i} + \frac{e^{S_i - m_i}}{\ell_i} \cdot V_i Oi=Oi−1⋅ℓiℓi−1⋅emi−1−mi+ℓieSi−mi⋅Vi

这保证了 Flash Attention 计算的结果与标准 Attention 在数值上完全等价，同时将内存复杂度从 O ( n 2 ) O(n^2) O(n2) 降至 O ( n ) O(n) O(n)。

三、线性化注意力：核函数的魔力

3.1 核函数视角

将 Softmax 注意力重写为核函数形式：

Attention ( Q , K , V ) i = ∑ j = 1 n sim ( Q i , K j ) V j ∑ j = 1 n sim ( Q i , K j ) \text{Attention}(Q, K, V)i = \frac{\sum{j=1}^{n} \text{sim}(Q_i, K_j) V_j}{\sum_{j=1}^{n} \text{sim}(Q_i, K_j)} Attention(Q,K,V)i=∑j=1nsim(Qi,Kj)∑j=1nsim(Qi,Kj)Vj

其中 sim ( q , k ) = e q T k / d \text{sim}(q, k) = e^{q^T k / \sqrt{d}} sim(q,k)=eqTk/d 。如果能找到特征映射 ϕ \phi ϕ 使得 sim ( q , k ) ≈ ϕ ( q ) T ϕ ( k ) \text{sim}(q, k) \approx \phi(q)^T \phi(k) sim(q,k)≈ϕ(q)Tϕ(k)，则：

Attention ( Q , K , V ) i ≈ ϕ ( Q i ) T ∑ j = 1 n ϕ ( K j ) V j T ϕ ( Q i ) T ∑ j = 1 n ϕ ( K j ) \text{Attention}(Q, K, V)i \approx \frac{\phi(Q_i)^T \sum{j=1}^{n} \phi(K_j) V_j^T}{\phi(Q_i)^T \sum_{j=1}^{n} \phi(K_j)} Attention(Q,K,V)i≈ϕ(Qi)T∑j=1nϕ(Kj)ϕ(Qi)T∑j=1nϕ(Kj)VjT

关键优势： ∑ j = 1 n ϕ ( K j ) V j T \sum_{j=1}^{n} \phi(K_j) V_j^T ∑j=1nϕ(Kj)VjT 只需计算一次（ O ( n m d ) O(nmd) O(nmd)），随后每个查询的计算为 O ( m d ) O(md) O(md)。

3.2 Performer 的 FAVOR+ 算法

Performer 的核心贡献是 FAVOR+（Fast Attention Via Orthogonal Random features）。使用正随机特征（Positive Random Features）来无偏估计 Softmax：

ϕ ( x ) = 1 m e − ∥ x ∥ 2 2 $e w 1 T x , e w 2 T x , . . . , e w m T x$ \phi(x) = \frac{1}{\sqrt{m}} e^{-\frac{\|x\|^2}{2}} \left $e\^{w_1\^T x}, e\^{w_2\^T x}, ..., e\^{w_m\^T x}\\right$ ϕ(x)=m 1e−2∥x∥2 $ew1Tx,ew2Tx,...,ewmTx$

其中 w i ∼ N ( 0 , I d ) w_i \sim \mathcal{N}(0, I_d) wi∼N(0,Id) 是随机投影向量。这使得：

e q T k = E w ∼ N ( 0 , I ) $e w T q - ∥ q ∥ 2 2 \cdot e w T k - ∥ k ∥ 2 2$ e^{q^T k} = \mathbb{E}_{w \sim \mathcal{N}(0,I)}\left $e\^{w\^T q - \\frac{\\\|q\\\|\^2}{2}} \\cdot e\^{w\^T k - \\frac{\\\|k\\\|\^2}{2}}\\right$ eqTk=Ew∼N(0,I) $ewTq-2∥q∥2\cdotewTk-2∥k∥2$

正交随机特征 ：为降低方差，对 w 1 , . . . , w m w_1, ..., w_m w1,...,wm 进行正交化：

W orth = Q ⋅ diag ( s 1 , . . . , s d ) W_{\text{orth}} = Q \cdot \text{diag}(s_1, ..., s_d) Worth=Q⋅diag(s1,...,sd)

其中 Q Q Q 是随机正交矩阵， s i ∼ χ d s_i \sim \chi_d si∼χd。

计算复杂度 ：Performer 将注意力复杂度从 O ( n 2 d ) O(n^2 d) O(n2d) 降至 O ( n m d ) O(nmd) O(nmd)，其中 m m m 是随机特征维度（通常 m = 256 m = 256 m=256）。当 n ≫ m n \gg m n≫m 时，加速比显著。

3.3 CosFormer：余弦重加权的线性注意力

CosFormer 指出：标准线性注意力忽略了 token 之间的相对位置信息。它通过余弦重加权引入位置偏置：

Attention ( Q , K , V ) i = ∑ j = 1 n ϕ ( Q i ) T ϕ ( K j ) ⋅ cos ⁡ ( π 2 ⋅ i − j M ) ⋅ V j ϕ ( Q i ) T ∑ j = 1 n ϕ ( K j ) ⋅ cos ⁡ ( π 2 ⋅ i − j M ) \text{Attention}(Q, K, V)i = \frac{\sum{j=1}^{n} \phi(Q_i)^T \phi(K_j) \cdot \cos\left(\frac{\pi}{2} \cdot \frac{i-j}{M}\right) \cdot V_j}{\phi(Q_i)^T \sum_{j=1}^{n} \phi(K_j) \cdot \cos\left(\frac{\pi}{2} \cdot \frac{i-j}{M}\right)} Attention(Q,K,V)i=ϕ(Qi)T∑j=1nϕ(Kj)⋅cos(2π⋅Mi−j)∑j=1nϕ(Qi)Tϕ(Kj)⋅cos(2π⋅Mi−j)⋅Vj

其中 ϕ ( ⋅ ) = ReLU ( ⋅ ) \phi(\cdot) = \text{ReLU}(\cdot) ϕ(⋅)=ReLU(⋅) 是非负特征映射， cos ⁡ \cos cos 重加权编码了 token 间的相对位置关系。

四、低秩近似：Linformer 的降维策略

4.1 核心洞察

Linformer 的核心发现：自注意力矩阵可以通过低秩分解近似。通过 Johnson-Lindenstrauss 引理，Linformer 将 n × d n \times d n×d 的 K K K 和 V V V 矩阵投影到固定的 k k k 维空间：

K ~ = E K ∈ R k × d , V ~ = F V ∈ R k × d \tilde{K} = EK \in \mathbb{R}^{k \times d}, \quad \tilde{V} = FV \in \mathbb{R}^{k \times d} K~=EK∈Rk×d,V~=FV∈Rk×d

其中 E , F ∈ R k × d E, F \in \mathbb{R}^{k \times d} E,F∈Rk×d 是可学习的投影矩阵， k k k 独立于序列长度 n n n。

4.2 Linformer 注意力

LinAttention ( Q , K , V ) = softmax ( Q ( E K ) T d k ) ⋅ ( F V ) \text{LinAttention}(Q, K, V) = \text{softmax}\left(\frac{Q(EK)^T}{\sqrt{d_k}}\right) \cdot (FV) LinAttention(Q,K,V)=softmax(dk Q(EK)T)⋅(FV)

复杂度分析： Q ( E K ) T Q(EK)^T Q(EK)T 为 O ( n × k × d ) O(n \times k \times d) O(n×k×d)，当 k ≪ n k \ll n k≪n 时，时间复杂度从 O ( n 2 d ) O(n^2 d) O(n2d) 降至 O ( n k d ) O(nkd) O(nkd)。

理论保证 ：假设注意力矩阵的秩为 r r r，Linformer 的近似误差满足：

∥ Attention − LinAttention ∥ F ≤ ϵ ⋅ ∥ Attention ∥ F \|\text{Attention} - \text{LinAttention}\|_F \leq \epsilon \cdot \|\text{Attention}\|_F ∥Attention−LinAttention∥F≤ϵ⋅∥Attention∥F

其中 k = O ( r log ⁡ n / ϵ 2 ) k = O(r \log n / \epsilon^2) k=O(rlogn/ϵ2)。这意味着 k k k 不需要随 n n n 增大。

4.3 参数共享策略

Linformer 支持三种参数共享级别：

共享策略	投影矩阵	参数量	效果
Headwise	每头独立 E , F E, F E,F	h × k × d h \times k \times d h×k×d	最灵活
Key-Value	KV 共享 E = F E=F E=F	h × k × d h \times k \times d h×k×d	参数减半
Layerwise	层间共享投影	k × d k \times d k×d	极大减少参数

五、状态空间模型与 RWKV

5.1 RWKV 的注意力递推公式

RWKV（Receptance Weighted Key Value）将 Transformer 的训练并行性与 RNN 的推理高效性结合。其核心是时间混合（Time-Mixing）机制：

w k v t = ∑ i = 1 t − 1 e − ( t − 1 − i ) w + k i v i + e u + k t v t ∑ i = 1 t − 1 e − ( t − 1 − i ) w + k i + e u + k t wkv_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i}v_i + e^{u + k_t}v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u + k_t}} wkvt=∑i=1t−1e−(t−1−i)w+ki+eu+kt∑i=1t−1e−(t−1−i)w+kivi+eu+ktvt

o t = W o ⋅ ( σ ( r t ) ⊙ w k v t ) o_t = W_o \cdot (\sigma(r_t) \odot wkv_t) ot=Wo⋅(σ(rt)⊙wkvt)

其中：

r t , k t , v t r_t, k_t, v_t rt,kt,vt 分别是从输入 x t x_t xt 线性投影得到的接受度（Receptance）、键和值
w w w 是通道级时间衰减因子（可学习参数）
u u u 是 bonus 偏置，使模型更关注当前 token
σ \sigma σ 是 Sigmoid 激活函数

5.2 递推模式与训练模式

训练模式（并行）：使用卷积形式的并行计算：

w k v = Conv1D ( V ⊙ exp ⁡ ( K ) , exp ⁡ ( − w ) ) wkv = \text{Conv1D}(V \odot \exp(K), \exp(-w)) wkv=Conv1D(V⊙exp(K),exp(−w))

其中 Conv1D \text{Conv1D} Conv1D 使用指数衰减核。这使得训练时可以利用 GPU 并行性。

推理模式（递推） ：维护隐藏状态 s t s_t st：

s t = diag ( e − w ) ⋅ s t − 1 + e k t ⋅ v t s_t = \text{diag}(e^{-w}) \cdot s_{t-1} + e^{k_t} \cdot v_t st=diag(e−w)⋅st−1+ekt⋅vt

z t = diag ( e − w ) ⋅ z t − 1 + e k t z_t = \text{diag}(e^{-w}) \cdot z_{t-1} + e^{k_t} zt=diag(e−w)⋅zt−1+ekt

w k v t = s t − 1 + e u + k t ⋅ v t z t − 1 + e u + k t wkv_t = \frac{s_{t-1} + e^{u + k_t} \cdot v_t}{z_{t-1} + e^{u + k_t}} wkvt=zt−1+eu+ktst−1+eu+kt⋅vt

这个递推形式使 RWKV 在推理时具有 O ( 1 ) O(1) O(1) 的每 token 内存开销，与 RNN 完全相同。

5.3 RWKV 系列演进

版本	发布时间	关键改进
RWKV-4	2023.02	基础 WKV 注意力 + 时间衰减
RWKV-5	2023.08	Eagle 多头 + 分组时间衰减
RWKV-6 (Finch)	2024.02	动态数据依赖衰减 + 更大规模
RWKV-7 (Hawk)	2024.10	改进训练稳定性 + 更高效的并行

六、RetNet 与多头保留机制

6.1 保留机制（Retention）

RetNet 提出保留机制（Retention），通过旋转位置编码（RoPE）和衰减掩码，实现三种计算范式：

并行表示（训练）：

Retention ( X ) = ( Q K T ⊙ D ) V \text{Retention}(X) = (QK^T \odot D)V Retention(X)=(QKT⊙D)V

其中 D n m = γ n − m D_{nm} = \gamma^{n-m} Dnm=γn−m（当 n ≥ m n \geq m n≥m 时）， D D D 是衰减掩码矩阵。

递推表示（推理）：

S n = γ S n − 1 + K n T V n S_n = \gamma S_{n-1} + K_n^T V_n Sn=γSn−1+KnTVn

Retention ( X n ) = Q n S n \text{Retention}(X_n) = Q_n S_n Retention(Xn)=QnSn

其中 S n S_n Sn 是状态矩阵， γ \gamma γ 是衰减因子。

块递推表示（混合）：将序列分成块，块内用并行表示，块间用递推表示，实现计算效率与内存的平衡。

6.2 三种表示的统一

RetNet 的核心贡献在于证明了这三种计算范式在数学上严格等价，从而实现了"一次训练，三种部署"：
#mermaid-svg-IOSh85VvsY9eTctI{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-IOSh85VvsY9eTctI .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-IOSh85VvsY9eTctI .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-IOSh85VvsY9eTctI .error-icon{fill:#552222;}#mermaid-svg-IOSh85VvsY9eTctI .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-IOSh85VvsY9eTctI .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-IOSh85VvsY9eTctI .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-IOSh85VvsY9eTctI .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-IOSh85VvsY9eTctI .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-IOSh85VvsY9eTctI .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-IOSh85VvsY9eTctI .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-IOSh85VvsY9eTctI .marker{fill:#333333;stroke:#333333;}#mermaid-svg-IOSh85VvsY9eTctI .marker.cross{stroke:#333333;}#mermaid-svg-IOSh85VvsY9eTctI svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-IOSh85VvsY9eTctI p{margin:0;}#mermaid-svg-IOSh85VvsY9eTctI .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-IOSh85VvsY9eTctI .cluster-label text{fill:#333;}#mermaid-svg-IOSh85VvsY9eTctI .cluster-label span{color:#333;}#mermaid-svg-IOSh85VvsY9eTctI .cluster-label span p{background-color:transparent;}#mermaid-svg-IOSh85VvsY9eTctI .label text,#mermaid-svg-IOSh85VvsY9eTctI span{fill:#333;color:#333;}#mermaid-svg-IOSh85VvsY9eTctI .node rect,#mermaid-svg-IOSh85VvsY9eTctI .node circle,#mermaid-svg-IOSh85VvsY9eTctI .node ellipse,#mermaid-svg-IOSh85VvsY9eTctI .node polygon,#mermaid-svg-IOSh85VvsY9eTctI .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-IOSh85VvsY9eTctI .rough-node .label text,#mermaid-svg-IOSh85VvsY9eTctI .node .label text,#mermaid-svg-IOSh85VvsY9eTctI .image-shape .label,#mermaid-svg-IOSh85VvsY9eTctI .icon-shape .label{text-anchor:middle;}#mermaid-svg-IOSh85VvsY9eTctI .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-IOSh85VvsY9eTctI .rough-node .label,#mermaid-svg-IOSh85VvsY9eTctI .node .label,#mermaid-svg-IOSh85VvsY9eTctI .image-shape .label,#mermaid-svg-IOSh85VvsY9eTctI .icon-shape .label{text-align:center;}#mermaid-svg-IOSh85VvsY9eTctI .node.clickable{cursor:pointer;}#mermaid-svg-IOSh85VvsY9eTctI .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-IOSh85VvsY9eTctI .arrowheadPath{fill:#333333;}#mermaid-svg-IOSh85VvsY9eTctI .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-IOSh85VvsY9eTctI .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-IOSh85VvsY9eTctI .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-IOSh85VvsY9eTctI .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-IOSh85VvsY9eTctI .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-IOSh85VvsY9eTctI .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-IOSh85VvsY9eTctI .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-IOSh85VvsY9eTctI .cluster text{fill:#333;}#mermaid-svg-IOSh85VvsY9eTctI .cluster span{color:#333;}#mermaid-svg-IOSh85VvsY9eTctI div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-IOSh85VvsY9eTctI .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-IOSh85VvsY9eTctI rect.text{fill:none;stroke-width:0;}#mermaid-svg-IOSh85VvsY9eTctI .icon-shape,#mermaid-svg-IOSh85VvsY9eTctI .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-IOSh85VvsY9eTctI .icon-shape p,#mermaid-svg-IOSh85VvsY9eTctI .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-IOSh85VvsY9eTctI .icon-shape .label rect,#mermaid-svg-IOSh85VvsY9eTctI .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-IOSh85VvsY9eTctI .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-IOSh85VvsY9eTctI .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-IOSh85VvsY9eTctI :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 训练：并行表示

O(n^2d) 但可并行化
同一个模型权重
推理：递推表示

O(1) 每 token
混合：块递推

O(n) 内存 / O(n^2/b) 计算

6.3 多头保留（Multi-Scale Retention）

RetNet 为不同头分配不同的 γ \gamma γ 值：

γ h = 1 − 2 − 5 − h \gamma_h = 1 - 2^{-5-h} γh=1−2−5−h

其中 h = 0 , 1 , . . . , H − 1 h = 0, 1, ..., H-1 h=0,1,...,H−1。不同的 γ \gamma γ 使不同头关注不同时间尺度：小 γ \gamma γ 关注近期 token，大 γ \gamma γ 关注远期 token。

七、线性注意力架构对比分析

7.1 核心方法对比

方法	类型	训练复杂度	推理复杂度	内存	代表特性
标准 Transformer	密集注意力	O ( n 2 d ) O(n^2 d) O(n2d)	O ( n 2 d ) O(n^2 d) O(n2d)	O ( n 2 ) O(n^2) O(n2)	全注意力
Longformer	稀疏	O ( n w d ) O(n w d) O(nwd)	O ( w d ) O(w d) O(wd)	O ( n w ) O(nw) O(nw)	局部+全局
Linformer	低秩	O ( n k d ) O(n k d) O(nkd)	O ( k d ) O(k d) O(kd)	O ( n k ) O(nk) O(nk)	投影降维
Performer	核函数	O ( n m d ) O(n m d) O(nmd)	O ( m d ) O(m d) O(md)	O ( n m ) O(nm) O(nm)	正随机特征
CosFormer	核函数+位置	O ( n m d ) O(n m d) O(nmd)	O ( m d ) O(m d) O(md)	O ( n m ) O(nm) O(nm)	余弦重加权
RWKV	RNN/Attention	O ( n d 2 ) O(n d^2) O(nd2)	O ( d 2 ) O(d^2) O(d2)	O ( d 2 ) O(d^2) O(d2)	WKV 时间衰减
RetNet	保留机制	O ( n d 2 ) O(n d^2) O(nd2)	O ( d 2 ) O(d^2) O(d2)	O ( d 2 ) O(d^2) O(d2)	三范式统一

7.2 长序列性能实测对比

在长序列任务上的性能表现（基于公开基准测试）：

模型	Perplexity (WikiText-103)	推理速度 (tokens/s)	内存 (GB, 32K ctx)
Transformer	18.3	120	24.6
Linformer ( k = 256 k=256 k=256)	19.1	340	2.8
Performer ( m = 256 m=256 m=256)	19.5	310	3.1
RWKV-6	18.8	580	1.2
RetNet	18.6	490	1.5

八、实际部署与性能评测

8.1 架构选型决策流程

#mermaid-svg-Q88xy4gCP3YYPNmZ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Q88xy4gCP3YYPNmZ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Q88xy4gCP3YYPNmZ .error-icon{fill:#552222;}#mermaid-svg-Q88xy4gCP3YYPNmZ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Q88xy4gCP3YYPNmZ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Q88xy4gCP3YYPNmZ .marker.cross{stroke:#333333;}#mermaid-svg-Q88xy4gCP3YYPNmZ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Q88xy4gCP3YYPNmZ p{margin:0;}#mermaid-svg-Q88xy4gCP3YYPNmZ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Q88xy4gCP3YYPNmZ .cluster-label text{fill:#333;}#mermaid-svg-Q88xy4gCP3YYPNmZ .cluster-label span{color:#333;}#mermaid-svg-Q88xy4gCP3YYPNmZ .cluster-label span p{background-color:transparent;}#mermaid-svg-Q88xy4gCP3YYPNmZ .label text,#mermaid-svg-Q88xy4gCP3YYPNmZ span{fill:#333;color:#333;}#mermaid-svg-Q88xy4gCP3YYPNmZ .node rect,#mermaid-svg-Q88xy4gCP3YYPNmZ .node circle,#mermaid-svg-Q88xy4gCP3YYPNmZ .node ellipse,#mermaid-svg-Q88xy4gCP3YYPNmZ .node polygon,#mermaid-svg-Q88xy4gCP3YYPNmZ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Q88xy4gCP3YYPNmZ .rough-node .label text,#mermaid-svg-Q88xy4gCP3YYPNmZ .node .label text,#mermaid-svg-Q88xy4gCP3YYPNmZ .image-shape .label,#mermaid-svg-Q88xy4gCP3YYPNmZ .icon-shape .label{text-anchor:middle;}#mermaid-svg-Q88xy4gCP3YYPNmZ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Q88xy4gCP3YYPNmZ .rough-node .label,#mermaid-svg-Q88xy4gCP3YYPNmZ .node .label,#mermaid-svg-Q88xy4gCP3YYPNmZ .image-shape .label,#mermaid-svg-Q88xy4gCP3YYPNmZ .icon-shape .label{text-align:center;}#mermaid-svg-Q88xy4gCP3YYPNmZ .node.clickable{cursor:pointer;}#mermaid-svg-Q88xy4gCP3YYPNmZ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Q88xy4gCP3YYPNmZ .arrowheadPath{fill:#333333;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Q88xy4gCP3YYPNmZ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Q88xy4gCP3YYPNmZ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Q88xy4gCP3YYPNmZ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Q88xy4gCP3YYPNmZ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Q88xy4gCP3YYPNmZ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Q88xy4gCP3YYPNmZ .cluster text{fill:#333;}#mermaid-svg-Q88xy4gCP3YYPNmZ .cluster span{color:#333;}#mermaid-svg-Q88xy4gCP3YYPNmZ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Q88xy4gCP3YYPNmZ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Q88xy4gCP3YYPNmZ rect.text{fill:none;stroke-width:0;}#mermaid-svg-Q88xy4gCP3YYPNmZ .icon-shape,#mermaid-svg-Q88xy4gCP3YYPNmZ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Q88xy4gCP3YYPNmZ .icon-shape p,#mermaid-svg-Q88xy4gCP3YYPNmZ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Q88xy4gCP3YYPNmZ .icon-shape .label rect,#mermaid-svg-Q88xy4gCP3YYPNmZ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Q88xy4gCP3YYPNmZ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Q88xy4gCP3YYPNmZ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Q88xy4gCP3YYPNmZ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是
否
< 8K
8K - 128K
> 128K
是
否
是
否
选择高效注意力方案
是否要求

精确注意力？
Flash Attention

精确且高效
序列长度？
标准注意力

或 Flash Attention
是否需要

全局交互？
推理延迟

是否敏感？
Performer / CosFormer

线性注意力
Longformer

稀疏注意力
RWKV

O(1) 推理
RetNet

三范式灵活

8.2 混合架构趋势

2025--2026 年，混合架构成为新趋势：

Jamba（AI21 Labs）：Transformer 层 + Mamba 层的混合，平衡表达能力和效率
Zamba（Zyphra）：Mamba2 + 全局注意力头的混合
Hymba（NVIDIA）：SSM + Attention 的连续混合层

这些混合架构试图取各方之长：用 Attention 处理需要全局信息的关键层，用 SSM/RWKV 处理长上下文的递推层。

总结

高效注意力机制的发展路径清晰展现了从"近似的不得已"到"精心设计的替代方案"的转变：

稀疏注意力 通过限制交互范围降低复杂度，Longformer 和 BigBird 是代表
线性注意力 通过核技巧将 O ( n 2 ) O(n^2) O(n2) 降至 O ( n ) O(n) O(n)，Performer 是标杆
低秩近似 利用注意力矩阵的低秩特性，Linformer 提供了优雅的降维方案
状态空间模型 将 Attention 重构为递推形式，RWKV 实现了训练推理双高效
保留机制 提供了三范式统一表示，RetNet 代表了新的设计思路
Flash Attention 作为精确注意力优化，已成为事实标准，与线性注意力形成互补

实际部署中，Flash Attention 适合中等长度（ < 128 K < 128K <128K）的精确计算需求，RWKV/RetNet 适合推理延迟敏感的流式场景，Performer/CosFormer 适合超长序列（ > 1 M > 1M >1M tokens）的近似建模。

参考资料

Linformer: Self-Attention with Linear Complexity --- Wang et al., 2020
Rethinking Attention with Performers --- Choromanski et al., ICLR 2021
Flash Attention: Fast and Memory-Efficient Exact Attention --- Dao et al., NeurIPS 2022
RWKV: Reinventing RNNs for the Transformer Era --- Peng et al., EMNLP 2023
Retentive Network: A Successor to Transformer for Large Language Models --- Sun et al., 2023
Efficient Attention Mechanisms for LLMs: A Survey --- 2025
CosFormer: Rethinking Softmax in Attention --- Qin et al., ICLR 2022
Longformer: The Long-Document Transformer --- Beltagy et al., 2020

高效注意力机制深度解析：从 Linear Attention 到 RWKV 的线性复杂度序列建模