Transformer 原理讲解及可视化算子操作

[1. 引言](#1. 引言)
[2. Transformer 的核心思想](#2. Transformer 的核心思想)
[3. 整体网络结构](#3. 整体网络结构)
[4. 输入表示：Embedding 与位置编码](#4. 输入表示：Embedding 与位置编码)
[5. Scaled Dot-Product Attention 原理](#5. Scaled Dot-Product Attention 原理)
[6. Multi-Head Attention 原理](#6. Multi-Head Attention 原理)
[7. Encoder 结构拆解](#7. Encoder 结构拆解)
[8. Decoder 结构拆解](#8. Decoder 结构拆解)
[9. 算子级可视化操作与张量维度](#9. 算子级可视化操作与张量维度)
[10. 一个小型数值示例](#10. 一个小型数值示例)
[11. 工程实现与算子映射](#11. 工程实现与算子映射)
[12. 优势与局限](#12. 优势与局限)
13.Transformer的Numpy逐步可视化代码
[14. 结论](#14. 结论)
参考文献

1. 引言

Transformer 是一种完全基于注意力机制（Attention） 的序列建模架构。与传统 RNN/LSTM 不同，Transformer 不依赖递归；与传统 CNN 不同，它也不依赖卷积来传播长程依赖。其核心思想是：

让序列中的每个位置都能直接与其他位置建立关联；
按相关性对全序列信息进行加权汇总；
通过高度并行的矩阵运算提升训练效率。

Transformer 最早由 Vaswani 等人在 2017 年提出，是现代大语言模型、视觉 Transformer、多模态模型等架构的基础。

2. Transformer 的核心思想

一句话理解 Transformer：

每个 token 都可以根据"与其他 token 的相关性"，从全局序列中取回一份对自己最有用的信息。

与 RNN 的顺序传递不同，Transformer 使用注意力机制，让任意两个位置之间通过一次 attention 直接连接。这样可以：

更高效地捕获长距离依赖；
充分并行化训练；
统一用矩阵乘法表达序列交互。

3. 整体网络结构

Transformer 由 Encoder 和 Decoder 两部分组成。原始论文中的基础版本使用：

6 层 Encoder
6 层 Decoder

3.1 总体结构图（Mermaid）

输入 Tokens
Embedding
Positional Encoding
Encoder × N
Encoder 输出上下文表示
已生成输出 Tokens
Embedding
Positional Encoding
Decoder × N
Linear
Softmax
输出概率

3.2 编码器-解码器结构图（Mermaid）

Encoder Layer
Multi-Head Self-Attention
Add & Norm
Feed Forward
Add & Norm
Decoder Layer
Masked Multi-Head Self-Attention
Add & Norm
Encoder-Decoder Attention
Add & Norm
Feed Forward
Add & Norm

4. 输入表示：Embedding 与位置编码

Transformer 输入首先需要转成连续向量表示。

4.1 Token Embedding

设：

batch size 为 B B B
序列长度为 L L L
模型维度为 d m o d e l d_{model} dmodel

则输入嵌入表示为：

X ∈ R B × L × d m o d e l X \in \mathbb{R}^{B \times L \times d_{model}} X∈RB×L×dmodel

4.2 位置编码

因为 Transformer 本身没有循环结构，无法天然感知顺序，所以要显式加入位置编码（Positional Encoding）。

原始 Transformer 使用正弦/余弦位置编码：

P E ( p o s , 2 i ) = sin ⁡ ( p o s 10000 2 i / d m o d e l ) PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) PE(pos,2i)=sin(100002i/dmodelpos)

P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s 10000 2 i / d m o d e l ) PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) PE(pos,2i+1)=cos(100002i/dmodelpos)

最终输入表示为：

X p o s = X + P E X_{pos} = X + PE Xpos=X+PE

4.3 输入表示示意图

Input Token IDs
Embedding Lookup
Token Embeddings
Positional Encoding
Add
Transformer 输入表示

5. Scaled Dot-Product Attention 原理

Transformer 的核心算子是 Scaled Dot-Product Attention。

公式为：

A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K ⊤ d k ) V \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V Attention(Q,K,V)=softmax(dk QK⊤)V

其中：

Q Q Q：Query，表示"当前我想找什么"
K K K：Key，表示"别人拥有什么特征"
V V V：Value，表示"别人真正提供的信息"
d k d_k dk：Key 的维度

5.1 计算流程

计算 Query 与 Key 的相似度：
S = Q K ⊤ d k S = \frac{QK^\top}{\sqrt{d_k}} S=dk QK⊤
通过 softmax 归一化：
A = s o f t m a x ( S ) A = \mathrm{softmax}(S) A=softmax(S)
对 Value 加权求和：
C = A V C = AV C=AV

5.2 为什么除以 d k \sqrt{d_k} dk

如果不做缩放，当 d k d_k dk 很大时，点积数值可能非常大，从而让 softmax 过于尖锐，导致梯度变小、训练不稳定。因此引入：

1 d k \frac{1}{\sqrt{d_k}} dk 1

来进行数值稳定化。

6. Multi-Head Attention 原理

单头注意力只能在一个子空间建模关系，而多头注意力可以让模型在多个不同子空间中并行建模。

6.1 线性投影

对输入 X p o s X_{pos} Xpos 进行三组线性变换：

Q = X p o s W Q , K = X p o s W K , V = X p o s W V Q = X_{pos}W^Q,\quad K = X_{pos}W^K,\quad V = X_{pos}W^V Q=XposWQ,K=XposWK,V=XposWV

其中：

Q , K , V ∈ R B × L × d m o d e l Q, K, V \in \mathbb{R}^{B \times L \times d_{model}} Q,K,V∈RB×L×dmodel

6.2 拆成多个 head

设头数为 h h h，则每头维度为：

d k = d m o d e l h d_k = \frac{d_{model}}{h} dk=hdmodel

张量拆分后形状变为：

Q , K , V ∈ R B × h × L × d k Q, K, V \in \mathbb{R}^{B \times h \times L \times d_k} Q,K,V∈RB×h×L×dk

6.3 每个头分别计算 attention

第 i i i 个头：

h e a d i = A t t e n t i o n ( Q i , K i , V i ) head_i = \mathrm{Attention}(Q_i, K_i, V_i) headi=Attention(Qi,Ki,Vi)

6.4 拼接与输出投影

M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , ... , h e a d h ) W O \mathrm{MultiHead}(Q,K,V) = \mathrm{Concat}(head_1, \dots, head_h)W^O MultiHead(Q,K,V)=Concat(head1,...,headh)WO

6.5 Multi-Head 可视化

输入表示 X
Linear -> Q
Linear -> K
Linear -> V
Split Heads
Split Heads
Split Heads
Head 1 Attention
Head 2 Attention
Head h Attention
Concat
Linear Projection

7. Encoder 结构拆解

每个 Encoder Layer 包含两个核心子层：

Multi-Head Self-Attention
Position-wise Feed Forward Network

并且每个子层外都包裹：

Residual Connection
LayerNorm

7.1 结构公式

设输入为 x x x，则：

Self-Attention 子层

z 1 = L a y e r N o r m ( x + M H A ( x ) ) z_1 = \mathrm{LayerNorm}(x + \mathrm{MHA}(x)) z1=LayerNorm(x+MHA(x))

Feed Forward 子层

z 2 = L a y e r N o r m ( z 1 + F F N ( z 1 ) ) z_2 = \mathrm{LayerNorm}(z_1 + \mathrm{FFN}(z_1)) z2=LayerNorm(z1+FFN(z1))

7.2 前馈网络

前馈网络逐位置独立计算：

F F N ( x ) = max ⁡ ( 0 , x W 1 + b 1 ) W 2 + b 2 \mathrm{FFN}(x)=\max(0, xW_1 + b_1)W_2 + b_2 FFN(x)=max(0,xW1+b1)W2+b2

7.3 Encoder 结构图

Encoder 输入 x
Multi-Head Self-Attention
Add
LayerNorm
Feed Forward
Add
LayerNorm
Encoder 输出

8. Decoder 结构拆解

Decoder 每层比 Encoder 多一个子层：

Masked Multi-Head Self-Attention
Encoder-Decoder Attention
Feed Forward Network

8.1 为什么要 Mask

在序列生成时，当前位置不能看到未来 token，因此 decoder 的 self-attention 必须使用 mask。

8.2 Decoder 结构公式

设输入为 y y y，encoder 输出为 E E E：

Masked Self-Attention

u 1 = L a y e r N o r m ( y + M a s k e d M H A ( y ) ) u_1 = \mathrm{LayerNorm}(y + \mathrm{MaskedMHA}(y)) u1=LayerNorm(y+MaskedMHA(y))

Encoder-Decoder Attention

u 2 = L a y e r N o r m ( u 1 + M H A ( Q = u 1 , K = E , V = E ) ) u_2 = \mathrm{LayerNorm}(u_1 + \mathrm{MHA}(Q=u_1, K=E, V=E)) u2=LayerNorm(u1+MHA(Q=u1,K=E,V=E))

Feed Forward

u 3 = L a y e r N o r m ( u 2 + F F N ( u 2 ) ) u_3 = \mathrm{LayerNorm}(u_2 + \mathrm{FFN}(u_2)) u3=LayerNorm(u2+FFN(u2))

8.3 Decoder 结构图

Decoder 输入 y
Masked Multi-Head Self-Attention
Add
LayerNorm
Encoder-Decoder Attention
Encoder 输出
Add
LayerNorm
Feed Forward
Add
LayerNorm
Decoder 输出

9. 算子级可视化操作与张量维度

这一节重点展示 Transformer 的核心张量流。

设：

batch size： B B B
序列长度： L L L
头数： h h h
模型维度： d m o d e l d_{model} dmodel
每头维度： d k = d m o d e l / h d_k = d_{model}/h dk=dmodel/h

9.1 输入张量

X ∈ R B × L × d m o d e l X \in \mathbb{R}^{B \times L \times d_{model}} X∈RB×L×dmodel

9.2 生成 Q/K/V

Q , K , V ∈ R B × L × d m o d e l Q,K,V \in \mathbb{R}^{B \times L \times d_{model}} Q,K,V∈RB×L×dmodel

9.3 拆头后

Q , K , V ∈ R B × h × L × d k Q,K,V \in \mathbb{R}^{B \times h \times L \times d_k} Q,K,V∈RB×h×L×dk

9.4 计算打分矩阵

S = Q K ⊤ d k S = \frac{QK^\top}{\sqrt{d_k}} S=dk QK⊤

形状为：

S ∈ R B × h × L × L S \in \mathbb{R}^{B \times h \times L \times L} S∈RB×h×L×L

9.5 归一化为注意力权重

A = s o f t m a x ( S ) A = \mathrm{softmax}(S) A=softmax(S)

形状不变：

A ∈ R B × h × L × L A \in \mathbb{R}^{B \times h \times L \times L} A∈RB×h×L×L

9.6 对 Value 加权

C = A V C = AV C=AV

结果为：

C ∈ R B × h × L × d k C \in \mathbb{R}^{B \times h \times L \times d_k} C∈RB×h×L×dk

9.7 拼接 heads

C o n c a t ( C 1 , ... , C h ) ∈ R B × L × d m o d e l \mathrm{Concat}(C_1,\dots,C_h)\in \mathbb{R}^{B \times L \times d_{model}} Concat(C1,...,Ch)∈RB×L×dmodel

9.8 张量流图

X: B x L x d_model
Linear to Q
Linear to K
Linear to V
Split Heads: B x h x L x d_k
Split Heads: B x h x L x d_k
Split Heads: B x h x L x d_k
Score MatMul and Scale
Softmax to Attention Weights
Weights MatMul V
Concat Heads
Linear Projection

10. 一个小型数值示例

假设某个注意力头下，序列长度 L = 3 L=3 L=3，Value 为：

V = $1 0 0 2 3 1$ V = \begin{bmatrix} 1 & 0 \\ 0 & 2 \\ 3 & 1 \end{bmatrix} V= 103021

注意力权重矩阵为：

A = $0.7 0.2 0.1 0.1 0.8 0.1 0.3 0.3 0.4$ A = \begin{bmatrix} 0.7 & 0.2 & 0.1 \\ 0.1 & 0.8 & 0.1 \\ 0.3 & 0.3 & 0.4 \end{bmatrix} A= 0.70.10.30.20.80.30.10.10.4

则输出：

C = A V C = AV C=AV

第一行：

0.7 $1 , 0$ + 0.2 $0 , 2$ + 0.1 $3 , 1$ = $1.0 , 0.5$ 0.7 $1,0$ + 0.2 $0,2$ + 0.1 $3,1$ = $1.0, 0.5$ 0.7 $1,0$ +0.2 $0,2$ +0.1 $3,1$ = $1.0,0.5$

第二行：

0.1 $1 , 0$ + 0.8 $0 , 2$ + 0.1 $3 , 1$ = $0.4 , 1.7$ 0.1 $1,0$ + 0.8 $0,2$ + 0.1 $3,1$ = $0.4, 1.7$ 0.1 $1,0$ +0.8 $0,2$ +0.1 $3,1$ = $0.4,1.7$

第三行：

0.3 $1 , 0$ + 0.3 $0 , 2$ + 0.4 $3 , 1$ = $1.5 , 1.0$ 0.3 $1,0$ + 0.3 $0,2$ + 0.4 $3,1$ = $1.5, 1.0$ 0.3 $1,0$ +0.3 $0,2$ +0.4 $3,1$ = $1.5,1.0$

这说明每个 token 的输出，不再只是自己的特征，而是整个序列的加权汇总结果。

11. 工程实现与算子映射

在深度学习框架中，Transformer 的核心通常对应以下算子：

11.1 输入表示

Embedding Lookup
Add

11.2 Attention

Linear
Reshape / Transpose
MatMul
Scale
Mask Add
Softmax
MatMul

11.3 残差与归一化

Add
LayerNorm

11.4 FFN

Linear
Activation（ReLU / GELU）
Linear

11.5 工程视角下的核心算子链

text 复制代码

Embedding
 -> Add Positional Encoding
 -> Linear(Q/K/V)
 -> Reshape/Transpose
 -> MatMul(Q, K^T)
 -> Scale
 -> Mask
 -> Softmax
 -> MatMul(Attn, V)
 -> Concat
 -> Linear
 -> Add & Norm
 -> FFN
 -> Add & Norm

12. 优势与局限

12.1 优势

并行性强：不依赖时间步递归。
长程依赖能力强：任意两个 token 间可直接交互。
结构统一：核心由矩阵乘法与线性层组成，利于硬件加速。

12.2 局限

注意力复杂度高 ：标准 self-attention 复杂度为 O ( L 2 ) O(L^2) O(L2)。
长序列内存占用大。
位置建模依赖额外机制。

13.Transformer的Numpy逐步可视化代码

特点：

只依赖 numpy
从 Embedding + 位置编码 + Q/K/V + 注意力 + 多头 + FFN 逐步展开
每一步都会打印 shape 和关键中间结果
用一个很小的 toy example，方便看清算子流
下载链接：可视化transformer

14. 结论

Transformer 的本质是：

用可并行的注意力机制替代递归，让每个位置都能按相关性从全局序列中提取信息。

理解 Transformer 最关键的抓手有四个：

Q / K / V 的含义
Scaled Dot-Product Attention 公式
Multi-Head 的张量拆分与拼接
Encoder / Decoder 的层级结构

只要把这几步的张量流和矩阵乘法过程看懂，Transformer 的核心原理就掌握了。

参考链接:

$1$ : https://arxiv.org/abs/1706.03762?utm_source=chatgpt.com "Attention Is All You Need" $2$ : https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf?utm_source=chatgpt.com "Attention is All you Need"

$3$ : https://docs.pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html?utm_source=chatgpt.com "MultiheadAttention --- PyTorch 2.11 documentation"

$4$ : https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html?utm_source=chatgpt.com "torch.nn.functional.scaled_dot_product_attention"