从零实现Transformer:第 4 部分 - 残差连接、层归一化与前馈网络(Add & Norm, Feed-Forward)

从零实现Transformer:第 4 部分 - 残差连接、层归一化与前馈网络(Add & Norm, Feed-Forward)

flyfish

本部分的完整代码在文末

完整的图

主要用于和其他的图做参考

还有两个组件要实现

多头注意力机制(Multi-Head Attention)已经实现了还有Add & Norm和 Feed-forward networ,这里的norm是Layer normalization.

Add & Norm 部分

Feed-Forward 部分

组件一:层归一化(Layer Normalization)即Norm

公式

对于输入向量 x ∈ R d m o d e l x \in \mathbb{R}^{d_{model}} x∈Rdmodel:

μ = 1 d ∑ i = 1 d x i (均值) σ 2 = 1 d ∑ i = 1 d ( x i − μ ) 2 (方差) x ^ = x − μ σ 2 + ϵ (标准化) y = γ ⋅ x ^ + β (可学习缩放平移) \begin{aligned} \mu &= \frac{1}{d}\sum_{i=1}^{d} x_i \quad &\text{(均值)}\\ \sigma^2 &= \frac{1}{d}\sum_{i=1}^{d} (x_i - \mu)^2 \quad &\text{(方差)}\\ \hat{x} &= \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \quad &\text{(标准化)}\\ y &= \gamma \cdot \hat{x} + \beta \quad &\text{(可学习缩放平移)} \end{aligned} μσ2x^y=d1i=1∑dxi=d1i=1∑d(xi−μ)2=σ2+ϵ x−μ=γ⋅x^+β(均值)(方差)(标准化)(可学习缩放平移)

ϵ \epsilon ϵ:数值稳定项(通常 1e-6)
γ , β \gamma, \beta γ,β:可学习参数,初始化为 1 和 0

PyTorch 实现

python 复制代码
import torch
import torch.nn as nn

class LayerNormalization(nn.Module):
    """层归一化实现"""
    def __init__(self, features: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        # 可学习的缩放和平移参数
        self.gamma = nn.Parameter(torch.ones(features))   # γ
        self.beta = nn.Parameter(torch.zeros(features))    # β

    def forward(self, x: torch.Tensor):
        """
        Args:
            x: [batch_size, seq_len, features]
        Returns:
            normalized: [batch_size, seq_len, features]
        """
        # 沿特征维度计算均值和标准差
        mean = x.mean(dim=-1, keepdim=True)      # [batch, seq, 1]
        std = x.std(dim=-1, keepdim=True)        # [batch, seq, 1]
        
        # 标准化 + 可学习变换
        x_norm = (x - mean) / torch.sqrt(std**2 + self.eps)
        return self.gamma * x_norm + self.beta

细节

要点 说明
dim=-1 对最后一个维度(特征维)计算统计量
keepdim=True 保持维度便于广播计算
eps 防止除零,保证数值稳定
nn.Parameter 使 γ/β 成为可训练参数

组件二:位置前馈网络(Position-wise Feed-Forward Network)

FFN (前馈神经网络 Feedforward Neural Network)是最大的概念,只要数据单向传播即属于 FFN;

MLP(Multi-Layer Perceptron) 是 FFN 的子集,限定为全连接层组成的网络;

Transformer 的 FFN 模块是 MLP 的特例,结构固定为 "升维→激活→降维"

FFN是数据流向层面的概念;

MLP强调全连接层的堆叠;

Transformer 的 FFN是 MLP 在 Transformer 中的标准化实现

《Attention Is All You Need》论文原文

用论文里的字母表示

结构

复制代码
输入: [batch, seq_len, d_model]
              │
              ▼
     ┌─────────────────┐
     │ Linear: d_model → d_ff │  (d_ff = 4×d_model)
     └─────────────────┘
              │
              ▼
     ┌─────────────────┐
     │      ReLU       │
     └─────────────────┘
              │
              ▼
     ┌─────────────────┐
     │   Dropout(p)    │
     └─────────────────┘
              │
              ▼
     ┌─────────────────┐
     │ Linear: d_ff → d_model │
     └─────────────────┘
              │
              ▼
输出: [batch, seq_len, d_model] ✓

PyTorch 实现

python 复制代码
import torch
import torch.nn as nn

class PositionwiseFeedForward(nn.Module):
    """位置前馈网络实现"""
    def __init__(self, d_model: int, d_ff: int, dropout: float):
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)    # 升维
        self.activation = nn.ReLU()                  # 激活函数
        self.dropout = nn.Dropout(dropout)           # 正则化
        self.linear_2 = nn.Linear(d_ff, d_model)     # 降维

    def forward(self, x: torch.Tensor):
        """
        Args:
            x: [batch_size, seq_len, d_model]
        Returns:
            output: [batch_size, seq_len, d_model]
        """
        x = self.linear_1(x)        # [batch, seq, d_ff]
        x = self.activation(x)      # ReLU
        x = self.dropout(x)         # Dropout
        x = self.linear_2(x)        # [batch, seq, d_model]
        return x

组件三:Add & Norm结构

Add 在这里就是残差连接
残差连接(Residual Connection)配合层归一化形成标准子层结构:

复制代码
输出 = LayerNorm( 输入 + Dropout(子层(输入)) )

Transformer这里的点子来自这里
Deep Residual Learning for Image Recognition

这几人的大作 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun


(左)有 残差连接,(右)无残差连接。

数据流向

复制代码
输入 x
  │
  ├─────────────────┐
  │                 │
  ▼                 │
[子层: MHA 或 FFN]   │
  │                 │
  ▼                 │
[Dropout]           │
  │                 │
  ▼                 │
[+ 残差相加] ◄──────┘
  │
  ▼
[LayerNorm]
  │
  ▼
输出 (形状与输入相同)

PyTorch 实现

python 复制代码
import torch
import torch.nn as nn

# LayerNormalization 已定义


class ResidualConnection(nn.Module):
    """残差连接 + 层归一化封装"""
    def __init__(self, features: int, dropout: float):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization(features)

    def forward(self, x: torch.Tensor, sublayer: nn.Module):
        """
        Args:
            x: 输入张量 [batch, seq_len, features]
            sublayer: 子层模块(如 MultiHeadAttention 或 PositionwiseFeedForward)
        Returns:
            output: [batch, seq_len, features]
        """
        # 标准流程: LayerNorm(x + Dropout(sublayer(x)))
        sublayer_out = sublayer(x)                    # 子层计算
        return self.norm(x + self.dropout(sublayer_out))  # 残差 + 归一化

使用示例

python 复制代码
# 在 Encoder 块中使用
mha = MultiHeadAttention(d_model=512, num_heads=8, dropout=0.1)
residual_mha = ResidualConnection(features=512, dropout=0.1)

# 前向传播
x = residual_mha(x, sublayer=mha)  # x 形状保持不变

测试代码

python 复制代码
import torch
import torch.nn as nn


## LayerNormalization
## PositionwiseFeedForward
## ResidualConnection


print("\n--- 测试支撑组件 ---")

# 参数设置
batch_size, seq_len, d_model = 4, 10, 512
d_ff, dropout = 2048, 0.1
dummy_input = torch.randn(batch_size, seq_len, d_model)

# 测试 LayerNormalization
ln = LayerNormalization(d_model)
ln_out = ln(dummy_input)
assert ln_out.shape == dummy_input.shape
print("LayerNormalization 形状验证通过")

# 测试 PositionwiseFeedForward
ffn = PositionwiseFeedForward(d_model, d_ff, dropout)
ffn_out = ffn(dummy_input)
assert ffn_out.shape == dummy_input.shape
print("PositionwiseFeedForward 形状验证通过")

# 测试 ResidualConnection(用 Linear 模拟子层)
dummy_sublayer = nn.Linear(d_model, d_model)
residual = ResidualConnection(d_model, dropout)
res_out = residual(dummy_input, dummy_sublayer)
assert res_out.shape == dummy_input.shape
print("ResidualConnection 形状验证通过")

print("\n所有支撑组件测试通过!")

组件组合预览:Encoder 块结构

复制代码
输入
  │
  ▼
┌─────────────────────┐
│  Multi-Head Attention │
│  (Part 3)            │
└─────────────────────┘
  │
  ▼
┌─────────────────────┐
│  Add & Norm          │ ← ResidualConnection
│  (x + Dropout(MHA))  │
│  → LayerNorm         │
└─────────────────────┘
  │
  ▼
┌─────────────────────┐
│  Position-wise FFN   │ ← PositionwiseFeedForward
│  (Linear→ReLU→Linear)│
└─────────────────────┘
  │
  ▼
┌─────────────────────┐
│  Add & Norm          │ ← ResidualConnection
│  (x + Dropout(FFN))  │
│  → LayerNorm         │
└─────────────────────┘
  │
  ▼
输出 → 下一个 Encoder 块

每个子层都遵循:子层 → Dropout → 残差相加 → LayerNorm 的统一模式

完整代码

cpp 复制代码
import torch
import torch.nn as nn
import torch
import torch.nn as nn

class PositionwiseFeedForward(nn.Module):
    """位置前馈网络实现"""
    def __init__(self, d_model: int, d_ff: int, dropout: float):
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)    # 升维
        self.activation = nn.ReLU()                  # 激活函数
        self.dropout = nn.Dropout(dropout)           # 正则化
        self.linear_2 = nn.Linear(d_ff, d_model)     # 降维

    def forward(self, x: torch.Tensor):
        """
        Args:
            x: [batch_size, seq_len, d_model]
        Returns:
            output: [batch_size, seq_len, d_model]
        """
        x = self.linear_1(x)        # [batch, seq, d_ff]
        x = self.activation(x)      # ReLU
        x = self.dropout(x)         # Dropout
        x = self.linear_2(x)        # [batch, seq, d_model]
        return x

class LayerNormalization(nn.Module):
    """层归一化实现"""
    def __init__(self, features: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        # 可学习的缩放和平移参数
        self.gamma = nn.Parameter(torch.ones(features))   # γ
        self.beta = nn.Parameter(torch.zeros(features))    # β

    def forward(self, x: torch.Tensor):
        """
        Args:
            x: [batch_size, seq_len, features]
        Returns:
            normalized: [batch_size, seq_len, features]
        """
        # 沿特征维度计算均值和标准差
        mean = x.mean(dim=-1, keepdim=True)      # [batch, seq, 1]
        std = x.std(dim=-1, keepdim=True)        # [batch, seq, 1]
        
        # 标准化 + 可学习变换
        x_norm = (x - mean) / torch.sqrt(std**2 + self.eps)
        return self.gamma * x_norm + self.beta
class ResidualConnection(nn.Module):
    """残差连接 + 层归一化封装"""
    def __init__(self, features: int, dropout: float):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization(features)

    def forward(self, x: torch.Tensor, sublayer: nn.Module):
        """
        Args:
            x: 输入张量 [batch, seq_len, features]
            sublayer: 子层模块(如 MultiHeadAttention 或 PositionwiseFeedForward)
        Returns:
            output: [batch, seq_len, features]
        """
        # 标准流程: LayerNorm(x + Dropout(sublayer(x)))
        sublayer_out = sublayer(x)                    # 子层计算
        return self.norm(x + self.dropout(sublayer_out))  # 残差 + 归一化
print("\n--- 测试支撑组件 ---")

# 参数设置
batch_size, seq_len, d_model = 4, 10, 512
d_ff, dropout = 2048, 0.1
dummy_input = torch.randn(batch_size, seq_len, d_model)

# 测试 LayerNormalization
ln = LayerNormalization(d_model)
ln_out = ln(dummy_input)
assert ln_out.shape == dummy_input.shape
print("LayerNormalization 形状验证通过")

# 测试 PositionwiseFeedForward
ffn = PositionwiseFeedForward(d_model, d_ff, dropout)
ffn_out = ffn(dummy_input)
assert ffn_out.shape == dummy_input.shape
print("PositionwiseFeedForward 形状验证通过")

# 测试 ResidualConnection(用 Linear 模拟子层)
dummy_sublayer = nn.Linear(d_model, d_model)
residual = ResidualConnection(d_model, dropout)
res_out = residual(dummy_input, dummy_sublayer)
assert res_out.shape == dummy_input.shape
print("ResidualConnection 形状验证通过")

print("\n所有支撑组件测试通过!")

输出

cpp 复制代码
--- 测试支撑组件 ---
LayerNormalization 形状验证通过
PositionwiseFeedForward 形状验证通过
ResidualConnection 形状验证通过

所有支撑组件测试通过!

下一步预告

从零实现Transformer:第 5 部分 - 构建编码器(The Encoder)

将把本部分的三个组件与多头注意力组装成完整的 Encoder Block,并堆叠多个块形成 Transformer 编码器!

python 复制代码
# 伪代码预览
class EncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        self.mha = MultiHeadAttention(d_model, num_heads, dropout)
        self.ffn = PositionwiseFeedForward(d_model, d_ff, dropout)
        self.residual1 = ResidualConnection(d_model, dropout)
        self.residual2 = ResidualConnection(d_model, dropout)
    
    def forward(self, x, mask):
        x = self.residual1(x, lambda t: self.mha(t, t, t, mask))
        x = self.residual2(x, self.ffn)
        return x

已经准备好组装第一个 Transformer 模块了

相关推荐
解局易否结局8 小时前
从架构视角看 ops-transformer:一个解决分层系统设计问题的算子仓库
深度学习·架构·transformer
莞凰9 小时前
昇腾CANN的“传音入密“:hccl仓库探秘
flutter·ui·transformer
AI技术控12 小时前
KV Cache 缓存机制的原理和应用:从 Transformer 推理到大模型服务优化
人工智能·python·深度学习·缓存·自然语言处理·transformer
徐安安ye16 小时前
FlashAttention在昇腾NPU上的性能实测:数据、瓶颈与优化上限
python·transformer
解局易否结局18 小时前
从零上手 ops-transformer:一个有清晰路径感的学习计划
深度学习·学习·transformer
灰灰勇闯IT21 小时前
ops-softmax:Transformer 推理中的概率归一化引擎
人工智能·深度学习·transformer
徐安安ye21 小时前
FlashAttention 算子深度解析:让大模型在昇腾NPU上跑得更快
python·transformer
Mem0rin21 小时前
[LLM初步]Transformer 模型分类(从架构出发)
深度学习·分类·transformer
解局易否结局21 小时前
ops-transformer 仓库核心能力解析:FlashAttention 在昇腾 NPU 上的融合实现
人工智能·深度学习·transformer
皮肤科大白1 天前
ViT革命:Transformer如何重塑计算机视觉
深度学习·计算机视觉·transformer