【HuggingFace Transformers】OpenAIGPTModel的核心——Block源码解析

OpenAIGPTModel的核心------Block源码解析

[1. Block 介绍](#1. Block 介绍)
[2. Block类源码解析](#2. Block类源码解析)
[3. Attention类源码解析](#3. Attention类源码解析)
[4. MLP类源码解析](#4. MLP类源码解析)

1. Block 介绍

在 GPT 模型中，Block 是 Transformer 架构的核心组成部分。每个 Block 主要由三个部分构成：Attention 、MLP 以及两个Layer Norm 。首先，Attention 层负责计算输入中各位置之间的注意力权重，并生成加权的表示。接着，将Attention 的输出与输入进行残差连接，并通过第一个Layer Norm 层进行层归一化，形成中间状态。随后，MLP 层进一步处理这些中间状态，通过激活函数引入非线性变换。最后将MLP 层的输出和输入进行残差连接，并通过第二个Layer Norm 层进行层归一化，最终输出Block 的计算结果。这样，Block 可以有效地提取和转换序列中的复杂特征，并支持深层模型的训练和推理。Block 的结构如下：

图片地址：Improving Language Understanding by Generative Pre-Training

2. Block类源码解析

源码地址：transformers/src/transformers/models/openai/modeling_openai.py

python 复制代码

# -*- coding: utf-8 -*-
# @time: 2024/9/3 20:42
from torch import nn
from transformers.models.openai.modeling_openai import Attention, MLP


class Block(nn.Module):
    def __init__(self, n_positions, config, scale=False):
        super().__init__()
        nx = config.n_embd
        self.attn = Attention(nx, n_positions, config, scale)  # 定义 Attention 层
        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)  # 定义 LayerNorm 层1
        self.mlp = MLP(4 * nx, config)  # 定义 MLP 层
        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)  # 定义 LayerNorm 层2

    def forward(self, x, attention_mask=None, head_mask=None, output_attentions=False):
        # 自注意力层计算
        attn_outputs = self.attn(
            x,
            attention_mask=attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
        )
        a = attn_outputs[0]  # 提取注意力机制的输出结果 a

        n = self.ln_1(x + a)  # 残差连接与第一个层层归一化
        m = self.mlp(n)  # 前馈神经网络计算
        h = self.ln_2(n + m)  # 残差连接与第二个层层归一化

        # 输出
        outputs = [h] + attn_outputs[1:]
        return outputs

3. Attention类源码解析

源码地址：transformers/src/transformers/models/openai/modeling_openai.py

python 复制代码

# -*- coding: utf-8 -*-
# @time: 2024/9/3 20:44

import math
import torch

from torch import nn
from transformers.pytorch_utils import Conv1D, find_pruneable_heads_and_indices, prune_conv1d_layer


class Attention(nn.Module):
    def __init__(self, nx, n_positions, config, scale=False):
        super().__init__()
        # 模型的隐藏状态维度 n_state 为嵌入维度 nx
        n_state = nx  # in Attention: n_state=768 (nx=n_embd)
        # [switch nx => n_state from Block to Attention to keep identical to TF implementation]
        # 检查n_state是否可以被注意力头的数量整除，如果不能整除，则抛出异常
        if n_state % config.n_head != 0:
            raise ValueError(f"Attention n_state shape: {n_state} must be divisible by config.n_head {config.n_head}")
        # 注册一个名为bias的缓冲区变量，用于存储一个下三角矩阵，防止未来信息泄露（适用于因果自注意力）
        self.register_buffer(
            "bias",
            torch.tril(torch.ones(n_positions, n_positions)).view(1, 1, n_positions, n_positions),
            persistent=False,
        )
        self.n_head = config.n_head  # 获取注意力头的数量
        self.split_size = n_state  # 设置split_size为n_state，用于后续的维度拆分
        self.scale = scale  # 设置是否缩放注意力权重

        self.c_attn = Conv1D(n_state * 3, nx)  # 定义一个1D卷积层c_attn，用于生成查询（Q）、键（K）和值（V），输出维度是n_state的3倍
        self.c_proj = Conv1D(n_state, nx)  # 定义一个1D卷积层c_proj，用于映射最终的注意力输出
        self.attn_dropout = nn.Dropout(config.attn_pdrop)  # 定义一个dropout层，防止注意力机制的过拟合
        self.resid_dropout = nn.Dropout(config.resid_pdrop)  # 定义一个dropout层，防止残差连接的过拟合
        self.pruned_heads = set()  # 初始化一个集合，用于存储已被剪枝的注意力头

    # 定义剪枝指定注意力头的方法（辅助工具：可选）
    def prune_heads(self, heads):
        # 如果没有指定要剪枝的头，直接返回
        if len(heads) == 0:
            return
        # 根据要剪枝的头，找到可以剪枝的头和对应的索引
        heads, index = find_pruneable_heads_and_indices(
            heads, self.n_head, self.split_size // self.n_head, self.pruned_heads
        )
        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])  # 构造要剪枝的索引，用于卷积层的权重剪枝
        # Prune conv1d layers
        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)  # 对c_attn卷积层进行剪枝
        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)  # 对c_proj卷积层进行剪枝
        # Update hyper params
        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))  # 更新split_size
        self.n_head = self.n_head - len(heads)  # 更新n_head
        self.pruned_heads = self.pruned_heads.union(heads)  # 将剪枝的头加入已剪枝集合

    # 定义计算注意力的方法
    def _attn(self, q, k, v, attention_mask=None, head_mask=None, output_attentions=False):
        w = torch.matmul(q, k)  # 计算查询（Q）和键（K）的点积，得到注意力权重矩阵
        # 根据scale值对注意力权重矩阵是否进行缩放
        if self.scale:
            w = w / math.sqrt(v.size(-1))
        # w = w * self.bias + -1e9 * (1 - self.bias)  # TF implementation method: mask_attn_weights
        # XD: self.b may be larger than w, so we need to crop it
        b = self.bias[:, :, : w.size(-2), : w.size(-1)]  # 获取与注意力权重矩阵大小一致的下三角掩码
        w = w * b + -1e4 * (1 - b)  # 使用掩码防止未来信息泄露，并通过大负值进行掩码处理

        # 如果提供了attention mask，将attention mask加到权重矩阵 w 上
        if attention_mask is not None:
            # Apply the attention mask
            w = w + attention_mask

        # 对权重矩阵 w 进行softmax归一化，并进行dropout操作
        w = nn.functional.softmax(w, dim=-1)
        w = self.attn_dropout(w)

        # Mask heads if we want to
        # 如果提供了head mask，使用head mask以屏蔽特定的头
        if head_mask is not None:
            w = w * head_mask

        outputs = [torch.matmul(w, v)]  # 注意力权重w 与 v 相乘，并将其作为输出
        # 如果要求输出注意力权重，将注意力权重也加入输出列表
        if output_attentions:
            outputs.append(w)
        return outputs

    # 定义将多头的输出合并为原来维度的方法
    def merge_heads(self, x):
        """调整x的维度顺序,然后把x的最后两个维度合并为一个维度"""
        x = x.permute(0, 2, 1, 3).contiguous()
        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)
        return x.view(*new_x_shape)  # in Tensorflow implementation: fct merge_states

    # 定义将输入拆分成多个注意力头的方法
    def split_heads(self, x, k=False):
        """把x的最后一维拆成（num_head）和（head_dim）两个维度,然后根据k的值进行维度调整"""
        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)
        x = x.view(*new_x_shape)  # in Tensorflow implementation: fct split_states
        if k:
            return x.permute(0, 2, 3, 1)
        else:
            return x.permute(0, 2, 1, 3)

    def forward(self, x, attention_mask=None, head_mask=None, output_attentions=False):
        # 通过卷积层c_attn计算query, key, value，并进行拆分，适应多头注意力机制
        x = self.c_attn(x)
        query, key, value = x.split(self.split_size, dim=2)
        query = self.split_heads(query)
        key = self.split_heads(key, k=True)
        value = self.split_heads(value)

        # 计算注意力，获取计算后的注意力输出
        attn_outputs = self._attn(query, key, value, attention_mask, head_mask, output_attentions)
        a = attn_outputs[0]

        a = self.merge_heads(a)  # 将多头注意力的输出合并
        a = self.c_proj(a)  # 通过映射层c_proj进一步处理输出
        a = self.resid_dropout(a)  # 然后再dropout操作
        
        # 将多头注意力的输出与其他注意力输出（注意力权重）一起打包成列表后输出
        outputs = [a] + attn_outputs[1:]
        return outputs  # a, (attentions)

4. MLP类源码解析

源码地址：transformers/src/transformers/models/openai/modeling_openai.py

python 复制代码

# -*- coding: utf-8 -*-
# @time: 2024/9/3 20:45

from torch import nn
from transformers.pytorch_utils import Conv1D
from transformers.activations import silu, gelu_new

# ACT_FNS 激活函数字典
ACT_FNS = {"relu": nn.ReLU(), "silu": silu, "gelu": gelu_new, "swish": silu}


class MLP(nn.Module):
    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)
        super().__init__()
        nx = config.n_embd
        self.c_fc = Conv1D(n_state, nx)  # 定义卷积层 c_fc
        self.c_proj = Conv1D(nx, n_state)  # 定义卷积层 c_proj
        self.act = ACT_FNS[config.afn]  # 从 ACT_FNS 字典中根据 config.afn 指定的激活函数名称选择激活函数
        self.dropout = nn.Dropout(config.resid_pdrop)  # 定义dropout 层，防止过拟合

    def forward(self, x):
        h = self.act(self.c_fc(x))  # 将输入x从 n_state 维度转换到 nx 维度，然后使用激活函数
        h2 = self.c_proj(h)  # 将经过激活函数处理后的数据 h 从 nx 维度转换回 n_state 维度
        return self.dropout(h2)  # 对 h2 进行dropout操作后返回结果

在这个 MLP 模块中，第一层变换将输入维度从 4 * nx 压缩到 nx，第二层变换将维度从 nx 再扩展回 4 * nx。这里为什么要这么做呢？有何作用？欢迎大家在评论区讨论，给出你的答案！

补充：
Conv1D 实际上实现了一个线性变换层，与传统的全连接层（nn.Linear）非常相似。这个实现简化了线性变换操作，通过 torch.addmm 执行矩阵乘法和偏置加法，避免了使用 nn.Linear 层的额外开销，同时保持了与模型的兼容性。

Conv1D 类源码解析：

源码地址：transformers/src/transformers/pytorch_utils.py

python 复制代码

# -*- coding: utf-8 -*-
# @time: 2024/9/4 11:31

import torch

from torch import nn

class Conv1D(nn.Module):
    """
    1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).

    Basically works like a linear layer but the weights are transposed.

    Args:
        nf (`int`): The number of output features.
        nx (`int`): The number of input features.
    """

    def __init__(self, nf, nx):
        super().__init__()
        self.nf = nf  # 设置输出特征的数量
        self.nx = nx  # 设置输入特征的数量
        self.weight = nn.Parameter(torch.empty(nx, nf))  # 初始化权重参数，shape为 (nx, nf)
        self.bias = nn.Parameter(torch.zeros(nf))  # 初始化偏置参数，shape为 (nf,)
        nn.init.normal_(self.weight, std=0.02)  # 用均值为0，标准差为0.02的正态分布初始化权重

    def __repr__(self) -> str:
        return "Conv1D(nf={nf}, nx={nx})".format(**self.__dict__)  # 返回该层的字符串表示，包括输出特征和输入特征的数量

    def forward(self, x):
        size_out = x.size()[:-1] + (self.nf,)  # 计算输出特征的大小，这里保持x除最后一个维度外的其他维度不变，然后将最后一个维度设置为 self.nf
        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)  # 对输入 x 进行变形，将最后一个维度与权重的维度匹配，并执行矩阵乘法加上偏置
        x = x.view(size_out)  # 将x 变回size_out的维度大小
        return x

【HuggingFace Transformers】OpenAIGPTModel的核心——Block源码解析

OpenAIGPTModel的核心------Block源码解析

1. Block 介绍

2. Block类 源码解析

3. Attention类 源码解析

4. MLP类 源码解析

2. Block类源码解析

3. Attention类源码解析

4. MLP类源码解析