(深度学习记录)第TR4周:Pytorch复现Transformer

🏡我的环境:

  • 语言环境:Python3.11.4
  • 编译器:Jupyter Notebook
  • torcch版本:2.0.1
python 复制代码
import torch
import torch.nn as nn
 
class MultiHeadAttention(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout):
        super().__init__()
 
        self.hid_dim = hid_dim
        self.n_heads = n_heads
 
        # hid_dim必须整除
        assert hid_dim % n_heads == 0
        # 定义wq
        self.w_q = nn.Linear(hid_dim, hid_dim)
        # 定义wk
        self.w_k = nn.Linear(hid_dim, hid_dim)
        # 定义wv
        self.w_v = nn.Linear(hid_dim, hid_dim)
 
        self.fc = nn.Linear(hid_dim, hid_dim)
        self.do = nn.Dropout(dropout)
 
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim//n_heads]))
 
    def forward(self, query, key, value, mask=None):
        # Q与KV在句子长度这一个维度上数值可以不一样
        bsz = query.shape[0]
        Q = self.w_q(query)
        K = self.w_k(key)
        V = self.w_v(value)
 
        # 将QKV拆成多组,方案是将向量直接拆开了
        # (64, 12, 300) -> (64, 12, 6, 50) -> (64, 6, 12, 50)
        # (64, 10, 300) -> (64, 10, 6, 50) -> (64, 6, 10, 50)
        # (64, 10, 300) -> (64, 10, 6, 50) -> (64, 6, 10, 50)
        Q = Q.view(bsz, -1, self.n_heads, self.hid_dim//self.n_heads).permute(0, 2, 1, 3)
        K = K.view(bsz, -1, self.n_heads, self.hid_dim//self.n_heads).permute(0, 2, 1, 3)
        V = V.view(bsz, -1, self.n_heads, self.hid_dim//self.n_heads).permute(0, 2, 1, 3)
 
        # 第1步,Q x K / scale
        # (64, 6, 12, 50) x (64, 6, 50, 10) -> (64, 6, 12, 10)
        attention = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
 
        # 需要mask掉的地方,attention设置的很小很小
        if mask is not None:
            attention = attention.masked_fill(mask == 0, -1e10)
 
        # 第2步,做softmax 再dropout得到attention
        attention = self.do(torch.softmax(attention, dim=-1))
 
 
        # 第3步,attention结果与k相乘,得到多头注意力的结果
        # (64, 6, 12, 10) x (64, 6, 10, 50) -> (64, 6, 12, 50)
        x = torch.matmul(attention, V)
 
        # 把结果转回去
        # (64, 6, 12, 50) -> (64, 12, 6, 50)
        x = x.permute(0, 2, 1, 3).contiguous()
 
        # 把结果合并
        # (64, 12, 6, 50) -> (64, 12, 300)
        x = x.view(bsz, -1, self.n_heads * (self.hid_dim // self.n_heads))
        x = self.fc(x)
        return x
 
query = torch.rand(64, 12, 300)
key = torch.rand(64, 10, 300)
value = torch.rand(64, 10, 300)
attention = MultiHeadAttention(hid_dim=300, n_heads=6, dropout=0.1)
output = attention(query, key, value)
print(output.shape)

多头注意力机制拓展了模型关注不同位置的能力,赋予Attention层多个"子表示空间"。

相关推荐
马踏岛国赏樱花2 分钟前
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
深度学习
aaaa_a1332 小时前
李宏毅——self-attention Transformer
人工智能·深度学习·transformer
cvyoutian2 小时前
解决 PyTorch 大型 wheel 下载慢、超时和反复重下的问题
人工智能·pytorch·python
子非鱼9213 小时前
3 传统序列模型——RNN
人工智能·rnn·深度学习
万俟淋曦3 小时前
【论文速递】2025年第33周(Aug-10-16)(Robotics/Embodied AI/LLM)
人工智能·深度学习·ai·机器人·论文·robotics·具身智能
像风没有归宿a3 小时前
AI绘画与音乐:生成式艺术是创作还是抄袭?
人工智能·深度学习·计算机视觉
碧海银沙音频科技研究院3 小时前
基于物奇wq7036与恒玄bes2800智能眼镜设计
arm开发·人工智能·深度学习·算法·分类
狐狸1174 小时前
解决在win10找不到site-packages\torch\lib\shm.dll“ or one of its dependencies 问题
pytorch·conda·pip
weixin_468466854 小时前
YOLOv11结构解析及源码复现
人工智能·深度学习·yolo·目标检测·计算机视觉·图像识别·yolov11
攻城狮-frank4 小时前
超越GPT的底层魔法:Transformer
深度学习·transformer