【Transformer系列(2)】Multi-head self-attention 多头自注意力

一、多头自注意力

多头自注意力机制与自注意力机制的区别在于,Q,K,V向量被分为了num_heads份。

实现流程

(1)根据num_heads参数将单头变成多头,获取多头注意力中的各个头的Q,K,V值

(2)Q叉乘K的转置,再使用softmax,获取attention

(3)attention叉乘V,得到输出

二、代码实现

(1)根据num_heads参数将单头变成多头,获取多头注意力中的各个头的Q,K,V值

cpp 复制代码
# 每个token(Q,K,V)的尺寸
values_length = 33
# 原始单头长度
hidden_size = 768
# 单头qkv
# [33,768]
Query = np.random.rand(values_length, hidden_size)
Key = np.random.rand(values_length, hidden_size)
Value = np.random.rand(values_length, hidden_size)

# 单头 -> 分组为8个头
# [33,768] -> [33,8,96]
# 8个头
num_attention_heads = 8
# 原始单头拆分为多头后,我们单头的长度
attention_head_size = hidden_size // num_attention_heads
Query = np.reshape(Query, [values_length, num_attention_heads, attention_head_size])
Key = np.reshape(Key, [values_length, num_attention_heads, attention_head_size])
Value = np.reshape(Value, [values_length, num_attention_heads, attention_head_size])

# [33,8,96] -> [8,33,96] 头放最前面 M,H*W,C
Query = np.transpose(Query, [1, 0, 2])
Key = np.transpose(Key, [1, 0, 2])
Value = np.transpose(Value, [1, 0, 2])

(2)Q叉乘K的转置,再使用softmax,获取attention

cpp 复制代码
# qv -> attention
# [8,33,96] @ [8,96,33] -> [8,33,33] [m1,n] @ [n,m2] -> [m1,m2]
scores = Query @ np.transpose(Key, [0, 2, 1])
print(np.shape(scores))
# qv+softmax -> attention
scores = soft_max(scores)
print(np.shape(scores))

(3)attention叉乘V,得到输出

cpp 复制代码
# attention+v -> output
# [8,33,33] @ [8,33,96] -> [8,33,96] [m1,n] @ [n,m2] -> [m1,m2]
out = scores @ Value
print(np.shape(out))
# [8,33,96] -> [33,8,96]
out = np.transpose(out, [1, 0, 2])
print(np.shape(out))
# [33,8,96] -> [33,768]
out = np.reshape(out, [values_length , 768])
print(np.shape(out))

三、完整代码

cpp 复制代码
# multi-head self-attention #
# by liushuai #
# 2024/2/6 #

import numpy as np

def soft_max(z):
    t = np.exp(z)
    a = np.exp(z) / np.expand_dims(np.sum(t, axis=-1), -1)
    return a

# 每个token(Q,K,V)的尺寸
# 相当于H*W
values_length = 33
# 原始单头深度
# 相当于Channels
hidden_size = 768
# 单头qkv
# [33,768]
Query = np.random.rand(values_length, hidden_size)
Key = np.random.rand(values_length, hidden_size)
Value = np.random.rand(values_length, hidden_size)

# 单头 -> 分组为8个头
# [33,768] -> [33,8,96]
# 8个头
num_attention_heads = 8
# 原始单头拆分为多头后,我们单头的深度
attention_head_size = hidden_size // num_attention_heads
Query = np.reshape(Query, [values_length, num_attention_heads, attention_head_size])
Key = np.reshape(Key, [values_length, num_attention_heads, attention_head_size])
Value = np.reshape(Value, [values_length, num_attention_heads, attention_head_size])

# [33,8,96] -> [8,33,96] 头放最前面 M,H*W,C
Query = np.transpose(Query, [1, 0, 2])
Key = np.transpose(Key, [1, 0, 2])
Value = np.transpose(Value, [1, 0, 2])

# qv -> attention
# [8,33,96] @ [8,96,33] -> [8,33,33] [m1,n] @ [n,m2] -> [m1,m2]
scores = Query @ np.transpose(Key, [0, 2, 1])
print(np.shape(scores))
# qv+softmax -> attention
scores = soft_max(scores)
print(np.shape(scores))

# attention+v -> output
# [8,33,33] @ [8,33,96] -> [8,33,96] [m1,n] @ [n,m2] -> [m1,m2]
out = scores @ Value
print(np.shape(out))
# [8,33,96] -> [33,8,96]
out = np.transpose(out, [1, 0, 2])
print(np.shape(out))
# [33,8,96] -> [33,768]
out = np.reshape(out, [values_length , 768])
print(np.shape(out))
相关推荐
IT_陈寒25 分钟前
JavaScript 性能优化:5 个被低估的 V8 引擎技巧让你的代码快 200%
前端·人工智能·后端
惯导马工30 分钟前
【论文导读】ORB-SLAM3:An Accurate Open-Source Library for Visual, Visual-Inertial and
深度学习·算法
Juchecar1 小时前
一文讲清 PyTorch 中反向传播(Backpropagation)的实现原理
人工智能
黎燃1 小时前
游戏NPC的智能行为设计:从规则驱动到强化学习的演进
人工智能
机器之心1 小时前
高阶程序,让AI从技术可行到商业可信的最后一公里
人工智能·openai
martinzh1 小时前
解锁RAG高阶密码:自适应、多模态、个性化技术深度剖析
人工智能
机器之心2 小时前
刚刚,李飞飞空间智能新成果震撼问世!3D世界生成进入「无限探索」时代
人工智能·openai
scilwb2 小时前
Isaac Sim机械臂教程 - 阶段1:基础环境搭建与机械臂加载
人工智能·开源
舒一笑2 小时前
TorchV企业级AI知识引擎的三大功能支柱:从构建到运营的技术解析
人工智能
掘金酱2 小时前
🎉 2025年8月金石计划开奖公示
前端·人工智能·后端