【Transformer系列(2)】Multi-head self-attention 多头自注意力

一、多头自注意力

多头自注意力机制与自注意力机制的区别在于,Q,K,V向量被分为了num_heads份。

实现流程

(1)根据num_heads参数将单头变成多头,获取多头注意力中的各个头的Q,K,V值

(2)Q叉乘K的转置,再使用softmax,获取attention

(3)attention叉乘V,得到输出

二、代码实现

(1)根据num_heads参数将单头变成多头,获取多头注意力中的各个头的Q,K,V值

cpp 复制代码
# 每个token(Q,K,V)的尺寸
values_length = 33
# 原始单头长度
hidden_size = 768
# 单头qkv
# [33,768]
Query = np.random.rand(values_length, hidden_size)
Key = np.random.rand(values_length, hidden_size)
Value = np.random.rand(values_length, hidden_size)

# 单头 -> 分组为8个头
# [33,768] -> [33,8,96]
# 8个头
num_attention_heads = 8
# 原始单头拆分为多头后,我们单头的长度
attention_head_size = hidden_size // num_attention_heads
Query = np.reshape(Query, [values_length, num_attention_heads, attention_head_size])
Key = np.reshape(Key, [values_length, num_attention_heads, attention_head_size])
Value = np.reshape(Value, [values_length, num_attention_heads, attention_head_size])

# [33,8,96] -> [8,33,96] 头放最前面 M,H*W,C
Query = np.transpose(Query, [1, 0, 2])
Key = np.transpose(Key, [1, 0, 2])
Value = np.transpose(Value, [1, 0, 2])

(2)Q叉乘K的转置,再使用softmax,获取attention

cpp 复制代码
# qv -> attention
# [8,33,96] @ [8,96,33] -> [8,33,33] [m1,n] @ [n,m2] -> [m1,m2]
scores = Query @ np.transpose(Key, [0, 2, 1])
print(np.shape(scores))
# qv+softmax -> attention
scores = soft_max(scores)
print(np.shape(scores))

(3)attention叉乘V,得到输出

cpp 复制代码
# attention+v -> output
# [8,33,33] @ [8,33,96] -> [8,33,96] [m1,n] @ [n,m2] -> [m1,m2]
out = scores @ Value
print(np.shape(out))
# [8,33,96] -> [33,8,96]
out = np.transpose(out, [1, 0, 2])
print(np.shape(out))
# [33,8,96] -> [33,768]
out = np.reshape(out, [values_length , 768])
print(np.shape(out))

三、完整代码

cpp 复制代码
# multi-head self-attention #
# by liushuai #
# 2024/2/6 #

import numpy as np

def soft_max(z):
    t = np.exp(z)
    a = np.exp(z) / np.expand_dims(np.sum(t, axis=-1), -1)
    return a

# 每个token(Q,K,V)的尺寸
# 相当于H*W
values_length = 33
# 原始单头深度
# 相当于Channels
hidden_size = 768
# 单头qkv
# [33,768]
Query = np.random.rand(values_length, hidden_size)
Key = np.random.rand(values_length, hidden_size)
Value = np.random.rand(values_length, hidden_size)

# 单头 -> 分组为8个头
# [33,768] -> [33,8,96]
# 8个头
num_attention_heads = 8
# 原始单头拆分为多头后,我们单头的深度
attention_head_size = hidden_size // num_attention_heads
Query = np.reshape(Query, [values_length, num_attention_heads, attention_head_size])
Key = np.reshape(Key, [values_length, num_attention_heads, attention_head_size])
Value = np.reshape(Value, [values_length, num_attention_heads, attention_head_size])

# [33,8,96] -> [8,33,96] 头放最前面 M,H*W,C
Query = np.transpose(Query, [1, 0, 2])
Key = np.transpose(Key, [1, 0, 2])
Value = np.transpose(Value, [1, 0, 2])

# qv -> attention
# [8,33,96] @ [8,96,33] -> [8,33,33] [m1,n] @ [n,m2] -> [m1,m2]
scores = Query @ np.transpose(Key, [0, 2, 1])
print(np.shape(scores))
# qv+softmax -> attention
scores = soft_max(scores)
print(np.shape(scores))

# attention+v -> output
# [8,33,33] @ [8,33,96] -> [8,33,96] [m1,n] @ [n,m2] -> [m1,m2]
out = scores @ Value
print(np.shape(out))
# [8,33,96] -> [33,8,96]
out = np.transpose(out, [1, 0, 2])
print(np.shape(out))
# [33,8,96] -> [33,768]
out = np.reshape(out, [values_length , 768])
print(np.shape(out))
相关推荐
engchina8 分钟前
如何在 Python 中忽略烦人的警告?
开发语言·人工智能·python
paixiaoxin1 小时前
CV-OCR经典论文解读|An Empirical Study of Scaling Law for OCR/OCR 缩放定律的实证研究
人工智能·深度学习·机器学习·生成对抗网络·计算机视觉·ocr·.net
OpenCSG1 小时前
CSGHub开源版本v1.2.0更新
人工智能
weixin_515202491 小时前
第R3周:RNN-心脏病预测
人工智能·rnn·深度学习
Altair澳汰尔1 小时前
数据分析和AI丨知识图谱,AI革命中数据集成和模型构建的关键推动者
人工智能·算法·机器学习·数据分析·知识图谱
机器之心1 小时前
图学习新突破:一个统一框架连接空域和频域
人工智能·后端
AI视觉网奇2 小时前
人脸生成3d模型 Era3D
人工智能·计算机视觉
call me by ur name2 小时前
VLM--CLIP作分类任务的损失函数
人工智能·机器学习·分类
吃个糖糖2 小时前
34 Opencv 自定义角点检测
人工智能·opencv·计算机视觉
禁默2 小时前
2024年图像处理、多媒体技术与机器学习
图像处理·人工智能·microsoft