1 mask
- 我们的Transformer模型里面涉及两种mask。分别是padding mask和sequence mask。
- padding mask在所有的scaled dot-product attention里面都需要用到,而sequence mask只有在decoder的self-attention里面用到。
2第一层encoder的padding mask
假设我们训练的transformer是德语转英语的功能。 训练数据为两条德语的句子。分别如下:
- ich mochte ein bier P
- ich mochte ein P P
P是填充,让让同一个 batch 内的句子长度一致,方便后续运算。假设
- batch_size = 2
- seq_len = 5(包括 P)
- d_model = d_k = d_v(假设简化一致)
按照前面的文章,我们可以知道,Q的大小为[batch_size, len_q, d_k],K: [batch_size, len_k, d_k],V: [batch_size, len_v, d_v]。根据以下公式计算出scores。

对于 Encoder,自注意力中 len_q = len_k = len_v = 5(句子长度)。对于第 1 个句子,scores 矩阵形状为 [5, 5]。 Padding token P 不是真实词,不希望模型关注到它,也不希望它影响别的词。 假如不屏蔽,Q 与 K 的点积可能会给 P 一些分数,softmax 后就会给 P 一定权重 → 结果混入无效信息。 所以我们用 mask 在 softmax 前把对应位置设置为 -inf。
python
import torch
import torch.nn.functional as F
# ----------------------------
# 1. 模拟数字化后的句子 (0 = PAD)
# ----------------------------
# batch_size = 2, seq_len = 5
seq_k = torch.tensor([
[1, 2, 3, 4, 0], # 第一个句子最后一个是 PAD
[1, 2, 3, 0, 0] # 第二个句子最后两个是 PAD
])
batch_size, seq_len = seq_k.size()
d_model = 4 # embedding 维度
# 随机生成 embeddings (Q, K, V)
torch.manual_seed(0)
inputs = torch.randn(batch_size, seq_len, d_model)
Q = K = V = inputs
print("inputs / embeddings:")
print(inputs)
ini
inputs / embeddings:
tensor([[[-1.1258, -1.1524, -0.2506, -0.4339],
[ 0.8487, 0.6920, -0.3160, -2.1152],
[ 0.3223, -1.2633, 0.3500, 0.3081],
[ 0.1198, 1.2377, 1.1168, -0.2473],
[-1.3527, -1.6959, 0.5667, 0.7935]],
[[ 0.5988, -1.5551, -0.3414, 1.8530],
[-0.2159, -0.7425, 0.5627, 0.2596],
[-0.1740, -0.6787, 0.9383, 0.4889],
[ 1.2032, 0.0845, -1.2001, -0.0048],
[-0.5181, -0.3067, -1.5810, 1.7066]]])
python
# 2. 计算注意力分数中的QK^T,
d_k = d_model
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
print("Raw scores (QK^T / sqrt(d_k)):")
print(scores)
lua
Raw scores (QK^T / sqrt(d_k)):
tensor([[[ 1.4232, -0.3780, 0.4358, -0.8669, 1.4955],
[-0.3780, 2.8866, -0.6815, 0.5642, -2.0896],
[ 0.4358, -0.6815, 0.9587, -0.6051, 1.0747],
[-0.8669, 0.5642, -0.6051, 1.4272, -0.9122],
[ 1.4955, -2.0896, 1.0747, -0.9122, 2.8283]],
[[ 3.1635, 0.6572, 0.7685, 0.4949, 1.9344],
[ 0.6572, 0.4910, 0.5982, -0.4995, -0.0535],
[ 0.7685, 0.5982, 0.8051, -0.6975, -0.1754],
[ 0.4949, -0.4995, -0.6975, 1.4476, 0.6200],
[ 1.9344, -0.0535, -0.1754, 0.6200, 2.8873]]])
python
# 3. 自动生成 key_padding_mask
# ----------------------------
# True 表示 PAD,需要被忽略
key_padding_mask = seq_k.data.eq(0) # [batch_size, seq_len]
print("key_padding_mask (True = PAD):")
print(key_padding_mask)
print("-" * 50)
# 扩展成 [batch_size, len_q, len_k]
mask_matrix = key_padding_mask.unsqueeze(1).expand(-1, seq_len, -1)
print("mask_matrix (broadcast to len_q dimension):")
print(mask_matrix)
lua
key_padding_mask (True = PAD):
tensor([[False, False, False, False, True],
[False, False, False, True, True]])
--------------------------------------------------
mask_matrix (broadcast to len_q dimension):
tensor([[[False, False, False, False, True],
[False, False, False, False, True],
[False, False, False, False, True],
[False, False, False, False, True],
[False, False, False, False, True]],
[[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, True, True]]])
python
# 4. 应用 mask
scores_masked = scores.masked_fill(mask_matrix, float('-inf'))
print("Scores after applying mask (-inf on PAD positions):")
print(scores_masked)
ini
Scores after applying mask (-inf on PAD positions):
tensor([[[ 1.4232, -0.3780, 0.4358, -0.8669, -inf],
[-0.3780, 2.8866, -0.6815, 0.5642, -inf],
[ 0.4358, -0.6815, 0.9587, -0.6051, -inf],
[-0.8669, 0.5642, -0.6051, 1.4272, -inf],
[ 1.4955, -2.0896, 1.0747, -0.9122, -inf]],
[[ 3.1635, 0.6572, 0.7685, -inf, -inf],
[ 0.6572, 0.4910, 0.5982, -inf, -inf],
[ 0.7685, 0.5982, 0.8051, -inf, -inf],
[ 0.4949, -0.4995, -0.6975, -inf, -inf],
[ 1.9344, -0.0535, -0.1754, -inf, -inf]]])
python
# 5. softmax 得到注意力权重
attn_weights = F.softmax(scores_masked, dim=-1)
print("Attention weights after softmax (PAD positions become 0):")
print(attn_weights)
print("-" * 50)
# 6. 输出加权和 Value
output = torch.matmul(attn_weights, V)
print("Output after attention (weighted sum of V, PAD ignored):")
print(output)
lua
Attention weights after softmax (PAD positions become 0):
tensor([[[0.6102, 0.1007, 0.2273, 0.0618, 0.0000],
[0.0328, 0.8588, 0.0242, 0.0842, 0.0000],
[0.2970, 0.0972, 0.5010, 0.1049, 0.0000],
[0.0610, 0.2551, 0.0792, 0.6047, 0.0000],
[0.5636, 0.0156, 0.3700, 0.0507, 0.0000]],
[[0.8527, 0.0696, 0.0777, 0.0000, 0.0000],
[0.3585, 0.3036, 0.3379, 0.0000, 0.0000],
[0.3471, 0.2928, 0.3601, 0.0000, 0.0000],
[0.5976, 0.2211, 0.1813, 0.0000, 0.0000],
[0.7948, 0.1089, 0.0964, 0.0000, 0.0000]]])
--------------------------------------------------
Output after attention (weighted sum of V, PAD ignored):
tensor([[[-0.5208, -0.8441, -0.0362, -0.4231],
[ 0.7098, 0.6301, -0.1771, -1.8441],
[-0.0779, -0.7781, 0.1873, -0.2059],
[ 0.2458, 0.7546, 0.6071, -0.6912],
[-0.4959, -1.0433, 0.0400, -0.1761]],
[[ 0.4821, -1.4305, -0.1790, 1.6361],
[ 0.0904, -1.0123, 0.3655, 0.9083],
[ 0.0820, -1.0016, 0.3841, 0.8953],
[ 0.2786, -1.2166, 0.0906, 1.2534],
[ 0.4357, -1.3822, -0.1196, 1.5481]]])
每一层 Encoder 的 self-attention 都使用相同的 padding mask(因为 padding token 永远不是真实信息),所以所有 Encoder 层的 self-attention 都共享同一个 mask。
3 Decoder 自注意力中的 mask
Decoder 有两种 self-attention:
- Masked Self-Attention(防止未来信息泄露)(Sequence mask)
- 每个位置只能看当前位置及之前的 token
- 屏蔽当前位置之后的 token → 上三角矩阵 mask
- padding mask 可以结合使用:先屏蔽 padding,再屏蔽未来 token
- 自注意力掩码的最终尺寸是 [batch_size, tgt_len, tgt_len]
- Encoder-Decoder Attention
- 需要屏蔽 Encoder padding(用 encoder 的 padding mask)
- 保证 decoder 不关注 encoder 中的 padding token
- 尺寸是 [batch_size, tgt_len, src_len]
参考
python