transformer实战——mask

1 mask

我们的Transformer模型里面涉及两种mask。分别是padding mask和sequence mask。
padding mask在所有的scaled dot-product attention里面都需要用到，而sequence mask只有在decoder的self-attention里面用到。

2第一层encoder的padding mask

假设我们训练的transformer是德语转英语的功能。训练数据为两条德语的句子。分别如下：

ich mochte ein bier P
ich mochte ein P P

P是填充，让让同一个 batch 内的句子长度一致，方便后续运算。假设

batch_size = 2
seq_len = 5（包括 P）
d_model = d_k = d_v（假设简化一致）

按照前面的文章，我们可以知道，Q的大小为[batch_size, len_q, d_k]，K: [batch_size, len_k, d_k]，V: [batch_size, len_v, d_v]。根据以下公式计算出scores。

对于 Encoder，自注意力中 len_q = len_k = len_v = 5（句子长度）。对于第 1 个句子，scores 矩阵形状为 [5, 5]。 Padding token P 不是真实词，不希望模型关注到它，也不希望它影响别的词。假如不屏蔽，Q 与 K 的点积可能会给 P 一些分数，softmax 后就会给 P 一定权重 → 结果混入无效信息。所以我们用 mask 在 softmax 前把对应位置设置为 -inf。

python 复制代码

import torch
import torch.nn.functional as F
# ----------------------------
# 1. 模拟数字化后的句子 (0 = PAD)
# ----------------------------
# batch_size = 2, seq_len = 5
seq_k = torch.tensor([
    [1, 2, 3, 4, 0],  # 第一个句子最后一个是 PAD
    [1, 2, 3, 0, 0]   # 第二个句子最后两个是 PAD
])
batch_size, seq_len = seq_k.size()
d_model = 4  # embedding 维度

# 随机生成 embeddings (Q, K, V)
torch.manual_seed(0)
inputs = torch.randn(batch_size, seq_len, d_model)
Q = K = V = inputs
print("inputs / embeddings:")
print(inputs)

ini 复制代码

inputs / embeddings:
tensor([[[-1.1258, -1.1524, -0.2506, -0.4339],
         [ 0.8487,  0.6920, -0.3160, -2.1152],
         [ 0.3223, -1.2633,  0.3500,  0.3081],
         [ 0.1198,  1.2377,  1.1168, -0.2473],
         [-1.3527, -1.6959,  0.5667,  0.7935]],

        [[ 0.5988, -1.5551, -0.3414,  1.8530],
         [-0.2159, -0.7425,  0.5627,  0.2596],
         [-0.1740, -0.6787,  0.9383,  0.4889],
         [ 1.2032,  0.0845, -1.2001, -0.0048],
         [-0.5181, -0.3067, -1.5810,  1.7066]]])

python 复制代码

# 2. 计算注意力分数中的QK^T，
d_k = d_model
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
print("Raw scores (QK^T / sqrt(d_k)):")
print(scores)

lua 复制代码

Raw scores (QK^T / sqrt(d_k)):
tensor([[[ 1.4232, -0.3780,  0.4358, -0.8669,  1.4955],
         [-0.3780,  2.8866, -0.6815,  0.5642, -2.0896],
         [ 0.4358, -0.6815,  0.9587, -0.6051,  1.0747],
         [-0.8669,  0.5642, -0.6051,  1.4272, -0.9122],
         [ 1.4955, -2.0896,  1.0747, -0.9122,  2.8283]],

        [[ 3.1635,  0.6572,  0.7685,  0.4949,  1.9344],
         [ 0.6572,  0.4910,  0.5982, -0.4995, -0.0535],
         [ 0.7685,  0.5982,  0.8051, -0.6975, -0.1754],
         [ 0.4949, -0.4995, -0.6975,  1.4476,  0.6200],
         [ 1.9344, -0.0535, -0.1754,  0.6200,  2.8873]]])

python 复制代码

# 3. 自动生成 key_padding_mask
# ----------------------------
# True 表示 PAD，需要被忽略
key_padding_mask = seq_k.data.eq(0)  # [batch_size, seq_len]
print("key_padding_mask (True = PAD):")
print(key_padding_mask)
print("-" * 50)

# 扩展成 [batch_size, len_q, len_k]
mask_matrix = key_padding_mask.unsqueeze(1).expand(-1, seq_len, -1)
print("mask_matrix (broadcast to len_q dimension):")
print(mask_matrix)

lua 复制代码

key_padding_mask (True = PAD):
tensor([[False, False, False, False,  True],
        [False, False, False,  True,  True]])
--------------------------------------------------
mask_matrix (broadcast to len_q dimension):
tensor([[[False, False, False, False,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True]],

        [[False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True]]])

python 复制代码

# 4. 应用 mask
scores_masked = scores.masked_fill(mask_matrix, float('-inf'))
print("Scores after applying mask (-inf on PAD positions):")
print(scores_masked)

ini 复制代码

Scores after applying mask (-inf on PAD positions):
tensor([[[ 1.4232, -0.3780,  0.4358, -0.8669,    -inf],
         [-0.3780,  2.8866, -0.6815,  0.5642,    -inf],
         [ 0.4358, -0.6815,  0.9587, -0.6051,    -inf],
         [-0.8669,  0.5642, -0.6051,  1.4272,    -inf],
         [ 1.4955, -2.0896,  1.0747, -0.9122,    -inf]],

        [[ 3.1635,  0.6572,  0.7685,    -inf,    -inf],
         [ 0.6572,  0.4910,  0.5982,    -inf,    -inf],
         [ 0.7685,  0.5982,  0.8051,    -inf,    -inf],
         [ 0.4949, -0.4995, -0.6975,    -inf,    -inf],
         [ 1.9344, -0.0535, -0.1754,    -inf,    -inf]]])

python 复制代码

# 5. softmax 得到注意力权重
attn_weights = F.softmax(scores_masked, dim=-1)
print("Attention weights after softmax (PAD positions become 0):")
print(attn_weights)
print("-" * 50)

# 6. 输出加权和 Value
output = torch.matmul(attn_weights, V)
print("Output after attention (weighted sum of V, PAD ignored):")
print(output)

lua 复制代码

Attention weights after softmax (PAD positions become 0):
tensor([[[0.6102, 0.1007, 0.2273, 0.0618, 0.0000],
         [0.0328, 0.8588, 0.0242, 0.0842, 0.0000],
         [0.2970, 0.0972, 0.5010, 0.1049, 0.0000],
         [0.0610, 0.2551, 0.0792, 0.6047, 0.0000],
         [0.5636, 0.0156, 0.3700, 0.0507, 0.0000]],

        [[0.8527, 0.0696, 0.0777, 0.0000, 0.0000],
         [0.3585, 0.3036, 0.3379, 0.0000, 0.0000],
         [0.3471, 0.2928, 0.3601, 0.0000, 0.0000],
         [0.5976, 0.2211, 0.1813, 0.0000, 0.0000],
         [0.7948, 0.1089, 0.0964, 0.0000, 0.0000]]])
--------------------------------------------------
Output after attention (weighted sum of V, PAD ignored):
tensor([[[-0.5208, -0.8441, -0.0362, -0.4231],
         [ 0.7098,  0.6301, -0.1771, -1.8441],
         [-0.0779, -0.7781,  0.1873, -0.2059],
         [ 0.2458,  0.7546,  0.6071, -0.6912],
         [-0.4959, -1.0433,  0.0400, -0.1761]],

        [[ 0.4821, -1.4305, -0.1790,  1.6361],
         [ 0.0904, -1.0123,  0.3655,  0.9083],
         [ 0.0820, -1.0016,  0.3841,  0.8953],
         [ 0.2786, -1.2166,  0.0906,  1.2534],
         [ 0.4357, -1.3822, -0.1196,  1.5481]]])

每一层 Encoder 的 self-attention 都使用相同的 padding mask（因为 padding token 永远不是真实信息），所以所有 Encoder 层的 self-attention 都共享同一个 mask。

3 Decoder 自注意力中的 mask

Decoder 有两种 self-attention：

Masked Self-Attention（防止未来信息泄露）（Sequence mask）
- 每个位置只能看当前位置及之前的 token
- 屏蔽当前位置之后的 token → 上三角矩阵 mask
- padding mask 可以结合使用：先屏蔽 padding，再屏蔽未来 token
- 自注意力掩码的最终尺寸是 [batch_size, tgt_len, tgt_len]
Encoder-Decoder Attention
- 需要屏蔽 Encoder padding（用 encoder 的 padding mask）
- 保证 decoder 不关注 encoder 中的 padding token
- 尺寸是 [batch_size, tgt_len, src_len]

参考

python 复制代码