Layer Norm 如何处理不同长度的句子样本（含 Padding）：中英双语

中文版

Layer Norm 如何处理不同长度的句子样本（含 Padding）

在 NLP 任务中，句子的长度往往不同。为了能够进行批处理，通常需要将不同长度的句子通过 Padding 补齐到相同的长度。对于这种场景，Layer Normalization（Layer Norm）如何处理 padding token 并保持其对有效序列部分的归一化作用，是一个非常关键的问题。

本文将围绕以下问题展开：

Layer Norm 是如何计算均值和标准差的？
Padding 对 Layer Norm 的影响？
如何保证 Layer Norm 的鲁棒性？
结合示例和代码详细说明。

1. Layer Norm 的基本原理

Layer Norm 是对每个样本（句子）的特征维度进行归一化操作。其公式如下：
LN ( x i ) = x i − μ σ 2 + ϵ ⋅ γ + β \text{LN}(x_i) = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta LN(xi)=σ2+ϵ xi−μ⋅γ+β

其中：

( x i x_i xi)：输入样本中每个特征值。
( μ \mu μ)：该样本中所有特征的均值。
( σ 2 \sigma^2 σ2)：该样本中所有特征的方差。
( γ \gamma γ)、( β \beta β)：可学习参数，用于调整归一化后的尺度和偏移。
( ϵ \epsilon ϵ)：防止分母为 0 的小常数。

2. Padding 对 Layer Norm 的影响

在 NLP 任务中，通常将句子补齐到相同的长度，以方便构成批处理（Batch）。例如，以下两个句子：

( 句子 1 : [ Today, is, great ] \text{句子 1}: [\text{Today, is, great}] 句子 1:[Today, is, great]) 长度为 3。
( 句子 2 : [ How, is, it, going, ? ] \text{句子 2}: [\text{How, is, it, going, ?}] 句子 2:[How, is, it, going, ?]) 长度为 5。

通过 Padding 补齐到长度为 5：
句子 1 : [ Today, is, great, [PAD], [PAD] ] \text{句子 1}: [\text{Today, is, great, [PAD], [PAD]}] 句子 1:[Today, is, great, [PAD], [PAD]]
句子 2 : [ How, is, it, going, ? ] \text{句子 2}: [\text{How, is, it, going, ?}] 句子 2:[How, is, it, going, ?]

此时，padding token 会引入额外的无效值（通常为 0），在直接计算均值和方差时，会对结果造成影响。

问题：

Layer Norm 按样本的全维度计算均值和方差。如果不做处理，Padding 部分的 0 值会被纳入统计，导致均值和方差偏离有效序列特征。

解决方法：

使用 有效序列掩码（mask） 来忽略 Padding 部分，仅对有效部分计算均值和方差。

3. Layer Norm 的鲁棒性与作用

Layer Norm 的优势体现在以下几个方面：

对变长序列的适应性：通过引入 mask，仅对有效部分归一化，忽略 padding token 的干扰。
提升训练稳定性：Layer Norm 归一化每个样本的特征分布，减少了网络层对权重初始值敏感的影响，尤其在长序列或梯度消失问题中表现突出。
加速收敛：归一化后的特征具有更均衡的分布，有助于优化器更快找到全局最优解。
对序列特征更鲁棒 ：通过调整 ( γ \gamma γ) 和 ( β \beta β)，Layer Norm 能更灵活地适应序列特征的动态变化。

4. 示例：以 [Today is great] 和 [How is it going?] 为例

输入数据

python 复制代码

import torch
import torch.nn as nn

# 输入序列
seqs = [
    [1.5, 2.0, 1.8, 0.0, 0.0],  # [Today is great [PAD] [PAD]]
    [2.1, 1.9, 2.2, 1.8, 2.0]   # [How is it going ?]
]
input_data = torch.tensor(seqs)

# 有效序列长度
lengths = [3, 5]  # 分别表示每个句子的有效长度
mask = torch.tensor([[1, 1, 1, 0, 0],  # 句子 1 的有效部分
                     [1, 1, 1, 1, 1]]) # 句子 2 的有效部分

手动计算第 1 个句子的 Layer Norm

对于第 1 个句子 ( [ 1.5 , 2.0 , 1.8 , 0.0 , 0.0 ] [1.5, 2.0, 1.8, 0.0, 0.0] [1.5,2.0,1.8,0.0,0.0])，其有效部分为 ( [ 1.5 , 2.0 , 1.8 ] [1.5, 2.0, 1.8] [1.5,2.0,1.8])。

有效均值 ：
μ = 1.5 + 2.0 + 1.8 3 = 1.7667 \mu = \frac{1.5 + 2.0 + 1.8}{3} = 1.7667 μ=31.5+2.0+1.8=1.7667
有效方差 ：
σ 2 = ( 1.5 − 1.7667 ) 2 + ( 2.0 − 1.7667 ) 2 + ( 1.8 − 1.7667 ) 2 3 = 0.04222 \sigma^2 = \frac{(1.5 - 1.7667)^2 + (2.0 - 1.7667)^2 + (1.8 - 1.7667)^2}{3} = 0.04222 σ2=3(1.5−1.7667)2+(2.0−1.7667)2+(1.8−1.7667)2=0.04222
归一化 （对每个元素执行以下计算）：
x norm = x − μ σ 2 + ϵ x_{\text{norm}} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} xnorm=σ2+ϵ x−μ
$1.5 , 2.0 , 1.8 \] → \[ − 1.0328 , 1.1344 , − 0.1016 \] \[1.5, 2.0, 1.8\] \\to \[-1.0328, 1.1344, -0.1016\] \[1.5,2.0,1.8\]→\[−1.0328,1.1344,−0.1016$
对 Padding 部分不做归一化 ：
$1.5 , 2.0 , 1.8 , 0.0 , 0.0 \] → \[ − 1.0328 , 1.1344 , − 0.1016 , 0.0 , 0.0 \] \[1.5, 2.0, 1.8, 0.0, 0.0\] \\to \[-1.0328, 1.1344, -0.1016, 0.0, 0.0\] \[1.5,2.0,1.8,0.0,0.0\]→\[−1.0328,1.1344,−0.1016,0.0,0.0$

代码实现

以下是基于 PyTorch 的完整实现，处理 padding 的 Layer Norm：

python 复制代码

class MaskedLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(normalized_shape))
        self.beta = nn.Parameter(torch.zeros(normalized_shape))

    def forward(self, x, mask):
        # 计算有效部分的均值和方差
        mask_sum = mask.sum(dim=1, keepdim=True)  # 有效长度
        mean = (x * mask).sum(dim=1, keepdim=True) / mask_sum
        var = ((x - mean)**2 * mask).sum(dim=1, keepdim=True) / mask_sum

        # 归一化
        normed = (x - mean) / torch.sqrt(var + self.eps)
        return normed * self.gamma + self.beta

# 应用 Masked Layer Norm
layer_norm = MaskedLayerNorm(5)
output = layer_norm(input_data, mask)
print("Masked Layer Norm Output:\n", output)

5. 总结

Layer Norm 的意义 ：

在 NLP 中，Layer Norm 能确保每个样本的特征分布标准化，减少模型对权重初始值的敏感性，并加速收敛。
对 Padding 的处理 ：

使用 Masking 技术忽略 Padding token，保证只对有效序列计算均值和方差，提升训练的鲁棒性。
为什么适合 NLP ：

Layer Norm 在序列建模中更鲁棒，因为它对每个样本的特征分布进行归一化，能有效处理变长输入并对长序列梯度消失问题起到缓解作用。

英文版

Handling Layer Normalization with Variable Sentence Lengths (Including Padding)

In NLP tasks, sentences often have variable lengths. To enable batch processing, sentences are typically padded to the same length. In this context, a crucial question arises: How does Layer Normalization handle padding tokens, and what is its significance for NLP tasks?

This article will cover:

How Layer Norm computes mean and variance.
The impact of padding on Layer Norm.
How to ensure robustness in Layer Norm for padded sequences.
Detailed explanation with examples and code.

1. Recap: Layer Norm Basics

Layer Normalization (Layer Norm) normalizes each sample independently across its feature dimensions. The formula is:

LN ( x i ) = x i − μ σ 2 + ϵ ⋅ γ + β \text{LN}(x_i) = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta LN(xi)=σ2+ϵ xi−μ⋅γ+β

Where:

( x i x_i xi): The input features of the sample.
( μ \mu μ): Mean of the features in the sample.
( σ 2 \sigma^2 σ2): Variance of the features in the sample.
( γ , β \gamma, \beta γ,β): Learnable parameters for scaling and shifting.
( ϵ \epsilon ϵ): A small constant to prevent division by zero.

2. Padding and Its Impact on Layer Norm

In NLP tasks, sequences of different lengths are padded to the same length, as shown below:

Sentence 1 : [Today, is, great] → [Today, is, great, [PAD], [PAD]]
Sentence 2 : [How, is, it, going, ?] → [How, is, it, going, ?]

Padding tokens are typically represented as zeros or other placeholders. However, including these padding tokens in Layer Norm's computations (mean and variance) can lead to incorrect normalization because they distort the statistical properties of the actual features.

Problem:

Layer Norm, by default, computes mean and variance across all features, including padding. This can cause:

Skewed mean and variance due to the presence of zeros.
Reduced effectiveness in normalizing meaningful tokens.

Solution:

To handle this, masking can be applied. The mask ensures that only the valid tokens (non-padding) are considered during mean and variance calculation.

3. Why Is Layer Norm Robust for NLP Tasks?

Dynamic Adaptation to Sequence Features :

Layer Norm computes statistics at the sample level, making it suitable for sequences of varying lengths.
Padding-Aware Normalization :

By masking padding tokens, Layer Norm focuses solely on meaningful tokens, preserving the semantic integrity of the sequence.
Improved Stability and Convergence :

Normalizing features across dimensions reduces variance in gradient magnitudes, preventing exploding/vanishing gradients.
Acceleration of Training :

Normalized inputs lead to smoother optimization landscapes, helping the model converge faster.

4. Example: [Today is great] and [How is it going?]

Input Data with Padding

python 复制代码

import torch

# Input sequences with padding
seqs = [
    [1.5, 2.0, 1.8, 0.0, 0.0],  # [Today, is, great, [PAD], [PAD]]
    [2.1, 1.9, 2.2, 1.8, 2.0]   # [How, is, it, going, ?]
]
input_data = torch.tensor(seqs, dtype=torch.float32)

# Mask indicating valid tokens (1 for valid, 0 for padding)
mask = torch.tensor([[1, 1, 1, 0, 0],  # Sentence 1
                     [1, 1, 1, 1, 1]])  # Sentence 2

Manual Computation of Layer Norm

Take Sentence 1 : [1.5, 2.0, 1.8, 0.0, 0.0].

Valid tokens : [1.5, 2.0, 1.8].
Padding tokens : [0.0, 0.0] (to be ignored).

Mean :
μ = 1.5 + 2.0 + 1.8 3 = 1.7667 \mu = \frac{1.5 + 2.0 + 1.8}{3} = 1.7667 μ=31.5+2.0+1.8=1.7667
Variance :
σ 2 = ( 1.5 − 1.7667 ) 2 + ( 2.0 − 1.7667 ) 2 + ( 1.8 − 1.7667 ) 2 3 = 0.04222 \sigma^2 = \frac{(1.5 - 1.7667)^2 + (2.0 - 1.7667)^2 + (1.8 - 1.7667)^2}{3} = 0.04222 σ2=3(1.5−1.7667)2+(2.0−1.7667)2+(1.8−1.7667)2=0.04222
Normalize Valid Tokens :
x norm = x − μ σ 2 + ϵ x_{\text{norm}} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} xnorm=σ2+ϵ x−μ
$1.5 , 2.0 , 1.8 \] → \[ − 1.0328 , 1.1344 , − 0.1016 \] \[1.5, 2.0, 1.8\] \\to \[-1.0328, 1.1344, -0.1016\] \[1.5,2.0,1.8\]→\[−1.0328,1.1344,−0.1016$
Retain Padding as Zero :

Final normalized output:
$− 1.0328 , 1.1344 , − 0.1016 , 0.0 , 0.0 \] \[-1.0328, 1.1344, -0.1016, 0.0, 0.0\] \[−1.0328,1.1344,−0.1016,0.0,0.0$

Implementing Masked Layer Norm in PyTorch

Below is a custom implementation of Layer Norm that ignores padding:

python 复制代码

import torch
import torch.nn as nn

class MaskedLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(normalized_shape))
        self.beta = nn.Parameter(torch.zeros(normalized_shape))

    def forward(self, x, mask):
        # Masked mean and variance
        mask_sum = mask.sum(dim=1, keepdim=True)  # Number of valid tokens
        mean = (x * mask).sum(dim=1, keepdim=True) / mask_sum
        var = ((x - mean)**2 * mask).sum(dim=1, keepdim=True) / mask_sum

        # Normalize
        normed = (x - mean) / torch.sqrt(var + self.eps)
        return normed * self.gamma + self.beta

# Instantiate and apply Masked Layer Norm
layer_norm = MaskedLayerNorm(5)
output = layer_norm(input_data, mask)
print("Masked Layer Norm Output:\n", output)

5. Insights and Implications

Handling Padding with Masking :

Masking ensures that only valid tokens contribute to the mean and variance, keeping Layer Norm focused on meaningful sequence features.
Relevance in NLP :

By normalizing each sample's feature dimensions, Layer Norm effectively handles long-term dependencies and ensures better gradient flow.
Stability and Faster Convergence :

Masked Layer Norm improves the model's robustness during training, especially in transformer-based architectures where padding is prevalent.
Practical Usage :

Most modern NLP libraries, such as Hugging Face Transformers, incorporate masking internally for Layer Norm when processing padded sequences.

6. Conclusion

Layer Norm is indispensable for NLP tasks, especially in transformer architectures. Its ability to normalize individual samples' feature dimensions while accounting for padding ensures robust and stable training. By applying masking techniques, it adapts seamlessly to variable-length sequences, enabling better handling of real-world textual data.

后记

2024年12月14日17点56分于上海，在GPT4o的辅助下完成。