动手学人工智能-多层感知机6-暂退法（Dropout）：深度神经网络的正则化利器

在深度学习中，过拟合是一个常见的挑战，特别是当我们使用复杂模型并且训练数据相对较少时。为了避免模型在训练集上记忆过多细节、从而在新数据上表现差 ，研究人员提出了多种方法来正则化模型。暂退法（Dropout） 就是其中一种非常有效的正则化方法。它通过在训练过程中随机丢弃神经元的输出，迫使模型学习到更加鲁棒的特征，从而提高泛化能力。

本文将带你全面理解暂退法的原理、实现方法和实际应用，确保你能够掌握这一重要技术。

1. 重新审视过拟合

过拟合发生在模型在训练数据上表现得非常好 ，但在测试数据或新数据上表现较差。为了理解暂退法如何应对过拟合，我们首先回顾一下过拟合的原因。

深度神经网络由于其强大的建模能力，可以轻松地拟合训练数据中的噪声和细节，导致过拟合 。例如，如果数据中的标签与特征之间没有清晰的关系，神经网络可能会"记住"这种关系 ，但这种关系在真实的测试数据中往往不成立。因此，即使数据量增加，模型也可能仍然会过拟合。

这就是神经网络中的偏差-方差权衡（bias-variance tradeoff） 。神经网络相较于线性模型，具有更低的偏差（可以学习复杂的特征关系），但也容易产生较高的方差（对训练数据的过度适应）。为了避免这种问题，我们需要采取措施提升模型的泛化能力。

2. 扰动的稳健性

一个好的预测模型应该能够应对未知数据的变化。泛化性要求模型不仅能在训练数据上表现良好，还能在未见过的数据上做出合理预测。

暂退法 正是通过在训练过程中 注入"噪声" 来提高模型的鲁棒性。具体来说，暂退法通过在每一层的前向传播过程中随机丢弃部分神经元 ，增加了模型对输入微小变化的适应能力，从而实现了对训练数据的正则化。

为什么要引入噪声？

噪声的引入类似于对模型进行正则化。假如一个模型非常复杂，它可能会对训练数据过度拟合，甚至学习到数据中的随机噪声。通过注入噪声，暂退法迫使神经网络避免过度依赖某些特定的特征，进而提升模型的泛化能力。

3. 暂退法的原理

暂退法的核心思想是在每一层的输出中随机选择一些神经元并将其"丢弃"（即置为零 ），然后再继续计算后续层的输出。这种做法可以被看作是在训练过程中形成多个不同的"子网络"，每个子网络在每次训练中都有不同的神经元组合，从而避免了过度依赖某些特定的神经元。

数学公式

假设我们在一个隐藏层中有一个神经元激活值向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> h h </math>h，我们通过引入一个随机变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> r r </math>r 来决定每个神经元是否被保留。对于每个神经元 <math xmlns="http://www.w3.org/1998/Math/MathML"> h i h_i </math>hi ，其被保留的概率为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 1 − p 1-p </math>1−p，而被丢弃的概率为 <math xmlns="http://www.w3.org/1998/Math/MathML"> p p </math>p，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> p p </math>p 是暂退率。公式可以表示为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h i drop = { h i with probability 1 − p , 0 with probability p . h_i^{\text{drop}} = \begin{cases} h_i & \text{with probability } 1 - p, \\ 0 & \text{with probability } p. \end{cases} </math>hidrop={hi0with probability 1−p,with probability p.

为了使得每一层的期望输出保持不变，我们在丢弃神经元时会对剩余的神经元进行缩放。假设输入到该层的张量是 <math xmlns="http://www.w3.org/1998/Math/MathML"> X \mathbf{X} </math>X，经过暂退法处理后，其输出为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h drop = 1 1 − p ⋅ h ⋅ r \mathbf{h}^{\text{drop}} = \frac{1}{1 - p} \cdot \mathbf{h} \cdot r </math>hdrop=1−p1⋅h⋅r

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> r r </math>r 是一个由均匀分布生成的随机变量，表示每个神经元是否保留。

4. 实践中的暂退法

在模型中使用暂退法

在实际应用中，我们通常会在神经网络的每一层后添加暂退法，尤其是在多层感知机（MLP）这样的深度网络中。通过设置不同的暂退率，我们可以在不同的层中控制丢弃的神经元比例，通常靠近输入层的地方会设置较低的暂退率。

示例：添加暂退法到多层感知机

假设我们有一个包含两个隐藏层的多层感知机，我们可以为每一层指定一个暂退率。以下是一个使用 PyTorch 实现的简单示例：

python 复制代码

import torch
from torch import nn
import d2l


# 暂退法函数
def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    if dropout == 1:
        return torch.zeros_like(X)
    if dropout == 0:
        return X
    mask = (torch.rand(X.shape) > dropout).float()
    return mask * X / (1.0 - dropout)


# 定义网络
class Net(nn.Module):
    def __init__(self, num_inputs, num_outputs, num_hiddens1, num_hiddens2,
                 dropout1, dropout2, is_training=True):
        super(Net, self).__init__()
        self.num_inputs = num_inputs
        self.training = is_training
        self.lin1 = nn.Linear(num_inputs, num_hiddens1)
        self.lin2 = nn.Linear(num_hiddens1, num_hiddens2)
        self.lin3 = nn.Linear(num_hiddens2, num_outputs)
        self.relu = nn.ReLU()
        self.dropout1 = dropout1
        self.dropout2 = dropout2

    def forward(self, X):
        X = X.reshape((-1, self.num_inputs))  # 调整输入张量 X 的形状
        H1 = self.relu(self.lin1(X))
        # 只有在训练模型时才使用dropout
        if self.training is True:
            H1 = dropout_layer(H1, self.dropout1)
        H2 = self.relu(self.lin2(H1))
        if self.training is True:
            H2 = dropout_layer(H2, self.dropout2)
        return self.lin3(H2)


# 创建网络
net = Net(num_inputs=784, num_outputs=10, num_hiddens1=256, num_hiddens2=256, dropout1=0.2, dropout2=0.5)

代码详解：

在上述代码中，我们为两个隐藏层分别设置了不同的暂退率：第一个隐藏层为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0.2 0.2 </math>0.2，第二个隐藏层为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0.5 0.5 </math>0.5。
暂退法函数 dropout_layer
- dropout_layer函数中torch.rand(X.shape) 生成一个与输入 <math xmlns="http://www.w3.org/1998/Math/MathML"> X \mathbf{X} </math>X 形状相同的张量，张量的元素是均匀分布在[0, 1)区间的随机数。然后，(torch.rand(X.shape) > dropout) 会返回一个与输入形状相同的布尔张量，表示每个元素是否大于 dropout 的值。最后，.float() 会将布尔值转换为浮点数（True 转换为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 1.0 1.0 </math>1.0，False 转换为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0.0 0.0 </math>0.0）。这样，mask 张量的每个元素都是 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0 0 </math>0 或 <math xmlns="http://www.w3.org/1998/Math/MathML"> 1 1 </math>1， <math xmlns="http://www.w3.org/1998/Math/MathML"> 1 1 </math>1 表示对应的神经元被保留， <math xmlns="http://www.w3.org/1998/Math/MathML"> 0 0 </math>0 表示对应的神经元被丢弃。
- mask * X 计算了丢弃神经元后的输出，其中被丢弃的神经元会变成 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0 0 </math>0。为了保证期望值不变，我们将保留的神经元进行缩放，即将它们除以 (1.0 - dropout)。这是因为在 Dropout 中，丢弃了一部分神经元，相当于让剩下的神经元承担更多的权重。通过这个缩放操作，我们保证了训练时的期望输出和测试时（不进行 Dropout）的一致性。
前向传播方法 forward
- X.reshape(-1, self.num_inputs)：如果输入数据是二维（如图像数据），这里通过 reshape 把数据展平。例如，28x28 的图像就会被展平成一个 784 维的向量。

训练与测试

在训练过程中，暂退法只在训练时生效。在测试时，我们使用全连接的网络，不进行丢弃操作，从而保证网络的稳定性。

python 复制代码

num_epochs, lr, batch_size = 10, 0.5, 256
loss = nn.CrossEntropyLoss(reduction='none')
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer, batch_size)

plaintext 复制代码

epoch 1, train loss 0.859, train acc 0.680, test acc 0.803
epoch 2, train loss 0.516, train acc 0.813, test acc 0.827
epoch 3, train loss 0.457, train acc 0.833, test acc 0.843
epoch 4, train loss 0.419, train acc 0.846, test acc 0.851
epoch 5, train loss 0.399, train acc 0.853, test acc 0.849
epoch 6, train loss 0.380, train acc 0.861, test acc 0.829
epoch 7, train loss 0.361, train acc 0.867, test acc 0.831
epoch 8, train loss 0.352, train acc 0.869, test acc 0.855
epoch 9, train loss 0.339, train acc 0.875, test acc 0.861
epoch 10, train loss 0.334, train acc 0.877, test acc 0.853

简洁实现

对于深度学习框架的高级API，我们只需在每个全连接层之后添加一个Dropout层，将暂退概率作为唯一的参数传递给它的构造函数。在训练时，Dropout层将根据指定的暂退概率随机丢弃上一层的输出（相当于下一层的输入）。在测试时，Dropout层仅传递数据。

python 复制代码

dropout1 = 0.2
dropout2 = 0.5
net = nn.Sequential(nn.Flatten(),
                    nn.Linear(784, 256),
                    nn.ReLU(),
                    # 在第一个全连接层之后添加一个dropout层
                    nn.Dropout(dropout1),
                    nn.Linear(256, 256),
                    nn.ReLU(),
                    # 在第二个全连接层之后添加一个dropout层
                    nn.Dropout(dropout2),
                    nn.Linear(256, 10))


def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)


net.apply(init_weights)

num_epochs, lr, batch_size = 10, 0.5, 256
loss = nn.CrossEntropyLoss(reduction='none')
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer, batch_size)

5. 小结

暂退法是一种有效的正则化技术，通过在训练过程中丢弃部分神经元来避免模型过拟合。
在训练时，暂退法帮助模型减少对特定神经元的依赖，提升模型的泛化能力。
暂退法在测试时不应用，而是在训练阶段通过引入噪声来增强网络的鲁棒性。

这种方法被广泛应用于深度神经网络中，特别是在多层感知机和 卷积神经网络（CNN） 中，帮助我们构建更加高效且稳定的模型。