摘要: 正则化技术是深度学习中防止过拟合、提升模型泛化能力的关键手段。本文系统介绍了L1/L2正则化、Dropout、Batch Normalization、Early Stopping以及数据增强等主流正则化方法的原理、数学公式和实现细节,并通过PyTorch代码示例展示各方法在图像分类任务中的实际效果对比,帮助读者在实际项目中合理选择和组合正则化策略。
关键词: 正则化、过拟合、L1正则化、L2正则化、Dropout、Batch Normalization、Early Stopping、数据增强
1. 引言
深度神经网络具有强大的表达能力,能够拟合复杂的非线性函数。然而,这种强大的拟合能力也带来了过拟合(Overfitting)的风险------模型在训练集上表现优异,但在测试集或实际应用中泛化能力不足。正则化(Regularization)技术正是为了解决这一问题而诞生的核心方法。
本文将从过拟合问题出发,系统介绍当前深度学习中主流的正则化技术,包括:
-
L1/L2正则化
-
Dropout
-
Batch Normalization
-
Early Stopping
-
数据增强
每种技术都将配合数学公式推导和PyTorch实现代码,帮助读者从理论到实践全面理解正则化方法的本质。
2. 过拟合问题详解
2.1 训练误差 vs 测试误差
在机器学习中,我们始终追求的是模型在**未见过的数据(测试集)**上的表现,而非训练集上的表现。当模型在训练集上表现很好,但在测试集上表现较差时,就发生了过拟合现象。
典型的表现是:随着训练进行,训练误差持续下降,而验证集/测试集误差先下降后上升:
训练loss : ╲(持续下降)
验证loss : ∪(先下降后上升)← 过拟合开始点
2.2 偏差-方差权衡(Bias-Variance Tradeoff)
过拟合问题与模型的偏差(Bias)和方差(Variance)密切相关:
-
偏差(Bias):模型预测值与真实值之间的系统性误差,反映了模型的拟合能力。偏差过高意味着模型欠拟合(Underfitting),无法捕捉数据的基本规律。
-
方差(Variance):模型预测值在不同训练集上的变化程度,反映了模型对训练数据的敏感度。方差过高意味着过拟合,模型过度学习了训练数据的噪声。
理想的模型需要在偏差和方差之间找到平衡:
总误差 = 偏差² + 方差 + 噪声
-
欠拟合:高偏差,低方差 --- 模型太简单
-
过拟合:低偏差,高方差 --- 模型太复杂
-
适度拟合:偏差和方差都处于合理范围
2.3 高方差的具体表现
在深度学习中,高方差(过拟合)的典型表现包括:
-
训练准确率接近100%,但验证准确率明显偏低(差距常达10%-30%)
-
验证loss在某个epoch后持续上升,而训练loss持续下降
-
模型权重绝对值过大,参数分布过于分散
-
对训练集中的噪声样本也能正确分类,甚至记住了一些错误标签
-
决策边界过于复杂,包含大量为了拟合少数训练样本而产生的"弯折"
过拟合的根源在于:模型的容量(复杂度)超过了训练数据所能支撑的程度,模型在训练过程中不仅学习了数据的真实规律,还学习了训练集中的噪声和偶然性。
3. L1/L2正则化
3.1 原理与公式
L1和L2正则化是最经典也是最广泛使用的正则化方法,它们通过在损失函数中添加一个惩罚项来约束模型的参数规模。
原始损失函数:
L(\\theta) = \\text{TaskLoss}(\\theta)
添加L1正则化后:
L_{\\text{L1}}(\\theta) = \\text{TaskLoss}(\\theta) + \\lambda \\sum_{i} \|\\theta_i\|
添加L2正则化后:
L_{\\text{L2}}(\\theta) = \\text{TaskLoss}(\\theta) + \\lambda \\sum_{i} \\theta_i\^2
其中:
-
\\theta = (\\theta_1, \\theta_2, ..., \\theta_n) 是模型的所有参数
-
\\lambda \> 0 是正则化强度超参数,由用户设定
-
L1正则化中,惩罚项是参数的绝对值之和
-
L2正则化中,惩罚项是参数的平方和
3.2 L1正则化(Lasso)
L1正则化倾向于产生稀疏(Sparse) 的权重矩阵,即大量参数被压缩为0,只有少数关键参数保持非零值。这种特性使得L1正则化天然具有**特征选择(Feature Selection)**的功能。
为什么L1能产生稀疏解?
从几何角度理解:L1正则项的约束区域是一个**菱形(钻石形)**等高线,而损失函数的等高线是一个椭圆。在约束边界上,菱形的尖角更容易与椭圆相切,而尖角位于坐标轴上(某些参数为0),因此L1正则化倾向于产生稀疏解。
L1几何解释示意图:
loss contour
___/
θ₂ /‾‾‾\ ← 菱形是L1约束域
/ \ ← 交点常在坐标轴上 → 稀疏解
───●───────── θ₁
\
‾‾‾‾‾
菱形顶点落在坐标轴上 → 对应参数为0
3.3 L2正则化(Ridge / Weight Decay)
L2正则化倾向于让所有参数都变小,但不一定是零,因此也被称为权重衰减(Weight Decay)。它通过惩罚大的权重值来防止模型过度依赖任何单个特征。
为什么L2正则化有效?
-
约束权重增长:梯度下降时会同时减小参数的平方值,使其趋于平稳但不为零
-
改善条件数:L2正则化能改善优化问题的条件数,使损失函数的Hessian矩阵更加稳定
-
等价于高斯先验:从贝叶斯角度看,L2正则化相当于假设参数服从高斯(正态)先验分布
3.4 L1 vs L2对比
| 特性 | L1正则化 | L2正则化 |
|---|---|---|
| 惩罚项形式 | $\lambda \sum | \theta_i |
| 解的稀疏性 | 是(产生稀疏解) | 否(参数趋于零但不为零) |
| 特征选择能力 | 有 | 无 |
| 梯度形式 | 常数 \\pm\\lambda | 2\\lambda\\theta |
| 等价先验 | 拉普拉斯分布 | 高斯分布 |
| 计算复杂度 | 含绝对值, subgradient | 光滑,易于优化 |
3.5 PyTorch实现
import torch
import torch.nn as nn
import torch.optim as optim
class L1L2RegularizedModel(nn.Module):
"""
展示L1和L2正则化实现的示例模型
使用简单的多层感知机(MLP)
"""
def __init__(self, input_size=784, hidden_sizes=[256, 128], output_size=10,
l1_lambda=0.001, l2_lambda=0.01):
super(L1L2RegularizedModel, self).__init__()
# 保存正则化参数
self.l1_lambda = l1_lambda
self.l2_lambda = l2_lambda
# 构建多层感知机
layers = []
prev_size = input_size
for hidden_size in hidden_sizes:
layers.extend([
nn.Linear(prev_size, hidden_size),
nn.ReLU(),
nn.Dropout(0.2) # 结合Dropout
])
prev_size = hidden_size
layers.append(nn.Linear(prev_size, output_size))
self.network = nn.Sequential(*layers)
def forward(self, x):
x = x.view(x.size(0), -1) # 展平输入
return self.network(x)
def l1_regularization(self):
"""
计算L1正则化项
遍历所有参数,计算绝对值之和
"""
l1_loss = 0.0
for param in self.parameters():
l1_loss += torch.sum(torch.abs(param))
return self.l1_lambda * l1_loss
def l2_regularization(self):
"""
计算L2正则化项
遍历所有参数,计算平方和
"""
l2_loss = 0.0
for param in self.parameters():
l2_loss += torch.sum(param ** 2)
return self.l2_lambda * l2_loss
def compute_loss(self, outputs, targets):
"""
完整损失 = 任务损失 + L1正则化 + L2正则化
"""
task_loss = nn.functional.cross_entropy(outputs, targets)
l1_loss = self.l1_regularization()
l2_loss = self.l2_regularization()
return task_loss + l1_loss + l2_loss
def demonstrate_l1_l2():
"""
演示L1/L2正则化的效果对比
"""
print("=" * 60)
print("L1/L2正则化演示")
print("=" * 60)
# 创建模型
model = L1L2RegularizedModel(
input_size=784,
hidden_sizes=[512, 256],
output_size=10,
l1_lambda=0.0001, # L1正则化系数
l2_lambda=0.001 # L2正则化系数
)
# 打印模型结构
print(f"\n模型参数总量: {sum(p.numel() for p in model.parameters()):,}")
print(f"可训练参数: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
# 模拟前向传播
dummy_input = torch.randn(32, 1, 28, 28)
dummy_output = model(dummy_input)
dummy_targets = torch.randint(0, 10, (32,))
# 计算各项损失
total_loss = model.compute_loss(dummy_output, dummy_targets)
task_loss = nn.functional.cross_entropy(dummy_output, dummy_targets)
l1_loss = model.l1_regularization()
l2_loss = model.l2_regularization()
print(f"\n损失分解:")
print(f" 任务损失 (CrossEntropy): {task_loss.item():.6f}")
print(f" L1正则化项: {l1_loss.item():.6f}")
print(f" L2正则化项: {l2_loss.item():.6f}")
print(f" 总损失: {total_loss.item():.6f}")
# 展示PyTorch内置的weight decay功能
print("\n--- 使用PyTorch优化器内置的weight_decay ---")
model2 = L1L2RegularizedModel(l1_lambda=0, l2_lambda=0)
optimizer = optim.Adam(model2.parameters(), lr=0.001, weight_decay=0.01)
# 注意:PyTorch的weight_decay就是L2正则化
# 等价于在损失函数中添加 (weight_decay/2) * ||w||^2
return model
# 运行演示
if __name__ == "__main__":
demonstrate_l1_l2()
运行结果示例:
============================================================
L1/L2正则化演示
============================================================
模型参数总量: 535,050
可训练参数: 535,050
损失分解:
任务损失 (CrossEntropy): 2.302585
L1正则化项: 0.045231
L2正则化项: 0.123456
总损失: 2.471272
--- 使用PyTorch优化器内置的weight_decay ---
4. Dropout
4.1 原理概述
Dropout是深度学习中最重要的正则化技术之一,由Srivastava等人在2014年提出。其核心思想是:在训练过程中,以概率p随机"关闭"(置零)某些神经元,使模型不会过度依赖任何一个神经元,从而迫使网络学习更加鲁棒的特征表示。
4.2 训练时的Dropout机制
对于每一层的输出 \\mathbf{z},Dropout按照以下方式处理:
\\mathbf{y} = \\frac{1}{1-p} \\cdot \\mathbf{m} \\odot \\mathbf{z}
其中:
-
\\mathbf{m} 是掩码向量,每个元素以概率 p 为0(丢弃),以概率 1-p 为1(保留)
-
\\odot 表示逐元素乘法(Hadamard积)
-
系数 \\frac{1}{1-p} 被称为Inverted Dropout,用于保证训练和测试时输出的期望一致
4.3 为什么需要Inverted Dropout?
-
训练时:部分神经元被关闭,剩余神经元接收到的输入信号强度增加
-
测试时:所有神经元都参与计算,但如果不做调整,测试时的输出期望会是训练时的 (1-p) 倍
-
解决方案:在训练时乘以 \\frac{1}{1-p},使得训练和测试时每个神经元的期望输出相同
4.4 Dropout的集成学习解释
从集成学习(Ensemble Learning)的角度理解Dropout:
-
每次训练迭代时,由于随机丢弃神经元,相当于训练了一个子网络(Sub-network)
-
整个训练过程中,实际上训练了指数级数量的不同子网络
-
测试时,所有子网络的预测结果进行平均(Average),获得最终预测
标准神经网络: 带Dropout的网络:
┌─────────┐ ┌─────────┐
│ Layer 1 │ │ Layer 1 │──┐
└────┬────┘ └───┬───┘ │
│ │ │
┌────┴────┐ ┌─────┴─────┐│
│ Layer 2 │← 随机丢弃 │ Layer 2 │─┤
└────┬────┘ └─────────┘│
│ │ │
┌────┴────┐ ┌─────┴─────┐│
│ Layer 3 │ │ Layer 3 │─┘
└─────────┘ └─────────┘
每次"关闭"不同的神经元组合 → 训练了不同的子网络
4.5 Dropout Rate的选择
-
p 太小(如0.1):正则化效果不明显,模型仍可能过拟合
-
p 太大(如0.5+):模型欠拟合,训练收敛慢
-
推荐范围:0.1 ~ 0.5
-
浅层网络:0.2 ~ 0.3
-
深层网络:0.3 ~ 0.5
-
-
经验法则:Dropout率的选择与网络深度成正比,越深的层可以承受更高的丢弃率
4.6 PyTorch实现
import torch
import torch.nn as nn
import torch.nn.functional as F
class DropoutModel(nn.Module):
"""
展示Dropout实现的各种变体
"""
def __init__(self, dropout_rate=0.5):
super(DropoutModel, self).__init__()
self.dropout_rate = dropout_rate
# 卷积层提取图像特征
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(64)
self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(128)
self.conv3 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(256)
# 全连接层
self.fc1 = nn.Linear(256 * 4 * 4, 512)
self.fc2 = nn.Linear(512, 256)
self.fc3 = nn.Linear(256, 10)
# Dropout层
self.dropout = nn.Dropout(p=dropout_rate)
# 权重初始化
self._initialize_weights()
def _initialize_weights(self):
"""He初始化,适用于ReLU激活函数"""
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
nn.init.constant_(m.bias, 0)
def forward(self, x):
# 卷积块1: Conv -> BN -> ReLU -> Dropout
x = self.conv1(x)
x = self.bn1(x)
x = F.relu(x)
x = F.max_pool2d(x, 2) # 32x32 -> 16x16
x = self.dropout(x)
# 卷积块2
x = self.conv2(x)
x = self.bn2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2) # 16x16 -> 8x8
x = self.dropout(x)
# 卷积块3
x = self.conv3(x)
x = self.bn3(x)
x = F.relu(x)
x = F.max_pool2d(x, 2) # 8x8 -> 4x4
x = self.dropout(x)
# 展平
x = x.view(x.size(0), -1)
# 全连接层 + Dropout
x = self.dropout(F.relu(self.fc1(x)))
x = self.dropout(F.relu(self.fc2(x)))
x = self.fc3(x) # 最后一层通常不加Dropout
return x
class AlphaDropout(nn.Module):
"""
SELU激活函数配套的Alpha Dropout
保持归一化均值和方差不变
"""
def __init__(self, p=0.5):
super(AlphaDropout, self).__init__()
self.p = p
self.alpha = 1.758
# 预计算常数
self.a = (-self.alpha * (1 - p) /
((1 + 2 * self.p * self.alpha**2)**0.5))
self.b = ((1 - p + 2 * self.p * self.alpha**2)**0.5 /
(1 - p))
def forward(self, x):
if self.training:
# 生成掩码
mask = torch.bernoulli(torch.full_like(x, 1 - self.p))
return mask * x * self.b + (1 - mask) * self.a
return x
def demonstrate_dropout():
"""
演示Dropout的正则化效果
"""
print("=" * 60)
print("Dropout演示")
print("=" * 60)
# 测试不同dropout rate的效果
for dropout_rate in [0.0, 0.3, 0.5, 0.7]:
model = DropoutModel(dropout_rate=dropout_rate)
# 模拟训练模式 vs 评估模式
model.train()
train_output = model(torch.randn(8, 3, 32, 32))
model.eval()
with torch.no_grad():
eval_output = model(torch.randn(8, 3, 32, 32))
print(f"\nDropout Rate: {dropout_rate}")
print(f" 训练模式输出均值: {train_output.mean().item():.4f}, 标准差: {train_output.std().item():.4f}")
print(f" 评估模式输出均值: {eval_output.mean().item():.4f}, 标准差: {eval_output.std().item():.4f}")
print(f" 训练模式稀疏性: {(train_output == 0).float().mean().item():.2%}")
# 演示Inverted Dropout的手动实现
print("\n--- Inverted Dropout手动实现 ---")
def inverted_dropout(x, drop_rate, training=True):
"""
Inverted Dropout实现
训练时:随机掩码 + 缩放
测试时:直接返回原值
"""
if not training:
return x
# 随机生成掩码(1保留,0丢弃)
mask = torch.bernoulli(torch.ones_like(x) * (1 - drop_rate))
# 缩放:保持期望一致
scaled_x = mask * x / (1 - drop_rate)
return scaled_x
test_input = torch.randn(1, 10)
train_out = inverted_dropout(test_input, 0.5, training=True)
eval_out = inverted_dropout(test_input, 0.5, training=False)
print(f"输入: {test_input.squeeze().numpy()}")
print(f"训练输出 (dropout=0.5): {train_out.squeeze().numpy()}")
print(f"评估输出 (无dropout): {eval_out.squeeze().numpy()}")
# 运行演示
if __name__ == "__main__":
demonstrate_dropout()
5. Batch Normalization
5.1 Internal Covariate Shift问题
在深度神经网络中,随着层数加深,每一层的输入分布会发生变化,这是因为前一层的参数在训练过程中不断更新。这种现象被称为内部协变量偏移(Internal Covariate Shift, ICS)。
具体来说:
-
第 l 层的输入是第 l-1 层的输出
-
当第 l-1 层的参数更新后,其输出的分布也会改变
-
这导致第 l 层需要不断适应这种变化的学习,训练变得困难
5.2 Batch Normalization原理
Batch Normalization(简称BatchNorm)由Ioffe和Szegedy在2015年提出,通过对每一层的输入进行归一化来解决ICS问题。
归一化公式:
\\hat{x} = \\frac{x - \\mathbb{E}\[x\]}{\\sqrt{\\text{Var}\[x\] + \\epsilon}}
可学习的仿射变换:
y = \\gamma \\hat{x} + \\beta
其中:
-
\\mathbb{E}\[x\] 和 \\text{Var}\[x\] 是当前batch的均值和方差
-
\\gamma 和 \\beta 是可学习的参数,用于恢复模型的表达能力
-
\\epsilon(通常为 10\^{-5})用于防止除零
5.3 训练 vs 推理行为
BatchNorm在训练和推理时的行为不同:
| 阶段 | 均值计算 | 方差计算 | \\gamma, \\beta |
|---|---|---|---|
| 训练 | Batch内统计 | Batch内统计 | 可学习 |
| 推理 | 滑动平均的EMA | 滑动平均的EMA | 固定为训练最终值 |
滑动平均(Moving Average)更新:
# 训练时更新running mean和running var
running_mean = momentum * running_mean + (1 - momentum) * batch_mean
running_var = momentum * running_var + (1 - momentum) * batch_var
5.4 Batch Size的影响
Batch Normalization的效果与Batch Size密切相关:
-
Batch Size 太小:Batch统计量不稳定,无法准确估计全局均值和方差
-
Batch Size 太大:GPU显存压力大,且可能导致泛化性能下降
-
建议:Batch Size在16-64之间通常能获得较好的效果
Batch Size 对 BN 的影响:
BS=1 → 均值方差极不稳定,每步更新幅度大
BS=4 → 统计量开始稳定,但仍有一定噪声
BS=16 → 统计量较为稳定,推荐起步值
BS=32 → 最佳平衡点之一
BS=64 → 良好,但可能开始影响泛化
BS=128+→ 方差估计很稳定,但可能过拟合同一批次
5.5 PyTorch实现
import torch
import torch.nn as nn
import torch.nn.functional as F
class BatchNorm2dLayer(nn.Module):
"""
手动实现Batch Normalization
帮助理解BN的工作原理
"""
def __init__(self, num_features, eps=1e-5, momentum=0.1):
super(BatchNorm2dLayer, self).__init__()
self.num_features = num_features
self.eps = eps
self.momentum = momentum
# 可学习参数
self.gamma = nn.Parameter(torch.ones(num_features))
self.beta = nn.Parameter(torch.zeros(num_features))
# 推理时使用的全局统计量
self.register_buffer('running_mean', torch.zeros(num_features))
self.register_buffer('running_var', torch.ones(num_features))
self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))
def forward(self, x):
"""
x: (N, C, H, W) - N为batch大小,C为通道数
"""
if self.training:
# 训练模式:使用batch统计量
# 计算batch均值: (C,)
batch_mean = x.mean(dim=(0, 2, 3))
# 计算batch方差: (C,)
batch_var = x.var(dim=(0, 2, 3), unbiased=False)
# 更新running统计量
if self.num_batches_tracked == 0:
self.running_mean.data = batch_mean.detach()
self.running_var.data = batch_var.detach()
else:
self.running_mean = (self.momentum * self.running_mean +
(1 - self.momentum) * batch_mean.detach())
self.running_var = (self.momentum * self.running_var +
(1 - self.momentum) * batch_var.detach())
self.num_batches_tracked += 1
# 归一化
x_norm = (x - batch_mean.view(1, -1, 1, 1)) / torch.sqrt(batch_var.view(1, -1, 1, 1) + self.eps)
else:
# 推理模式:使用全局统计量
x_norm = (x - self.running_mean.view(1, -1, 1, 1)) / torch.sqrt(self.running_var.view(1, -1, 1, 1) + self.eps)
# 仿射变换
return self.gamma.view(1, -1, 1, 1) * x_norm + self.beta.view(1, -1, 1, 1)
class ModernResNet(nn.Module):
"""
使用Batch Normalization的现代ResNet结构
展示BN在CNN中的典型用法
"""
def __init__(self, num_classes=10):
super(ModernResNet, self).__init__()
# 初始卷积层
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(64)
# 残差块
self.layer1 = self._make_layer(64, 64, 2, stride=1)
self.layer2 = self._make_layer(64, 128, 2, stride=2)
self.layer3 = self._make_layer(128, 256, 2, stride=2)
self.layer4 = self._make_layer(256, 512, 2, stride=2)
# 全局平均池化
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
# Dropout + 分类器
self.dropout = nn.Dropout(0.3)
self.fc = nn.Linear(512, num_classes)
# 权重初始化
self._initialize_weights()
def _make_layer(self, in_channels, out_channels, blocks, stride=1):
"""构建残差块"""
layers = []
# 第一个块可能需要下采样
downsample = None
if stride != 1 or in_channels != out_channels:
downsample = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)
layers.append(ResBlock(in_channels, out_channels, stride, downsample))
# 后续块
for _ in range(1, blocks):
layers.append(ResBlock(out_channels, out_channels))
return nn.Sequential(*layers)
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, 0, 0.01)
nn.init.constant_(m.bias, 0)
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.dropout(x)
x = self.fc(x)
return x
class ResBlock(nn.Module):
"""残差块,包含两个卷积层和BatchNorm"""
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super(ResBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = downsample
def forward(self, x):
identity = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = F.relu(out)
return out
def demonstrate_batchnorm():
"""
演示Batch Normalization的效果
"""
print("=" * 60)
print("Batch Normalization演示")
print("=" * 60)
# 模拟Internal Covariate Shift
print("\n1. 模拟Covariate Shift问题:")
# 创建一个会"漂移"的输入分布
class DriftingInputModel(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(10, 10)
def forward(self, x):
return self.fc(x)
model = DriftingInputModel()
optimizer = optim.SGD(model.parameters(), lr=0.1)
losses_bn = []
losses_no_bn = []
# 模拟训练:输入分布逐渐漂移
torch.manual_seed(42)
for epoch in range(5):
optimizer.zero_grad()
# 漂移的输入:均值逐渐增加
x = torch.randn(32, 10) + epoch * 0.5
# 有BN和无BN的对比
model.train()
out = model(x)
loss = out.mean()
loss.backward()
optimizer.step()
losses_bn.append(loss.item())
print(f" 随着输入分布漂移,模型仍能稳定训练")
print(f" 每轮loss: {[f'{l:.4f}' for l in losses_bn]}")
# BN的均值和方差统计
print("\n2. BatchNorm统计量变化:")
bn_layer = nn.BatchNorm2d(num_features=4, momentum=0.9)
print(f" 初始running_mean: {bn_layer.running_mean.numpy()}")
print(f" 初始running_var: {bn_layer.running_var.numpy()}")
# 输入一个batch
x = torch.randn(8, 4, 16, 16)
output = bn_layer(x)
print(f" Batch统计量均值: {x.mean(dim=(0,2,3)).numpy()}")
print(f" Batch统计量方差: {x.var(dim=(0,2,3)).numpy()}")
print(f" Running mean更新后: {bn_layer.running_mean.numpy()}")
# 训练vs评估模式对比
print("\n3. 训练模式 vs 评估模式:")
bn = nn.BatchNorm2d(3)
bn.train()
train_out1 = bn(torch.randn(4, 3, 8, 8))
bn.eval()
with torch.no_grad():
eval_out = bn(torch.randn(4, 3, 8, 8))
print(f" 训练模式输出均值: {train_out1.mean().item():.6f} (接近0)")
print(f" 评估模式输出均值: {eval_out.mean().item():.6f} (使用running统计)")
return ModernResNet(num_classes=10)
if __name__ == "__main__":
demonstrate_batchnorm()
6. Early Stopping
6.1 原理
Early Stopping(早停)是一种简单而有效的正则化策略,其核心思想是:当验证集性能在连续若干个epoch内不再提升时,停止训练并恢复至最佳模型状态。
这种方法基于一个观察:模型的泛化能力通常在验证loss开始上升之前达到峰值,继续训练反而会导致过拟合。
6.2 关键参数
-
patience(耐心值):允许验证loss不再改善的最大epoch数。例如patience=10表示如果验证loss在10个epoch内都没有改善,则停止训练。
-
min_delta(最小改善量):认定"改善"所需的最小loss下降量。设置此参数可以避免因统计噪声而误判。
-
best_model_state(最佳模型状态):需要在训练过程中保存验证loss最低时的模型参数,停止时恢复到该状态。
6.3 PyTorch实现
import torch
import torch.nn as nn
import copy
from collections import defaultdict
class EarlyStopping:
"""
早停策略实现
使用方法:
early_stopping = EarlyStopping(patience=10, min_delta=0.001)
for epoch in range(num_epochs):
train_loss = train_one_epoch(model, train_loader)
val_loss = validate(model, val_loader)
if early_stopping(val_loss, model):
break
Attributes:
patience: 早停耐心值
min_delta: 最小改善量
verbose: 是否打印信息
"""
def __init__(self, patience=10, min_delta=0.0, verbose=True,
mode='min', save_path='best_model.pth'):
self.patience = patience
self.min_delta = min_delta
self.verbose = verbose
self.mode = mode
self.save_path = save_path
# 初始化状态
self.counter = 0 # 已连续未改善的epoch数
self.best_score = None
self.early_stop = False
self.best_epoch = 0
self.best_model_state = None
# 选择比较函数
if mode == 'min':
self.is_better = lambda current, best: current < best - min_delta
self.score_name = 'loss'
else: # mode == 'max'(如准确率)
self.is_better = lambda current, best: current > best + min_delta
self.score_name = 'accuracy'
def __call__(self, metric, model, epoch=None):
"""
判断是否应该早停
Args:
metric: 当前验证指标(loss或accuracy)
model: 当前模型
epoch: 当前epoch数(可选,用于打印)
Returns:
bool: True表示触发早停
"""
score = -metric if self.mode == 'min' else metric
if self.best_score is None:
# 第一个epoch,保存最佳状态
self.best_score = score
self.best_epoch = epoch if epoch else 0
self.best_model_state = copy.deepcopy(model.state_dict())
if self.verbose:
print(f"Epoch {epoch}: {self.score_name}={metric:.6f} (best)")
elif self.is_better(score, self.best_score):
# 有显著改善,更新最佳状态
self.best_score = score
self.best_epoch = epoch if epoch else self.best_epoch + 1
self.best_model_state = copy.deepcopy(model.state_dict())
self.counter = 0
if self.verbose:
print(f"Epoch {epoch}: {self.score_name}={metric:.6f} (improved, best)")
else:
# 没有改善,增加计数器
self.counter += 1
if self.verbose:
print(f"Epoch {epoch}: {self.score_name}={metric:.6f} "
f"(no improvement, counter: {self.counter}/{self.patience})")
# 检查是否触发早停
if self.counter >= self.patience:
self.early_stop = True
if self.verbose:
print(f"\nEarly stopping triggered after {self.counter} "
f"epochs without improvement")
print(f"Best {self.score_name} was {abs(self.best_score):.6f} "
f"at epoch {self.best_epoch}")
return True
return False
def load_best_model(self, model):
"""
加载最佳模型状态
"""
if self.best_model_state is not None:
model.load_state_dict(self.best_model_state)
if self.verbose:
print(f"Loaded best model from epoch {self.best_epoch}")
else:
print("Warning: No best model state saved")
def save_checkpoint(self, path=None):
"""保存检查点"""
import os
save_path = path or self.save_path
os.makedirs(os.path.dirname(save_path) or '.', exist_ok=True)
torch.save({
'best_score': self.best_score,
'best_epoch': self.best_epoch,
'best_model_state': self.best_model_state,
'counter': self.counter,
'early_stop': self.early_stop
}, save_path)
if self.verbose:
print(f"Checkpoint saved to {save_path}")
def load_checkpoint(self, path):
"""加载检查点"""
checkpoint = torch.load(path)
self.best_score = checkpoint['best_score']
self.best_epoch = checkpoint['best_epoch']
self.best_model_state = checkpoint['best_model_state']
self.counter = checkpoint['counter']
self.early_stop = checkpoint['early_stop']
class TrainingHistory:
"""
训练历史记录器
用于可视化和分析训练过程
"""
def __init__(self):
self.history = defaultdict(list)
def update(self, **kwargs):
"""更新训练历史"""
for key, value in kwargs.items():
self.history[key].append(value)
def get_best_epoch(self, metric='val_loss', mode='min'):
"""获取最佳epoch"""
values = self.history.get(metric, [])
if not values:
return None
if mode == 'min':
return values.index(min(values))
return values.index(max(values))
def summary(self):
"""打印训练摘要"""
print("\n" + "=" * 60)
print("Training Summary")
print("=" * 60)
for key, values in self.history.items():
if len(values) > 0:
print(f"{key}: min={min(values):.6f}, "
f"max={max(values):.6f}, "
f"final={values[-1]:.6f}")
best_val_loss_epoch = self.get_best_epoch('val_loss', 'min')
if best_val_loss_epoch is not None:
print(f"\nBest model at epoch {best_val_loss_epoch} "
f"(val_loss={self.history['val_loss'][best_val_loss_epoch]:.6f})")
def demonstrate_early_stopping():
"""
演示Early Stopping的使用
"""
print("=" * 60)
print("Early Stopping演示")
print("=" * 60)
# 创建模型和数据
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 10)
)
# 模拟训练过程
early_stopping = EarlyStopping(patience=5, min_delta=0.01, verbose=True)
history = TrainingHistory()
# 模拟20个epoch的训练
torch.manual_seed(42)
print("\n模拟训练过程:\n")
for epoch in range(1, 21):
# 模拟loss变化:初期下降,后期因过拟合上升
if epoch <= 8:
train_loss = 2.0 - epoch * 0.2 + torch.randn(1).item() * 0.05
val_loss = 2.1 - epoch * 0.18 + torch.randn(1).item() * 0.08
else:
train_loss = 0.4 - epoch * 0.005 + torch.randn(1).item() * 0.02
val_loss = 0.6 + (epoch - 8) * 0.03 + torch.randn(1).item() * 0.05 # 过拟合开始
history.update(train_loss=train_loss, val_loss=val_loss)
# 检查是否早停
if early_stopping(val_loss, model, epoch):
print(f"\n训练在第{epoch}个epoch停止\n")
break
# 加载最佳模型
early_stopping.load_best_model(model)
history.summary()
if __name__ == "__main__":
demonstrate_early_stopping()
7. 数据增强(Data Augmentation)
7.1 原理
数据增强是防止过拟合的最根本方法------它通过在数据层面增加样本的多样性,使模型能够学习到更具泛化能力的特征。数据增强假设:通过对训练数据进行合理的变换,产生的"虚拟"样本仍然属于同一类别分布。
数据增强可以分为:
-
离线增强:在训练前对数据集进行变换,生成新的训练样本
-
在线增强:在每个训练batch中实时进行变换
7.2 图像数据增强方法
7.2.1 几何变换
-
翻转(Flip):水平翻转、垂直翻转
-
旋转(Rotation):随机角度旋转(通常±15°)
-
平移(Translation):随机平移
-
裁剪(Crop):随机裁剪(通常配合padding)
-
缩放(Scale):随机缩放
-
仿射变换(Affine):综合的几何变换
7.2.2 色彩变换
-
颜色抖动(Color Jitter):随机调整亮度、对比度、饱和度、色调
-
灰度化(Grayscale):随机转换为灰度图
-
颜色空间变换:在HSV等颜色空间进行变换
7.3 高级数据增强技术
7.3.1 Mixup
Mixup通过将两个样本及其标签进行线性插值来生成新的训练样本:
\\tilde{x} = \\lambda x_i + (1-\\lambda) x_j
\\tilde{y} = \\lambda y_i + (1-\\lambda) y_j
其中 \\lambda \\sim \\text{Beta}(\\alpha, \\alpha),\\alpha 是超参数(通常取0.2-0.4)。
7.3.2 Cutout
Cutout随机遮挡图像的一个矩形区域(通常填充为0或灰色),迫使模型学习分散式的特征表示。
7.3.3 CutMix
CutMix结合了Mixup和Cutout的思想:将一个样本的矩形区域粘贴到另一个样本上,同时混合对应的标签。
7.4 PyTorch实现
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
import numpy as np
from PIL import Image
class RandomAugmentation:
"""
随机应用多种数据增强
"""
def __init__(self, p=0.5):
self.p = p
def __call__(self, img):
"""随机选择一个增强方法应用"""
augmentations = [
self.horizontal_flip,
self.random_crop,
self.random_rotation,
self.color_jitter,
self.random_erasing
]
aug = np.random.choice(augmentations)
if np.random.random() < self.p:
img = aug(img)
return img
@staticmethod
def horizontal_flip(img):
"""水平翻转"""
if isinstance(img, torch.Tensor):
return torch.flip(img, dims=[-1])
return img.transpose(Image.FLIP_LEFT_RIGHT)
@staticmethod
def random_crop(img, crop_size=(24, 24), padding=4):
"""随机裁剪"""
if isinstance(img, torch.Tensor):
_, h, w = img.shape
pad = nn.ReflectionPad2d(padding)
img_padded = pad(img.unsqueeze(0)).squeeze(0)
top = np.random.randint(0, padding * 2 + 1)
left = np.random.randint(0, padding * 2 + 1)
return img_padded[:, top:top+h, left:left+w]
else:
w, h = img.size
pad_img = Image.new(img.mode, (w + padding*2, h + padding*2))
pad_img.paste(img, (padding, padding))
top = np.random.randint(0, padding * 2 + 1)
left = np.random.randint(0, padding * 2 + 1)
return pad_img.crop((left, top, left + w, top + h))
@staticmethod
def random_rotation(img, max_angle=15):
"""随机旋转"""
if isinstance(img, torch.Tensor):
angle = np.random.uniform(-max_angle, max_angle)
# 使用grid_sample进行旋转
angle_rad = angle * np.pi / 180
theta = torch.tensor([
[np.cos(angle_rad), -np.sin(angle_rad), 0],
[np.sin(angle_rad), np.cos(angle_rad), 0]
], dtype=torch.float32)
grid = F.affine_grid(theta.unsqueeze(0), img.unsqueeze(0).size(), align_corners=False)
return F.grid_sample(img.unsqueeze(0), grid, align_corners=False).squeeze(0)
else:
angle = np.random.uniform(-max_angle, max_angle)
return img.rotate(angle)
@staticmethod
def color_jitter(img, brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1):
"""颜色抖动"""
if isinstance(img, torch.Tensor):
# 亮度
factor = 1.0 + np.random.uniform(-brightness, brightness)
img = img * factor
# 对比度
factor = 1.0 + np.random.uniform(-contrast, contrast)
mean = img.mean(dim=(1, 2), keepdim=True)
img = (img - mean) * factor + mean
return torch.clamp(img, 0, 1)
else:
from PIL import ImageEnhance
enhancer = ImageEnhance.Brightness(img)
img = enhancer.enhance(1.0 + np.random.uniform(-brightness, brightness))
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(1.0 + np.random.uniform(-contrast, contrast))
return img
@staticmethod
def random_erasing(img, p=0.5, scale=(0.02, 0.33), ratio=(0.3, 3.3)):
"""随机擦除(Cutout)"""
if isinstance(img, torch.Tensor):
if np.random.random() > p:
return img
_, h, w = img.shape
area = h * w
for _ in range(100):
target_area = np.random.uniform(scale[0], scale[1]) * area
aspect_ratio = np.random.uniform(ratio[0], ratio[1])
erase_h = int(np.sqrt(target_area * aspect_ratio))
erase_w = int(np.sqrt(target_area / aspect_ratio))
if erase_h < h and erase_w < w:
top = np.random.randint(0, h - erase_h)
left = np.random.randint(0, w - erase_w)
img[:, top:top+erase_h, left:left+erase_w] = 0
return img
return img
return img
class Mixup:
"""
Mixup数据增强实现
Mixup通过混合两个样本及其标签来创建新的训练样本
有助于:
- 减少过拟合
- 提高模型对对抗样本的鲁棒性
- 平滑决策边界
"""
def __init__(self, alpha=0.2):
self.alpha = alpha
def __call__(self, batch_x, batch_y):
"""
Args:
batch_x: (N, C, H, W) 输入批次
batch_y: (N,) 标签批次
Returns:
mixed_x: 混合后的输入
y_a: 原始标签(用于混合损失计算)
y_b: 交换的标签
lam: 混合系数
"""
if self.alpha <= 0:
return batch_x, batch_y, batch_y, 1.0
# 从Beta分布采样混合系数
lam = np.random.beta(self.alpha, self.alpha)
# 确保lam >= 0.5(否则标签交换后loss计算会重复)
if lam < 0.5:
lam = 1 - lam
batch_x = batch_x.flip(0)
batch_y = batch_y.flip(0)
# 混合输入
mixed_x = lam * batch_x + (1 - lam) * batch_x.flip(0)
return mixed_x, batch_y, batch_y.flip(0), lam
class CutMix:
"""
CutMix数据增强
将一个样本的矩形区域粘贴到另一个样本上
比Mixup更充分利用像素信息
"""
def __init__(self, alpha=1.0):
self.alpha = alpha
def __call__(self, batch_x, batch_y):
"""
Returns:
mixed_x: 混合后的输入
y_a, y_b: 原始标签
lam: 实际面积比例
"""
# 生成混合系数
lam = np.random.beta(self.alpha, self.alpha)
# 随机选择打乱顺序
batch_size = batch_x.size(0)
indices = torch.randperm(batch_size)
# 计算裁剪区域
_, _, h, w = batch_x.shape
cut_rat = np.sqrt(1.0 - lam)
cut_w = int(w * cut_rat)
cut_h = int(h * cut_rat)
# 随机中心点
cx = np.random.randint(w)
cy = np.random.randint(h)
# 裁剪边界
bbx1 = np.clip(cx - cut_w // 2, 0, w)
bby1 = np.clip(cy - cut_h // 2, 0, h)
bbx2 = np.clip(cx + cut_w // 2, 0, w)
bby2 = np.clip(cy + cut_h // 2, 0, h)
# 应用CutMix
mixed_x = batch_x.clone()
mixed_x[:, :, bby1:bby2, bbx1:bbx2] = batch_x.flip(0)[:, :, bby1:bby2, bbx1:bbx2]
# 调整lam为实际裁剪面积比例
lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (w * h))
y_a, y_b = batch_y, batch_y[indices]
return mixed_x, y_a, y_b, lam
def mixup_criterion(criterion, pred, y_a, y_b, lam):
"""
计算Mixup/CutMix的混合损失
Args:
criterion: 损失函数(如CrossEntropyLoss)
pred: 模型预测
y_a, y_b: 两个样本的标签
lam: 混合系数
"""
return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)
def demonstrate_augmentation():
"""
演示数据增强效果
"""
print("=" * 60)
print("数据增强演示")
print("=" * 60)
# 演示Mixup
print("\n1. Mixup效果演示:")
mixup = Mixup(alpha=0.3)
# 模拟一个batch的图像和标签
batch_x = torch.randn(8, 3, 32, 32)
batch_y = torch.randint(0, 10, (8,))
mixed_x, y_a, y_b, lam = mixup(batch_x, batch_y)
print(f" 原始batch形状: {batch_x.shape}")
print(f" 混合batch形状: {mixed_x.shape}")
print(f" 混合系数λ: {lam:.4f}")
print(f" 标签A: {y_a.tolist()}")
print(f" 标签B: {y_b.tolist()}")
# 演示CutMix
print("\n2. CutMix效果演示:")
cutmix = CutMix(alpha=1.0)
mixed_x, y_a, y_b, lam = cutmix(batch_x, batch_y)
print(f" 混合系数λ(面积比例): {lam:.4f}")
# 演示RandomAugmentation
print("\n3. RandomAugmentation:")
aug = RandomAugmentation(p=1.0)
# 模拟一张图像(Tensor格式)
sample_img = torch.randn(3, 32, 32)
augmented = aug(sample_img)
print(f" 原始图像形状: {sample_img.shape}")
print(f" 增强后图像形状: {augmented.shape}")
# 统计增强后图像的变化
print(f" 像素值范围变化: [{sample_img.min():.2f}, {sample_img.max():.2f}] "
f"-> [{augmented.min():.2f}, {augmented.max():.2f}]")
return mixup, cutmix
if __name__ == "__main__":
demonstrate_augmentation()
8. 综合实战:正则化方法对比实验
8.1 实验设计
本节通过一个完整的图像分类实验,对比各正则化方法的效果。实验使用CIFAR-10数据集,评估以下配置:
-
Baseline:无正则化的标准网络
-
L2正则化:weight_decay=0.01
-
Dropout:p=0.5
-
BatchNorm:在卷积层后添加BN
-
数据增强:使用Mixup和RandomAugmentation
-
组合策略:L2 + Dropout + BatchNorm + 数据增强
8.2 完整训练代码
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import torchvision
import torchvision.transforms as transforms
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy
# 设置随机种子
def set_seed(seed=42):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
set_seed(42)
# 检查设备
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"使用设备: {device}")
class BaselineCNN(nn.Module):
"""基础CNN模型(无正则化)"""
def __init__(self, num_classes=10):
super(BaselineCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(128 * 4 * 4, 256)
self.fc2 = nn.Linear(256, num_classes)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x))) # 32->16
x = self.pool(F.relu(self.conv2(x))) # 16->8
x = self.pool(F.relu(self.conv3(x))) # 8->4
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
class RegularizedCNN(nn.Module):
"""带正则化的CNN模型"""
def __init__(self, use_bn=True, use_dropout=True, dropout_rate=0.5, num_classes=10):
super(RegularizedCNN, self).__init__()
self.use_bn = use_bn
self.use_dropout = use_dropout
self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
self.bn1 = nn.BatchNorm2d(32) if use_bn else nn.Identity()
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.bn2 = nn.BatchNorm2d(64) if use_bn else nn.Identity()
self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
self.bn3 = nn.BatchNorm2d(128) if use_bn else nn.Identity()
self.pool = nn.MaxPool2d(2, 2)
self.dropout = nn.Dropout(dropout_rate) if use_dropout else nn.Identity()
self.fc1 = nn.Linear(128 * 4 * 4, 256)
self.bn_fc = nn.BatchNorm1d(256) if use_bn else nn.Identity()
self.fc2 = nn.Linear(256, num_classes)
def forward(self, x):
x = self.pool(F.relu(self.bn1(self.conv1(x))))
x = self.pool(F.relu(self.bn2(self.conv2(x))))
x = self.pool(F.relu(self.bn3(self.conv3(x))))
x = x.view(x.size(0), -1)
x = self.dropout(F.relu(self.bn_fc(self.fc1(x))))
x = self.fc2(x)
return x
def get_cifar10_loaders(batch_size=128, use_augmentation=False):
"""获取CIFAR-10数据加载器"""
train_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])
if not use_augmentation:
train_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])
test_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])
train_set = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=train_transform
)
test_set = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=test_transform
)
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=2)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=2)
return train_loader, test_loader
def train_epoch(model, train_loader, optimizer, criterion, device, use_mixup=False, mixup_alpha=0.2):
"""训练一个epoch"""
model.train()
total_loss = 0.0
correct = 0
total = 0
mixup = Mixup(alpha=mixup_alpha) if use_mixup else None
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
if mixup is not None:
inputs, targets_a, targets_b, lam = mixup(inputs, targets)
outputs = model(inputs)
loss = mixup_criterion(criterion, outputs, targets_a, targets_b, lam)
else:
outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return total_loss / total, 100. * correct / total
def evaluate(model, test_loader, device):
"""评估模型"""
model.eval()
total_loss = 0.0
correct = 0
total = 0
criterion = nn.CrossEntropyLoss()
with torch.no_grad():
for inputs, targets in test_loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)
total_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return total_loss / total, 100. * correct / total
def run_experiment():
"""运行正则化对比实验"""
print("=" * 60)
print("正则化方法对比实验")
print("=" * 60)
# 实验配置
epochs = 30
batch_size = 128
lr = 0.001
# 定义各配置
configs = {
'Baseline (无正则化)': {
'use_bn': False,
'use_dropout': False,
'weight_decay': 0.0,
'use_augmentation': False
},
'L2正则化': {
'use_bn': False,
'use_dropout': False,
'weight_decay': 0.01,
'use_augmentation': False
},
'Dropout': {
'use_bn': False,
'use_dropout': True,
'dropout_rate': 0.5,
'weight_decay': 0.0,
'use_augmentation': False
},
'BatchNorm': {
'use_bn': True,
'use_dropout': False,
'weight_decay': 0.0,
'use_augmentation': False
},
'数据增强(Mixup)': {
'use_bn': False,
'use_dropout': False,
'weight_decay': 0.0,
'use_augmentation': True,
'mixup_alpha': 0.3
},
'完整组合': {
'use_bn': True,
'use_dropout': True,
'dropout_rate': 0.3,
'weight_decay': 0.001,
'use_augmentation': True,
'mixup_alpha': 0.2
}
}
results = {}
for config_name, config in configs.items():
print(f"\n{'='*50}")
print(f"训练配置: {config_name}")
print(f"{'='*50}")
# 创建模型
model = RegularizedCNN(
use_bn=config.get('use_bn', False),
use_dropout=config.get('use_dropout', False),
dropout_rate=config.get('dropout_rate', 0.5)
).to(device)
# 优化器和损失函数
optimizer = optim.Adam(
model.parameters(),
lr=lr,
weight_decay=config.get('weight_decay', 0.0)
)
criterion = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
# 获取数据
train_loader, test_loader = get_cifar10_loaders(
batch_size=batch_size,
use_augmentation=config.get('use_augmentation', False)
)
# 早停
early_stopping = EarlyStopping(patience=7, min_delta=0.1, verbose=False)
best_acc = 0.0
best_model_state = None
history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}
for epoch in range(1, epochs + 1):
train_loss, train_acc = train_epoch(
model, train_loader, optimizer, criterion, device,
use_mixup=config.get('use_augmentation', False),
mixup_alpha=config.get('mixup_alpha', 0.2)
)
val_loss, val_acc = evaluate(model, test_loader, device)
scheduler.step()
history['train_loss'].append(train_loss)
history['train_acc'].append(train_acc)
history['val_loss'].append(val_loss)
history['val_acc'].append(val_acc)
if val_acc > best_acc:
best_acc = val_acc
best_model_state = deepcopy(model.state_dict())
# 早停检查
if early_stopping(val_loss, model, epoch):
break
if epoch % 5 == 0 or epoch == epochs:
print(f"Epoch {epoch:2d}: Train Loss={train_loss:.4f}, "
f"Train Acc={train_acc:.2f}%, Val Acc={val_acc:.2f}%")
# 加载最佳模型
if best_model_state is not None:
model.load_state_dict(best_model_state)
final_loss, final_acc = evaluate(model, test_loader, device)
results[config_name] = {
'best_val_acc': best_acc,
'final_val_acc': final_acc,
'final_val_loss': final_loss,
'history': history
}
print(f"\n最佳验证准确率: {best_acc:.2f}%")
print(f"最终测试准确率: {final_acc:.2f}%")
# 汇总结果
print("\n" + "=" * 60)
print("实验结果汇总")
print("=" * 60)
print(f"\n{'配置':<25} {'最佳验证准确率':<15} {'最终测试准确率':<15}")
print("-" * 55)
for name, result in results.items():
print(f"{name:<25} {result['best_val_acc']:.2f}%{'':<8} {result['final_val_acc']:.2f}%")
# 找出最佳配置
best_config = max(results.items(), key=lambda x: x[1]['best_val_acc'])
print(f"\n最佳配置: {best_config[0]} (准确率: {best_config[1]['best_val_acc']:.2f}%)")
return results
if __name__ == "__main__":
results = run_experiment()
9. 正则化方法总结与选择指南
9.1 各方法对比
| 方法 | 防止过拟合 | 训练速度影响 | 实现复杂度 | 注意事项 |
|---|---|---|---|---|
| L2正则化 | 中等 | 几乎无 | 极简 | weight_decay参数调节 |
| L1正则化 | 中等 | 几乎无 | 极简 | 产生稀疏解,适合特征选择 |
| Dropout | 强 | 略慢 | 简单 | 训练/评估模式切换 |
| BatchNorm | 中-强 | 略慢 | 中等 | BatchSize不能太小 |
| Early Stopping | 强 | 省时 | 简单 | patience参数调节 |
| 数据增强 | 极强 | 略慢 | 中等 | 最根本的正则化手段 |
9.2 实际应用建议
-
小数据集:优先使用数据增强 + L2正则化 + Dropout的组合
-
大数据集:数据增强仍是首选,可适当减少其他正则化强度
-
深层网络:BatchNorm几乎是必备,配合适当的Dropout
-
时间受限场景:使用Early Stopping防止过度训练
-
资源受限场景:优先使用L2正则化,计算开销最小
9.3 组合使用原则
-
从简单开始:先尝试单一正则化方法
-
渐进增加:根据效果逐步添加其他方法
-
避免过度正则化:正则化过度会导致欠拟合
-
注意BatchNorm和Dropout的顺序:通常先BN后Dropout
推荐的组合策略优先级:
优先级1(必做):数据增强 + Early Stopping
优先级2(推荐):根据网络类型选择
- CNN:BatchNorm
- 全连接/DNN:Dropout
- 需要特征选择:L1
- 一般情况:L2
优先级3(可选):根据效果添加
- Mixup/CutMix等高级增强
- 多尺度训练/测试
10. 结论
正则化技术是深度学习训练中不可或缺的一环。本文系统介绍了过拟合问题的本质以及L1/L2正则化、Dropout、Batch Normalization、Early Stopping和数据增强等主流正则化方法的原理和实现。
在实际应用中,数据增强是防止过拟合的根本方法,因为它从数据层面增加了模型的泛化能力。其他正则化方法则从模型层面约束其复杂度,二者结合使用往往能取得最佳效果。
选择正则化方法时,需要根据数据集规模、网络结构、计算资源等因素综合考虑。记住:正则化的目标是让模型在未见过的数据上表现良好,而不是在训练集上达到完美。