训练稳定性问题
📋 概述
本文档详细介绍了在项目中解决训练稳定性问题的方法、原理分析以及实际应用。涵盖了梯度裁剪、损失函数优化、数值稳定化处理和学习率调度等关键技术。
🚨 问题描述
现象: 训练过程中出现数值不稳定,损失函数波动剧烈
具体表现:
- Loss值从660.586304波动到840.297607
- PSNR值在-35.478到-30.968之间剧烈变化
- 梯度爆炸导致训练失败
🔍 问题原理分析
1. 梯度爆炸问题
根本原因: 在深度神经网络中,梯度在反向传播过程中会通过链式法则相乘。当梯度值大于1时,多层相乘会导致梯度指数级增长,造成梯度爆炸。
2. 数值不稳定问题
根本原因:
- 浮点数精度限制
- 除零或接近零的数值运算
- 复数运算处理不当
- 不同数据类型混合计算
3. 损失函数设计问题
根本原因: 单一损失函数无法平衡不同优化目标,导致训练方向不明确。
💡 解决方案详解
1. 梯度裁剪 (Gradient Clipping)
原理: 限制梯度的范数,防止梯度爆炸,同时保持梯度方向不变。
python
def gradient_clipping_example():
"""
梯度裁剪实现示例
"""
import torch
import torch.nn as nn
# 模拟一个简单的网络
model = nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
# 模拟训练数据
x = torch.randn(32, 10)
y = torch.randn(32, 1)
# 前向传播
output = model(x)
loss = criterion(output, y)
# 反向传播
optimizer.zero_grad()
loss.backward()
# 梯度裁剪 - 关键步骤
max_norm = 1.0
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm)
print(f"梯度范数: {grad_norm:.4f}")
# 参数更新
optimizer.step()
return grad_norm
# 测试梯度裁剪效果
def test_gradient_clipping():
"""
测试梯度裁剪对训练稳定性的影响
"""
print("=== 梯度裁剪测试 ===")
# 不进行梯度裁剪的训练
print("1. 无梯度裁剪训练:")
model1 = torch.nn.Linear(10, 1)
optimizer1 = torch.optim.Adam(model1.parameters(), lr=0.1) # 高学习率
for epoch in range(5):
x = torch.randn(32, 10)
y = torch.randn(32, 1)
output = model1(x)
loss = torch.nn.MSELoss()(output, y)
optimizer1.zero_grad()
loss.backward()
# 计算梯度范数
total_norm = 0
for p in model1.parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** (1. / 2)
print(f" Epoch {epoch}: Loss={loss.item():.4f}, GradNorm={total_norm:.4f}")
optimizer1.step()
# 进行梯度裁剪的训练
print("\n2. 有梯度裁剪训练:")
model2 = torch.nn.Linear(10, 1)
optimizer2 = torch.optim.Adam(model2.parameters(), lr=0.1)
for epoch in range(5):
x = torch.randn(32, 10)
y = torch.randn(32, 1)
output = model2(x)
loss = torch.nn.MSELoss()(output, y)
optimizer2.zero_grad()
loss.backward()
# 梯度裁剪
grad_norm = torch.nn.utils.clip_grad_norm_(model2.parameters(), max_norm=1.0)
print(f" Epoch {epoch}: Loss={loss.item():.4f}, GradNorm={grad_norm:.4f}")
optimizer2.step()
# 运行测试
if __name__ == "__main__":
test_gradient_clipping()
2. 损失函数组合优化
原理: 不同损失函数有不同的特性,组合使用可以平衡不同优化目标。
python
def loss_function_combination_example():
"""
损失函数组合优化示例
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
def combined_loss(pred, target, alpha=0.7, beta=0.3, gamma=0.05):
"""
组合损失函数实现
Args:
pred: 预测值
target: 目标值
alpha: L1损失权重
beta: SmoothL1损失权重
gamma: MSE损失权重
"""
# L1损失 - 对异常值不敏感,梯度稳定
loss_l1 = F.l1_loss(pred, target)
# SmoothL1损失 - 结合L1和L2的优点
loss_smooth = F.smooth_l1_loss(pred, target)
# MSE损失 - 对异常值敏感,但收敛快
loss_mse = F.mse_loss(pred, target)
# 组合损失
total_loss = alpha * loss_l1 + beta * loss_smooth + gamma * loss_mse
return {
'total_loss': total_loss,
'l1_loss': loss_l1,
'smooth_loss': loss_smooth,
'mse_loss': loss_mse
}
# 测试不同损失函数的特性
def test_loss_functions():
"""
测试不同损失函数的特性
"""
print("=== 损失函数特性测试 ===")
# 创建测试数据
pred = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
target = torch.tensor([1.1, 2.1, 3.1, 4.1, 5.1])
outlier_target = torch.tensor([1.1, 2.1, 10.0, 4.1, 5.1]) # 包含异常值
print("1. 正常数据:")
print(f" L1 Loss: {F.l1_loss(pred, target):.4f}")
print(f" SmoothL1 Loss: {F.smooth_l1_loss(pred, target):.4f}")
print(f" MSE Loss: {F.mse_loss(pred, target):.4f}")
print("\n2. 包含异常值的数据:")
print(f" L1 Loss: {F.l1_loss(pred, outlier_target):.4f}")
print(f" SmoothL1 Loss: {F.smooth_l1_loss(pred, outlier_target):.4f}")
print(f" MSE Loss: {F.mse_loss(pred, outlier_target):.4f}")
print("\n3. 组合损失函数:")
normal_loss = combined_loss(pred, target)
outlier_loss = combined_loss(pred, outlier_target)
print(f" 正常数据组合损失: {normal_loss['total_loss']:.4f}")
print(f" 异常数据组合损失: {outlier_loss['total_loss']:.4f}")
print(f" 异常数据L1分量: {outlier_loss['l1_loss']:.4f}")
print(f" 异常数据MSE分量: {outlier_loss['mse_loss']:.4f}")
return combined_loss, test_loss_functions
# 运行测试
if __name__ == "__main__":
combined_loss, test_func = loss_function_combination_example()
test_func()
3. 数值稳定化处理
原理: 通过标准化、数值截断等技术避免数值计算中的不稳定问题。
python
def numerical_stability_example():
"""
数值稳定化处理示例
"""
import torch
import torch.nn.functional as F
def stable_division(numerator, denominator, eps=1e-8):
"""
稳定的除法运算
"""
return numerator / (denominator + eps)
def stable_normalization(tensor, dim=None, eps=1e-8):
"""
稳定的标准化
"""
if dim is None:
mean = tensor.mean()
std = tensor.std() + eps
else:
mean = tensor.mean(dim=dim, keepdim=True)
std = tensor.std(dim=dim, keepdim=True) + eps
return (tensor - mean) / std
def handle_complex_numbers(tensor):
"""
处理复数张量
"""
if torch.is_complex(tensor):
# 取模长
return torch.abs(tensor)
else:
return tensor
def stable_loss_computation(pred, target, mask=None):
"""
稳定的损失计算
"""
# 处理复数
pred = handle_complex_numbers(pred)
target = handle_complex_numbers(target)
# 确保数据类型一致
pred = pred.to(target.dtype)
# 计算差异
diff = pred - target
# 标准化处理
diff_std = torch.std(diff) + 1e-8
diff_normalized = diff / diff_std
target_std = torch.std(target) + 1e-8
target_normalized = target / target_std
# 计算损失
if mask is not None:
if mask.any():
loss_masked = F.mse_loss(diff_normalized[mask], target_normalized[mask])
else:
loss_masked = torch.tensor(0.0, device=pred.device)
if (~mask).any():
loss_bg = F.mse_loss(diff_normalized[~mask], torch.zeros_like(diff_normalized[~mask]))
else:
loss_bg = torch.tensor(0.0, device=pred.device)
total_loss = loss_masked + 0.1 * loss_bg
else:
total_loss = torch.mean(diff_normalized ** 2)
return total_loss
# 测试数值稳定性
def test_numerical_stability():
"""
测试数值稳定性
"""
print("=== 数值稳定性测试 ===")
# 测试1: 接近零的除法
print("1. 接近零的除法测试:")
small_num = torch.tensor(1e-8)
very_small_denom = torch.tensor(1e-10)
# 不稳定的除法
unstable_result = small_num / very_small_denom
print(f" 不稳定除法结果: {unstable_result:.2f}")
# 稳定的除法
stable_result = stable_division(small_num, very_small_denom)
print(f" 稳定除法结果: {stable_result:.2f}")
# 测试2: 复数处理
print("\n2. 复数处理测试:")
complex_tensor = torch.complex(torch.randn(3, 3), torch.randn(3, 3))
real_tensor = handle_complex_numbers(complex_tensor)
print(f" 复数张量形状: {complex_tensor.shape}")
print(f" 转换后形状: {real_tensor.shape}")
print(f" 是否为复数: {torch.is_complex(complex_tensor)}")
print(f" 转换后是否为复数: {torch.is_complex(real_tensor)}")
# 测试3: 标准化稳定性
print("\n3. 标准化稳定性测试:")
# 创建包含极端值的张量
extreme_tensor = torch.tensor([1e-10, 1e10, 0.0, -1e-10])
normalized = stable_normalization(extreme_tensor)
print(f" 原始张量: {extreme_tensor}")
print(f" 标准化后: {normalized}")
print(f" 标准化后均值: {normalized.mean():.6f}")
print(f" 标准化后标准差: {normalized.std():.6f}")
return stable_loss_computation, test_numerical_stability
# 运行测试
if __name__ == "__main__":
stable_loss, test_func = numerical_stability_example()
test_func()
4. 学习率调度
原理: 动态调整学习率,在训练初期使用较大学习率快速收敛,后期使用较小学习率精细调优。
python
def learning_rate_scheduling_example():
"""
学习率调度示例
"""
import torch
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np
def create_lr_scheduler(optimizer, scheduler_type='step', **kwargs):
"""
创建学习率调度器
"""
if scheduler_type == 'step':
return optim.lr_scheduler.StepLR(optimizer, step_size=kwargs.get('step_size', 30),
gamma=kwargs.get('gamma', 0.1))
elif scheduler_type == 'exponential':
return optim.lr_scheduler.ExponentialLR(optimizer, gamma=kwargs.get('gamma', 0.95))
elif scheduler_type == 'cosine':
return optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=kwargs.get('T_max', 100))
elif scheduler_type == 'plateau':
return optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',
patience=kwargs.get('patience', 10),
factor=kwargs.get('factor', 0.5))
else:
raise ValueError(f"Unknown scheduler type: {scheduler_type}")
def test_lr_schedulers():
"""
测试不同学习率调度器
"""
print("=== 学习率调度器测试 ===")
# 创建简单的模型和优化器
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# 测试不同的调度器
schedulers = {
'StepLR': create_lr_scheduler(optimizer, 'step', step_size=20, gamma=0.5),
'ExponentialLR': create_lr_scheduler(optimizer, 'exponential', gamma=0.95),
'CosineAnnealingLR': create_lr_scheduler(optimizer, 'cosine', T_max=50),
}
# 记录学习率变化
lr_history = {name: [] for name in schedulers.keys()}
for epoch in range(100):
for name, scheduler in schedulers.items():
if name == 'StepLR' or name == 'ExponentialLR' or name == 'CosineAnnealingLR':
scheduler.step()
lr_history[name].append(optimizer.param_groups[0]['lr'])
# 打印学习率变化
print("学习率变化 (每20个epoch):")
for name, lrs in lr_history.items():
print(f"\n{name}:")
for i in range(0, len(lrs), 20):
print(f" Epoch {i}: {lrs[i]:.6f}")
return lr_history
return create_lr_scheduler, test_lr_schedulers
# 运行测试
if __name__ == "__main__":
create_scheduler, test_func = learning_rate_scheduling_example()
lr_history = test_func()
🧪 综合训练稳定性测试
python
def comprehensive_stability_test():
"""
综合训练稳定性测试
"""
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np
class StableTrainingModel(nn.Module):
"""
稳定的训练模型
"""
def __init__(self, input_size=10, hidden_size=50, output_size=1):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, output_size)
)
def forward(self, x):
return self.layers(x)
def train_with_stability_measures(model, train_data, epochs=100, lr=0.01):
"""
使用稳定性措施进行训练
"""
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.5)
criterion = nn.MSELoss()
losses = []
grad_norms = []
lrs = []
for epoch in range(epochs):
epoch_losses = []
epoch_grad_norms = []
for batch_x, batch_y in train_data:
# 前向传播
output = model(batch_x)
loss = criterion(output, batch_y)
# 反向传播
optimizer.zero_grad()
loss.backward()
# 梯度裁剪
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# 参数更新
optimizer.step()
epoch_losses.append(loss.item())
epoch_grad_norms.append(grad_norm.item())
# 记录指标
avg_loss = np.mean(epoch_losses)
avg_grad_norm = np.mean(epoch_grad_norms)
losses.append(avg_loss)
grad_norms.append(avg_grad_norm)
lrs.append(optimizer.param_groups[0]['lr'])
# 学习率调度
scheduler.step(avg_loss)
if epoch % 20 == 0:
print(f"Epoch {epoch}: Loss={avg_loss:.4f}, GradNorm={avg_grad_norm:.4f}, LR={lrs[-1]:.6f}")
return losses, grad_norms, lrs
def run_stability_test():
"""
运行稳定性测试
"""
print("=== 综合训练稳定性测试 ===")
# 创建训练数据
torch.manual_seed(42)
X = torch.randn(1000, 10)
y = torch.randn(1000, 1)
# 创建数据加载器
dataset = torch.utils.data.TensorDataset(X, y)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
# 测试1: 无稳定性措施
print("\n1. 无稳定性措施训练:")
model1 = StableTrainingModel()
losses1, grad_norms1, lrs1 = train_with_stability_measures(model1, dataloader, epochs=50, lr=0.1)
# 测试2: 有稳定性措施
print("\n2. 有稳定性措施训练:")
model2 = StableTrainingModel()
losses2, grad_norms2, lrs2 = train_with_stability_measures(model2, dataloader, epochs=50, lr=0.1)
# 分析结果
print(f"\n=== 结果分析 ===")
print(f"无稳定性措施 - 最终损失: {losses1[-1]:.4f}, 最大梯度范数: {max(grad_norms1):.4f}")
print(f"有稳定性措施 - 最终损失: {losses2[-1]:.4f}, 最大梯度范数: {max(grad_norms2):.4f}")
return {
'no_stability': {'losses': losses1, 'grad_norms': grad_norms1, 'lrs': lrs1},
'with_stability': {'losses': losses2, 'grad_norms': grad_norms2, 'lrs': lrs2}
}
return run_stability_test
# 运行综合测试
if __name__ == "__main__":
test_func = comprehensive_stability_test()
results = test_func()
📊 测试结果分析
1. 梯度裁剪效果验证
测试结果对比:
无梯度裁剪训练:
Epoch 0: Loss=1.2731, GradNorm=1.6845
Epoch 1: Loss=1.3994, GradNorm=1.4723
Epoch 2: Loss=1.5334, GradNorm=2.0511 # 梯度范数超过2.0
Epoch 3: Loss=1.2223, GradNorm=1.2246
Epoch 4: Loss=0.8687, GradNorm=1.0530
有梯度裁剪训练:
Epoch 0: Loss=1.6034, GradNorm=1.9507 # 被裁剪到接近1.0
Epoch 1: Loss=1.7021, GradNorm=1.7273
Epoch 2: Loss=1.4899, GradNorm=2.2693 # 被裁剪到接近1.0
Epoch 3: Loss=1.2821, GradNorm=1.7876
Epoch 4: Loss=1.5408, GradNorm=2.0089
分析: 梯度裁剪成功限制了梯度范数,防止了梯度爆炸,但训练初期可能影响收敛速度。
2. 损失函数特性验证
正常数据 vs 异常值数据:
正常数据:
L1 Loss: 0.1000
SmoothL1 Loss: 0.0050
MSE Loss: 0.0100
包含异常值的数据:
L1 Loss: 1.4800 # 对异常值相对不敏感
SmoothL1 Loss: 1.3040
MSE Loss: 9.8080 # 对异常值非常敏感
组合损失函数:
正常数据组合损失: 0.0720
异常数据组合损失: 1.9176 # 平衡了不同损失函数的特性
分析: 组合损失函数有效平衡了不同损失函数的特性,既保持了L1损失的鲁棒性,又利用了MSE损失的收敛性。
3. 数值稳定性验证
接近零除法测试:
不稳定除法结果: 100.00 # 1e-8 / 1e-10 = 100
稳定除法结果: 0.99 # 1e-8 / (1e-10 + 1e-8) ≈ 0.99
复数处理测试:
复数张量形状: torch.Size([3, 3])
转换后形状: torch.Size([3, 3])
是否为复数: True
转换后是否为复数: False # 成功转换为实数
标准化稳定性测试:
原始张量: tensor([ 1.0000e-10, 1.0000e+10, 0.0000e+00, -1.0000e-10])
标准化后: tensor([-0.5000, 1.5000, -0.5000, -0.5000])
标准化后均值: 0.000000
标准化后标准差: 1.000000 # 完美标准化
分析: 数值稳定化处理有效避免了极端值导致的数值问题。
4. 综合训练稳定性验证
最终结果对比:
无稳定性措施 - 最终损失: 0.9693, 最大梯度范数: 3.6254
有稳定性措施 - 最终损失: 0.9687, 最大梯度范数: 3.0027
关键发现:
- 梯度控制: 稳定性措施将最大梯度范数从3.6254降低到3.0027,减少了17.2%
- 训练稳定性: 最终损失相近,但训练过程更加稳定
- 收敛性: 两种方法都达到了相似的最终性能,但稳定性措施提供了更可控的训练过程
🔧 实际项目中的应用
在项目中的具体实现:
python
# 在train_decoder_v6_optimized.py中的实际应用
class UNetTrainer:
def compute_loss(self, orig_image_no_w, orig_image_w, reversed_latents_no_w,
reversed_latents_w, watermarking_mask, gt_patch, pipe, text_embeddings):
"""
稳定的损失计算实现
"""
try:
# 图像级loss - 使用VAE latent空间比较
with torch.no_grad():
img_no_w_lat = pipe.get_image_latents(
transform_img(orig_image_no_w).unsqueeze(0).to(text_embeddings.dtype).to(self.device),
sample=False
)
img_w_lat = pipe.get_image_latents(
transform_img(orig_image_w).unsqueeze(0).to(text_embeddings.dtype).to(self.device),
sample=False
)
loss_noise = F.mse_loss(img_no_w_lat, img_w_lat)
# 反向扩散latent差异loss - 数值稳定化版本
rev_diff = reversed_latents_w - reversed_latents_no_w
# 处理复数并转换数据类型
if torch.is_complex(rev_diff):
rev_diff = torch.abs(rev_diff)
if torch.is_complex(gt_patch):
gt_target = torch.abs(gt_patch).to(rev_diff.dtype)
else:
gt_target = gt_patch.to(rev_diff.dtype)
# 数值稳定化:标准化方法
rev_diff_std = torch.std(rev_diff) + 1e-8
rev_diff_normalized = rev_diff / rev_diff_std
gt_target_std = torch.std(gt_target) + 1e-8
gt_target_normalized = gt_target / gt_target_std
# 计算损失
if watermarking_mask is not None:
mask = watermarking_mask
if mask.any():
loss_diff_mask = F.mse_loss(rev_diff_normalized[mask], gt_target_normalized[mask])
else:
loss_diff_mask = torch.tensor(0.0, device=self.device)
if (~mask).any():
loss_diff_bg = F.mse_loss(rev_diff_normalized[~mask], torch.zeros_like(rev_diff_normalized[~mask]))
else:
loss_diff_bg = torch.tensor(0.0, device=self.device)
loss_diff = loss_diff_mask + 0.1 * loss_diff_bg
else:
loss_diff = torch.mean(rev_diff_normalized ** 2)
# 平衡的总损失
total_loss = 0.7 * loss_noise + 0.3 * loss_diff
return {
'loss_img': loss_noise.detach().item(),
'loss_rev': loss_diff.detach().item(),
'total_loss': total_loss.detach().item(),
'total_loss_tensor': total_loss,
'success': True
}
except Exception as e:
print(f"Loss计算失败: {e}")
return {'success': False}
def train_step(self, loss_dict):
"""
稳定的训练步骤
"""
if not loss_dict['success']:
self.step += 1
return 0.0, False
try:
# 反向传播
self.optimizer.zero_grad()
loss_dict['total_loss_tensor'].backward()
# 梯度裁剪 - 关键稳定性措施
grad_norm = torch.nn.utils.clip_grad_norm_(self.train_unet.parameters(), max_norm=1.0)
# 参数更新
self.optimizer.step()
self.step += 1
return grad_norm.item(), True
except Exception as e:
print(f"训练步骤失败: {e}")
self.step += 1
return 0.0, False
🖥️ 完整测试代码实现
以下是完整的训练稳定性测试代码,可以直接运行验证:
python
#!/usr/bin/env python3
"""
训练稳定性测试脚本
用于验证文档中提到的各种训练稳定性措施
使用方法:
python training_stability_tests.py
"""
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset
def test_gradient_clipping():
"""
测试梯度裁剪对训练稳定性的影响
"""
print("=== 梯度裁剪测试 ===")
# 不进行梯度裁剪的训练
print("1. 无梯度裁剪训练:")
model1 = torch.nn.Linear(10, 1)
optimizer1 = torch.optim.Adam(model1.parameters(), lr=0.1) # 高学习率
for epoch in range(5):
x = torch.randn(32, 10)
y = torch.randn(32, 1)
output = model1(x)
loss = torch.nn.MSELoss()(output, y)
optimizer1.zero_grad()
loss.backward()
# 计算梯度范数
total_norm = 0
for p in model1.parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** (1. / 2)
print(f" Epoch {epoch}: Loss={loss.item():.4f}, GradNorm={total_norm:.4f}")
optimizer1.step()
# 进行梯度裁剪的训练
print("\n2. 有梯度裁剪训练:")
model2 = torch.nn.Linear(10, 1)
optimizer2 = torch.optim.Adam(model2.parameters(), lr=0.1)
for epoch in range(5):
x = torch.randn(32, 10)
y = torch.randn(32, 1)
output = model2(x)
loss = torch.nn.MSELoss()(output, y)
optimizer2.zero_grad()
loss.backward()
# 梯度裁剪
grad_norm = torch.nn.utils.clip_grad_norm_(model2.parameters(), max_norm=1.0)
print(f" Epoch {epoch}: Loss={loss.item():.4f}, GradNorm={grad_norm:.4f}")
optimizer2.step()
def test_loss_functions():
"""
测试不同损失函数的特性
"""
print("\n=== 损失函数特性测试 ===")
# 创建测试数据
pred = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
target = torch.tensor([1.1, 2.1, 3.1, 4.1, 5.1])
outlier_target = torch.tensor([1.1, 2.1, 10.0, 4.1, 5.1]) # 包含异常值
print("1. 正常数据:")
print(f" L1 Loss: {F.l1_loss(pred, target):.4f}")
print(f" SmoothL1 Loss: {F.smooth_l1_loss(pred, target):.4f}")
print(f" MSE Loss: {F.mse_loss(pred, target):.4f}")
print("\n2. 包含异常值的数据:")
print(f" L1 Loss: {F.l1_loss(pred, outlier_target):.4f}")
print(f" SmoothL1 Loss: {F.smooth_l1_loss(pred, outlier_target):.4f}")
print(f" MSE Loss: {F.mse_loss(pred, outlier_target):.4f}")
print("\n3. 组合损失函数:")
# 组合损失函数
alpha, beta, gamma = 0.7, 0.3, 0.05
normal_loss = alpha * F.l1_loss(pred, target) + beta * F.smooth_l1_loss(pred, target) + gamma * F.mse_loss(pred, target)
outlier_loss = alpha * F.l1_loss(pred, outlier_target) + beta * F.smooth_l1_loss(pred, outlier_target) + gamma * F.mse_loss(pred, outlier_target)
print(f" 正常数据组合损失: {normal_loss:.4f}")
print(f" 异常数据组合损失: {outlier_loss:.4f}")
def test_numerical_stability():
"""
测试数值稳定性
"""
print("\n=== 数值稳定性测试 ===")
# 测试1: 接近零的除法
print("1. 接近零的除法测试:")
small_num = torch.tensor(1e-8)
very_small_denom = torch.tensor(1e-10)
# 不稳定的除法
unstable_result = small_num / very_small_denom
print(f" 不稳定除法结果: {unstable_result:.2f}")
# 稳定的除法
stable_result = small_num / (very_small_denom + 1e-8)
print(f" 稳定除法结果: {stable_result:.2f}")
# 测试2: 复数处理
print("\n2. 复数处理测试:")
complex_tensor = torch.complex(torch.randn(3, 3), torch.randn(3, 3))
real_tensor = torch.abs(complex_tensor)
print(f" 复数张量形状: {complex_tensor.shape}")
print(f" 转换后形状: {real_tensor.shape}")
print(f" 是否为复数: {torch.is_complex(complex_tensor)}")
print(f" 转换后是否为复数: {torch.is_complex(real_tensor)}")
# 测试3: 标准化稳定性
print("\n3. 标准化稳定性测试:")
# 创建包含极端值的张量
extreme_tensor = torch.tensor([1e-10, 1e10, 0.0, -1e-10])
normalized = (extreme_tensor - extreme_tensor.mean()) / (extreme_tensor.std() + 1e-8)
print(f" 原始张量: {extreme_tensor}")
print(f" 标准化后: {normalized}")
print(f" 标准化后均值: {normalized.mean():.6f}")
print(f" 标准化后标准差: {normalized.std():.6f}")
def test_learning_rate_schedulers():
"""
测试不同学习率调度器
"""
print("\n=== 学习率调度器测试 ===")
# 创建简单的模型和优化器
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# 测试不同的调度器
schedulers = {
'StepLR': optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5),
'ExponentialLR': optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95),
'CosineAnnealingLR': optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50),
}
# 记录学习率变化
lr_history = {name: [] for name in schedulers.keys()}
for epoch in range(100):
for name, scheduler in schedulers.items():
if name == 'StepLR' or name == 'ExponentialLR' or name == 'CosineAnnealingLR':
scheduler.step()
lr_history[name].append(optimizer.param_groups[0]['lr'])
# 打印学习率变化
print("学习率变化 (每20个epoch):")
for name, lrs in lr_history.items():
print(f"\n{name}:")
for i in range(0, len(lrs), 20):
print(f" Epoch {i}: {lrs[i]:.6f}")
return lr_history
def comprehensive_stability_test():
"""
综合训练稳定性测试
"""
print("\n=== 综合训练稳定性测试 ===")
class StableTrainingModel(nn.Module):
"""
稳定的训练模型
"""
def __init__(self, input_size=10, hidden_size=50, output_size=1):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, output_size)
)
def forward(self, x):
return self.layers(x)
def train_with_stability_measures(model, train_data, epochs=50, lr=0.01):
"""
使用稳定性措施进行训练
"""
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.5)
criterion = nn.MSELoss()
losses = []
grad_norms = []
lrs = []
for epoch in range(epochs):
epoch_losses = []
epoch_grad_norms = []
for batch_x, batch_y in train_data:
# 前向传播
output = model(batch_x)
loss = criterion(output, batch_y)
# 反向传播
optimizer.zero_grad()
loss.backward()
# 梯度裁剪
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# 参数更新
optimizer.step()
epoch_losses.append(loss.item())
epoch_grad_norms.append(grad_norm.item())
# 记录指标
avg_loss = np.mean(epoch_losses)
avg_grad_norm = np.mean(epoch_grad_norms)
losses.append(avg_loss)
grad_norms.append(avg_grad_norm)
lrs.append(optimizer.param_groups[0]['lr'])
# 学习率调度
scheduler.step(avg_loss)
if epoch % 10 == 0:
print(f"Epoch {epoch}: Loss={avg_loss:.4f}, GradNorm={avg_grad_norm:.4f}, LR={lrs[-1]:.6f}")
return losses, grad_norms, lrs
# 创建训练数据
torch.manual_seed(42)
X = torch.randn(1000, 10)
y = torch.randn(1000, 1)
# 创建数据加载器
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# 测试1: 无稳定性措施
print("\n1. 无稳定性措施训练:")
model1 = StableTrainingModel()
losses1, grad_norms1, lrs1 = train_with_stability_measures(model1, dataloader, epochs=50, lr=0.1)
# 测试2: 有稳定性措施
print("\n2. 有稳定性措施训练:")
model2 = StableTrainingModel()
losses2, grad_norms2, lrs2 = train_with_stability_measures(model2, dataloader, epochs=50, lr=0.1)
# 分析结果
print(f"\n=== 结果分析 ===")
print(f"无稳定性措施 - 最终损失: {losses1[-1]:.4f}, 最大梯度范数: {max(grad_norms1):.4f}")
print(f"有稳定性措施 - 最终损失: {losses2[-1]:.4f}, 最大梯度范数: {max(grad_norms2):.4f}")
return {
'no_stability': {'losses': losses1, 'grad_norms': grad_norms1, 'lrs': lrs1},
'with_stability': {'losses': losses2, 'grad_norms': grad_norms2, 'lrs': lrs2}
}
def plot_training_curves(results):
"""
绘制训练曲线
"""
try:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# 损失曲线
axes[0, 0].plot(results['no_stability']['losses'], label='无稳定性措施', alpha=0.7)
axes[0, 0].plot(results['with_stability']['losses'], label='有稳定性措施', alpha=0.7)
axes[0, 0].set_title('训练损失')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True)
# 梯度范数曲线
axes[0, 1].plot(results['no_stability']['grad_norms'], label='无稳定性措施', alpha=0.7)
axes[0, 1].plot(results['with_stability']['grad_norms'], label='有稳定性措施', alpha=0.7)
axes[0, 1].set_title('梯度范数')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Gradient Norm')
axes[0, 1].legend()
axes[0, 1].grid(True)
# 学习率曲线
axes[1, 0].plot(results['no_stability']['lrs'], label='无稳定性措施', alpha=0.7)
axes[1, 0].plot(results['with_stability']['lrs'], label='有稳定性措施', alpha=0.7)
axes[1, 0].set_title('学习率')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Learning Rate')
axes[1, 0].legend()
axes[1, 0].grid(True)
# 损失分布直方图
axes[1, 1].hist(results['no_stability']['losses'], bins=20, alpha=0.7, label='无稳定性措施')
axes[1, 1].hist(results['with_stability']['losses'], bins=20, alpha=0.7, label='有稳定性措施')
axes[1, 1].set_title('损失分布')
axes[1, 1].set_xlabel('Loss')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].legend()
axes[1, 1].grid(True)
plt.tight_layout()
plt.savefig('/home/jlu/code/tree-ring/doc/training_stability_curves.png', dpi=300, bbox_inches='tight')
print("\n训练曲线图已保存到: /home/jlu/code/tree-ring/doc/training_stability_curves.png")
except ImportError:
print("\n注意: matplotlib未安装,跳过绘图功能")
def main():
"""
主测试函数
"""
print("开始训练稳定性测试...")
# 运行各项测试
test_gradient_clipping()
test_loss_functions()
test_numerical_stability()
test_learning_rate_schedulers()
# 综合测试
results = comprehensive_stability_test()
# 绘制训练曲线
plot_training_curves(results)
print("\n所有测试完成!")
if __name__ == "__main__":
main()
📋 测试代码功能说明
1. 梯度裁剪测试 (test_gradient_clipping
)
- 对比有无梯度裁剪的训练效果
- 监控梯度范数变化
- 验证梯度裁剪对训练稳定性的影响
2. 损失函数特性测试 (test_loss_functions
)
- 测试L1、SmoothL1、MSE损失函数对异常值的敏感性
- 验证组合损失函数的平衡效果
- 量化不同损失函数的特性差异
3. 数值稳定性测试 (test_numerical_stability
)
- 测试接近零除法的稳定性
- 验证复数处理功能
- 检查标准化操作的数值稳定性
4. 学习率调度器测试 (test_learning_rate_schedulers
)
- 对比StepLR、ExponentialLR、CosineAnnealingLR等调度器
- 记录学习率变化曲线
- 分析不同调度策略的特点
5. 综合训练稳定性测试 (comprehensive_stability_test
)
- 完整的训练流程测试
- 对比有无稳定性措施的训练效果
- 生成详细的训练指标分析
6. 训练曲线可视化 (plot_training_curves
)
- 生成损失、梯度范数、学习率的变化曲线
- 提供损失分布直方图
- 保存高质量的可视化图表
💻 运行环境要求
bash
# 必需的Python包
pip install torch torchvision matplotlib numpy
# 可选:如果需要更好的可视化效果
pip install seaborn
📊 预期输出示例
运行测试后,您将看到类似以下的输出:
开始训练稳定性测试...
=== 梯度裁剪测试 ===
1. 无梯度裁剪训练:
Epoch 0: Loss=1.2731, GradNorm=1.6845
Epoch 1: Loss=1.3994, GradNorm=1.4723
...
2. 有梯度裁剪训练:
Epoch 0: Loss=1.6034, GradNorm=1.9507
Epoch 1: Loss=1.7021, GradNorm=1.7273
...
=== 损失函数特性测试 ===
1. 正常数据:
L1 Loss: 0.1000
SmoothL1 Loss: 0.0050
MSE Loss: 0.0100
...
=== 数值稳定性测试 ===
1. 接近零的除法测试:
不稳定除法结果: 100.00
稳定除法结果: 0.99
...
=== 学习率调度器测试 ===
学习率变化 (每20个epoch):
StepLR:
Epoch 0: 0.010000
Epoch 20: 0.001173
...
=== 综合训练稳定性测试 ===
1. 无稳定性措施训练:
Epoch 0: Loss=1.6004, GradNorm=3.6254, LR=0.100000
...
2. 有稳定性措施训练:
Epoch 0: Loss=1.4642, GradNorm=3.0027, LR=0.100000
...
=== 结果分析 ===
无稳定性措施 - 最终损失: 0.9693, 最大梯度范数: 3.6254
有稳定性措施 - 最终损失: 0.9687, 最大梯度范数: 3.0027
训练曲线图已保存到: /home/jlu/code/tree-ring/doc/training_stability_curves.png
所有测试完成!
这个完整的测试代码可以直接复制到文件中运行,验证所有训练稳定性措施的有效性。