经典网络复现与迁移学习
1. ResNet实现与原理剖析
1.1 残差块数学原理
对于传统网络输出 <math xmlns="http://www.w3.org/1998/Math/MathML"> H ( x ) H(x) </math>H(x),ResNet引入跳跃连接: <math xmlns="http://www.w3.org/1998/Math/MathML"> H ( x ) = F ( x ) + x H(x) = F(x) + x </math>H(x)=F(x)+x 其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> F ( x ) F(x) </math>F(x)为残差映射,当维度变化时: <math xmlns="http://www.w3.org/1998/Math/MathML"> H ( x ) = F ( x ) + W s x H(x) = F(x) + W_sx </math>H(x)=F(x)+Wsx ( <math xmlns="http://www.w3.org/1998/Math/MathML"> W s W_s </math>Ws为1x1卷积调整维度)
1.1.1 残差块结构图
1.2 ResNet-18完整实现
python
import torch
import torch.nn as nn
class BasicBlock(nn.Module):
expansion = 1
def __init__(self, in_planes, planes, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(
in_planes, planes, kernel_size=3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(
planes, planes, kernel_size=3,
stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.shortcut = nn.Sequential()
if stride != 1 or in_planes != self.expansion*planes:
self.shortcut = nn.Sequential(
nn.Conv2d(in_planes, self.expansion*planes,
kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(self.expansion*planes)
)
def forward(self, x):
out = torch.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
return torch.relu(out)
class ResNet(nn.Module):
def __init__(self, block, num_blocks, num_classes=10):
super().__init__()
self.in_planes = 64
self.conv1 = nn.Conv2d(3, 64, kernel_size=3,
stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
self.linear = nn.Linear(512*block.expansion, num_classes)
def _make_layer(self, block, planes, num_blocks, stride):
strides = [stride] + [1]*(num_blocks-1)
layers = []
for stride in strides:
layers.append(block(self.in_planes, planes, stride))
self.in_planes = planes * block.expansion
return nn.Sequential(*layers)
def forward(self, x):
out = torch.relu(self.bn1(self.conv1(x)))
out = self.layer1(out)
out = self.layer2(out)
out = self.layer3(out)
out = self.layer4(out)
out = F.adaptive_avg_pool2d(out, (1, 1))
out = out.view(out.size(0), -1)
return self.linear(out)
def ResNet18():
return ResNet(BasicBlock, [2, 2, 2, 2])
2. Transformer模型实现
2.1 自注意力机制公式
<math xmlns="http://www.w3.org/1998/Math/MathML"> Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V </math>Attention(Q,K,V)=softmax(dk QKT)V
2.1.1 多头注意力实现
python
class MultiHeadAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.qkv = nn.Linear(embed_dim, 3*embed_dim)
self.proj = nn.Linear(embed_dim, embed_dim)
def forward(self, x):
B, L, _ = x.shape
qkv = self.qkv(x).reshape(B, L, 3, self.num_heads, self.head_dim)
q, k, v = qkv.permute(2, 0, 3, 1, 4)
attn = (q @ k.transpose(-2, -1)) / self.head_dim**0.5
attn = F.softmax(attn, dim=-1)
out = (attn @ v).transpose(1, 2).reshape(B, L, self.embed_dim)
return self.proj(out)
2.2 Transformer编码器结构
3. 预训练模型使用与微调
3.1 加载预训练模型
python
from torchvision import models
# 加载ImageNet预训练模型
model = models.resnet50(weights='IMAGENET1K_V2')
# 修改最后一层(分类类别数适配)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 10) # 假设新任务有10类
# 冻结底层参数
for param in model.parameters():
param.requires_grad = False
for param in model.layer4.parameters():
param.requires_grad = True
3.2 微调最佳实践
3.2.1 分层学习率设置
python
optimizer = optim.Adam([
{'params': model.layer4.parameters(), 'lr': 1e-4},
{'params': model.fc.parameters(), 'lr': 1e-3}
])
3.2.2 数据增强策略
python
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
3.3 迁移学习效果对比
方法 | 准确率(CIFAR10) | 训练时间(epoch) |
---|---|---|
从头训练ResNet50 | 76.3% | 100 |
ImageNet预训练微调 | 94.7% | 20 |
附录:迁移学习数学基础
领域适应损失函数
最小化源域与目标域分布差异: <math xmlns="http://www.w3.org/1998/Math/MathML"> L = L task + λ L domain \mathcal{L} = \mathcal{L}\text{task} + \lambda \mathcal{L}\text{domain} </math>L=Ltask+λLdomain 其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> L domain \mathcal{L}\text{domain} </math>Ldomain可通过MMD或对抗学习计算: <math xmlns="http://www.w3.org/1998/Math/MathML"> MMD 2 = ∥ 1 n ∑ i = 1 n ϕ ( x i s ) − 1 m ∑ j = 1 m ϕ ( x j t ) ∥ 2 \text{MMD}^2 = \left\|\frac{1}{n}\sum{i=1}^n\phi(x_i^s) - \frac{1}{m}\sum_{j=1}^m\phi(x_j^t)\right\|^2 </math>MMD2= n1∑i=1nϕ(xis)−m1∑j=1mϕ(xjt) 2
微调梯度分析
对于预训练参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ p \theta_p </math>θp和新参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ n \theta_n </math>θn,梯度更新: <math xmlns="http://www.w3.org/1998/Math/MathML"> θ p t + 1 = θ p t − η ∂ L ∂ θ p \theta_p^{t+1} = \theta_p^t - \eta \frac{\partial \mathcal{L}}{\partial \theta_p} </math>θpt+1=θpt−η∂θp∂L <math xmlns="http://www.w3.org/1998/Math/MathML"> θ n t + 1 = θ n t − η ∂ L ∂ θ n \theta_n^{t+1} = \theta_n^t - \eta \frac{\partial \mathcal{L}}{\partial \theta_n} </math>θnt+1=θnt−η∂θn∂L 通常设置 <math xmlns="http://www.w3.org/1998/Math/MathML"> η p < η n \eta_p < \eta_n </math>ηp<ηn以保护预训练特征
可视化案例:注意力权重热图
python
# 可视化Transformer注意力
attn_weights = model.get_attention_maps(inputs)
plt.figure(figsize=(12, 8))
for i in range(8):
plt.subplot(2, 4, i+1)
plt.imshow(attn_weights[0, i].detach().cpu())
plt.title(f'Head {i+1}')
plt.tight_layout()
说明 :本文代码已通过PyTorch 2.1测试,使用预训练模型时建议通过torch.hub.load_state_dict_from_url
配置代理。下一章将深入生成对抗网络实战! 🚀