一学就会的深度学习基础指令及操作步骤（5）使用预训练模型

文章目录

- 使用预训练模型

使用预训练模型

加载预训练模型

python 复制代码

from torchvision.models import vgg16  # VGG16模型架构的定义
from torchvision.models import VGG16_Weights  # VGG16的预训练权重配置

# load the VGG16 network *pre-trained* on the ImageNet dataset
weights = VGG16_Weights.DEFAULT  # 获取默认的预训练权重（通常是ImageNet上训练的）
model = vgg16(weights=weights)  # 创建VGG16模型，并加载指定权重

当指定 weights=VGG16_Weights.DEFAULT 时，PyTorch会：

根据配置自动下载对应的预训练权重文件（如 .pth）。
将权重加载到 vgg16 定义的模型架构中，确保每层参数正确匹配。

VGG16 模型结构如下：

python 复制代码

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace=True)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace=True)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace=True)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace=True)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace=True)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace=True)
    (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace=True)
    (2): Dropout(p=0.5, inplace=False)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace=True)
    (5): Dropout(p=0.5, inplace=False)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

VGG16 神经网络，主要用于图像分类任务（如识别1000种物体）。它的结构设计非常规整，像搭积木一样层层堆叠。

VGG16 模型结构拆解

1. 特征提取部分 (features) ------ 层层递进捕捉图像细节

核心操作 ：重复堆叠 "卷积层 + ReLU激活"，每2~3个卷积后接一个最大池化层压缩尺寸。
具体流程：
前2层：输入3通道图片 → 64通道 → 捕捉基础边缘纹理。
池化：尺寸减半（如224x224 → 112x112），保留关键特征。
后续层：逐步增加通道数（128 → 256 → 512），提取更复杂图案（如形状、物体局部）。
池化间隔：每阶段通过池化压缩空间信息，减少计算量。

2.分类部分 (classifier) ------ 综合特征做判断

全局池化 (avgpool)：将特征图压缩成固定大小（7x7），避免输入尺寸影响。
全连接层 ：
- 前两层：4096维 → 通过Dropout随机关闭部分神经元，防止死记硬背。
- 最后一层：4096维 → 1000维，输出1000个类别的概率。

VGG16 设计优势

1. 统一的小卷积核 (3x3)

优势：多个小卷积核叠加可等效大卷积核的感受野，但参数更少、非线性更强。
- 例如：2层3x3卷积 ≈ 1层5x5卷积，但参数量减少 (3x3x2=18 vs 5x5=25)。
  效果：更高效捕捉空间特征，提升模型深度。

2. 深度结构

16层（13卷积 + 3全连接）的深度结构能提取多层次特征：
- 浅层：边缘、纹理 → 中层：形状、部件 → 深层：完整物体。

3. 模块化设计

每阶段结构重复（如卷积→激活→池化），代码易实现，模型可扩展性强（如VGG19）。

4. 泛化能力强

在ImageNet上预训练后，可作为其他任务的基础模型（迁移学习），适应性强。

VGG16 局限性

参数量大：全连接层占据大量参数（如25088→4096），计算成本高。
现代替代品：后续模型（如ResNet）通过残差连接解决深层梯度问题，效果更好。

图像加载与预处理

python 复制代码

pre_trans = weights.transforms()  # 调用weights自带的标准化预处理流程
pre_trans
>>> ImageClassification(
   # 1. 调整尺寸与裁剪
>>>     crop_size=[224]  # 从缩放后的图片中心裁剪出 224x224像素 的区域，作为模型输入。
>>>     resize_size=[256]    # 先把图片等比缩放到 短边256像素
   # 2. 归一化处理
   # 归一化后像素 = (原始像素 - mean) / std
   # 让输入数据的分布接近标准正态（均值为0，标准差为1），加速模型收敛，稳定训练过程。 （这些数值是ImageNet数据集的统计值，沿用可兼容预训练模型。）
>>>     mean=[0.485, 0.456, 0.406]  #  RGB三通道的均值，用于将像素值减去均值（中心化）
>>>     std=[0.229, 0.224, 0.225]   # RGB三通道的标准差，用于将像素值除以标准差（缩放至标准正态分布）
>>>     interpolation=InterpolationMode.BILINEAR # 缩放图片时使用双线性插值，平滑像素间的过渡，减少锯齿感
>>> )

上面代码等同于下面代码

python 复制代码

IMG_WIDTH, IMG_HEIGHT = (224, 224)

pre_trans = transforms.Compose([
    transforms.ToDtype(torch.float32, scale=True), # Converts [0, 255] to [0, 1]
    transforms.Resize((IMG_WIDTH, IMG_HEIGHT)),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
    transforms.CenterCrop(224)
])

对图像进行预处理，以便能以适当的格式(1, 3, 224, 224)将其送入模型中

python 复制代码

def load_and_process_image(file_path):
    # Print image's original shape, for reference
    print('Original image shape: ', mpimg.imread(file_path).shape)
    
    image = tv_io.read_image(file_path).to(device)
    image = pre_trans(image)  # weights.transforms()
    image = image.unsqueeze(0)  # Turn into a batch
    return image

processed_image = load_and_process_image("data/doggy_door_images/happy_dog.jpg")
print("Processed image shape: ", processed_image.shape)

>>> Original image shape:  (1200, 1800, 3)
>>> Processed image shape:  torch.Size([1, 3, 224, 224])

预测

python 复制代码

vgg_classes = json.load(open("data/imagenet_class_index.json"))

def readable_prediction(image_path):
    # Show image
    show_image(image_path)
    # Load and pre-process image 加载图像并预处理
    image = load_and_process_image(image_path)
    # Make predictions 模型推理，取第一个（唯一）样本的输出
    output = model(image)[0]  # Unbatch
    predictions = torch.topk(output, 3) # 获取概率最高的3个类别
    indices = predictions.indices.tolist() # 转换为列表
    # Print predictions in readable form
    out_str = "Top results: "
    # 映射索引到类别名称,遍历索引列表，从字典中提取对应的类别名称
    pred_classes = [vgg_classes[str(idx)][1] for idx in indices]
    out_str += ", ".join(pred_classes)
    print(out_str)

    return predictions

readable_prediction("data/doggy_door_images/happy_dog.jpg")
>>> Original image shape:  (1200, 1800, 3)
>>> Top results: Staffordshire_bullterrier, American_Staffordshire_terrier, Labrador_retriever
>>> torch.return_types.topk( values=tensor([19.6133, 15.8125, 14.4607], device='cuda:0', grad_fn=<TopkBackward0>), indices=tensor([179, 180, 208], device='cuda:0'))

readable_prediction("data/doggy_door_images/brown_bear.jpg")
>>> Original image shape:  (2592, 3456, 3)
>>> Top results: brown_bear, American_black_bear, sloth_bear
>>> torch.return_types.topk(
values=tensor([33.0100, 27.1086, 22.9985], device='cuda:0', grad_fn=<TopkBackward0>),
indices=tensor([294, 295, 297], device='cuda:0'))

readable_prediction("data/doggy_door_images/sleepy_cat.jpg")
>>> Original image shape:  (1200, 1800, 3)
>>> Top results: tiger_cat, tabby, Egyptian_cat
>>> torch.return_types.topk(
values=tensor([16.7054, 13.8567, 12.5219], device='cuda:0', grad_fn=<TopkBackward0>),
indices=tensor([282, 281, 285], device='cuda:0'))