【Pytorch】实验跟踪 Experiment Tracking

文章目录

[1. 获取数据](#1. 获取数据)
[2. 创建Dataset和DataLoader](#2. 创建Dataset和DataLoader)
[3. 获取并定制预训练模型](#3. 获取并定制预训练模型)
[4. 训练模型并跟踪结果](#4. 训练模型并跟踪结果)
[5. 在 TensorBoard 中查看模型的结果](#5. 在 TensorBoard 中查看模型的结果)
[6. 创建辅助函数来构建 SummaryWriter() 实例](#6. 创建辅助函数来构建 SummaryWriter() 实例)
[7. 建立一系列建模实验](#7. 建立一系列建模实验)
[8. 在TensorBoard中查看实验](#8. 在TensorBoard中查看实验)
[9. 加载最佳模型并用它进行预测](#9. 加载最佳模型并用它进行预测)
补充

机器学习和深度学习是非常实验性的，需要跟踪数据、模型架构和训练方案的各种组合的结果。

进行大量不同的实验，实验跟踪可以帮助您找出哪些有效，哪些无效。

只运行少数模型，那么只需在打印输出和一些字典中跟踪它们的结果可能就可以了。但是，随着运行的实验数量开始增加，这种简单的跟踪方式可能会失控。

有多少种实验可以运行，就有多少种不同的方法来跟踪机器学习实验。

Method	Setup	Pros	Cons	Cost
Python 词典、CSV 文件、打印输出	None	易于设置，以纯 Python 运行	难以跟踪大量实验	Free
TensorBoard	Minimal, install tensorboard	PyTorch 内置的扩展被广泛认可和使用，并且可以轻松扩展	用户体验不如其他选项	Free
Weights & Biases Experiment Tracking	Minimal, install wandb, make an account	用户体验棒，公开实验，跟踪几乎所有东西	需要 PyTorch 之外的外部资源	Free for personal use
MLFlow	Minimal, install mlflow and starting tracking	完全开源的 MLOps 生命周期管理，许多集成	与其他服务相比，设置远程跟踪服务器有点困难	Free

本篇博客主要介绍使用 TensorBoard 来跟踪我们的实验。

在实验开始前需要进行如下设置：

（1）会重复使用到之前博客 Pytorch模块化里面的python脚本，data_setup.py 和 engine.py ，data_setup主要是创建Dataset和DataLoader，engine主要是训练模型的引擎函数，直接从前面的博客copy过来。

（2）导入一些基础库。

（3）设置与设备无关。

（4）创建一个辅助函数来设置种子

python 复制代码

# Continue with regular imports

import matplotlib.pyplot as plt
import torch
import torchvision

from torch import nn
from torchvision import transforms
from torchinfo import summary

from going_modular.going_modular import data_setup, engine

python 复制代码

device = "cuda" if torch.cuda.is_available() else "cpu"
device

python 复制代码

# Set seeds
def set_seeds(seed: int=42):
    """Sets random sets for torch operations.

    Args:
        seed (int, optional): Random seed to set. Defaults to 42.
    """
    # Set the seed for general torch operations
    torch.manual_seed(seed)
    # Set the seed for CUDA torch operations (ones that happen on the GPU)
    torch.cuda.manual_seed(seed)

上面设置完成后，后面实验跟踪的流程如下：

获取数据：FoodVision Mini---披萨、牛排和寿司图像分类数据集。
创建Dataset和DataLoader：通过导入的data_setup脚本。
获取并定制预训练模型：将从 torchvision.models 下载预训练模型，并根据我们自己的问题对其进行自定义。
训练模型并跟踪结果
在 TensorBoard 中查看模型的结果
创建辅助函数来跟踪实验：创建一个函数来帮助我们保存建模实验结果。
建立一系列建模实验：编写一些代码来同时运行多个实验，使用不同的模型、不同的数据量和不同的训练时间。
在TensorBoard中查看建模实验
加载最佳模型并用它进行预测

1. 获取数据

通过下面代码下载好数据集：

python 复制代码

import os
import zipfile

from pathlib import Path

import requests

def download_data(source: str,
                  destination: str,
                  remove_source: bool = True) -> Path:
    """Downloads a zipped dataset from source and unzips to destination.

    Args:
        source (str): A link to a zipped file containing data.
        destination (str): A target directory to unzip data to.
        remove_source (bool): Whether to remove the source after downloading and extracting.

    Returns:
        pathlib.Path to downloaded data.

    Example usage:
        download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                      destination="pizza_steak_sushi")
    """
    # Setup path to data folder
    data_path = Path("data/")
    image_path = data_path / destination

    # If the image folder doesn't exist, download it and prepare it...
    if image_path.is_dir():
        print(f"[INFO] {image_path} directory exists, skipping download.")
    else:
        print(f"[INFO] Did not find {image_path} directory, creating one...")
        image_path.mkdir(parents=True, exist_ok=True)

        # Download pizza, steak, sushi data
        target_file = Path(source).name
        with open(data_path / target_file, "wb") as f:
            request = requests.get(source)
            print(f"[INFO] Downloading {target_file} from {source}...")
            f.write(request.content)

        # Unzip pizza, steak, sushi data
        with zipfile.ZipFile(data_path / target_file, "r") as zip_ref:
            print(f"[INFO] Unzipping {target_file} data...")
            zip_ref.extractall(image_path)

        # Remove .zip file
        if remove_source:
            os.remove(data_path / target_file)

    return image_path

image_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")
image_path

2. 创建Dataset和DataLoader

可以直接调用data_setup.py里面的create_dataloaders()函数直接创建Dataset和DataLoader。

在调用之前，我们需要创建一个对数据进行转换的transform参数，转换后的形式一定需要符合后续模型所需的输入。

由于我们将使用迁移学习和来自 torchvision.models 的专门预训练模型，因此我们将创建一个转换来正确准备我们的图像。

至于这个转换的方式可参考【Pytorch】Transfer Learning 迁移学习里面提到的两种方式，下面选择的是使用手动方式创建：

python 复制代码

# Setup directories
train_dir = image_path / "train"
test_dir = image_path / "test"

# Setup ImageNet normalization levels (turns all images into similar distribution as ImageNet)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

# Create transform pipeline manually
manual_transforms = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    normalize
])
print(f"Manually created transforms: {manual_transforms}")

# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms, # use manually created transforms
    batch_size=32,
    num_workers=1
)

train_dataloader, test_dataloader, class_names

补充自动创建方式，选择一种就可以了：

python 复制代码

# Setup dirs
train_dir = image_path / "train"
test_dir = image_path / "test"

# Setup pretrained weights (plenty of these available in torchvision.models)
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT

# Get transforms from weights (these are the transforms that were used to obtain the weights)
automatic_transforms = weights.transforms() 
print(f"Automatically created transforms: {automatic_transforms}")

# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=automatic_transforms, # use automatic created transforms
    batch_size=32,
    num_workers=1
)

train_dataloader, test_dataloader, class_names

3. 获取并定制预训练模型

获取预训练模型，冻结基础层并更改分类器头

下载 torchvision.models.efficientnet_b0() 模型的预训练权重，并准备将其与我们自己的数据一起使用。

python 复制代码

# Note: This is how a pretrained model would be created in torchvision > 0.13, it will be deprecated in future versions.
# model = torchvision.models.efficientnet_b0(pretrained=True).to(device) # OLD

# Download the pretrained weights for EfficientNet_B0
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT # NEW in torchvision 0.13, "DEFAULT" means "best weights available"

# Setup the model with the pretrained weights and send it to the target device
model = torchvision.models.efficientnet_b0(weights=weights).to(device)

# View the output of the model
# model

将冻结模型的基础层（我们将使用它们从输入图像中提取特征），并且我们将更改分类器头（输出层）以适应我们正在使用的类的数量（我们有 3 种：披萨、牛排、寿司）。

python 复制代码

# Freeze all base layers by setting requires_grad attribute to False
for param in model.features.parameters():
    param.requires_grad = False
    
# Since we're creating a new layer with random weights (torch.nn.Linear), 
# let's set the seeds
set_seeds() 

# Update the classifier head to suit our problem
model.classifier = torch.nn.Sequential(
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(in_features=1280, 
              out_features=len(class_names),
              bias=True).to(device))

基础层冻结，分类器头改变，用 torchinfo.summary() 来总结我们的模型：

python 复制代码

from torchinfo import summary

# # Get a summary of the model (uncomment for full output)
# summary(model, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape" (batch_size, color_channels, height, width)
#         verbose=0,
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# )
# Print a summary using torchinfo (uncomment for actual output)
summary(model=model,
        input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
        # col_names=["input_size"], # uncomment for smaller output
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
)

4. 训练模型并跟踪结果

创建损失函数和优化器：

python 复制代码

# Define loss and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

调整 train() 函数【 engine.py】以跟踪 SummaryWriter() 的结果：

可以使用 PyTorch 的 torch.utils.tensorboard.SummaryWriter() 类将模型训练进度的各个部分保存到文件中。
默认情况下， SummaryWriter() 类将有关模型的各种信息保存到由 log_dir 参数设置的文件中。
log_dir 的默认位置位于 runs/CURRENT_DATETIME_HOSTNAME 下，其中 HOSTNAME 是您的计算机的名称。可以更改跟踪实验的位置（文件名可根据您的需要进行自定义）
SummaryWriter() 的输出以 TensorBoard 格式保存

创建一个默认的 SummaryWriter() 实例:

python 复制代码

from torch.utils.tensorboard import SummaryWriter

# Create a writer with all default settings
writer = SummaryWriter()

从 engine.py 获取 train() 函数，并将其调整为使用 writer：【为 train() 函数添加记录模型训练和测试损失和准确性值的功能 】

可以使用 writer.add_scalars(main_tag, tag_scalar_dict) 来做到这一点，其中：

main_tag (string)- 正在跟踪的标量的名称（例如"准确性"）
tag_scalar_dict (dict) - 正在跟踪的值的字典（例如 {"train_loss": 0.3454} ）

方法称为 add_scalars() 因为我们的损失和准确度值通常是标量（单个值）。

一旦我们完成跟踪值，我们将调用 writer.close() 告诉 writer 停止寻找要跟踪的值。

要开始修改 train() ，我们还将从 engine.py 导入 train_step() 和 test_step()

python 复制代码

from typing import Dict, List
from tqdm.auto import tqdm

from going_modular.going_modular.engine import train_step, test_step

# Import train() function from: 
# https://github.com/mrdbourke/pytorch-deep-learning/blob/main/going_modular/going_modular/engine.py
def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Args:
      model: A PyTorch model to be trained and tested.
      train_dataloader: A DataLoader instance for the model to be trained on.
      test_dataloader: A DataLoader instance for the model to be tested on.
      optimizer: A PyTorch optimizer to help minimize the loss function.
      loss_fn: A PyTorch loss function to calculate loss on both datasets.
      epochs: An integer indicating how many epochs to train for.
      device: A target device to compute on (e.g. "cuda" or "cpu").
      
    Returns:
      A dictionary of training and testing loss as well as training and
      testing accuracy metrics. Each metric has a value in a list for 
      each epoch.
      In the form: {train_loss: [...],
                train_acc: [...],
                test_loss: [...],
                test_acc: [...]} 
      For example if training for epochs=2: 
              {train_loss: [2.0616, 1.0537],
                train_acc: [0.3945, 0.3945],
                test_loss: [1.2641, 1.5706],
                test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }

    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                           dataloader=train_dataloader,
                                           loss_fn=loss_fn,
                                           optimizer=optimizer,
                                           device=device)
        test_loss, test_acc = test_step(model=model,
                                        dataloader=test_dataloader,
                                        loss_fn=loss_fn,
                                        device=device)

        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)

        ### New: Experiment tracking ###
        # Add loss results to SummaryWriter
        writer.add_scalars(main_tag="Loss", 
                           tag_scalar_dict={"train_loss": train_loss,
                                            "test_loss": test_loss},
                           global_step=epoch)

        # Add accuracy results to SummaryWriter
        writer.add_scalars(main_tag="Accuracy", 
                           tag_scalar_dict={"train_acc": train_acc,
                                            "test_acc": test_acc}, 
                           global_step=epoch)
        
        # Track the PyTorch model architecture
        writer.add_graph(model=model, 
                         # Pass in an example input
                         input_to_model=torch.randn(32, 3, 224, 224).to(device))
    
    # Close the writer
    writer.close()
    
    ### End new ###

    # Return the filled results at the end of the epochs
    return results

测试 5 个 epoch 效果：

python 复制代码

# Train model
# Note: Not using engine.train() since the original script isn't updated to use writer
set_seeds()
results = train(model=model,
                train_dataloader=train_dataloader,
                test_dataloader=test_dataloader,
                optimizer=optimizer,
                loss_fn=loss_fn,
                epochs=5,
                device=device)

在字典中跟踪模型的结果：

python 复制代码

# Check out the model results
results

5. 在 TensorBoard 中查看模型的结果

默认情况下， SummaryWriter() 类以 TensorBoard 格式将模型结果存储在名为 runs/ 的目录中。

可以通过多种方式查看 TensorBoard：

在jupyter notebook中，可以这样操作：

确保 TensorBoard 已安装，使用 %load_ext tensorboard 加载它，然后使用 %tensorboard --logdir DIR_WITH_LOGS 查看结果。

python 复制代码

# Example code to run in Jupyter or Google Colab Notebook (uncomment to try it out)
%load_ext tensorboard
%tensorboard --logdir runs

下面是Colab的显示：

6. 创建辅助函数来构建 SummaryWriter() 实例

SummaryWriter() 类将各种信息记录到 log_dir 参数指定的目录中，创建一个辅助函数来为每个实验创建一个自定义目录。

每个实验都有自己的日志目录，跟踪以下内容：

Experiment date/timestamp - when did the experiment take place?
Experiment name - is there something we'd like to call the experiment?
Model name - what model was used?
Extra - should anything else be tracked?

开始创建一个名为 create_writer() 的辅助函数，它生成一个 SummaryWriter() 实例跟踪自定义 log_dir 。

理想情况下，我们希望 log_dir 类似于：runs/YYYY-MM-DD/experiment_name/model_name/extra

python 复制代码

def create_writer(experiment_name: str, 
                  model_name: str, 
                  extra: str=None) -> torch.utils.tensorboard.writer.SummaryWriter():
    """Creates a torch.utils.tensorboard.writer.SummaryWriter() instance saving to a specific log_dir.

    log_dir is a combination of runs/timestamp/experiment_name/model_name/extra.

    Where timestamp is the current date in YYYY-MM-DD format.

    Args:
        experiment_name (str): Name of experiment.
        model_name (str): Name of model.
        extra (str, optional): Anything extra to add to the directory. Defaults to None.

    Returns:
        torch.utils.tensorboard.writer.SummaryWriter(): Instance of a writer saving to log_dir.

    Example usage:
        # Create a writer saving to "runs/2022-06-04/data_10_percent/effnetb2/5_epochs/"
        writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb2",
                               extra="5_epochs")
        # The above is the same as:
        writer = SummaryWriter(log_dir="runs/2022-06-04/data_10_percent/effnetb2/5_epochs/")
    """
    from datetime import datetime
    import os

    # Get timestamp of current date (all experiments on certain day live in same folder)
    timestamp = datetime.now().strftime("%Y-%m-%d") # returns current date in YYYY-MM-DD format

    if extra:
        # Create log directory path
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name, extra)
    else:
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name)
        
    print(f"[INFO] Created SummaryWriter, saving to: {log_dir}...")
    return SummaryWriter(log_dir=log_dir)

python 复制代码

# Create an example writer
example_writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb0",
                               extra="5_epochs")

更新 train() 函数以包含 writer 参数

为了调整 train() 函数，我们将向该函数添加一个 writer 参数，然后添加一些代码来查看是否存在 writer 以及是否存在，我们将在那里跟踪我们的信息。

python 复制代码

from typing import Dict, List
from tqdm.auto import tqdm

# Add writer parameter to train()
def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device, 
          writer: torch.utils.tensorboard.writer.SummaryWriter # new parameter to take in a writer
          ) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Stores metrics to specified writer log_dir if present.

    Args:
      model: A PyTorch model to be trained and tested.
      train_dataloader: A DataLoader instance for the model to be trained on.
      test_dataloader: A DataLoader instance for the model to be tested on.
      optimizer: A PyTorch optimizer to help minimize the loss function.
      loss_fn: A PyTorch loss function to calculate loss on both datasets.
      epochs: An integer indicating how many epochs to train for.
      device: A target device to compute on (e.g. "cuda" or "cpu").
      writer: A SummaryWriter() instance to log model results to.

    Returns:
      A dictionary of training and testing loss as well as training and
      testing accuracy metrics. Each metric has a value in a list for 
      each epoch.
      In the form: {train_loss: [...],
                train_acc: [...],
                test_loss: [...],
                test_acc: [...]} 
      For example if training for epochs=2: 
              {train_loss: [2.0616, 1.0537],
                train_acc: [0.3945, 0.3945],
                test_loss: [1.2641, 1.5706],
                test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }

    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                          dataloader=train_dataloader,
                                          loss_fn=loss_fn,
                                          optimizer=optimizer,
                                          device=device)
        test_loss, test_acc = test_step(model=model,
          dataloader=test_dataloader,
          loss_fn=loss_fn,
          device=device)

        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)


        ### New: Use the writer parameter to track experiments ###
        # See if there's a writer, if so, log to it
        if writer:
            # Add results to SummaryWriter
            writer.add_scalars(main_tag="Loss", 
                               tag_scalar_dict={"train_loss": train_loss,
                                                "test_loss": test_loss},
                               global_step=epoch)
            writer.add_scalars(main_tag="Accuracy", 
                               tag_scalar_dict={"train_acc": train_acc,
                                                "test_acc": test_acc}, 
                               global_step=epoch)

            # Close the writer
            writer.close()
        else:
            pass
    ### End new ###

    # Return the filled results at the end of the epochs
    return results

7. 建立一系列建模实验

进行怎样的实验：

每个超参数都是不同实验的起点：

更改epochs
更改层数/隐藏单元数
更改数据量
改变学习率
尝试不同类型的数据增强
选择不同的模型架构

通常你的模型越大（可学习的参数越多），你拥有的数据越多（学习的机会越多）），性能越好。

但是，从小处开始，如果有效果，再扩大规模。

进行那些实验：

目标是改进为 FoodVision Mini 提供动力的模型，同时避免其变得太大。

即理想模型实现了高水平的测试集准确度（90%+），但不需要太长时间来训练/执行推理（做出预测）。

尝试一下组合：

不同数量的数据集（披萨、牛排、寿司的 10% 与 20%）
不同的模型（ torchvision.models.efficientnet_b0 与 torchvision.models.efficientnet_b2 ）
不同的训练时间（5 个 epoch 与 10 个 epoch）

得到以下实验组合：

Experiment number	Training Dataset	Model (pretrained on ImageNet)	Number of epochs
1	Pizza, Steak, Sushi 10% percent	EfficientNetB0	5
2	Pizza, Steak, Sushi 10% percent	EfficientNetB2	5
3	Pizza, Steak, Sushi 10% percent	EfficientNetB0	10
4	Pizza, Steak, Sushi 10% percent	EfficientNetB2	10
5	Pizza, Steak, Sushi 20% percent	EfficientNetB0	5
6	Pizza, Steak, Sushi 20% percent	EfficientNetB2	5
7	Pizza, Steak, Sushi 20% percent	EfficientNetB0	10
8	Pizza, Steak, Sushi 20% percent	EfficientNetB2	10

请注意上述实验是慢慢扩大规模的，在每次实验中，我们都会慢慢增加数据量、模型大小和训练时间。到最后，与实验 1 相比，实验 8 将使用双倍的数据、双倍的模型大小和双倍的训练长度。

这里设计的只是选项的一小部分，因为无法测试所有内容，因此最好先尝试一些事情，然后遵循效果最好的那些。

相关数据集全部数据：Food101

下载不同比例的数据集
需要两种形式的训练集：10%和20%比例的
需要的测试集：全部使用10%的数据集测试集进行测试【保持一致性】

下载代码：

python 复制代码

# Download 10 percent and 20 percent training data (if necessary)
data_10_percent_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                                     destination="pizza_steak_sushi")

data_20_percent_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi_20_percent.zip",
                                     destination="pizza_steak_sushi_20_percent")

创建不同的训练目录路径，但只需要一个测试目录路径，因为所有实验都将使用相同的测试数据集（测试数据集来自披萨、牛排、寿司 10%）：

python 复制代码

# Setup training directory paths
train_dir_10_percent = data_10_percent_path / "train"
train_dir_20_percent = data_20_percent_path / "train"

# Setup testing directory paths (note: use the same test dataset for both to compare the results)
test_dir = data_10_percent_path / "test"

# Check the directories
print(f"Training directory 10%: {train_dir_10_percent}")
print(f"Training directory 20%: {train_dir_20_percent}")
print(f"Testing directory: {test_dir}")

转换数据集并创建DataLoaders

将创建一系列变换来为模型准备图像，为了保持一致，我们将手动创建一个转换并在所有数据集中使用相同的转换。

（1）调整所有图像的大小（我们将从 224、224 开始，但这可以更改）

（2）将它们转换为值在 0 和 1 之间的张量。

（3）以某种方式对它们进行标准化，使它们的分布与 ImageNet 数据集内联（我们这样做是因为我们来自 torchvision.models 的模型已经在 ImageNet 上进行了预训练）

python 复制代码

from torchvision import transforms

# Create a transform to normalize data distribution to be inline with ImageNet
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], # values per colour channel [red, green, blue]
                                 std=[0.229, 0.224, 0.225]) # values per colour channel [red, green, blue]

# Compose transforms into a pipeline
simple_transform = transforms.Compose([
    transforms.Resize((224, 224)), # 1. Resize the images
    transforms.ToTensor(), # 2. Turn the images into tensors with values between 0 & 1
    normalize # 3. Normalize the images so their distributions match the ImageNet dataset 
])

data_setup.py 中的 create_dataloaders() 函数来创建 DataLoaders，使用相同的 test_dataloader （以保持比较一致）：

python 复制代码

BATCH_SIZE = 32

# Create 10% training and test DataLoaders
train_dataloader_10_percent, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir_10_percent,
    test_dir=test_dir, 
    transform=simple_transform,
    batch_size=BATCH_SIZE
)

# Create 20% training and test data DataLoders
train_dataloader_20_percent, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir_20_percent,
    test_dir=test_dir,
    transform=simple_transform,
    batch_size=BATCH_SIZE
)

# Find the number of samples/batches per dataloader (using the same test_dataloader for both experiments)
print(f"Number of batches of size {BATCH_SIZE} in 10 percent training data: {len(train_dataloader_10_percent)}")
print(f"Number of batches of size {BATCH_SIZE} in 20 percent training data: {len(train_dataloader_20_percent)}")
print(f"Number of batches of size {BATCH_SIZE} in testing data: {len(train_dataloader_10_percent)} (all experiments will use the same test set)")
print(f"Number of classes: {len(class_names)}, class names: {class_names}")

创建特征提取器模型
创建两个特征提取器模型：

torchvision.models.efficientnet_b0() 预训练的主干+自定义分类器头（简称EffNetB0）。
torchvision.models.efficientnet_b2() 预训练的主干+自定义分类器头（简称EffNetB2）。

为此，我们将冻结基础层（特征层）并更新模型的分类器头（输出层）以适应我们的问题.

EffNetB0 分类器头的 in_features 参数是 1280 （主干网将输入图像转换为大小为 1280 的特征向量）。

由于 EffNetB2 具有不同数量的层和参数，因此我们需要相应地对其进行调整。

我们可以使用 torchinfo.summary() 并传入 input_size=(32, 3, 224, 224) 参数找到 EffNetB2 的输入和输出形状（ (32, 3, 224, 224) 相当于 (batch_size, color_channels, height, width) ，即我们传入一个示例，说明单批数据将是什么到我们的模型）。

为了找到 EffNetB2 最后一层所需的输入形状，进行如下操作：

（1）创建 torchvision.models.efficientnet_b2(pretrained=True) 的实例。

（2）通过运行 torchinfo.summary() 查看各种输入和输出形状。

（3）通过检查 EffNetB2 分类器部分的 state_dict() 并打印权重矩阵的长度，打印出 in_features 的数量。【也可以只检查 effnetb2.classifier 的输出】

why：由于 torch.nn.AdaptiveAvgPool2d() 层，许多现代模型可以处理不同大小的输入图像，该层根据需要自适应调整给定输入的 output_size 。您可以通过将不同大小的输入图像传递到 torchinfo.summary() 或使用该图层传递到您自己的模型来尝试此操作。

python 复制代码

import torchvision
from torchinfo import summary

# 1. Create an instance of EffNetB2 with pretrained weights
effnetb2_weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT # "DEFAULT" means best available weights
effnetb2 = torchvision.models.efficientnet_b2(weights=effnetb2_weights)

# # 2. Get a summary of standard EffNetB2 from torchvision.models (uncomment for full output)
# summary(model=effnetb2, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# ) 

# 3. Get the number of in_features of the EfficientNetB2 classifier layer
print(f"Number of in_features to final layer of EfficientNetB2: {len(effnetb2.classifier.state_dict()['1.weight'][0])}")

EffNetB2 特征提取器模型的模型摘要，其中所有层均未冻结（可训练），并且来自 ImageNet 预训练的默认分类器头。

现在我们知道了 EffNetB2 模型所需的 in_features 数量，创建几个辅助函数来设置 EffNetB0 和 EffNetB2 特征提取器模型。

函数能够：

（1）从 torchvision.models 获取基本模型

（2）冻结模型中的基础层（设置 requires_grad=False ）

（3）设置随机种子

（4）更改分类器头（以适应我们的问题）

（5）为模型命名（例如 EffNetB0 为"effnetb0"）

python 复制代码

import torchvision
from torch import nn

# Get num out features (one for each class pizza, steak, sushi)
OUT_FEATURES = len(class_names)

# Create an EffNetB0 feature extractor
def create_effnetb0():
    # 1. Get the base mdoel with pretrained weights and send to target device
    weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
    model = torchvision.models.efficientnet_b0(weights=weights).to(device)

    # 2. Freeze the base model layers
    for param in model.features.parameters():
        param.requires_grad = False

    # 3. Set the seeds
    set_seeds()

    # 4. Change the classifier head
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.2),
        nn.Linear(in_features=1280, out_features=OUT_FEATURES)
    ).to(device)

    # 5. Give the model a name
    model.name = "effnetb0"
    print(f"[INFO] Created new {model.name} model.")
    return model

# Create an EffNetB2 feature extractor
def create_effnetb2():
    # 1. Get the base model with pretrained weights and send to target device
    weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT
    model = torchvision.models.efficientnet_b2(weights=weights).to(device)

    # 2. Freeze the base model layers
    for param in model.features.parameters():
        param.requires_grad = False

    # 3. Set the seeds
    set_seeds()

    # 4. Change the classifier head
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.3),
        nn.Linear(in_features=1408, out_features=OUT_FEATURES)
    ).to(device)

    # 5. Give the model a name
    model.name = "effnetb2"
    print(f"[INFO] Created new {model.name} model.")
    return model

创建 EffNetB0 和 EffNetB2 的实例并检查它们的 summary() 来测试它们：

python 复制代码

effnetb0 = create_effnetb0() 

# Get an output summary of the layers in our EffNetB0 feature extractor model (uncomment to view full output)
summary(model=effnetb0, 
        input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
        # col_names=["input_size"], # uncomment for smaller output
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
)

EffNetB0 模型的模型摘要，基础层已冻结（无法训练）并更新了分类器头（适用于披萨、牛排、寿司图像分类）。

python 复制代码

effnetb2 = create_effnetb2()

# Get an output summary of the layers in our EffNetB2 feature extractor model (uncomment to view full output)
summary(model=effnetb2, 
        input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
        # col_names=["input_size"], # uncomment for smaller output
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
)

EffNetB2 模型的模型摘要，基础层已冻结（无法训练）并更新了分类器头（适用于披萨、牛排、寿司图像分类）。

从摘要的输出来看，EffNetB2 主干网络的参数数量几乎是 EffNetB0 的两倍。

创建实验并设置训练代码
首先创建两个列表和一个字典：

epoch列表 ( [5, 10] )
测试的模型列表 ( ["effnetb0", "effnetb2"] )
不同训练 DataLoader 的字典

python 复制代码

# 1. Create epochs list
num_epochs = [5, 10]

# 2. Create models list (need to create a new model for each experiment)
models = ["effnetb0", "effnetb2"]

# 3. Create dataloaders dictionary for various dataloaders
train_dataloaders = {"data_10_percent": train_dataloader_10_percent,
                     "data_20_percent": train_dataloader_20_percent}

编写代码来迭代每个不同的选项并尝试每个不同的组合，在每次实验结束时保存模型，以便稍后我们可以加载回最佳模型并使用它进行预测。

（1）设置随机种子

（2）跟踪不同的实验编号【方便打印结果】

（3）循环遍历每个不同训练 DataLoader 的 train_dataloaders 字典项

（4）循环遍历epoch编号列表。

（5）循环浏览不同模型名称的列表。

（6）为当前正在运行的实验创建信息打印输出，以便我们知道发生了什么

（7）检查哪个模型是目标模型并创建一个新的 EffNetB0 或 EffNetB2 实例（我们每个实验都会创建一个新的模型实例，因此所有模型都从相同的角度开始）。

（8）为每个新实验创建一个新的损失函数 ( torch.nn.CrossEntropyLoss() ) 和优化器 ( torch.optim.Adam(params=model.parameters(), lr=0.001) )。

（9）使用修改后的 train() 函数训练模型，将适当的详细信息传递给 writer 参数。

（10）使用适当的文件名将经过训练的模型保存到 utils.py 中的 save_model() 文件中。

python 复制代码

%%time
from going_modular.going_modular.utils import save_model

# 1. Set the random seeds
set_seeds(seed=42)

# 2. Keep track of experiment numbers
experiment_number = 0

# 3. Loop through each DataLoader
for dataloader_name, train_dataloader in train_dataloaders.items():

    # 4. Loop through each number of epochs
    for epochs in num_epochs: 

        # 5. Loop through each model name and create a new model based on the name
        for model_name in models:

            # 6. Create information print outs
            experiment_number += 1
            print(f"[INFO] Experiment number: {experiment_number}")
            print(f"[INFO] Model: {model_name}")
            print(f"[INFO] DataLoader: {dataloader_name}")
            print(f"[INFO] Number of epochs: {epochs}")  

            # 7. Select the model
            if model_name == "effnetb0":
                model = create_effnetb0() # creates a new model each time (important because we want each experiment to start from scratch)
            else:
                model = create_effnetb2() # creates a new model each time (important because we want each experiment to start from scratch)
            
            # 8. Create a new loss and optimizer for every model
            loss_fn = nn.CrossEntropyLoss()
            optimizer = torch.optim.Adam(params=model.parameters(), lr=0.001)

            # 9. Train target model with target dataloaders and track experiments
            train(model=model,
                  train_dataloader=train_dataloader,
                  test_dataloader=test_dataloader, 
                  optimizer=optimizer,
                  loss_fn=loss_fn,
                  epochs=epochs,
                  device=device,
                  writer=create_writer(experiment_name=dataloader_name,
                                       model_name=model_name,
                                       extra=f"{epochs}_epochs"))
            
            # 10. Save the model to file so we can get back the best model
            save_filepath = f"07_{model_name}_{dataloader_name}_{epochs}_epochs.pth"
            save_model(model=model,
                       target_dir="models",
                       model_name=save_filepath)
            print("-"*50 + "\n")

8. 在TensorBoard中查看实验

python 复制代码

# Viewing TensorBoard in Jupyter and Google Colab Notebooks (uncomment to view full TensorBoard instance)
%load_ext tensorboard
%tensorboard --logdir runs

最重要的是趋势。您的数字将走向何方。如果偏差很大，可能出了问题，最好回去检查代码。但如果它们偏差很小（比如小数点后几位左右），那也没关系。

在 TensorBoard 中可视化不同建模实验的测试损失值，您可以看到训练 10 个 epoch 且使用 20% 数据的 EffNetB0 模型实现了最低损失。这符合实验的总体趋势：更多的数据、更大的模型和更长的训练时间通常更好。

9. 加载最佳模型并用它进行预测

最大的模型取得了最好的结果，我们可以通过使用 create_effnetb2() 函数创建 EffNetB2 的新实例来导入最佳保存的模型，然后使用 torch.load() 加载保存的 state_dict() 。

python 复制代码

# Setup the best model filepath
best_model_path = "models/07_effnetb2_data_20_percent_10_epochs.pth"

# Instantiate a new instance of EffNetB2 (to load the saved state_dict() to)
best_model = create_effnetb2()

# Load the saved best model state_dict()
best_model.load_state_dict(torch.load(best_model_path))

查看文件模型大小，太大难以部署：

python 复制代码

# Check the model file size
from pathlib import Path

# Get the model size in bytes then convert to megabytes
effnetb2_model_size = Path(best_model_path).stat().st_size // (1024*1024)
print(f"EfficientNetB2 feature extractor model size: {effnetb2_model_size} MB")

做出一些预测并将其可视化：

【创建了一个 pred_and_plot_image() 函数来使用经过训练的模型对图像进行预测。】
pred_and_plot_image() 函数在predictions.py代码里，可以直接调用，补predictions.py代码：

python 复制代码

"""
Utility functions to make predictions.

Main reference for code creation: https://www.learnpytorch.io/06_pytorch_transfer_learning/#6-make-predictions-on-images-from-the-test-set 
"""
import torch
import torchvision
from torchvision import transforms
import matplotlib.pyplot as plt

from typing import List, Tuple

from PIL import Image

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Predict on a target image with a target model
# Function created in: https://www.learnpytorch.io/06_pytorch_transfer_learning/#6-make-predictions-on-images-from-the-test-set
def pred_and_plot_image(
    model: torch.nn.Module,
    class_names: List[str],
    image_path: str,
    image_size: Tuple[int, int] = (224, 224),
    transform: torchvision.transforms = None,
    device: torch.device = device,
):
    """Predicts on a target image with a target model.

    Args:
        model (torch.nn.Module): A trained (or untrained) PyTorch model to predict on an image.
        class_names (List[str]): A list of target classes to map predictions to.
        image_path (str): Filepath to target image to predict on.
        image_size (Tuple[int, int], optional): Size to transform target image to. Defaults to (224, 224).
        transform (torchvision.transforms, optional): Transform to perform on image. Defaults to None which uses ImageNet normalization.
        device (torch.device, optional): Target device to perform prediction on. Defaults to device.
    """

    # Open image
    img = Image.open(image_path)

    # Create transformation for image (if one doesn't exist)
    if transform is not None:
        image_transform = transform
    else:
        image_transform = transforms.Compose(
            [
                transforms.Resize(image_size),
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )

    ### Predict on image ###

    # Make sure the model is on the target device
    model.to(device)

    # Turn on model evaluation mode and inference mode
    model.eval()
    with torch.inference_mode():
        # Transform and add an extra dimension to image (model requires samples in [batch_size, color_channels, height, width])
        transformed_image = image_transform(img).unsqueeze(dim=0)

        # Make a prediction on image with an extra dimension and send it to the target device
        target_image_pred = model(transformed_image.to(device))

    # Convert logits -> prediction probabilities (using torch.softmax() for multi-class classification)
    target_image_pred_probs = torch.softmax(target_image_pred, dim=1)

    # Convert prediction probabilities -> prediction labels
    target_image_pred_label = torch.argmax(target_image_pred_probs, dim=1)

    # Plot image with predicted label and probability
    plt.figure()
    plt.imshow(img)
    plt.title(
        f"Pred: {class_names[target_image_pred_label]} | Prob: {target_image_pred_probs.max():.3f}"
    )
    plt.axis(False)

开始随机预测：

python 复制代码

# Import function to make predictions on images and plot them 
# See the function previously created in section: https://www.learnpytorch.io/06_pytorch_transfer_learning/#6-make-predictions-on-images-from-the-test-set
from going_modular.going_modular.predictions import pred_and_plot_image

# Get a random list of 3 images from 20% test set
import random
num_images_to_plot = 3
test_image_path_list = list(Path(data_20_percent_path / "test").glob("*/*.jpg")) # get all test image paths from 20% dataset
test_image_path_sample = random.sample(population=test_image_path_list,
                                       k=num_images_to_plot) # randomly select k number of images

# Iterate through random test image paths, make predictions on them and plot them
for image_path in test_image_path_sample:
    pred_and_plot_image(model=best_model,
                        image_path=image_path,
                        class_names=class_names,
                        image_size=(224, 224))

最后使用最佳模型预测自定义图像：

python 复制代码

# Download custom image
import requests

# Setup custom image path
custom_image_path = Path("data/04-pizza-dad.jpeg")

# Download the image if it doesn't already exist
if not custom_image_path.is_file():
    with open(custom_image_path, "wb") as f:
        # When downloading from GitHub, need to use the "raw" file link
        request = requests.get("https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/04-pizza-dad.jpeg")
        print(f"Downloading {custom_image_path}...")
        f.write(request.content)
else:
    print(f"{custom_image_path} already exists, skipping download.")

# Predict on custom image
pred_and_plot_image(model=model,
                    image_path=custom_image_path,
                    class_names=class_names)

补充

使用 20% 披萨、牛排、寿司训练和测试数据集将数据增强引入到实验列表中：

python 复制代码

# Note: Data augmentation transform like this should only be performed on training data
train_transform_data_aug = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.TrivialAugmentWide(),
    transforms.ToTensor(),
    normalize
])

# Helper function to view images in a DataLoader (works with data augmentation transforms or not) 
def view_dataloader_images(dataloader, n=10):
    if n > 10:
        print(f"Having n higher than 10 will create messy plots, lowering to 10.")
        n = 10
    imgs, labels = next(iter(dataloader))
    plt.figure(figsize=(16, 8))
    for i in range(n):
        # Min max scale the image for display purposes
        targ_image = imgs[i]
        sample_min, sample_max = targ_image.min(), targ_image.max()
        sample_scaled = (targ_image - sample_min)/(sample_max - sample_min)

        # Plot images with appropriate axes information
        plt.subplot(1, 10, i+1)
        plt.imshow(sample_scaled.permute(1, 2, 0)) # resize for Matplotlib requirements
        plt.title(class_names[labels[i]])
        plt.axis(False)

# Have to update `create_dataloaders()` to handle different augmentations
import os
from torch.utils.data import DataLoader
from torchvision import datasets

NUM_WORKERS = os.cpu_count() # use maximum number of CPUs for workers to load data 

# Note: this is an update version of data_setup.create_dataloaders to handle
# differnt train and test transforms.
def create_dataloaders(
    train_dir, 
    test_dir, 
    train_transform, # add parameter for train transform (transforms on train dataset)
    test_transform,  # add parameter for test transform (transforms on test dataset)
    batch_size=32, num_workers=NUM_WORKERS
):
    # Use ImageFolder to create dataset(s)
    train_data = datasets.ImageFolder(train_dir, transform=train_transform)
    test_data = datasets.ImageFolder(test_dir, transform=test_transform)

    # Get class names
    class_names = train_data.classes

    # Turn images into data loaders
    train_dataloader = DataLoader(
        train_data,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True,
    )
    test_dataloader = DataLoader(
        test_data,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True,
    )

    return train_dataloader, test_dataloader, class_names