文章目录
- [1. 获取数据](#1. 获取数据)
 - [2. 创建Dataset和DataLoader](#2. 创建Dataset和DataLoader)
 - [3. 获取并定制预训练模型](#3. 获取并定制预训练模型)
 - [4. 训练模型并跟踪结果](#4. 训练模型并跟踪结果)
 - [5. 在 TensorBoard 中查看模型的结果](#5. 在 TensorBoard 中查看模型的结果)
 - [6. 创建辅助函数来构建 SummaryWriter() 实例](#6. 创建辅助函数来构建 SummaryWriter() 实例)
 - [7. 建立一系列建模实验](#7. 建立一系列建模实验)
 - [8. 在TensorBoard中查看实验](#8. 在TensorBoard中查看实验)
 - [9. 加载最佳模型并用它进行预测](#9. 加载最佳模型并用它进行预测)
 - 补充
 
机器学习和深度学习是非常实验性的,需要跟踪数据、模型架构和训练方案的各种组合的结果。
进行大量不同的实验,实验跟踪可以帮助您找出哪些有效,哪些无效。
只运行少数模型,那么只需在打印输出和一些字典中跟踪它们的结果可能就可以了。但是,随着运行的实验数量开始增加,这种简单的跟踪方式可能会失控。
有多少种实验可以运行,就有多少种不同的方法来跟踪机器学习实验。
| Method | Setup | Pros | Cons | Cost | 
|---|---|---|---|---|
| Python 词典、CSV 文件、打印输出 | None | 易于设置,以纯 Python 运行 | 难以跟踪大量实验 | Free | 
| TensorBoard | Minimal, install tensorboard | PyTorch 内置的扩展被广泛认可和使用,并且可以轻松扩展 | 用户体验不如其他选项 | Free | 
| Weights & Biases Experiment Tracking | Minimal, install wandb, make an account | 用户体验棒,公开实验,跟踪几乎所有东西 | 需要 PyTorch 之外的外部资源 | Free for personal use | 
| MLFlow | Minimal, install mlflow and starting tracking | 完全开源的 MLOps 生命周期管理,许多集成 | 与其他服务相比,设置远程跟踪服务器有点困难 | Free | 
本篇博客主要介绍使用 TensorBoard 来跟踪我们的实验。
在实验开始前需要进行如下设置:
(1)会重复使用到之前博客 Pytorch模块化里面的python脚本,data_setup.py 和 engine.py ,data_setup主要是创建Dataset和DataLoader,engine主要是训练模型的引擎函数,直接从前面的博客copy过来。
(2)导入一些基础库。
(3)设置与设备无关。
(4)创建一个辅助函数来设置种子
            
            
              python
              
              
            
          
          # Continue with regular imports
import matplotlib.pyplot as plt
import torch
import torchvision
from torch import nn
from torchvision import transforms
from torchinfo import summary
from going_modular.going_modular import data_setup, engine
        
            
            
              python
              
              
            
          
          device = "cuda" if torch.cuda.is_available() else "cpu"
device
        
            
            
              python
              
              
            
          
          # Set seeds
def set_seeds(seed: int=42):
    """Sets random sets for torch operations.
    Args:
        seed (int, optional): Random seed to set. Defaults to 42.
    """
    # Set the seed for general torch operations
    torch.manual_seed(seed)
    # Set the seed for CUDA torch operations (ones that happen on the GPU)
    torch.cuda.manual_seed(seed)
        上面设置完成后,后面实验跟踪的流程如下:
- 获取数据:FoodVision Mini---披萨、牛排和寿司图像分类数据集。
 - 创建Dataset和DataLoader:通过导入的
data_setup脚本。 - 获取并定制预训练模型:将从 
torchvision.models下载预训练模型,并根据我们自己的问题对其进行自定义。 - 训练模型并跟踪结果
 - 在 TensorBoard 中查看模型的结果
 - 创建辅助函数来跟踪实验:创建一个函数来帮助我们保存建模实验结果。
 - 建立一系列建模实验:编写一些代码来同时运行多个实验,使用不同的模型、不同的数据量和不同的训练时间。
 - 在TensorBoard中查看建模实验
 - 加载最佳模型并用它进行预测
 
1. 获取数据
通过下面代码下载好数据集:
            
            
              python
              
              
            
          
          import os
import zipfile
from pathlib import Path
import requests
def download_data(source: str,
                  destination: str,
                  remove_source: bool = True) -> Path:
    """Downloads a zipped dataset from source and unzips to destination.
    Args:
        source (str): A link to a zipped file containing data.
        destination (str): A target directory to unzip data to.
        remove_source (bool): Whether to remove the source after downloading and extracting.
    Returns:
        pathlib.Path to downloaded data.
    Example usage:
        download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                      destination="pizza_steak_sushi")
    """
    # Setup path to data folder
    data_path = Path("data/")
    image_path = data_path / destination
    # If the image folder doesn't exist, download it and prepare it...
    if image_path.is_dir():
        print(f"[INFO] {image_path} directory exists, skipping download.")
    else:
        print(f"[INFO] Did not find {image_path} directory, creating one...")
        image_path.mkdir(parents=True, exist_ok=True)
        # Download pizza, steak, sushi data
        target_file = Path(source).name
        with open(data_path / target_file, "wb") as f:
            request = requests.get(source)
            print(f"[INFO] Downloading {target_file} from {source}...")
            f.write(request.content)
        # Unzip pizza, steak, sushi data
        with zipfile.ZipFile(data_path / target_file, "r") as zip_ref:
            print(f"[INFO] Unzipping {target_file} data...")
            zip_ref.extractall(image_path)
        # Remove .zip file
        if remove_source:
            os.remove(data_path / target_file)
    return image_path
image_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")
image_path
        2. 创建Dataset和DataLoader
可以直接调用data_setup.py里面的create_dataloaders()函数直接创建Dataset和DataLoader。
在调用之前,我们需要创建一个对数据进行转换的transform参数,转换后的形式一定需要符合后续模型所需的输入。
由于我们将使用迁移学习和来自 torchvision.models 的专门预训练模型,因此我们将创建一个转换来正确准备我们的图像。
至于这个转换的方式可参考【Pytorch】Transfer Learning 迁移学习里面提到的两种方式,下面选择的是使用手动方式创建:
            
            
              python
              
              
            
          
          # Setup directories
train_dir = image_path / "train"
test_dir = image_path / "test"
# Setup ImageNet normalization levels (turns all images into similar distribution as ImageNet)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
# Create transform pipeline manually
manual_transforms = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    normalize
])
print(f"Manually created transforms: {manual_transforms}")
# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms, # use manually created transforms
    batch_size=32,
    num_workers=1
)
train_dataloader, test_dataloader, class_names
        补充自动创建方式,选择一种就可以了:
            
            
              python
              
              
            
          
          # Setup dirs
train_dir = image_path / "train"
test_dir = image_path / "test"
# Setup pretrained weights (plenty of these available in torchvision.models)
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
# Get transforms from weights (these are the transforms that were used to obtain the weights)
automatic_transforms = weights.transforms() 
print(f"Automatically created transforms: {automatic_transforms}")
# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=automatic_transforms, # use automatic created transforms
    batch_size=32,
    num_workers=1
)
train_dataloader, test_dataloader, class_names
        3. 获取并定制预训练模型
获取预训练模型,冻结基础层并更改分类器头
下载 torchvision.models.efficientnet_b0() 模型的预训练权重,并准备将其与我们自己的数据一起使用。
            
            
              python
              
              
            
          
          # Note: This is how a pretrained model would be created in torchvision > 0.13, it will be deprecated in future versions.
# model = torchvision.models.efficientnet_b0(pretrained=True).to(device) # OLD
# Download the pretrained weights for EfficientNet_B0
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT # NEW in torchvision 0.13, "DEFAULT" means "best weights available"
# Setup the model with the pretrained weights and send it to the target device
model = torchvision.models.efficientnet_b0(weights=weights).to(device)
# View the output of the model
# model
        将冻结模型的基础层(我们将使用它们从输入图像中提取特征),并且我们将更改分类器头(输出层)以适应我们正在使用的类的数量(我们有 3 种:披萨、牛排、寿司)。
            
            
              python
              
              
            
          
          # Freeze all base layers by setting requires_grad attribute to False
for param in model.features.parameters():
    param.requires_grad = False
    
# Since we're creating a new layer with random weights (torch.nn.Linear), 
# let's set the seeds
set_seeds() 
# Update the classifier head to suit our problem
model.classifier = torch.nn.Sequential(
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(in_features=1280, 
              out_features=len(class_names),
              bias=True).to(device))
        基础层冻结,分类器头改变,用 torchinfo.summary() 来总结我们的模型:
            
            
              python
              
              
            
          
          from torchinfo import summary
# # Get a summary of the model (uncomment for full output)
# summary(model, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape" (batch_size, color_channels, height, width)
#         verbose=0,
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# )
# Print a summary using torchinfo (uncomment for actual output)
summary(model=model,
        input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
        # col_names=["input_size"], # uncomment for smaller output
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
)
        4. 训练模型并跟踪结果
创建损失函数和优化器:
            
            
              python
              
              
            
          
          # Define loss and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
        调整 train() 函数【 engine.py】以跟踪 SummaryWriter() 的结果:
可以使用 PyTorch 的
torch.utils.tensorboard.SummaryWriter()类将模型训练进度的各个部分保存到文件中。
默认情况下, SummaryWriter() 类将有关模型的各种信息保存到由log_dir参数设置的文件中。
log_dir 的默认位置位于 runs/CURRENT_DATETIME_HOSTNAME 下,其中 HOSTNAME 是您的计算机的名称。可以更改跟踪实验的位置(文件名可根据您的需要进行自定义)
SummaryWriter() 的输出以 TensorBoard 格式保存
创建一个默认的 SummaryWriter() 实例:
            
            
              python
              
              
            
          
          from torch.utils.tensorboard import SummaryWriter
# Create a writer with all default settings
writer = SummaryWriter()
        从 engine.py 获取 train() 函数,并将其调整为使用 writer:【为 train() 函数添加记录模型训练和测试损失和准确性值的功能 】
可以使用 writer.add_scalars(main_tag, tag_scalar_dict) 来做到这一点,其中:
main_tag (string)- 正在跟踪的标量的名称(例如"准确性")tag_scalar_dict (dict)- 正在跟踪的值的字典(例如{"train_loss": 0.3454})
方法称为 add_scalars() 因为我们的损失和准确度值通常是标量(单个值)。
一旦我们完成跟踪值,我们将调用 writer.close() 告诉 writer 停止寻找要跟踪的值。
要开始修改 train() ,我们还将从 engine.py 导入 train_step() 和 test_step()
            
            
              python
              
              
            
          
          from typing import Dict, List
from tqdm.auto import tqdm
from going_modular.going_modular.engine import train_step, test_step
# Import train() function from: 
# https://github.com/mrdbourke/pytorch-deep-learning/blob/main/going_modular/going_modular/engine.py
def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device) -> Dict[str, List]:
    """Trains and tests a PyTorch model.
    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.
    Calculates, prints and stores evaluation metrics throughout.
    Args:
      model: A PyTorch model to be trained and tested.
      train_dataloader: A DataLoader instance for the model to be trained on.
      test_dataloader: A DataLoader instance for the model to be tested on.
      optimizer: A PyTorch optimizer to help minimize the loss function.
      loss_fn: A PyTorch loss function to calculate loss on both datasets.
      epochs: An integer indicating how many epochs to train for.
      device: A target device to compute on (e.g. "cuda" or "cpu").
      
    Returns:
      A dictionary of training and testing loss as well as training and
      testing accuracy metrics. Each metric has a value in a list for 
      each epoch.
      In the form: {train_loss: [...],
                train_acc: [...],
                test_loss: [...],
                test_acc: [...]} 
      For example if training for epochs=2: 
              {train_loss: [2.0616, 1.0537],
                train_acc: [0.3945, 0.3945],
                test_loss: [1.2641, 1.5706],
                test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }
    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                           dataloader=train_dataloader,
                                           loss_fn=loss_fn,
                                           optimizer=optimizer,
                                           device=device)
        test_loss, test_acc = test_step(model=model,
                                        dataloader=test_dataloader,
                                        loss_fn=loss_fn,
                                        device=device)
        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )
        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)
        ### New: Experiment tracking ###
        # Add loss results to SummaryWriter
        writer.add_scalars(main_tag="Loss", 
                           tag_scalar_dict={"train_loss": train_loss,
                                            "test_loss": test_loss},
                           global_step=epoch)
        # Add accuracy results to SummaryWriter
        writer.add_scalars(main_tag="Accuracy", 
                           tag_scalar_dict={"train_acc": train_acc,
                                            "test_acc": test_acc}, 
                           global_step=epoch)
        
        # Track the PyTorch model architecture
        writer.add_graph(model=model, 
                         # Pass in an example input
                         input_to_model=torch.randn(32, 3, 224, 224).to(device))
    
    # Close the writer
    writer.close()
    
    ### End new ###
    # Return the filled results at the end of the epochs
    return results
        测试 5 个 epoch 效果:
            
            
              python
              
              
            
          
          # Train model
# Note: Not using engine.train() since the original script isn't updated to use writer
set_seeds()
results = train(model=model,
                train_dataloader=train_dataloader,
                test_dataloader=test_dataloader,
                optimizer=optimizer,
                loss_fn=loss_fn,
                epochs=5,
                device=device)
        
在字典中跟踪模型的结果:
            
            
              python
              
              
            
          
          # Check out the model results
results
        5. 在 TensorBoard 中查看模型的结果
默认情况下, SummaryWriter() 类以 TensorBoard 格式将模型结果存储在名为 runs/ 的目录中。
可以通过多种方式查看 TensorBoard:
在jupyter notebook中,可以这样操作:
确保 TensorBoard 已安装,使用 %load_ext tensorboard 加载它,然后使用 %tensorboard --logdir DIR_WITH_LOGS 查看结果。
            
            
              python
              
              
            
          
          # Example code to run in Jupyter or Google Colab Notebook (uncomment to try it out)
%load_ext tensorboard
%tensorboard --logdir runs
        下面是Colab的显示:


6. 创建辅助函数来构建 SummaryWriter() 实例
SummaryWriter() 类将各种信息记录到 log_dir 参数指定的目录中,创建一个辅助函数来为每个实验创建一个自定义目录。
每个实验都有自己的日志目录,跟踪以下内容:
- Experiment date/timestamp - when did the experiment take place?
 - Experiment name - is there something we'd like to call the experiment?
 - Model name - what model was used?
 - Extra - should anything else be tracked?
 
开始创建一个名为 create_writer() 的辅助函数,它生成一个 SummaryWriter() 实例跟踪自定义 log_dir 。
理想情况下,我们希望 log_dir 类似于:runs/YYYY-MM-DD/experiment_name/model_name/extra
            
            
              python
              
              
            
          
          def create_writer(experiment_name: str, 
                  model_name: str, 
                  extra: str=None) -> torch.utils.tensorboard.writer.SummaryWriter():
    """Creates a torch.utils.tensorboard.writer.SummaryWriter() instance saving to a specific log_dir.
    log_dir is a combination of runs/timestamp/experiment_name/model_name/extra.
    Where timestamp is the current date in YYYY-MM-DD format.
    Args:
        experiment_name (str): Name of experiment.
        model_name (str): Name of model.
        extra (str, optional): Anything extra to add to the directory. Defaults to None.
    Returns:
        torch.utils.tensorboard.writer.SummaryWriter(): Instance of a writer saving to log_dir.
    Example usage:
        # Create a writer saving to "runs/2022-06-04/data_10_percent/effnetb2/5_epochs/"
        writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb2",
                               extra="5_epochs")
        # The above is the same as:
        writer = SummaryWriter(log_dir="runs/2022-06-04/data_10_percent/effnetb2/5_epochs/")
    """
    from datetime import datetime
    import os
    # Get timestamp of current date (all experiments on certain day live in same folder)
    timestamp = datetime.now().strftime("%Y-%m-%d") # returns current date in YYYY-MM-DD format
    if extra:
        # Create log directory path
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name, extra)
    else:
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name)
        
    print(f"[INFO] Created SummaryWriter, saving to: {log_dir}...")
    return SummaryWriter(log_dir=log_dir)
        
            
            
              python
              
              
            
          
          # Create an example writer
example_writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb0",
                               extra="5_epochs")
        
- 更新 train() 函数以包含 writer 参数
 
为了调整 train() 函数,我们将向该函数添加一个 writer 参数,然后添加一些代码来查看是否存在 writer 以及是否存在,我们将在那里跟踪我们的信息。
            
            
              python
              
              
            
          
          from typing import Dict, List
from tqdm.auto import tqdm
# Add writer parameter to train()
def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device, 
          writer: torch.utils.tensorboard.writer.SummaryWriter # new parameter to take in a writer
          ) -> Dict[str, List]:
    """Trains and tests a PyTorch model.
    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.
    Calculates, prints and stores evaluation metrics throughout.
    Stores metrics to specified writer log_dir if present.
    Args:
      model: A PyTorch model to be trained and tested.
      train_dataloader: A DataLoader instance for the model to be trained on.
      test_dataloader: A DataLoader instance for the model to be tested on.
      optimizer: A PyTorch optimizer to help minimize the loss function.
      loss_fn: A PyTorch loss function to calculate loss on both datasets.
      epochs: An integer indicating how many epochs to train for.
      device: A target device to compute on (e.g. "cuda" or "cpu").
      writer: A SummaryWriter() instance to log model results to.
    Returns:
      A dictionary of training and testing loss as well as training and
      testing accuracy metrics. Each metric has a value in a list for 
      each epoch.
      In the form: {train_loss: [...],
                train_acc: [...],
                test_loss: [...],
                test_acc: [...]} 
      For example if training for epochs=2: 
              {train_loss: [2.0616, 1.0537],
                train_acc: [0.3945, 0.3945],
                test_loss: [1.2641, 1.5706],
                test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }
    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                          dataloader=train_dataloader,
                                          loss_fn=loss_fn,
                                          optimizer=optimizer,
                                          device=device)
        test_loss, test_acc = test_step(model=model,
          dataloader=test_dataloader,
          loss_fn=loss_fn,
          device=device)
        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )
        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)
        ### New: Use the writer parameter to track experiments ###
        # See if there's a writer, if so, log to it
        if writer:
            # Add results to SummaryWriter
            writer.add_scalars(main_tag="Loss", 
                               tag_scalar_dict={"train_loss": train_loss,
                                                "test_loss": test_loss},
                               global_step=epoch)
            writer.add_scalars(main_tag="Accuracy", 
                               tag_scalar_dict={"train_acc": train_acc,
                                                "test_acc": test_acc}, 
                               global_step=epoch)
            # Close the writer
            writer.close()
        else:
            pass
    ### End new ###
    # Return the filled results at the end of the epochs
    return results
        7. 建立一系列建模实验
- 进行怎样的实验:
 
每个超参数都是不同实验的起点:
- 更改
epochs - 更改层数/隐藏单元数
 - 更改数据量
 - 改变学习率
 - 尝试不同类型的数据增强
 - 选择不同的模型架构
 
通常你的模型越大(可学习的参数越多),你拥有的数据越多(学习的机会越多) ),性能越好。
但是,从小处开始,如果有效果,再扩大规模。
- 进行那些实验:
 
目标是改进为 FoodVision Mini 提供动力的模型,同时避免其变得太大。
即理想模型实现了高水平的测试集准确度(90%+),但不需要太长时间来训练/执行推理(做出预测)。
尝试一下组合:
- 不同数量的数据集(披萨、牛排、寿司的 10% 与 20%)
 - 不同的模型( 
torchvision.models.efficientnet_b0与torchvision.models.efficientnet_b2) - 不同的训练时间(5 个 epoch 与 10 个 epoch)
 
得到以下实验组合:
| Experiment number | Training Dataset | Model (pretrained on ImageNet) | Number of epochs | 
|---|---|---|---|
| 1 | Pizza, Steak, Sushi 10% percent | EfficientNetB0 | 5 | 
| 2 | Pizza, Steak, Sushi 10% percent | EfficientNetB2 | 5 | 
| 3 | Pizza, Steak, Sushi 10% percent | EfficientNetB0 | 10 | 
| 4 | Pizza, Steak, Sushi 10% percent | EfficientNetB2 | 10 | 
| 5 | Pizza, Steak, Sushi 20% percent | EfficientNetB0 | 5 | 
| 6 | Pizza, Steak, Sushi 20% percent | EfficientNetB2 | 5 | 
| 7 | Pizza, Steak, Sushi 20% percent | EfficientNetB0 | 10 | 
| 8 | Pizza, Steak, Sushi 20% percent | EfficientNetB2 | 10 | 
请注意上述实验是慢慢扩大规模的,在每次实验中,我们都会慢慢增加数据量、模型大小和训练时间。到最后,与实验 1 相比,实验 8 将使用双倍的数据、双倍的模型大小和双倍的训练长度。
这里设计的只是选项的一小部分,因为无法测试所有内容,因此最好先尝试一些事情,然后遵循效果最好的那些。
相关数据集全部数据:Food101
- 下载不同比例的数据集
需要两种形式的训练集:10%和20%比例的
需要的测试集:全部使用10%的数据集测试集进行测试【保持一致性】 
下载代码:
            
            
              python
              
              
            
          
          # Download 10 percent and 20 percent training data (if necessary)
data_10_percent_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                                     destination="pizza_steak_sushi")
data_20_percent_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi_20_percent.zip",
                                     destination="pizza_steak_sushi_20_percent")
        创建不同的训练目录路径,但只需要一个测试目录路径,因为所有实验都将使用相同的测试数据集(测试数据集来自披萨、牛排、寿司 10%):
            
            
              python
              
              
            
          
          # Setup training directory paths
train_dir_10_percent = data_10_percent_path / "train"
train_dir_20_percent = data_20_percent_path / "train"
# Setup testing directory paths (note: use the same test dataset for both to compare the results)
test_dir = data_10_percent_path / "test"
# Check the directories
print(f"Training directory 10%: {train_dir_10_percent}")
print(f"Training directory 20%: {train_dir_20_percent}")
print(f"Testing directory: {test_dir}")
        - 转换数据集并创建DataLoaders
 
将创建一系列变换来为模型准备图像,为了保持一致,我们将手动创建一个转换并在所有数据集中使用相同的转换。
(1)调整所有图像的大小(我们将从 224、224 开始,但这可以更改)
(2)将它们转换为值在 0 和 1 之间的张量。
(3)以某种方式对它们进行标准化,使它们的分布与 ImageNet 数据集内联(我们这样做是因为我们来自 torchvision.models 的模型已经在 ImageNet 上进行了预训练)
            
            
              python
              
              
            
          
          from torchvision import transforms
# Create a transform to normalize data distribution to be inline with ImageNet
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], # values per colour channel [red, green, blue]
                                 std=[0.229, 0.224, 0.225]) # values per colour channel [red, green, blue]
# Compose transforms into a pipeline
simple_transform = transforms.Compose([
    transforms.Resize((224, 224)), # 1. Resize the images
    transforms.ToTensor(), # 2. Turn the images into tensors with values between 0 & 1
    normalize # 3. Normalize the images so their distributions match the ImageNet dataset 
])
        data_setup.py 中的 create_dataloaders() 函数来创建 DataLoaders,使用相同的 test_dataloader (以保持比较一致):
            
            
              python
              
              
            
          
          BATCH_SIZE = 32
# Create 10% training and test DataLoaders
train_dataloader_10_percent, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir_10_percent,
    test_dir=test_dir, 
    transform=simple_transform,
    batch_size=BATCH_SIZE
)
# Create 20% training and test data DataLoders
train_dataloader_20_percent, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir_20_percent,
    test_dir=test_dir,
    transform=simple_transform,
    batch_size=BATCH_SIZE
)
# Find the number of samples/batches per dataloader (using the same test_dataloader for both experiments)
print(f"Number of batches of size {BATCH_SIZE} in 10 percent training data: {len(train_dataloader_10_percent)}")
print(f"Number of batches of size {BATCH_SIZE} in 20 percent training data: {len(train_dataloader_20_percent)}")
print(f"Number of batches of size {BATCH_SIZE} in testing data: {len(train_dataloader_10_percent)} (all experiments will use the same test set)")
print(f"Number of classes: {len(class_names)}, class names: {class_names}")
        - 创建特征提取器模型
创建两个特征提取器模型: 
torchvision.models.efficientnet_b0()预训练的主干+自定义分类器头(简称EffNetB0)。torchvision.models.efficientnet_b2()预训练的主干+自定义分类器头(简称EffNetB2)。
为此,我们将冻结基础层(特征层)并更新模型的分类器头(输出层)以适应我们的问题.
EffNetB0 分类器头的 in_features 参数是 1280 (主干网将输入图像转换为大小为 1280 的特征向量) 。
由于 EffNetB2 具有不同数量的层和参数,因此我们需要相应地对其进行调整。
我们可以使用 torchinfo.summary() 并传入 input_size=(32, 3, 224, 224) 参数找到 EffNetB2 的输入和输出形状( (32, 3, 224, 224) 相当于 (batch_size, color_channels, height, width) ,即我们传入一个示例,说明单批数据将是什么到我们的模型)。
为了找到 EffNetB2 最后一层所需的输入形状,进行如下操作:
(1)创建 torchvision.models.efficientnet_b2(pretrained=True) 的实例。
(2)通过运行 torchinfo.summary() 查看各种输入和输出形状。
(3)通过检查 EffNetB2 分类器部分的 state_dict() 并打印权重矩阵的长度,打印出 in_features 的数量。【也可以只检查 effnetb2.classifier 的输出】
why:由于
torch.nn.AdaptiveAvgPool2d()层,许多现代模型可以处理不同大小的输入图像,该层根据需要自适应调整给定输入的 output_size 。您可以通过将不同大小的输入图像传递到 torchinfo.summary() 或使用该图层传递到您自己的模型来尝试此操作。
            
            
              python
              
              
            
          
          import torchvision
from torchinfo import summary
# 1. Create an instance of EffNetB2 with pretrained weights
effnetb2_weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT # "DEFAULT" means best available weights
effnetb2 = torchvision.models.efficientnet_b2(weights=effnetb2_weights)
# # 2. Get a summary of standard EffNetB2 from torchvision.models (uncomment for full output)
# summary(model=effnetb2, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# ) 
# 3. Get the number of in_features of the EfficientNetB2 classifier layer
print(f"Number of in_features to final layer of EfficientNetB2: {len(effnetb2.classifier.state_dict()['1.weight'][0])}")
        
EffNetB2 特征提取器模型的模型摘要,其中所有层均未冻结(可训练),并且来自 ImageNet 预训练的默认分类器头。
现在我们知道了 EffNetB2 模型所需的 in_features 数量,创建几个辅助函数来设置 EffNetB0 和 EffNetB2 特征提取器模型。
函数能够:
(1)从 torchvision.models 获取基本模型
(2)冻结模型中的基础层(设置 requires_grad=False )
(3)设置随机种子
(4)更改分类器头(以适应我们的问题)
(5)为模型命名(例如 EffNetB0 为"effnetb0")
            
            
              python
              
              
            
          
          import torchvision
from torch import nn
# Get num out features (one for each class pizza, steak, sushi)
OUT_FEATURES = len(class_names)
# Create an EffNetB0 feature extractor
def create_effnetb0():
    # 1. Get the base mdoel with pretrained weights and send to target device
    weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
    model = torchvision.models.efficientnet_b0(weights=weights).to(device)
    # 2. Freeze the base model layers
    for param in model.features.parameters():
        param.requires_grad = False
    # 3. Set the seeds
    set_seeds()
    # 4. Change the classifier head
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.2),
        nn.Linear(in_features=1280, out_features=OUT_FEATURES)
    ).to(device)
    # 5. Give the model a name
    model.name = "effnetb0"
    print(f"[INFO] Created new {model.name} model.")
    return model
# Create an EffNetB2 feature extractor
def create_effnetb2():
    # 1. Get the base model with pretrained weights and send to target device
    weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT
    model = torchvision.models.efficientnet_b2(weights=weights).to(device)
    # 2. Freeze the base model layers
    for param in model.features.parameters():
        param.requires_grad = False
    # 3. Set the seeds
    set_seeds()
    # 4. Change the classifier head
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.3),
        nn.Linear(in_features=1408, out_features=OUT_FEATURES)
    ).to(device)
    # 5. Give the model a name
    model.name = "effnetb2"
    print(f"[INFO] Created new {model.name} model.")
    return model
        创建 EffNetB0 和 EffNetB2 的实例并检查它们的 summary() 来测试它们:
            
            
              python
              
              
            
          
          effnetb0 = create_effnetb0() 
# Get an output summary of the layers in our EffNetB0 feature extractor model (uncomment to view full output)
summary(model=effnetb0, 
        input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
        # col_names=["input_size"], # uncomment for smaller output
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
) 
        
EffNetB0 模型的模型摘要,基础层已冻结(无法训练)并更新了分类器头(适用于披萨、牛排、寿司图像分类)。
            
            
              python
              
              
            
          
          effnetb2 = create_effnetb2()
# Get an output summary of the layers in our EffNetB2 feature extractor model (uncomment to view full output)
summary(model=effnetb2, 
        input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
        # col_names=["input_size"], # uncomment for smaller output
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
) 
        
EffNetB2 模型的模型摘要,基础层已冻结(无法训练)并更新了分类器头(适用于披萨、牛排、寿司图像分类)。
从摘要的输出来看,EffNetB2 主干网络的参数数量几乎是 EffNetB0 的两倍。
- 创建实验并设置训练代码
首先创建两个列表和一个字典: 
- epoch列表 ( 
[5, 10]) - 测试的模型列表 ( 
["effnetb0", "effnetb2"]) - 不同训练 DataLoader 的字典
 
            
            
              python
              
              
            
          
          # 1. Create epochs list
num_epochs = [5, 10]
# 2. Create models list (need to create a new model for each experiment)
models = ["effnetb0", "effnetb2"]
# 3. Create dataloaders dictionary for various dataloaders
train_dataloaders = {"data_10_percent": train_dataloader_10_percent,
                     "data_20_percent": train_dataloader_20_percent}
        编写代码来迭代每个不同的选项并尝试每个不同的组合,在每次实验结束时保存模型,以便稍后我们可以加载回最佳模型并使用它进行预测。
(1)设置随机种子
(2)跟踪不同的实验编号【方便打印结果】
(3)循环遍历每个不同训练 DataLoader 的 train_dataloaders 字典项
(4)循环遍历epoch编号列表。
(5)循环浏览不同模型名称的列表。
(6)为当前正在运行的实验创建信息打印输出,以便我们知道发生了什么
(7)检查哪个模型是目标模型并创建一个新的 EffNetB0 或 EffNetB2 实例(我们每个实验都会创建一个新的模型实例,因此所有模型都从相同的角度开始)。
(8)为每个新实验创建一个新的损失函数 ( torch.nn.CrossEntropyLoss() ) 和优化器 ( torch.optim.Adam(params=model.parameters(), lr=0.001) )。
(9)使用修改后的 train() 函数训练模型,将适当的详细信息传递给 writer 参数。
(10)使用适当的文件名将经过训练的模型保存到 utils.py 中的 save_model() 文件中。
            
            
              python
              
              
            
          
          %%time
from going_modular.going_modular.utils import save_model
# 1. Set the random seeds
set_seeds(seed=42)
# 2. Keep track of experiment numbers
experiment_number = 0
# 3. Loop through each DataLoader
for dataloader_name, train_dataloader in train_dataloaders.items():
    # 4. Loop through each number of epochs
    for epochs in num_epochs: 
        # 5. Loop through each model name and create a new model based on the name
        for model_name in models:
            # 6. Create information print outs
            experiment_number += 1
            print(f"[INFO] Experiment number: {experiment_number}")
            print(f"[INFO] Model: {model_name}")
            print(f"[INFO] DataLoader: {dataloader_name}")
            print(f"[INFO] Number of epochs: {epochs}")  
            # 7. Select the model
            if model_name == "effnetb0":
                model = create_effnetb0() # creates a new model each time (important because we want each experiment to start from scratch)
            else:
                model = create_effnetb2() # creates a new model each time (important because we want each experiment to start from scratch)
            
            # 8. Create a new loss and optimizer for every model
            loss_fn = nn.CrossEntropyLoss()
            optimizer = torch.optim.Adam(params=model.parameters(), lr=0.001)
            # 9. Train target model with target dataloaders and track experiments
            train(model=model,
                  train_dataloader=train_dataloader,
                  test_dataloader=test_dataloader, 
                  optimizer=optimizer,
                  loss_fn=loss_fn,
                  epochs=epochs,
                  device=device,
                  writer=create_writer(experiment_name=dataloader_name,
                                       model_name=model_name,
                                       extra=f"{epochs}_epochs"))
            
            # 10. Save the model to file so we can get back the best model
            save_filepath = f"07_{model_name}_{dataloader_name}_{epochs}_epochs.pth"
            save_model(model=model,
                       target_dir="models",
                       model_name=save_filepath)
            print("-"*50 + "\n")
        

8. 在TensorBoard中查看实验
            
            
              python
              
              
            
          
          # Viewing TensorBoard in Jupyter and Google Colab Notebooks (uncomment to view full TensorBoard instance)
%load_ext tensorboard
%tensorboard --logdir runs
        最重要的是趋势。您的数字将走向何方。如果偏差很大,可能出了问题,最好回去检查代码。但如果它们偏差很小(比如小数点后几位左右),那也没关系。


在 TensorBoard 中可视化不同建模实验的测试损失值,您可以看到训练 10 个 epoch 且使用 20% 数据的 EffNetB0 模型实现了最低损失。这符合实验的总体趋势:更多的数据、更大的模型和更长的训练时间通常更好。


9. 加载最佳模型并用它进行预测
最大的模型取得了最好的结果,我们可以通过使用 create_effnetb2() 函数创建 EffNetB2 的新实例来导入最佳保存的模型,然后使用 torch.load() 加载保存的 state_dict() 。
            
            
              python
              
              
            
          
          # Setup the best model filepath
best_model_path = "models/07_effnetb2_data_20_percent_10_epochs.pth"
# Instantiate a new instance of EffNetB2 (to load the saved state_dict() to)
best_model = create_effnetb2()
# Load the saved best model state_dict()
best_model.load_state_dict(torch.load(best_model_path))
        查看文件模型大小,太大难以部署:
            
            
              python
              
              
            
          
          # Check the model file size
from pathlib import Path
# Get the model size in bytes then convert to megabytes
effnetb2_model_size = Path(best_model_path).stat().st_size // (1024*1024)
print(f"EfficientNetB2 feature extractor model size: {effnetb2_model_size} MB")
        
做出一些预测并将其可视化:
【创建了一个 pred_and_plot_image() 函数来使用经过训练的模型对图像进行预测。】
pred_and_plot_image() 函数在predictions.py代码里,可以直接调用,补predictions.py代码:
            
            
              python
              
              
            
          
          """
Utility functions to make predictions.
Main reference for code creation: https://www.learnpytorch.io/06_pytorch_transfer_learning/#6-make-predictions-on-images-from-the-test-set 
"""
import torch
import torchvision
from torchvision import transforms
import matplotlib.pyplot as plt
from typing import List, Tuple
from PIL import Image
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Predict on a target image with a target model
# Function created in: https://www.learnpytorch.io/06_pytorch_transfer_learning/#6-make-predictions-on-images-from-the-test-set
def pred_and_plot_image(
    model: torch.nn.Module,
    class_names: List[str],
    image_path: str,
    image_size: Tuple[int, int] = (224, 224),
    transform: torchvision.transforms = None,
    device: torch.device = device,
):
    """Predicts on a target image with a target model.
    Args:
        model (torch.nn.Module): A trained (or untrained) PyTorch model to predict on an image.
        class_names (List[str]): A list of target classes to map predictions to.
        image_path (str): Filepath to target image to predict on.
        image_size (Tuple[int, int], optional): Size to transform target image to. Defaults to (224, 224).
        transform (torchvision.transforms, optional): Transform to perform on image. Defaults to None which uses ImageNet normalization.
        device (torch.device, optional): Target device to perform prediction on. Defaults to device.
    """
    # Open image
    img = Image.open(image_path)
    # Create transformation for image (if one doesn't exist)
    if transform is not None:
        image_transform = transform
    else:
        image_transform = transforms.Compose(
            [
                transforms.Resize(image_size),
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )
    ### Predict on image ###
    # Make sure the model is on the target device
    model.to(device)
    # Turn on model evaluation mode and inference mode
    model.eval()
    with torch.inference_mode():
        # Transform and add an extra dimension to image (model requires samples in [batch_size, color_channels, height, width])
        transformed_image = image_transform(img).unsqueeze(dim=0)
        # Make a prediction on image with an extra dimension and send it to the target device
        target_image_pred = model(transformed_image.to(device))
    # Convert logits -> prediction probabilities (using torch.softmax() for multi-class classification)
    target_image_pred_probs = torch.softmax(target_image_pred, dim=1)
    # Convert prediction probabilities -> prediction labels
    target_image_pred_label = torch.argmax(target_image_pred_probs, dim=1)
    # Plot image with predicted label and probability
    plt.figure()
    plt.imshow(img)
    plt.title(
        f"Pred: {class_names[target_image_pred_label]} | Prob: {target_image_pred_probs.max():.3f}"
    )
    plt.axis(False)
        开始随机预测:
            
            
              python
              
              
            
          
          # Import function to make predictions on images and plot them 
# See the function previously created in section: https://www.learnpytorch.io/06_pytorch_transfer_learning/#6-make-predictions-on-images-from-the-test-set
from going_modular.going_modular.predictions import pred_and_plot_image
# Get a random list of 3 images from 20% test set
import random
num_images_to_plot = 3
test_image_path_list = list(Path(data_20_percent_path / "test").glob("*/*.jpg")) # get all test image paths from 20% dataset
test_image_path_sample = random.sample(population=test_image_path_list,
                                       k=num_images_to_plot) # randomly select k number of images
# Iterate through random test image paths, make predictions on them and plot them
for image_path in test_image_path_sample:
    pred_and_plot_image(model=best_model,
                        image_path=image_path,
                        class_names=class_names,
                        image_size=(224, 224))
        最后使用最佳模型预测自定义图像:
            
            
              python
              
              
            
          
          # Download custom image
import requests
# Setup custom image path
custom_image_path = Path("data/04-pizza-dad.jpeg")
# Download the image if it doesn't already exist
if not custom_image_path.is_file():
    with open(custom_image_path, "wb") as f:
        # When downloading from GitHub, need to use the "raw" file link
        request = requests.get("https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/04-pizza-dad.jpeg")
        print(f"Downloading {custom_image_path}...")
        f.write(request.content)
else:
    print(f"{custom_image_path} already exists, skipping download.")
# Predict on custom image
pred_and_plot_image(model=model,
                    image_path=custom_image_path,
                    class_names=class_names)
        
补充
使用 20% 披萨、牛排、寿司训练和测试数据集将数据增强引入到实验列表中:
            
            
              python
              
              
            
          
          # Note: Data augmentation transform like this should only be performed on training data
train_transform_data_aug = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.TrivialAugmentWide(),
    transforms.ToTensor(),
    normalize
])
# Helper function to view images in a DataLoader (works with data augmentation transforms or not) 
def view_dataloader_images(dataloader, n=10):
    if n > 10:
        print(f"Having n higher than 10 will create messy plots, lowering to 10.")
        n = 10
    imgs, labels = next(iter(dataloader))
    plt.figure(figsize=(16, 8))
    for i in range(n):
        # Min max scale the image for display purposes
        targ_image = imgs[i]
        sample_min, sample_max = targ_image.min(), targ_image.max()
        sample_scaled = (targ_image - sample_min)/(sample_max - sample_min)
        # Plot images with appropriate axes information
        plt.subplot(1, 10, i+1)
        plt.imshow(sample_scaled.permute(1, 2, 0)) # resize for Matplotlib requirements
        plt.title(class_names[labels[i]])
        plt.axis(False)
# Have to update `create_dataloaders()` to handle different augmentations
import os
from torch.utils.data import DataLoader
from torchvision import datasets
NUM_WORKERS = os.cpu_count() # use maximum number of CPUs for workers to load data 
# Note: this is an update version of data_setup.create_dataloaders to handle
# differnt train and test transforms.
def create_dataloaders(
    train_dir, 
    test_dir, 
    train_transform, # add parameter for train transform (transforms on train dataset)
    test_transform,  # add parameter for test transform (transforms on test dataset)
    batch_size=32, num_workers=NUM_WORKERS
):
    # Use ImageFolder to create dataset(s)
    train_data = datasets.ImageFolder(train_dir, transform=train_transform)
    test_data = datasets.ImageFolder(test_dir, transform=test_transform)
    # Get class names
    class_names = train_data.classes
    # Turn images into data loaders
    train_dataloader = DataLoader(
        train_data,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True,
    )
    test_dataloader = DataLoader(
        test_data,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True,
    )
    return train_dataloader, test_dataloader, class_names