pytorch学习笔记（四）-- TorchVision 物体检测微调教程

系列文章目录

文章目录

系列文章目录

文章目录

前言

一、定义数据集

[1.1 数据集定义要求](#1.1 数据集定义要求)

[1.2 为PennFudan写一个自定义的数据集](#1.2 为PennFudan写一个自定义的数据集)

二、定义一个模型

三、针对PennFudan数据集做的对象检测和实例分割模型

四、合并

总结

前言

在本章节，我们将在 Penn-Fudan 数据库上微调预训练的 Mask R-CNN 模型，用于行人检测和分割。它包含 170 张图像，其中有 345 个行人实例，我们将使用它来说明如何使用 torchvision 中的新功能在自定义数据集上训练对象检测和实例分割模型。

一、定义数据集

1.1 数据集定义要求

训练对象检测、实例分割和人物关键点检测的参考脚本可以轻松支持添加新的自定义数据集，数据集应答继承自torch.utils.data.Dataset类，并实现__len__和__getitem__。我们要求的唯一特殊性是数据集 getitem 应该返回一个tuple元组：（image， target）

image ：torchvision.tv_tensors.Image of shape [3, H, W], a pure tensor, or a PIL Image of size (H, W)它的参数包含以下这些：

data --表示可以通过torch.as_tensor()以及PILImages转换为tensor的数据
dtype (可选，期望的数据类型)，如果没有写，则继承data的数据类型
device（可选，期望运行所在的设备类型），如果没有写，则默认是CPU
requires_grad(可选，表示是否支持梯度计算)

target : 包含下面所有参数的dict

boxes: 形状为[N,4]的 torchvision.tv_tensors.BoundingBoxes：[x0, y0, x1, y1] 格式的 N 个边界框的坐标，x,y的范围从 0 到 W 和 0 到 H
labels: 形状为[N ]的整数torch.Tensor, 表示每个边界框的标签，0表示背景类
image_id: 图片标识符，数据集里面每张图片对应的标识符都是不同的，在验证阶段会用到
area：形状为[N ]的浮点数torch.Tensor，表示边界框的面积，这在使用 COCO 指标进行评估时用于区分小框、中框和大框之间的指标分数。
iscrowd: 形状为[N ]的uint8类型的torch.Tensor，对于iscrowd=True的实例会在验证阶段忽略
masks: 可选，每个对象的分割掩码

注意：关于标签的一条注释。模型将 0 类视为背景。如果您的数据集不包含背景类，则标签中不应包含 0。例如，假设您只有两个类，猫和狗，您可以定义 1（而不是 0）来表示猫，2 来表示狗。因此，例如，如果其中一个图像同时包含这两个类，则您的标签张量应类似于 [1, 2]。

此外，如果您想在训练期间使用纵横比分组（以便每个批次仅包含具有相似纵横比的图像），则建议还实现 get_height_and_width 方法，该方法返回图像的高度和宽度。如果没有提供此方法，我们将通过 getitem 查询数据集的所有元素，这会将图像加载到内存中，并且比提供自定义方法时更慢。

1.2 为PennFudan写一个自定义的数据集

PennFudanPed下载：

复制代码

wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip -P data
cd data && unzip PennFudanPed.zip

PennFudanPed数据集定义：

复制代码

import os
import torch

from torchvision.io import read_image
from torchvision.ops.boxes import masks_to_boxes
from torchvision import tv_tensors
from torchvision.transforms.v2 import functional as F

class PennFudanDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        # load images and masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = read_image(img_path)
        mask = read_image(mask_path)
        # instances are encoded as different colors
        obj_ids = torch.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]
        num_objs = len(obj_ids)

        # split the color-encoded mask into a set
        # of binary masks
        masks = (mask == obj_ids[:, None, None]).to(dtype=torch.uint8)

        # get bounding box coordinates for each mask
        boxes = masks_to_boxes(masks)

        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)

        image_id = idx
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        # Wrap sample and targets into torchvision tv_tensors:
        img = tv_tensors.Image(img)

        target = {}
        target["boxes"] = tv_tensors.BoundingBoxes(boxes, format="XYXY", canvas_size=F.get_size(img))
        target["masks"] = tv_tensors.Mask(masks)
        target["labels"] = labels
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

__getitem__函数处理过程分析：

通过torch.unique对掩码进行去重
去掉第一个去重后的元素，因为第一个是背景
计算剩下的mask元素个数num_objs
通过masks = (mask == obj_ids[:,None,None]).to(dtype=torch.uint8), 对掩码图进行图像分割
通过boxes = masks_to_boxes(masks)，可以获得框图
通过labels = torch.ones((num_objs,),dtype=torch.int64)
通过boxes计算掩码框图面积，area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
image_id 赋值为传进来的索引值idx
iscrowd假设都是false

二、定义一个模型

在本教程中，我们将使用基于 Faster R-CNN 的 Mask R-CNN。Faster R-CNN 是一种预测图像中潜在对象的边界框和类别分数的模型。

MASK R-CNN简介：我们提出了一个概念上简单、灵活且通用的对象实例分割框架。我们的方法可以高效地检测图像中的对象，同时为每个实例生成高质量的分割掩码。该方法称为 Mask R-CNN，它扩展了 Faster R-CNN，在现有的边界框识别分支的基础上添加了一个预测对象掩码的分支。Mask R-CNN 易于训练，运行速度为 5 fps，只会给 Faster R-CNN 增加很小的开销。此外，Mask R-CNN 很容易推广到其他任务，例如，允许我们在同一框架中估计人体姿势。我们在 COCO 系列挑战赛的所有三个赛道中都展示了最佳结果，包括实例分割、边界框对象检测和人物关键点检测。Mask R-CNN 没有任何花哨的花哨功能，但在每项任务上都胜过所有现有的单一模型参赛作品，包括 COCO 2016 挑战赛的获胜者。我们希望我们简单有效的方法可以作为坚实的基础，并有助于简化实例级识别的未来研究。

在两种常见情况下，人们可能需要修改TorchVision Model Zoo中的可用模型之一，第一种情况是，我们想从预先训练的模型开始，然后只微调最后一层，另一种情况是，我们想用不同的模型替换模型的主干。

从预训练模型进行微调

假设你想从 COCO 上预先训练的模型开始，并希望针对特定类别对其进行微调。以下是一种可能的方法。

复制代码

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

# load a model pre-trained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="DEFAULT")

# replace the classifier with a new one, that has num_classes which is user-defined
num_classes = 2  # 1 class (person) + background
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

修改模型以添加不同的主干

import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator

load a pre-trained model for classification and return

only the features

backbone = torchvision.models.mobilenet_v2(weights="DEFAULT").features

FasterRCNN needs to know the number of

output channels in a backbone. For mobilenet_v2, it's 1280

so we need to add it here

backbone.out_channels = 1280

let's make the RPN generate 5 x 3 anchors per spatial

location, with 5 different sizes and 3 different aspect

ratios. We have a Tuple[Tuple[int]] because each feature

map could potentially have different sizes and

aspect ratios

anchor_generator = AnchorGenerator(
sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),)
)

let's define what are the feature maps that we will

use to perform the region of interest cropping, as well as

the size of the crop after rescaling.

if your backbone returns a Tensor, featmap_names is expected to

be [0]. More generally, the backbone should return an

OrderedDict[Tensor], and in featmap_names you can choose which

feature maps to use.

roi_pooler = torchvision.ops.MultiScaleRoIAlign(
featmap_names=['0'],
output_size=7,
sampling_ratio=2
)

put the pieces together inside a Faster-RCNN model

model = FasterRCNN(
backbone,
num_classes=2,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_pooler
)

三、针对PennFudan数据集做的对象检测和实例分割模型

在我们的例子中，我们希望从预先训练的模型中进行微调，因为我们的数据集非常小，所有我们遵循方法1. 在这里我们还想计算实例分割掩码，因此我们将使用Mask R-CNN，最后创建一个模型出来。

复制代码

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

def get_model_instance_segmentation(num_classes):
    # load an instance segmentation model pre-trained on COCO
    model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights="DEFAULT")

    # get number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # now get the number of input features for the mask classifier
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    # and replace the mask predictor with a new one
    model.roi_heads.mask_predictor = MaskRCNNPredictor(
        in_features_mask,
        hidden_layer,
        num_classes
    )

    return model

四、合并

OK，上面已经对数据集和模型的构建做了分析，下面我们将开始对模型从构建、训练到测试的整个流程加以编写。

在references/detection/目录下，我们有很多有效的函数来简化训练和测试检测模型，下载命令如下：

复制代码

os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/engine.py")
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/utils.py")
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/coco_utils.py")
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/coco_eval.py")
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/transforms.py")

通过这些函数，可以编写一些用于数据增强/转换的函数，我们写一个transform函数，用于我们数据集加载接口中：

复制代码

from torchvision.transforms import v2 as T
def get_transform(train):
    transforms = []
    if train:
        transforms.append(T.RandomHorizontalFlip(0.5))
    transforms.append(T.ToDtype(torch.float, scale=True))
    transforms.append(T.ToPureTensor())
    return T.Compose(transforms)

下面就是主体了：

复制代码

import utils

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="DEFAULT")
dataset = PennFudanDataset('data/PennFudanPed', get_transform(train=True))
data_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=utils.collate_fn
)

# For Training
images, targets = next(iter(data_loader))
images = list(image for image in images)
targets = [{k: v for k, v in t.items()} for t in targets]
output = model(images, targets)  # Returns losses and detections
print(output)

# For inference
model.eval()
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
predictions = model(x)  # Returns predictions
print(predictions[0])

#主函数如下
from engine import train_one_epoch, evaluate

# train on the GPU or on the CPU, if a GPU is not available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# our dataset has two classes only - background and person
num_classes = 2
# use our dataset and defined transformations
dataset = PennFudanDataset('data/PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('data/PennFudanPed', get_transform(train=False))

# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])

# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=utils.collate_fn
)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test,
    batch_size=1,
    shuffle=False,
    collate_fn=utils.collate_fn
)

# get the model using our helper function
model = get_model_instance_segmentation(num_classes)

# move model to the right device
model.to(device)

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(
    params,
    lr=0.005,
    momentum=0.9,
    weight_decay=0.0005
)

# and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=3,
    gamma=0.1
)

# let's train it just for 2 epochs
num_epochs = 2

for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
    # update the learning rate
    lr_scheduler.step()
    # evaluate on the test dataset
    evaluate(model, data_loader_test, device=device)

print("That's it!")

模型验证：

复制代码

import matplotlib.pyplot as plt
from torchvision.utils import draw_bounding_boxes, draw_segmentation_masks
image = read_image("data/PennFudanPed/PNGImages/FudanPed00046.png")
eval_transform = get_transform(train=False)

model.eval()
with torch.no_grad():
    x = eval_transform(image)
    # convert RGBA -> RGB and move to device
    x = x[:3, ...].to(device)
    predictions = model([x, ])
    pred = predictions[0]

image = (255.0 * (image - image.min()) / (image.max() - image.min())).to(torch.uint8)
image = image[:3, ...]
pred_labels = [f"pedestrian: {score:.3f}" for label, score in zip(pred["labels"], pred["scores"])]
pred_boxes = pred["boxes"].long()
output_image = draw_bounding_boxes(image, pred_boxes, pred_labels, colors="red")

masks = (pred["masks"] > 0.7).squeeze(1)
output_image = draw_segmentation_masks(output_image, masks, alpha=0.5, colors="blue")

plt.figure(figsize=(12, 12))
plt.imshow(output_image.permute(1, 2, 0))
plt.show()

总结

在本教程中，您学习了如何在自定义数据集上创建自己的目标检测模型训练流程。为此，您编写了一个 torch.utils.data.Dataset 类，用于返回图像、真实值框和分割蒙版。您还利用了在 COCO train2017 上预训练的 Mask R-CNN 模型，以便在此新数据集上执行迁移学习。

有关更完整的示例（包含多机器/多 GPU 训练），请查看 torchvision 代码库中的 references/detection/train.py 文件。

pytorch学习笔记（四）-- TorchVision 物体检测微调教程

系列文章目录

文章目录

前言

一、定义数据集

1.1 数据集定义要求

1.2 为PennFudan写一个自定义的数据集

二、定义一个模型

load a pre-trained model for classification and return

only the features

`FasterRCNN` needs to know the number of

output channels in a backbone. For mobilenet_v2, it's 1280

so we need to add it here

let's make the RPN generate 5 x 3 anchors per spatial

location, with 5 different sizes and 3 different aspect

ratios. We have a Tuple[Tuple[int]] because each feature

map could potentially have different sizes and

aspect ratios

let's define what are the feature maps that we will

use to perform the region of interest cropping, as well as

the size of the crop after rescaling.

if your backbone returns a Tensor, featmap_names is expected to

be [0]. More generally, the backbone should return an

`OrderedDict[Tensor]`, and in `featmap_names` you can choose which

feature maps to use.

put the pieces together inside a Faster-RCNN model

三、针对PennFudan数据集做的对象检测和实例分割模型

四、合并

总结

pytorch学习笔记（四）-- TorchVision 物体检测微调教程

系列文章目录

文章目录

前言

一、定义数据集

1.1 数据集定义要求

1.2 为PennFudan写一个自定义的数据集

二、定义一个模型

load a pre-trained model for classification and return

only the features

FasterRCNN needs to know the number of

output channels in a backbone. For mobilenet_v2, it's 1280

so we need to add it here

let's make the RPN generate 5 x 3 anchors per spatial

location, with 5 different sizes and 3 different aspect

ratios. We have a Tuple[Tuple[int]] because each feature

map could potentially have different sizes and

aspect ratios

let's define what are the feature maps that we will

use to perform the region of interest cropping, as well as

the size of the crop after rescaling.

if your backbone returns a Tensor, featmap_names is expected to

be [0]. More generally, the backbone should return an

OrderedDict[Tensor], and in featmap_names you can choose which

feature maps to use.

put the pieces together inside a Faster-RCNN model

三、针对PennFudan数据集做的对象检测和实例分割模型

四、合并

总结

`FasterRCNN` needs to know the number of

`OrderedDict[Tensor]`, and in `featmap_names` you can choose which