3.7.物体检测算法

物体检测算法

1.R-CNN

首先使用启发式搜索算法来选择锚框，使用预训练模型对每个锚框抽取特征，训练一个SVM来对类别分类，最后训练一个线性回归模型来预测边缘框偏移。

R-CNN比较早，所以使用的是SVM

1.1 兴趣区域(RoI)池化层

给定一个锚框，均匀分割成 n × m n\times m n×m块，输出每块里的最大值，不管锚框多大，总是输出 n m nm nm个值。

3 × 3 3\times 3 3×3不好被 2 × 2 2\times 2 2×2均分，所以会取一下整。

1.2 Fast RCNN

R-CNN每张图片都要抽取一次特征，如果每张图片锚框很多，就可能要抽取很多次特征，很麻烦，Fast RCNN在次基础上做快：

使用CNN对整张图片抽取特征，再使用RoI池化层对每个锚框生成固定长度的特征

就是先抽取特征后，将原图的锚框按比例的在特征图中找出锚框，然后再做。CNN这一层不对每个锚框抽取特征，而是对整个图片抽取特征，那么对于锚框重复的地方，就只用抽取一次了，变快了很多。

1.3 Faster R-CNN

使用一个区域提议网络来替代启发式搜索来获得更好的锚框？

大概就是训练一个神经网络，判断这些锚框是不是框住了(一个二分类问题)，如果框住了，与真实边界框的偏移是多少，训练好后，会输出比较好的锚框。先做一个粗糙的预测，再做一个精准的预测。

1.4 Mask R-CNN

其余部分和Faster R-CNN相同，新增了一个对像素的神经网络，假设有每个像素的编号，可以对像素进行预测。并且将pooling改为了align，对像素分类更准确，得到的是一个加权，而不是简单的切割。

R-CNN是最早、也是最有名的一类基于锚框和CNN的目标检测算法。Faster R-CNN和Mask R-CNN是在追求最高精度场景下的常用算法，并且Mask R-CNN需要每个像素的标号，会有一些限制

2.单发多框检测 (SSD)

对于每个像素，生成多个以它为中心的锚框(上一节的生成锚框方法)，

首先使用一个基础网络块来抽取特征，然后使用多个卷积层块来减半高宽，在每个阶段都生成锚框，底部段来拟合小物体，顶部段来拟合大物体，对每个锚框都预测类别和真实边缘框

接近顶部的多尺度特征图较小，但具有较大的感受野，它们适合检测较少但较大的物体。简而言之，通过多尺度特征块，单发多框检测生成不同大小的锚框，并通过预测边界框的类别和偏移量来检测大小不同的目标，因此这是一个多尺度目标检测模型。

SSD通过单神经网络来检测模型，以每个像素为中心产生多个锚框，在多个段的输出上进行多尺度的检测。

2.1 多尺度目标检测

动机是减少图像上的锚框数量，可以在输入图像中均匀采样一小部分像素，并以它们为中心生成锚框。在不同尺度下，我们可以生成不同数量和不同大小的锚框。

因为一般来说，较小的物体在图像中出现的可能性更多样，例如 1 × 1 , 1 × 2 , 2 × 2 1\times1,1\times2,2\times2 1×1,1×2,2×2的目标，分别以4、2和1种可能的方式出现在 2 × 2 2\times 2 2×2图像上。那么当检测较小的物体时，可以采样更多的区域，对于较大的物体，可以采样较少的区域。

python 复制代码

import torch
from d2l import torch as d2l

img = d2l.plt.imread('../img/catdog.jpg')
h, w = img.shape[:2]
print(h, w)


def display_anchors(fmap_w, fmap_h, s):
    d2l.set_figsize()
    # 前两个维度上的值不影响输出
    fmap = torch.zeros((1, 10, fmap_h, fmap_w))

    # multibox_prior的data是四维的
    anchors = d2l.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5])  # 生成多个锚框，形状为(1,num_anchors,4)
    bbox_scale = torch.tensor((w, h, w, h))
    d2l.show_bboxes(d2l.plt.imshow(img).axes,
                    anchors[0] * bbox_scale)

#小锚框检测小目标
display_anchors(fmap_w=4, fmap_h=4, s=[0.15])  # 分成4 *4 的区域
d2l.plt.show()
#大锚框检测大目标
display_anchors(fmap_w=2, fmap_h=2, s=[0.4])  # 分成4 *4 的区域
d2l.plt.show()

2.2 SSD

具体请看注释：

python 复制代码

import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

'''类别预测层，索引为i(q+1)+j的通道代表了索引为i的锚框有关类别索引为j的预测'''


def cls_predictor(num_inputs, num_anchors, num_classes):
    # num_inputs 是输入的像素点个数，对每个像素都要预测
    return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1),
                     kernel_size=3, padding=1)  # 加1是因为还要预测背景类,对于每个锚框都要进行分类，所以输出通道有这么多


'''边界框预测层(bound box)，为每个锚框预测4个偏移量(x,y,w,h)上的偏移'''


def bbox_predictor(num_inputs, num_anchors):
    return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1)


'''连接多尺度的预测'''


def forward(x, block):
    return block(x)


# 举个例子
# 分别生成5*（10+1） =55 个和 3*(10+1)=33个锚框，输出形状是(批量大小，通道数，高度，宽度)
Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10))
Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10))
print(Y1.shape, Y2.shape)


# 把4 D转换成2D的
# permute将维度调整，将通道数挪到最后，然后从dim=1开始拉平，即后三维拉平
# 把通道数放到最后是为了让预测值连续，好用一些，可以想象一下
# 将通道数放在第三维，那么纵深就是通道，每个平面是(高，宽)，拉平是每个平面每个平面的拉平，这样才是连续的。
def flatten_pred(pred):
    return torch.flatten(pred.permute(0, 2, 3, 1), start_dim=1)


# 拉平后连接：20 * 20 *55 +10 * 10 *33= 25300
def concat_preds(preds):
    return torch.cat([flatten_pred(p) for p in preds], dim=1)


print(concat_preds([Y1, Y2]).shape)

'''高宽减半块'''


def down_sample_blk(in_channels, out_channels):
    blk = []
    for _ in range(2):
        blk.append(nn.Conv2d(in_channels, out_channels,
                             kernel_size=3, padding=1))  # 高宽不变
        blk.append(nn.BatchNorm2d(out_channels))
        blk.append(nn.ReLU())
        in_channels = out_channels
    blk.append(nn.MaxPool2d(2))  # 高宽减半
    return nn.Sequential(*blk)


# 示例20*20 减半为 10*10
print('高宽减半块：', forward(torch.zeros((2, 3, 20, 20)), down_sample_blk(3, 10)).shape)

'''基本网络块'''


# 该网络块输入图像的形状为256*256,输出的特征图为32*32
def base_net():
    blk = []
    num_filters = [3, 16, 32, 64]  # 输入是3个维度，然后增加到16,32,64
    for i in range(len(num_filters) - 1):
        blk.append(down_sample_blk(num_filters[i], num_filters[i + 1]))  # 每个块高宽减半，有三个，减8倍
    return nn.Sequential(*blk)


print('基本网络块：', forward(torch.zeros((2, 3, 256, 256)), base_net()).shape)

'''完整的模型'''


# 5个模块，每个模块既用于生成锚框，又用于预测这些锚框的类别和偏移量
# 第一个是基本网络块，第2到4个都是高宽减半块，最后一个使用全局最大池化将高度和宽度都降到1
# 可以自己找其他神经网络，比如resnet，将down_sample_blk换成resnet?
def get_blk(i):
    if i == 0:
        blk = base_net()
    elif i == 1:
        blk = down_sample_blk(64, 128)  # 第一个减半块扩大一下输出通道
    elif i == 4:
        blk = nn.AdaptiveMaxPool2d((1, 1))
    else:
        blk = down_sample_blk(128, 128)  # 后续不用扩大输出通道
    return blk


'''块前向传播'''


# 与图像分类任务不同，此处的输出包括：CNN特征图Y，当前尺度下根据Y生产的锚框，预测的这些锚框的类别和偏移量(基于Y)
def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
    Y = blk(X)
    anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)  # 生成锚框
    cls_preds = cls_predictor(Y)  # 类别预测，不需要把锚框传进去，只需要知道有多少个锚框，多少个类就行
    bbox_preds = bbox_predictor(Y)  # 边界框预测
    return (Y, anchors, cls_preds, bbox_preds)


# 超参数，有5个层，size逐渐增加,实际面积= s^2 *原图面积 ，第二个值是几何平均
# 0.272 = \sqrt{0.2 *0.37}
sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
         [0.88, 0.961]]
ratios = [[1, 2, 0.5]] * 5  # 常用的组合
num_anchors = len(sizes[0]) + len(ratios[0]) - 1

'''完整的模型'''
class TinySSD(nn.Module):
    def __init__(self, num_classes, **kwargs):
        super(TinySSD, self).__init__(**kwargs)
        self.num_classes = num_classes
        idx_to_in_channels = [64, 128, 128, 128, 128] # 每个块的输出通道数
        for i in range(5):
            # 即赋值语句self.blk_i=get_blk(i)
            # 设定属性值，有属性.blk_0,.cls_0,.bbox_0等一系列属性了
            setattr(self, f'blk_{i}', get_blk(i))
            setattr(self, f'cls_{i}', cls_predictor(idx_to_in_channels[i],
                                                    num_anchors, num_classes))
            setattr(self, f'bbox_{i}', bbox_predictor(idx_to_in_channels[i],
                                                      num_anchors))

    def forward(self, X):
        anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
        for i in range(5):
            # getattr(self,'blk_%d'%i)即访问self.blk_i，获取这一属性的值
            X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
                X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
                getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
        anchors = torch.cat(anchors, dim=1)  # 输出的后三个都是三维的，第一个都是类似批量大小的
        # 将预测结果全部连接起来
        cls_preds = concat_preds(cls_preds)
        # reshape成(输出通道数，anchors，类别)，-1就表示由其他参数决定，因为我们想预测类别，
        # 重构成这样方便读
        cls_preds = cls_preds.reshape(
            cls_preds.shape[0], -1, self.num_classes + 1)
        bbox_preds = concat_preds(bbox_preds)
        return anchors, cls_preds, bbox_preds


net = TinySSD(num_classes=1)
X = torch.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)

print('output anchors:', anchors.shape)
print('output class preds:', cls_preds.shape)
print('output bbox preds:', bbox_preds.shape)

2.3 训练模型

具体看注释：

python 复制代码

'''训练模型'''

# 读取数据和初始化
batch_size = 32
train_iter, _ = d2l.load_data_bananas(batch_size)

device, net = torch_directml.device(), TinySSD(num_classes=1)
trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4)

'''损失和评价函数'''

cls_loss = nn.CrossEntropyLoss(reduction='none')
bbox_loss = nn.L1Loss(reduction='none') # 当预测特别差时，L1也不会特别大，如果用L2可能会特别大

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2]
    cls = cls_loss(cls_preds.reshape(-1, num_classes),
                   cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1)
    bbox = bbox_loss(bbox_preds * bbox_masks,
                     bbox_labels * bbox_masks).mean(dim=1) #mask表示，对应背景时取0，不算损失了
    return cls + bbox # 损失值就是锚框类别的损失值加上偏移量的损失

def cls_eval(cls_preds, cls_labels):
    # 由于类别预测结果放在最后一维，argmax需要指定最后一维。
    return float((cls_preds.argmax(dim=-1).type(
        cls_labels.dtype) == cls_labels).sum())

def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
    return float((torch.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())

num_epochs, timer = 20, d2l.Timer()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                        legend=['class error', 'bbox mae'])
net = net.to(device)
for epoch in range(num_epochs):
    # 训练精确度的和，训练精确度的和中的示例数
    # 绝对误差的和，绝对误差的和中的示例数
    metric = d2l.Accumulator(4)
    net.train()
    for features, target in train_iter:
        timer.start()
        trainer.zero_grad()
        X, Y = features.to(device), target.to(device)
        # 生成多尺度的锚框，为每个锚框预测类别和偏移量
        anchors, cls_preds, bbox_preds = net(X)
        # 为每个锚框标注类别和偏移量，Y是真实标签
        bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)
        # 根据类别和偏移量的预测和标注值计算损失函数
        l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
                      bbox_masks)
        l.mean().backward()
        trainer.step()
        metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel(),
                   bbox_eval(bbox_preds, bbox_labels, bbox_masks),
                   bbox_labels.numel()) # 累加器记录(预测正确数，总预测数，)
    cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
    animator.add(epoch + 1, (cls_err, bbox_mae))
print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')
print(f'{len(train_iter.dataset) / timer.stop():.1f} examples/sec on '
      f'{str(device)}')

d2l.plt.show()

cpu训练(因为是A卡，后面会说为什么要用cpu)：

使用torch_directml有问题，似乎是某个操作不支持(repeat_interleave和AdaptiveMaxPool2d)，只能在CPU上计算，导致训练结果非常差，但应该只影响训练时间啊？不是很明白：

UserWarning: The operator 'aten::repeat_interleave.Tensor' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17.)

out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],

3. YOLO

You Only Look Once

SSD中锚框有大量重叠(生成锚框的方法导致的)，因此浪费了很多计算，YOLO将图片分成 S × S S\times S S×S个锚框，每个锚框预测 B B B个边缘框(如果只预测一个可能会丢失某些物体，因为可能有多个物体)