目标检测进阶：2.Faster R-CNN算法

这一章主要介绍一个非常经典的两阶段目标检测模型，Faster R-CNN（Faster Regions with Convolutional Neural Networks）是一种用于对象检测的高效深度神经网络架构。它通过引入区域提议网络（Region Proposal Network，RPN）来改进Fast R-CNN的性能，实现了更快的检测速度。

R-CNN、Fast R-CNN、Faster R-CNN都是非常经典的两阶段算法。Faster R-CNN是在Fast R-CNN改进来的，Fast R-CNN又是在R-CNN上改进来的，掌握Faster R-CNN相当于掌握了两阶段模型的核心思想和技术，并且它仍是现在许多目标检测算法的基础。

Faster R-CNN的网络结构

一、Faster R-CNN第一阶段

注：第一阶段只是获得目标的大概位置，第一阶段的分类并非是确定是具体某一类，而是做前景和背景的分类。第一阶段包括主干特征提取网络和区域提议网络，第一阶段得到的损失值并不直接更新模型参数，而是和第二阶段的损失值相加后再更新模型权重参数。

Faster R-CNN的网络结构中第一阶段包含主干特征提取网络和区域提议网络,输出第二阶段所需的特征图和建议框，同时计算第一阶段模型的损失值。

1.主干特征提取网络

Faster R-CNN的主干特征提取网络（Backbone）是其网络结构中的关键部分，负责从输入图像中提取特征图（Feature Map），这些特征图随后被用于后续的区域提议网络（RPN）和全连接层。

复制代码

from torchvision.models import resnet50
from torch import nn

def Resnet50(pretrained=False):
    model = resnet50(pretrained)

    # ----------------------------------------------------------------------------#
    #   获取特征提取部分，从conv1到model.layer3，当输入的图像形状为3,640，640，最终获得一个40,40,1024的特征层
    # ----------------------------------------------------------------------------#
    features = list([model.conv1, model.bn1, model.relu, model.maxpool, model.layer1, model.layer2, model.layer3])
    # ----------------------------------------------------------------------------#
    #   获取分类部分，从model.layer4到model.avgpool
    # ----------------------------------------------------------------------------#
    classifier = list([model.layer4, model.avgpool])

    features = nn.Sequential(*features)
    classifier = nn.Sequential(*classifier)
    return features, classifier

常用的特征提取网络有VGG和ResNet等，也可以使用ConvNeXt作为特征提取网络，以ResNet50为例，当Faster R-CNN用ResNet网络时，主干特征提取网络只使用ResNet的第一层到layer3层，layer4和avgpool层用于第二阶段f分类部分。当输出的图片形状为[3,640,640]时，主干特征提取网络的输出为[1024,40,40]的特征图。此时输出的特征图不仅用于区域提议网络还会用于第二阶段。

2.区域提议网络

区域提议网络（Region Proposal Network，RPN）是Faster R-CNN目标检测算法中的一个核心组成部分，它负责生成潜在的物体区域，即区域提议（或称为区域建议）。

先验框

先验框（Prior），也称为锚框（Anchor Box），是目标检测任务中一个重要概念。它是指在目标检测过程中，为了提高检测模型的准确性和稳定性，预定义的一组固定大小和长宽比的矩形框。这些框以不同的尺度和长宽比在图像上均匀分布，用于与图像大众的目标进行匹配，从而指导模型对目标的定位和分类。

复制代码

import numpy as np

#--------------------------------------------#
#   生成基础的先验框
#--------------------------------------------#
def generate_anchor_base(base_size=16, ratios=[0.5, 1, 2], anchor_scales=[8, 16, 32]):
    anchor_base = np.zeros((len(ratios) * len(anchor_scales), 4), dtype=np.float32) #每个像素点生成9个先验框
    for i in range(len(ratios)):
        for j in range(len(anchor_scales)):
            h = base_size * anchor_scales[j] * np.sqrt(ratios[i])
            w = base_size * anchor_scales[j] * np.sqrt(1. / ratios[i])
            #坐标变换
            index = i * len(anchor_scales) + j
            anchor_base[index, 0] = - h / 2.
            anchor_base[index, 1] = - w / 2.
            anchor_base[index, 2] = h / 2.
            anchor_base[index, 3] = w / 2.
    return anchor_base

#--------------------------------------------#
#   对基础先验框进行拓展对应到所有特征点上
#--------------------------------------------#
def _enumerate_shifted_anchor(anchor_base, feat_stride, height, width):
    #---------------------------------#
    #   计算网格中心点
    #---------------------------------#
    shift_x             = np.arange(0, width * feat_stride, feat_stride)
    shift_y             = np.arange(0, height * feat_stride, feat_stride)
    shift_x, shift_y    = np.meshgrid(shift_x, shift_y)
    shift               = np.stack((shift_x.ravel(), shift_y.ravel(), shift_x.ravel(), shift_y.ravel(),), axis=1)

    #---------------------------------#
    #   每个网格点上的9个先验框
    #---------------------------------#
    A       = anchor_base.shape[0]
    K       = shift.shape[0]
    anchor  = anchor_base.reshape((1, A, 4)) + shift.reshape((K, 1, 4))
    #---------------------------------#
    #   所有的先验框
    #---------------------------------#
    anchor  = anchor.reshape((K * A, 4)).astype(np.float32)
    return anchor
    
if __name__ == "__main__":
    import matplotlib.pyplot as plt
    nine_anchors = generate_anchor_base()
    print(nine_anchors)

    height, width, feat_stride  = 40,40,16
    anchors_all                 = _enumerate_shifted_anchor(nine_anchors, feat_stride, height, width)
    print(np.shape(anchors_all)) #14400，4（14400=40*40*9）
    
    fig     = plt.figure()
    ax      = fig.add_subplot(111)
    plt.ylim(-300,900)
    plt.xlim(-300,900)
    shift_x = np.arange(0, width * feat_stride, feat_stride)
    shift_y = np.arange(0, height * feat_stride, feat_stride)
    shift_x, shift_y = np.meshgrid(shift_x, shift_y)
    plt.scatter(shift_x,shift_y)
    box_widths  = anchors_all[:,2]-anchors_all[:,0]
    box_heights = anchors_all[:,3]-anchors_all[:,1]
    for i in [0,1,2,3,4,5,6,7,8]:  #左下角第一个像素点的9个先验框
        rect = plt.Rectangle([anchors_all[i, 0],anchors_all[i, 1]],box_widths[i],box_heights[i],color="r",fill=False)
        ax.add_patch(rect)
    plt.show()

先验框生成示例

以上为先验框生成代码以及执行结果，若在尺寸40x40的特征图上生成先验框，每个特征点生成9个，则最终生成40*40*9=14400个先验框，先验框生成示例结果图是将在特征图上生成的先验框还原到原图像中展示（假设原图像尺寸为640x640），图中的9个先验框为第一个特征点对应的先验框。

RPN网络输出

RPN网络结构

复制代码

        #   先进行一个3x3的卷积，可理解为特征整合
        #-----------------------------------------#
        self.conv1  = nn.Conv2d(in_channels, mid_channels, 3, 1, 1)
        #-----------------------------------------#
        #   分类预测先验框内部是否包含物体
        #-----------------------------------------#
        self.score  = nn.Conv2d(mid_channels, n_anchor * 2, 1, 1, 0)
        #-----------------------------------------#
        #   回归预测对先验框进行调整
        #-----------------------------------------#
        self.loc    = nn.Conv2d(mid_channels, n_anchor * 4, 1, 1, 0)

由上图所示，主干特征提取网络得到的特征图作为区域提议网络的输入，特征图经过一个小尺寸的滑动窗口（3x3的卷积层）进行滑动，然后在分别经过一个1x1的卷积层，一个用于分类（目标或背景），一个用于回归（调整矩形框的位置和大小，使其更准确地包含对象）。若输入到区域提议网络中的特征图尺寸为[1024,40,40],n_anchor=9,则输出的分类预测的特征图尺寸为[9*2，40，40]，回归预测的特征图尺寸为[9*4，40，40]。

分类预测的特征图[9*2，40，40]中的9代表特征图上每个特征点（像素点）的9个先验框，2代表对先验框的分类，是否有目标。回归预测的特征图[9*4，40，40]中的9代表特征图上每个特征点（像素点）的9个先验框，4代表对先验框的坐标（x1，y1，x2，y2）。

建议框

复制代码

class ProposalCreator():
    def __init__(
        self, 
        mode, 
        nms_iou             = 0.7,
        n_train_pre_nms     = 12000,
        n_train_post_nms    = 600,
        n_test_pre_nms      = 3000,
        n_test_post_nms     = 300,
        min_size            = 16
    
    ):
        #-----------------------------------#
        #   设置预测还是训练
        #-----------------------------------#
        self.mode               = mode
        #-----------------------------------#
        #   建议框非极大抑制的iou大小
        #-----------------------------------#
        self.nms_iou            = nms_iou
        #-----------------------------------#
        #   训练用到的建议框数量
        #-----------------------------------#
        self.n_train_pre_nms    = n_train_pre_nms
        self.n_train_post_nms   = n_train_post_nms
        #-----------------------------------#
        #   预测用到的建议框数量
        #-----------------------------------#
        self.n_test_pre_nms     = n_test_pre_nms
        self.n_test_post_nms    = n_test_post_nms
        self.min_size           = min_size

    def __call__(self, loc, score, anchor, img_size, scale=1.):
        if self.mode == "training":
            n_pre_nms   = self.n_train_pre_nms
            n_post_nms  = self.n_train_post_nms
        else:
            n_pre_nms   = self.n_test_pre_nms
            n_post_nms  = self.n_test_post_nms

        #-----------------------------------#
        #   将先验框转换成tensor
        #-----------------------------------#
        anchor = torch.from_numpy(anchor).type_as(loc)
        #-----------------------------------#
        #   将RPN网络预测结果转化成建议框
        #-----------------------------------#
        roi = loc2bbox(anchor, loc)
        #-----------------------------------#
        #   防止建议框超出图像边缘
        #-----------------------------------#
        roi[:, [0, 2]] = torch.clamp(roi[:, [0, 2]], min = 0, max = img_size[1])
        roi[:, [1, 3]] = torch.clamp(roi[:, [1, 3]], min = 0, max = img_size[0])
        
        #-----------------------------------#
        #   建议框的宽高的最小值不可以小于16
        #-----------------------------------#
        min_size    = self.min_size * scale
        keep        = torch.where(((roi[:, 2] - roi[:, 0]) >= min_size) & ((roi[:, 3] - roi[:, 1]) >= min_size))[0]
        #-----------------------------------#
        #   将对应的建议框保留下来
        #-----------------------------------#
        roi         = roi[keep, :]
        score       = score[keep]

        #-----------------------------------#
        #   根据得分进行排序，取出建议框
        #-----------------------------------#
        order       = torch.argsort(score, descending=True)
        if n_pre_nms > 0:
            order   = order[:n_pre_nms]
        roi     = roi[order, :]
        score   = score[order]

        #-----------------------------------#
        #   对建议框进行非极大抑制
        #   使用官方的非极大抑制会快非常多
        #-----------------------------------#
        keep    = nms(roi, score, self.nms_iou)
        if len(keep) < n_post_nms:
            index_extra = np.random.choice(range(len(keep)), size=(n_post_nms - len(keep)), replace=True)
            keep        = torch.cat([keep, keep[index_extra]])
        keep    = keep[:n_post_nms]
        roi     = roi[keep]
        return roi

以上为利用RPN网络的预测结果得到建议框的代码，RPN网络的预测结果根据分类分数选出最可能有存在目标的预测框，这些就是建议框。

3.计算一阶段模型的损失

计算一阶段的损失需要RPN网络输出的预测值和真实值，现RPN网络输出的分类预测的特征图和回归预测的特征图已得到，还需准备真实值。

复制代码

class AnchorTargetCreator(object):
    def __init__(self, n_sample=256, pos_iou_thresh=0.7, neg_iou_thresh=0.3, pos_ratio=0.5):
        self.n_sample       = n_sample
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh = neg_iou_thresh
        self.pos_ratio      = pos_ratio

    def __call__(self, bbox, anchor):
        argmax_ious, label = self._create_label(anchor, bbox)
        if (label > 0).any():
            loc = bbox2loc(anchor, bbox[argmax_ious])
            return loc, label
        else:
            return np.zeros_like(anchor), label

    def _calc_ious(self, anchor, bbox):
        #----------------------------------------------#
        #   anchor和bbox的iou
        #   获得的ious的shape为[num_anchors, num_gt]
        #----------------------------------------------#
        ious = bbox_iou(anchor, bbox)

        if len(bbox)==0:
            return np.zeros(len(anchor), np.int32), np.zeros(len(anchor)), np.zeros(len(bbox))
        #---------------------------------------------------------#
        #   获得每一个先验框最对应的真实框  [num_anchors, ]
        #---------------------------------------------------------#
        argmax_ious = ious.argmax(axis=1)
        #---------------------------------------------------------#
        #   找出每一个先验框最对应的真实框的iou  [num_anchors, ]
        #---------------------------------------------------------#
        max_ious = np.max(ious, axis=1)
        #---------------------------------------------------------#
        #   获得每一个真实框最对应的先验框  [num_gt, ]
        #---------------------------------------------------------#
        gt_argmax_ious = ious.argmax(axis=0)
        #---------------------------------------------------------#
        #   保证每一个真实框都存在对应的先验框
        #---------------------------------------------------------#
        for i in range(len(gt_argmax_ious)):
            argmax_ious[gt_argmax_ious[i]] = i

        return argmax_ious, max_ious, gt_argmax_ious
        
    def _create_label(self, anchor, bbox):
        # ------------------------------------------ #
        #   1是正样本，0是负样本，-1忽略
        #   初始化的时候全部设置为-1
        # ------------------------------------------ #
        label = np.empty((len(anchor),), dtype=np.int32)
        label.fill(-1)

        # ------------------------------------------------------------------------ #
        #   argmax_ious为每个先验框对应的最大的真实框的序号         [num_anchors, ]
        #   max_ious为每个真实框对应的最大的真实框的iou             [num_anchors, ]
        #   gt_argmax_ious为每一个真实框对应的最大的先验框的序号    [num_gt, ]
        # ------------------------------------------------------------------------ #
        argmax_ious, max_ious, gt_argmax_ious = self._calc_ious(anchor, bbox)
        
        # ----------------------------------------------------- #
        #   如果小于门限值则设置为负样本
        #   如果大于门限值则设置为正样本
        #   每个真实框至少对应一个先验框
        # ----------------------------------------------------- #
        label[max_ious < self.neg_iou_thresh] = 0
        label[max_ious >= self.pos_iou_thresh] = 1
        if len(gt_argmax_ious)>0:
            label[gt_argmax_ious] = 1

        # ----------------------------------------------------- #
        #   判断正样本数量是否大于128，如果大于则限制在128
        # ----------------------------------------------------- #
        n_pos = int(self.pos_ratio * self.n_sample)
        pos_index = np.where(label == 1)[0]
        if len(pos_index) > n_pos:
            disable_index = np.random.choice(pos_index, size=(len(pos_index) - n_pos), replace=False)
            label[disable_index] = -1

        # ----------------------------------------------------- #
        #   平衡正负样本，保持总数量为256
        # ----------------------------------------------------- #
        n_neg = self.n_sample - np.sum(label == 1)
        neg_index = np.where(label == 0)[0]
        if len(neg_index) > n_neg:
            disable_index = np.random.choice(neg_index, size=(len(neg_index) - n_neg), replace=False)
            label[disable_index] = -1

        return argmax_ious, label

现需要对先验框进行标注标签，需通过与目标的真实框计算IOU（交并比）值，若IOU的值大于设定的阈值，则认为是正样本，反之为负样本，每个真实框至少会有一个与之对应的先验框。在代码中限制了正样本和总样本的最大数量。这是因为先验框的数量非常多，当主干特征提取网络输出的特征图尺寸为40x40时，一张图像的先验框的数量已经达到40*40*9=14400个，如果对每个框都进行计算损失，计算量太大，同时，一张图像上有50个以上真实目标基本就很少见了。限制生成的先验框标签样本数量，既对模型的检测能力没有影响，同时模型的推理能力更快。在代码中正样本的标签为1,负样本标签为0，其余标签为-1。最后利用生成的标签和RPN输出的预测值计算边界框回归损失和分类损失。

二、Faster R-CNN第二阶段

第二阶段结构如图所示，第二阶段主要吧包括感兴趣区域池化（ROIPooling，Region of Interest Pooling）和分类回归层。

1.ROI Pooling（Region of Interest Pooling）

RoI Pooling（Region of Interest Pooling）是目标检测算法中常用的一种技术，特别是在使用基于区域的卷积神经网络（R-CNN）系列模型时，如Fast R-CNN和Faster R-CNN。RoI Pooling的主要作用是将不同尺寸的感兴趣区域（RoIs）转换为固定大小的特征图，以便进行后续的分类和边界框回归操作。

在第二阶段将利用主干特征提取网络的特征图以及RPN网络输出的建议框，通过建议框在特征图上截取相应的特征，即按照建议框坐标截取特征，最终得到N个小的特征图，这些特征图并非是相同的尺寸，而后续的全连接层是必须相同尺寸才能训练，则需要通过RoI Pooling统一这些特征图的尺寸，为后续的全连接层做准备。

2.分类和回归

复制代码

class Resnet50RoIHead(nn.Module):
    def __init__(self, n_class, roi_size, spatial_scale, classifier):
        super(Resnet50RoIHead, self).__init__()
        self.classifier = classifier
        #--------------------------------------#
        #   对ROIPooling后的的结果进行回归预测
        #--------------------------------------#
        self.cls_loc = nn.Linear(2048, n_class * 4)
        #-----------------------------------#
        #   对ROIPooling后的的结果进行分类
        #-----------------------------------#
        self.score = nn.Linear(2048, n_class)
        #-----------------------------------#
        #   权值初始化
        #-----------------------------------#
        normal_init(self.cls_loc, 0, 0.001)
        normal_init(self.score, 0, 0.01)

        self.roi = RoIPool((roi_size, roi_size), spatial_scale)

    def forward(self, x, rois, roi_indices, img_size):
        n, _, _, _ = x.shape
        if x.is_cuda:
            roi_indices = roi_indices.cuda()
            rois = rois.cuda()
        rois        = torch.flatten(rois, 0, 1)
        roi_indices = torch.flatten(roi_indices, 0, 1)
        
        rois_feature_map = torch.zeros_like(rois)
        rois_feature_map[:, [0,2]] = rois[:, [0,2]] / img_size[1] * x.size()[3]
        rois_feature_map[:, [1,3]] = rois[:, [1,3]] / img_size[0] * x.size()[2]

        indices_and_rois = torch.cat([roi_indices[:, None], rois_feature_map], dim=1)
        #-----------------------------------#
        #   利用建议框对公用特征层进行截取
        #-----------------------------------#
        pool = self.roi(x, indices_and_rois)
        #-----------------------------------#
        #   利用classifier网络进行特征提取
        #-----------------------------------#
        fc7 = self.classifier(pool)
        #--------------------------------------------------------------#
        #   当输入为一张图片的时候，这里获得的f7的shape为[300, 2048]
        #--------------------------------------------------------------#
        fc7 = fc7.view(fc7.size(0), -1)

        roi_cls_locs    = self.cls_loc(fc7)
        roi_scores      = self.score(fc7)
        roi_cls_locs    = roi_cls_locs.view(n, -1, roi_cls_locs.size(1))
        roi_scores      = roi_scores.view(n, -1, roi_scores.size(1))
        return roi_cls_locs, roi_scores

最后将ROI Pooling得到的同一尺寸的所有特征图输入到分类回归网络中，特征图会先经过卷积层，如果一开始用的主干特征提取网络为ResNet，第二阶段的卷积层为ResNet的Layer4，然后在经过全连接层，最后分别得到预测框的坐标和类别。

3.计算二阶段模型的损失

复制代码

class ProposalTargetCreator(object):
    def ；__init__(self, n_sample=128, pos_ratio=0.5, pos_iou_thresh=0.5, neg_iou_thresh_high=0.5, neg_iou_thresh_low=0):
        self.n_sample = n_sample
        self.pos_ratio = pos_ratio
        self.pos_roi_per_image = np.round(self.n_sample * self.pos_ratio)
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh_high = neg_iou_thresh_high
        self.neg_iou_thresh_low = neg_iou_thresh_low

    def __call__(self, roi, bbox, label, loc_normalize_std=(0.1, 0.1, 0.2, 0.2)):
        roi = np.concatenate((roi.detach().cpu().numpy(), bbox), axis=0)
        # ----------------------------------------------------- #
        #   计算建议框和真实框的重合程度
        # ----------------------------------------------------- #
        iou = bbox_iou(roi, bbox)
        
        if len(bbox)==0:
            gt_assignment = np.zeros(len(roi), np.int32)
            max_iou = np.zeros(len(roi))
            gt_roi_label = np.zeros(len(roi))
        else:
            #---------------------------------------------------------#
            #   获得每一个建议框最对应的真实框  [num_roi, ]
            #---------------------------------------------------------#
            gt_assignment = iou.argmax(axis=1)
            #---------------------------------------------------------#
            #   获得每一个建议框最对应的真实框的iou  [num_roi, ]
            #---------------------------------------------------------#
            max_iou = iou.max(axis=1)
            #---------------------------------------------------------#
            #   真实框的标签要+1因为有背景的存在
            #---------------------------------------------------------#
            gt_roi_label = label[gt_assignment] + 1

        #----------------------------------------------------------------#
        #   满足建议框和真实框重合程度大于neg_iou_thresh_high的作为正样本
        #   将正样本的数量限制在self.pos_roi_per_image以内
        #----------------------------------------------------------------#
        pos_index = np.where(max_iou >= self.pos_iou_thresh)[0]
        pos_roi_per_this_image = int(min(self.pos_roi_per_image, pos_index.size))
        if pos_index.size > 0:
            pos_index = np.random.choice(pos_index, size=pos_roi_per_this_image, replace=False)

        #-----------------------------------------------------------------------------------------------------#
        #   满足建议框和真实框重合程度小于neg_iou_thresh_high大于neg_iou_thresh_low作为负样本
        #   将正样本的数量和负样本的数量的总和固定成self.n_sample
        #-----------------------------------------------------------------------------------------------------#
        neg_index = np.where((max_iou < self.neg_iou_thresh_high) & (max_iou >= self.neg_iou_thresh_low))[0]
        neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image
        neg_roi_per_this_image = int(min(neg_roi_per_this_image, neg_index.size))
        if neg_index.size > 0:
            neg_index = np.random.choice(neg_index, size=neg_roi_per_this_image, replace=False)
            
        #---------------------------------------------------------#
        #   sample_roi      [n_sample, ]
        #   gt_roi_loc      [n_sample, 4]
        #   gt_roi_label    [n_sample, ]
        #---------------------------------------------------------#
        keep_index = np.append(pos_index, neg_index)

        sample_roi = roi[keep_index]
        if len(bbox)==0:
            return sample_roi, np.zeros_like(sample_roi), gt_roi_label[keep_index]

        gt_roi_loc = bbox2loc(sample_roi, bbox[gt_assignment[keep_index]])
        gt_roi_loc = (gt_roi_loc / np.array(loc_normalize_std, np.float32))

        gt_roi_label = gt_roi_label[keep_index]
        gt_roi_label[pos_roi_per_this_image:] = 0
        return sample_roi, gt_roi_loc, gt_roi_label

二阶段生成真实标签代码如上所示，生成的真实标签数量和输入二阶段模型的建议框数量一致，建若议框输入的数量为n_sample，则二阶段的真实标签的数量也为n_sample。二阶段的标签会根据每一个建议框和真实框的IOU（交并比）最大值确定该建议框的类别与哪个真实框对应（类别相同），然后通过设定的IOU阈值，将大于阈值设定为正样本，小于的为负样本。

生成的边界框标签和类别标签与第二阶段输出的预测值分别计算回归损失和分类损失，最后将一阶段和二阶段的损失值相加得到的损失值更新一阶段和二阶段所有的网络权重参数。

NMS（Non-Maximum Suppression，非极大抑制）

是一种在计算机视觉和图像处理领域中广泛使用的后处理技术，特别是在目标检测任务中。它的主要目的是解决目标检测过程中出现的重复检测问题，即对于同一个物体，算法可能会预测出多个重叠或相似的边界框（bounding boxes）。

NMS的工作原理可以概括为以下几个步骤：

选择最高置信度的边界框：从所有候选框中选择具有最高置信度（或检测概率）的边界框。
计算交并比（IoU）：计算该边界框与其他所有边界框的IoU。IoU是两个边界框重叠程度的度量，计算公式为：IoU=Area of UnionArea of Overlap
抑制重叠框：如果IoU超过某个阈值（通常设置为0.5），则认为这两个边界框检测到的是同一个目标，因此抑制（删除或忽略）置信度较低的边界框。
迭代处理：重复上述过程，直到所有边界框都被处理。

在Faster R-CNN在生成预测框时，对于同一个目标可能有多个预测框，需要去掉多余的预测框，只留下来最佳的预测框作为最终的结果，此时就需要NMS进行处理。
示例

如上图所示，生成的预测框有3个，每个预测框都将目标包含在内，且分类正确，此时通过对分类分数进行排序，去掉分类分数低的预测框，只保留分数最高的预测框作为最终结果。这里只是一个简单的例子，对于多类别的结果，其计算过程稍微会复杂一些，但原理是一样的。