【目标检测】two-stage------Mask R-CNN浅析-2018

every blog every motto: There's only one corner of the universe you can be sure of improving, and that's your own self.
https://blog.csdn.net/weixin_39190382?spm=1010.2135.3001.5343

0. 前言

Mask rcnn 梳理

1. 正文

Mask-RCNN 主要在Faster R-CNN的基础做了两点改进：

引入 ROIAlign，解决 ROI Pooling 带来的插值误差问题。
引入 Mask 分支，预测每个目标物体的 mask。

1.1 backbone

梗概： 输入是原图，输出是经过resnet和FPN处理后的P2-P6五个特征图。

backbone这部分采用resnet50或resnet101,进行特征提取，随着深度增加，特征图尺寸变为(H/2,W/2),(H/4,W/4),(H/8,W/8),(H/16,W/16),(H/32,W/32)。我们知道，低层特征往往含有较多的细节信息（颜色、轮廓、纹理），但包含许多的噪声以及无关信息。而高层特征包含有充分的语义信息（类别、属性等），但空间分辨率却很小，从而导致高层特征上信息丢失较为严重。因此maskrcnn采用了FPN（特征金字塔网络）的结构，来进行特征的融合。

1.1.1 调用部分

python 复制代码

resnet = ResNet("resnet101", stage5=True)
C1, C2, C3, C4, C5 = resnet.stages()

# Top-down Layers
# TODO: add assert to varify feature map sizes match what's in config
self.fpn = FPN(C1, C2, C3, C4, C5, out_channels=256)

1.1.1 Rsenet网络代码

python 复制代码

class ResNetBackbone(nn.Module):
    """ResNet backbone stripped to C1..C5 feature maps."""

    def __init__(self, layers: Tuple[int, int, int, int]) -> None:
        super().__init__()
        self.inplanes = 64
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=True)
        self.bn1 = FrozenBatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(64, layers[0], stride=1)
        self.layer2 = self._make_layer(128, layers[1], stride=2)
        self.layer3 = self._make_layer(256, layers[2], stride=2)
        self.layer4 = self._make_layer(512, layers[3], stride=2)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")

    def _make_layer(self, planes: int, blocks: int, stride: int) -> nn.Sequential:
        downsample = None
        if stride != 1 or self.inplanes != planes * Bottleneck.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(
                    self.inplanes,
                    planes * Bottleneck.expansion,
                    kernel_size=1,
                    stride=stride,
                    bias=True,
                ),
                FrozenBatchNorm2d(planes * Bottleneck.expansion),
            )

        layers = [Bottleneck(self.inplanes, planes, stride=stride, downsample=downsample)]
        self.inplanes = planes * Bottleneck.expansion
        for _ in range(1, blocks):
            layers.append(Bottleneck(self.inplanes, planes))
        return nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, ...]:  # noqa: D401
        c1 = self.relu(self.bn1(self.conv1(x)))
        c1 = self.maxpool(c1)

        c2 = self.layer1(c1)
        c3 = self.layer2(c2)
        c4 = self.layer3(c3)
        c5 = self.layer4(c4)
        return c1, c2, c3, c4, c5

将resnet的中间过程C2-C5作为输入，会得到p2-p6

1.1.2 FPN网络结构

python 复制代码

class ResNetFPN(nn.Module):
    """ResNet backbone followed by a top-down Feature Pyramid Network."""

    def __init__(
        self,
        architecture: str = "resnet50",
        out_channels: int = 256,
    ) -> None:
        super().__init__()
        layers = {"resnet50": (3, 4, 6, 3), "resnet101": (3, 4, 23, 3)}[architecture]
        self.backbone = ResNetBackbone(layers)
        c2_channels = 256
        c3_channels = 512
        c4_channels = 1024
        c5_channels = 2048

        self.lateral_c5 = nn.Conv2d(c5_channels, out_channels, kernel_size=1)
        self.lateral_c4 = nn.Conv2d(c4_channels, out_channels, kernel_size=1)
        self.lateral_c3 = nn.Conv2d(c3_channels, out_channels, kernel_size=1)
        self.lateral_c2 = nn.Conv2d(c2_channels, out_channels, kernel_size=1)

        self.out_p5 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        self.out_p4 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        self.out_p3 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        self.out_p2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, ...]:  # noqa: D401
        _, c2, c3, c4, c5 = self.backbone(x)

        p5 = self.lateral_c5(c5)
        p4 = self.lateral_c4(c4) + F.interpolate(p5, scale_factor=2.0, mode="nearest")
        p3 = self.lateral_c3(c3) + F.interpolate(p4, scale_factor=2.0, mode="nearest")
        p2 = self.lateral_c2(c2) + F.interpolate(p3, scale_factor=2.0, mode="nearest")

        p5 = self.out_p5(p5)
        p4 = self.out_p4(p4)
        p3 = self.out_p3(p3)
        p2 = self.out_p2(p2)
        p6 = F.max_pool2d(p5, kernel_size=1, stride=2)
        return p2, p3, p4, p5, p6

1.2 anchor生成

梗概： 以特征图每个像素点为中心生成anchor(坐标系在原图上，根据缩放关系映射回去的)

从原图到特征图有一个缩放的过程，假设这个缩放是feature_stride = 16,生成的特征图大小为(2,3)。

特征图上的每个点都是我们生成框的中心点。
需要将框映射到原始图像坐标

框的大小由三种，32，64，128，宽高比由三种，0.5，1，2。

如果不是多尺度（只使用backbone的最后一个特征图），特征图上每个点，都会生成3 * 3=9个框，也就是常说的9k个，所以特征图上2 * 3=6个点，会生成6 * 9=54个框。

而如果使用多尺度的话，会根据框的大小分到不同的特征图中，小框（32)会分到大的分辨率中（256×256），大框512，会被分到小分辨率的特征图（16×16）。（下面会举例），所特征图上的每个点会有3个宽高比的框，即，3k。

框的面积为S, S = w h = s c a l e 2 , r a t i o s = w / h S=wh=scale^2, ratios=w/h S=wh=scale2,ratios=w/h,

h = S / r a t i o s = s c a l e s / r a t i o s h = \sqrt{S/ratios} = scales/\sqrt{ratios} h=S/ratios =scales/ratios ;

w = s c a l e s ∗ r a t i o s w = scales*\sqrt{ratios} w=scales∗ratios

说明： 这里使用面积为基准

保持不同长宽比的anchor具有相似的目标检测能力
避免某些anchor过大或过小导致的检测问题
更符合实际目标检测的需求

调用部分

python 复制代码

# Generate Anchors
self.anchors = Variable(torch.from_numpy(utils.generate_pyramid_anchors(config.RPN_ANCHOR_SCALES,
                                                                        config.RPN_ANCHOR_RATIOS,
                                                                        config.BACKBONE_SHAPES,
                                                                        config.BACKBONE_STRIDES,
                                                                        config.RPN_ANCHOR_STRIDE)).float(), requires_grad=False)

实现部分

这里的scales是[32,64,128,256,512]，而对应的feature shape为[256,128,64,32,16]，即，在

256×256的特征图上生成面积为32×32的box;

128×128的特征图上生成面积为64×64的box;

...

16×16的特征图上生成面积为512×512的box.

python 复制代码

def generate_pyramid_anchors(scales, ratios, feature_shapes, feature_strides,
                             anchor_stride):
    """Generate anchors at different levels of a feature pyramid. Each scale
    is associated with a level of the pyramid, but each ratio is used in
    all levels of the pyramid.

    Returns:
    anchors: [N, (y1, x1, y2, x2)]. All generated anchors in one array. Sorted
        with the same order of the given scales. So, anchors of scale[0] come
        first, then anchors of scale[1], and so on.
    """
    # Anchors
    # [anchor_count, (y1, x1, y2, x2)]
    anchors = []
    for i in range(len(scales)): # [32, 64, 128, 256, 512]
        anchors.append(generate_anchors(scales[i], ratios, feature_shapes[i],
                                        feature_strides[i], anchor_stride))
    return np.concatenate(anchors, axis=0)

python 复制代码

def generate_anchors(scales, ratios, shape, feature_stride, anchor_stride):
    """
    scales: 1D array of anchor sizes in pixels. Example: [32, 64, 128]
    ratios: 1D array of anchor ratios of width/height. Example: [0.5, 1, 2]
    shape: [height, width] spatial shape of the feature map over which
            to generate anchors.
    feature_stride: Stride of the feature map relative to the image in pixels.
    anchor_stride: Stride of anchors on the feature map. For example, if the
        value is 2 then generate anchors for every other feature map pixel.
    """
    # 获取所有尺度和长宽比的组合
    scales, ratios = np.meshgrid(np.array(scales), np.array(ratios))
    # scales: [[32,64,128], [32,64,128], [32,64,128]] 
    # ratios: [[0.5,0.5,0.5], [1,1,1], [2,2,2]]
    
    scales = scales.flatten() # [32, 64, 128, 32, 64, 128, 32, 64, 128]
    ratios = ratios.flatten() # [0.5, 0.5, 0.5, 1, 1, 1, 2, 2, 2]

    # 根据尺度和长宽比计算高度和宽度
    heights = scales / np.sqrt(ratios)  # [45.25, 90.51, 181.02, 32, 64, 128, 22.63, 45.25, 90.51]
    widths = scales * np.sqrt(ratios)   # [22.63, 45.25, 90.51, 32, 64, 128, 45.25, 90.51, 181.02]

    # 在特征空间生成偏移量（锚框中心点位置）
    shifts_y = np.arange(0, shape[0], anchor_stride) * feature_stride  # [0, 16] (假设shape[0]=2, feature_stride=16)
    shifts_x = np.arange(0, shape[1], anchor_stride) * feature_stride  # [0, 16, 32] (假设shape[1]=3, feature_stride=16)
    shifts_x, shifts_y = np.meshgrid(shifts_x, shifts_y)
    # shifts_x: [[0,16,32], [0,16,32]]
    # shifts_y: [[0,0,0], [16,16,16]]

    # 生成偏移量、宽度和高度的所有组合
    box_widths, box_centers_x = np.meshgrid(widths, shifts_x)
    # box_widths: 9种宽度重复6次，形状(6,9)
    # box_centers_x: 6个x中心点重复9次，形状(6,9)
    
    box_heights, box_centers_y = np.meshgrid(heights, shifts_y)
    # box_heights: 9种高度重复6次，形状(6,9)  
    # box_centers_y: 6个y中心点重复9次，形状(6,9)

    # 重塑为(y,x)坐标列表和(h,w)尺寸列表
    box_centers = np.stack([box_centers_y, box_centers_x], axis=2).reshape([-1, 2])
    # box_centers: [[0,0], [0,0], ..., [16,32]] 共54行(6位置×9尺寸)
    
    box_sizes = np.stack([box_heights, box_widths], axis=2).reshape([-1, 2])
    # box_sizes: [[45.25,22.63], [90.51,45.25], ..., [90.51,181.02]] 共54行

    # 转换为角点坐标(y1, x1, y2, x2)
    boxes = np.concatenate([box_centers - 0.5 * box_sizes,    # 左上角坐标
                            box_centers + 0.5 * box_sizes], axis=1)  # 右下角坐标
    # boxes: [[-22.63,-11.31,22.63,11.31], ..., 共54个边界框]

1.3 RPN

梗概： 返回 对anchor进行粗分类（前景还是背景）**，坐标的回归值（偏移量）

RPN(Region Proposal Network)网络用于生成候选区域,这部分之前在Faster R-CNN中有介绍，点击查看

对每个特征图都进行：

实例化部分

python 复制代码

# RPN
self.rpn = RPN(len(config.RPN_ANCHOR_RATIOS), config.RPN_ANCHOR_STRIDE, 256)

rpn层的调用部分

python 复制代码

# Feature extraction
[p2_out, p3_out, p4_out, p5_out, p6_out] = self.fpn(molded_images)

# Note that P6 is used in RPN, but not in the classifier heads.
rpn_feature_maps = [p2_out, p3_out, p4_out, p5_out, p6_out]
mrcnn_feature_maps = [p2_out, p3_out, p4_out, p5_out]

# Loop through pyramid layers
layer_outputs = []  # list of lists
for p in rpn_feature_maps:
    layer_outputs.append(self.rpn(p))


# Concatenate layer outputs
# Convert from list of lists of level outputs to list of lists
# of outputs across levels.
# e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]]
# 将每个特征图产生的结果合并
outputs = list(zip(*layer_outputs))
outputs = [torch.cat(list(o), dim=1) for o in outputs]
rpn_class_logits, rpn_class, rpn_bbox = outputs

rpn层实现,

对fpn的每一个特征图都都经过rpn层，有两个分支，一个是分类分支，判断是前景还是背景；一个是回归分支，判断框的位置。

说明：

这里的anchors_per_location=3，因为是多尺度的，anchor会分配到不同的尺度上进行裁剪，所以特征图上的每个像素有三种（1：2，1：1，2：1）

python 复制代码

class RPN(nn.Module):
    """Builds the model of Region Proposal Network.

    anchors_per_location: number of anchors per pixel in the feature map
    anchor_stride: Controls the density of anchors. Typically 1 (anchors for
                   every pixel in the feature map), or 2 (every other pixel).

    Returns:
        rpn_logits: [batch, H, W, 2] Anchor classifier logits (before softmax)
        rpn_probs: [batch, W, W, 2] Anchor classifier probabilities.
        rpn_bbox: [batch, H, W, (dy, dx, log(dh), log(dw))] Deltas to be
                  applied to anchors.
    """

    def __init__(self, anchors_per_location, anchor_stride, depth):
        super(RPN, self).__init__()
        self.anchors_per_location = anchors_per_location
        self.anchor_stride = anchor_stride
        self.depth = depth

        self.padding = SamePad2d(kernel_size=3, stride=self.anchor_stride)
        self.conv_shared = nn.Conv2d(self.depth, 512, kernel_size=3, stride=self.anchor_stride)
        self.relu = nn.ReLU(inplace=True)
        self.conv_class = nn.Conv2d(512, 2 * anchors_per_location, kernel_size=1, stride=1)
        self.softmax = nn.Softmax(dim=2)
        self.conv_bbox = nn.Conv2d(512, 4 * anchors_per_location, kernel_size=1, stride=1)

    def forward(self, x):
        # Shared convolutional base of the RPN
        x = self.relu(self.conv_shared(self.padding(x)))

        # Anchor Score. [batch, anchors per location * 2, height, width].
        rpn_class_logits = self.conv_class(x)

        # Reshape to [batch, 2, anchors]
        rpn_class_logits = rpn_class_logits.permute(0,2,3,1)
        rpn_class_logits = rpn_class_logits.contiguous()
        rpn_class_logits = rpn_class_logits.view(x.size()[0], -1, 2)

        # Softmax on last dimension of BG/FG.
        rpn_probs = self.softmax(rpn_class_logits)

        # Bounding box refinement. [batch, H, W, anchors per location, depth]
        # where depth is [x, y, log(w), log(h)]
        rpn_bbox = self.conv_bbox(x)

        # Reshape to [batch, 4, anchors]
        rpn_bbox = rpn_bbox.permute(0,2,3,1)
        rpn_bbox = rpn_bbox.contiguous()
        rpn_bbox = rpn_bbox.view(x.size()[0], -1, 4)

        return [rpn_class_logits, rpn_probs, rpn_bbox]

1.4 Proposal Layer

梗概： 根据rpn的结果（分类、坐标回归值）调整anchor（候选框=anchor box +坐标回归值），并对anchor进行过滤，返回候选框。
说明： 训练时可能有2k候选框，推理时只有1k候选框。

python 复制代码

# Generate proposals
# Proposals are [batch, N, (y1, x1, y2, x2)] in normalized coordinates
# and zero padded.

# 需要多少个候选框
proposal_count = self.config.POST_NMS_ROIS_TRAINING if mode == "training" \
    else self.config.POST_NMS_ROIS_INFERENCE
rpn_rois = proposal_layer([rpn_class, rpn_bbox],
                            proposal_count=proposal_count,
                            nms_threshold=self.config.RPN_NMS_THRESHOLD,
                            anchors=self.anchors,
                            config=self.config)

现在我们有两部分信息，一个是生成的anchor，一个上面rpn层的分类结果(前景还是背景)和坐标回归（偏移量）结果。

这一步就是根据分类结果和回归结果，筛选出部分前景的anchor，并调整其坐标（加上偏移量）。

步骤：

取得分最高的6000个anchor
根据delta调整anchor
框超过图像边界的进行裁剪
非极大值抑制，去除重叠的框
归一化坐标

实现部分

python 复制代码

def apply_box_deltas(boxes, deltas):
    """Applies the given deltas to the given boxes.
    boxes: [N, 4] where each row is y1, x1, y2, x2
    deltas: [N, 4] where each row is [dy, dx, log(dh), log(dw)]
    """
    # 1. anchor是（y1,x1,y2,x2） 转成(y, x, h, w)
    # Convert to y, x, h, w
    height = boxes[:, 2] - boxes[:, 0]
    width = boxes[:, 3] - boxes[:, 1]
    center_y = boxes[:, 0] + 0.5 * height
    center_x = boxes[:, 1] + 0.5 * width
    
    # 2. 根据delta调整anchor
    # Apply deltas
    center_y += deltas[:, 0] * height
    center_x += deltas[:, 1] * width
    height *= torch.exp(deltas[:, 2])
    width *= torch.exp(deltas[:, 3])
    
    # 3. 调整后的anchor转回（y1,x1,y2,x2）
    # Convert back to y1, x1, y2, x2
    y1 = center_y - 0.5 * height
    x1 = center_x - 0.5 * width
    y2 = y1 + height
    x2 = x1 + width
    result = torch.stack([y1, x1, y2, x2], dim=1)
    return result

def clip_boxes(boxes, window):
    """
    boxes: [N, 4] each col is y1, x1, y2, x2
    window: [4] in the form y1, x1, y2, x2
    """
    boxes = torch.stack( \
        [boxes[:, 0].clamp(float(window[0]), float(window[2])),
         boxes[:, 1].clamp(float(window[1]), float(window[3])),
         boxes[:, 2].clamp(float(window[0]), float(window[2])),
         boxes[:, 3].clamp(float(window[1]), float(window[3]))], 1)
    return boxes

def proposal_layer(inputs, proposal_count, nms_threshold, anchors, config=None):
    """Receives anchor scores and selects a subset to pass as proposals
    to the second stage. Filtering is done based on anchor scores and
    non-max suppression to remove overlaps. It also applies bounding
    box refinment detals to anchors.

    Inputs:
        rpn_probs: [batch, anchors, (bg prob, fg prob)]
        rpn_bbox: [batch, anchors, (dy, dx, log(dh), log(dw))]

    Returns:
        Proposals in normalized coordinates [batch, rois, (y1, x1, y2, x2)]
    """
    # 1. input： [rpn_class, rpn_bbox]
    
    # Currently only supports batchsize 1
    inputs[0] = inputs[0].squeeze(0) # rpn_class
    inputs[1] = inputs[1].squeeze(0) # rpn_bbox

    # Box Scores. Use the foreground class confidence. [Batch, num_rois, 1]
    # 取前景的得分值
    scores = inputs[0][:, 1]

    # Box deltas [batch, num_rois, 4]
    deltas = inputs[1] # box偏移量 的(dx,dy,log(dh),log(dw))
    std_dev = Variable(torch.from_numpy(np.reshape(config.RPN_BBOX_STD_DEV, [1, 4])).float(), requires_grad=False)
    if config.GPU_COUNT:
        std_dev = std_dev.cuda()
    deltas = deltas * std_dev # 乘以标准差,还原到实际偏移量

    # 2. 取得分最高的6000个anchor
    # Improve performance by trimming to top anchors by score
    # and doing the rest on the smaller subset.
    pre_nms_limit = min(6000, anchors.size()[0]) # 设置上限，只处理6000个anchor
    scores, order = scores.sort(descending=True) # 得分降序
    order = order[:pre_nms_limit] # 得分最高的索引
    scores = scores[:pre_nms_limit]
    deltas = deltas[order.data, :] # TODO: Support batch size > 1 ff. 6000个anchor对应的(dx,dy,log(dh),log(dw))
    anchors = anchors[order.data, :] # 6000个anchor

    # 3. 根据delta调整anchor
    # Apply deltas to anchors to get refined anchors.
    # [batch, N, (y1, x1, y2, x2)]
    boxes = apply_box_deltas(anchors, deltas)

    # 4. 框超过图像边界的进行裁剪
    # Clip to image boundaries. [batch, N, (y1, x1, y2, x2)]
    height, width = config.IMAGE_SHAPE[:2]
    window = np.array([0, 0, height, width]).astype(np.float32)
    boxes = clip_boxes(boxes, window)

    # Filter out small boxes
    # According to Xinlei Chen's paper, this reduces detection accuracy
    # for small objects, so we're skipping it.

    # 5. Non-max suppression
    keep = nms(torch.cat((boxes, scores.unsqueeze(1)), 1).data, nms_threshold)
    keep = keep[:proposal_count]
    boxes = boxes[keep, :]

    # 6. 归一化操作
    # Normalize dimensions to range of 0 to 1.
    norm = Variable(torch.from_numpy(np.array([height, width, height, width])).float(), requires_grad=False)
    if config.GPU_COUNT:
        norm = norm.cuda()
    normalized_boxes = boxes / norm

    # Add back batch dimension
    normalized_boxes = normalized_boxes.unsqueeze(0)

    return normalized_boxes

1.5 dataset

我们看下训练数据是如何准备的，训练时需要准备两份信息一份是图片相关信息（image,gt_class_ids,gt_boxes,gt_masks)这个用于最后的精确分类和回归。

另一份是rpn相关信息（rpn_match,rpn_bbox)这个用于在rpn层对候选框粗分类。

1.5.1 rpn_targets 建立

正样本：

iou>0.7
gt对应最大iou的anchor

负样本：

iou<0.3

正负样本都不超过一半。

具体，

anchor粗分类，正负中立样本(根据规则）->rpn_match
计算gt和对应的anchor的偏移量 ->rpn_bbox

python 复制代码

def build_rpn_targets(image_shape, anchors, gt_class_ids, gt_boxes, config):
    """Given the anchors and GT boxes, compute overlaps and identify positive
    anchors and deltas to refine them to match their corresponding GT boxes.

    anchors: [num_anchors, (y1, x1, y2, x2)]
    gt_class_ids: [num_gt_boxes] Integer class IDs.
    gt_boxes: [num_gt_boxes, (y1, x1, y2, x2)]

    Returns:
    rpn_match: [N] (int32) matches between anchors and GT boxes.
               1 = positive anchor, -1 = negative anchor, 0 = neutral
    rpn_bbox: [N, (dy, dx, log(dh), log(dw))] Anchor bbox deltas.
    """
    # 正负anchor位置
    # RPN Match: 1 = positive anchor, -1 = negative anchor, 0 = neutral
    rpn_match = np.zeros([anchors.shape[0]], dtype=np.int32)
    # 每张图的用于训练的最大anchor数
    # RPN bounding boxes: [max anchors per image, (dy, dx, log(dh), log(dw))]
    rpn_bbox = np.zeros((config.RPN_TRAIN_ANCHORS_PER_IMAGE, 4)) # (256,4)

    # ----------------------------------------------------------------------------------------------
    # 找出非多物体的anchor
    # Handle COCO crowds
    # A crowd box in COCO is a bounding box around several instances. Exclude
    # them from training. A crowd box is given a negative class ID.
    crowd_ix = np.where(gt_class_ids < 0)[0]
    if crowd_ix.shape[0] > 0:
        # Filter out crowds from ground truth class IDs and boxes
        non_crowd_ix = np.where(gt_class_ids > 0)[0]
        crowd_boxes = gt_boxes[crowd_ix]
        gt_class_ids = gt_class_ids[non_crowd_ix]
        gt_boxes = gt_boxes[non_crowd_ix]
        # Compute overlaps with crowd boxes [anchors, crowds]
        crowd_overlaps = utils.compute_overlaps(anchors, crowd_boxes)
        crowd_iou_max = np.amax(crowd_overlaps, axis=1)
        no_crowd_bool = (crowd_iou_max < 0.001)
    else:
        # All anchors don't intersect a crowd
        no_crowd_bool = np.ones([anchors.shape[0]], dtype=bool)
    
    # ----------------------------------------------------------------------------------------------
    # 计算anchor与gt_boxes的IoU
    # Compute overlaps [num_anchors, num_gt_boxes]
    overlaps = utils.compute_overlaps(anchors, gt_boxes)

    # Match anchors to GT Boxes
    # If an anchor overlaps a GT box with IoU >= 0.7 then it's positive.
    # If an anchor overlaps a GT box with IoU < 0.3 then it's negative.
    # Neutral anchors are those that don't match the conditions above,
    # and they don't influence the loss function.
    # However, don't keep any GT box unmatched (rare, but happens). Instead,
    # match it to the closest anchor (even if its max IoU is < 0.3).
    #
    # ----------------------------------------------------------------------------------------------
    # 设置正负样本
    # 1. Set negative anchors first. They get overwritten below if a GT box is
    # matched to them. Skip boxes in crowd areas.
    # 设置负样本
    anchor_iou_argmax = np.argmax(overlaps, axis=1) # 每个anchor与哪个gt_box的IoU最大
    anchor_iou_max = overlaps[np.arange(overlaps.shape[0]), anchor_iou_argmax] # 每个anchor与IoU最大的gt_box的IoU值
    rpn_match[(anchor_iou_max < 0.3) & (no_crowd_bool)] = -1 # IoU小于0.3的anchor为负样本
    
    # 2. Set an anchor for each GT box (regardless of IoU value).
    # TODO: If multiple anchors have the same IoU match all of them
    gt_iou_argmax = np.argmax(overlaps, axis=0) # 每个gt_box与哪个anchor的IoU最大
    rpn_match[gt_iou_argmax] = 1 # 规则1，每个gt_box对应一个anchor，为正样本
    # 3. Set anchors with high overlap as positive.
    rpn_match[anchor_iou_max >= 0.7] = 1 # 规则2， IoU大于0.7的anchor为正样本

    # ----------------------------------------------------------------------------------------------
    # 设置正负样本数量，正样本不超过一半
    # Subsample to balance positive and negative anchors
    # Don't let positives be more than half the anchors
    ids = np.where(rpn_match == 1)[0] # 正样本的索引
    extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE // 2) # 正样本数超过一半的个数
    # 
    if extra > 0:
        # Reset the extra ones to neutral
        ids = np.random.choice(ids, extra, replace=False) 
        rpn_match[ids] = 0 # 正样本数超过一半的anchor置为0
    
    # Same for negative proposals
    # 同样限制负样本数量
    ids = np.where(rpn_match == -1)[0] # 负样本的索引
    extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE -
                        np.sum(rpn_match == 1)) # 负样本数超过一半的个数
    if extra > 0:
        # Rest the extra ones to neutral
        ids = np.random.choice(ids, extra, replace=False)
        rpn_match[ids] = 0 # 负样本数超过一半的anchor置为0

    # ----------------------------------------------------------------------------------------------
    # 计算正样本的偏移量，负样本不需要计算偏移量
    # For positive anchors, compute shift and scale needed to transform them
    # to match the corresponding GT boxes.
    ids = np.where(rpn_match == 1)[0] # 正样本的索引
    ix = 0  # index into rpn_bbox
    # TODO: use box_refinment() rather than duplicating the code here
    for i, a in zip(ids, anchors[ids]):
        # Closest gt box (it might have IoU < 0.7)
        gt = gt_boxes[anchor_iou_argmax[i]] # 

        # Convert coordinates to center plus width/height.
        # GT Box，(y1, x1, y2, x2) -> (center_y, center_x, height, width)
        gt_h = gt[2] - gt[0]
        gt_w = gt[3] - gt[1]
        gt_center_y = gt[0] + 0.5 * gt_h
        gt_center_x = gt[1] + 0.5 * gt_w
        # Anchor, (y1, x1, y2, x2) -> (center_y, center_x, height, width)
        a_h = a[2] - a[0]
        a_w = a[3] - a[1]
        a_center_y = a[0] + 0.5 * a_h
        a_center_x = a[1] + 0.5 * a_w

        # Compute the bbox refinement that the RPN should predict.
        # (dy, dx, log(dh), log(dw)) gt相对于anchor的偏移量
        rpn_bbox[ix] = [
            (gt_center_y - a_center_y) / a_h,
            (gt_center_x - a_center_x) / a_w,
            np.log(gt_h / a_h),
            np.log(gt_w / a_w),
        ]
        # Normalize
        rpn_bbox[ix] /= config.RPN_BBOX_STD_DEV # 标准化，后需要反标准化
        ix += 1

    return rpn_match, rpn_bbox

1.5.2 dataset

获取rpn和最后的标签

python 复制代码

class Dataset(torch.utils.data.Dataset):
    def __init__(self, dataset, config, augment=True):
        """A generator that returns images and corresponding target class ids,
            bounding box deltas, and masks.

            dataset: The Dataset object to pick data from
            config: The model config object
            shuffle: If True, shuffles the samples before every epoch
            augment: If True, applies image augmentation to images (currently only
                     horizontal flips are supported)

            Returns a Python generator. Upon calling next() on it, the
            generator returns two lists, inputs and outputs. The containtes
            of the lists differs depending on the received arguments:
            inputs list:
            - images: [batch, H, W, C]
            - image_metas: [batch, size of image meta]
            - rpn_match: [batch, N] Integer (1=positive anchor, -1=negative, 0=neutral)
            - rpn_bbox: [batch, N, (dy, dx, log(dh), log(dw))] Anchor bbox deltas.
            - gt_class_ids: [batch, MAX_GT_INSTANCES] Integer class IDs
            - gt_boxes: [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)]
            - gt_masks: [batch, height, width, MAX_GT_INSTANCES]. The height and width
                        are those of the image unless use_mini_mask is True, in which
                        case they are defined in MINI_MASK_SHAPE.

            outputs list: Usually empty in regular training. But if detection_targets
                is True then the outputs list contains target class_ids, bbox deltas,
                and masks.
            """
        self.b = 0  # batch item index
        self.image_index = -1
        self.image_ids = np.copy(dataset.image_ids)
        self.error_count = 0

        self.dataset = dataset
        self.config = config
        self.augment = augment

        # Anchors
        # [anchor_count, (y1, x1, y2, x2)]
        self.anchors = utils.generate_pyramid_anchors(config.RPN_ANCHOR_SCALES,
                                                 config.RPN_ANCHOR_RATIOS,
                                                 config.BACKBONE_SHAPES,
                                                 config.BACKBONE_STRIDES,
                                                 config.RPN_ANCHOR_STRIDE)

    def __getitem__(self, image_index):
        # Get GT bounding boxes and masks for image.
        image_id = self.image_ids[image_index]
        image, image_metas, gt_class_ids, gt_boxes, gt_masks = \
            load_image_gt(self.dataset, self.config, image_id, augment=self.augment,
                          use_mini_mask=self.config.USE_MINI_MASK)

        # Skip images that have no instances. This can happen in cases
        # where we train on a subset of classes and the image doesn't
        # have any of the classes we care about.
        if not np.any(gt_class_ids > 0):
            return None

        # RPN Targets
        rpn_match, rpn_bbox = build_rpn_targets(image.shape, self.anchors,
                                                gt_class_ids, gt_boxes, self.config)

        # If more instances than fits in the array, sub-sample from them.
        # 
        if gt_boxes.shape[0] > self.config.MAX_GT_INSTANCES:
            ids = np.random.choice(
                np.arange(gt_boxes.shape[0]), self.config.MAX_GT_INSTANCES, replace=False)
            gt_class_ids = gt_class_ids[ids]
            gt_boxes = gt_boxes[ids]
            gt_masks = gt_masks[:, :, ids]

        # Add to batch
        rpn_match = rpn_match[:, np.newaxis]
        images = mold_image(image.astype(np.float32), self.config)

        # Convert
        images = torch.from_numpy(images.transpose(2, 0, 1)).float()
        image_metas = torch.from_numpy(image_metas)
        rpn_match = torch.from_numpy(rpn_match)
        rpn_bbox = torch.from_numpy(rpn_bbox).float()
        gt_class_ids = torch.from_numpy(gt_class_ids)
        gt_boxes = torch.from_numpy(gt_boxes).float()
        gt_masks = torch.from_numpy(gt_masks.astype(int).transpose(2, 0, 1)).float()

        return images, image_metas, rpn_match, rpn_bbox, gt_class_ids, gt_boxes, gt_masks

    def __len__(self):
        return self.image_ids.shape[0]

1.6 detection_target_layer

梗概： 对proposals筛选的候选框（训练2k，推理1k)进一步筛选，分为正负样本，找到正样本对应的gt信息，最终得到200个roi（正负样本1：3，负样本用gt信息用0填充）

具体，

从候选框(proposals)中采样正负样本
为每个样本生成对应的类别ID、边界框回归目标(delta)和分割掩码(mask)
保持正负样本的比例平衡

调用部分：

python 复制代码

rois, target_class_ids, target_deltas, target_mask = \
    detection_target_layer(rpn_rois, gt_class_ids, gt_boxes, gt_masks, self.config)

实现部分：

python 复制代码

def detection_target_layer(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    """Subsamples proposals and generates target box refinment, class_ids,
    and masks for each.

    Inputs:
    proposals: [batch, N, (y1, x1, y2, x2)] in normalized coordinates. Might
               be zero padded if there are not enough proposals.
    gt_class_ids: [batch, MAX_GT_INSTANCES] Integer class IDs.
    gt_boxes: [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)] in normalized
              coordinates.
    gt_masks: [batch, height, width, MAX_GT_INSTANCES] of boolean type

    Returns: Target ROIs and corresponding class IDs, bounding box shifts,
    and masks.
    rois: [batch, TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized
          coordinates
    target_class_ids: [batch, TRAIN_ROIS_PER_IMAGE]. Integer class IDs.
    target_deltas: [batch, TRAIN_ROIS_PER_IMAGE, NUM_CLASSES,
                    (dy, dx, log(dh), log(dw), class_id)]
                   Class-specific bbox refinments.
    target_mask: [batch, TRAIN_ROIS_PER_IMAGE, height, width)
                 Masks cropped to bbox boundaries and resized to neural
                 network output size.
    """

    # Currently only supports batchsize 1
    proposals = proposals.squeeze(0)
    gt_class_ids = gt_class_ids.squeeze(0)
    gt_boxes = gt_boxes.squeeze(0)
    gt_masks = gt_masks.squeeze(0)
    # -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    # 1. 区分多物体和非多物体
    # Handle COCO crowds
    # A crowd box in COCO is a bounding box around several instances. Exclude
    # them from training. A crowd box is given a negative class ID.
    if torch.nonzero(gt_class_ids < 0).size():
        crowd_ix = torch.nonzero(gt_class_ids < 0)[:, 0] # 多物体索引
        non_crowd_ix = torch.nonzero(gt_class_ids > 0)[:, 0] # 非多物体索引
        crowd_boxes = gt_boxes[crowd_ix.data, :]
        crowd_masks = gt_masks[crowd_ix.data, :, :]
        
        # 非多物体的gt_class_ids, gt_boxes, gt_masks
        gt_class_ids = gt_class_ids[non_crowd_ix.data]
        gt_boxes = gt_boxes[non_crowd_ix.data, :]
        gt_masks = gt_masks[non_crowd_ix.data, :]

        # Compute overlaps with crowd boxes [anchors, crowds]
        crowd_overlaps = bbox_overlaps(proposals, crowd_boxes)
        crowd_iou_max = torch.max(crowd_overlaps, dim=1)[0]
        no_crowd_bool = crowd_iou_max < 0.001
    else:
        no_crowd_bool =  Variable(torch.ByteTensor(proposals.size()[0]*[True]), requires_grad=False)
        if config.GPU_COUNT:
            no_crowd_bool = no_crowd_bool.cuda()

    # ----------------------------------------------------------------------------------------------
    # 2. 根据IoU筛选出正样本
    # Compute overlaps matrix [proposals, gt_boxes]
    overlaps = bbox_overlaps(proposals, gt_boxes) # 候选框与真实框质检的IOU

    # Determine postive and negative ROIs
    roi_iou_max = torch.max(overlaps, dim=1)[0] # 每个候选框与真实框质检的最大IOU

    # 1. Positive ROIs are those with >= 0.5 IoU with a GT box
    positive_roi_bool = roi_iou_max >= 0.5 # IoU大于0.5的候选框为正样本

    # Subsample ROIs. Aim for 33% positive
    # Positive ROIs
    if torch.nonzero(positive_roi_bool).size():
        positive_indices = torch.nonzero(positive_roi_bool)[:, 0] # 正样本索引

        positive_count = int(config.TRAIN_ROIS_PER_IMAGE * # 正样本数量 ，总样本：200
                             config.ROI_POSITIVE_RATIO) # 正样本比例：0.33
        rand_idx = torch.randperm(positive_indices.size()[0]) # 打乱正样本索引
        rand_idx = rand_idx[:positive_count] # 取前positive_count个正样本索引
        if config.GPU_COUNT:
            rand_idx = rand_idx.cuda() # GPU加速
        positive_indices = positive_indices[rand_idx] # 正样本索引
        positive_count = positive_indices.size()[0] # 正样本数量
        positive_rois = proposals[positive_indices.data,:] # 正样本

        # Assign positive ROIs to GT boxes.
        positive_overlaps = overlaps[positive_indices.data,:] # 正样本与真实框质检的IOU
        roi_gt_box_assignment = torch.max(positive_overlaps, dim=1)[1] # 正样本对应的真实框索引
        roi_gt_boxes = gt_boxes[roi_gt_box_assignment.data,:] # 正样本对应的真实框
        roi_gt_class_ids = gt_class_ids[roi_gt_box_assignment.data] # 正样本对应的真实框类别

        # 计算偏移量并归一化
        # Compute bbox refinement for positive ROIs
        deltas = Variable(utils.box_refinement(positive_rois.data, roi_gt_boxes.data), requires_grad=False)
        std_dev = Variable(torch.from_numpy(config.BBOX_STD_DEV).float(), requires_grad=False)
        if config.GPU_COUNT:
            std_dev = std_dev.cuda()
        deltas /= std_dev

        # Assign positive ROIs to GT masks
        roi_masks = gt_masks[roi_gt_box_assignment.data,:,:] # 正样本对应的真实框掩码

        # Compute mask targets
        boxes = positive_rois
        if config.USE_MINI_MASK:
            # Transform ROI corrdinates from normalized image space
            # to normalized mini-mask space.
            y1, x1, y2, x2 = positive_rois.chunk(4, dim=1) # 候选框坐标
            gt_y1, gt_x1, gt_y2, gt_x2 = roi_gt_boxes.chunk(4, dim=1) # 真实框坐标
            gt_h = gt_y2 - gt_y1 # 真实框高度
            gt_w = gt_x2 - gt_x1 # 真实框宽度
            y1 = (y1 - gt_y1) / gt_h # 候选框坐标归一化
            x1 = (x1 - gt_x1) / gt_w 
            y2 = (y2 - gt_y1) / gt_h
            x2 = (x2 - gt_x1) / gt_w
            boxes = torch.cat([y1, x1, y2, x2], dim=1)
        box_ids = Variable(torch.arange(roi_masks.size()[0]), requires_grad=False).int()
        if config.GPU_COUNT:
            box_ids = box_ids.cuda()
        masks = Variable(CropAndResizeFunction(config.MASK_SHAPE[0], config.MASK_SHAPE[1], 0)(roi_masks.unsqueeze(1), boxes, box_ids).data, requires_grad=False)
        masks = masks.squeeze(1)

        # Threshold mask pixels at 0.5 to have GT masks be 0 or 1 to use with
        # binary cross entropy loss.
        masks = torch.round(masks)
    else:
        positive_count = 0

    # ----------------------------------------------------------------------------------------------
    # 负样本
    # 2. Negative ROIs are those with < 0.5 with every GT box. Skip crowds.
    negative_roi_bool = roi_iou_max < 0.5 # IoU小于0.5的候选框为负样本
    negative_roi_bool = negative_roi_bool & no_crowd_bool # 排除多物体
    # Negative ROIs. Add enough to maintain positive:negative ratio.
    if torch.nonzero(negative_roi_bool).size() and positive_count>0:
        negative_indices = torch.nonzero(negative_roi_bool)[:, 0] # 负样本索引
        r = 1.0 / config.ROI_POSITIVE_RATIO # 正负样本比例
        negative_count = int(r * positive_count - positive_count) # 负样本数量
        rand_idx = torch.randperm(negative_indices.size()[0]) # 打乱负样本索引
        rand_idx = rand_idx[:negative_count] # 取前negative_count个负样本索引
        if config.GPU_COUNT:
            rand_idx = rand_idx.cuda()
        negative_indices = negative_indices[rand_idx] # 负样本索引
        negative_count = negative_indices.size()[0] # 负样本数量
        negative_rois = proposals[negative_indices.data, :]  # 负样本
    else:
        negative_count = 0

    # -----------------------------------------------------------------------------------------------
    # Append negative ROIs and pad bbox deltas and masks that
    # are not used for negative ROIs with zeros.
    # 正负样本都有
    if positive_count > 0 and negative_count > 0: 
        # 1. 负样本候选框
        rois = torch.cat((positive_rois, negative_rois), dim=0) # 候选框
        zeros = Variable(torch.zeros(negative_count), requires_grad=False).int() 
        if config.GPU_COUNT:
            zeros = zeros.cuda()
        roi_gt_class_ids = torch.cat([roi_gt_class_ids, zeros], dim=0)
        # 2. 负样本偏移量
        zeros = Variable(torch.zeros(negative_count,4), requires_grad=False)
        if config.GPU_COUNT:
            zeros = zeros.cuda()
        deltas = torch.cat([deltas, zeros], dim=0)
        # 3. 负样本掩码
        zeros = Variable(torch.zeros(negative_count,config.MASK_SHAPE[0],config.MASK_SHAPE[1]), requires_grad=False)
        if config.GPU_COUNT:
            zeros = zeros.cuda()
        masks = torch.cat([masks, zeros], dim=0)
    # 正样本有
    elif positive_count > 0:
        rois = positive_rois
    # 负样本有
    elif negative_count > 0:
        # 1. 负样本候选框
        rois = negative_rois
        zeros = Variable(torch.zeros(negative_count), requires_grad=False)
        if config.GPU_COUNT:
            zeros = zeros.cuda()
        roi_gt_class_ids = zeros
        # 2. 负样本偏移量
        zeros = Variable(torch.zeros(negative_count,4), requires_grad=False).int()
        if config.GPU_COUNT:
            zeros = zeros.cuda()
        deltas = zeros
        # 3. 负样本掩码
        
        zeros = Variable(torch.zeros(negative_count,config.MASK_SHAPE[0],config.MASK_SHAPE[1]), requires_grad=False)
        if config.GPU_COUNT:
            zeros = zeros.cuda()
        masks = zeros
    # 没有正负样本
    else:
        rois = Variable(torch.FloatTensor(), requires_grad=False)
        roi_gt_class_ids = Variable(torch.IntTensor(), requires_grad=False)
        deltas = Variable(torch.FloatTensor(), requires_grad=False)
        masks = Variable(torch.FloatTensor(), requires_grad=False)
        if config.GPU_COUNT:
            rois = rois.cuda()
            roi_gt_class_ids = roi_gt_class_ids.cuda()
            deltas = deltas.cuda()
            masks = masks.cuda()

    return rois, roi_gt_class_ids, deltas, masks

1.7 classifier

梗概： 这一步是对候选框进行第二阶段分类(分具体的类别)和回归。

现在我们有了一堆候选框，接下来我们需要对每个候选框进行分类和回归。这一步是通过一个分类器和一个回归器来完成的。分类器用于判断候选框中是否包含目标，回归器用于预测候选框中目标的边界框。

调用部分：

python 复制代码

# Network Heads
# Proposal classifier and BBox regressor heads
mrcnn_class_logits, mrcnn_class, mrcnn_bbox = self.classifier(mrcnn_feature_maps, rois)

说明： 其中输入来着p2-p5，不包含p6

python 复制代码

# Note that P6 is used in RPN, but not in the classifier heads.
rpn_feature_maps = [p2_out, p3_out, p4_out, p5_out, p6_out]
mrcnn_feature_maps = [p2_out, p3_out, p4_out, p5_out]

实现部分：

python 复制代码

class Classifier(nn.Module):
    def __init__(self, depth, pool_size, image_shape, num_classes):
        super(Classifier, self).__init__()
        self.depth = depth
        self.pool_size = pool_size
        self.image_shape = image_shape
        self.num_classes = num_classes
        self.conv1 = nn.Conv2d(self.depth, 1024, kernel_size=self.pool_size, stride=1)
        self.bn1 = nn.BatchNorm2d(1024, eps=0.001, momentum=0.01)
        self.conv2 = nn.Conv2d(1024, 1024, kernel_size=1, stride=1)
        self.bn2 = nn.BatchNorm2d(1024, eps=0.001, momentum=0.01)
        self.relu = nn.ReLU(inplace=True)

        self.linear_class = nn.Linear(1024, num_classes)
        self.softmax = nn.Softmax(dim=1)

        self.linear_bbox = nn.Linear(1024, num_classes * 4)

    def forward(self, x, rois):
        x = pyramid_roi_align([rois]+x, self.pool_size, self.image_shape)
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu(x)

        x = x.view(-1,1024)
        mrcnn_class_logits = self.linear_class(x) # 分类
        mrcnn_probs = self.softmax(mrcnn_class_logits)

        mrcnn_bbox = self.linear_bbox(x)
        mrcnn_bbox = mrcnn_bbox.view(mrcnn_bbox.size()[0], -1, 4)

        return [mrcnn_class_logits, mrcnn_probs, mrcnn_bbox]

1.7.1 ROIAlign

先说ROI Pooling，我们会用之前的候选框进行分类和回归，具体来说就是用候选框裁剪局部特征，然后做分类和回归。裁剪下来的局部特征图，大小的不一样的，所以会缩放到一个相同的尺度，这就是之前ROI Pooling要做的事。

再说裁剪，之前的一般单尺度的，即在backbone的某一个特征图上进行裁剪，现在是多尺度的，一种思路是上面我们得到的候选框在所有特征图上进行一次裁剪，另一种思路根据候选框的大小和特征图的大小进行分配，小的候选框在分辨率高的候选框上裁剪，大的候选框在分辨率低的特征图裁剪。

最后说ROIAlign，原图到特征图有一个缩放，那么候选框从原图到特征图上的变换，也会有一个缩放，比如缩放4倍，而候选框在原图上长宽为45×45，那么这里除不尽会存在一次取整操作，会丢失精度（特征图的候选变换为11×11）。上面我们提到会将裁剪结果pooling到7×7。这里从11到7又除不尽，有会丢失信息。

所以Mask-rcnn中提出了ROIAlign。

1.7.1.1 ROIAlign

原始的ROI Pooling

ROIAlign

1.7.1.2 分配规则

具体的分配的规则是根据公式，

公式 1 来源于论文《Feature Pyramid Networks for Object Detection》的第 4.2 节。

该公式在 特征金字塔网络（Feature Pyramid Networks, FPN） 中用于 目标检测 应用，特别是当 FPN 被集成到 Fast R-CNN 检测器中时。

公式 1 及其作用

公式 (1) 的核心功能是确定 感兴趣区域 (Region-of-Interest, RoI) 应该从 FPN 的哪个特征金字塔层级 P k P_k Pk 中提取特征。

公式 (1)：
k = ⌊ k 0 + log ⁡ 2 ( w h 224 ) ⌋ (1) k = \left\lfloor k_0 + \log_2\left(\frac{\sqrt{wh}}{224}\right) \right\rfloor \quad \text{(1)} \text{} k=⌊k0+log2(224wh )⌋(1)

公式组成部分详解

符号	含义	解释与设定
k k k	目标金字塔层级索引	RoI 被分配到的 FPN 特征层级 P k P_k Pk 的索引。例如， P 2 , P 3 , P 4 , P 5 P_2, P_3, P_4, P_5 P2,P3,P4,P5 等。
w w w 和 h h h	RoI 的宽度和高度	RoI 在输入图像上的尺寸。 w h \sqrt{wh} wh 代表 RoI 的尺度（即面积的平方根）。
224	规范尺寸	ImageNet 预训练时的标准输入尺寸。
k 0 k_0 k0	基准层级	规范尺寸 RoI 的目标映射层级。如果一个 RoI 的面积 w × h w \times h w×h 等于 22 4 2 224^2 2242，它将被映射到 P k 0 P_{k_0} Pk0 层。
⌊ ... ⌋ \lfloor \dots \rfloor ⌊...⌋	向下取整函数	确保 k k k 是一个整数索引，用于确定最终的特征层级。 (原文使用 b ... c b\dots c b...c 符号，通常表示取整操作)
设定值		在 ResNet-based Faster R-CNN 系统中，由于 C 4 C_4 C4 层被用作单尺度特征图的基准，因此 k 0 k_0 k0 被设定为 4。

公式背后的机制与直觉*

该公式将 FPN 视为一个类似于 特征化图像金字塔 的结构，并据此调整了区域检测器（如 Fast R-CNN）的分配策略。

核心思路是：将小尺度的 RoI 映射到 FPN 中分辨率更高（索引 k k k 更小）的特征层，并将大尺度的 RoI 映射到分辨率较低但语义信息更强（索引 k k k 更大）的特征层。

直觉解释：

RoI 尺度与规范尺寸的比率： w h 224 \frac{\sqrt{wh}}{224} 224wh 计算了当前 RoI 的尺度与规范尺度 224 之间的比率。
对数变换： log ⁡ 2 ( ... ) \log_2(\dots) log2(...) 将该比率转换为以 2 为底的对数，这实际上反映了 RoI 尺度相对于基准尺度（224）的变化了多少个"八度"（即 2 的幂次）。
层级确定：
- 如果 RoI 的尺度恰好是规范尺寸 224 × 224 224 \times 224 224×224，则 log ⁡ 2 ( ... ) = 0 \log_2(\dots)=0 log2(...)=0，此时 k = k 0 = 4 k = k_0=4 k=k0=4。
- 如果 RoI 的尺度较小（例如，尺度是 224 的一半），则 log ⁡ 2 ( 1 2 ) = − 1 \log_2(\frac{1}{2}) = -1 log2(21)=−1。此时 k = k 0 − 1 = 3 k = k_0 - 1 = 3 k=k0−1=3，RoI 被分配到更高分辨率的 P 3 P_3 P3 层。
- 这确保了不同大小的 RoI 能够从与其尺度最匹配的特征图层级中提取特征，这对于多尺度目标检测（尤其是检测小目标）至关重要。

这种分配策略使得 FPN 能够利用卷积网络固有的金字塔结构，为所有尺度的 RoI 提供 语义信息丰富 的特征。

总结性比喻

理解公式 (1) 可以想象它像一个物流中心的自动分拣系统：

如果 FPN 是一个多层货架（金字塔层级 P k P_k Pk），每个货架对应一个分辨率（从小到大）。公式 (1) 就像一个条形码扫描仪：它扫描包裹（RoI）的尺寸，并立即计算出该包裹应该投放到哪个标准货架（ P k P_k Pk），以确保小包裹被分配到精细的低层货架（高分辨率特征图），而大包裹被分配到粗糙的高层货架（低分辨率、高语义特征图）。这样，每个包裹都能得到最合适的处理，对应到目标检测中就是每个 RoI 都能提取到最适合其尺度的特征。

python 复制代码

def pyramid_roi_align(inputs, pool_size, image_shape):
    """Implements ROI Pooling on multiple levels of the feature pyramid.

    Params:
    - pool_size: [height, width] of the output pooled regions. Usually [7, 7]
    - image_shape: [height, width, channels]. Shape of input image in pixels

    Inputs:
    - boxes: [batch, num_boxes, (y1, x1, y2, x2)] in normalized
             coordinates.
    - Feature maps: List of feature maps from different levels of the pyramid.
                    Each is [batch, channels, height, width]

    Output:
    Pooled regions in the shape: [num_boxes, height, width, channels].
    The width and height are those specific in the pool_shape in the layer
    constructor.
    """

    # Currently only supports batchsize 1 input : [rois,feature maps]
    for i in range(len(inputs)):
        inputs[i] = inputs[i].squeeze(0)

    # Crop boxes [batch, num_boxes, (y1, x1, y2, x2)] in normalized coords
    boxes = inputs[0] # 归一化以后的boxes

    # Feature Maps. List of feature maps from different level of the
    # feature pyramid. Each is [batch, height, width, channels]
    feature_maps = inputs[1:] # 不同的特征图 p2~p5

    # ----------------------------------------------------------------------------------------------
    # 根据ROI的面积，分配到不同的特征图上进行roi align操作
    # Assign each ROI to a level in the pyramid based on the ROI area.
    y1, x1, y2, x2 = boxes.chunk(4, dim=1)
    h = y2 - y1
    w = x2 - x1

    # Equation 1 in the Feature Pyramid Networks paper. Account for
    # the fact that our coordinates are normalized here.
    # e.g. a 224x224 ROI (in pixels) maps to P4
    image_area = Variable(torch.FloatTensor([float(image_shape[0]*image_shape[1])]), requires_grad=False)
    if boxes.is_cuda:
        image_area = image_area.cuda()
    roi_level = 4 + log2(torch.sqrt(h*w)/(224.0/torch.sqrt(image_area)))
    roi_level = roi_level.round().int()
    roi_level = roi_level.clamp(2,5)


    # Loop through levels and apply ROI pooling to each. P2 to P5.
    pooled = []
    box_to_level = []
    for i, level in enumerate(range(2, 6)):
        ix  = roi_level==level
        if not ix.any():
            continue
        ix = torch.nonzero(ix)[:,0]
        level_boxes = boxes[ix.data, :] # 当前层特征图对应的boxes

        # Keep track of which box is mapped to which level
        box_to_level.append(ix.data)

        # Stop gradient propogation to ROI proposals
        level_boxes = level_boxes.detach()

        # Crop and Resize
        # From Mask R-CNN paper: "We sample four regular locations, so
        # that we can evaluate either max or average pooling. In fact,
        # interpolating only a single value at each bin center (without
        # pooling) is nearly as effective."
        #
        # Here we use the simplified approach of a single value per bin,
        # which is how it's done in tf.crop_and_resize()
        # Result: [batch * num_boxes, pool_height, pool_width, channels]
        ind = Variable(torch.zeros(level_boxes.size()[0]),requires_grad=False).int()
        if level_boxes.is_cuda:
            ind = ind.cuda()
        # 当前层的特征图
        feature_maps[i] = feature_maps[i].unsqueeze(0)  #CropAndResizeFunction needs batch dimension
        # 将不同大小的rois裁剪并resize到相同大小
        pooled_features = CropAndResizeFunction(pool_size, pool_size, 0)(feature_maps[i], level_boxes, ind)
        pooled.append(pooled_features)

    # Pack pooled features into one tensor
    pooled = torch.cat(pooled, dim=0)

    # Pack box_to_level mapping into one array and add another
    # column representing the order of pooled boxes
    box_to_level = torch.cat(box_to_level, dim=0) 

    # Rearrange pooled features to match the order of the original boxes
    _, box_to_level = torch.sort(box_to_level) # 按照原始boxes的顺序重新排列
    pooled = pooled[box_to_level, :, :]

    return pooled

1.8 mask

调用部分

输入和上面的分类相同，p2-p5

python 复制代码

# Create masks for detections
mrcnn_mask = self.mask(mrcnn_feature_maps, rois)

注意最后使用的时sigmoid，对每一个类进行分类

实现部分

python 复制代码

# Mask branch
class Mask(nn.Module):
    def __init__(self, depth, pool_size, image_shape, num_classes):
        super(Mask, self).__init__()
        self.depth = depth
        self.pool_size = pool_size
        self.image_shape = image_shape
        self.num_classes = num_classes
        self.padding = SamePad2d(kernel_size=3, stride=1)
        self.conv1 = nn.Conv2d(self.depth, 256, kernel_size=3, stride=1)
        self.bn1 = nn.BatchNorm2d(256, eps=0.001)
        self.conv2 = nn.Conv2d(256, 256, kernel_size=3, stride=1)
        self.bn2 = nn.BatchNorm2d(256, eps=0.001)
        self.conv3 = nn.Conv2d(256, 256, kernel_size=3, stride=1)
        self.bn3 = nn.BatchNorm2d(256, eps=0.001)
        self.conv4 = nn.Conv2d(256, 256, kernel_size=3, stride=1)
        self.bn4 = nn.BatchNorm2d(256, eps=0.001)
        self.deconv = nn.ConvTranspose2d(256, 256, kernel_size=2, stride=2)
        self.conv5 = nn.Conv2d(256, num_classes, kernel_size=1, stride=1)
        self.sigmoid = nn.Sigmoid()
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x, rois):
        x = pyramid_roi_align([rois] + x, self.pool_size, self.image_shape)
        x = self.conv1(self.padding(x))
        x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(self.padding(x))
        x = self.bn2(x)
        x = self.relu(x)
        x = self.conv3(self.padding(x))
        x = self.bn3(x)
        x = self.relu(x)
        x = self.conv4(self.padding(x))
        x = self.bn4(x)
        x = self.relu(x)
        x = self.deconv(x)
        x = self.relu(x)
        x = self.conv5(x)
        x = self.sigmoid(x)

        return x

1.9 损失

调用部分

python 复制代码

# Run object detection
rpn_class_logits, rpn_pred_bbox, target_class_ids, mrcnn_class_logits, target_deltas, mrcnn_bbox, target_mask, mrcnn_mask = \
    self.predict([images, image_metas, gt_class_ids, gt_boxes, gt_masks], mode='training')

rpn_class_loss, rpn_bbox_loss, mrcnn_class_loss, mrcnn_bbox_loss, mrcnn_mask_loss = compute_losses(rpn_match, rpn_bbox, rpn_class_logits, rpn_pred_bbox, target_class_ids, mrcnn_class_logits, target_deltas, mrcnn_bbox, target_mask, mrcnn_mask)
loss = rpn_class_loss + rpn_bbox_loss + mrcnn_class_loss + mrcnn_bbox_loss + mrcnn_mask_loss

1.9.1 主体

一阶段会有一个rpn_match，正负样本的概念。

分类时，正负样本起作用，中立样本不起作用；

回归时，只有正样本起作用。

python 复制代码

def compute_losses(rpn_match, rpn_bbox, rpn_class_logits, rpn_pred_bbox, target_class_ids, mrcnn_class_logits, target_deltas, mrcnn_bbox, target_mask, mrcnn_mask):

    rpn_class_loss = compute_rpn_class_loss(rpn_match, rpn_class_logits) # rpn class
    rpn_bbox_loss = compute_rpn_bbox_loss(rpn_bbox, rpn_match, rpn_pred_bbox) # rpn box
    mrcnn_class_loss = compute_mrcnn_class_loss(target_class_ids, mrcnn_class_logits) # class
    mrcnn_bbox_loss = compute_mrcnn_bbox_loss(target_deltas, target_class_ids, mrcnn_bbox) # box
    mrcnn_mask_loss = compute_mrcnn_mask_loss(target_mask, target_class_ids, mrcnn_mask) # mask

    return [rpn_class_loss, rpn_bbox_loss, mrcnn_class_loss, mrcnn_bbox_loss, mrcnn_mask_loss]

1.一阶段class

python 复制代码

def compute_rpn_class_loss(rpn_match, rpn_class_logits):
    """RPN anchor classifier loss.

    rpn_match: [batch, anchors, 1]. Anchor match type. 1=positive,
               -1=negative, 0=neutral anchor.
    rpn_class_logits: [batch, anchors, 2]. RPN classifier logits for FG/BG.
    """

    # Squeeze last dim to simplify
    rpn_match = rpn_match.squeeze(2)

    # Get anchor classes. Convert the -1/+1 match to 0/1 values.
    anchor_class = (rpn_match == 1).long()

    # Positive and Negative anchors contribute to the loss,
    # but neutral anchors (match value = 0) don't.
    indices = torch.nonzero(rpn_match != 0)

    # Pick rows that contribute to the loss and filter out the rest.
    rpn_class_logits = rpn_class_logits[indices.data[:,0],indices.data[:,1],:]
    anchor_class = anchor_class[indices.data[:,0],indices.data[:,1]]

    # Crossentropy loss
    loss = F.cross_entropy(rpn_class_logits, anchor_class)

    return loss

2. 一阶段box

python 复制代码

def compute_rpn_bbox_loss(target_bbox, rpn_match, rpn_bbox):
    """Return the RPN bounding box loss graph.

    target_bbox: [batch, max positive anchors, (dy, dx, log(dh), log(dw))].
        Uses 0 padding to fill in unsed bbox deltas.
    rpn_match: [batch, anchors, 1]. Anchor match type. 1=positive,
               -1=negative, 0=neutral anchor.
    rpn_bbox: [batch, anchors, (dy, dx, log(dh), log(dw))]
    """

    # Squeeze last dim to simplify
    rpn_match = rpn_match.squeeze(2)

    # Positive anchors contribute to the loss, but negative and
    # neutral anchors (match value of 0 or -1) don't.
    indices = torch.nonzero(rpn_match==1)

    # Pick bbox deltas that contribute to the loss
    rpn_bbox = rpn_bbox[indices.data[:,0],indices.data[:,1]]

    # Trim target bounding box deltas to the same length as rpn_bbox.
    target_bbox = target_bbox[0,:rpn_bbox.size()[0],:]

    # Smooth L1 loss
    loss = F.smooth_l1_loss(rpn_bbox, target_bbox)

    return loss

3. 二阶段class

python 复制代码

def compute_mrcnn_class_loss(target_class_ids, pred_class_logits):
    """Loss for the classifier head of Mask RCNN.

    target_class_ids: [batch, num_rois]. Integer class IDs. Uses zero
        padding to fill in the array.
    pred_class_logits: [batch, num_rois, num_classes]
    """

    # Loss
    if target_class_ids.size():
        loss = F.cross_entropy(pred_class_logits,target_class_ids.long())
    else:
        loss = Variable(torch.FloatTensor([0]), requires_grad=False)
        if target_class_ids.is_cuda:
            loss = loss.cuda()

    return loss

4. 二阶段box

只计算正样本box

python 复制代码

def compute_mrcnn_bbox_loss(target_bbox, target_class_ids, pred_bbox):
    """Loss for Mask R-CNN bounding box refinement.

    target_bbox: [batch, num_rois, (dy, dx, log(dh), log(dw))]
    target_class_ids: [batch, num_rois]. Integer class IDs.
    pred_bbox: [batch, num_rois, num_classes, (dy, dx, log(dh), log(dw))]
    """

    if target_class_ids.size():
        # Only positive ROIs contribute to the loss. And only
        # the right class_id of each ROI. Get their indicies.
        positive_roi_ix = torch.nonzero(target_class_ids > 0)[:, 0]
        positive_roi_class_ids = target_class_ids[positive_roi_ix.data].long()
        indices = torch.stack((positive_roi_ix,positive_roi_class_ids), dim=1)

        # Gather the deltas (predicted and true) that contribute to loss
        target_bbox = target_bbox[indices[:,0].data,:]
        pred_bbox = pred_bbox[indices[:,0].data,indices[:,1].data,:]

        # Smooth L1 loss
        loss = F.smooth_l1_loss(pred_bbox, target_bbox)
    else:
        loss = Variable(torch.FloatTensor([0]), requires_grad=False)
        if target_class_ids.is_cuda:
            loss = loss.cuda()

    return loss

5. mask

只计算正样本mask的损失

python 复制代码

def compute_mrcnn_mask_loss(target_masks, target_class_ids, pred_masks):
    """Mask binary cross-entropy loss for the masks head.

    target_masks: [batch, num_rois, height, width].
        A float32 tensor of values 0 or 1. Uses zero padding to fill array.
    target_class_ids: [batch, num_rois]. Integer class IDs. Zero padded.
    pred_masks: [batch, proposals, height, width, num_classes] float32 tensor
                with values from 0 to 1.
    """
    if target_class_ids.size():
        # Only positive ROIs contribute to the loss. And only
        # the class specific mask of each ROI.
        positive_ix = torch.nonzero(target_class_ids > 0)[:, 0] # 正样本索引
        positive_class_ids = target_class_ids[positive_ix.data].long() # 正样本类别ID
        indices = torch.stack((positive_ix, positive_class_ids), dim=1) # 

        # Gather the masks (predicted and true) that contribute to loss
        y_true = target_masks[indices[:,0].data,:,:] # 真实掩码
        y_pred = pred_masks[indices[:,0].data,indices[:,1].data,:,:] # 预测掩码

        # Binary cross entropy
        loss = F.binary_cross_entropy(y_pred, y_true)
    else:
        loss = Variable(torch.FloatTensor([0]), requires_grad=False)
        if target_class_ids.is_cuda:
            loss = loss.cuda()

    return loss

1.10 推理阶段

1.10.1 分类和回归

分类和回归使用的proposal层的结果。我们在前面说了，训练时proposal输出2k候选框（后面经过detection_target_layer只会产生200个用于训练），推理时1k。

python 复制代码

# Network Heads
# Proposal classifier and BBox regressor heads
mrcnn_class_logits, mrcnn_class, mrcnn_bbox = self.classifier(mrcnn_feature_maps, rpn_rois)

# Detections
# output is [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in image coordinates
detections = detection_layer(self.config, rpn_rois, mrcnn_class, mrcnn_bbox, image_metas)

python 复制代码

def detection_layer(config, rois, mrcnn_class, mrcnn_bbox, image_meta):
    """Takes classified proposal boxes and their bounding box deltas and
    returns the final detection boxes.

    Returns:
    [batch, num_detections, (y1, x1, y2, x2, class_score)] in pixels
    """

    # Currently only supports batchsize 1
    rois = rois.squeeze(0)

    _, _, window, _ = parse_image_meta(image_meta)
    window = window[0]
    detections = refine_detections(rois, mrcnn_class, mrcnn_bbox, window, config)

    return detections

对rois进行筛选和过滤，对1k个proposals的候选框进行筛选，最终只保留100个。

具体，

每个roi对应的类，以及类得分和类对应的(dx,dy,log(dh),log(dw))
对候选框应用得到的偏移量进行调整
转到图像坐标，并裁剪超出区域
剔除背景，得分小于0.7的过滤，以及每个类分别进行nms
保留前N个

python 复制代码

def refine_detections(rois, probs, deltas, window, config):
    """Refine classified proposals and filter overlaps and return final
    detections.

    Inputs:
        rois: [N, (y1, x1, y2, x2)] in normalized coordinates
        probs: [N, num_classes]. Class probabilities.
        deltas: [N, num_classes, (dy, dx, log(dh), log(dw))]. Class-specific
                bounding box deltas.
        window: (y1, x1, y2, x2) in image coordinates. The part of the image
            that contains the image excluding the padding.

    Returns detections shaped: [N, (y1, x1, y2, x2, class_id, score)]
    """
    # ----------------------------------------------------------------------------------------------
    # 每个roi对应的类，以及类得分和类对应的(dx,dy,log(dh),log(dw))
    # Class IDs per ROI
    _, class_ids = torch.max(probs, dim=1) # 类的索引

    # Class probability of the top class of each ROI
    # Class-specific bounding box deltas
    idx = torch.arange(class_ids.size()[0]).long()
    if config.GPU_COUNT:
        idx = idx.cuda()
    class_scores = probs[idx, class_ids.data] # 类得分
    deltas_specific = deltas[idx, class_ids.data] # 类对应的(dx,dy,log(dh),log(dw))

    # ----------------------------------------------------------------------------------------------
    # 对候选框应用得到的偏移量进行调整
    # Apply bounding box deltas
    # Shape: [boxes, (y1, x1, y2, x2)] in normalized coordinates
    std_dev = Variable(torch.from_numpy(np.reshape(config.RPN_BBOX_STD_DEV, [1, 4])).float(), requires_grad=False)
    if config.GPU_COUNT:
        std_dev = std_dev.cuda()
    refined_rois = apply_box_deltas(rois, deltas_specific * std_dev) # 调整后的框

    # ----------------------------------------------------------------------------------------------
    # 转到图像坐标，并裁剪超出区域
    # Convert coordiates to image domain
    height, width = config.IMAGE_SHAPE[:2]
    scale = Variable(torch.from_numpy(np.array([height, width, height, width])).float(), requires_grad=False)
    if config.GPU_COUNT:
        scale = scale.cuda()
    refined_rois *= scale # 转换到图像域

    # Clip boxes to image window
    refined_rois = clip_to_window(window, refined_rois) # 裁剪到图像边界

    # Round and cast to int since we're deadling with pixels now
    refined_rois = torch.round(refined_rois) # 四舍五入

    # TODO: Filter out boxes with zero area
    # ----------------------------------------------------------------------------------------------
    # 剔除背景，得分小于0.7的过滤，以及每个类分别进行nms
    # Filter out background boxes
    keep_bool = class_ids>0 # 剔除背景框

    # Filter out low confidence boxes ，置信度小于0.7的剔除
    if config.DETECTION_MIN_CONFIDENCE:
        keep_bool = keep_bool & (class_scores >= config.DETECTION_MIN_CONFIDENCE)
    keep = torch.nonzero(keep_bool)[:,0]

    # Apply per-class NMS， 每个类分别进行nms
    pre_nms_class_ids = class_ids[keep.data]
    pre_nms_scores = class_scores[keep.data]
    pre_nms_rois = refined_rois[keep.data]

    for i, class_id in enumerate(unique1d(pre_nms_class_ids)):
        # Pick detections of this class
        ixs = torch.nonzero(pre_nms_class_ids == class_id)[:,0]

        # Sort
        ix_rois = pre_nms_rois[ixs.data]
        ix_scores = pre_nms_scores[ixs]
        ix_scores, order = ix_scores.sort(descending=True)
        ix_rois = ix_rois[order.data,:]

        class_keep = nms(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1).data, config.DETECTION_NMS_THRESHOLD)

        # Map indicies
        class_keep = keep[ixs[order[class_keep].data].data]

        if i==0:
            nms_keep = class_keep
        else:
            nms_keep = unique1d(torch.cat((nms_keep, class_keep)))
    keep = intersect1d(keep, nms_keep)
    # ----------------------------------------------------------------------------------------------
    # Keep top detections，保留前N个
    roi_count = config.DETECTION_MAX_INSTANCES
    top_ids = class_scores[keep.data].sort(descending=True)[1][:roi_count]
    keep = keep[top_ids.data]

    # Arrange output as [N, (y1, x1, y2, x2, class_id, score)]
    # Coordinates are in image domain.
    result = torch.cat((refined_rois[keep.data],
                        class_ids[keep.data].unsqueeze(1).float(),
                        class_scores[keep.data].unsqueeze(1)), dim=1)

    return result

1.10.2 mask预测

python 复制代码

# Convert boxes to normalized coordinates
# TODO: let DetectionLayer return normalized coordinates to avoid
#       unnecessary conversions
h, w = self.config.IMAGE_SHAPE[:2]
scale = Variable(torch.from_numpy(np.array([h, w, h, w])).float(), requires_grad=False)
if self.config.GPU_COUNT:
    scale = scale.cuda()
detection_boxes = detections[:, :4] / scale

# Add back batch dimension
detection_boxes = detection_boxes.unsqueeze(0)

# Create masks for detections
mrcnn_mask = self.mask(mrcnn_feature_maps, detection_boxes)

1.10.3 最终输出

最终输出detection和mask,detection中为[y1,x1,y2,x2,class_id,score]

python 复制代码

if mode == 'inference':
    # Network Heads
    # Proposal classifier and BBox regressor heads
    mrcnn_class_logits, mrcnn_class, mrcnn_bbox = self.classifier(mrcnn_feature_maps, rpn_rois)

    # Detections
    # output is [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in image coordinates
    detections = detection_layer(self.config, rpn_rois, mrcnn_class, mrcnn_bbox, image_metas)

    # --------------------------------------------------------------------------------------
    # 对框进行标准化
    # Convert boxes to normalized coordinates
    # TODO: let DetectionLayer return normalized coordinates to avoid
    #       unnecessary conversions
    h, w = self.config.IMAGE_SHAPE[:2]
    scale = Variable(torch.from_numpy(np.array([h, w, h, w])).float(), requires_grad=False)
    if self.config.GPU_COUNT:
        scale = scale.cuda()
    detection_boxes = detections[:, :4] / scale #  标准化

    # Add back batch dimension
    detection_boxes = detection_boxes.unsqueeze(0)

    # Create masks for detections
    mrcnn_mask = self.mask(mrcnn_feature_maps, detection_boxes)

    # Add back batch dimension
    detections = detections.unsqueeze(0)
    mrcnn_mask = mrcnn_mask.unsqueeze(0)

    return [detections, mrcnn_mask]

1.11 总结

一阶段rpn输入的是p2-p6,

二阶段的分类和mask输入的是p2-p5

分类是Pooling成7×7，mask是Pooling成14×14

1.11.1 rpn

rpn阶段的样本准备，首先排除"多物体"（crowd)
1. 正负样本划分与偏移量计算： 负样本和anchor的iou<0.3；正样本iou>0.7和每个gt_box对应的最大的anchor。正样本标记为1，负样本标记为-1，其他标记为0。具体见1.5.1
其中正负样本数不超过一半。

计算正样本的偏移量。负样本不需要计算偏移量。

偏移量计算是gt_box相对对应的anchor的偏移量，如：

python 复制代码

rpn_bbox[ix] = [
    (gt_center_y - a_center_y) / a_h,
    (gt_center_x - a_center_x) / a_w,
    np.log(gt_h / a_h),
    np.log(gt_w / a_w),
]

rpn分类时，是区分正负样本，所以会忽略掉其他样本0，如：

（可以参考一阶段class损失计算)

python 复制代码

indices = torch.nonzero(rpn_match != 0)

对box的回归值是只计算正样本的，如（

可参考一阶段的box)

python 复制代码

# Positive anchors contribute to the loss, but negative and
# neutral anchors (match value of 0 or -1) don't.
indices = torch.nonzero(rpn_match==1)

至此，我们对anchor进行了粗分类和粗回归。

1.11.2 proposals

对上面的粗分类结果进行筛选。保留部分候选框(训练2k,推理1k)，主要有：

对前景得分进行排序
筛选前6k个anchor
对anchor用回归得到的偏移量进行调整，生成候选框
超过边界的候选框进行裁剪
应用nms，保留2k个
对得到的候选框进行归一化操作

至此，我们得到了2k个归一化以后的候选框。

1.11.3 detection_target(训练阶段)

对proposals得到的2k个候选框进一步筛选得到200个，正样本为iou>0.5，负样本为iou<0.5，正负样本1：3。

找到这些候选框对应的gt，然后获取相对偏移量，以及gt_mask。

具体见1.6

python 复制代码

rois, target_class_ids, target_deltas, target_mask = \
    detection_target_layer(rpn_rois, gt_class_ids, gt_boxes, gt_masks, self.config)

1.11.4 分类、回归和mask(训练阶段)

对上面得到的200个候选框进行分类mask操作。

上面我们有200个不同大小的候选框了，然后我们用候选框在特征图上裁剪局部特征，对这个特征送入分类网络，由于我们的候选框大小不同，所以需要缩放到同一个大小，ROI-Pooling，本文对此进行了改进，提出了ROIAlign。

说明：

由于是多尺度的，根据候选框的大小不同，不同大小的候选框会在不同的特征图是上进行裁剪。
一阶段(rpn部分)用的特征图是p2-p6，这里(二阶段)用的是p2-p5

python 复制代码

# Proposal classifier and BBox regressor heads
mrcnn_class_logits, mrcnn_class, mrcnn_bbox = self.classifier(mrcnn_feature_maps, rois)

# Create masks for detections
mrcnn_mask = self.mask(mrcnn_feature_maps, rois)

1.11.5 损失(训练阶段)

损失分为一阶段和二阶段，具体看1.9

python 复制代码

def compute_losses(rpn_match, rpn_bbox, rpn_class_logits, rpn_pred_bbox, target_class_ids, mrcnn_class_logits, target_deltas, mrcnn_bbox, target_mask, mrcnn_mask):
    # 一阶段
    rpn_class_loss = compute_rpn_class_loss(rpn_match, rpn_class_logits) # rpn class
    rpn_bbox_loss = compute_rpn_bbox_loss(rpn_bbox, rpn_match, rpn_pred_bbox) # rpn box
    # 二阶段
    mrcnn_class_loss = compute_mrcnn_class_loss(target_class_ids, mrcnn_class_logits) # class
    mrcnn_bbox_loss = compute_mrcnn_bbox_loss(target_deltas, target_class_ids, mrcnn_bbox) # box
    mrcnn_mask_loss = compute_mrcnn_mask_loss(target_mask, target_class_ids, mrcnn_mask) # mask

    return [rpn_class_loss, rpn_bbox_loss, mrcnn_class_loss, mrcnn_bbox_loss, mrcnn_mask_loss]

1.11.6 分类和回归(推理阶段)

和训练一样，二阶段用的是p2-p5的特征图。

推理时，proposals层只产生1k候选框，对这1k候选框进行分类和回归

python 复制代码

# Proposal classifier and BBox regressor heads
mrcnn_class_logits, mrcnn_class, mrcnn_bbox = self.classifier(mrcnn_feature_maps, rpn_rois)

1.11.7 应用分类和回归(推理阶段)

对rois进行筛选和过滤，对1k个proposals的候选框进行筛选，最终只保留100个。

对proposals得到的1k个候选框应用上面分类和回归的偏移量进行调整。

python 复制代码

# Detections
# output is [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in image coordinates
detections = detection_layer(self.config, rpn_rois, mrcnn_class, mrcnn_bbox, image_metas)

具体见1.10.1：

每个roi对应的类，以及类得分和类对应的(dx,dy,log(dh),log(dw))
对候选框应用得到的偏移量进行调整
转到图像坐标，并裁剪超出区域
剔除背景，得分小于0.7的过滤，以及每个类分别进行nms
保留前N个

1.11.8 mask(推理阶段)

对上面调整后的候选框，在特征图上裁剪，再mask

python 复制代码

# --------------------------------------------------------------------------------------
# 对框进行标准化
# Convert boxes to normalized coordinates
# TODO: let DetectionLayer return normalized coordinates to avoid
#       unnecessary conversions
h, w = self.config.IMAGE_SHAPE[:2]
scale = Variable(torch.from_numpy(np.array([h, w, h, w])).float(), requires_grad=False)
if self.config.GPU_COUNT:
    scale = scale.cuda()
detection_boxes = detections[:, :4] / scale #  标准化

# Add back batch dimension
detection_boxes = detection_boxes.unsqueeze(0)

# Create masks for detections
mrcnn_mask = self.mask(mrcnn_feature_maps, detection_boxes)

1.11.9 小结

rpn对每个anchor进行分类和归回；
rpn阶段(划分正负(anchor)样本判定是，>0.7，<0.3)，正负样本都不超过一半；
回归只针对正样本。

训练阶段：

proposals筛选保留2k个框；
detection_target_layer进一步对proposals进行筛选，保留200个，正负候选框比为1：3(>0.5为正样本，<0.5为负样本)；
对上面的200个进行分类、回归和mask；

推理阶段：

proposals筛选保留1k个框；
对上面1k个框先分类（和训练的顺序稍有不同）；
根据结果对框进一步筛选，保留100个；
对这个100个框进行mask计算。

1.11.10 流程图

1. 推理流程（Inference）

复制代码

┌─────────────────────────────────────────────────────────────────┐
│                        输入图像 (H×W×3)                          │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │   图像预处理         │
                    │  - 尺寸调整          │
                    │  - 归一化           │
                    │  - 填充到固定尺寸    │
                    └──────────┬───────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │   Backbone (ResNet)   │
                    │   特征提取            │
                    │   C2, C3, C4, C5      │
                    └──────────┬───────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │   FPN                │
                    │   多尺度特征融合     │
                    │   P2, P3, P4, P5, P6 │
                    └──────────┬───────────┘
                               │
                    ┌──────────┴───────────┐
                    │                      │
                    ▼                      ▼
        ┌───────────────────┐   ┌───────────────────┐
        │   RPN             │   │   Anchor 生成     │
        │   - 分类          │   │   (5尺度×3比例)    │
        │   - 回归          │   │   = 15锚点/位置   │
        └──────────┬─────────┘   └──────────┬────────┘
                   │                        │
                   └──────────┬─────────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │   Proposal Layer    │
                    │   - NMS             │
                    │   - Top-K 选择      │
                    │   输出: 1000个ROI   │
                    └──────────┬───────────┘
                               │
                    ┌──────────┴───────────┐
                    │                      │
                    ▼                      ▼
        ┌───────────────────┐   ┌───────────────────┐
        │   ROI Align       │   │   ROI Align       │
        │   7×7 (分类/回归) │   │   14×14 (掩码)    │
        └──────────┬────────┘   └──────────┬────────┘
                   │                      │
        ┌──────────┴──────────┐          │
        │                     │          │
        ▼                     ▼          ▼
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ 分类头       │   │ 回归头        │   │ 掩码头       │
│ FC → 类别    │   │ FC → 4个参数  │   │ Conv → 28×28 │
└──────┬───────┘   └──────┬───────┘   └──────┬───────┘
       │                  │                  │
       └──────────┬───────┴──────────────────┘
                  │
                  ▼
         ┌────────────────┐
         │ Detection Layer│
         │ - 类别选择     │
         │ - 框精调       │
         │ - NMS          │
         │ - 阈值过滤     │
         └────────┬───────┘
                  │
                  ▼
         ┌────────────────┐
         │   后处理       │
         │ - 坐标映射     │
         │ - 掩码上采样   │
         │ - 裁剪         │
         └────────┬───────┘
                  │
                  ▼
    ┌─────────────────────────────┐
    │  最终输出                    │
    │  - rois: [N, 4]              │
    │  - class_ids: [N]            │
    │  - scores: [N]               │
    │  - masks: [H, W, N]          │
    └─────────────────────────────┘

2. 训练流程（Training）

复制代码

输入图像 + GT标注
    │
    ├─→ RPN 分支
    │   ├─→ RPN 分类损失 (前景/背景)
    │   └─→ RPN 回归损失 (仅正样本)
    │
    ├─→ Detection Target Layer
    │   └─→ 采样 ROI (正负样本平衡)
    │
    └─→ ROI 分支
        ├─→ 分类损失
        ├─→ 回归损失 (仅正样本)
        └─→ 掩码损失 (仅正样本，逐像素)

3. 数据流维度变化

复制代码

输入: [H, W, 3]
    ↓
Backbone:
    C2: [H/4, W/4, 256]
    C3: [H/8, W/8, 512]
    C4: [H/16, W/16, 1024]
    C5: [H/32, W/32, 2048]
    ↓
FPN:
    P2: [H/4, W/4, 256]
    P3: [H/8, W/8, 256]
    P4: [H/16, W/16, 256]
    P5: [H/32, W/32, 256]
    P6: [H/64, W/64, 256]
    ↓
RPN:
    输出: [num_anchors, 2] (分类)
          [num_anchors, 4] (回归)
    ↓
Proposals:
    [1000, 4] (推理)
    [2000, 4] (训练)
    ↓
ROI Align:
    分类/回归: [1000, 7, 7, 256]
    掩码: [1000, 14, 14, 256]
    ↓
分类头:
    [1000, num_classes]
    ↓
回归头:
    [1000, num_classes, 4]
    ↓
掩码头:
    [1000, num_classes, 28, 28]
    ↓
Detection Layer:
    [max_detections, 6] (y1, x1, y2, x2, class, score)
    ↓
最终输出:
    rois: [N, 4]
    class_ids: [N]
    scores: [N]
    masks: [H, W, N]

4. 关键模块说明

4.1 Backbone (ResNet)

作用: 提取图像特征
输出: 5个不同尺度的特征图
选择: ResNet50 或 ResNet101

4.2 FPN (Feature Pyramid Network)

作用: 融合多尺度特征，提升小目标检测
方法: 自顶向下路径 + 横向连接
优势: 单一特征图 → 多尺度特征金字塔

4.3 RPN (Region Proposal Network)

作用: 生成候选目标区域
输入: FPN 特征图 + 锚点
输出: 候选框 (Proposals)
特点: 共享 Backbone 特征，计算高效

4.4 ROI Align

作用: 从特征图中提取固定大小的 ROI 特征
优势: 相比 ROI Pooling，避免量化误差
输出: 7×7 (分类/回归) 或 14×14 (掩码)

4.5 检测头 (Detection Heads)

分类头: 预测类别
回归头: 精调边界框
掩码头: 生成像素级掩码

4.6 后处理

NMS: 去除重复检测
阈值过滤: 去除低置信度检测
坐标映射: 映射回原始图像尺寸
掩码处理: 上采样、裁剪、阈值化

5. 训练策略

5.1 阶段1: 仅训练 RPN

冻结 Backbone
训练 RPN 生成高质量候选区域

5.2 阶段2: 仅训练检测头

冻结 Backbone 和 RPN
使用固定候选区域训练检测头

5.3 阶段3: 端到端微调

解冻所有层
联合优化所有模块

6. 性能指标

检测精度: mAP (mean Average Precision)
分割精度: mAP (mask)
速度: FPS (Frames Per Second)

7. 应用场景

实例分割: 检测并分割每个目标实例
目标检测: 仅使用检测框（忽略掩码）
关键点检测: 扩展掩码头为关键点预测
姿态估计: 结合人体关键点检测