【目标检测】FPN特征金字塔完整流程详解

学习视频：1.1.2 FPN结构详解

一、FPN与其他结构的对比

可以看到FPN是自上而下、自下而上并且可以进行多尺度特征融合的的层级结构。

二、FPN具体结构

1x1 conv: 对通道数进行调整，不同大小的特征图通道数不同，越高层次的特征图通道数越大，论文中使用256个1x1的卷积核，从而把特征图的通道数都调整为256。
2x up: 使用插值法把高层次特征图大小放大两倍从而变为和低一层的特征图大小一样大。

三、整体实现流程

1. 生成C2-C5特征层

使用不同大小的卷积对特征图进行操作，从而生成了不同大小的4个特征层。

C2: 160x160x256 (HxWxC)
C3: 80x80x512
C4: 40x40x1024
C5: 20x20x2048

2. 使用1x1conv进行通道数统一和2x up上采样后相加完成多尺度特征融合

因为4个特征图大小和通道数都不同，所以需要调整为相同的通道数和大小后才能两两相加。

通道数调整：使用256个1x1 conv进行通道数的调整，把4个特征图的通道数都调整为256。
大小调整：自上而下，把小尺度的图两倍上采样，就可以成为和下一层一样的大小。e.g. 20x20大小的特征图2倍上采样就可以变成40x40的大小。

3. 3x3conv进行进一步特征融合

在相加后加一个3x3的卷积来进行相加后的进一步的特征融合。

P5: C5 ---> 3x3 conv
P4: P5 + C4 ---> 3x3 conv
P3: P4 + C3 ---> 3x3 conv
P2: P3 + C2 ---> 3x3 conv

4. 得到P2-P5特征层

最终得到了4张通道数相同、大小分别两倍递减的特征图。

P2: 160x160x256 (HxWxC)
P3: 80x80x256
P4: 40x40x256
P5: 20x20x256

5. P5下采样形成P6

P6根据P5下采样得到。

P6: 10x10x256。

注：P6这一层只在RPN中生成proposal部分进行使用，即使用P2-P6这5层来生成候选区域，在faster rcnn的预测部分不使用，还是只用P2-P5这4层进行预测。

在P2-P6生成proposals，然后把生成的proposals映射到P2-P5上从而预测结果。

四、基于pytorch的代码具体实现和解析

python 复制代码

# Copyright (c) Facebook, Inc. and its affiliates.
import math
import fvcore.nn.weight_init as weight_init
import torch.nn.functional as F
from torch import nn

from detectron2.layers import Conv2d, ShapeSpec, get_norm

from .backbone import Backbone
from .build import BACKBONE_REGISTRY
from .resnet import build_resnet_backbone

__all__ = ["build_resnet_fpn_backbone", "build_retinanet_resnet_fpn_backbone", "FPN"]


class FPN(Backbone):
    """
    This module implements :paper:`FPN`.
    It creates pyramid features built on top of some input feature maps.
    """

    def __init__(
        self, bottom_up, in_features, out_channels, norm="", top_block=None, fuse_type="sum"
    ):
        """
        Args:
            bottom_up (Backbone): module representing the bottom up subnetwork.
                Must be a subclass of :class:`Backbone`. The multi-scale feature
                maps generated by the bottom up network, and listed in `in_features`,
                are used to generate FPN levels.
            in_features (list[str]): names of the input feature maps coming
                from the backbone to which FPN is attached. For example, if the
                backbone produces ["res2", "res3", "res4"], any *contiguous* sublist
                of these may be used; order must be from high to low resolution.
            out_channels (int): number of channels in the output feature maps.
            norm (str): the normalization to use.
            top_block (nn.Module or None): if provided, an extra operation will
                be performed on the output of the last (smallest resolution)
                FPN output, and the result will extend the result list. The top_block
                further downsamples the feature map. It must have an attribute
                "num_levels", meaning the number of extra FPN levels added by
                this block, and "in_feature", which is a string representing
                its input feature (e.g., p5).
            fuse_type (str): types for fusing the top down features and the lateral
                ones. It can be "sum" (default), which sums up element-wise; or "avg",
                which takes the element-wise mean of the two.

        bottom_up：底部网络模块，用于提取多尺度特征。
        in_features：底部网络输出的特征图名称列表。
        out_channels：输出特征图的通道数。
        norm：归一化的类型。
        top_block：顶部额外操作的模块，对最后一个 FPN 输出进行额外操作。
        fuse_type：特征融合的类型，可以是 "sum" 或 "avg"。
        """
        super(FPN, self).__init__()
        assert isinstance(bottom_up, Backbone)
        assert in_features, in_features

        # Feature map strides and channels from the bottom up network (e.g. ResNet)
        # 获取底部网络输出特征图的形状和步幅信息，input_shapes是数组
        input_shapes = bottom_up.output_shape()
        # 步幅
        strides = [input_shapes[f].stride for f in in_features]
        # 通道数
        in_channels_per_feature = [input_shapes[f].channels for f in in_features]

        # 调用辅助函数 _assert_strides_are_log2_contiguous，确保步幅是对数连续的。
        _assert_strides_are_log2_contiguous(strides)

        # 初始化侧边特征卷积和输出特征卷积列表，并根据是否使用归一化确定是否使用偏置。
        lateral_convs = []
        output_convs = []
        use_bias = norm == ""

        # 循环创建侧边特征卷积和输出特征卷积。
        for idx, in_channels in enumerate(in_channels_per_feature):
            lateral_norm = get_norm(norm, out_channels)
            output_norm = get_norm(norm, out_channels)

            lateral_conv = Conv2d(
                in_channels, out_channels, kernel_size=1, bias=use_bias, norm=lateral_norm
            )
            output_conv = Conv2d(
                out_channels,
                out_channels,
                kernel_size=3,
                stride=1,
                padding=1,
                bias=use_bias,
                norm=output_norm,
            )
            # 使用 Xavier 初始化方法初始化侧边特征卷积和输出特征卷积的权重。
            weight_init.c2_xavier_fill(lateral_conv)
            weight_init.c2_xavier_fill(output_conv)
            # 计算当前特征图的步幅对应的金字塔层级，步幅越大，金字塔层级越低。
            stage = int(math.log2(strides[idx]))
            # 将侧边特征卷积和输出特征卷积添加到模块中，模块名称包含金字塔层级信息。
            self.add_module("fpn_lateral{}".format(stage), lateral_conv)
            self.add_module("fpn_output{}".format(stage), output_conv)
            # 将侧边特征卷积和输出特征卷积添加到模块中，并存储到列表中。
            lateral_convs.append(lateral_conv)
            output_convs.append(output_conv)

        # Place convs into top-down order (from low to high resolution)
        # to make the top-down computation in forward clearer.
        # 将侧边特征卷积和输出特征卷积列表逆转，以便在前向传播时按自顶向下的顺序处理特征图。
        self.lateral_convs = lateral_convs[::-1]
        self.output_convs = output_convs[::-1]
        # 存储其他属性，如顶部块、输入特征名称列表和底部网络模块。
        self.top_block = top_block
        self.in_features = in_features
        self.bottom_up = bottom_up
        # Return feature names are "p<stage>", like ["p2", "p3", ..., "p6"]
        self._out_feature_strides = {"p{}".format(int(math.log2(s))): s for s in strides}

        # top block output feature maps.
        # 这部分代码是用来处理顶部附加模块的输出特征图的步幅。
        # 如果存在顶部附加模块，它会生成额外的特征图，因此需要更新输出特征图的步幅。对于每个额外的特征图，步幅会翻倍，以确保特征图的尺寸随着层级的增加而减小。
        #
        # 具体而言，代码首先检查是否存在顶部附加模块 top_block。
        # 如果存在，它会迭代顶部附加模块要添加的特征图数量，并为每个特征图更新对应的步幅。
        # 例如，如果要添加两个额外的特征图，则会将 "p3" 和 "p4" 的步幅设置为 2 ** 3 = 8 和 2 ** 4 = 16。
        if self.top_block is not None:
            for s in range(stage, stage + self.top_block.num_levels):
                self._out_feature_strides["p{}".format(s + 1)] = 2 ** (s + 1)

        self._out_features = list(self._out_feature_strides.keys())
        self._out_feature_channels = {k: out_channels for k in self._out_features}
        # 存储其他属性，如顶部块、输入特征名称列表和底部网络模块。
        self._size_divisibility = strides[-1]
        assert fuse_type in {"avg", "sum"}
        self._fuse_type = fuse_type

        # Scripting does not support this: https://github.com/pytorch/pytorch/issues/47334
        # have to do it in __init__ instead.
        self.rev_in_features = tuple(in_features[::-1])

    # 定义了 size_divisibility 属性，用于获取输出特征图的大小可分性。
    @property
    def size_divisibility(self):
        return self._size_divisibility

    def forward(self, x):
        """
        Args:
            input (dict[str->Tensor]): mapping feature map name (e.g., "res5") to
                feature map tensor for each feature level in high to low resolution order.
        input (dict[str->Tensor]):
        将特征映射名称(例如"res5")映射到每个特征级别的特征映射张量从高到低的分辨率顺序。
        Returns:
            dict[str->Tensor]:
                mapping from feature map name to FPN feature map tensor
                in high to low resolution order. Returned feature names follow the FPN
                paper convention: "p<stage>", where stage has stride = 2 ** stage e.g.,
                ["p2", "p3", ..., "p6"].
                从特征映射名称到FPN特征映射张量的映射从高到低的分辨率顺序。
                返回的特性名称跟随FPN文件约定:"p<stage>"，其中stage有stride = 2 ** stage，例如:["p2"， "p3"，...],"p6"]。
        """

        # 传递了输入张量 x，其中 x 是一个字典，将特征图名称映射到对应的张量。
        # 例如，可以是从底层到顶层的特征图，如 "res2", "res3", "res4" 等。
        # 函数的目标是根据输入特征图构建金字塔特征图，然后返回一个字典，将特征图名称映射到相应的金字塔特征图张量。
        bottom_up_features = self.bottom_up(x)
        results = []
        # 获取底层特征图中的最高分辨率特征图，将其通过一个卷积层 (lateral_convs) 进行处理，作为金字塔的第一个层级特征图。
        prev_features = self.lateral_convs[0](bottom_up_features[self.in_features[-1]])
        # 对经过处理的特征图应用另一个卷积层 (output_convs)，以生成第一个层级的金字塔特征图，并将其添加到结果列表中。
        results.append(self.output_convs[0](prev_features))

        # Reverse feature maps into top-down order (from low to high resolution)
        for features, lateral_conv, output_conv in zip(
            self.rev_in_features[1:], self.lateral_convs[1:], self.output_convs[1:]
        ):
            # 获取下一个更低层级的输入特征图。
            features = bottom_up_features[features]
            # 将上一个层级的特征图进行上采样，以便与当前层级的特征图进行融合。
            top_down_features = F.interpolate(prev_features, scale_factor=2.0, mode="nearest")
            # Has to use explicit forward due to https://github.com/pytorch/pytorch/issues/47336
            # 将当前层级的特征图通过侧连接卷积层(lateral_conv)进行处理。
            lateral_features = lateral_conv.forward(features)
            # 将侧连接特征图和上采样的特征图相加，产生当前层级的金字塔特征图。
            prev_features = lateral_features + top_down_features
            # 如果 fuse_type 设置为 "avg"，则将当前层级的金字塔特征图 (prev_features) 除以 2，
            # 实现了上一级特征图与当前层级特征图的平均融合。
            if self._fuse_type == "avg":
                prev_features /= 2
            # 对当前层级的金字塔特征图应用另一个卷积层 (output_conv)，然后将其插入到结果列表的开头。
            results.insert(0, output_conv.forward(prev_features))

        # 如果存在，则检查顶部附加块的输入特征是否在底部向上特征图中。
        if self.top_block is not None:
            # 如果是，则将顶部附加块的输入特征设置为底部向上特征图中的对应特征。
            if self.top_block.in_feature in bottom_up_features:
                top_block_in_feature = bottom_up_features[self.top_block.in_feature]
            # 否则，将其设置为已生成的结果中的特征，这是为了确保在不同配置下也能正确运行。
            else:
                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]
            # 通过顶部附加块处理输入特征，并将处理后的结果添加到结果列表中。
            results.extend(self.top_block(top_block_in_feature))
        # 断言生成的特征图数量与预期的特征图数量相等
        assert len(self._out_features) == len(results)
        # 将结果以字典形式返回，键为特征图名称，值为对应的特征图。
        return dict(list(zip(self._out_features, results)))

    def output_shape(self):
        return {
            name: ShapeSpec(
                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]
            )
            for name in self._out_features
        }


def _assert_strides_are_log2_contiguous(strides):
    """
    Assert that each stride is 2x times its preceding stride, i.e. "contiguous in log2".
    """
    for i, stride in enumerate(strides[1:], 1):
        assert stride == 2 * strides[i - 1], "Strides {} {} are not log2 contiguous".format(
            stride, strides[i - 1]
        )


class LastLevelMaxPool(nn.Module):
    """
    This module is used in the original FPN to generate a downsampled
    P6 feature from P5.
    """

    def __init__(self):
        super().__init__()
        self.num_levels = 1
        self.in_feature = "p5"

    def forward(self, x):
        return [F.max_pool2d(x, kernel_size=1, stride=2, padding=0)]


class LastLevelP6P7(nn.Module):
    """
    This module is used in RetinaNet to generate extra layers, P6 and P7 from
    C5 feature.
    """

    def __init__(self, in_channels, out_channels, in_feature="res5"):
        super().__init__()
        self.num_levels = 2
        self.in_feature = in_feature
        self.p6 = nn.Conv2d(in_channels, out_channels, 3, 2, 1)
        self.p7 = nn.Conv2d(out_channels, out_channels, 3, 2, 1)
        for module in [self.p6, self.p7]:
            weight_init.c2_xavier_fill(module)

    def forward(self, c5):
        p6 = self.p6(c5)
        p7 = self.p7(F.relu(p6))
        return [p6, p7]


@BACKBONE_REGISTRY.register()
def build_resnet_fpn_backbone(cfg, input_shape: ShapeSpec):
    """
    Args:
        cfg: a detectron2 CfgNode

    Returns:
        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
    """
    bottom_up = build_resnet_backbone(cfg, input_shape)
    in_features = cfg.MODEL.FPN.IN_FEATURES
    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
    backbone = FPN(
        bottom_up=bottom_up,
        in_features=in_features,
        out_channels=out_channels,
        norm=cfg.MODEL.FPN.NORM,
        top_block=LastLevelMaxPool(),
        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
    )
    return backbone


@BACKBONE_REGISTRY.register()
def build_retinanet_resnet_fpn_backbone(cfg, input_shape: ShapeSpec):
    """
    Args:
        cfg: a detectron2 CfgNode

    Returns:
        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
    """
    bottom_up = build_resnet_backbone(cfg, input_shape)
    in_features = cfg.MODEL.FPN.IN_FEATURES
    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
    in_channels_p6p7 = bottom_up.output_shape()["res5"].channels
    backbone = FPN(
        bottom_up=bottom_up,
        in_features=in_features,
        out_channels=out_channels,
        norm=cfg.MODEL.FPN.NORM,
        top_block=LastLevelP6P7(in_channels_p6p7, out_channels),
        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
    )
    return backbone