【ASFF】《Learning Spatial Fusion for Single-Shot Object Detection》

arXiv-2019

https://github.com/GOATmessi7/ASFF


文章目录

  • [1 Background and Motivation](#1 Background and Motivation)
  • [2 Related Work](#2 Related Work)
  • [3 Advantages / Contributions](#3 Advantages / Contributions)
  • [4 Method](#4 Method)
    • [4.1 Strong Baseline](#4.1 Strong Baseline)
    • [4.2 Adaptively Spatial Feature Fusion](#4.2 Adaptively Spatial Feature Fusion)
      • [4.2.1 Feature Resizing](#4.2.1 Feature Resizing)
      • [4.2.2 Adaptive Fusion](#4.2.2 Adaptive Fusion)
    • [4.3 Consistency Property](#4.3 Consistency Property)
  • [5 Experiments](#5 Experiments)
    • [5.1 Datasets and Metrics](#5.1 Datasets and Metrics)
    • [5.2 Ablation Study](#5.2 Ablation Study)
    • [5.3 Evaluation on Other Single-Shot Detector](#5.3 Evaluation on Other Single-Shot Detector)
    • [5.4 Comparison to State of the Art](#5.4 Comparison to State of the Art)
  • [6 Conclusion(own) / Future work](#6 Conclusion(own) / Future work)

1 Background and Motivation

目标检测任务中,特征金字塔技术可以缓解目标的 scale variation(同一类物体,物体的尺寸可能不一样)

the inconsistency across different feature scales is a primary limitation for the single-shot detectors based on feature pyramid(同一类物体,特征最好一样,但是由于尺寸原因,会分布在特征金字塔的不同 level 上,不同 level 的特征也没有强制协同,可能会影响效果)

if an image contains both small and large objects, the conflict among features at different levels tends to occupy the major part of the feature pyramid

作者提出 adaptively spatial feature fusion (ASFF), improving the scale-invariance of features,nearly free inference overhead

Feature pyramid representations or multi-level feature

still suffer from the inconsistency across different scales

作者的方法

adaptively learns the import degrees(入度) for different levels of features on each location to avoid spatial contradiction

3 Advantages / Contributions

提出 ASFF 模块,即插即用且基本 cost free,强化特征金字塔能力,to address the inconsistency in feature pyramids of single-shot detector

在 COCO 数据集上验证了其有效性

4 Method

4.1 Strong Baseline

开源的 yolov3 基础上,引入了一些比较好的 trick,效果提升明显

BoF 是 Bag of freebies

Zhang Z, He T, Zhang H, et al. Bag of freebies for training object detection neural networks[J]. arXiv preprint arXiv:1902.04103, 2019.

GA 是 guided anchoring

Wang J, Chen K, Yang S, et al. Region proposal by guided anchoring[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 2965-2974.

细节可以跳转到本博客最后总结部分

IoU 指的是额外引入了 IoU loss

4.2 Adaptively Spatial Feature Fusion

adaptively learn the spatial weight of fusion for feature maps at each scale

本文的核心

4.2.1 Feature Resizing

下采样 2 倍时,2-stride 1x1 convolution

下采样 4 倍时,add a 2-stride max pooling layer before the 2-stride convolution

上采样用的插值

4.2.2 Adaptive Fusion

核心公式

Let x i j n → l x_{ij}^{n →l} xijn→ldenote the feature vector at the position (i, j) on the feature maps resized from level n n n to level l l l.

特征金字塔 resize 到同一尺寸,然后加权在一起,只不过加权的系数是 learning 出来的,权重 shared across all the channels,有点类似于空间注意力

α i j l + β i j l + γ i j l = 1 \alpha_{ij}^{l} + \beta_{ij}^{l} + \gamma_{ij}^{l} = 1 αijl+βijl+γijl=1

加权系数约束到了和为1,实现的话就是 softmax

λ \lambda λ 为 control parameters------代码中好像没有体现

看看代码

python 复制代码
class ASFF(nn.Module):
    def __init__(self, level, rfb=False, vis=False):
        super(ASFF, self).__init__()
        self.level = level
        self.dim = [512, 256, 256]
        self.inter_dim = self.dim[self.level]
        if level==0:
            self.stride_level_1 = add_conv(256, self.inter_dim, 3, 2)
            self.stride_level_2 = add_conv(256, self.inter_dim, 3, 2)
            self.expand = add_conv(self.inter_dim, 1024, 3, 1)
        elif level==1:
            self.compress_level_0 = add_conv(512, self.inter_dim, 1, 1)
            self.stride_level_2 = add_conv(256, self.inter_dim, 3, 2)
            self.expand = add_conv(self.inter_dim, 512, 3, 1)
        elif level==2:
            self.compress_level_0 = add_conv(512, self.inter_dim, 1, 1)
            self.expand = add_conv(self.inter_dim, 256, 3, 1)

        compress_c = 8 if rfb else 16  #when adding rfb, we use half number of channels to save memory

        self.weight_level_0 = add_conv(self.inter_dim, compress_c, 1, 1)
        self.weight_level_1 = add_conv(self.inter_dim, compress_c, 1, 1)
        self.weight_level_2 = add_conv(self.inter_dim, compress_c, 1, 1)

        self.weight_levels = nn.Conv2d(compress_c*3, 3, kernel_size=1, stride=1, padding=0)
        self.vis= vis


    def forward(self, x_level_0, x_level_1, x_level_2):
        if self.level==0:
            level_0_resized = x_level_0
            level_1_resized = self.stride_level_1(x_level_1)

            level_2_downsampled_inter =F.max_pool2d(x_level_2, 3, stride=2, padding=1)
            level_2_resized = self.stride_level_2(level_2_downsampled_inter)

        elif self.level==1:
            level_0_compressed = self.compress_level_0(x_level_0)
            level_0_resized =F.interpolate(level_0_compressed, scale_factor=2, mode='nearest')
            level_1_resized =x_level_1
            level_2_resized =self.stride_level_2(x_level_2)
        elif self.level==2:
            level_0_compressed = self.compress_level_0(x_level_0)
            level_0_resized =F.interpolate(level_0_compressed, scale_factor=4, mode='nearest')
            level_1_resized =F.interpolate(x_level_1, scale_factor=2, mode='nearest')
            level_2_resized =x_level_2

        level_0_weight_v = self.weight_level_0(level_0_resized) # 缩放后的特征图压缩成 16 通道
        level_1_weight_v = self.weight_level_1(level_1_resized) # 缩放后的特征图压缩成 16 通道
        level_2_weight_v = self.weight_level_2(level_2_resized) # 缩放后的特征图压缩成 16 通道
        levels_weight_v = torch.cat((level_0_weight_v, level_1_weight_v, level_2_weight_v),1) # concat 在一起
        levels_weight = self.weight_levels(levels_weight_v) # 变成 3 通道
        levels_weight = F.softmax(levels_weight, dim=1) # 沿通道做 softmax

        fused_out_reduced = level_0_resized * levels_weight[:,0:1,:,:]+\
                            level_1_resized * levels_weight[:,1:2,:,:]+\
                            level_2_resized * levels_weight[:,2:,:,:]  # 与缩放后的特征图加权在一起

        out = self.expand(fused_out_reduced) # 扩充通道数

        if self.vis:
            return out, levels_weight, fused_out_reduced.sum(dim=1)
        else:
            return out

其中 add_conv 定义如下

python 复制代码
def add_conv(in_ch, out_ch, ksize, stride, leaky=True):
    """
    Add a conv2d / batchnorm / leaky ReLU block.
    Args:
        in_ch (int): number of input channels of the convolution layer.
        out_ch (int): number of output channels of the convolution layer.
        ksize (int): kernel size of the convolution layer.
        stride (int): stride of the convolution layer.
    Returns:
        stage (Sequential) : Sequential layers composing a convolution block.
    """
    stage = nn.Sequential()
    pad = (ksize - 1) // 2
    stage.add_module('conv', nn.Conv2d(in_channels=in_ch,
                                       out_channels=out_ch, kernel_size=ksize, stride=stride,
                                       padding=pad, bias=False))
    stage.add_module('batch_norm', nn.BatchNorm2d(out_ch))
    if leaky:
        stage.add_module('leaky', nn.LeakyReLU(0.1))
    else:
        stage.add_module('relu6', nn.ReLU6(inplace=True))
    return stage

level = 0

level = 1

level = 2

4.3 Consistency Property

反向传播推导推导

简化一下

感觉 resize 的时候如果涉及到了 conv + activation 的话,不太能简化吧,哈哈

进一步简化,当多个特征图融合的方式为 add 或者 concat 的时候

结果为

作者方法的反向传播公式为

这样通过设置 α \alpha α 就可以避免各 level 梯度的影响

比如目标由 level 1 负责预测, α i j 1 = 1 \alpha_{ij}^1 = 1 αij1=1, α i j 2 = 0 \alpha_{ij}^2 = 0 αij2=0, α i j 3 = 1 \alpha_{ij}^3 = 1 αij3=1

5 Experiments

5.1 Datasets and Metrics

MS COCO 2017

AP

5.2 Ablation Study

(1)Solid Baseline

Table1,前面第四小节已介绍过了

(2) Effectiveness of Adjacent Ignore Regions

前面说梯度的时候说 ignor 不好,这里又是 ignore area,可能我还没有理解到精髓,需看看参考文献和代码加深下理解

(3)Adaptively Spatial Feature Fusion

exhibit the images that have several objects of different sizes



5.3 Evaluation on Other Single-Shot Detector

体现了其即插即用

5.4 Comparison to State of the Art

6 Conclusion(own) / Future work

  • trained to find the optimal fusion
  • fusion is differential(可微分的,也即可以反向传播)

Wang J, Chen K, Yang S, et al. Region proposal by guided anchoring[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 2965-2974.

目标检测正负样本区分策略和平衡策略总结(三)



相关推荐
中杯可乐多加冰5 分钟前
百度文心一言开源ERNIE-4.5深度测评报告:技术架构解读与性能对比
人工智能·掘金·金石计划
198911 分钟前
【零基础学AI】第31讲:目标检测 - YOLO算法
人工智能·rnn·yolo·目标检测·tensorflow·lstm
沐尘而生15 分钟前
【AI智能体】智能音视频-硬件设备基于 WebSocket 实现语音交互
大数据·人工智能·websocket·机器学习·ai作画·音视频·娱乐
巴伦是只猫19 分钟前
【机器学习笔记Ⅰ】3 代价函数
人工智能·笔记·机器学习
NetX行者19 分钟前
基于Vue 3的AI前端框架汇总及工具对比表
前端·vue.js·人工智能·前端框架·开源
hans汉斯1 小时前
【人工智能与机器人研究】基于力传感器坐标系预标定的重力补偿算法
人工智能·算法·机器人·信号处理·深度神经网络
cver1231 小时前
CSGO 训练数据集介绍-2,427 张图片 AI 游戏助手 游戏数据分析
人工智能·深度学习·yolo·目标检测·游戏·计算机视觉
FreeBuf_1 小时前
新型BERT勒索软件肆虐:多线程攻击同时针对Windows、Linux及ESXi系统
人工智能·深度学习·bert
强哥之神2 小时前
Meta AI 推出 Multi - SpatialMLLM:借助多模态大语言模型实现多帧空间理解
人工智能·深度学习·计算机视觉·语言模型·自然语言处理·llama
成都极云科技2 小时前
成都算力租赁新趋势:H20 八卡服务器如何重塑 AI 产业格局?
大数据·服务器·人工智能·云计算·gpu算力