Revisiting Proposal-based Object Detection阅读笔记

论文地址：link

Abstract

For any object detector, the obtained box proposals or queries need to be classified and regressed towards ground truth boxes.

对于任何物体检测器来说，获得的 box proposals或queries需要被分类并回归到真实框上。(也就是对于每个预测框与要找到相对应的GT)

The common solution for the final predictions is to directly maximize the overlap between each proposal and the ground truth box, followed by a winner-takes-all ranking or non-maximum suppression.

常见的最终预测解决方案是直接最大化每个proposal与真实框之间的重叠，然后通过winner-takes-all ranking或非最大值抑制进行分类。

本文提出了一个简单有效的替代方案。

对于Proposals回归，即回归到Proposals和真实边界框之间的交集区域。

交集区域就是Proposal中包含GT也就是目标的区域。

通过这种方式，每个Proposal只指定包含物体的部分，避免了一个盲目的修复问题，即需要将proposal回归到它们的视觉范围之外。反过来，我们替换了胜者通吃策略，并通过取一个围绕物体的proposal group的回归交集的并集来获得最终预测。

Introduction

文中提到目前物体检测常见设置的两个问题，并提出了一个简单的解决方案。

首先，任何物体检测器的目标，例如 $6, 34, 35$ ，无论其架构如何，都是要准确地学习独立代表图像中真实物体的提议框 。正如图1中所强调的，这通常是一个提出不当的问题。在检测器的前向传播过程中，生成的提议框通常只捕获其范围内的部分真实物体。学习进行完美的真实物体对齐导致了一个盲目的内部绘制挑战。
其次，在所有检测器的前向传播过程中，一个常见的现象是，总是有多个提议或查询与一个真实物体相关 。==尽管它们为真实物体提供了互补的视角，但非最大值抑制或排名只是为了排除重复的检测。==这样做的同时，这些方法忽视了被丢弃提议中的宝贵信息。

本文decompose the problems of proposal-to-ground truth regression and proposal candidate selection into easier to solve intersection and union problems.

proposal-to-GT: intersection

proposal candidate: union

Intersection-based Regression

我们为提议的回归设定了一个新目标：不是预测与真实物体的重叠，而是仅预测交集区域。因此，我们只在proposal的视觉范围内向真实物体回归。

Intersection-based Groupin

给定一组预测了真实物体交集的提议，我们通过取交集区域的并集形成最终预测。换句话说，我们不仅选择一个区域中最有信心的提议，而是we use the wisdom of the crowd to form our final predictions.

这两个阶段对现有物体检测流程的改变很小。我们只改变了回归头目标为交集，并修改了winner-takes-all的后处理过程，加入了一个分组过程。因此，我们的方法可以直接插入到任何物体检测器中。尽管技术上简单，提出的分解直接提高了检测性能。我们展示了我们重新审视的方法如何提高多个数据集上的典型检测和实例分割方法，特别是在评估时的高重叠阈值

Method

Problem statement:

传统regression存在问题：the proposal boxes P are commonly badly aligned, with only a part of the ground truth G visible in the scope of the proposal. Consequently, the function f is compelled to extend P to regions beyond its initial coverage, essentially requiring it to predict beyond its visible scope. This scenario leads to an ill-posed problem

传统的proposal regression会将proposal拓展，像GT去回归，这本质上要求它预测超出其可见范围的部分。这种情况导致了一个不恰当的问题，因为 f 被迫推断出 P 内部不包含的 G 的部分。

记录Problem statement写作样本：

Problem statement

In traditional object detection, bounding box regression involves learning a mapping function f that transforms a proposal box P=(P_{x1}, P_{y1}, P_{x2}, P_{y2}) to closely approximate a ground truth box G=(G_{x1}, G_{y1}, G_{x2}, G_{y2}), where (x1, y1, x2, y2) refers to the top-left x and y coordinates and bottom-right x and y coordinates respectively. Typically, this is achieved by minimizing the L1 or L2 loss between the predicted box and the ground truth. However, this approach often poses a challenge: the proposal boxes P are commonly badly aligned, with only a part of the ground truth G visible in the scope of the proposal. Consequently, the function f is compelled to extend P to regions beyond its initial coverage, essentially requiring it to predict beyond its visible scope. This scenario leads to an ill-posed problem, as f is forced to infer parts of G not contained within P. Then, the set of all proposals for a ground truth object P={P_1, P_2, ..., P_n}, providing complementary views, are sorted and only one candidate is picked via non-maximum suppression or ranking. This assumes that the intersection-over-union problem is already solved and only serves to exclude duplicate detections in the same region.

To address these issues, first, we redefine the task as an intersection learning problem in Section 3.1. For individual proposal boxes, we only require solving the intersection alignment with ground truth boxes, i.e. we only need to regress to the part of the proposal that is shared with the ground truth. Then, instead of discarding complementary information from proposals, our method strategically learns to leverage these fragments in Section 3.2. Given a collection of proposal boxes and their identified intersection, we can replace the non-maximum suppression with a union operator, where we identify the final object as the union-over-intersection of all the proposals in the same region. Figure 2 shows how these two simpler tasks lead to a final object detection.

问题陈述

在传统的目标检测中，边界框回归涉及学习一个映射函数 f，该函数将一个提议框 P=(P_{x1}, P_{y1}, P_{x2}, P_{y2}) 转换为与真实框 G=(G_{x1}, G_{y1}, G_{x2}, G_{y2}) 非常接近，其中 (x1, y1, x2, y2) 分别指的是左上角的 x 和 y 坐标以及右下角的 x 和 y 坐标。通常，这是通过最小化预测框和真实框之间的 L1 或 L2 损失来实现的。然而，这种方法经常带来挑战：提议框 P 通常对齐得很差，只有部分真实物体 G 在提议的范围内可见。因此，函数 f 被迫将 P 扩展到其最初覆盖范围之外的区域，本质上要求它预测超出其可见范围的部分。这种情况导致了一个不恰当的问题，因为 f 被迫推断出 P 内部不包含的 G 的部分。然后，为一个真实物体 G 的所有提议 P={P_1, P_2, ..., P_n} 提供补充视角，它们被排序，并且通过非最大值抑制或排名只选择一个候选。这假设交并比问题已经解决，只是为了排除同一区域内的重复检测。

为了解决这些问题，首先，我们在第3.1节中将任务重新定义为一个交集学习问题。对于单个提议框，我们只需要解决与真实框的交集对齐问题，即我们只需要回归到proposals与真实物体共享的部分。然后，我们的方法不是丢弃提议中的补充信息，而是在第3.2节中策略性地学习利用这些碎片。鉴于一系列提议框及其识别的交集，我们可以用并集操作符替换非最大值抑制，在这个操作符中，我们将最终物体识别为同一区域内所有提议的交集上的并集。图2展示了这两个更简单的任务是如何导致最终的目标检测的。

3.1 基于交集的回归

我们将回归任务重新定义为一个交集学习问题。与其回归到整个真实框，每个提议框的任务是仅回归到真实物体可见的部分，即提议框与真实框之间的交集。这使得映射函数 f 更加明确定义，从而更容易学习这样的转换。

设 I=(I_{x1}, I_{y1}, I_{x2}, I_{y2}) 为提议框 P 与真实框 G 的交集。新的任务是学习一个映射函数 f'，使得 f'(P) ≈ I。为了监督目的，交集按以下方式计算：

I_{x1} = max(P_{x1}, G_{x1}), (1)

I_{y1} = max(P_{y1}, G_{y1}), (2)

I_{x2} = min(P_{x2}, G_{x2}), (3)

I_{y2} = min(P_{y2}, G_{y2}). (4)

这个任务的损失函数定义为：

L_t = ∑|f'(P_i) - I_{t_i}|, (5)

其中 t 取值1或2，分别对应于回归训练中应用 L1 或 L2 损失。

我们方法的第二个方面涉及策略性地利用为单一真实物体生成的多个提议框中包含的部分信息。

3.2 基于交集的分组

在传统的目标检测方法中，通常选择一个提议来代表最终检测，而其余的提议则被丢弃。我们的方法不是丢弃这些包含有价值但碎片化信息的提议，而是学会有效地合并这些片段。因此，我们的方法产生了一个更全面准确的真实物体的表征。

公式化

让我们将所有为一个真实物体提出的提议集合表示为 P={P_1, P_2, ..., P_n}，与真实物体相交的对应交集表示为 T={I_1, I_2, ..., I_n}。我们的目标是找到这些交集的组合，最好地代表真实物体。我们定义一个组合函数 c : T → R^4，它接受一组交集并输出一个边界框。任务是学习这个函数，以最小化组合框和真实框 G 之间的损失：

L = ∑ |c_i(T) - G_i| . (6)

我们的模型旨在通过交集回归理解提议的部分性质。交集的组合分为两个阶段：分组和回归，如下所述。

分组

为了执行分组，我们引入一个分组函数 g : P → G，将proposals P 映射到一组组 G={g_1, g_2, ..., g_m}。每个组 g_j 是 P 的一个子集，并代表图像中的潜在物体。分组函数旨在将可能属于同一物体的proposals分组在一起。这是通过考虑提议之间的空间重叠和语义相似性来实现的。一旦提议被分组，我们结合每个组的提议对应的交集，得到一组组合框 B={B_1, B_2, ..., B_m}。每个组合框 B_j 是由组 g_j 代表的物体的候选边界框。

精炼

最后，我们执行一个回归步骤来精炼组合框。我们定义一个回归函数 r : B → R，将每个组合框映射到最终的目标框。回归函数学习最小化目标框和真实框之间的损失：

L = ∑ ∑ |r(B_{ji}) - T_{ji}| , (7)

其中 B_{ji} 是组合框 B_j 的第 i 个坐标，T_{ji} 是对应于 B_j 的真实框的第 i 个坐标。这种方法使得可以将多个提议中的有价值信息整合到一个增强的提议中。我们的方法不是选择一个最优候选并丢弃其他所有候选，而是提取并合并每个提议中最相关的方面，从而构建一个更准确代表目标真实物体的优越候选。

将这种分解整合到现有的目标检测器中只需要做少量的改动。对于基于交集的回归，我们只需要改变回归头中的目标坐标，从真实框改为提议和真实物体之间的交集区域。为了通过基于交集的分组获得最终的目标检测输出，我们对提议进行排序和聚类，类似于非最大值抑制。我们不仅保留最上面的框，还取同一簇中回归交集的并集作为输出。尽管这种方法的简单性，我们展示了在目标检测中分解交并比对齐直接影响性能。