We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr .
The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest. Modern detectors address this set prediction task in an indirect way, by defining surrogate regression and classification problems on a large set of proposals [ 37 , 5 ], anchors [ 23 ], or window centers [ 53 , 46 ]. Their performances are significantly influenced by postprocessing steps to collapse near-duplicate predictions, by the design of the anchor sets and by the heuristics that assign target boxes to anchors [ 52 ]. To simplify these pipelines, we propose a direct set prediction approach to bypass the surrogate tasks. This end-to-end philosophy has led to significant advances in complex structured prediction tasks such as machine translation or speech recognition, but not yet in object detection: previous attempts [ 43 , 16 , 4 , 39 ] either add other forms of prior knowledge, or have not proven to be competitive with strong baselines on challenging benchmarks. This paper aims to bridge this gap.
Fig. 1: DETR directly predicts (in parallel) the final set of detections by combining a common CNN with a transformer architecture. During training, bipartite matching uniquely assigns predictions with ground truth boxes. Prediction with no match should yield a " no object " ( ∅ ) (\emptyset) (∅) class prediction.
We streamline the training pipeline by viewing object detection as a direct set prediction problem. We adopt an encoder-decoder architecture based on transformers [ 47 ], a popular architecture for sequence prediction. The self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing duplicate predictions.
Our DEtection TRansformer (DETR, see Figure 1 ) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects. DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or non-maximal suppression. Unlike most existing detection methods, DETR doesn't require any customized layers, and thus can be reproduced easily in any framework that contains standard CNN and transformer classes.
Compared to most previous work on direct set prediction, the main features of DETR are the conjunction of the bipartite matching loss and transformers with (non-autoregressive) parallel decoding [ 29 , 12 , 10 , 8 ]. In contrast, previous work focused on autoregressive decoding with RNNs [ 43 , 41 , 30 , 36 , 42 ]. Our matching loss function uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel.
We evaluate DETR on one of the most popular object detection datasets, COCO [ 24 ], against a very competitive Faster R-CNN baseline [ 37 ]. Faster RCNN has undergone many design iterations and its performance was greatly improved since the original publication. Our experiments show that our new model achieves comparable performances. More precisely, DETR demonstrates significantly better performance on large objects, a result likely enabled by the non-local computations of the transformer. It obtains, however, lower performances on small objects. We expect that future work will improve this aspect in the same way the development of FPN [ 22 ] did for Faster R-CNN.
Training settings for DETR differ from standard object detectors in multiple ways. The new model requires extra-long training schedule and benefits from auxiliary decoding losses in the transformer. We thoroughly explore what components are crucial for the demonstrated performance.
The design ethos of DETR easily extend to more complex tasks. In our experiments, we show that a simple segmentation head trained on top of a pretrained DETR outperfoms competitive baselines on Panoptic Segmentation [ 19 ], a challenging pixel-level recognition task that has recently gained popularity.
Our work build on prior work in several domains: bipartite matching losses for set prediction, encoder-decoder architectures based on the transformer, parallel decoding, and object detection methods.
There is no canonical deep learning model to directly predict sets. The basic set prediction task is multilabel classification (see e.g., [ 40 , 33 ] for references in the context of computer vision) for which the baseline approach, one-vs-rest, does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes). The first difficulty in these tasks is to avoid near-duplicates. Most current detectors use postprocessings such as non-maximal suppression to address this issue, but direct set prediction are postprocessing-free. They need global inference schemes that model interactions between all predicted elements to avoid redundancy. For constant-size set prediction, dense fully connected networks [ 9 ] are sufficient but costly. A general approach is to use auto-regressive sequence models such as recurrent neural networks [ 48 ]. In all cases, the loss function should be invariant by a permutation of the predictions. The usual solution is to design a loss based on the Hungarian algorithm [ 20 ], to find a bipartite matching between ground-truth and prediction. This enforces permutation-invariance, and guarantees that each target element has a unique match. We follow the bipartite matching loss approach. In contrast to most prior work however, we step away from autoregressive models and use transformers with parallel decoding, which we describe below.
【解析】集合预测是一个比较难的任务,因为传统的模型都是为序列或结构化输出设计的,而集合是无序的。在计算机视觉中,最简单的集合预测任务是多标签分类,比如判断一张图片中包含哪些物体类别。传统的"一对其余"方法为每个类别训练一个二元分类器,但这种方法无法处理目标检测中的复杂情况,因为检测需要同时预测位置和类别,而且同一类别可能有多个实例。目标检测中的核心困难在于避免重复检测同一个目标。想象一下,如果模型同时预测出10个几乎相同的边界框都指向同一个人,这显然是不合理的。传统检测器通过非极大值抑制等后处理步骤来解决这个问题,但这增加了系统的复杂性。直接集合预测方法希望模型在预测阶段就能避免这种重复,这需要模型具备全局推理能力,让每个预测都能"感知"到其他所有预测的存在,从而自动避免冗余。自回归模型虽然可以做到这一点,但需要串行生成预测,效率较低。排列不变性是集合预测的核心要求,因为集合本身是无序的,模型输出[A, B, C]和[C, A, B]应该被认为是等价的。匈牙利算法提供了一种优雅的解决方案,它能找到预测集合和真实集合之间的最优一一对应关系,确保每个真实目标都有唯一的预测与之匹配。DETR的创新在于将这种二分匹配损失与transformer的并行解码能力结合起来,既保证了预测的无重复性,又实现了高效的并行计算。
2.2 Transformers和并行解码
Transformers were introduced by Vaswani et al . [ 47 ] as a new attention-based building block for machine translation. Attention mechanisms [ 2 ] are neural network layers that aggregate information from the entire input sequence. Transformers introduced self-attention layers, which, similarly to Non-Local Neural Networks [ 49 ], scan through each element of a sequence and update it by aggregating information from the whole sequence. One of the main advantages of attention-based models is their global computations and perfect memory, which makes them more suitable than RNNs on long sequences. Transformers are now replacing RNNs in many problems in natural language processing, speech processing and computer vision [ 8 , 27 , 45 , 34 , 31 ].
Transformers were first used in auto-regressive models, following early sequenceto-sequence models [ 44 ], generating output tokens one by one. However, the prohibitive inference cost (proportional to output length, and hard to batch) lead to the development of parallel sequence generation, in the domains of audio [ 29 ], machine translation [ 12 , 10 ], word representation learning [ 8 ], and more recently speech recognition [ 6 ]. We also combine transformers and parallel decoding for their suitable trade-off between computational cost and the ability to perform the global computations required for set prediction.
Most modern object detection methods make predictions relative to some initial guesses. Two-stage detectors [ 37 , 5 ] predict boxes w.r.t. proposals, whereas single-stage methods make predictions w.r.t. anchors [ 23 ] or a grid of possible object centers [ 53 , 46 ]. Recent work [ 52 ] demonstrate that the final performance of these systems heavily depends on the exact way these initial guesses are set. In our model we are able to remove this hand-crafted process and streamline the detection process by directly predicting the set of detections with absolute box prediction w.r.t. the input image rather than an anchor.
Set-based loss. Several object detectors [ 9 , 25 , 35 ] used the bipartite matching loss. However, in these early deep learning models, the relation between different prediction was modeled with convolutional or fully-connected layers only and a hand-designed NMS post-processing can improve their performance. More recent detectors [ 37 , 23 , 53 ] use non-unique assignment rules between ground truth and predictions together with an NMS.
Learnable NMS methods [ 16 , 4 ] and relation networks [ 17 ] explicitly model relations between different predictions with attention. Using direct set losses, they do not require any post-processing steps. However, these methods employ additional hand-crafted context features like proposal box coordinates to model relations between detections efficiently, while we look for solutions that reduce the prior knowledge encoded in the model.
Recurrent detectors. Closest to our approach are end-to-end set predictions for object detection [ 43 ] and instance segmentation [ 41 , 30 , 36 , 42 ]. Similarly to us, they use bipartite-matching losses with encoder-decoder architectures based on CNN activations to directly produce a set of bounding boxes. These approaches, however, were only evaluated on small datasets and not against modern baselines. In particular, they are based on autoregressive models (more precisely RNNs), so they do not leverage the recent transformers with parallel decoding.
Two ingredients are essential for direct set predictions in detection: (1) a set prediction loss that forces unique matching between predicted and ground truth boxes; (2) an architecture that predicts (in a single pass) a set of objects and models their relation. We describe our architecture in detail in Figure 2 .
DETR infers a fixed-size set of N N N predictions, in a single pass through the decoder, where N N N is set to be significantly larger than the typical number of objects in an image. One of the main difficulties of training is to score predicted objects (class, position, size) with respect to the ground truth. Our loss produces an optimal bipartite matching between predicted and ground truth objects, and then optimize object-specific (bounding box) losses.
【翻译】DETR通过解码器的单次前向传播推断出大小固定为 N N N的预测集合,其中 N N N被设置为显著大于图像中典型目标数量。训练的主要困难之一是如何相对于真实值对预测的目标(类别、位置、大小)进行评分。我们的损失产生预测目标和真实目标之间的最优二分匹配,然后优化目标特定的(边界框)损失。
Let us denote by y y y the ground truth set of objects, and y ^ = { y ^ i } i = 1 N \hat{y}=\{\hat{y}{i}\}{i=1}^{N} y^={y^i}i=1N the set of N N N predictions. Assuming N N N is larger than the number of objects in the image, we consider y y y also as a set of size N N N padded with O \scriptscriptstyle\mathcal{O} O (no object). To find a bipartite matching between these two sets we search for a permutation of N N N elements σ ∈ S N \sigma\in\mathfrak{S}_{N} σ∈SN with the lowest cost:
【翻译】让我们用 y y y表示真实目标集合,用 y ^ = { y ^ i } i = 1 N \hat{y}=\{\hat{y}{i}\}{i=1}^{N} y^={y^i}i=1N表示 N N N个预测的集合。假设 N N N大于图像中目标的数量,我们也将 y y y视为大小为 N N N的集合,用 O \scriptscriptstyle\mathcal{O} O(无目标)进行填充。为了在这两个集合之间找到二分匹配,我们搜索具有最低成本的 N N N元素的排列 σ ∈ S N \sigma\in\mathfrak{S}_{N} σ∈SN:
【解析】真实目标集合y包含图像中所有的真实目标,而预测集合 y ^ \hat{y} y^包含模型输出的N个预测结果。由于N通常大于真实目标数量,需要对真实目标集合进行填充操作,使其也达到N个元素。这些填充的元素用特殊符号 O \scriptscriptstyle\mathcal{O} O表示,代表"无目标"或"背景"。这样处理后,两个集合都有相同的大小N,可以进行一一对应的匹配。关键在于寻找一个排列 σ \sigma σ,这个排列定义了预测集合中每个元素应该与真实集合中哪个元素配对。 S N \mathfrak{S}_{N} SN表示N个元素的所有可能排列的集合,总共有N!种可能的排列方式。目标是在这N!种排列中找到使匹配成本最小的那一个,这就是最优分配问题。
σ ^ = arg min σ ∈ S N ∑ i N L m a t c h ( y i , y ^ σ ( i ) ) , \hat{\sigma}=\mathop{\arg\operatorname*{min}}{\sigma\in\mathfrak{S}{N}}\sum_{i}^{N}\mathcal{L}{\mathrm{match}}\big(y{i},\hat{y}_{\sigma(i)}\big), σ^=argminσ∈SNi∑NLmatch(yi,y^σ(i)),
where L m a t c h ( y i , y ^ σ ( i ) ) \mathcal{L}{\mathrm{match}}\big(y{i},\hat{y}{\sigma(i)}\big) Lmatch(yi,y^σ(i)) is a pair-wise matching cost between ground truth y i y{i} yi and a prediction with index σ ( i ) \sigma(i) σ(i) . This optimal assignment is computed efficiently with the Hungarian algorithm, following prior work ( e.g . [ 43 ]).
【翻译】其中 L m a t c h ( y i , y ^ σ ( i ) ) \mathcal{L}{\mathrm{match}}\big(y{i},\hat{y}{\sigma(i)}\big) Lmatch(yi,y^σ(i))是真实值 y i y{i} yi和索引为 σ ( i ) \sigma(i) σ(i)的预测之间的成对匹配成本。这个最优分配通过匈牙利算法高效计算,遵循先前的工作(例如[43])。
【解析】其实这个优化公式的核心是匹配成本函数 L m a t c h \mathcal{L}_{\mathrm{match}} Lmatch,它量化了真实目标和预测结果之间的相似程度或匹配质量。对于每个可能的排列 σ \sigma σ,公式计算所有配对的匹配成本总和,目标是找到使这个总和最小的排列。 arg min \arg\min argmin操作符表示我们要找的是使目标函数达到最小值的参数,即最优排列 σ ^ \hat{\sigma} σ^。匈牙利算法是解决这类二分匹配问题的经典算法,它能在多项式时间内找到最优解,避免了暴力搜索所有N!种排列的指数级复杂度。该算法的时间复杂度是 O ( N 3 ) O(N^3) O(N3),对于实际应用中的N值(通常是几百)是完全可行的。匹配成本函数通常结合分类损失和定位损失,既考虑预测的类别是否正确,也考虑边界框的位置是否准确。通过这种方式,DETR能够在每个训练步骤中自动学习如何将预测与真实目标进行最佳配对,无需人工设计的启发式规则。
The matching cost takes into account both the class prediction and the similarity of predicted and ground truth boxes. Each element i i i of the ground truth set can be seen as a y i = ( c i , b i ) y_{i}~=~\left(c_{i},b_{i}\right) yi = (ci,bi) where c i c_{i} ci is the target class label (which may be L \scriptscriptstyle\mathcal{L} L ) and b i ∈ [ 0 , 1 ] 4 b_{i}~\in~[0,1]^{4} bi ∈ [0,1]4 is a vector that defines ground truth box center coordinates and its height and width relative to the image size. For the prediction with index σ ( i ) \sigma(i) σ(i) we define probability of class c i c_{i} ci as p ^ σ ( i ) ( c i ) \hat{p}{\sigma(i)}(\boldsymbol{c}{i}) p^σ(i)(ci) and the predicted box as b ^ σ ( i ) \hat{b}{\sigma(i)} b^σ(i) . With these notations we define L m a t c h ( y i , y ^ σ ( i ) ) \mathcal{L}{\mathrm{match}}\big(y_{i},\hat{y}{\sigma(i)}\big) Lmatch(yi,y^σ(i)) as − 1 { c i ≠ ∅ } p ^ σ ( i ) ( c i ) + 1 { c i ≠ ∅ } L b o x ( b i , b ^ σ ( i ) ) -\mathbb{1}{\{c_{i}\neq\emptyset\}}\hat{p}{\sigma(i)}(c{i})+\mathbb{1}{\{c{i}\neq\emptyset\}}\mathcal{L}{\mathrm{box}}(b{i},\hat{b}_{\sigma(i)}) −1{ci=∅}p^σ(i)(ci)+1{ci=∅}Lbox(bi,b^σ(i)) .
【翻译】匹配成本同时考虑了类别预测和预测框与真实框的相似性。真实集合的每个元素 i i i可以看作 y i = ( c i , b i ) y_{i}~=~\left(c_{i},b_{i}\right) yi = (ci,bi),其中 c i c_{i} ci是目标类别标签(可能是 L \scriptscriptstyle\mathcal{L} L), b i ∈ [ 0 , 1 ] 4 b_{i}~\in~[0,1]^{4} bi ∈ [0,1]4是一个向量,定义了相对于图像大小的真实框中心坐标及其高度和宽度。对于索引为 σ ( i ) \sigma(i) σ(i)的预测,我们将类别 c i c_{i} ci的概率定义为 p ^ σ ( i ) ( c i ) \hat{p}{\sigma(i)}(\boldsymbol{c}{i}) p^σ(i)(ci),预测框定义为 b ^ σ ( i ) \hat{b}{\sigma(i)} b^σ(i)。使用这些记号,我们将 L m a t c h ( y i , y ^ σ ( i ) ) \mathcal{L}{\mathrm{match}}\big(y_{i},\hat{y}{\sigma(i)}\big) Lmatch(yi,y^σ(i))定义为 − 1 { c i ≠ ∅ } p ^ σ ( i ) ( c i ) + 1 { c i ≠ ∅ } L b o x ( b i , b ^ σ ( i ) ) -\mathbb{1}{\{c_{i}\neq\emptyset\}}\hat{p}{\sigma(i)}(c{i})+\mathbb{1}{\{c{i}\neq\emptyset\}}\mathcal{L}{\mathrm{box}}(b{i},\hat{b}_{\sigma(i)}) −1{ci=∅}p^σ(i)(ci)+1{ci=∅}Lbox(bi,b^σ(i))。
【解析】这里的匹配成本函数考虑了目标检测任务的两个任务要求:正确的类别分类和准确的位置预测。每个真实目标 y i y_i yi包含两个关键信息,类别标签 c i c_i ci和边界框 b i b_i bi。这里的边界框采用归一化表示,坐标值都在[0,1]范围内,这样做的好处是消除了不同图像尺寸的影响,使得损失函数更加稳定。边界框用4维向量表示,包括中心点的x、y坐标以及框的宽度和高度。预测部分对应地包含类别概率 p ^ σ ( i ) ( c i ) \hat{p}{\sigma(i)}(c_i) p^σ(i)(ci)和预测边界框 b ^ σ ( i ) \hat{b}{\sigma(i)} b^σ(i)。匹配成本公式中的指示函数 1 { c i ≠ ∅ } \mathbb{1}{\{c{i}\neq\emptyset\}} 1{ci=∅}起到了关键的控制作用,当真实目标不是空目标时值为1,是空目标时值为0。这说明只有在存在真实目标的情况下,才会计算分类概率的负值(鼓励高概率预测)和边界框损失。分类部分使用负概率,本质上是在最大化正确类别的预测概率,这与交叉熵损失的思想一致。边界框损失 L b o x \mathcal{L}_{\mathrm{box}} Lbox则专门处理位置和尺寸的预测精度。整个匹配成本的设计既要保证分类准确,又要保证定位精确,两者的权重通过加法形式进行平衡。
This procedure of finding matching plays the same role as the heuristic assignment rules used to match proposal [ 37 ] or anchors [ 22 ] to ground truth objects in modern detectors. The main difference is that we need to find one-to-one matching for direct set prediction without duplicates.
The second step is to compute the loss function, the Hungarian loss for all pairs matched in the previous step. We define the loss similarly to the losses of common object detectors, i.e . a linear combination of a negative log-likelihood for class prediction and a box loss defined later:
L H u n g a r i a n ( y , y ^ ) = ∑ i = 1 N [ − log p ^ σ ^ ( i ) ( c i ) + 1 { c i ≠ Q } L b o x ( b i , b ^ σ ^ ( i ) ) ] , \mathcal{L}{\mathrm{Hungarian}}(y,\hat{y})=\sum{i=1}^{N}\left[-\log\hat{p}{\hat{\sigma}(i)}(c{i})+\mathbb{1}{\{c{i}\neq\mathcal{Q}\}}\mathcal{L}{\mathrm{box}}(b{i},\hat{b}_{\hat{\sigma}}(i))\right], LHungarian(y,y^)=i=1∑N[−logp^σ^(i)(ci)+1{ci=Q}Lbox(bi,b^σ^(i))],
【解析】这个公式详细描述了匈牙利损失的计算方式。 σ ^ \hat{\sigma} σ^是通过第一步匈牙利算法得到的最优分配方案,它告诉我们第i个真实目标应该与第 σ ^ ( i ) \hat{\sigma}(i) σ^(i)个预测结果配对。损失函数对所有N个位置进行求和,每个位置的损失包含两个部分。第一部分 − log p ^ σ ^ ( i ) ( c i ) -\log\hat{p}{\hat{\sigma}(i)}(c{i}) −logp^σ^(i)(ci)是分类损失, p ^ σ ^ ( i ) ( c i ) \hat{p}{\hat{\sigma}(i)}(c{i}) p^σ^(i)(ci)表示预测结果 σ ^ ( i ) \hat{\sigma}(i) σ^(i)对真实类别 c i c_{i} ci的预测概率,取负对数后鼓励模型输出更高的正确类别概率。第二部分是边界框损失 L b o x ( b i , b ^ σ ^ ( i ) ) \mathcal{L}{\mathrm{box}}(b{i},\hat{b}{\hat{\sigma}}(i)) Lbox(bi,b^σ^(i)),用于约束预测框与真实框之间的位置偏差。关键的是指示函数 1 { c i ≠ Q } \mathbb{1}{\{c_{i}\neq\mathcal{Q}\}} 1{ci=Q},它确保只有当真实目标不是空目标(即 c i ≠ Q c_{i}\neq\mathcal{Q} ci=Q)时才计算边界框损失,因为空目标没有边界框信息,不需要进行位置回归。这样设计使得损失函数能够正确处理有目标和无目标两种情况。
where σ ^ \hat{\sigma} σ^ is the optimal assignment computed in the first step ( 1 ). In practice, we down-weight the log-probability term when c i = ∅ c_{i}=\emptyset ci=∅ by a factor 10 to account for class imbalance. This is analogous to how Faster R-CNN training procedure balances positive/negative proposals by subsampling [ 37 ]. Notice that the matching cost between an object and L \scriptscriptstyle\mathcal{L} L doesn't depend on the prediction, which means that in that case the cost is a constant. In the matching cost we use probabilities p ^ σ ^ ( i ) ( c i ) \hat{p}{\hat{\sigma}(i)}(c{i}) p^σ^(i)(ci) instead of log-probabilities. This makes the class prediction term commensurable to L b o x ( ⋅ , ⋅ ) \mathcal{L}_{\mathrm{box}}(\cdot,\cdot) Lbox(⋅,⋅) (described below), and we observed better empirical performances.
【翻译】其中 σ ^ \hat{\sigma} σ^是第一步中计算的最优分配(公式1)。在实践中,当 c i = ∅ c_{i}=\emptyset ci=∅时,我们将对数概率项的权重降低10倍以考虑类别不平衡。这类似于Faster R-CNN训练过程通过子采样来平衡正负样本[37]。注意,目标与 L \scriptscriptstyle\mathcal{L} L之间的匹配成本不依赖于预测,这说明在这种情况下成本是常数。在匹配成本中我们使用概率 p ^ σ ^ ( i ) ( c i ) \hat{p}{\hat{\sigma}(i)}(c{i}) p^σ^(i)(ci)而不是对数概率。这使得类别预测项与 L b o x ( ⋅ , ⋅ ) \mathcal{L}_{\mathrm{box}}(\cdot,\cdot) Lbox(⋅,⋅)(稍后描述)相称,我们观察到了更好的经验性能。
【解析】这段话说明了训练的技术细节。首先是类别不平衡问题的处理,由于图像中大部分预测位置都对应背景(即 c i = ∅ c_{i}=\emptyset ci=∅的情况),如果不做特殊处理,背景类的损失会占主导地位,影响模型学习真实目标的能力。DETR通过将背景类的对数概率项权重降低10倍来缓解这个问题,这与Faster R-CNN中通过控制正负样本比例的思路是一致的。其次,文中提到了匹配成本与最终损失计算之间的一个重要区别:在匹配阶段使用概率值而不是对数概率值。这样做的原因是为了让分类项和边界框回归项在数值尺度上更加匹配,避免某一项损失过大而掩盖另一项的作用。对于空目标的匹配成本是常数这一点说明了设计的合理性,因为空目标本身没有具体的预测内容需要评估,匹配成本只需要是一个固定值即可。
Bounding box loss. The second part of the matching cost and the Hungarian loss is L b o x ( ⋅ ) \mathcal{L}{\mathrm{box}}(\cdot) Lbox(⋅) that scores the bounding boxes. Unlike many detectors that do box predictions as a Δ \varDelta Δ w.r.t. some initial guesses, we make box predictions directly. While such approach simplify the implementation it poses an issue with relative scaling of the loss. The most commonly-used ℓ 1 \ell{1} ℓ1 loss will have different scales for small and large boxes even if their relative errors are similar. To mitigate this issue we use a linear combination of the ℓ 1 \ell_{1} ℓ1 loss and the generalized IoU loss [ 38 ] L i o u ( ⋅ , ⋅ ) \mathcal{L}{\mathrm{iou}}(\cdot,\cdot) Liou(⋅,⋅) that is scale-invariant. Overall, our box loss is L b o x ( b i , b ^ σ ( i ) ) \mathcal{L}{\mathrm{box}}\big(b_{i},\hat{b}{\sigma(i)}\big) Lbox(bi,b^σ(i)) defined as λ i o u L i o u ( b i , b ^ σ ( i ) ) + λ L 1 ∣ ∣ b i − b ^ σ ( i ) ∣ ∣ 1 \lambda{\mathrm{iou}}\mathcal{L}{\mathrm{iou}}(b{i},\hat{b}{\sigma(i)})+\lambda{\mathrm{L1}}\vert\vert b_{i}-\hat{b}{\sigma(i)}\vert\vert{1} λiouLiou(bi,b^σ(i))+λL1∣∣bi−b^σ(i)∣∣1 where λ i o u , λ L 1 ∈ R \lambda_{\mathrm{iou}},\lambda_{\mathrm{L1}}\in\mathbb{R} λiou,λL1∈R are hyperparameters. These two losses are normalized by the number of objects inside the batch.
【翻译】边界框损失。匹配成本和匈牙利损失的第二部分是 L b o x ( ⋅ ) \mathcal{L}{\mathrm{box}}(\cdot) Lbox(⋅),它对边界框进行评分。与许多相对于某些初始猜测进行 Δ \varDelta Δ边界框预测的检测器不同,我们直接进行边界框预测。虽然这种方法简化了实现,但它带来了损失相对缩放的问题。最常用的 ℓ 1 \ell{1} ℓ1损失对于小框和大框会有不同的尺度,即使它们的相对误差相似。为了缓解这个问题,我们使用 ℓ 1 \ell_{1} ℓ1损失和尺度不变的广义IoU损失[38] L i o u ( ⋅ , ⋅ ) \mathcal{L}{\mathrm{iou}}(\cdot,\cdot) Liou(⋅,⋅)的线性组合。总体而言,我们的边界框损失 L b o x ( b i , b ^ σ ( i ) ) \mathcal{L}{\mathrm{box}}\big(b_{i},\hat{b}{\sigma(i)}\big) Lbox(bi,b^σ(i))定义为 λ i o u L i o u ( b i , b ^ σ ( i ) ) + λ L 1 ∣ ∣ b i − b ^ σ ( i ) ∣ ∣ 1 \lambda{\mathrm{iou}}\mathcal{L}{\mathrm{iou}}(b{i},\hat{b}{\sigma(i)})+\lambda{\mathrm{L1}}\vert\vert b_{i}-\hat{b}{\sigma(i)}\vert\vert{1} λiouLiou(bi,b^σ(i))+λL1∣∣bi−b^σ(i)∣∣1,其中 λ i o u , λ L 1 ∈ R \lambda_{\mathrm{iou}},\lambda_{\mathrm{L1}}\in\mathbb{R} λiou,λL1∈R是超参数。这两个损失都通过批次内的目标数量进行归一化。
【解析】DETR的边界框损失是属于直接回归策略。传统检测器通常采用间接预测方式,即先设定一些初始的候选框或锚点,然后预测相对于这些初始框的偏移量 Δ \varDelta Δ,这种方式的好处是可以利用先验知识来约束预测范围,但也增加了设计复杂度。DETR采用直接预测策略,模型直接输出边界框的绝对坐标,这种方式更加简洁直观,避免了复杂的锚点设计和偏移量编码解码过程。然而直接预测也带来了尺度敏感性问题,这其实是边界框回归中的经典难题。 ℓ 1 \ell_{1} ℓ1损失虽然简单直观,但对不同尺寸的目标表现出不同的敏感度,具体来说,对于大目标,即使预测框有几个像素的偏差, ℓ 1 \ell_{1} ℓ1损失值可能很大,而对于小目标,相同像素级别的偏差可能产生较小的损失值,但相对误差却可能更严重。这种不均衡会导致模型在训练过程中对大目标过度关注,而对小目标的精度要求不够严格。广义IoU损失的引入是为了解决这个尺度不变性问题,IoU本质上是一个比例度量,它关注的是预测框与真实框重叠程度的相对关系,而不是绝对的像素差异。通过将 ℓ 1 \ell_{1} ℓ1损失和广义IoU损失进行线性组合,既保持了 ℓ 1 \ell_{1} ℓ1损失对精确定位的敏感性,又利用了IoU损失的尺度不变特性,使得模型能够在不同尺寸的目标上都保持一致的性能表现。两个权重系数 λ i o u \lambda_{\mathrm{iou}} λiou和 λ L 1 \lambda_{\mathrm{L1}} λL1用来平衡这两种损失的相对重要性,需要通过实验来确定最优值。最后提到的批次归一化是为了稳定训练过程,因为不同批次中的目标数量可能差异很大,直接使用损失总和会导致梯度的不稳定,通过目标数量归一化可以确保损失值的尺度相对稳定。
3.2 DETR架构
The overall DETR architecture is surprisingly simple and depicted in Figure 2 . It contains three main components, which we describe below: a CNN backbone to extract a compact feature representation, an encoder-decoder transformer, and a simple feed forward network (FFN) that makes the final detection prediction.
Unlike many modern detectors, DETR can be implemented in any deep learning framework that provides a common CNN backbone and a transformer architecture implementation with just a few hundred lines. Inference code for DETR can be implemented in less than 50 lines in PyTorch [ 32 ]. We hope that the simplicity of our method will attract new researchers to the detection community.
Backbone. Starting from the initial image x i m g ∈ R 3 × H 0 × W 0 x_{\mathrm{img}}\in\mathbb{R}^{3\times H_{0}\times W_{0}} ximg∈R3×H0×W0 (with 3 color channels 2 ), a conventional CNN backbone generates a lower-resolution activation map f ∈ R C × H × W f\in\mathbb{R}^{C\times H\times W} f∈RC×H×W . Typical values we use are C = 2048 C=2048 C=2048 and H , W = H 0 32 , W 0 32 \begin{array}{r}{H,W=\frac{H_{0}}{32},\frac{W_{0}}{32}}\end{array} H,W=32H0,32W0 .
【翻译】主干网络。从初始图像 x i m g ∈ R 3 × H 0 × W 0 x_{\mathrm{img}}\in\mathbb{R}^{3\times H_{0}\times W_{0}} ximg∈R3×H0×W0(具有3个颜色通道)开始,传统的CNN主干网络生成一个较低分辨率的激活图 f ∈ R C × H × W f\in\mathbb{R}^{C\times H\times W} f∈RC×H×W。我们使用的典型值是 C = 2048 C=2048 C=2048和 H , W = H 0 32 , W 0 32 H,W=\frac{H_{0}}{32},\frac{W_{0}}{32} H,W=32H0,32W0。
Transformer encoder. First, a 1x1 convolution reduces the channel dimension of the high-level activation map f f f from C C C to a smaller dimension d d d . creating a new feature map z 0 ∈ R d × H × W z_{0}\in\mathbb{R}^{d\times H\times W} z0∈Rd×H×W . The encoder expects a sequence as inpu we collapse the spatial dimensions of z 0 z_{0} z0 into one dimension, resulting in a d × H W d{\times}H W d×HW × feature map. Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN). Since the transformer architecture is permutation-invariant, we supplement it with fixed positional encodings [ 31 , 3 ] that are added to the input of each attention layer. We defer to the supplementary material the detailed definition of the architecture, which follows the one described in [ 47 ].
【翻译】Transformer编码器。首先,1x1卷积将高级激活图 f f f的通道维度从 C C C减少到更小的维度 d d d,创建一个新的特征图 z 0 ∈ R d × H × W z_{0}\in\mathbb{R}^{d\times H\times W} z0∈Rd×H×W。编码器期望序列作为输入,我们将 z 0 z_{0} z0的空间维度折叠为一维,得到 d × H W d{\times}HW d×HW的特征图。每个编码器层都有标准架构,由多头自注意力模块和前馈网络(FFN)组成。由于transformer架构是排列不变的,我们用固定的位置编码[31, 3]来补充它,这些编码被添加到每个注意力层的输入中。我们将架构的详细定义推迟到补充材料中,它遵循[47]中描述的架构。
【解析】这一步骤是将CNN特征适配到Transformer架构的转换过程。1x1卷积的作用是降维,将2048维的特征压缩到更小的维度 d d d(通常是256或512),这样做有两个目的:减少计算量和内存消耗,同时也起到特征选择的作用,保留最重要的信息。空间维度的重塑是必要的,因为Transformer期望的是序列输入,而CNN输出的是二维特征图。将 H × W H \times W H×W的空间维度展平为 H W HW HW的序列长度,每个位置的 d d d维特征向量成为序列中的一个元素,这样就将图像的每个空间位置看作序列中的一个token。Transformer编码器采用标准的架构设计,多头自注意力机制能够捕捉序列中任意两个位置之间的关系,这对于目标检测来说特别重要,因为不同空间位置的特征可能存在复杂的依赖关系。前馈网络提供非线性变换能力,增强模型的表达力。位置编码的引入解决了Transformer排列不变性带来的问题,因为在图像中空间位置信息是至关重要的,固定的位置编码确保模型能够区分不同空间位置的特征,保持空间结构的完整性。
Fig. 2: DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries , and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a " no object " class.
Transformer decoder. The decoder follows the standard architecture of the transformer, transforming N N N embeddings of size d d d using multi-headed self- and encoder-decoder attention mechanisms. The difference with the original transformer is that our model decodes the N N N objects in parallel at each decoder layer, while Vaswani et al. [ 47 ] use an autoregressive model that predicts the output sequence one element at a time. We refer the reader unfamiliar with the concepts to the supplementary material. Since the decoder is also permutation-invariant, the N N N input embeddings must be different to produce different results. These input embeddings are learnt positional encodings that we refer to as object queries , and similarly to the encoder, we add them to the input of each attention layer. The N N N object queries are transformed into an output embedding by the decoder. They are then independently decoded into box coordinates and class labels by a feed forward network (described in the next subsection), resulting N N N final predictions. Using self- and encoder-decoder attention over these embeddings, the model globally reasons about all objects together using pair-wise relations between them, while being able to use the whole image as context.
【翻译】Transformer解码器。解码器遵循transformer的标准架构,使用多头自注意力和编码器-解码器注意力机制来转换 N N N个大小为 d d d的嵌入。与原始transformer的区别在于,我们的模型在每个解码器层并行解码 N N N个目标,而Vaswani等人[47]使用自回归模型,一次预测输出序列中的一个元素。我们建议不熟悉这些概念的读者参考补充材料。由于解码器也是排列不变的, N N N个输入嵌入必须不同以产生不同的结果。这些输入嵌入是我们称为目标查询的学习位置编码,与编码器类似,我们将它们添加到每个注意力层的输入中。 N N N个目标查询被解码器转换为输出嵌入。然后它们通过前馈网络(在下一小节中描述)独立解码为边界框坐标和类别标签,产生 N N N个最终预测。通过对这些嵌入使用自注意力和编码器-解码器注意力,模型使用它们之间的成对关系对所有目标进行全局推理,同时能够使用整个图像作为上下文。
【解析】解码器的设计说明了从序列生成到集合预测的转变。传统的transformer解码器采用自回归的生成模式,这种模式在自然语言处理中很自然,因为文本序列具有明确的时序关系,每个词的生成都依赖于前面已生成的词。但是在目标检测任务中,各个目标之间并不存在天然的顺序关系,一张图像中的不同目标可以被视为一个无序集合。因此DETR创新性地采用了并行解码策略,所有 N N N个目标同时生成,这不仅提高了计算效率,更重要的是避免了人为引入顺序偏差。目标查询(object queries)可以理解为 N N N个可学习的"探测器",每个探测器负责关注和识别图像中的一个特定目标。这些查询在训练过程中学习到不同的表示,使得每个查询能够专注于不同类型、不同位置或不同尺度的目标。由于解码器本身具有排列不变性,如果所有目标查询都相同,那么解码器的输出也会相同,这显然无法区分不同的目标。因此每个目标查询必须具有独特的特征表示,这种差异性通过随机初始化和训练过程中的梯度更新自然形成。解码器中的注意力机制发挥着至关重要的作用,自注意力允许不同的目标查询之间进行信息交换,这有助于处理目标间的相互关系,比如避免重复检测、处理遮挡关系等。编码器-解码器注意力则让每个目标查询能够关注到图像特征的不同区域,实现从全局图像特征到具体目标的精确定位。最终,每个目标查询的输出嵌入通过独立的前馈网络转换为具体的检测结果,包括边界框坐标和类别概率。
Prediction feed-forward networks (FFNs). The final prediction is computed by a 3-layer perceptron with ReLU activation function and hidden dimension d d d , and a linear projection layer. The FFN predicts the normalized center coordinates, height and width of the box w.r.t. the input image, and the linear layer predicts the class label using a softmax function. Since we predict a fixed-size set of N N N bounding boxes, where N N N is usually much larger than the actual number of objects of interest in an image, an additional special class label L \scriptscriptstyle\mathcal{L} L is used to represent that no object is detected within a slot. This class plays a similar role to the "background" class in the standard object detection approaches.
【翻译】预测前馈网络(FFNs)。最终预测由一个具有ReLU激活函数和隐藏维度 d d d的3层感知机以及一个线性投影层计算得出。FFN预测相对于输入图像的边界框的归一化中心坐标、高度和宽度,线性层使用softmax函数预测类别标签。由于我们预测固定大小的 N N N个边界框集合,其中 N N N通常远大于图像中感兴趣目标的实际数量,因此使用额外的特殊类别标签 L \scriptscriptstyle\mathcal{L} L来表示在某个槽位中没有检测到目标。这个类别与标准目标检测方法中的"背景"类别起着类似的作用。
【解析】FFN是DETR的输出头,负责将transformer解码器的抽象特征表示转换为具体的检测结果。3层感知机提供了足够的非线性变换能力,能够从高维特征中提取出边界框的几何信息。这里的关键设计是坐标归一化,即将边界框的中心坐标、高度和宽度都归一化到[0,1]区间内,这样做的好处是使得网络更容易收敛,同时也便于不同尺寸图像的处理。类别预测使用softmax函数确保所有类别概率的和为1,这是多分类问题的标准做法。DETR的一个重要特点是固定输出数量,即无论图像中实际有多少个目标,网络都会输出固定的 N N N个预测结果。这种设计简化了网络结构,避免了传统检测器中复杂的候选框生成和过滤过程。然而这也带来了一个问题:当图像中的实际目标数量小于 N N N时,多余的输出槽位该如何处理?DETR通过引入特殊的"无目标"类别来解决这个问题,这个类别本质上就是传统检测器中背景类别的扩展,用来标识那些不包含任何感兴趣目标的预测槽位。
Auxiliary decoding losses. We found helpful to use auxiliary losses [ 1 ] in decoder during training, especially to help the model output the correct number of objects of each class. We add prediction FFNs and Hungarian loss after each decoder layer. All predictions FFNs share their parameters. We use an additional shared layer-norm to normalize the input to the prediction FFNs from different decoder layers.
We show that DETR achieves competitive results compared to Faster R-CNN in quantitative evaluation on COCO. Then, we provide a detailed ablation study of the architecture and loss, with insights and qualitative results. Finally, to show that DETR is a versatile and extensible model, we present results on panoptic segmentation, training only a small extension on a fixed DETR model. We provide code and pretrained models to reproduce our experiments at https://github.com/facebookresearch/detr .
Dataset. We perform experiments on COCO 2017 detection and panoptic segmentation datasets [ 24 , 18 ], containing 118k training images and 5k validation images. Each image is annotated with bounding boxes and panoptic segmentation. There are 7 instances per image on average, up to 63 instances in a single image in training set, ranging from small to large on the same images. If not specified, we report AP as bbox AP, the integral metric over multiple thresholds. For comparison with Faster R-CNN we report validation AP at the last training epoch, for ablations we report median over validation results from the last 10 epochs.
Technical details. We train DETR with AdamW [ 26 ] setting the initial transformer's learning rate to 1 0 − 4 10^{-4} 10−4 , the backbone's to 1 0 − 5 10^{-5} 10−5 , and weight decay to 1 0 − 4 10^{-4} 10−4 . All transformer weights are initialized with Xavier init [ 11 ], and the backbone is with ImageNet-pretrained ResNet model [ 15 ] from torchvision with frozen batchnorm layers. We report results with two different backbones: a ResNet50 and a ResNet-101. The corresponding models are called respectively DETR and DETR-R101. Following [ 21 ], we also increase the feature resolution by adding a dilation to the last stage of the backbone and removing a stride from the first convolution of this stage. The corresponding models are called respectively DETR-DC5 and DETR-DC5-R101 (dilated C5 stage). This modification increases the resolution by a factor of two, thus improving performance for small objects, at the cost of a 16x higher cost in the self-attentions of the encoder, leading to an overall 2x increase in computational cost. A full comparison of FLOPs of these models and Faster R-CNN is given in Table 1 .
We use scale augmentation, resizing the input images such that the shortest side is at least 480 and at most 800 pixels while the longest at most 1333 [ 50 ]. To help learning global relationships through the self-attention of the encoder, we also apply random crop augmentations during training, improving the performance by approximately 1 AP. Specifically, a train image is cropped with probability 0.5 to a random rectangular patch which is then resized again to 800-1333. The transformer is trained with default dropout of 0.1. At inference time, some slots predict empty class. To optimize for AP, we override the prediction of these slots with the second highest scoring class, using the corresponding confidence. This improves AP by 2 points compared to filtering out empty slots. Other training hyperparameters can be found in section A.4 . For our ablation experiments we use training schedule of 300 epochs with a learning rate drop by a factor of 10 after 200 epochs, where a single epoch is a pass over all training images once. Training the baseline model for 300 epochs on 16 V100 GPUs takes 3 days, with 4 images per GPU (hence a total batch size of 64). For the longer schedule used to compare with Faster R-CNN we train for 500 epochs with learning rate drop after 400 epochs. This schedule adds 1.5 AP compared to the shorter schedule.
Table 1: Comparison with Faster R-CNN with a ResNet-50 and ResNet-101 backbones on the COCO validation set. The top section shows results for Faster R-CNN models in Detectron2 [ 50 ], the middle section shows results for Faster R-CNN models with GIoU [ 38 ], random crops train-time augmentation, and the long 9x training schedule. DETR models achieve comparable results to heavily tuned Faster R-CNN baselines, having lower AP S but greatly improved AP ⌞ \llcorner └ . We use torchscript Faster R-CNN and DETR models to measure FLOPS and FPS. Results without R101 in the name correspond to ResNet-50.
Transformers are typically trained with Adam or Adagrad optimizers with very long training schedules and dropout, and this is true for DETR as well. Faster R-CNN, however, is trained with SGD with minimal data augmentation and we are not aware of successful applications of Adam or dropout. Despite these differences we attempt to make a Faster R-CNN baseline stronger. To align it with DETR, we add generalized IoU [ 38 ] to the box loss, the same random crop augmentation and long training known to improve results [ 13 ]. Results are presented in Table 1 . In the top section we show Faster R-CNN results from Detectron2 Model Zoo [ 50 ] for models trained with the 3x schedule. In the middle section we show results (with a "+") for the same models but trained with the 9x schedule (109 epochs) and the described enhancements, which in total adds 1-2 AP. In the last section of Table 1 we show the results for multiple DETR models. To be comparable in the number of parameters we choose a model with 6 transformer and 6 decoder layers of width 256 with 8 attention heads. Like Faster R-CNN with FPN this model has 41.3M parameters, out of which 23.5M are in ResNet-50, and 17.8M are in the transformer. Even though both Faster R-CNN and DETR are still likely to further improve with longer training, we can conclude that DETR can be competitive with Faster R-CNN with the same number of parameters, achieving 42 AP on the COCO val subset. The way DETR achieves this is by improving AP ⊥ \perp ⊥ (+7.8), however note that the model is still lagging behind in AP S (-5.5). DETR-DC5 with the same number of parameters and similar FLOP count has higher AP, but is still significantly behind in AP S too. Faster R-CNN and DETR with ResNet-101 backbone show comparable results as well.
Table 2: Effect of encoder size. Each row corresponds to a model with varied number of encoder layers and fixed number of decoder layers. Performance gradually improves with more encoder layers.
Attention mechanisms in the transformer decoder are the key components which model relations between feature representations of different detections. In our ablation analysis, we explore how other components of our architecture and loss influence the final performance. For the study we choose ResNet-50-based DETR model with 6 encoder, 6 decoder layers and width 256. The model has 41.3M parameters, achieves 40.6 and 42.0 AP on short and long schedules respectively, and runs at 28 FPS, similarly to Faster R-CNN-FPN with the same backbone.
Number of encoder layers. We evaluate the importance of global imagelevel self-attention by changing the number of encoder layers (Table 2 ). Without encoder layers, overall AP drops by 3.9 points, with a more significant drop of 6.0 AP on large objects. We hypothesize that, by using global scene reasoning, the encoder is important for disentangling objects. In Figure 3 , we visualize the attention maps of the last encoder layer of a trained model, focusing on a few points in the image. The encoder seems to separate instances already, which likely simplifies object extraction and localization for the decoder.
Number of decoder layers. We apply auxiliary losses after each decoding layer (see Section 3.2 ), hence, the prediction FFNs are trained by design to predict objects out of the outputs of every decoder layer. We analyze the importance of each decoder layer by evaluating the objects that would be predicted at each stage of the decoding (Fig. 4 ). Both AP and AP 50 ^{50} 50 improve after every layer, totalling into a very significant +8.2/9.5 AP improvement between the first and the last layer. With its set-based loss, DETR does not need NMS by design. To verify this we run a standard NMS procedure with default parameters [ 50 ] for the outputs after each decoder. NMS improves performance for the predictions from the first decoder. This can be explained by the fact that a single decoding layer of the transformer is not able to compute any cross-correlations between the output elements, and thus it is prone to making multiple predictions for the same object. In the second and subsequent layers, the self-attention mechanism over the activations allows the model to inhibit duplicate predictions. We observe that the improvement brought by NMS diminishes as depth increases. At the last layers, we observe a small loss in AP as NMS incorrectly removes true positive predictions.
Fig. 3: Encoder self-attention for a set of reference points. The encoder is able to separate individual instances. Predictions are made with baseline DETR model on a validation set image.
Similarly to visualizing encoder attention, we visualize decoder attentions in Fig. 6 , coloring attention maps for each predicted object in different colors. We observe that decoder attention is fairly local, meaning that it mostly attends to object extremities such as heads or legs. We hypothesise that after the encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.
Importance of FFN. FFN inside tranformers can be seen as 1 × 1 1\times1 1×1 convolutional layers, making encoder similar to attention augmented convolutional networks [ 3 ]. We attempt to remove it completely leaving only attention in the transformer layers. By reducing the number of network parameters from 41.3M to 28.7M, leaving only 10.8M in the transformer, performance drops by 2.3 AP, we thus conclude that FFN are important for achieving good results.
Importance of positional encodings. There are two kinds of positional encodings in our model: spatial positional encodings and output positional encodings (object queries). We experiment with various combinations of fixed and learned encodings, results can be found in table 3 . Output positional encodings are required and cannot be removed, so we experiment with either passing them once at decoder input or adding to queries at every decoder attention layer. In the first experiment we completely remove spatial positional encodings and pass output positional encodings at input and, interestingly, the model still achieves more than 32 AP, losing 7.8 AP to the baseline. Then, we pass fixed sine spatial positional encodings and the output encodings at input once, as in the original transformer [ 47 ], and find that this leads to 1.4 AP drop compared to passing the positional encodings directly in attention. Learned spatial encodings passed to the attentions give similar results. Surprisingly, we find that not passing any spatial encodings in the encoder only leads to a minor AP drop of 1.3 AP. When we pass the encodings to the attentions, they are shared across all layers, and the output encodings (object queries) are always learned.
Fig. 4: AP and AP 50 performance after each decoder layer. A single long schedule baseline model is evaluated. DETR does not need NMS by design, which is validated by this figure. NMS lowers AP in the final layers, removing TP predictions, but improves AP in the first decoder layers, removing double predictions, as there is no communication in the first layer, and slightly improves A P 50 \mathrm{AP_{50}} AP50 .
【翻译】图4:每个解码器层后的AP和AP50性能。评估了单个长时间表基线模型。DETR在设计上不需要NMS,这个图验证了这一点。NMS在最后几层降低了AP,移除了TP预测,但在第一个解码器层改进了AP,移除了双重预测,因为第一层中没有通信,并且稍微改进了 A P 50 \mathrm{AP_{50}} AP50。
Fig. 5: Out of distribution generalization for rare classes. Even though no image in the training set has more than 13 giraffes, DETR has no diffi- culty generalizing to 24 and more instances of the same class.
Given these ablations, we conclude that transformer components: the global self-attention in encoder, FFN, multiple decoder layers, and positional encodings, all significantly contribute to the final object detection performance.
Loss ablations. To evaluate the importance of different components of the matching cost and the loss, we train several models turning them on and off. There are three components to the loss: classification loss, ℓ 1 \ell_{1} ℓ1 bounding box distance loss, and GIoU [ 38 ] loss. The classification loss is essential for training and cannot be turned off, so we train a model without bounding box distance loss, and a model without the GIoU loss, and compare with baseline, trained with all three losses. Results are presented in table 4 . GIoU loss on its own accounts for most of the model performance, losing only 0.7 AP to the baseline with combined losses. Using ℓ 1 \ell_{1} ℓ1 without GIoU shows poor results. We only studied simple ablations of different losses (using the same weighting every time), but other means of combining them may achieve different results.
Table 3: Results for different positional encodings compared to the baseline (last row), which has fixed sine pos. encodings passed at every attention layer in both the encoder and the decoder. Learned embeddings are shared between all layers. Not using spatial positional encodings leads to a significant drop in AP. Interestingly, passing them in decoder only leads to a minor AP drop. All these models use learned output positional encodings.
Fig. 6: Visualizing decoder attention for every predicted object (images from COCO val set). Predictions are made with DETR-DC5 model. Attention scores are coded with different colors for different objects. Decoder typically attends to object extremities, such as legs and heads. Best viewed in color.
Table 4: Effect of loss components on AP. We train two models turning off ℓ 1 \ell_{1} ℓ1 loss, and GIoU loss, and observe that ℓ 1 \ell_{1} ℓ1 gives poor results on its own, but when combined with GIoU improves A P M \mathrm{AP_{M}} APM and A P L \mathrm{AP_{L}} APL . Our baseline (last row) combines both losses.
【翻译】表4:损失组件对AP的影响。我们训练了两个模型,分别关闭 ℓ 1 \ell_{1} ℓ1损失和GIoU损失,观察到 ℓ 1 \ell_{1} ℓ1单独使用效果不好,但与GIoU结合时改进了 A P M \mathrm{AP_{M}} APM和 A P L \mathrm{AP_{L}} APL。我们的基线(最后一行)结合了两种损失。
Fig. 7: Visualization of all box predictions on all images from COCO 2017 val set for 20 out of total N = 100 N=100 N=100 prediction slots in DETR decoder. Each box prediction is represented as a point with the coordinates of its center in the 1-by-1 square normalized by each image size. The points are color-coded so that green color corresponds to small boxes, red to large horizontal boxes and blue to large vertical boxes. We observe that each slot learns to specialize on certain areas and box sizes with several operating modes. We note that almost all slots have a mode of predicting large image-wide boxes that are common in COCO dataset.
【翻译】图7:DETR解码器中总共 N = 100 N=100 N=100个预测槽中20个槽在COCO 2017验证集所有图像上的所有边界框预测的可视化。每个边界框预测表示为一个点,其坐标是其中心在由每个图像尺寸归一化的1×1正方形中的坐标。点用颜色编码,绿色对应小框,红色对应大的水平框,蓝色对应大的垂直框。我们观察到每个槽学会专门化某些区域和边界框尺寸,具有几种操作模式。我们注意到几乎所有槽都有预测在COCO数据集中常见的大图像级边界框的模式。
4.3 Analysis
Decoder output slot analysis In Fig. 7 we visualize the boxes predicted by different slots for all images in COCO 2017 val set. DETR learns different specialization for each query slot. We observe that each slot has several modes of operation focusing on different areas and box sizes. In particular, all slots have the mode for predicting image-wide boxes (visible as the red dots aligned in the middle of the plot). We hypothesize that this is related to the distribution of objects in COCO.
Generalization to unseen numbers of instances. Some classes in COCO are not well represented with many instances of the same class in the same image. For example, there is no image with more than 13 giraffes in the training set. We create a synthetic image 3 ^{3} 3 to verify the generalization ability of DETR (see Figure 5 ). Our model is able to find all 24 giraffes on the image which is clearly out of distribution. This experiment confirms that there is no strong class-specialization in each object query.
Panoptic segmentation [ 19 ] has recently attracted a lot of attention from the computer vision community. Similarly to the extension of Faster R-CNN [ 37 ] to Mask R-CNN [ 14 ], DETR can be naturally extended by adding a mask head on top of the decoder outputs. In this section we demonstrate that such a head can be used to produce panoptic segmentation [ 19 ] by treating stuff and thing classes in a unified way. We perform our experiments on the panoptic annotations of the COCO dataset that has 53 stuff categories in addition to 80 things categories.
Fig. 8: Illustration of the panoptic head. A binary mask is generated in parallel for each detected object, then the masks are merged using pixel-wise argmax.
Fig. 9: Qualitative results for panoptic segmentation generated by DETR-R101. DETR produces aligned mask predictions in a unified manner for things and stuff.
We train DETR to predict boxes around both stuff and things classes on COCO, using the same recipe. Predicting boxes is required for the training to be possible, since the Hungarian matching is computed using distances between boxes. We also add a mask head which predicts a binary mask for each of the predicted boxes, see Figure 8 . It takes as input the output of transformer decoder for each object and computes multi-head (with M M M heads) attention scores of this embedding over the output of the encoder, generating M M M attention heatmaps per object in a small resolution. To make the final prediction and increase the resolution, an FPN-like architecture is used. We describe the architecture in more details in the supplement. The final resolution of the masks has stride 4 and each mask is supervised independently using the DICE/F-1 loss [ 28 ] and Focal loss [ 23 ].
【翻译】我们使用相同的方法训练DETR来预测COCO上背景类和物体类的边界框。预测边界框是训练成为可能的必要条件,因为匈牙利匹配是使用边界框之间的距离计算的。我们还添加了一个掩码头,它为每个预测的边界框预测一个二进制掩码,见图8。它以每个物体的变换器解码器输出作为输入,并计算该嵌入在编码器输出上的多头(具有 M M M个头)注意力分数,在小分辨率下为每个物体生成 M M M个注意力热图。为了进行最终预测并提高分辨率,使用了类似FPN的架构。我们在补充材料中更详细地描述了该架构。掩码的最终分辨率具有步长4,每个掩码使用DICE/F-1损失[28]和Focal损失[23]独立监督。
The mask head can be trained either jointly, or in a two steps process, where we train DETR for boxes only, then freeze all the weights and train only the mask head for 25 epochs. Experimentally, these two approaches give similar results, we report results using the latter method since it results in a shorter total wall-clock time training.
Table 5: Comparison with the state-of-the-art methods UPSNet [ 51 ] and Panoptic FPN [ 18 ] on the COCO val dataset We retrained PanopticFPN with the same dataaugmentation as DETR, on a 18x schedule for fair comparison. UPSNet uses the 1x schedule, UPSNet-M is the version with multiscale test-time augmentations.
To predict the final panoptic segmentation we simply use an argmax over the mask scores at each pixel, and assign the corresponding categories to the resulting masks. This procedure guarantees that the final masks have no overlaps and, therefore, DETR does not require a heuristic [ 19 ] that is often used to align different masks.
Training details. We train DETR, DETR-DC5 and DETR-R101 models following the recipe for bounding box detection to predict boxes around stuff and things classes in COCO dataset. The new mask head is trained for 25 epochs (see supplementary for details). During inference we first filter out the detection with a confidence below 85 % 85\% 85% , then compute the per-pixel argmax to determine in which mask each pixel belongs. We then collapse different mask predictions of the same stuff category in one, and filter the empty ones (less than 4 pixels).
Main results. Qualitative results are shown in Figure 9 . In table 5 we compare our unified panoptic segmenation approach with several established methods that treat things and stuff differently. We report the Panoptic Quality (PQ) and the break-down on things (PQ t h \mathrm{th} th ) and stuff (PQ s t ^{\mathrm{st}} st ). We also report the mask AP (computed on the things classes), before any panoptic post-treatment (in our case, before taking the pixel-wise argmax). We show that DETR outperforms published results on COCO-val 2017, as well as our strong PanopticFPN baseline (trained with same data-augmentation as DETR, for fair comparison). The result break-down shows that DETR is especially dominant on stuff classes, and we hypothesize that the global reasoning allowed by the encoder attention is the key element to this result. For things class, despite a severe deficit of up to 8 mAP compared to the baselines on the mask AP computation, DETR obtains competitive PQ t h \mathrm{th} th . We also evaluated our method on the test set of the COCO dataset, and obtained 46 PQ. We hope that our approach will inspire the exploration of fully unified models for panoptic segmentation in future work.
【翻译】主要结果。定性结果如图9所示。在表5中,我们将我们的统一全景分割方法与几种对物体和背景区别处理的既定方法进行比较。我们报告了全景质量(PQ)以及对物体(PQ t h _{\mathrm{th}} th)和背景(PQ s t ^{\mathrm{st}} st)的细分。我们还报告了掩码AP(在物体类上计算),在任何全景后处理之前(在我们的情况下,是在逐像素argmax之前)。我们显示DETR在COCO-val 2017上超越了已发布的结果,以及我们强大的PanopticFPN基线(使用与DETR相同的数据增强训练,以便公平比较)。结果细分显示DETR在背景类上特别占优势,我们假设编码器注意力允许的全局推理是这一结果的关键因素。对于物体类,尽管在掩码AP计算上与基线相比存在高达8 mAP的严重不足,DETR获得了具有竞争力的PQ t h _{\mathrm{th}} th。我们还在COCO数据集的测试集上评估了我们的方法,获得了46 PQ。我们希望我们的方法将激发未来工作中对全景分割完全统一模型的探索。
5 Conclusion
We presented DETR, a new design for object detection systems based on transformers and bipartite matching loss for direct set prediction. The approach achieves comparable results to an optimized Faster R-CNN baseline on the challenging COCO dataset. DETR is straightforward to implement and has a flexible architecture that is easily extensible to panoptic segmentation, with competitive results. In addition, it achieves significantly better performance on large objects than Faster R-CNN, likely thanks to the processing of global information performed by the self-attention.
This new design for detectors also comes with new challenges, in particular regarding training, optimization and performances on small objects. Current detectors required several years of improvements to cope with similar issues, and we expect future work to successfully address them for DETR.
Since our model is based on the Transformer architecture, we remind here the general form of attention mechanisms we use for exhaustivity. The attention mechanism follows [ 47 ], except for the details of positional encodings (see Equation 8 ) that follows [ 7 ].
【解析】这里提到的两个重要参考文献:[47]是原始的Transformer论文"Attention is All You Need",[7]则是关于位置编码的改进方案。DETR基本沿用了标准Transformer的注意力计算方式,但在位置编码方面有自己的特色,这对于处理图像这种二维结构化数据是必要的。
Multi-head The general form of multi-head attention with M M M heads of dimension d d d is a function with the following signature (using d ′ = d M \begin{array}{r}{d^{\prime}=\frac{d}{M}}\end{array} d′=Md , and giving matrix/tensors sizes in underbrace)
【翻译】多头 具有 M M M个维度为 d d d的头的多头注意力的一般形式是具有以下签名的函数(使用 d ′ = d M d^{\prime}=\frac{d}{M} d′=Md,并在下括号中给出矩阵/张量大小)
m h − a t t n : X q ⏟ d × N q , X k v ⏟ d × N k v , T ⏟ M × 3 × d ′ × d , L ⏟ d × d ↦ X ~ q ⏟ d × N q \mathrm{mh-attn:}\underbrace{X_{\mathrm{q}}}{d\times N{\mathrm{q}}},\underbrace{X_{\mathrm{kv}}}{d\times N{\mathrm{kv}}},\underbrace{T}{M\times3\times d^{\prime}\times d},\underbrace{L}{d\times d}\mapsto\underbrace{\tilde{X}{\mathrm{q}}}{d\times N_{\mathrm{q}}} mh−attn:d×Nq Xq,d×Nkv Xkv,M×3×d′×d T,d×d L↦d×Nq X~q
where X q X_{\mathrm{q}} Xq is the query sequence of length N q N_{\mathrm{q}} Nq , X k v X_{\mathrm{kv}} Xkv is the key-value sequence of length N k v N_{\mathrm{kv}} Nkv (with the same number of channels d d d for simplicity of exposition), T T T is the weight tensor to compute the so-called query, key and value embeddings, and L L L is a projection matrix. The output is the same size as the query sequence. To fix the vocabulary before giving details, multi-head selfattention (mh-s-attn) is the special case X q = X k v X_{\mathrm{q}}=X_{\mathrm{kv}} Xq=Xkv , i.e.
【翻译】其中 X q X_{\mathrm{q}} Xq是长度为 N q N_{\mathrm{q}} Nq的查询序列, X k v X_{\mathrm{kv}} Xkv是长度为 N k v N_{\mathrm{kv}} Nkv的键值序列(为了表述简单,具有相同的通道数 d d d), T T T是用于计算所谓的查询、键和值嵌入的权重张量, L L L是投影矩阵。输出与查询序列的大小相同。为了在给出细节之前固定词汇,多头自注意力(mh-s-attn)是 X q = X k v X_{\mathrm{q}}=X_{\mathrm{kv}} Xq=Xkv的特殊情况,即
mh-s-attn ( X , T , L ) = mh-attn ( X , X , T , L ) . \operatorname{mh-s-attn}(X,T,L)=\operatorname{mh-attn}(X,X,T,L). mh-s−attn(X,T,L)=mh-attn(X,X,T,L).
【解析】这个公式定义了多头注意力机制的形式。各个符号的含义: X q X_q Xq代表查询序列,在DETR中是object queries或者图像特征; X k v X_{kv} Xkv代表键值序列,是图像的特征表示。权重张量 T T T的维度 M × 3 × d ′ × d M\times3\times d^{\prime}\times d M×3×d′×d说明了它的结构: M M M个头,每个头需要3组权重(分别用于计算Query、Key、Value),每组权重将 d d d维输入映射到 d ′ d^{\prime} d′维输出。这里 d ′ = d M d^{\prime}=\frac{d}{M} d′=Md说明总的参数量保持不变,只是将大的注意力头分解为多个小头。自注意力是注意力机制的特殊形式,输入序列既作为查询也作为键值,这样序列中的每个元素都能与包括自己在内的所有元素进行交互。
The multi-head attention is simply the concatenation of M M M single attention heads followed by a projection with L L L . The common practice [ 47 ] is to use residual connections, dropout and layer normalization. In other words, denoting X ¨ q = \ddot{X}{\mathrm{q}}= X¨q= mh-attn ( X q , X k v , T , L ) (X{\mathrm{q}},X_{\mathrm{kv}},T,L) (Xq,Xkv,T,L) and X ˉ ˉ ( q ) \bar{\bar{X}}^{(q)} Xˉˉ(q) the concatenation of attention heads, we have
【翻译】多头注意力简单地是 M M M个单一注意力头的连接,然后跟着用 L L L进行投影。常见做法[47]是使用残差连接、dropout和层归一化。换句话说,记 X ¨ q = \ddot{X}{\mathrm{q}}= X¨q= mh-attn ( X q , X k v , T , L ) (X{\mathrm{q}},X_{\mathrm{kv}},T,L) (Xq,Xkv,T,L)和 X ˉ ˉ ( q ) \bar{\bar{X}}^{(q)} Xˉˉ(q)为注意力头的连接,我们有
X q ′ = [ a t t n ( X q , X k v , T 1 ) ; . . . ; a t t n ( X q , X k v , T M ) ] X ~ q = l a y e r n o r m ( X q + d r o p o u t ( L X q ′ ) ) , \begin{array}{r l}&{X_{\mathrm{q}}^{\prime}=[\mathrm{attn}(X_{\mathrm{q}},X_{\mathrm{kv}},T_{1});...;\mathrm{attn}(X_{\mathrm{q}},X_{\mathrm{kv}},T_{M})]}\ &{\tilde{X}{\mathrm{q}}=\mathrm{layernorm}\bigl(X{\mathrm{q}}+\mathrm{dropout}(L X_{\mathrm{q}}^{\prime})\bigr),}\end{array} Xq′=[attn(Xq,Xkv,T1);...;attn(Xq,Xkv,TM)] X~q=layernorm(Xq+dropout(LXq′)),
where [ ; ] [;] [;] denotes concatenation on the channel axis.
【翻译】其中 [ ; ] [;] [;]表示在通道轴上的连接。
【解析】这里描述了多头注意力的计算流程。第一步是并行计算 M M M个注意力头,每个头使用不同的权重 T i T_i Ti。第二步是将所有头的输出在通道维度上连接起来,形成 X q ′ X_q^{\prime} Xq′。第三步是通过线性投影 L L L将连接后的特征映射回原始维度。最后一步是标准的Transformer后处理:先应用dropout防止过拟合,再加上残差连接保持梯度流动,最后用层归一化stabilize训练过程。"多头并行+连接+投影+残差+归一化"的组合,保证了模型的表达能力,又保证了训练的稳定性。。
Single head An attention head with weight tensor T ′ ∈ R 3 × d ′ × d T^{\prime}\in\mathbb{R}^{3\times d^{\prime}\times d} T′∈R3×d′×d , denoted by a t t n ( X q , X k v , T ′ ) \mathrm{attn}(X_{\mathrm{q}},X_{\mathrm{kv}},T^{\prime}) attn(Xq,Xkv,T′) , depends on additional positional encoding P q ∈ R d × N q P_{\mathrm{q}}\in\mathbb{R}^{d\times N_{\mathrm{q}}} Pq∈Rd×Nq and P k v ∈ R d × N k v P_{\mathrm{kv}}\in\mathbb R^{d\times N_{\mathrm{kv}}} Pkv∈Rd×Nkv ∈ . It starts by computing so-called query, key and value embeddings after adding the query and key positional encodings [ 7 ]:
【翻译】单头 具有权重张量 T ′ ∈ R 3 × d ′ × d T^{\prime}\in\mathbb{R}^{3\times d^{\prime}\times d} T′∈R3×d′×d的注意力头,用 a t t n ( X q , X k v , T ′ ) \mathrm{attn}(X_{\mathrm{q}},X_{\mathrm{kv}},T^{\prime}) attn(Xq,Xkv,T′)表示,依赖于额外的位置编码 P q ∈ R d × N q P_{\mathrm{q}}\in\mathbb{R}^{d\times N_{\mathrm{q}}} Pq∈Rd×Nq和 P k v ∈ R d × N k v P_{\mathrm{kv}}\in\mathbb R^{d\times N_{\mathrm{kv}}} Pkv∈Rd×Nkv。它首先在添加查询和键位置编码[7]后计算所谓的查询、键和值嵌入:
【解析】这里开始介绍单个注意力头的计算过程。权重张量 T ′ T^{\prime} T′的维度 3 × d ′ × d 3\times d^{\prime}\times d 3×d′×d说明它包含三个部分:分别用于生成Query、Key、Value的线性变换矩阵。位置编码 P q P_q Pq和 P k v P_{kv} Pkv是DETR处理图像数据的创新点,因为标准的Transformer是为序列数据设计的,而图像是二维结构。在计算注意力之前,需要将位置信息融入特征表示中,这样模型才能理解不同空间位置之间的关系。这种位置编码的加入方式与原始Transformer处理文本序列时的做法类似,但针对图像的二维特性进行了适配。
Q ; K ; V \] = \[ T 1 ′ ( X q + P q ) ; T 2 ′ ( X k v + P k v ) ; T 3 ′ X k v \] \[Q;K;V\]=\[T_{1}\^{\\prime}(X_{\\mathrm{q}}+P_{\\mathrm{q}});T_{2}\^{\\prime}(X_{\\mathrm{kv}}+P_{\\mathrm{kv}});T_{3}\^{\\prime}X_{\\mathrm{kv}}\] \[Q;K;V\]=\[T1′(Xq+Pq);T2′(Xkv+Pkv);T3′Xkv
where T ′ T^{\prime} T′ is the concatenation of T 1 ′ , T 2 ′ , T 3 ′ T_{1}^{\prime},T_{2}^{\prime},T_{3}^{\prime} T1′,T2′,T3′ ′ ′ . The attention weights α \alpha α are then computed based on the softmax of dot products between queries and keys, so that each element of the query sequence attends to all elements of the key-value sequence ( i i i is a query index and j j j a key-value index):
【翻译】其中 T ′ T^{\prime} T′是 T 1 ′ , T 2 ′ , T 3 ′ T_{1}^{\prime},T_{2}^{\prime},T_{3}^{\prime} T1′,T2′,T3′的连接。然后基于查询和键之间点积的softmax计算注意力权重 α \alpha α,使得查询序列的每个元素都关注键值序列的所有元素( i i i是查询索引, j j j是键值索引):
【解析】这个公式展示了QKV计算的细节。注意到Value的计算中没有加入位置编码,这是有意设计的:位置信息主要用于确定"注意力应该关注哪里"(通过Q和K),而不是用于修改"被关注的内容"(V)。 T 1 ′ , T 2 ′ , T 3 ′ T_1^{\prime}, T_2^{\prime}, T_3^{\prime} T1′,T2′,T3′分别将输入特征映射为查询、键、值表示,每个变换矩阵的作用是将原始特征投影到一个新的表示空间,在这个空间中更容易计算相似度和提取有用信息。这种分离的设计允许模型学习到不同的表示:查询表示"我在寻找什么",键表示"我能提供什么信息",值表示"我实际包含的信息内容"。
α i , j = e 1 d ′ Q i T K j Z i w h e r e Z i = ∑ j = 1 N k v e 1 d ′ Q i T K j . \alpha_{i,j}=\frac{e^{\frac{1}{\sqrt{d^{\prime}}}Q_{i}^{T}K_{j}}}{Z_{i}}\mathrm{~where~}Z_{i}=\sum_{j=1}^{N_{\mathrm{kv}}}e^{\frac{1}{\sqrt{d^{\prime}}}Q_{i}^{T}K_{j}}. αi,j=Zied′ 1QiTKj where Zi=j=1∑Nkved′ 1QiTKj.
【解析】这是标准的缩放点积注意力公式。 1 d ′ \frac{1}{\sqrt{d^{\prime}}} d′ 1是缩放因子,防止点积结果过大导致softmax饱和。当维度 d ′ d^{\prime} d′较大时,两个随机向量的点积方差会增大,可能导致softmax输出极度不均匀,梯度消失。通过除以 d ′ \sqrt{d^{\prime}} d′ ,保持点积结果在合理范围内。 Z i Z_i Zi是归一化常数,确保每个查询位置 i i i对应的注意力权重 α i , j \alpha_{i,j} αi,j在所有键值位置 j j j上的和为1。这个softmax操作实现了"软选择":不是硬性选择某个位置,而是给每个位置分配一个权重,权重大小反映了该位置对当前查询的重要程度。
In our case, the positional encodings may be learnt or fixed, but are shared across all attention layers for a given query/key-value sequence, so we do not explicitly write them as parameters of the attention. We give more details on their exact value when describing the encoder and the decoder. The final output is the aggregation of values weighted by attention weights: The i i i -th row is given by a t t n i ( X q , X k v , T ′ ) = ∑ j = 1 N k v α i , j V j \begin{array}{r}{\mathrm{attn}{i}(X{\mathrm{q}},X_{\mathrm{kv}},T^{\prime})=\sum_{j=1}^{N_{\mathrm{kv}}}\alpha_{i,j}V_{j}}\end{array} attni(Xq,Xkv,T′)=∑j=1Nkvαi,jVj .
【翻译】在我们的情况下,位置编码可以是学习得到的或固定的,但对于给定的查询/键值序列,它们在所有注意力层间共享,因此我们不将它们明确写作注意力的参数。我们在描述编码器和解码器时会给出它们确切值的更多细节。最终输出是由注意力权重加权的值的聚合:第 i i i行由 a t t n i ( X q , X k v , T ′ ) = ∑ j = 1 N k v α i , j V j \mathrm{attn}{i}(X{\mathrm{q}},X_{\mathrm{kv}},T^{\prime})=\sum_{j=1}^{N_{\mathrm{kv}}}\alpha_{i,j}V_{j} attni(Xq,Xkv,T′)=∑j=1Nkvαi,jVj给出。
【解析】位置编码的共享设计简化了模型架构,避免了每层都需要学习不同位置编码的复杂性。最终的注意力输出是一个加权平均过程:对于每个查询位置 i i i,将所有值向量 V j V_j Vj按照对应的注意力权重 α i , j \alpha_{i,j} αi,j进行加权求和。这个过程可以理解为"信息聚合":模型根据查询的需求,从所有可能的信息源中按重要程度提取并组合信息。权重越大的位置,其值向量对最终输出的贡献越大。软注意力机制可以使模型灵活地关注多个相关位置,而不是只选择单一的最重要位置。
Feed-forward network (FFN) layers The original transformer alternates multi-head attention and so-called FFN layers [ 47 ], which are effectively multilayer 1x1 convolutions, which have M d M d Md input and output channels in our case. The FFN we consider is composed of two-layers of 1x1 convolutions with ReLU activations. There is also a residual connection/dropout/layernorm after the two layers, similarly to equation 6 .
【翻译】前馈网络(FFN)层 原始transformer交替使用多头注意力和所谓的FFN层[47],这些实际上是多层1x1卷积,在我们的情况下具有 M d Md Md个输入和输出通道。我们考虑的FFN由两层带ReLU激活的1x1卷积组成。在这两层之后还有残差连接/dropout/层归一化,类似于方程6。
For completeness, we present in detail the losses used in our approach. All losses are normalized by the number of objects inside the batch. Extra care must be taken for distributed training: since each GPU receives a sub-batch, it is not sufficient to normalize by the number of objects in the local batch, since in general the sub-batches are not balanced across GPUs. Instead, it is important to normalize by the total number of objects in all sub-batches.
Box loss Similarly to [ 41 , 36 ], we use a soft version of Intersection over Union in our loss, together with a ℓ 1 \ell_{1} ℓ1 loss on b ^ \hat{b} b^ :
【翻译】边界框损失:与[41, 36]类似,我们在损失中使用交并比的软版本,以及 b ^ \hat{b} b^上的 ℓ 1 \ell_{1} ℓ1损失:
L b o x ( b σ ( i ) , b ^ i ) = λ i o u L i o u ( b σ ( i ) , b ^ i ) + λ L 1 ∣ ∣ b σ ( i ) − b ^ i ∣ ∣ 1 , \begin{array}{r}{\mathcal{L}{\mathrm{{box}}}(b{\sigma(i)},\hat{b}{i})=\lambda{\mathrm{{iou}}}\mathcal{L}{\mathrm{{iou}}}(b{\sigma(i)},\hat{b}{i})+\lambda{\mathrm{{L1}}}||b_{\sigma(i)}-\hat{b}{i}||{1},}\end{array} Lbox(bσ(i),b^i)=λiouLiou(bσ(i),b^i)+λL1∣∣bσ(i)−b^i∣∣1,
where λ i o u , λ L 1 ∈ R \lambda_{\mathrm{iou}},\lambda_{\mathrm{L1}}\in\mathbb{R} λiou,λL1∈R are hyperparameters and L i o u ( ⋅ ) \mathcal{L}_{\mathrm{iou}}(\cdot) Liou(⋅) is the generalized IoU [ 38 ]:
【翻译】其中 λ i o u , λ L 1 ∈ R \lambda_{\mathrm{iou}},\lambda_{\mathrm{L1}}\in\mathbb{R} λiou,λL1∈R是超参数, L i o u ( ⋅ ) \mathcal{L}_{\mathrm{iou}}(\cdot) Liou(⋅)是广义IoU [38]:
L i o u ( b σ ( i ) , b ^ i ) = 1 − ( ∣ b σ ( i ) ∩ b ^ i ∣ ∣ b σ ( i ) ∪ b ^ i ∣ − ∣ B ( b σ ( i ) , b ^ i ) ∖ b σ ( i ) ∪ b ^ i ∣ ∣ B ( b σ ( i ) , b ^ i ) ∣ ) . \mathcal{L}{\mathrm{iou}}(b{\sigma(i)},\hat{b}{i})=1-\left(\frac{|b{\sigma(i)}\cap\hat{b}{i}|}{|b{\sigma(i)}\cup\hat{b}{i}|}-\frac{|B(b{\sigma(i)},\hat{b}{i})\setminus b{\sigma(i)}\cup\hat{b}{i}|}{|B(b{\sigma(i)},\hat{b}_{i})|}\right). Liou(bσ(i),b^i)=1−(∣bσ(i)∪b^i∣∣bσ(i)∩b^i∣−∣B(bσ(i),b^i)∣∣B(bσ(i),b^i)∖bσ(i)∪b^i∣).
| . | means "area", and the union and intersection of box coordinates are used as shorthands for the boxes themselves. The areas of unions or intersections are computed by min / max of the linear functions of b σ ( i ) b_{\sigma(i)} bσ(i) and b ^ i \hat{b}{i} b^i , which makes the loss sufficiently well-behaved for stochastic gradients. B ( b σ ( i ) , b ^ i ) B(b{\sigma(i)},\hat{b}{i}) B(bσ(i),b^i) means the largest box containing b σ ( i ) , b ^ i b{\sigma(i)},\hat{b}_{i} bσ(i),b^i (the areas involving B B B are also computed based on min / max of linear functions of the box coordinates).
【翻译】 ∣ . ∣ |.| ∣.∣表示"面积",边界框坐标的并集和交集用作边界框本身的简写。并集或交集的面积通过 b σ ( i ) b_{\sigma(i)} bσ(i)和 b ^ i \hat{b}{i} b^i的线性函数的min/max计算,这使得损失对随机梯度有足够好的性质。 B ( b σ ( i ) , b ^ i ) B(b{\sigma(i)},\hat{b}{i}) B(bσ(i),b^i)表示包含 b σ ( i ) , b ^ i b{\sigma(i)},\hat{b}_{i} bσ(i),b^i的最大边界框(涉及 B B B的面积也基于边界框坐标的线性函数的min/max计算)。
DICE/F-1 loss [ 28 ] The DICE coefficient is closely related to the Intersection over Union. If we denote by m ^ \hat{m} m^ the raw mask logits prediction of the model, and m m m the binary target mask, the loss is defined as:
【翻译】DICE/F-1损失[28]:DICE系数与交并比密切相关。如果我们用 m ^ \hat{m} m^表示模型的原始掩码logits预测,用 m m m表示二值目标掩码,损失定义为:
L D I C E ( m , m ^ ) = 1 − 2 m σ ( m ^ ) + 1 σ ( m ^ ) + m + 1 \mathcal{L}_{\mathrm{DICE}}(m,\hat{m})=1-\frac{2m\sigma(\hat{m})+1}{\sigma(\hat{m})+m+1} LDICE(m,m^)=1−σ(m^)+m+12mσ(m^)+1
where σ \sigma σ is the sigmoid function. This loss is normalized by the number of objects.
【翻译】其中 σ \sigma σ是sigmoid函数。这个损失由目标数量进行归一化。
A.3 详细架构
The detailed description of the transformer used in DETR, with positional encodings passed at every attention layer, is given in Fig. 10 . Image features from the CNN backbone are passed through the transformer encoder, together with spatial positional encoding that are added to queries and keys at every multihead self-attention layer. Then, the decoder receives queries (initially set to zero), output positional encoding (object queries), and encoder memory, and produces the final set of predicted class labels and bounding boxes through multiple multihead self-attention and decoder-encoder attention. The first self-attention layer in the first decoder layer can be skipped.
Fig. 10: Architecture of DETR's transformer. Please, see Section A.3 for details.
【翻译】图10:DETR的transformer架构。详情请参见A.3节。
Computational complexity Every self-attention in the encoder has complexity O ( d 2 H W + d ( H W ) 2 ) : O ( d ′ d ) \mathcal{O}(d^{2}H W{+}d(H W)^{2}){:\mathcal{O}(d^{\prime}d)} O(d2HW+d(HW)2):O(d′d) is the cost of computing a single query/key/value embeddings (and M d ′ = d M d^{\prime}=d Md′=d ), while O ( d ′ ( H W ) 2 ) \mathcal{O}(d^{\prime}(H W)^{2}) O(d′(HW)2) is the cost of computing the attention weights for one head. Other computations are negligible. In the decoder, each self-attention is in O ( d 2 N + d N 2 ) \mathcal{O}(d^{2}N{+}d N^{2}) O(d2N+dN2) , and cross-attention between encoder and decoder is in O ( d 2 ( N + H W ) + d N H W ) \mathcal{O}(d^{2}(N+H W)+d N H W) O(d2(N+HW)+dNHW) , which is much lower than the encoder since N ≪ H W N\ll H W N≪HW in practice.
【翻译】计算复杂度:编码器中的每个自注意力的复杂度为 O ( d 2 H W + d ( H W ) 2 ) \mathcal{O}(d^{2}H W{+}d(H W)^{2}) O(d2HW+d(HW)2),其中 O ( d ′ d ) \mathcal{O}(d^{\prime}d) O(d′d)是计算单个查询/键/值嵌入的成本(且 M d ′ = d M d^{\prime}=d Md′=d),而 O ( d ′ ( H W ) 2 ) \mathcal{O}(d^{\prime}(H W)^{2}) O(d′(HW)2)是计算一个头的注意力权重的成本。其他计算可忽略不计。在解码器中,每个自注意力的复杂度为 O ( d 2 N + d N 2 ) \mathcal{O}(d^{2}N{+}d N^{2}) O(d2N+dN2),编码器和解码器之间的交叉注意力复杂度为 O ( d 2 ( N + H W ) + d N H W ) \mathcal{O}(d^{2}(N+H W)+d N H W) O(d2(N+HW)+dNHW),这比编码器低得多,因为在实践中 N ≪ H W N\ll H W N≪HW。
FLOPS computation Given that the FLOPS for Faster R-CNN depends on the number of proposals in the image, we report the average number of FLOPS for the first 100 images in the COCO 2017 validation set. We compute the FLOPS with the tool flop count operators from Detectron2 [ 50 ]. We use it without modifications for Detectron2 models, and extend it to take batch matrix multiply ( bmm ) into account for DETR models.
We train DETR using AdamW [ 26 ] with improved weight decay handling, set to 1 0 − 4 10^{-4} 10−4 . We also apply gradient clipping, with a maximal gradient norm of 0 . 1. The backbone and the transformers are treated slightly differently, we now discuss the details for both.
Backbone ImageNet pretrained backbone ResNet-50 is imported from Torchvision, discarding the last classification layer. Backbone batch normalization weights and statistics are frozen during training, following widely adopted practice in object detection. We fine-tune the backbone using learning rate of 1 0 − 5 10^{-5} 10−5 . We observe that having the backbone learning rate roughly an order of magnitude smaller than the rest of the network is important to stabilize training, especially in the first few epochs.
Transformer We train the transformer with a learning rate of 1 0 − 4 10^{-4} 10−4 . Additive dropout of 0 . 1 is applied after every multi-head attention and FFN before layer normalization. The weights are randomly initialized with Xavier initialization.
Losses We use linear combination of ℓ 1 \ell_{1} ℓ1 and GIoU losses for bounding box regression with λ L 1 = 5 \lambda_{\mathrm{L1}}=5 λL1=5 and λ i o u = 2 \lambda_{\mathrm{iou}}=2 λiou=2 weights respectively. All models were trained with N = 100 N=100 N=100 decoder query slots.
【翻译】损失函数:我们使用 ℓ 1 \ell_{1} ℓ1和GIoU损失的线性组合进行边界框回归,权重分别为 λ L 1 = 5 \lambda_{\mathrm{L1}}=5 λL1=5和 λ i o u = 2 \lambda_{\mathrm{iou}}=2 λiou=2。所有模型都使用 N = 100 N=100 N=100个解码器查询槽位进行训练。
Baseline Our enhanced Faster-RCNN + ^+ + baselines use GIoU [ 38 ] loss along with the standard ℓ 1 \ell_{1} ℓ1 loss for bounding box regression. We performed a grid search to find the best weights for the losses and the final models use only GIoU loss with weights 20 and 1 for box and proposal regression tasks respectively. For the baselines we adopt the same data augmentation as used in DETR and train it with 9 × \times × schedule (approximately 109 epochs). All other settings are identical to the same models in the Detectron2 model zoo [ 50 ].
Spatial positional encoding Encoder activations are associated with corresponding spatial positions of image features. In our model we use a fixed absolute encoding to represent these spatial positions. We adopt a generalization of the original Transformer [ 47 ] encoding to the 2D case [ 31 ]. Specifically, for both spatial coordinates of each embedding we independently use d 2 \frac{d}{2} 2d sine and cosine functions with different frequencies. We then concatenate them to get the final d d d channel positional encoding.
【翻译】空间位置编码:编码器激活与图像特征的相应空间位置相关联。在我们的模型中,我们使用固定的绝对编码来表示这些空间位置。我们采用原始Transformer [47]编码对2D情况的泛化[31]。具体来说,对于每个嵌入的两个空间坐标,我们独立使用 d 2 \frac{d}{2} 2d个不同频率的正弦和余弦函数。然后我们将它们连接起来得到最终的 d d d通道位置编码。
A.5 附加结果
Some extra qualitative results for the panoptic prediction of the DETR-R101 model are shown in Fig. 11 .
【翻译】DETR-R101模型的全景预测的一些额外定性结果如图11所示。
(a) Failure case with overlapping objects. PanopticFPN misses one plane entirely, while DETR fails to accurately segment 3 of them.
Increasing the number of instances By design, DETR cannot predict more objects than it has query slots, i.e. 100 in our experiments. In this section, we analyze the behavior of DETR when approaching this limit. We select a canonical square image of a given class, repeat it on a 10 × 10 10\times10 10×10 grid, and compute the percentage of instances that are missed by the model. To test the model with less than 100 instances, we randomly mask some of the cells. This ensures that the absolute size of the objects is the same no matter how many are visible. To account for the randomness in the masking, we repeat the experiment 100 times with different masks. The results are shown in Fig. 12 . The behavior is similar across classes, and while the model detects all instances when up to 50 are visible, it then starts saturating and misses more and more instances. Notably, when the image contains all 100 instances, the model only detects 30 on average, which is less than if the image contains only 50 instances that are all detected. The counter-intuitive behavior of the model is likely because the images and the detections are far from the training distribution.
Note that this test is a test of generalization out-of-distribution by design, since there are very few example images with a lot of instances of a single class. It is difficult to disentangle, from the experiment, two types of out-of-domain generalization: the image itself vs the number of object per class. But since few to no COCO images contain only a lot of objects of the same class, this type of experiment represents our best effort to understand whether query objects overfit the label and position distribution of the dataset. Overall, the experiments suggests that the model does not overfit on these distributions since it yields near-perfect detections up to 50 objects.
Fig. 12: Analysis of the number of instances of various classes missed by DETR depending on how many are present in the image. We report the mean and the standard deviation. As the number of instances gets close to 100, DETR starts saturating and misses more and more objects
To demonstrate the simplicity of the approach, we include inference code with PyTorch and Torchvision libraries in Listing 1 . The code runs with Python 3.6 + 3.6+ 3.6+ , PyTorch 1.4 and Torchvision 0.5. Note that it does not support batching, hence it is suitable only for inference or training with DistributedDataParallel with one image per GPU. Also note that for clarity, this code uses learnt positional encodings in the encoder instead of fixed, and positional encodings are added to the input only instead of at each transformer layer. Making these changes requires going beyond PyTorch implementation of transformers, which hampers readability. The entire code to reproduce the experiments will be made available before the conference.
import torch
from torch import nn
from torchvision.models import resnet50
class DETR(nn.Module):
def __init__(self, num_classes, hidden_dim, nheads,
num_encoder_layers, num_decoder_layers):
super().__init__()
# We take only convolutional layers from ResNet-50 model
self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2])
self.conv = nn.Conv2d(2048, hidden_dim, 1)
self.transformer = nn.Transformer(hidden_dim, nheads,
num_encoder_layers, num_decoder_layers)
self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
self.linear_bbox = nn.Linear(hidden_dim, 4)
self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
def forward(self, inputs):
x = self.backbone(inputs)
h = self.conv(x)
H, W = h.shape[-2:]
pos = torch.cat([
self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
], dim=-1).flatten(0, 1).unsqueeze(1)
h = self.transformer(pos + h.flatten(2).permute(2, 0, 1),
self.query_pos.unsqueeze(1))
return self.linear_class(h), self.linear_bbox(h).sigmoid()
detr = DETR(num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6)
detr.eval()
inputs = torch.randn(1, 3, 800, 1200)
logits, bboxes = detr(inputs)
Listing 1: DETR PyTorch inference code. For clarity it uses learnt positional encodings in the encoder instead of fixed, and positional encodings are added to the input only instead of at each transformer layer. Making these changes requires going beyond PyTorch implementation of transformers, which hampers readability. The entire code to reproduce the experiments will be made available before the conference.