(DETR)End-to-End Object Detection with Transformers论文精读（逐段解析）

论文地址：https://arxiv.org/abs/2005.12872

CVPR 2020

Facebook AI 发布

Abstract.

We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, eﬀectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a ﬁxed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the ﬁnal set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a uniﬁed manner. We show that it signiﬁcantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr .

【翻译】摘要。我们提出了一种将目标检测视为直接集合预测问题的新方法。我们的方法简化了检测流水线，有效地消除了许多手工设计组件的需求，如非极大值抑制程序或锚点生成，这些组件明确编码了我们对任务的先验知识。新框架（称为DEtection TRansformer或DETR）的主要组成部分是基于集合的全局损失（通过二分匹配强制唯一预测）和transformer编码器-解码器架构。给定一个固定的小型学习对象查询集合，DETR推理对象之间的关系和全局图像上下文，以并行方式直接输出最终的预测集合。新模型在概念上简单，不需要专门的库，这与许多其他现代检测器不同。DETR在具有挑战性的COCO目标检测数据集上展示了与经过良好建立和高度优化的Faster RCNN基线相当的准确性和运行时性能。此外，DETR可以轻松推广以统一的方式产生全景分割。我们展示了它显著优于竞争基线。训练代码和预训练模型可在 https://github.com/facebookresearch/detr 获得。

【解析】DETR重新定义了目标检测问题的本质。传统的目标检测方法把检测任务看作是一个分类和回归的组合问题，需要先生成候选区域或者锚点，然后对这些候选进行分类和位置回归。而DETR把整个问题重新框定为一个集合预测问题，也就是说，给定一张图片，直接预测出图片中所有目标的完整信息集合。这种思路的核心在于消除了传统方法中大量的后处理步骤和人工设计的组件。比如说非极大值抑制（NMS）是传统检测器必需的后处理步骤，用来去除重复的检测框，而锚点生成则需要人工设计不同尺度和长宽比的锚点来覆盖可能的目标。DETR通过引入transformer架构和二分匹配损失函数，能够在训练时就学会如何避免重复预测，从而在推理时直接输出不重复的检测结果。这里的object queries是一个关键概念，它们是一组可学习的查询向量，每个查询负责检测图像中的一个目标，通过self-attention机制与图像特征交互来生成最终的预测。这种并行预测的方式不仅简化了架构，还能够更好地建模目标之间的关系和全局上下文信息。

1 Introduction

The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest. Modern detectors address this set prediction task in an indirect way, by deﬁning surrogate regression and classiﬁcation problems on a large set of proposals [ 37 , 5 ], anchors [ 23 ], or window centers [ 53 , 46 ]. Their performances are signiﬁcantly inﬂuenced by postprocessing steps to collapse near-duplicate predictions, by the design of the anchor sets and by the heuristics that assign target boxes to anchors [ 52 ]. To simplify these pipelines, we propose a direct set prediction approach to bypass the surrogate tasks. This end-to-end philosophy has led to signiﬁcant advances in complex structured prediction tasks such as machine translation or speech recognition, but not yet in object detection: previous attempts [ 43 , 16 , 4 , 39 ] either add other forms of prior knowledge, or have not proven to be competitive with strong baselines on challenging benchmarks. This paper aims to bridge this gap.

【翻译】目标检测的目标是为每个感兴趣的目标预测一组边界框和类别标签。现代检测器通过间接的方式来解决这个集合预测任务，即在大量提案集合[ 37 , 5 ]、锚点[ 23 ]或窗口中心[ 53 , 46 ]上定义代理回归和分类问题。它们的性能受到后处理步骤的显著影响，这些步骤用于合并近似重复的预测，还受到锚点集合的设计以及将目标框分配给锚点的启发式方法的影响[ 52 ]。为了简化这些流水线，我们提出了一种直接集合预测方法来绕过代理任务。这种端到端的理念在复杂的结构化预测任务（如机器翻译或语音识别）中取得了显著进展，但在目标检测中还没有：之前的尝试[ 43 , 16 , 4 , 39 ]要么添加了其他形式的先验知识，要么在具有挑战性的基准测试中没有被证明能与强基线竞争。本论文旨在弥补这一差距。

【解析】这里举个例子来说明，传统的目标检测系统就像是一个复杂的工厂流水线，需要很多中间步骤才能生产出最终产品。这个工厂首先要生成大量的"候选产品"（提案、锚点或窗口中心），然后对每个候选产品进行质量检测（分类和回归），最后还要进行质量控制（后处理）来去除重复或低质量的产品。这种间接的方式虽然有效，但整个系统变得非常复杂，因为你需要精心设计每一个中间环节。比如你要决定生成什么样的候选产品，如何对它们进行评估，以及如何在最后阶段清理重复的结果。而且这些中间步骤的设计往往需要大量的人工经验和启发式规则，这些规则实际上就是我们把对任务的理解硬编码到系统中。DETR的核心思想是完全抛弃这种间接的方式，直接让模型学会从图像到最终检测结果的映射，就像是训练一个工匠直接从原材料制作出成品，而不需要复杂的工厂流水线。这种端到端的思想在自然语言处理领域已经证明了其威力，但在计算机视觉的目标检测任务中，之前的尝试要么还是引入了一些人工设计的组件，要么在性能上无法与传统方法竞争。

Fig. 1: DETR directly predicts (in parallel) the ﬁnal set of detections by combining a common CNN with a transformer architecture. During training, bipartite matching uniquely assigns predictions with ground truth boxes. Prediction with no match should yield a " no object " ( ∅ ) (\emptyset) (∅) class prediction.

【翻译】图1：DETR通过将常见的CNN与transformer架构相结合，直接（并行地）预测最终的检测集合。在训练过程中，二分匹配唯一地将预测与真实边界框分配。没有匹配的预测应该产生"无目标" ( ∅ ) (\emptyset) (∅) 类别预测。

【解析】这张图展示了DETR架构设计。整个系统分为两个主要部分：首先是CNN backbone负责从原始图像中提取特征表示，这部分和传统检测器类似；然后是transformer编码器-解码器结构，这是DETR的创新之处。关键在于transformer解码器能够并行地输出固定数量的预测结果，每个预测都是一个完整的检测结果（包含类别和边界框）。训练时使用二分匹配来解决预测和真实目标之间的对应问题，这是一个非常重要的设计。由于模型总是输出固定数量的预测（比如100个），但图像中的实际目标数量通常少于这个数量，所以多余的预测位置会被标记为"无目标"类别。这种设计消除了传统方法中复杂的候选生成和后处理步骤，让整个检测过程变得非常直接和简洁。

We streamline the training pipeline by viewing object detection as a direct set prediction problem. We adopt an encoder-decoder architecture based on transformers [ 47 ], a popular architecture for sequence prediction. The self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make these architectures particularly suitable for speciﬁc constraints of set prediction such as removing duplicate predictions.

【翻译】我们通过将目标检测视为直接集合预测问题来简化训练流水线。我们采用基于transformers[ 47 ]的编码器-解码器架构，这是一种流行的序列预测架构。transformers的自注意力机制明确地建模序列中元素之间的所有成对交互，使得这些架构特别适合集合预测的特定约束，如去除重复预测。

【解析】这段话解释了为什么选择transformer架构来解决目标检测问题。传统的目标检测把问题分解为多个子问题，而DETR直接把它看作一个集合预测问题，也就是说给定一张图像，直接预测出其中所有目标的完整信息集合。Transformer架构的选择并不是随意的，而是有其深层原因的。Self-attention机制的核心能力在于它能让序列中的每个元素都"看到"序列中的所有其他元素，并根据它们之间的关系来更新自己的表示。在目标检测的语境下，这说明每个预测的目标都能感知到其他所有预测目标的信息，这样就能自然地避免重复预测同一个目标。这种全局交互的能力正是传统检测器需要通过复杂的后处理步骤（如非极大值抑制）才能实现的功能，但在transformer中这变成了架构的内在能力。而且transformer的编码器-解码器结构天然适合处理从输入图像到输出检测集合的映射关系，编码器负责理解图像内容，解码器负责生成检测结果。

Our DEtection TRansformer (DETR, see Figure 1 ) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects. DETR simpliﬁes the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or non-maximal suppression. Unlike most existing detection methods, DETR doesn't require any customized layers, and thus can be reproduced easily in any framework that contains standard CNN and transformer classes.

【翻译】我们的DEtection TRansformer（DETR，见图1）一次性预测所有目标，并通过在预测对象和真实目标之间执行二分匹配的集合损失函数进行端到端训练。DETR通过丢弃多个编码先验知识的手工设计组件（如空间锚点或非极大值抑制）来简化检测流水线。与大多数现有检测方法不同，DETR不需要任何定制层，因此可以在任何包含标准CNN和transformer类的框架中轻松复现。

【解析】而DETR把传统检测方法的多步骤过程压缩成了一个直接的映射关系。这种"一次性预测所有目标"的能力来自于transformer架构的全局建模能力，模型能够同时考虑图像中所有可能的目标位置和它们之间的关系。二分匹配损失函数是实现这种能力的关键技术，它解决了如何将模型的预测结果与真实标注进行对应的问题。由于模型输出固定数量的预测，而图像中的实际目标数量是变化的，二分匹配确保每个真实目标都能找到唯一的预测与之对应，同时避免多个预测指向同一个目标。DETR使用标准的CNN和transformer组件，不需要特殊的定制层，这大大提高了方法的通用性和可复现性。

Compared to most previous work on direct set prediction, the main features of DETR are the conjunction of the bipartite matching loss and transformers with (non-autoregressive) parallel decoding [ 29 , 12 , 10 , 8 ]. In contrast, previous work focused on autoregressive decoding with RNNs [ 43 , 41 , 30 , 36 , 42 ]. Our matching loss function uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel.

【翻译】与大多数先前关于直接集合预测的工作相比，DETR的主要特征是二分匹配损失与具有（非自回归）并行解码的transformers的结合[ 29 , 12 , 10 , 8 ]。相比之下，先前的工作专注于使用RNNs的自回归解码[ 43 , 41 , 30 , 36 , 42 ]。我们的匹配损失函数唯一地将预测分配给真实目标对象，并且对预测对象的排列不变，因此我们可以并行地输出它们。

【解析】这段话说明了DETR相对于之前直接集合预测方法的技术优势。传统的集合预测方法采用自回归的方式，就像写文章一样需要一个词一个词地生成，前面的预测会影响后面的预测。这种方式虽然能保证预测之间的一致性，但计算效率很低，因为必须串行处理。DETR采用非自回归的并行解码方式，所有的预测都是同时生成的，就像是多个专家同时独立工作，每个人负责检测图像中的一个目标。这种并行性的实现依赖于两个关键技术：首先是transformer架构本身支持并行计算，self-attention机制允许所有预测位置同时访问全局信息；其次是二分匹配损失函数的排列不变性，这说明模型输出的预测顺序并不重要，重要的是每个预测的内容。排列不变性是集合预测的基本要求，因为集合本身是无序的。传统方法通过后处理来处理这个问题，而DETR通过损失函数的设计从根本上解决了这个问题，使得模型在训练时就学会了生成无重复的预测集合。

We evaluate DETR on one of the most popular object detection datasets, COCO [ 24 ], against a very competitive Faster R-CNN baseline [ 37 ]. Faster RCNN has undergone many design iterations and its performance was greatly improved since the original publication. Our experiments show that our new model achieves comparable performances. More precisely, DETR demonstrates signiﬁcantly better performance on large objects, a result likely enabled by the non-local computations of the transformer. It obtains, however, lower performances on small objects. We expect that future work will improve this aspect in the same way the development of FPN [ 22 ] did for Faster R-CNN.

【翻译】我们在最受欢迎的目标检测数据集之一COCO[ 24 ]上评估DETR，与非常有竞争力的Faster R-CNN基线[ 37 ]进行比较。Faster RCNN经历了许多设计迭代，其性能自原始发布以来得到了极大改善。我们的实验表明，我们的新模型达到了可比较的性能。更准确地说，DETR在大目标上表现出显著更好的性能，这一结果可能得益于transformer的非局部计算。然而，它在小目标上获得较低的性能。我们期望未来的工作能够改善这一方面，就像FPN[ 22 ]的发展对Faster R-CNN所做的那样。

【解析】在大目标检测上的优势源于transformer架构的全局建模能力。传统的CNN主要依靠局部特征和层次化的特征提取，而transformer通过self-attention机制能够直接建模图像中任意两个位置之间的关系，这种非局部计算特别有利于理解大目标的完整结构和上下文信息。对于大目标而言，其各个部分之间可能相距较远，传统方法需要通过多层卷积才能建立这种长距离依赖关系，而transformer可以一步到位。然而，小目标检测的劣势也很明显，这可能与几个因素有关：首先是小目标本身包含的信息有限，需要更精细的特征表示；其次是DETR使用相对粗糙的特征图，可能丢失了小目标的细节信息；最后是训练数据中小目标的样本相对较少，模型学习不充分。作者提到FPN对Faster R-CNN的改进，暗示DETR也可能通过类似的多尺度特征融合技术来改善小目标检测性能。

Training settings for DETR diﬀer from standard object detectors in multiple ways. The new model requires extra-long training schedule and beneﬁts from auxiliary decoding losses in the transformer. We thoroughly explore what components are crucial for the demonstrated performance.

【翻译】DETR的训练设置在多个方面与标准目标检测器不同。新模型需要超长的训练计划，并受益于transformer中的辅助解码损失。我们彻底探索了哪些组件对所展示的性能至关重要。

【解析】传统检测器可以借助预训练的分类网络，然后在此基础上学习检测任务，而DETR需要从头学习整个端到端的映射过程。辅助解码损失是一个重要的训练技巧，transformer解码器通常有多层，每一层都在逐步细化预测结果。通过在中间层也添加损失函数，模型可以获得更多的监督信号，帮助训练过程更快收敛且更稳定。这种辅助损失类似于深度网络中的深度监督，确保网络的每一层都能学到有用的表示。作者强调了会详细分析各个组件的重要性。

The design ethos of DETR easily extend to more complex tasks. In our experiments, we show that a simple segmentation head trained on top of a pretrained DETR outperfoms competitive baselines on Panoptic Segmentation [ 19 ], a challenging pixel-level recognition task that has recently gained popularity.

【翻译】DETR的设计理念很容易扩展到更复杂的任务。在我们的实验中，我们展示了在预训练DETR之上训练的简单分割头在全景分割[ 19 ]上优于有竞争力的基线，这是一个具有挑战性的像素级识别任务，最近获得了广泛关注。

【解析】DETR的transformer编码器已经学会了理解图像的全局结构和目标之间的关系，这些知识可以直接迁移到像素级的预测任务中。通过简单地在DETR之上添加一个分割头，就能实现从边界框预测到像素级预测的转换，这种简洁性展现了端到端学习的优势。而且由于DETR本身就是集合预测的框架，它天然适合处理场景中多个不同实例的分割问题，每个object query可以对应一个分割实例。

Our work build on prior work in several domains: bipartite matching losses for set prediction, encoder-decoder architectures based on the transformer, parallel decoding, and object detection methods.

【翻译】我们的工作建立在几个领域的先前工作基础上：用于集合预测的二分匹配损失、基于transformer的编码器-解码器架构、并行解码以及目标检测方法。

【解析】这段话是说了DETR并不是凭空出现的，而是融合了多个研究领域的成熟技术。二分匹配损失解决了集合预测中的核心问题，即如何将模型的多个预测结果与真实标注进行一一对应，这是实现端到端训练的关键。基于transformer的编码器-解码器架构提供了强大的序列建模能力，编码器负责理解输入图像的全局信息，解码器负责生成检测结果。并行解码技术使得模型能够同时生成所有预测，而不需要像传统方法那样逐个生成，这大大提高了推理效率。目标检测方法的相关工作为DETR提供了评估基准和比较对象，帮助验证新方法的有效性。这四个技术方向的结合构成了DETR的完整技术框架，每个部分都解决了端到端目标检测中的特定挑战。

2.1 集合预测

There is no canonical deep learning model to directly predict sets. The basic set prediction task is multilabel classiﬁcation (see e.g., [ 40 , 33 ] for references in the context of computer vision) for which the baseline approach, one-vs-rest, does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes). The ﬁrst diﬃculty in these tasks is to avoid near-duplicates. Most current detectors use postprocessings such as non-maximal suppression to address this issue, but direct set prediction are postprocessing-free. They need global inference schemes that model interactions between all predicted elements to avoid redundancy. For constant-size set prediction, dense fully connected networks [ 9 ] are suﬃcient but costly. A general approach is to use auto-regressive sequence models such as recurrent neural networks [ 48 ]. In all cases, the loss function should be invariant by a permutation of the predictions. The usual solution is to design a loss based on the Hungarian algorithm [ 20 ], to ﬁnd a bipartite matching between ground-truth and prediction. This enforces permutation-invariance, and guarantees that each target element has a unique match. We follow the bipartite matching loss approach. In contrast to most prior work however, we step away from autoregressive models and use transformers with parallel decoding, which we describe below.

【翻译】没有标准的深度学习模型可以直接预测集合。基本的集合预测任务是多标签分类（参见例如在计算机视觉背景下的参考文献[ 40 , 33 ]），其基线方法"一对其余"不适用于检测等问题，因为这些问题中元素之间存在潜在结构（即，几乎相同的边界框）。这些任务中的第一个困难是避免近重复。大多数当前检测器使用非极大值抑制等后处理来解决这个问题，但直接集合预测是无后处理的。它们需要全局推理方案来建模所有预测元素之间的交互以避免冗余。对于恒定大小的集合预测，密集全连接网络[ 9 ]是足够的但代价高昂。一般方法是使用自回归序列模型，如循环神经网络[ 48 ]。在所有情况下，损失函数应该对预测的排列不变。通常的解决方案是设计基于匈牙利算法[ 20 ]的损失，以找到真实值和预测之间的二分匹配。这强制了排列不变性，并保证每个目标元素都有唯一的匹配。我们采用二分匹配损失方法。然而，与大多数先前的工作相比，我们摆脱了自回归模型，使用具有并行解码的transformers，我们将在下面描述。

【解析】集合预测是一个比较难的任务，因为传统的模型都是为序列或结构化输出设计的，而集合是无序的。在计算机视觉中，最简单的集合预测任务是多标签分类，比如判断一张图片中包含哪些物体类别。传统的"一对其余"方法为每个类别训练一个二元分类器，但这种方法无法处理目标检测中的复杂情况，因为检测需要同时预测位置和类别，而且同一类别可能有多个实例。目标检测中的核心困难在于避免重复检测同一个目标。想象一下，如果模型同时预测出10个几乎相同的边界框都指向同一个人，这显然是不合理的。传统检测器通过非极大值抑制等后处理步骤来解决这个问题，但这增加了系统的复杂性。直接集合预测方法希望模型在预测阶段就能避免这种重复，这需要模型具备全局推理能力，让每个预测都能"感知"到其他所有预测的存在，从而自动避免冗余。自回归模型虽然可以做到这一点，但需要串行生成预测，效率较低。排列不变性是集合预测的核心要求，因为集合本身是无序的，模型输出[A, B, C]和[C, A, B]应该被认为是等价的。匈牙利算法提供了一种优雅的解决方案，它能找到预测集合和真实集合之间的最优一一对应关系，确保每个真实目标都有唯一的预测与之匹配。DETR的创新在于将这种二分匹配损失与transformer的并行解码能力结合起来，既保证了预测的无重复性，又实现了高效的并行计算。

2.2 Transformers和并行解码

Transformers were introduced by Vaswani et al . [ 47 ] as a new attention-based building block for machine translation. Attention mechanisms [ 2 ] are neural network layers that aggregate information from the entire input sequence. Transformers introduced self-attention layers, which, similarly to Non-Local Neural Networks [ 49 ], scan through each element of a sequence and update it by aggregating information from the whole sequence. One of the main advantages of attention-based models is their global computations and perfect memory, which makes them more suitable than RNNs on long sequences. Transformers are now replacing RNNs in many problems in natural language processing, speech processing and computer vision [ 8 , 27 , 45 , 34 , 31 ].

【翻译】Transformers由Vaswani等人[ 47 ]作为机器翻译的新基于注意力的构建块而引入。注意力机制[ 2 ]是从整个输入序列聚合信息的神经网络层。Transformers引入了自注意力层，类似于非局部神经网络[ 49 ]，扫描序列的每个元素并通过聚合来自整个序列的信息来更新它。基于注意力的模型的主要优势之一是它们的全局计算和完美记忆，这使得它们比RNNs更适合处理长序列。Transformers现在正在自然语言处理、语音处理和计算机视觉的许多问题中取代RNNs[ 8 , 27 , 45 , 34 , 31 ]。

【解析】传统的循环神经网络（RNNs）需要按顺序处理序列中的每个元素，前一个元素的计算结果是下一个元素计算的输入，这种串行特性限制了并行计算的可能性，也使得长序列处理变得困难，因为信息在长距离传播过程中容易丢失或衰减。Transformer通过注意力机制解决了这些问题，注意力机制的本质是一种加权聚合操作，它能让序列中的每个位置都直接"看到"序列中所有其他位置的信息，并根据相关性给予不同的权重。自注意力机制更进一步，它允许序列与自身进行交互，每个位置都能计算与所有其他位置的相关性，然后根据这些相关性来更新自己的表示。这种全局交互能力是RNNs所缺乏的，RNNs只能通过递归的方式逐步传播信息，而Transformer可以一步到位地建立任意两个位置之间的联系。完美记忆指的是Transformer不存在梯度消失或信息丢失的问题，因为每个位置都能直接访问其他所有位置的信息，不需要通过中间步骤进行传递。这些优势使得Transformer不仅在原本的机器翻译任务上表现优异，还迅速扩展到了语音、视觉等其他领域，成为了一种通用的序列建模架构。

Transformers were ﬁrst used in auto-regressive models, following early sequenceto-sequence models [ 44 ], generating output tokens one by one. However, the prohibitive inference cost (proportional to output length, and hard to batch) lead to the development of parallel sequence generation, in the domains of audio [ 29 ], machine translation [ 12 , 10 ], word representation learning [ 8 ], and more recently speech recognition [ 6 ]. We also combine transformers and parallel decoding for their suitable trade-oﬀ between computational cost and the ability to perform the global computations required for set prediction.

【翻译】Transformers最初用于自回归模型，遵循早期的序列到序列模型[ 44 ]，逐个生成输出标记。然而，令人望而却步的推理成本（与输出长度成正比，且难以批处理）导致了并行序列生成的发展，涉及音频[ 29 ]、机器翻译[ 12 , 10 ]、词表示学习[ 8 ]以及最近的语音识别[ 6 ]等领域。我们也结合transformers和并行解码，因为它们在计算成本和执行集合预测所需的全局计算能力之间提供了合适的权衡。

【解析】自回归生成是早期Transformer应用的主要模式，这种方式模仿了人类语言生成的过程，即根据前面已经生成的内容来预测下一个词。在机器翻译中，这表现为先生成第一个目标语言词汇，然后基于这个词和源语言信息生成第二个词，依此类推。虽然这种方式能够保证生成序列的连贯性和一致性，但它也带来了严重的效率问题。推理时间与输出长度成正比，这说明如果需要生成一个100词的句子，就需要执行100次前向传播，而且每次都只能生成一个词，无法并行处理。批处理困难是另一个问题，因为不同样本的输出长度不同，难以将多个样本组织成统一的批次进行并行计算。这些限制在实际应用中造成了巨大的计算瓶颈，特别是在需要实时响应的场景中。并行序列生成技术的发展正是为了解决这些问题，它们尝试让模型同时生成所有输出位置的内容，而不是逐个生成。这种并行性大大提高了推理效率，使得Transformer能够在更多实时应用场景中得到使用。DETR正是利用了这种并行解码的优势，因为目标检测本质上是一个集合预测任务，所有的检测结果都是独立的，非常适合并行生成。同时，Transformer的全局计算能力确保了每个预测都能感知到其他所有预测的存在，从而避免重复检测。

2.3 目标检测

Most modern object detection methods make predictions relative to some initial guesses. Two-stage detectors [ 37 , 5 ] predict boxes w.r.t. proposals, whereas single-stage methods make predictions w.r.t. anchors [ 23 ] or a grid of possible object centers [ 53 , 46 ]. Recent work [ 52 ] demonstrate that the ﬁnal performance of these systems heavily depends on the exact way these initial guesses are set. In our model we are able to remove this hand-crafted process and streamline the detection process by directly predicting the set of detections with absolute box prediction w.r.t. the input image rather than an anchor.

【翻译】大多数现代目标检测方法都是相对于一些初始猜测进行预测的。两阶段检测器[ 37 , 5 ]相对于候选框预测边界框，而单阶段方法相对于锚点[ 23 ]或可能的目标中心网格[ 53 , 46 ]进行预测。最近的工作[ 52 ]表明，这些系统的最终性能很大程度上取决于这些初始猜测的确切设置方式。在我们的模型中，我们能够移除这个手工设计的过程，并通过直接预测相对于输入图像的绝对边界框预测的检测集合来简化检测过程，而不是相对于锚点。

【解析】传统目标检测方法的一个根本特点就是需要预设一些"起始点"或"参考点"来帮助模型定位目标。两阶段检测器如Faster R-CNN首先生成一堆可能包含目标的候选区域（proposals），然后在这些候选区域的基础上进一步精确定位和分类。这就好比先圈出一些大概的范围，再在这些范围内仔细寻找。单阶段检测器则采用不同的策略，它们在图像上预设大量的锚点（anchors）或者将图像划分成网格，每个锚点或网格位置都负责检测该位置附近的目标。锚点可以理解为不同尺寸和长宽比的预设边界框，模型需要学会如何调整这些预设框来匹配真实目标。这些初始猜测的设计需要大量的先验知识和经验，比如锚点的尺寸、长宽比、密度等都需要根据数据集特点精心调整，这个过程既繁琐又容易引入偏差。DETR的创新之处在于完全摆脱了这种依赖，它直接预测图像中目标的绝对位置坐标，就像人类看图片时直接指出"这里有个人，那里有辆车"，而不需要先在脑海中设想一堆可能的位置框架。这种直接预测方式大大简化了系统设计，减少了需要调优的超参数，也避免了因初始设置不当而影响最终性能的问题。

Set-based loss. Several object detectors [ 9 , 25 , 35 ] used the bipartite matching loss. However, in these early deep learning models, the relation between diﬀerent prediction was modeled with convolutional or fully-connected layers only and a hand-designed NMS post-processing can improve their performance. More recent detectors [ 37 , 23 , 53 ] use non-unique assignment rules between ground truth and predictions together with an NMS.

【翻译】基于集合的损失。一些目标检测器[ 9 , 25 , 35 ]使用了二分匹配损失。然而，在这些早期的深度学习模型中，不同预测之间的关系仅通过卷积或全连接层建模，手工设计的NMS后处理可以改善它们的性能。更近期的检测器[ 37 , 23 , 53 ]使用真实值和预测之间的非唯一分配规则以及NMS。

【解析】二分匹配损失的概念在目标检测领域并不是全新的，早期就有研究者尝试使用这种方法来解决集合预测问题。但是早期模型在处理预测之间关系时技术手段比较有限，主要依靠传统的卷积层和全连接层来建模不同预测结果之间的相互影响，这种建模能力相对较弱，无法充分捕捉复杂的全局关系。由于模型本身的局限性，这些早期方法仍然需要依赖手工设计的非极大值抑制（NMS）来去除重复检测，NMS虽然有效但属于启发式后处理方法，不是端到端学习的一部分。现代检测器虽然在架构上有了很大改进，但在训练过程中仍然采用非唯一分配策略，也就是说一个真实目标可能对应多个预测结果，这种一对多的关系在推理时同样需要NMS来解决重复问题。这些方法的共同特点是都没有从根本上解决集合预测的核心挑战，即如何让模型直接学会生成无重复的预测集合，而是将这个问题留给了后处理步骤来解决。

Learnable NMS methods [ 16 , 4 ] and relation networks [ 17 ] explicitly model relations between diﬀerent predictions with attention. Using direct set losses, they do not require any post-processing steps. However, these methods employ additional hand-crafted context features like proposal box coordinates to model relations between detections eﬃciently, while we look for solutions that reduce the prior knowledge encoded in the model.

【翻译】可学习的NMS方法[ 16 , 4 ]和关系网络[ 17 ]通过注意力机制显式地建模不同预测之间的关系。使用直接集合损失，它们不需要任何后处理步骤。然而，这些方法使用额外的手工设计的上下文特征，如候选框坐标来有效地建模检测之间的关系，而我们寻求减少模型中编码的先验知识的解决方案。

【解析】这些改进方法已经认识到了传统NMS的局限性，开始尝试将重复去除的过程整合到模型的学习过程中。可学习NMS方法不再使用固定的启发式规则，而是让网络学习如何判断和去除重复检测。关系网络则更进一步，通过注意力机制让不同的预测结果能够"互相感知"，从而在生成阶段就避免重复。注意力机制在这里的作用类似于让每个预测都能"看到"其他所有预测的存在，然后根据相互关系调整自己的输出。这种显式建模预测间关系的方法确实能够实现端到端训练，无需后处理步骤。但是这些方法仍有不足之处，它们需要额外的手工特征来帮助建模，比如候选框的坐标信息、几何关系等，这些特征虽然有用但增加了系统的复杂性，也引入了更多的先验假设。DETR的设计更加激进，它希望尽可能减少对人工设计特征的依赖，让模型从原始数据中自主学习所需的表示和关系，这样更加简洁且通用性更强。

Recurrent detectors. Closest to our approach are end-to-end set predictions for object detection [ 43 ] and instance segmentation [ 41 , 30 , 36 , 42 ]. Similarly to us, they use bipartite-matching losses with encoder-decoder architectures based on CNN activations to directly produce a set of bounding boxes. These approaches, however, were only evaluated on small datasets and not against modern baselines. In particular, they are based on autoregressive models (more precisely RNNs), so they do not leverage the recent transformers with parallel decoding.

【翻译】循环检测器。与我们的方法最接近的是用于目标检测[ 43 ]和实例分割[ 41 , 30 , 36 , 42 ]的端到端集合预测。与我们类似，它们使用二分匹配损失与基于CNN激活的编码器-解码器架构来直接产生边界框集合。然而，这些方法仅在小数据集上进行了评估，没有与现代基线进行比较。特别是，它们基于自回归模型（更准确地说是RNNs），因此没有利用最近具有并行解码能力的transformers。

【解析】这些循环检测器在理念上与DETR最为相似，它们都认识到了端到端集合预测的重要性，也都采用了二分匹配损失来解决预测与真实目标的对应问题。它们的架构设计也遵循编码器-解码器的范式，编码器负责提取图像特征，解码器负责生成检测结果。但是这些早期尝试存在几个明显的局限性。首先是评估规模有限，它们主要在相对较小的数据集上验证效果，缺乏在大规模数据集上的充分验证，也没有与当时最先进的检测方法进行全面比较，这使得它们的实际效果存在疑问。更重要的技术局限是它们基于自回归模型，特别是循环神经网络（RNNs）来实现序列解码。自回归模型需要按顺序生成每个预测结果，前一个预测的输出会影响下一个预测的生成，这种串行处理方式虽然能保证预测之间的一致性，但计算效率低下，无法充分利用现代GPU的并行计算能力。而且RNNs在处理长序列时容易出现梯度消失等问题，影响训练效果。DETR通过采用transformer架构克服了这些限制，transformer的自注意力机制支持并行处理所有预测位置，既提高了计算效率，又能更好地建模全局关系。

3 The DETR model

Two ingredients are essential for direct set predictions in detection: (1) a set prediction loss that forces unique matching between predicted and ground truth boxes; (2) an architecture that predicts (in a single pass) a set of objects and models their relation. We describe our architecture in detail in Figure 2 .

【翻译】检测中直接集合预测有两个要素是必不可少的：（1）一种集合预测损失，强制预测框和真实框之间的唯一匹配；（2）一种架构，（在单次传递中）预测一组目标并建模它们的关系。我们在图2中详细描述了我们的架构。

【解析】这段话概括了DETR实现端到端目标检测的两个技术要求。第一个要素是损失函数设计，必须确保预测结果与真实标注之间建立唯一的一对一匹配关系。这是因为传统检测方法会产生大量重复或冗余的预测，而集合预测要求每个真实目标只对应一个预测结果，每个预测结果也只能匹配一个真实目标，这种唯一性约束是实现无后处理检测的关键。第二个要素是模型架构设计，需要能够在单次前向传播中同时生成所有检测结果，而不是像传统方法那样需要多个阶段或迭代过程。更重要的是，这个架构必须能够建模不同检测结果之间的相互关系，让每个预测都能"感知"到其他预测的存在，从而自动避免重复检测。这两个要素相互配合：唯一匹配损失提供了训练信号，告诉模型什么样的预测是好的；而关系建模架构提供了实现能力，让模型有能力学会避免重复预测。DETR通过匈牙利算法实现第一个要素，通过Transformer的自注意力机制实现第二个要素，两者结合构成了一个完整的端到端检测系统。

3.1 目标检测集合预测损失

DETR infers a ﬁxed-size set of N N N predictions, in a single pass through the decoder, where N N N is set to be signiﬁcantly larger than the typical number of objects in an image. One of the main diﬃculties of training is to score predicted objects (class, position, size) with respect to the ground truth. Our loss produces an optimal bipartite matching between predicted and ground truth objects, and then optimize object-speciﬁc (bounding box) losses.

【翻译】DETR通过解码器的单次前向传播推断出大小固定为 N N N的预测集合，其中 N N N被设置为显著大于图像中典型目标数量。训练的主要困难之一是如何相对于真实值对预测的目标（类别、位置、大小）进行评分。我们的损失产生预测目标和真实目标之间的最优二分匹配，然后优化目标特定的（边界框）损失。

【解析】DETR采用了一种固定输出槽位的设计策略，无论输入图像中实际包含多少个目标，模型都会输出固定数量N个预测结果。这个N通常设置得比图像中可能出现的最大目标数量还要大，比如设置为100个槽位，而实际图像中可能只有几个到几十个目标。这种设计简化了网络架构，因为输出维度是固定的，不需要根据输入动态调整。但这也带来了训练上的挑战：如何将这N个预测结果与真实的目标进行对应。传统的监督学习通常有明确的输入输出对应关系，但在这里，我们需要解决一个分配问题，即决定哪个预测应该负责哪个真实目标。DETR通过二分匹配来解决这个问题，本质上是在预测集合和真实目标集合之间建立一一对应关系。这种匹配不是随意的，而是要找到使总体匹配成本最小的最优分配方案。一旦确定了这种对应关系，就可以针对每对匹配的预测和真实目标计算具体的损失，包括分类损失和边界框回归损失。

Let us denote by y y y the ground truth set of objects, and y ^ = { y ^ i } i = 1 N \hat{y}=\{\hat{y}{i}\}{i=1}^{N} y^={y^i}i=1N the set of N N N predictions. Assuming N N N is larger than the number of objects in the image, we consider y y y also as a set of size N N N padded with O \scriptscriptstyle\mathcal{O} O (no object). To ﬁnd a bipartite matching between these two sets we search for a permutation of N N N elements σ ∈ S N \sigma\in\mathfrak{S}_{N} σ∈SN with the lowest cost:

【翻译】让我们用 y y y表示真实目标集合，用 y ^ = { y ^ i } i = 1 N \hat{y}=\{\hat{y}{i}\}{i=1}^{N} y^={y^i}i=1N表示 N N N个预测的集合。假设 N N N大于图像中目标的数量，我们也将 y y y视为大小为 N N N的集合，用 O \scriptscriptstyle\mathcal{O} O（无目标）进行填充。为了在这两个集合之间找到二分匹配，我们搜索具有最低成本的 N N N元素的排列 σ ∈ S N \sigma\in\mathfrak{S}_{N} σ∈SN：

【解析】真实目标集合y包含图像中所有的真实目标，而预测集合 y ^ \hat{y} y^包含模型输出的N个预测结果。由于N通常大于真实目标数量，需要对真实目标集合进行填充操作，使其也达到N个元素。这些填充的元素用特殊符号 O \scriptscriptstyle\mathcal{O} O表示，代表"无目标"或"背景"。这样处理后，两个集合都有相同的大小N，可以进行一一对应的匹配。关键在于寻找一个排列 σ \sigma σ，这个排列定义了预测集合中每个元素应该与真实集合中哪个元素配对。 S N \mathfrak{S}_{N} SN表示N个元素的所有可能排列的集合，总共有N!种可能的排列方式。目标是在这N!种排列中找到使匹配成本最小的那一个，这就是最优分配问题。

σ ^ = arg ⁡ min ⁡ σ ∈ S N ∑ i N L m a t c h ( y i , y ^ σ ( i ) ) , \hat{\sigma}=\mathop{\arg\operatorname*{min}}{\sigma\in\mathfrak{S}{N}}\sum_{i}^{N}\mathcal{L}{\mathrm{match}}\big(y{i},\hat{y}_{\sigma(i)}\big), σ^=argminσ∈SNi∑NLmatch(yi,y^σ(i)),

where L m a t c h ( y i , y ^ σ ( i ) ) \mathcal{L}{\mathrm{match}}\big(y{i},\hat{y}{\sigma(i)}\big) Lmatch(yi,y^σ(i)) is a pair-wise matching cost between ground truth y i y{i} yi and a prediction with index σ ( i ) \sigma(i) σ(i) . This optimal assignment is computed eﬃciently with the Hungarian algorithm, following prior work ( e.g . [ 43 ]).

【翻译】其中 L m a t c h ( y i , y ^ σ ( i ) ) \mathcal{L}{\mathrm{match}}\big(y{i},\hat{y}{\sigma(i)}\big) Lmatch(yi,y^σ(i))是真实值 y i y{i} yi和索引为 σ ( i ) \sigma(i) σ(i)的预测之间的成对匹配成本。这个最优分配通过匈牙利算法高效计算，遵循先前的工作（例如[43]）。

【解析】其实这个优化公式的核心是匹配成本函数 L m a t c h \mathcal{L}_{\mathrm{match}} Lmatch，它量化了真实目标和预测结果之间的相似程度或匹配质量。对于每个可能的排列 σ \sigma σ，公式计算所有配对的匹配成本总和，目标是找到使这个总和最小的排列。 arg ⁡ min ⁡ \arg\min argmin操作符表示我们要找的是使目标函数达到最小值的参数，即最优排列 σ ^ \hat{\sigma} σ^。匈牙利算法是解决这类二分匹配问题的经典算法，它能在多项式时间内找到最优解，避免了暴力搜索所有N!种排列的指数级复杂度。该算法的时间复杂度是 O ( N 3 ) O(N^3) O(N3)，对于实际应用中的N值（通常是几百）是完全可行的。匹配成本函数通常结合分类损失和定位损失，既考虑预测的类别是否正确，也考虑边界框的位置是否准确。通过这种方式，DETR能够在每个训练步骤中自动学习如何将预测与真实目标进行最佳配对，无需人工设计的启发式规则。

The matching cost takes into account both the class prediction and the similarity of predicted and ground truth boxes. Each element i i i of the ground truth set can be seen as a y i = ( c i , b i ) y_{i}~=~\left(c_{i},b_{i}\right) yi = (ci,bi) where c i c_{i} ci is the target class label (which may be L \scriptscriptstyle\mathcal{L} L ) and b i ∈ [ 0 , 1 ] 4 b_{i}~\in~[0,1]^{4} bi ∈ [0,1]4 is a vector that deﬁnes ground truth box center coordinates and its height and width relative to the image size. For the prediction with index σ ( i ) \sigma(i) σ(i) we deﬁne probability of class c i c_{i} ci as p ^ σ ( i ) ( c i ) \hat{p}{\sigma(i)}(\boldsymbol{c}{i}) p^σ(i)(ci) and the predicted box as b ^ σ ( i ) \hat{b}{\sigma(i)} b^σ(i) . With these notations we deﬁne L m a t c h ( y i , y ^ σ ( i ) ) \mathcal{L}{\mathrm{match}}\big(y_{i},\hat{y}{\sigma(i)}\big) Lmatch(yi,y^σ(i)) as − 1 { c i ≠ ∅ } p ^ σ ( i ) ( c i ) + 1 { c i ≠ ∅ } L b o x ( b i , b ^ σ ( i ) ) -\mathbb{1}{\{c_{i}\neq\emptyset\}}\hat{p}{\sigma(i)}(c{i})+\mathbb{1}{\{c{i}\neq\emptyset\}}\mathcal{L}{\mathrm{box}}(b{i},\hat{b}_{\sigma(i)}) −1{ci=∅}p^σ(i)(ci)+1{ci=∅}Lbox(bi,b^σ(i)) .

【翻译】匹配成本同时考虑了类别预测和预测框与真实框的相似性。真实集合的每个元素 i i i可以看作 y i = ( c i , b i ) y_{i}~=~\left(c_{i},b_{i}\right) yi = (ci,bi)，其中 c i c_{i} ci是目标类别标签（可能是 L \scriptscriptstyle\mathcal{L} L）， b i ∈ [ 0 , 1 ] 4 b_{i}~\in~[0,1]^{4} bi ∈ [0,1]4是一个向量，定义了相对于图像大小的真实框中心坐标及其高度和宽度。对于索引为 σ ( i ) \sigma(i) σ(i)的预测，我们将类别 c i c_{i} ci的概率定义为 p ^ σ ( i ) ( c i ) \hat{p}{\sigma(i)}(\boldsymbol{c}{i}) p^σ(i)(ci)，预测框定义为 b ^ σ ( i ) \hat{b}{\sigma(i)} b^σ(i)。使用这些记号，我们将 L m a t c h ( y i , y ^ σ ( i ) ) \mathcal{L}{\mathrm{match}}\big(y_{i},\hat{y}{\sigma(i)}\big) Lmatch(yi,y^σ(i))定义为 − 1 { c i ≠ ∅ } p ^ σ ( i ) ( c i ) + 1 { c i ≠ ∅ } L b o x ( b i , b ^ σ ( i ) ) -\mathbb{1}{\{c_{i}\neq\emptyset\}}\hat{p}{\sigma(i)}(c{i})+\mathbb{1}{\{c{i}\neq\emptyset\}}\mathcal{L}{\mathrm{box}}(b{i},\hat{b}_{\sigma(i)}) −1{ci=∅}p^σ(i)(ci)+1{ci=∅}Lbox(bi,b^σ(i))。

【解析】这里的匹配成本函数考虑了目标检测任务的两个任务要求：正确的类别分类和准确的位置预测。每个真实目标 y i y_i yi包含两个关键信息，类别标签 c i c_i ci和边界框 b i b_i bi。这里的边界框采用归一化表示，坐标值都在[0,1]范围内，这样做的好处是消除了不同图像尺寸的影响，使得损失函数更加稳定。边界框用4维向量表示，包括中心点的x、y坐标以及框的宽度和高度。预测部分对应地包含类别概率 p ^ σ ( i ) ( c i ) \hat{p}{\sigma(i)}(c_i) p^σ(i)(ci)和预测边界框 b ^ σ ( i ) \hat{b}{\sigma(i)} b^σ(i)。匹配成本公式中的指示函数 1 { c i ≠ ∅ } \mathbb{1}{\{c{i}\neq\emptyset\}} 1{ci=∅}起到了关键的控制作用，当真实目标不是空目标时值为1，是空目标时值为0。这说明只有在存在真实目标的情况下，才会计算分类概率的负值（鼓励高概率预测）和边界框损失。分类部分使用负概率，本质上是在最大化正确类别的预测概率，这与交叉熵损失的思想一致。边界框损失 L b o x \mathcal{L}_{\mathrm{box}} Lbox则专门处理位置和尺寸的预测精度。整个匹配成本的设计既要保证分类准确，又要保证定位精确，两者的权重通过加法形式进行平衡。

This procedure of ﬁnding matching plays the same role as the heuristic assignment rules used to match proposal [ 37 ] or anchors [ 22 ] to ground truth objects in modern detectors. The main diﬀerence is that we need to ﬁnd one-to-one matching for direct set prediction without duplicates.

【翻译】这种寻找匹配的过程与现代检测器中用于将候选框[37]或锚点[22]与真实目标匹配的启发式分配规则起着相同的作用。主要区别在于我们需要为直接集合预测找到无重复的一对一匹配。

【解析】这段话指出了DETR与传统目标检测方法在分配策略上的本质差异。传统检测器如Faster R-CNN使用启发式规则来决定哪些候选框或锚点应该负责检测哪些真实目标，这些规则通常基于IoU阈值等简单标准，比如IoU大于0.7的认为是正样本，小于0.3的认为是负样本。再例如YOLOv8及其后续均采用的TAL标签分配等，此类方法存在固有问题：同一个真实目标可能被多个候选框匹配，导致重复检测，需要后续的NMS来去重（简单而言训练时因为正样本要一对多保证召回和稳定性，至于YOLOv10无需NMS则是训练时多加了一个one2one头，一对多头已经提供了高质量且充足的监督信息，故一对一头TopK=1即可，具体可以参考https://blog.csdn.net/weixin_46248968/article/details/148995618?spm=1001.2014.3001.5502）。DETR的二分匹配则完全不同，它强制实现严格的一对一对应关系，每个真实目标只能被一个预测匹配，每个预测也只能匹配一个真实目标。这种唯一性约束从根本上解决了重复检测问题，使得模型可以直接输出最终的检测结果而无需后处理。匈牙利算法保证了这种一对一匹配在全局范围内是最优的，不是贪心的局部最优解，而是考虑了所有可能匹配组合后得出的全局最优分配方案。（因此可以说DETR创新性很强）

The second step is to compute the loss function, the Hungarian loss for all pairs matched in the previous step. We deﬁne the loss similarly to the losses of common object detectors, i.e . a linear combination of a negative log-likelihood for class prediction and a box loss deﬁned later:

【翻译】第二步是计算损失函数，即为前一步中匹配的所有对计算匈牙利损失。我们定义的损失与常见目标检测器的损失类似，即类别预测的负对数似然和稍后定义的边界框损失的线性组合：

【解析】确定了预测与真实目标的最优匹配关系后，接下来就要计算实际的训练损失。这个损失函数被称为匈牙利损失，因为它依赖于前面通过匈牙利算法得到的最优分配结果。损失函数的设计遵循目标检测的经典思路，主要包含两个部分：分类损失和定位损失。分类损失采用负对数似然的形式，本质上就是交叉熵损失，用来衡量模型对目标类别预测的准确性。定位损失则专门处理边界框的回归问题，确保预测框能够准确地定位到真实目标的位置和尺寸。将这两部分损失进行线性组合，就构成了完整的训练目标，既要求模型能够正确识别目标的类别，又要求能够精确预测目标的空间位置。

L H u n g a r i a n ( y , y ^ ) = ∑ i = 1 N [ − log ⁡ p ^ σ ^ ( i ) ( c i ) + 1 { c i ≠ Q } L b o x ( b i , b ^ σ ^ ( i ) ) ] , \mathcal{L}{\mathrm{Hungarian}}(y,\hat{y})=\sum{i=1}^{N}\left[-\log\hat{p}{\hat{\sigma}(i)}(c{i})+\mathbb{1}{\{c{i}\neq\mathcal{Q}\}}\mathcal{L}{\mathrm{box}}(b{i},\hat{b}_{\hat{\sigma}}(i))\right], LHungarian(y,y^)=i=1∑N[−logp^σ^(i)(ci)+1{ci=Q}Lbox(bi,b^σ^(i))],

【解析】这个公式详细描述了匈牙利损失的计算方式。 σ ^ \hat{\sigma} σ^是通过第一步匈牙利算法得到的最优分配方案，它告诉我们第i个真实目标应该与第 σ ^ ( i ) \hat{\sigma}(i) σ^(i)个预测结果配对。损失函数对所有N个位置进行求和，每个位置的损失包含两个部分。第一部分 − log ⁡ p ^ σ ^ ( i ) ( c i ) -\log\hat{p}{\hat{\sigma}(i)}(c{i}) −logp^σ^(i)(ci)是分类损失， p ^ σ ^ ( i ) ( c i ) \hat{p}{\hat{\sigma}(i)}(c{i}) p^σ^(i)(ci)表示预测结果 σ ^ ( i ) \hat{\sigma}(i) σ^(i)对真实类别 c i c_{i} ci的预测概率，取负对数后鼓励模型输出更高的正确类别概率。第二部分是边界框损失 L b o x ( b i , b ^ σ ^ ( i ) ) \mathcal{L}{\mathrm{box}}(b{i},\hat{b}{\hat{\sigma}}(i)) Lbox(bi,b^σ^(i))，用于约束预测框与真实框之间的位置偏差。关键的是指示函数 1 { c i ≠ Q } \mathbb{1}{\{c_{i}\neq\mathcal{Q}\}} 1{ci=Q}，它确保只有当真实目标不是空目标（即 c i ≠ Q c_{i}\neq\mathcal{Q} ci=Q）时才计算边界框损失，因为空目标没有边界框信息，不需要进行位置回归。这样设计使得损失函数能够正确处理有目标和无目标两种情况。

where σ ^ \hat{\sigma} σ^ is the optimal assignment computed in the ﬁrst step ( 1 ). In practice, we down-weight the log-probability term when c i = ∅ c_{i}=\emptyset ci=∅ by a factor 10 to account for class imbalance. This is analogous to how Faster R-CNN training procedure balances positive/negative proposals by subsampling [ 37 ]. Notice that the matching cost between an object and L \scriptscriptstyle\mathcal{L} L doesn't depend on the prediction, which means that in that case the cost is a constant. In the matching cost we use probabilities p ^ σ ^ ( i ) ( c i ) \hat{p}{\hat{\sigma}(i)}(c{i}) p^σ^(i)(ci) instead of log-probabilities. This makes the class prediction term commensurable to L b o x ( ⋅ , ⋅ ) \mathcal{L}_{\mathrm{box}}(\cdot,\cdot) Lbox(⋅,⋅) (described below), and we observed better empirical performances.

【翻译】其中 σ ^ \hat{\sigma} σ^是第一步中计算的最优分配（公式1）。在实践中，当 c i = ∅ c_{i}=\emptyset ci=∅时，我们将对数概率项的权重降低10倍以考虑类别不平衡。这类似于Faster R-CNN训练过程通过子采样来平衡正负样本[37]。注意，目标与 L \scriptscriptstyle\mathcal{L} L之间的匹配成本不依赖于预测，这说明在这种情况下成本是常数。在匹配成本中我们使用概率 p ^ σ ^ ( i ) ( c i ) \hat{p}{\hat{\sigma}(i)}(c{i}) p^σ^(i)(ci)而不是对数概率。这使得类别预测项与 L b o x ( ⋅ , ⋅ ) \mathcal{L}_{\mathrm{box}}(\cdot,\cdot) Lbox(⋅,⋅)（稍后描述）相称，我们观察到了更好的经验性能。

【解析】这段话说明了训练的技术细节。首先是类别不平衡问题的处理，由于图像中大部分预测位置都对应背景（即 c i = ∅ c_{i}=\emptyset ci=∅的情况），如果不做特殊处理，背景类的损失会占主导地位，影响模型学习真实目标的能力。DETR通过将背景类的对数概率项权重降低10倍来缓解这个问题，这与Faster R-CNN中通过控制正负样本比例的思路是一致的。其次，文中提到了匹配成本与最终损失计算之间的一个重要区别：在匹配阶段使用概率值而不是对数概率值。这样做的原因是为了让分类项和边界框回归项在数值尺度上更加匹配，避免某一项损失过大而掩盖另一项的作用。对于空目标的匹配成本是常数这一点说明了设计的合理性，因为空目标本身没有具体的预测内容需要评估，匹配成本只需要是一个固定值即可。

Bounding box loss. The second part of the matching cost and the Hungarian loss is L b o x ( ⋅ ) \mathcal{L}{\mathrm{box}}(\cdot) Lbox(⋅) that scores the bounding boxes. Unlike many detectors that do box predictions as a Δ \varDelta Δ w.r.t. some initial guesses, we make box predictions directly. While such approach simplify the implementation it poses an issue with relative scaling of the loss. The most commonly-used ℓ 1 \ell{1} ℓ1 loss will have diﬀerent scales for small and large boxes even if their relative errors are similar. To mitigate this issue we use a linear combination of the ℓ 1 \ell_{1} ℓ1 loss and the generalized IoU loss [ 38 ] L i o u ( ⋅ , ⋅ ) \mathcal{L}{\mathrm{iou}}(\cdot,\cdot) Liou(⋅,⋅) that is scale-invariant. Overall, our box loss is L b o x ( b i , b ^ σ ( i ) ) \mathcal{L}{\mathrm{box}}\big(b_{i},\hat{b}{\sigma(i)}\big) Lbox(bi,b^σ(i)) deﬁned as λ i o u L i o u ( b i , b ^ σ ( i ) ) + λ L 1 ∣ ∣ b i − b ^ σ ( i ) ∣ ∣ 1 \lambda{\mathrm{iou}}\mathcal{L}{\mathrm{iou}}(b{i},\hat{b}{\sigma(i)})+\lambda{\mathrm{L1}}\vert\vert b_{i}-\hat{b}{\sigma(i)}\vert\vert{1} λiouLiou(bi,b^σ(i))+λL1∣∣bi−b^σ(i)∣∣1 where λ i o u , λ L 1 ∈ R \lambda_{\mathrm{iou}},\lambda_{\mathrm{L1}}\in\mathbb{R} λiou,λL1∈R are hyperparameters. These two losses are normalized by the number of objects inside the batch.

【翻译】边界框损失。匹配成本和匈牙利损失的第二部分是 L b o x ( ⋅ ) \mathcal{L}{\mathrm{box}}(\cdot) Lbox(⋅)，它对边界框进行评分。与许多相对于某些初始猜测进行 Δ \varDelta Δ边界框预测的检测器不同，我们直接进行边界框预测。虽然这种方法简化了实现，但它带来了损失相对缩放的问题。最常用的 ℓ 1 \ell{1} ℓ1损失对于小框和大框会有不同的尺度，即使它们的相对误差相似。为了缓解这个问题，我们使用 ℓ 1 \ell_{1} ℓ1损失和尺度不变的广义IoU损失[38] L i o u ( ⋅ , ⋅ ) \mathcal{L}{\mathrm{iou}}(\cdot,\cdot) Liou(⋅,⋅)的线性组合。总体而言，我们的边界框损失 L b o x ( b i , b ^ σ ( i ) ) \mathcal{L}{\mathrm{box}}\big(b_{i},\hat{b}{\sigma(i)}\big) Lbox(bi,b^σ(i))定义为 λ i o u L i o u ( b i , b ^ σ ( i ) ) + λ L 1 ∣ ∣ b i − b ^ σ ( i ) ∣ ∣ 1 \lambda{\mathrm{iou}}\mathcal{L}{\mathrm{iou}}(b{i},\hat{b}{\sigma(i)})+\lambda{\mathrm{L1}}\vert\vert b_{i}-\hat{b}{\sigma(i)}\vert\vert{1} λiouLiou(bi,b^σ(i))+λL1∣∣bi−b^σ(i)∣∣1，其中 λ i o u , λ L 1 ∈ R \lambda_{\mathrm{iou}},\lambda_{\mathrm{L1}}\in\mathbb{R} λiou,λL1∈R是超参数。这两个损失都通过批次内的目标数量进行归一化。

【解析】DETR的边界框损失是属于直接回归策略。传统检测器通常采用间接预测方式，即先设定一些初始的候选框或锚点，然后预测相对于这些初始框的偏移量 Δ \varDelta Δ，这种方式的好处是可以利用先验知识来约束预测范围，但也增加了设计复杂度。DETR采用直接预测策略，模型直接输出边界框的绝对坐标，这种方式更加简洁直观，避免了复杂的锚点设计和偏移量编码解码过程。然而直接预测也带来了尺度敏感性问题，这其实是边界框回归中的经典难题。 ℓ 1 \ell_{1} ℓ1损失虽然简单直观，但对不同尺寸的目标表现出不同的敏感度，具体来说，对于大目标，即使预测框有几个像素的偏差， ℓ 1 \ell_{1} ℓ1损失值可能很大，而对于小目标，相同像素级别的偏差可能产生较小的损失值，但相对误差却可能更严重。这种不均衡会导致模型在训练过程中对大目标过度关注，而对小目标的精度要求不够严格。广义IoU损失的引入是为了解决这个尺度不变性问题，IoU本质上是一个比例度量，它关注的是预测框与真实框重叠程度的相对关系，而不是绝对的像素差异。通过将 ℓ 1 \ell_{1} ℓ1损失和广义IoU损失进行线性组合，既保持了 ℓ 1 \ell_{1} ℓ1损失对精确定位的敏感性，又利用了IoU损失的尺度不变特性，使得模型能够在不同尺寸的目标上都保持一致的性能表现。两个权重系数 λ i o u \lambda_{\mathrm{iou}} λiou和 λ L 1 \lambda_{\mathrm{L1}} λL1用来平衡这两种损失的相对重要性，需要通过实验来确定最优值。最后提到的批次归一化是为了稳定训练过程，因为不同批次中的目标数量可能差异很大，直接使用损失总和会导致梯度的不稳定，通过目标数量归一化可以确保损失值的尺度相对稳定。

3.2 DETR架构

The overall DETR architecture is surprisingly simple and depicted in Figure 2 . It contains three main components, which we describe below: a CNN backbone to extract a compact feature representation, an encoder-decoder transformer, and a simple feed forward network (FFN) that makes the ﬁnal detection prediction.

【翻译】整体DETR架构出人意料地简单，如图2所示。它包含三个主要组件，我们在下面描述：一个CNN主干网络用于提取紧凑的特征表示，一个编码器-解码器transformer，以及一个进行最终检测预测的简单前馈网络（FFN）。

【解析】DETR将整个检测任务简化为三个核心模块的组合。CNN主干网络负责从原始图像中提取高质量的视觉特征，编码器-解码器transformer则是DETR的核心创新，它将目标检测重新定义为序列到序列的转换问题，编码器处理图像特征并建立全局上下文理解，解码器则生成最终的检测结果。最后的前馈网络相当于检测头，将transformer的输出转换为具体的边界框坐标和类别预测。

Unlike many modern detectors, DETR can be implemented in any deep learning framework that provides a common CNN backbone and a transformer architecture implementation with just a few hundred lines. Inference code for DETR can be implemented in less than 50 lines in PyTorch [ 32 ]. We hope that the simplicity of our method will attract new researchers to the detection community.

【翻译】与许多现代检测器不同，DETR可以在任何提供常见CNN主干和transformer架构实现的深度学习框架中用几百行代码实现。DETR的推理代码可以在PyTorch[32]中用不到50行代码实现。我们希望我们方法的简洁性将吸引新的研究者加入检测社区。

Backbone. Starting from the initial image x i m g ∈ R 3 × H 0 × W 0 x_{\mathrm{img}}\in\mathbb{R}^{3\times H_{0}\times W_{0}} ximg∈R3×H0×W0 (with 3 color channels 2 ), a conventional CNN backbone generates a lower-resolution activation map f ∈ R C × H × W f\in\mathbb{R}^{C\times H\times W} f∈RC×H×W . Typical values we use are C = 2048 C=2048 C=2048 and H , W = H 0 32 , W 0 32 \begin{array}{r}{H,W=\frac{H_{0}}{32},\frac{W_{0}}{32}}\end{array} H,W=32H0,32W0 .

【翻译】主干网络。从初始图像 x i m g ∈ R 3 × H 0 × W 0 x_{\mathrm{img}}\in\mathbb{R}^{3\times H_{0}\times W_{0}} ximg∈R3×H0×W0（具有3个颜色通道）开始，传统的CNN主干网络生成一个较低分辨率的激活图 f ∈ R C × H × W f\in\mathbb{R}^{C\times H\times W} f∈RC×H×W。我们使用的典型值是 C = 2048 C=2048 C=2048和 H , W = H 0 32 , W 0 32 H,W=\frac{H_{0}}{32},\frac{W_{0}}{32} H,W=32H0,32W0。

Transformer encoder. First, a 1x1 convolution reduces the channel dimension of the high-level activation map f f f from C C C to a smaller dimension d d d . creating a new feature map z 0 ∈ R d × H × W z_{0}\in\mathbb{R}^{d\times H\times W} z0∈Rd×H×W . The encoder expects a sequence as inpu we collapse the spatial dimensions of z 0 z_{0} z0 into one dimension, resulting in a d × H W d{\times}H W d×HW × feature map. Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN). Since the transformer architecture is permutation-invariant, we supplement it with ﬁxed positional encodings [ 31 , 3 ] that are added to the input of each attention layer. We defer to the supplementary material the detailed deﬁnition of the architecture, which follows the one described in [ 47 ].

【翻译】Transformer编码器。首先，1x1卷积将高级激活图 f f f的通道维度从 C C C减少到更小的维度 d d d，创建一个新的特征图 z 0 ∈ R d × H × W z_{0}\in\mathbb{R}^{d\times H\times W} z0∈Rd×H×W。编码器期望序列作为输入，我们将 z 0 z_{0} z0的空间维度折叠为一维，得到 d × H W d{\times}HW d×HW的特征图。每个编码器层都有标准架构，由多头自注意力模块和前馈网络（FFN）组成。由于transformer架构是排列不变的，我们用固定的位置编码[31, 3]来补充它，这些编码被添加到每个注意力层的输入中。我们将架构的详细定义推迟到补充材料中，它遵循[47]中描述的架构。

【解析】这一步骤是将CNN特征适配到Transformer架构的转换过程。1x1卷积的作用是降维，将2048维的特征压缩到更小的维度 d d d（通常是256或512），这样做有两个目的：减少计算量和内存消耗，同时也起到特征选择的作用，保留最重要的信息。空间维度的重塑是必要的，因为Transformer期望的是序列输入，而CNN输出的是二维特征图。将 H × W H \times W H×W的空间维度展平为 H W HW HW的序列长度，每个位置的 d d d维特征向量成为序列中的一个元素，这样就将图像的每个空间位置看作序列中的一个token。Transformer编码器采用标准的架构设计，多头自注意力机制能够捕捉序列中任意两个位置之间的关系，这对于目标检测来说特别重要，因为不同空间位置的特征可能存在复杂的依赖关系。前馈网络提供非线性变换能力，增强模型的表达力。位置编码的引入解决了Transformer排列不变性带来的问题，因为在图像中空间位置信息是至关重要的，固定的位置编码确保模型能够区分不同空间位置的特征，保持空间结构的完整性。

Fig. 2: DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model ﬂattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small ﬁxed number of learned positional embeddings, which we call object queries , and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a " no object " class.

【翻译】图2：DETR使用传统的CNN主干网络来学习输入图像的2D表示。模型将其展平并在传入transformer编码器之前补充位置编码。然后transformer解码器将少量固定数量的学习位置嵌入作为输入，我们称之为目标查询，并额外关注编码器输出。我们将解码器的每个输出嵌入传递给共享的前馈网络（FFN），该网络预测检测结果（类别和边界框）或"无目标"类别。

【解析】这个图展示了DETR的整体架构流程。首先CNN主干网络（如ResNet）从原始图像中提取特征图，这个特征图是二维的空间表示，包含了图像中的视觉信息。然后模型将这个二维特征图展平成一维序列，这是为了适配transformer的输入格式，因为transformer处理的是序列数据。在输入transformer编码器之前，还需要添加位置编码，这是因为transformer本身对位置信息不敏感，位置编码帮助模型理解特征在原始图像中的空间位置关系。编码器处理这个带有位置编码的特征序列，建立全局的上下文理解。解码器的设计是DETR的核心创新之一，它不像传统方法那样基于滑动窗口或锚点生成候选，而是使用固定数量的可学习查询向量（object queries）。这些查询向量可以理解为"问题"，每个查询都在询问"在这个图像中是否存在一个目标，如果存在，它的类别和位置是什么？"解码器通过注意力机制让这些查询与编码器的输出进行交互，从而获得答案。最后，每个查询的输出通过共享的前馈网络转换为最终的预测结果，包括目标类别和边界框坐标，如果某个查询没有找到对应的目标，就预测为"无目标"类别。将检测问题转化为一个并行的查询-回答过程，避免了复杂的后处理步骤。

Transformer decoder. The decoder follows the standard architecture of the transformer, transforming N N N embeddings of size d d d using multi-headed self- and encoder-decoder attention mechanisms. The diﬀerence with the original transformer is that our model decodes the N N N objects in parallel at each decoder layer, while Vaswani et al. [ 47 ] use an autoregressive model that predicts the output sequence one element at a time. We refer the reader unfamiliar with the concepts to the supplementary material. Since the decoder is also permutation-invariant, the N N N input embeddings must be diﬀerent to produce diﬀerent results. These input embeddings are learnt positional encodings that we refer to as object queries , and similarly to the encoder, we add them to the input of each attention layer. The N N N object queries are transformed into an output embedding by the decoder. They are then independently decoded into box coordinates and class labels by a feed forward network (described in the next subsection), resulting N N N ﬁnal predictions. Using self- and encoder-decoder attention over these embeddings, the model globally reasons about all objects together using pair-wise relations between them, while being able to use the whole image as context.

【翻译】Transformer解码器。解码器遵循transformer的标准架构，使用多头自注意力和编码器-解码器注意力机制来转换 N N N个大小为 d d d的嵌入。与原始transformer的区别在于，我们的模型在每个解码器层并行解码 N N N个目标，而Vaswani等人[47]使用自回归模型，一次预测输出序列中的一个元素。我们建议不熟悉这些概念的读者参考补充材料。由于解码器也是排列不变的， N N N个输入嵌入必须不同以产生不同的结果。这些输入嵌入是我们称为目标查询的学习位置编码，与编码器类似，我们将它们添加到每个注意力层的输入中。 N N N个目标查询被解码器转换为输出嵌入。然后它们通过前馈网络（在下一小节中描述）独立解码为边界框坐标和类别标签，产生 N N N个最终预测。通过对这些嵌入使用自注意力和编码器-解码器注意力，模型使用它们之间的成对关系对所有目标进行全局推理，同时能够使用整个图像作为上下文。

【解析】解码器的设计说明了从序列生成到集合预测的转变。传统的transformer解码器采用自回归的生成模式，这种模式在自然语言处理中很自然，因为文本序列具有明确的时序关系，每个词的生成都依赖于前面已生成的词。但是在目标检测任务中，各个目标之间并不存在天然的顺序关系，一张图像中的不同目标可以被视为一个无序集合。因此DETR创新性地采用了并行解码策略，所有 N N N个目标同时生成，这不仅提高了计算效率，更重要的是避免了人为引入顺序偏差。目标查询（object queries）可以理解为 N N N个可学习的"探测器"，每个探测器负责关注和识别图像中的一个特定目标。这些查询在训练过程中学习到不同的表示，使得每个查询能够专注于不同类型、不同位置或不同尺度的目标。由于解码器本身具有排列不变性，如果所有目标查询都相同，那么解码器的输出也会相同，这显然无法区分不同的目标。因此每个目标查询必须具有独特的特征表示，这种差异性通过随机初始化和训练过程中的梯度更新自然形成。解码器中的注意力机制发挥着至关重要的作用，自注意力允许不同的目标查询之间进行信息交换，这有助于处理目标间的相互关系，比如避免重复检测、处理遮挡关系等。编码器-解码器注意力则让每个目标查询能够关注到图像特征的不同区域，实现从全局图像特征到具体目标的精确定位。最终，每个目标查询的输出嵌入通过独立的前馈网络转换为具体的检测结果，包括边界框坐标和类别概率。

Prediction feed-forward networks (FFNs). The ﬁnal prediction is computed by a 3-layer perceptron with ReLU activation function and hidden dimension d d d , and a linear projection layer. The FFN predicts the normalized center coordinates, height and width of the box w.r.t. the input image, and the linear layer predicts the class label using a softmax function. Since we predict a ﬁxed-size set of N N N bounding boxes, where N N N is usually much larger than the actual number of objects of interest in an image, an additional special class label L \scriptscriptstyle\mathcal{L} L is used to represent that no object is detected within a slot. This class plays a similar role to the "background" class in the standard object detection approaches.

【翻译】预测前馈网络（FFNs）。最终预测由一个具有ReLU激活函数和隐藏维度 d d d的3层感知机以及一个线性投影层计算得出。FFN预测相对于输入图像的边界框的归一化中心坐标、高度和宽度，线性层使用softmax函数预测类别标签。由于我们预测固定大小的 N N N个边界框集合，其中 N N N通常远大于图像中感兴趣目标的实际数量，因此使用额外的特殊类别标签 L \scriptscriptstyle\mathcal{L} L来表示在某个槽位中没有检测到目标。这个类别与标准目标检测方法中的"背景"类别起着类似的作用。

【解析】FFN是DETR的输出头，负责将transformer解码器的抽象特征表示转换为具体的检测结果。3层感知机提供了足够的非线性变换能力，能够从高维特征中提取出边界框的几何信息。这里的关键设计是坐标归一化，即将边界框的中心坐标、高度和宽度都归一化到[0,1]区间内，这样做的好处是使得网络更容易收敛，同时也便于不同尺寸图像的处理。类别预测使用softmax函数确保所有类别概率的和为1，这是多分类问题的标准做法。DETR的一个重要特点是固定输出数量，即无论图像中实际有多少个目标，网络都会输出固定的 N N N个预测结果。这种设计简化了网络结构，避免了传统检测器中复杂的候选框生成和过滤过程。然而这也带来了一个问题：当图像中的实际目标数量小于 N N N时，多余的输出槽位该如何处理？DETR通过引入特殊的"无目标"类别来解决这个问题，这个类别本质上就是传统检测器中背景类别的扩展，用来标识那些不包含任何感兴趣目标的预测槽位。

Auxiliary decoding losses. We found helpful to use auxiliary losses [ 1 ] in decoder during training, especially to help the model output the correct number of objects of each class. We add prediction FFNs and Hungarian loss after each decoder layer. All predictions FFNs share their parameters. We use an additional shared layer-norm to normalize the input to the prediction FFNs from diﬀerent decoder layers.

【翻译】辅助解码损失。我们发现在训练期间在解码器中使用辅助损失[1]很有帮助，特别是帮助模型输出每个类别的正确目标数量。我们在每个解码器层后添加预测FFN和匈牙利损失。所有预测FFN共享它们的参数。我们使用额外的共享层归一化来标准化来自不同解码器层到预测FFN的输入。

【解析】辅助损失可以为网络的中间层提供额外的监督信号，帮助优化过程更加稳定和有效。在DETR中，这种技术特别重要，因为transformer解码器的每一层都在逐步细化对目标的理解和定位。通过在每个解码器层后都添加预测头和对应的损失函数，网络不仅要确保最终输出的正确性，还要保证每个中间层的输出都朝着正确的方向发展。这种多层监督的策略能够有效缓解梯度消失问题，特别是在深层网络中，早期层的梯度往往会变得很小，影响参数更新的效果。参数共享的设计是出于效率考虑，如果每个解码器层都使用独立的预测头，会大大增加模型的参数量和计算复杂度。通过让所有解码器层共享同一套预测FFN参数，既保持了多层监督的优势，又控制了模型的复杂度。层归一化的引入是为了处理不同解码器层输出特征分布的差异，由于每一层的学习程度不同，其输出特征的尺度和分布可能存在显著差异，直接输入到共享的预测头中可能会影响预测质量。通过层归一化统一特征分布，确保每个解码器层的输出都能被预测头有效处理。

4 Experiments

We show that DETR achieves competitive results compared to Faster R-CNN in quantitative evaluation on COCO. Then, we provide a detailed ablation study of the architecture and loss, with insights and qualitative results. Finally, to show that DETR is a versatile and extensible model, we present results on panoptic segmentation, training only a small extension on a ﬁxed DETR model. We provide code and pretrained models to reproduce our experiments at https://github.com/facebookresearch/detr .

【翻译】我们证明了DETR在COCO的定量评估中取得了与Faster R-CNN相竞争的结果。然后，我们提供了架构和损失的详细消融研究，包含见解和定性结果。最后，为了表明DETR是一个多功能且可扩展的模型，我们展示了全景分割的结果，仅在固定的DETR模型上训练一个小的扩展。我们在https://github.com/facebookresearch/detr提供代码和预训练模型来重现我们的实验。

Dataset. We perform experiments on COCO 2017 detection and panoptic segmentation datasets [ 24 , 18 ], containing 118k training images and 5k validation images. Each image is annotated with bounding boxes and panoptic segmentation. There are 7 instances per image on average, up to 63 instances in a single image in training set, ranging from small to large on the same images. If not speciﬁed, we report AP as bbox AP, the integral metric over multiple thresholds. For comparison with Faster R-CNN we report validation AP at the last training epoch, for ablations we report median over validation results from the last 10 epochs.

【翻译】数据集。我们在COCO 2017检测和全景分割数据集[24, 18]上进行实验，包含118k训练图像和5k验证图像。每个图像都标注了边界框和全景分割。平均每张图像有7个实例，训练集中单张图像最多有63个实例，在同一图像中从小到大都有。如果没有特别说明，我们报告AP为边界框AP，即多个阈值上的积分指标。为了与Faster R-CNN比较，我们报告最后训练轮次的验证AP，对于消融实验，我们报告最后10个轮次验证结果的中位数。

Technical details. We train DETR with AdamW [ 26 ] setting the initial transformer's learning rate to 1 0 − 4 10^{-4} 10−4 , the backbone's to 1 0 − 5 10^{-5} 10−5 , and weight decay to 1 0 − 4 10^{-4} 10−4 . All transformer weights are initialized with Xavier init [ 11 ], and the backbone is with ImageNet-pretrained ResNet model [ 15 ] from torchvision with frozen batchnorm layers. We report results with two diﬀerent backbones: a ResNet50 and a ResNet-101. The corresponding models are called respectively DETR and DETR-R101. Following [ 21 ], we also increase the feature resolution by adding a dilation to the last stage of the backbone and removing a stride from the ﬁrst convolution of this stage. The corresponding models are called respectively DETR-DC5 and DETR-DC5-R101 (dilated C5 stage). This modiﬁcation increases the resolution by a factor of two, thus improving performance for small objects, at the cost of a 16x higher cost in the self-attentions of the encoder, leading to an overall 2x increase in computational cost. A full comparison of FLOPs of these models and Faster R-CNN is given in Table 1 .

【翻译】技术细节。我们使用AdamW[26]训练DETR，设置transformer的初始学习率为 1 0 − 4 10^{-4} 10−4，主干网络的学习率为 1 0 − 5 10^{-5} 10−5，权重衰减为 1 0 − 4 10^{-4} 10−4。所有transformer权重都使用Xavier初始化[11]，主干网络使用来自torchvision的ImageNet预训练ResNet模型[15]，并冻结批量归一化层。我们报告了两种不同主干网络的结果：ResNet50和ResNet-101。相应的模型分别称为DETR和DETR-R101。遵循[21]，我们还通过在主干网络的最后阶段添加膨胀并从该阶段的第一个卷积中移除步长来增加特征分辨率。相应的模型分别称为DETR-DC5和DETR-DC5-R101（膨胀C5阶段）。这种修改将分辨率提高了2倍，从而改善了小物体的性能，但代价是编码器自注意力的计算成本增加16倍，导致整体计算成本增加2倍。表1给出了这些模型与Faster R-CNN的FLOP完整比较。

We use scale augmentation, resizing the input images such that the shortest side is at least 480 and at most 800 pixels while the longest at most 1333 [ 50 ]. To help learning global relationships through the self-attention of the encoder, we also apply random crop augmentations during training, improving the performance by approximately 1 AP. Speciﬁcally, a train image is cropped with probability 0.5 to a random rectangular patch which is then resized again to 800-1333. The transformer is trained with default dropout of 0.1. At inference time, some slots predict empty class. To optimize for AP, we override the prediction of these slots with the second highest scoring class, using the corresponding conﬁdence. This improves AP by 2 points compared to ﬁltering out empty slots. Other training hyperparameters can be found in section A.4 . For our ablation experiments we use training schedule of 300 epochs with a learning rate drop by a factor of 10 after 200 epochs, where a single epoch is a pass over all training images once. Training the baseline model for 300 epochs on 16 V100 GPUs takes 3 days, with 4 images per GPU (hence a total batch size of 64). For the longer schedule used to compare with Faster R-CNN we train for 500 epochs with learning rate drop after 400 epochs. This schedule adds 1.5 AP compared to the shorter schedule.

【翻译】我们使用尺度增强，调整输入图像大小，使最短边至少为480像素，最多为800像素，而最长边最多为1333像素[50]。为了帮助通过编码器的自注意力学习全局关系，我们还在训练期间应用随机裁剪增强，将性能提高约1 AP。具体来说，训练图像有0.5的概率被裁剪为随机矩形补丁，然后再次调整大小为800-1333。Transformer使用默认的0.1的dropout进行训练。在推理时，一些槽位预测空类别。为了优化AP，我们用第二高评分类别覆盖这些槽位的预测，使用相应的置信度。与过滤掉空槽位相比，这将AP提高了2个点。其他训练超参数可以在A.4节中找到。对于我们的消融实验，我们使用300个epoch的训练计划，在200个epoch后学习率降低10倍，其中单个epoch是对所有训练图像的一次遍历。在16个V100 GPU上训练基线模型300个epoch需要3天，每个GPU 4张图像（因此总批次大小为64）。对于用于与Faster R-CNN比较的更长计划，我们训练500个epoch，在400个epoch后学习率下降。与较短的计划相比，这个计划增加了1.5 AP。

Table 1: Comparison with Faster R-CNN with a ResNet-50 and ResNet-101 backbones on the COCO validation set. The top section shows results for Faster R-CNN models in Detectron2 [ 50 ], the middle section shows results for Faster R-CNN models with GIoU [ 38 ], random crops train-time augmentation, and the long 9x training schedule. DETR models achieve comparable results to heavily tuned Faster R-CNN baselines, having lower AP S but greatly improved AP ⌞ \llcorner └ . We use torchscript Faster R-CNN and DETR models to measure FLOPS and FPS. Results without R101 in the name correspond to ResNet-50.

【翻译】表1：在COCO验证集上与使用ResNet-50和ResNet-101主干网络的Faster R-CNN的比较。顶部部分显示了Detectron2[50]中Faster R-CNN模型的结果，中间部分显示了使用GIoU[38]、随机裁剪训练时数据增强和长9x训练计划的Faster R-CNN模型的结果。DETR模型取得了与经过大量调优的Faster R-CNN基线相当的结果，AP_S较低但AP_L大幅改善。我们使用torchscript Faster R-CNN和DETR模型来测量FLOPS和FPS。名称中没有R101的结果对应ResNet-50。

4.1 Comparison with Faster R-CNN

Transformers are typically trained with Adam or Adagrad optimizers with very long training schedules and dropout, and this is true for DETR as well. Faster R-CNN, however, is trained with SGD with minimal data augmentation and we are not aware of successful applications of Adam or dropout. Despite these diﬀerences we attempt to make a Faster R-CNN baseline stronger. To align it with DETR, we add generalized IoU [ 38 ] to the box loss, the same random crop augmentation and long training known to improve results [ 13 ]. Results are presented in Table 1 . In the top section we show Faster R-CNN results from Detectron2 Model Zoo [ 50 ] for models trained with the 3x schedule. In the middle section we show results (with a "+") for the same models but trained with the 9x schedule (109 epochs) and the described enhancements, which in total adds 1-2 AP. In the last section of Table 1 we show the results for multiple DETR models. To be comparable in the number of parameters we choose a model with 6 transformer and 6 decoder layers of width 256 with 8 attention heads. Like Faster R-CNN with FPN this model has 41.3M parameters, out of which 23.5M are in ResNet-50, and 17.8M are in the transformer. Even though both Faster R-CNN and DETR are still likely to further improve with longer training, we can conclude that DETR can be competitive with Faster R-CNN with the same number of parameters, achieving 42 AP on the COCO val subset. The way DETR achieves this is by improving AP ⊥ \perp ⊥ (+7.8), however note that the model is still lagging behind in AP S (-5.5). DETR-DC5 with the same number of parameters and similar FLOP count has higher AP, but is still signiﬁcantly behind in AP S too. Faster R-CNN and DETR with ResNet-101 backbone show comparable results as well.

【翻译】Transformer通常使用Adam或Adagrad优化器进行训练，采用非常长的训练计划和dropout，DETR也是如此。然而，Faster R-CNN使用SGD进行训练，数据增强很少，我们不知道Adam或dropout的成功应用。尽管存在这些差异，我们仍试图使Faster R-CNN基线更强。为了与DETR对齐，我们在边界框损失中添加了广义IoU[38]，相同的随机裁剪增强和已知能改善结果的长训练[13]。结果呈现在表1中。在顶部部分，我们展示了来自Detectron2模型库[50]的使用3x计划训练的模型的Faster R-CNN结果。在中间部分，我们展示了相同模型但使用9x计划（109个epoch）和所述增强训练的结果（标有"+"），总共增加了1-2 AP。在表1的最后部分，我们展示了多个DETR模型的结果。为了在参数数量上具有可比性，我们选择了一个具有6个transformer和6个解码器层、宽度为256、8个注意力头的模型。与使用FPN的Faster R-CNN一样，该模型有41.3M个参数，其中23.5M在ResNet-50中，17.8M在transformer中。尽管Faster R-CNN和DETR在更长的训练后仍可能进一步改进，我们可以得出结论，DETR可以在相同参数数量下与Faster R-CNN竞争，在COCO验证子集上达到42 AP。DETR实现这一点的方式是通过改进AP ⊥ \perp ⊥（+7.8），但请注意该模型在APS上仍然落后（-5.5）。具有相同参数数量和相似FLOP计数的DETR-DC5具有更高的AP，但在APS上仍然显著落后。使用ResNet-101主干的Faster R-CNN和DETR也显示出可比较的结果。

Table 2: Eﬀect of encoder size. Each row corresponds to a model with varied number of encoder layers and ﬁxed number of decoder layers. Performance gradually improves with more encoder layers.

【翻译】表2：编码器大小的影响。每一行对应一个具有不同编码器层数和固定解码器层数的模型。性能随着更多编码器层的增加而逐渐改善。

4.2 Ablations

Attention mechanisms in the transformer decoder are the key components which model relations between feature representations of diﬀerent detections. In our ablation analysis, we explore how other components of our architecture and loss inﬂuence the ﬁnal performance. For the study we choose ResNet-50-based DETR model with 6 encoder, 6 decoder layers and width 256. The model has 41.3M parameters, achieves 40.6 and 42.0 AP on short and long schedules respectively, and runs at 28 FPS, similarly to Faster R-CNN-FPN with the same backbone.

【翻译】Transformer解码器中的注意力机制是建模不同检测的特征表示之间关系的关键组件。在我们的消融分析中，我们探索了架构和损失的其他组件如何影响最终性能。对于这项研究，我们选择基于ResNet-50的DETR模型，具有6个编码器层、6个解码器层和256的宽度。该模型有41.3M个参数，在短时间表和长时间表上分别达到40.6和42.0 AP，运行速度为28 FPS，与使用相同主干网络的Faster R-CNN-FPN相似。

Number of encoder layers. We evaluate the importance of global imagelevel self-attention by changing the number of encoder layers (Table 2 ). Without encoder layers, overall AP drops by 3.9 points, with a more signiﬁcant drop of 6.0 AP on large objects. We hypothesize that, by using global scene reasoning, the encoder is important for disentangling objects. In Figure 3 , we visualize the attention maps of the last encoder layer of a trained model, focusing on a few points in the image. The encoder seems to separate instances already, which likely simpliﬁes object extraction and localization for the decoder.

【翻译】编码器层数量。我们通过改变编码器层数来评估全局图像级自注意力的重要性（表2）。没有编码器层时，整体AP下降3.9个点，在大物体上的下降更显著，达到6.0 AP。我们假设通过使用全局场景推理，编码器对于分离物体很重要。在图3中，我们可视化了训练模型最后一个编码器层的注意力图，聚焦于图像中的几个点。编码器似乎已经分离了实例，这可能简化了解码器的物体提取和定位。

Number of decoder layers. We apply auxiliary losses after each decoding layer (see Section 3.2 ), hence, the prediction FFNs are trained by design to predict objects out of the outputs of every decoder layer. We analyze the importance of each decoder layer by evaluating the objects that would be predicted at each stage of the decoding (Fig. 4 ). Both AP and AP 50 ^{50} 50 improve after every layer, totalling into a very signiﬁcant +8.2/9.5 AP improvement between the ﬁrst and the last layer. With its set-based loss, DETR does not need NMS by design. To verify this we run a standard NMS procedure with default parameters [ 50 ] for the outputs after each decoder. NMS improves performance for the predictions from the ﬁrst decoder. This can be explained by the fact that a single decoding layer of the transformer is not able to compute any cross-correlations between the output elements, and thus it is prone to making multiple predictions for the same object. In the second and subsequent layers, the self-attention mechanism over the activations allows the model to inhibit duplicate predictions. We observe that the improvement brought by NMS diminishes as depth increases. At the last layers, we observe a small loss in AP as NMS incorrectly removes true positive predictions.

【翻译】解码器层数量。我们在每个解码层后应用辅助损失（见第3.2节），因此，预测FFN被设计为从每个解码器层的输出中预测物体。我们通过评估在解码的每个阶段会预测的物体来分析每个解码器层的重要性（图4）。AP和AP 50 ^{50} 50在每一层后都有改进，第一层和最后一层之间总共有非常显著的+8.2/9.5 AP改进。由于其基于集合的损失，DETR在设计上不需要NMS。为了验证这一点，我们对每个解码器后的输出运行标准NMS程序和默认参数[50]。NMS改进了第一个解码器预测的性能。这可以解释为transformer的单个解码层无法计算输出元素之间的任何交叉相关性，因此容易对同一物体做出多个预测。在第二层和后续层中，激活上的自注意力机制允许模型抑制重复预测。我们观察到NMS带来的改进随着深度增加而减少。在最后几层，我们观察到AP的小幅损失，因为NMS错误地移除了真正的正预测。

Fig. 3: Encoder self-attention for a set of reference points. The encoder is able to separate individual instances. Predictions are made with baseline DETR model on a validation set image.

【翻译】图3：一组参考点的编码器自注意力。编码器能够分离单个实例。使用基线DETR模型在验证集图像上进行预测。

Similarly to visualizing encoder attention, we visualize decoder attentions in Fig. 6 , coloring attention maps for each predicted object in diﬀerent colors. We observe that decoder attention is fairly local, meaning that it mostly attends to object extremities such as heads or legs. We hypothesise that after the encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.

【翻译】类似于可视化编码器注意力，我们在图6中可视化解码器注意力，为每个预测物体的注意力图着不同的颜色。我们观察到解码器注意力相当局部，意味着它主要关注物体的四肢，如头部或腿部。我们假设编码器通过全局注意力分离实例后，解码器只需要关注四肢来提取类别和物体边界。

Importance of FFN. FFN inside tranformers can be seen as 1 × 1 1\times1 1×1 convolutional layers, making encoder similar to attention augmented convolutional networks [ 3 ]. We attempt to remove it completely leaving only attention in the transformer layers. By reducing the number of network parameters from 41.3M to 28.7M, leaving only 10.8M in the transformer, performance drops by 2.3 AP, we thus conclude that FFN are important for achieving good results.

【翻译】FFN的重要性。Transformer内部的FFN可以看作是 1 × 1 1\times1 1×1卷积层，使编码器类似于注意力增强卷积网络[3]。我们尝试完全移除它，只在transformer层中保留注意力。通过将网络参数数量从41.3M减少到28.7M，在transformer中只留下10.8M，性能下降了2.3 AP，因此我们得出结论FFN对于达到良好结果很重要。

Importance of positional encodings. There are two kinds of positional encodings in our model: spatial positional encodings and output positional encodings (object queries). We experiment with various combinations of ﬁxed and learned encodings, results can be found in table 3 . Output positional encodings are required and cannot be removed, so we experiment with either passing them once at decoder input or adding to queries at every decoder attention layer. In the ﬁrst experiment we completely remove spatial positional encodings and pass output positional encodings at input and, interestingly, the model still achieves more than 32 AP, losing 7.8 AP to the baseline. Then, we pass ﬁxed sine spatial positional encodings and the output encodings at input once, as in the original transformer [ 47 ], and ﬁnd that this leads to 1.4 AP drop compared to passing the positional encodings directly in attention. Learned spatial encodings passed to the attentions give similar results. Surprisingly, we ﬁnd that not passing any spatial encodings in the encoder only leads to a minor AP drop of 1.3 AP. When we pass the encodings to the attentions, they are shared across all layers, and the output encodings (object queries) are always learned.

【翻译】位置编码的重要性。我们的模型中有两种位置编码：空间位置编码和输出位置编码（物体查询）。我们实验了固定和学习编码的各种组合，结果可以在表3中找到。输出位置编码是必需的，不能移除，所以我们实验了在解码器输入时传递一次或在每个解码器注意力层添加到查询中。在第一个实验中，我们完全移除空间位置编码并在输入时传递输出位置编码，有趣的是，模型仍然达到超过32 AP，比基线损失7.8 AP。然后，我们传递固定的正弦空间位置编码和输出编码在输入时一次，如在原始transformer[47]中，发现与直接在注意力中传递位置编码相比，这导致1.4 AP的下降。传递给注意力的学习空间编码给出类似的结果。令人惊讶的是，我们发现在编码器中不传递任何空间编码只导致1.3 AP的微小下降。当我们将编码传递给注意力时，它们在所有层之间共享，输出编码（物体查询）总是学习的。

Fig. 4: AP and AP 50 performance after each decoder layer. A single long schedule baseline model is evaluated. DETR does not need NMS by design, which is validated by this ﬁgure. NMS lowers AP in the ﬁnal layers, removing TP predictions, but improves AP in the ﬁrst decoder layers, removing double predictions, as there is no communication in the ﬁrst layer, and slightly improves A P 50 \mathrm{AP_{50}} AP50 .

【翻译】图4：每个解码器层后的AP和AP50性能。评估了单个长时间表基线模型。DETR在设计上不需要NMS，这个图验证了这一点。NMS在最后几层降低了AP，移除了TP预测，但在第一个解码器层改进了AP，移除了双重预测，因为第一层中没有通信，并且稍微改进了 A P 50 \mathrm{AP_{50}} AP50。

Fig. 5: Out of distribution generalization for rare classes. Even though no image in the training set has more than 13 giraﬀes, DETR has no diﬃ- culty generalizing to 24 and more instances of the same class.

【翻译】图5：稀有类别的分布外泛化。尽管训练集中没有图像有超过13只长颈鹿，DETR在泛化到24只或更多同一类别的实例时没有困难。

Given these ablations, we conclude that transformer components: the global self-attention in encoder, FFN, multiple decoder layers, and positional encodings, all signiﬁcantly contribute to the ﬁnal object detection performance.

【翻译】基于这些消融实验，我们得出结论：transformer组件：编码器中的全局自注意力、FFN、多个解码器层和位置编码，都对最终的物体检测性能有显著贡献。

Loss ablations. To evaluate the importance of diﬀerent components of the matching cost and the loss, we train several models turning them on and oﬀ. There are three components to the loss: classiﬁcation loss, ℓ 1 \ell_{1} ℓ1 bounding box distance loss, and GIoU [ 38 ] loss. The classiﬁcation loss is essential for training and cannot be turned oﬀ, so we train a model without bounding box distance loss, and a model without the GIoU loss, and compare with baseline, trained with all three losses. Results are presented in table 4 . GIoU loss on its own accounts for most of the model performance, losing only 0.7 AP to the baseline with combined losses. Using ℓ 1 \ell_{1} ℓ1 without GIoU shows poor results. We only studied simple ablations of diﬀerent losses (using the same weighting every time), but other means of combining them may achieve diﬀerent results.

【翻译】损失消融实验。为了评估匹配成本和损失的不同组件的重要性，我们训练了几个模型，开启和关闭它们。损失有三个组件：分类损失、 ℓ 1 \ell_{1} ℓ1边界框距离损失和GIoU[38]损失。分类损失对训练至关重要，不能关闭，所以我们训练了一个没有边界框距离损失的模型和一个没有GIoU损失的模型，并与使用所有三种损失训练的基线进行比较。结果在表4中显示。GIoU损失单独使用就占了模型性能的大部分，与组合损失的基线相比只损失0.7 AP。使用 ℓ 1 \ell_{1} ℓ1而不使用GIoU显示很差的结果。我们只研究了不同损失的简单消融（每次使用相同的权重），但其他组合方式可能会达到不同的结果。

Table 3: Results for diﬀerent positional encodings compared to the baseline (last row), which has ﬁxed sine pos. encodings passed at every attention layer in both the encoder and the decoder. Learned embeddings are shared between all layers. Not using spatial positional encodings leads to a signiﬁcant drop in AP. Interestingly, passing them in decoder only leads to a minor AP drop. All these models use learned output positional encodings.

【翻译】表3：不同位置编码与基线（最后一行）的结果比较，基线在编码器和解码器的每个注意力层都传递固定的正弦位置编码。学习的嵌入在所有层之间共享。不使用空间位置编码导致AP显著下降。有趣的是，仅在解码器中传递它们只导致AP的轻微下降。所有这些模型都使用学习的输出位置编码。

Fig. 6: Visualizing decoder attention for every predicted object (images from COCO val set). Predictions are made with DETR-DC5 model. Attention scores are coded with diﬀerent colors for diﬀerent objects. Decoder typically attends to object extremities, such as legs and heads. Best viewed in color.

【翻译】图6：可视化每个预测物体的解码器注意力（来自COCO验证集的图像）。使用DETR-DC5模型进行预测。不同物体的注意力分数用不同颜色编码。解码器通常关注物体的四肢，如腿部和头部。最好以彩色查看。

Table 4: Eﬀect of loss components on AP. We train two models turning oﬀ ℓ 1 \ell_{1} ℓ1 loss, and GIoU loss, and observe that ℓ 1 \ell_{1} ℓ1 gives poor results on its own, but when combined with GIoU improves A P M \mathrm{AP_{M}} APM and A P L \mathrm{AP_{L}} APL . Our baseline (last row) combines both losses.

【翻译】表4：损失组件对AP的影响。我们训练了两个模型，分别关闭 ℓ 1 \ell_{1} ℓ1损失和GIoU损失，观察到 ℓ 1 \ell_{1} ℓ1单独使用效果不好，但与GIoU结合时改进了 A P M \mathrm{AP_{M}} APM和 A P L \mathrm{AP_{L}} APL。我们的基线（最后一行）结合了两种损失。

Fig. 7: Visualization of all box predictions on all images from COCO 2017 val set for 20 out of total N = 100 N=100 N=100 prediction slots in DETR decoder. Each box prediction is represented as a point with the coordinates of its center in the 1-by-1 square normalized by each image size. The points are color-coded so that green color corresponds to small boxes, red to large horizontal boxes and blue to large vertical boxes. We observe that each slot learns to specialize on certain areas and box sizes with several operating modes. We note that almost all slots have a mode of predicting large image-wide boxes that are common in COCO dataset.

【翻译】图7：DETR解码器中总共 N = 100 N=100 N=100个预测槽中20个槽在COCO 2017验证集所有图像上的所有边界框预测的可视化。每个边界框预测表示为一个点，其坐标是其中心在由每个图像尺寸归一化的1×1正方形中的坐标。点用颜色编码，绿色对应小框，红色对应大的水平框，蓝色对应大的垂直框。我们观察到每个槽学会专门化某些区域和边界框尺寸，具有几种操作模式。我们注意到几乎所有槽都有预测在COCO数据集中常见的大图像级边界框的模式。

4.3 Analysis

Decoder output slot analysis In Fig. 7 we visualize the boxes predicted by diﬀerent slots for all images in COCO 2017 val set. DETR learns diﬀerent specialization for each query slot. We observe that each slot has several modes of operation focusing on diﬀerent areas and box sizes. In particular, all slots have the mode for predicting image-wide boxes (visible as the red dots aligned in the middle of the plot). We hypothesize that this is related to the distribution of objects in COCO.

【翻译】解码器输出槽分析。在图7中，我们可视化了COCO 2017验证集中所有图像的不同槽预测的边界框。DETR为每个查询槽学习不同的专门化。我们观察到每个槽都有几种操作模式，专注于不同的区域和边界框大小。特别是，所有槽都有预测图像级边界框的模式（在图中可见为对齐在图中央的红点）。我们假设这与COCO中物体的分布有关。

Generalization to unseen numbers of instances. Some classes in COCO are not well represented with many instances of the same class in the same image. For example, there is no image with more than 13 giraﬀes in the training set. We create a synthetic image 3 ^{3} 3 to verify the generalization ability of DETR (see Figure 5 ). Our model is able to ﬁnd all 24 giraﬀes on the image which is clearly out of distribution. This experiment conﬁrms that there is no strong class-specialization in each object query.

【翻译】对未见实例数量的泛化能力。COCO中的一些类别在同一图像中没有很好地表示具有同一类别的多个实例。例如，训练集中没有超过13只长颈鹿的图像。我们创建了一个合成图像 3 ^{3} 3来验证DETR的泛化能力（见图5）。我们的模型能够找到图像上所有24只长颈鹿，这明显超出了分布范围。这个实验证实了每个物体查询中没有强烈的类别专门化。

4.4 DETR在全景分割中的应用

Panoptic segmentation [ 19 ] has recently attracted a lot of attention from the computer vision community. Similarly to the extension of Faster R-CNN [ 37 ] to Mask R-CNN [ 14 ], DETR can be naturally extended by adding a mask head on top of the decoder outputs. In this section we demonstrate that such a head can be used to produce panoptic segmentation [ 19 ] by treating stuﬀ and thing classes in a uniﬁed way. We perform our experiments on the panoptic annotations of the COCO dataset that has 53 stuﬀ categories in addition to 80 things categories.

【翻译】全景分割[19]最近引起了计算机视觉界的广泛关注。类似于将Faster R-CNN[37]扩展到Mask R-CNN[14]，DETR可以通过在解码器输出之上添加掩码头来自然地扩展。在本节中，我们演示了这样的头可以通过统一处理背景类和物体类来生成全景分割[19]。我们在COCO数据集的全景标注上进行实验，该数据集除了80个物体类别外还有53个背景类别。

Fig. 8: Illustration of the panoptic head. A binary mask is generated in parallel for each detected object, then the masks are merged using pixel-wise argmax.

【翻译】图8：全景头的示例。为每个检测到的物体并行生成二进制掩码，然后使用逐像素argmax合并掩码。

Fig. 9: Qualitative results for panoptic segmentation generated by DETR-R101. DETR produces aligned mask predictions in a uniﬁed manner for things and stuﬀ.

【翻译】图9：DETR-R101生成的全景分割定性结果。DETR以统一的方式为物体和背景产生对齐的掩码预测。

We train DETR to predict boxes around both stuﬀ and things classes on COCO, using the same recipe. Predicting boxes is required for the training to be possible, since the Hungarian matching is computed using distances between boxes. We also add a mask head which predicts a binary mask for each of the predicted boxes, see Figure 8 . It takes as input the output of transformer decoder for each object and computes multi-head (with M M M heads) attention scores of this embedding over the output of the encoder, generating M M M attention heatmaps per object in a small resolution. To make the ﬁnal prediction and increase the resolution, an FPN-like architecture is used. We describe the architecture in more details in the supplement. The ﬁnal resolution of the masks has stride 4 and each mask is supervised independently using the DICE/F-1 loss [ 28 ] and Focal loss [ 23 ].

【翻译】我们使用相同的方法训练DETR来预测COCO上背景类和物体类的边界框。预测边界框是训练成为可能的必要条件，因为匈牙利匹配是使用边界框之间的距离计算的。我们还添加了一个掩码头，它为每个预测的边界框预测一个二进制掩码，见图8。它以每个物体的变换器解码器输出作为输入，并计算该嵌入在编码器输出上的多头（具有 M M M个头）注意力分数，在小分辨率下为每个物体生成 M M M个注意力热图。为了进行最终预测并提高分辨率，使用了类似FPN的架构。我们在补充材料中更详细地描述了该架构。掩码的最终分辨率具有步长4，每个掩码使用DICE/F-1损失[28]和Focal损失[23]独立监督。

The mask head can be trained either jointly, or in a two steps process, where we train DETR for boxes only, then freeze all the weights and train only the mask head for 25 epochs. Experimentally, these two approaches give similar results, we report results using the latter method since it results in a shorter total wall-clock time training.

【翻译】掩码头可以联合训练，也可以通过两步过程训练，即我们首先仅为边界框训练DETR，然后冻结所有权重并仅训练掩码头25个epoch。实验表明，这两种方法给出了相似的结果，我们报告使用后一种方法的结果，因为它导致更短的总挂钟时间训练。

Table 5: Comparison with the state-of-the-art methods UPSNet [ 51 ] and Panoptic FPN [ 18 ] on the COCO val dataset We retrained PanopticFPN with the same dataaugmentation as DETR, on a 18x schedule for fair comparison. UPSNet uses the 1x schedule, UPSNet-M is the version with multiscale test-time augmentations.

【翻译】表5：在COCO验证数据集上与最先进方法UPSNet[51]和Panoptic FPN[18]的比较。为了公平比较，我们使用与DETR相同的数据增强在18x计划上重新训练了PanopticFPN。UPSNet使用1x计划，UPSNet-M是具有多尺度测试时增强的版本。

To predict the ﬁnal panoptic segmentation we simply use an argmax over the mask scores at each pixel, and assign the corresponding categories to the resulting masks. This procedure guarantees that the ﬁnal masks have no overlaps and, therefore, DETR does not require a heuristic [ 19 ] that is often used to align diﬀerent masks.

【翻译】为了预测最终的全景分割，我们简单地在每个像素的掩码分数上使用argmax，并将相应的类别分配给结果掩码。此过程保证最终掩码没有重叠，因此DETR不需要通常用于对齐不同掩码的启发式方法[19]。

Training details. We train DETR, DETR-DC5 and DETR-R101 models following the recipe for bounding box detection to predict boxes around stuﬀ and things classes in COCO dataset. The new mask head is trained for 25 epochs (see supplementary for details). During inference we ﬁrst ﬁlter out the detection with a conﬁdence below 85 % 85\% 85% , then compute the per-pixel argmax to determine in which mask each pixel belongs. We then collapse diﬀerent mask predictions of the same stuﬀ category in one, and ﬁlter the empty ones (less than 4 pixels).

【翻译】训练细节。我们按照边界框检测的方法训练DETR、DETR-DC5和DETR-R101模型，以预测COCO数据集中背景类和物体类周围的边界框。新的掩码头训练25个epoch（详见补充材料）。在推理过程中，我们首先过滤掉置信度低于85%的检测，然后计算逐像素argmax以确定每个像素属于哪个掩码。然后我们将相同背景类别的不同掩码预测合并为一个，并过滤掉空的掩码（少于4个像素）。

Main results. Qualitative results are shown in Figure 9 . In table 5 we compare our uniﬁed panoptic segmenation approach with several established methods that treat things and stuﬀ diﬀerently. We report the Panoptic Quality (PQ) and the break-down on things (PQ t h \mathrm{th} th ) and stuﬀ (PQ s t ^{\mathrm{st}} st ). We also report the mask AP (computed on the things classes), before any panoptic post-treatment (in our case, before taking the pixel-wise argmax). We show that DETR outperforms published results on COCO-val 2017, as well as our strong PanopticFPN baseline (trained with same data-augmentation as DETR, for fair comparison). The result break-down shows that DETR is especially dominant on stuﬀ classes, and we hypothesize that the global reasoning allowed by the encoder attention is the key element to this result. For things class, despite a severe deﬁcit of up to 8 mAP compared to the baselines on the mask AP computation, DETR obtains competitive PQ t h \mathrm{th} th . We also evaluated our method on the test set of the COCO dataset, and obtained 46 PQ. We hope that our approach will inspire the exploration of fully uniﬁed models for panoptic segmentation in future work.

【翻译】主要结果。定性结果如图9所示。在表5中，我们将我们的统一全景分割方法与几种对物体和背景区别处理的既定方法进行比较。我们报告了全景质量（PQ）以及对物体（PQ t h _{\mathrm{th}} th）和背景（PQ s t ^{\mathrm{st}} st）的细分。我们还报告了掩码AP（在物体类上计算），在任何全景后处理之前（在我们的情况下，是在逐像素argmax之前）。我们显示DETR在COCO-val 2017上超越了已发布的结果，以及我们强大的PanopticFPN基线（使用与DETR相同的数据增强训练，以便公平比较）。结果细分显示DETR在背景类上特别占优势，我们假设编码器注意力允许的全局推理是这一结果的关键因素。对于物体类，尽管在掩码AP计算上与基线相比存在高达8 mAP的严重不足，DETR获得了具有竞争力的PQ t h _{\mathrm{th}} th。我们还在COCO数据集的测试集上评估了我们的方法，获得了46 PQ。我们希望我们的方法将激发未来工作中对全景分割完全统一模型的探索。

5 Conclusion

We presented DETR, a new design for object detection systems based on transformers and bipartite matching loss for direct set prediction. The approach achieves comparable results to an optimized Faster R-CNN baseline on the challenging COCO dataset. DETR is straightforward to implement and has a ﬂexible architecture that is easily extensible to panoptic segmentation, with competitive results. In addition, it achieves signiﬁcantly better performance on large objects than Faster R-CNN, likely thanks to the processing of global information performed by the self-attention.

【翻译】我们提出了DETR，这是一种基于transformers和二分匹配损失进行直接集合预测的目标检测系统的新设计。该方法在具有挑战性的COCO数据集上取得了与优化的Faster R-CNN基线相当的结果。DETR实现简单，架构灵活，可以轻松扩展到全景分割，并取得了有竞争力的结果。此外，它在大目标上的性能明显优于Faster R-CNN，这可能得益于自注意力机制对全局信息的处理。

This new design for detectors also comes with new challenges, in particular regarding training, optimization and performances on small objects. Current detectors required several years of improvements to cope with similar issues, and we expect future work to successfully address them for DETR.

【翻译】这种新的检测器设计也带来了新的挑战，特别是在训练、优化和小目标性能方面。当前的检测器需要几年的改进来应对类似的问题，我们期望未来的工作能够成功地为DETR解决这些问题。

A Appendix

A.1 预备知识：多头注意力层

关于多头自注意力机制可以研读一下 Attention Is All You Need 这篇论文，详细解析地址：Attention Is All You Need论文精读（逐段解析）

我这里按照DETR作者的论文描述，也进行了MSA的详细解析。

Since our model is based on the Transformer architecture, we remind here the general form of attention mechanisms we use for exhaustivity. The attention mechanism follows [ 47 ], except for the details of positional encodings (see Equation 8 ) that follows [ 7 ].

【翻译】由于我们的模型基于Transformer架构，我们在这里提醒一下为了详尽性而使用的注意力机制的一般形式。注意力机制遵循[47]，除了位置编码的细节（见方程8）遵循[7]。

【解析】这里提到的两个重要参考文献：[47]是原始的Transformer论文"Attention is All You Need"，[7]则是关于位置编码的改进方案。DETR基本沿用了标准Transformer的注意力计算方式，但在位置编码方面有自己的特色，这对于处理图像这种二维结构化数据是必要的。

Multi-head The general form of multi-head attention with M M M heads of dimension d d d is a function with the following signature (using d ′ = d M \begin{array}{r}{d^{\prime}=\frac{d}{M}}\end{array} d′=Md , and giving matrix/tensors sizes in underbrace)

【翻译】多头具有 M M M个维度为 d d d的头的多头注意力的一般形式是具有以下签名的函数（使用 d ′ = d M d^{\prime}=\frac{d}{M} d′=Md，并在下括号中给出矩阵/张量大小）

m h − a t t n : X q ⏟ d × N q , X k v ⏟ d × N k v , T ⏟ M × 3 × d ′ × d , L ⏟ d × d ↦ X ~ q ⏟ d × N q \mathrm{mh-attn:}\underbrace{X_{\mathrm{q}}}{d\times N{\mathrm{q}}},\underbrace{X_{\mathrm{kv}}}{d\times N{\mathrm{kv}}},\underbrace{T}{M\times3\times d^{\prime}\times d},\underbrace{L}{d\times d}\mapsto\underbrace{\tilde{X}{\mathrm{q}}}{d\times N_{\mathrm{q}}} mh−attn:d×Nq Xq,d×Nkv Xkv,M×3×d′×d T,d×d L↦d×Nq X~q

where X q X_{\mathrm{q}} Xq is the query sequence of length N q N_{\mathrm{q}} Nq , X k v X_{\mathrm{kv}} Xkv is the key-value sequence of length N k v N_{\mathrm{kv}} Nkv (with the same number of channels d d d for simplicity of exposition), T T T is the weight tensor to compute the so-called query, key and value embeddings, and L L L is a projection matrix. The output is the same size as the query sequence. To ﬁx the vocabulary before giving details, multi-head selfattention (mh-s-attn) is the special case X q = X k v X_{\mathrm{q}}=X_{\mathrm{kv}} Xq=Xkv , i.e.

【翻译】其中 X q X_{\mathrm{q}} Xq是长度为 N q N_{\mathrm{q}} Nq的查询序列， X k v X_{\mathrm{kv}} Xkv是长度为 N k v N_{\mathrm{kv}} Nkv的键值序列（为了表述简单，具有相同的通道数 d d d）， T T T是用于计算所谓的查询、键和值嵌入的权重张量， L L L是投影矩阵。输出与查询序列的大小相同。为了在给出细节之前固定词汇，多头自注意力（mh-s-attn）是 X q = X k v X_{\mathrm{q}}=X_{\mathrm{kv}} Xq=Xkv的特殊情况，即

mh-s-attn ⁡ ( X , T , L ) = mh-attn ⁡ ( X , X , T , L ) . \operatorname{mh-s-attn}(X,T,L)=\operatorname{mh-attn}(X,X,T,L). mh-s−attn(X,T,L)=mh-attn(X,X,T,L).

【解析】这个公式定义了多头注意力机制的形式。各个符号的含义： X q X_q Xq代表查询序列，在DETR中是object queries或者图像特征； X k v X_{kv} Xkv代表键值序列，是图像的特征表示。权重张量 T T T的维度 M × 3 × d ′ × d M\times3\times d^{\prime}\times d M×3×d′×d说明了它的结构： M M M个头，每个头需要3组权重（分别用于计算Query、Key、Value），每组权重将 d d d维输入映射到 d ′ d^{\prime} d′维输出。这里 d ′ = d M d^{\prime}=\frac{d}{M} d′=Md说明总的参数量保持不变，只是将大的注意力头分解为多个小头。自注意力是注意力机制的特殊形式，输入序列既作为查询也作为键值，这样序列中的每个元素都能与包括自己在内的所有元素进行交互。

The multi-head attention is simply the concatenation of M M M single attention heads followed by a projection with L L L . The common practice [ 47 ] is to use residual connections, dropout and layer normalization. In other words, denoting X ¨ q = \ddot{X}{\mathrm{q}}= X¨q= mh-attn ( X q , X k v , T , L ) (X{\mathrm{q}},X_{\mathrm{kv}},T,L) (Xq,Xkv,T,L) and X ˉ ˉ ( q ) \bar{\bar{X}}^{(q)} Xˉˉ(q) the concatenation of attention heads, we have

【翻译】多头注意力简单地是 M M M个单一注意力头的连接，然后跟着用 L L L进行投影。常见做法[47]是使用残差连接、dropout和层归一化。换句话说，记 X ¨ q = \ddot{X}{\mathrm{q}}= X¨q= mh-attn ( X q , X k v , T , L ) (X{\mathrm{q}},X_{\mathrm{kv}},T,L) (Xq,Xkv,T,L)和 X ˉ ˉ ( q ) \bar{\bar{X}}^{(q)} Xˉˉ(q)为注意力头的连接，我们有

X q ′ = [ a t t n ( X q , X k v , T 1 ) ; . . . ; a t t n ( X q , X k v , T M ) ] X ~ q = l a y e r n o r m ( X q + d r o p o u t ( L X q ′ ) ) , \begin{array}{r l}&{X_{\mathrm{q}}^{\prime}=[\mathrm{attn}(X_{\mathrm{q}},X_{\mathrm{kv}},T_{1});...;\mathrm{attn}(X_{\mathrm{q}},X_{\mathrm{kv}},T_{M})]}\ &{\tilde{X}{\mathrm{q}}=\mathrm{layernorm}\bigl(X{\mathrm{q}}+\mathrm{dropout}(L X_{\mathrm{q}}^{\prime})\bigr),}\end{array} Xq′=[attn(Xq,Xkv,T1);...;attn(Xq,Xkv,TM)] X~q=layernorm(Xq+dropout(LXq′)),

where [ ; ] [;] [;] denotes concatenation on the channel axis.

【翻译】其中 [ ; ] [;] [;]表示在通道轴上的连接。

【解析】这里描述了多头注意力的计算流程。第一步是并行计算 M M M个注意力头，每个头使用不同的权重 T i T_i Ti。第二步是将所有头的输出在通道维度上连接起来，形成 X q ′ X_q^{\prime} Xq′。第三步是通过线性投影 L L L将连接后的特征映射回原始维度。最后一步是标准的Transformer后处理：先应用dropout防止过拟合，再加上残差连接保持梯度流动，最后用层归一化stabilize训练过程。"多头并行+连接+投影+残差+归一化"的组合，保证了模型的表达能力，又保证了训练的稳定性。。

Single head An attention head with weight tensor T ′ ∈ R 3 × d ′ × d T^{\prime}\in\mathbb{R}^{3\times d^{\prime}\times d} T′∈R3×d′×d , denoted by a t t n ( X q , X k v , T ′ ) \mathrm{attn}(X_{\mathrm{q}},X_{\mathrm{kv}},T^{\prime}) attn(Xq,Xkv,T′) , depends on additional positional encoding P q ∈ R d × N q P_{\mathrm{q}}\in\mathbb{R}^{d\times N_{\mathrm{q}}} Pq∈Rd×Nq and P k v ∈ R d × N k v P_{\mathrm{kv}}\in\mathbb R^{d\times N_{\mathrm{kv}}} Pkv∈Rd×Nkv ∈ . It starts by computing so-called query, key and value embeddings after adding the query and key positional encodings [ 7 ]:

【翻译】单头具有权重张量 T ′ ∈ R 3 × d ′ × d T^{\prime}\in\mathbb{R}^{3\times d^{\prime}\times d} T′∈R3×d′×d的注意力头，用 a t t n ( X q , X k v , T ′ ) \mathrm{attn}(X_{\mathrm{q}},X_{\mathrm{kv}},T^{\prime}) attn(Xq,Xkv,T′)表示，依赖于额外的位置编码 P q ∈ R d × N q P_{\mathrm{q}}\in\mathbb{R}^{d\times N_{\mathrm{q}}} Pq∈Rd×Nq和 P k v ∈ R d × N k v P_{\mathrm{kv}}\in\mathbb R^{d\times N_{\mathrm{kv}}} Pkv∈Rd×Nkv。它首先在添加查询和键位置编码[7]后计算所谓的查询、键和值嵌入：

【解析】这里开始介绍单个注意力头的计算过程。权重张量 T ′ T^{\prime} T′的维度 3 × d ′ × d 3\times d^{\prime}\times d 3×d′×d说明它包含三个部分：分别用于生成Query、Key、Value的线性变换矩阵。位置编码 P q P_q Pq和 P k v P_{kv} Pkv是DETR处理图像数据的创新点，因为标准的Transformer是为序列数据设计的，而图像是二维结构。在计算注意力之前，需要将位置信息融入特征表示中，这样模型才能理解不同空间位置之间的关系。这种位置编码的加入方式与原始Transformer处理文本序列时的做法类似，但针对图像的二维特性进行了适配。

Q ; K ; V \] = \[ T 1 ′ ( X q + P q ) ; T 2 ′ ( X k v + P k v ) ; T 3 ′ X k v \] \[Q;K;V\]=\[T_{1}\^{\\prime}(X_{\\mathrm{q}}+P_{\\mathrm{q}});T_{2}\^{\\prime}(X_{\\mathrm{kv}}+P_{\\mathrm{kv}});T_{3}\^{\\prime}X_{\\mathrm{kv}}\] \[Q;K;V\]=\[T1′(Xq+Pq);T2′(Xkv+Pkv);T3′Xkv

where T ′ T^{\prime} T′ is the concatenation of T 1 ′ , T 2 ′ , T 3 ′ T_{1}^{\prime},T_{2}^{\prime},T_{3}^{\prime} T1′,T2′,T3′ ′ ′ . The attention weights α \alpha α are then computed based on the softmax of dot products between queries and keys, so that each element of the query sequence attends to all elements of the key-value sequence ( i i i is a query index and j j j a key-value index):

【翻译】其中 T ′ T^{\prime} T′是 T 1 ′ , T 2 ′ , T 3 ′ T_{1}^{\prime},T_{2}^{\prime},T_{3}^{\prime} T1′,T2′,T3′的连接。然后基于查询和键之间点积的softmax计算注意力权重 α \alpha α，使得查询序列的每个元素都关注键值序列的所有元素（ i i i是查询索引， j j j是键值索引）：

【解析】这个公式展示了QKV计算的细节。注意到Value的计算中没有加入位置编码，这是有意设计的：位置信息主要用于确定"注意力应该关注哪里"（通过Q和K），而不是用于修改"被关注的内容"（V）。 T 1 ′ , T 2 ′ , T 3 ′ T_1^{\prime}, T_2^{\prime}, T_3^{\prime} T1′,T2′,T3′分别将输入特征映射为查询、键、值表示，每个变换矩阵的作用是将原始特征投影到一个新的表示空间，在这个空间中更容易计算相似度和提取有用信息。这种分离的设计允许模型学习到不同的表示：查询表示"我在寻找什么"，键表示"我能提供什么信息"，值表示"我实际包含的信息内容"。

α i , j = e 1 d ′ Q i T K j Z i w h e r e Z i = ∑ j = 1 N k v e 1 d ′ Q i T K j . \alpha_{i,j}=\frac{e^{\frac{1}{\sqrt{d^{\prime}}}Q_{i}^{T}K_{j}}}{Z_{i}}\mathrm{~where~}Z_{i}=\sum_{j=1}^{N_{\mathrm{kv}}}e^{\frac{1}{\sqrt{d^{\prime}}}Q_{i}^{T}K_{j}}. αi,j=Zied′ 1QiTKj where Zi=j=1∑Nkved′ 1QiTKj.

【解析】这是标准的缩放点积注意力公式。 1 d ′ \frac{1}{\sqrt{d^{\prime}}} d′ 1是缩放因子，防止点积结果过大导致softmax饱和。当维度 d ′ d^{\prime} d′较大时，两个随机向量的点积方差会增大，可能导致softmax输出极度不均匀，梯度消失。通过除以 d ′ \sqrt{d^{\prime}} d′ ，保持点积结果在合理范围内。 Z i Z_i Zi是归一化常数，确保每个查询位置 i i i对应的注意力权重 α i , j \alpha_{i,j} αi,j在所有键值位置 j j j上的和为1。这个softmax操作实现了"软选择"：不是硬性选择某个位置，而是给每个位置分配一个权重，权重大小反映了该位置对当前查询的重要程度。

In our case, the positional encodings may be learnt or ﬁxed, but are shared across all attention layers for a given query/key-value sequence, so we do not explicitly write them as parameters of the attention. We give more details on their exact value when describing the encoder and the decoder. The ﬁnal output is the aggregation of values weighted by attention weights: The i i i -th row is given by a t t n i ( X q , X k v , T ′ ) = ∑ j = 1 N k v α i , j V j \begin{array}{r}{\mathrm{attn}{i}(X{\mathrm{q}},X_{\mathrm{kv}},T^{\prime})=\sum_{j=1}^{N_{\mathrm{kv}}}\alpha_{i,j}V_{j}}\end{array} attni(Xq,Xkv,T′)=∑j=1Nkvαi,jVj .

【翻译】在我们的情况下，位置编码可以是学习得到的或固定的，但对于给定的查询/键值序列，它们在所有注意力层间共享，因此我们不将它们明确写作注意力的参数。我们在描述编码器和解码器时会给出它们确切值的更多细节。最终输出是由注意力权重加权的值的聚合：第 i i i行由 a t t n i ( X q , X k v , T ′ ) = ∑ j = 1 N k v α i , j V j \mathrm{attn}{i}(X{\mathrm{q}},X_{\mathrm{kv}},T^{\prime})=\sum_{j=1}^{N_{\mathrm{kv}}}\alpha_{i,j}V_{j} attni(Xq,Xkv,T′)=∑j=1Nkvαi,jVj给出。

【解析】位置编码的共享设计简化了模型架构，避免了每层都需要学习不同位置编码的复杂性。最终的注意力输出是一个加权平均过程：对于每个查询位置 i i i，将所有值向量 V j V_j Vj按照对应的注意力权重 α i , j \alpha_{i,j} αi,j进行加权求和。这个过程可以理解为"信息聚合"：模型根据查询的需求，从所有可能的信息源中按重要程度提取并组合信息。权重越大的位置，其值向量对最终输出的贡献越大。软注意力机制可以使模型灵活地关注多个相关位置，而不是只选择单一的最重要位置。

Feed-forward network (FFN) layers The original transformer alternates multi-head attention and so-called FFN layers [ 47 ], which are eﬀectively multilayer 1x1 convolutions, which have M d M d Md input and output channels in our case. The FFN we consider is composed of two-layers of 1x1 convolutions with ReLU activations. There is also a residual connection/dropout/layernorm after the two layers, similarly to equation 6 .

【翻译】前馈网络（FFN）层原始transformer交替使用多头注意力和所谓的FFN层[47]，这些实际上是多层1x1卷积，在我们的情况下具有 M d Md Md个输入和输出通道。我们考虑的FFN由两层带ReLU激活的1x1卷积组成。在这两层之后还有残差连接/dropout/层归一化，类似于方程6。

【解析】FFN层在Transformer中起到非线性变换的作用。虽然多头注意力擅长建模序列中不同位置之间的关系，但它本质上是线性变换（除了softmax）。FFN通过引入非线性激活函数ReLU，增强了模型的表达能力，使其能够学习更复杂的特征变换。两层1x1卷积的设计相当于两层全连接网络：第一层通常会增加通道数（扩展特征维度），第二层再将其压缩回原始维度。这种"扩展-压缩"的结构为模型提供了更大的表示空间来学习复杂的特征变换。残差连接、dropout和层归一化的组合是深度学习的标准配置，分别解决梯度消失、过拟合和训练稳定性问题。

A.2 损失函数

For completeness, we present in detail the losses used in our approach. All losses are normalized by the number of objects inside the batch. Extra care must be taken for distributed training: since each GPU receives a sub-batch, it is not suﬃcient to normalize by the number of objects in the local batch, since in general the sub-batches are not balanced across GPUs. Instead, it is important to normalize by the total number of objects in all sub-batches.

【翻译】为了完整性，我们详细介绍了我们方法中使用的损失函数。所有损失都由批次内目标的数量进行归一化。对于分布式训练必须特别小心：由于每个GPU接收一个子批次，仅按局部批次中的目标数量进行归一化是不够的，因为通常子批次在GPU之间是不平衡的。相反，重要的是要按所有子批次中目标的总数进行归一化。

Box loss Similarly to [ 41 , 36 ], we use a soft version of Intersection over Union in our loss, together with a ℓ 1 \ell_{1} ℓ1 loss on b ^ \hat{b} b^ :

【翻译】边界框损失：与[41, 36]类似，我们在损失中使用交并比的软版本，以及 b ^ \hat{b} b^上的 ℓ 1 \ell_{1} ℓ1损失：

L b o x ( b σ ( i ) , b ^ i ) = λ i o u L i o u ( b σ ( i ) , b ^ i ) + λ L 1 ∣ ∣ b σ ( i ) − b ^ i ∣ ∣ 1 , \begin{array}{r}{\mathcal{L}{\mathrm{{box}}}(b{\sigma(i)},\hat{b}{i})=\lambda{\mathrm{{iou}}}\mathcal{L}{\mathrm{{iou}}}(b{\sigma(i)},\hat{b}{i})+\lambda{\mathrm{{L1}}}||b_{\sigma(i)}-\hat{b}{i}||{1},}\end{array} Lbox(bσ(i),b^i)=λiouLiou(bσ(i),b^i)+λL1∣∣bσ(i)−b^i∣∣1,

where λ i o u , λ L 1 ∈ R \lambda_{\mathrm{iou}},\lambda_{\mathrm{L1}}\in\mathbb{R} λiou,λL1∈R are hyperparameters and L i o u ( ⋅ ) \mathcal{L}_{\mathrm{iou}}(\cdot) Liou(⋅) is the generalized IoU [ 38 ]:

【翻译】其中 λ i o u , λ L 1 ∈ R \lambda_{\mathrm{iou}},\lambda_{\mathrm{L1}}\in\mathbb{R} λiou,λL1∈R是超参数， L i o u ( ⋅ ) \mathcal{L}_{\mathrm{iou}}(\cdot) Liou(⋅)是广义IoU [38]：

L i o u ( b σ ( i ) , b ^ i ) = 1 − ( ∣ b σ ( i ) ∩ b ^ i ∣ ∣ b σ ( i ) ∪ b ^ i ∣ − ∣ B ( b σ ( i ) , b ^ i ) ∖ b σ ( i ) ∪ b ^ i ∣ ∣ B ( b σ ( i ) , b ^ i ) ∣ ) . \mathcal{L}{\mathrm{iou}}(b{\sigma(i)},\hat{b}{i})=1-\left(\frac{|b{\sigma(i)}\cap\hat{b}{i}|}{|b{\sigma(i)}\cup\hat{b}{i}|}-\frac{|B(b{\sigma(i)},\hat{b}{i})\setminus b{\sigma(i)}\cup\hat{b}{i}|}{|B(b{\sigma(i)},\hat{b}_{i})|}\right). Liou(bσ(i),b^i)=1−(∣bσ(i)∪b^i∣∣bσ(i)∩b^i∣−∣B(bσ(i),b^i)∣∣B(bσ(i),b^i)∖bσ(i)∪b^i∣).

| . | means "area", and the union and intersection of box coordinates are used as shorthands for the boxes themselves. The areas of unions or intersections are computed by min / max of the linear functions of b σ ( i ) b_{\sigma(i)} bσ(i) and b ^ i \hat{b}{i} b^i , which makes the loss suﬃciently well-behaved for stochastic gradients. B ( b σ ( i ) , b ^ i ) B(b{\sigma(i)},\hat{b}{i}) B(bσ(i),b^i) means the largest box containing b σ ( i ) , b ^ i b{\sigma(i)},\hat{b}_{i} bσ(i),b^i (the areas involving B B B are also computed based on min / max of linear functions of the box coordinates).

【翻译】 ∣ . ∣ |.| ∣.∣表示"面积"，边界框坐标的并集和交集用作边界框本身的简写。并集或交集的面积通过 b σ ( i ) b_{\sigma(i)} bσ(i)和 b ^ i \hat{b}{i} b^i的线性函数的min/max计算，这使得损失对随机梯度有足够好的性质。 B ( b σ ( i ) , b ^ i ) B(b{\sigma(i)},\hat{b}{i}) B(bσ(i),b^i)表示包含 b σ ( i ) , b ^ i b{\sigma(i)},\hat{b}_{i} bσ(i),b^i的最大边界框（涉及 B B B的面积也基于边界框坐标的线性函数的min/max计算）。

DICE/F-1 loss [ 28 ] The DICE coeﬃcient is closely related to the Intersection over Union. If we denote by m ^ \hat{m} m^ the raw mask logits prediction of the model, and m m m the binary target mask, the loss is deﬁned as:

【翻译】DICE/F-1损失[28]：DICE系数与交并比密切相关。如果我们用 m ^ \hat{m} m^表示模型的原始掩码logits预测，用 m m m表示二值目标掩码，损失定义为：

L D I C E ( m , m ^ ) = 1 − 2 m σ ( m ^ ) + 1 σ ( m ^ ) + m + 1 \mathcal{L}_{\mathrm{DICE}}(m,\hat{m})=1-\frac{2m\sigma(\hat{m})+1}{\sigma(\hat{m})+m+1} LDICE(m,m^)=1−σ(m^)+m+12mσ(m^)+1

where σ \sigma σ is the sigmoid function. This loss is normalized by the number of objects.

【翻译】其中 σ \sigma σ是sigmoid函数。这个损失由目标数量进行归一化。

A.3 详细架构

The detailed description of the transformer used in DETR, with positional encodings passed at every attention layer, is given in Fig. 10 . Image features from the CNN backbone are passed through the transformer encoder, together with spatial positional encoding that are added to queries and keys at every multihead self-attention layer. Then, the decoder receives queries (initially set to zero), output positional encoding (object queries), and encoder memory, and produces the ﬁnal set of predicted class labels and bounding boxes through multiple multihead self-attention and decoder-encoder attention. The ﬁrst self-attention layer in the ﬁrst decoder layer can be skipped.

【翻译】DETR中使用的transformer的详细描述如图10所示，位置编码在每个注意力层中传递。来自CNN骨干网络的图像特征通过transformer编码器，同时在每个多头自注意力层中将空间位置编码添加到查询和键中。然后，解码器接收查询（初始设置为零）、输出位置编码（目标查询）和编码器记忆，并通过多个多头自注意力和解码器-编码器注意力产生最终的预测类别标签和边界框集合。第一个解码器层中的第一个自注意力层可以跳过。

Fig. 10: Architecture of DETR's transformer. Please, see Section A.3 for details.

【翻译】图10：DETR的transformer架构。详情请参见A.3节。

Computational complexity Every self-attention in the encoder has complexity O ( d 2 H W + d ( H W ) 2 ) : O ( d ′ d ) \mathcal{O}(d^{2}H W{+}d(H W)^{2}){:\mathcal{O}(d^{\prime}d)} O(d2HW+d(HW)2):O(d′d) is the cost of computing a single query/key/value embeddings (and M d ′ = d M d^{\prime}=d Md′=d ), while O ( d ′ ( H W ) 2 ) \mathcal{O}(d^{\prime}(H W)^{2}) O(d′(HW)2) is the cost of computing the attention weights for one head. Other computations are negligible. In the decoder, each self-attention is in O ( d 2 N + d N 2 ) \mathcal{O}(d^{2}N{+}d N^{2}) O(d2N+dN2) , and cross-attention between encoder and decoder is in O ( d 2 ( N + H W ) + d N H W ) \mathcal{O}(d^{2}(N+H W)+d N H W) O(d2(N+HW)+dNHW) , which is much lower than the encoder since N ≪ H W N\ll H W N≪HW in practice.

【翻译】计算复杂度：编码器中的每个自注意力的复杂度为 O ( d 2 H W + d ( H W ) 2 ) \mathcal{O}(d^{2}H W{+}d(H W)^{2}) O(d2HW+d(HW)2)，其中 O ( d ′ d ) \mathcal{O}(d^{\prime}d) O(d′d)是计算单个查询/键/值嵌入的成本（且 M d ′ = d M d^{\prime}=d Md′=d），而 O ( d ′ ( H W ) 2 ) \mathcal{O}(d^{\prime}(H W)^{2}) O(d′(HW)2)是计算一个头的注意力权重的成本。其他计算可忽略不计。在解码器中，每个自注意力的复杂度为 O ( d 2 N + d N 2 ) \mathcal{O}(d^{2}N{+}d N^{2}) O(d2N+dN2)，编码器和解码器之间的交叉注意力复杂度为 O ( d 2 ( N + H W ) + d N H W ) \mathcal{O}(d^{2}(N+H W)+d N H W) O(d2(N+HW)+dNHW)，这比编码器低得多，因为在实践中 N ≪ H W N\ll H W N≪HW。

FLOPS computation Given that the FLOPS for Faster R-CNN depends on the number of proposals in the image, we report the average number of FLOPS for the ﬁrst 100 images in the COCO 2017 validation set. We compute the FLOPS with the tool flop count operators from Detectron2 [ 50 ]. We use it without modiﬁcations for Detectron2 models, and extend it to take batch matrix multiply ( bmm ) into account for DETR models.

【翻译】FLOPS计算：鉴于Faster R-CNN的FLOPS取决于图像中提案的数量，我们报告COCO 2017验证集前100张图像的平均FLOPS数量。我们使用Detectron2 [50]中的flop count operators工具计算FLOPS。对于Detectron2模型，我们不加修改地使用它，对于DETR模型，我们扩展它以考虑批量矩阵乘法（bmm）。

A.4 训练超参数

We train DETR using AdamW [ 26 ] with improved weight decay handling, set to 1 0 − 4 10^{-4} 10−4 . We also apply gradient clipping, with a maximal gradient norm of 0 . 1. The backbone and the transformers are treated slightly diﬀerently, we now discuss the details for both.

【翻译】我们使用AdamW [26]训练DETR，该优化器具有改进的权重衰减处理，设置为 1 0 − 4 10^{-4} 10−4。我们还应用梯度裁剪，最大梯度范数为0.1。骨干网络和transformer的处理略有不同，我们现在讨论两者的细节。

Backbone ImageNet pretrained backbone ResNet-50 is imported from Torchvision, discarding the last classiﬁcation layer. Backbone batch normalization weights and statistics are frozen during training, following widely adopted practice in object detection. We ﬁne-tune the backbone using learning rate of 1 0 − 5 10^{-5} 10−5 . We observe that having the backbone learning rate roughly an order of magnitude smaller than the rest of the network is important to stabilize training, especially in the ﬁrst few epochs.

【翻译】骨干网络：从Torchvision导入ImageNet预训练的ResNet-50骨干网络，丢弃最后的分类层。在训练期间，骨干网络的批归一化权重和统计信息被冻结，遵循目标检测中广泛采用的做法。我们使用 1 0 − 5 10^{-5} 10−5的学习率对骨干网络进行微调。我们观察到，骨干网络的学习率大约比网络其余部分小一个数量级对于稳定训练很重要，特别是在最初几个epoch中。

Transformer We train the transformer with a learning rate of 1 0 − 4 10^{-4} 10−4 . Additive dropout of 0 . 1 is applied after every multi-head attention and FFN before layer normalization. The weights are randomly initialized with Xavier initialization.

【翻译】Transformer：我们使用 1 0 − 4 10^{-4} 10−4的学习率训练transformer。在每个多头注意力和FFN之后、层归一化之前应用0.1的加性dropout。权重使用Xavier初始化随机初始化。

Losses We use linear combination of ℓ 1 \ell_{1} ℓ1 and GIoU losses for bounding box regression with λ L 1 = 5 \lambda_{\mathrm{L1}}=5 λL1=5 and λ i o u = 2 \lambda_{\mathrm{iou}}=2 λiou=2 weights respectively. All models were trained with N = 100 N=100 N=100 decoder query slots.

【翻译】损失函数：我们使用 ℓ 1 \ell_{1} ℓ1和GIoU损失的线性组合进行边界框回归，权重分别为 λ L 1 = 5 \lambda_{\mathrm{L1}}=5 λL1=5和 λ i o u = 2 \lambda_{\mathrm{iou}}=2 λiou=2。所有模型都使用 N = 100 N=100 N=100个解码器查询槽位进行训练。

Baseline Our enhanced Faster-RCNN + ^+ + baselines use GIoU [ 38 ] loss along with the standard ℓ 1 \ell_{1} ℓ1 loss for bounding box regression. We performed a grid search to ﬁnd the best weights for the losses and the ﬁnal models use only GIoU loss with weights 20 and 1 for box and proposal regression tasks respectively. For the baselines we adopt the same data augmentation as used in DETR and train it with 9 × \times × schedule (approximately 109 epochs). All other settings are identical to the same models in the Detectron2 model zoo [ 50 ].

【翻译】基线：我们增强的Faster-RCNN + ^+ +基线使用GIoU [38]损失以及标准 ℓ 1 \ell_{1} ℓ1损失进行边界框回归。我们进行网格搜索以找到损失的最佳权重，最终模型仅使用GIoU损失，边界框和提案回归任务的权重分别为20和1。对于基线，我们采用与DETR中使用的相同数据增强，并使用9×调度（大约109个epoch）进行训练。所有其他设置与Detectron2模型库[50]中的相同模型相同。

Spatial positional encoding Encoder activations are associated with corresponding spatial positions of image features. In our model we use a ﬁxed absolute encoding to represent these spatial positions. We adopt a generalization of the original Transformer [ 47 ] encoding to the 2D case [ 31 ]. Speciﬁcally, for both spatial coordinates of each embedding we independently use d 2 \frac{d}{2} 2d sine and cosine functions with diﬀerent frequencies. We then concatenate them to get the ﬁnal d d d channel positional encoding.

【翻译】空间位置编码：编码器激活与图像特征的相应空间位置相关联。在我们的模型中，我们使用固定的绝对编码来表示这些空间位置。我们采用原始Transformer [47]编码对2D情况的泛化[31]。具体来说，对于每个嵌入的两个空间坐标，我们独立使用 d 2 \frac{d}{2} 2d个不同频率的正弦和余弦函数。然后我们将它们连接起来得到最终的 d d d通道位置编码。

A.5 附加结果

Some extra qualitative results for the panoptic prediction of the DETR-R101 model are shown in Fig. 11 .

【翻译】DETR-R101模型的全景预测的一些额外定性结果如图11所示。

(a) Failure case with overlapping objects. PanopticFPN misses one plane entirely, while DETR fails to accurately segment 3 of them.

【翻译】（a）重叠目标的失败案例。PanopticFPN完全遗漏了一架飞机，而DETR未能准确分割其中3架。

(b) Things masks are predicted at full resolution, which allows sharper boundaries than PanopticFPN

【翻译】（b）Things掩码以全分辨率预测，这使得边界比PanopticFPN更清晰。

Fig. 11: Comparison of panoptic predictions. From left to right: Ground truth, PanopticFPN with ResNet 101, DETR with ResNet 101

【翻译】图11：全景预测比较。从左到右：真实标注，带ResNet 101的PanopticFPN，带ResNet 101的DETR。

Increasing the number of instances By design, DETR cannot predict more objects than it has query slots, i.e. 100 in our experiments. In this section, we analyze the behavior of DETR when approaching this limit. We select a canonical square image of a given class, repeat it on a 10 × 10 10\times10 10×10 grid, and compute the percentage of instances that are missed by the model. To test the model with less than 100 instances, we randomly mask some of the cells. This ensures that the absolute size of the objects is the same no matter how many are visible. To account for the randomness in the masking, we repeat the experiment 100 times with diﬀerent masks. The results are shown in Fig. 12 . The behavior is similar across classes, and while the model detects all instances when up to 50 are visible, it then starts saturating and misses more and more instances. Notably, when the image contains all 100 instances, the model only detects 30 on average, which is less than if the image contains only 50 instances that are all detected. The counter-intuitive behavior of the model is likely because the images and the detections are far from the training distribution.

【翻译】增加实例数量。根据设计，DETR不能预测超过其查询槽位数量的目标，即在我们的实验中为100个。在本节中，我们分析DETR在接近这一限制时的行为。我们选择给定类别的规范正方形图像，在10×10网格上重复它，并计算模型遗漏的实例百分比。为了测试少于100个实例的模型，我们随机遮罩一些单元格。这确保无论有多少可见，目标的绝对大小都是相同的。为了考虑遮罩的随机性，我们用不同的遮罩重复实验100次。结果如图12所示。不同类别的行为相似，虽然当多达50个实例可见时模型能检测到所有实例，但随后开始饱和并遗漏越来越多的实例。值得注意的是，当图像包含全部100个实例时，模型平均只检测到30个，这比图像只包含50个实例且全部被检测到的情况要少。模型的这种反直觉行为可能是因为图像和检测结果远离训练分布。

Note that this test is a test of generalization out-of-distribution by design, since there are very few example images with a lot of instances of a single class. It is diﬃcult to disentangle, from the experiment, two types of out-of-domain generalization: the image itself vs the number of object per class. But since few to no COCO images contain only a lot of objects of the same class, this type of experiment represents our best eﬀort to understand whether query objects overﬁt the label and position distribution of the dataset. Overall, the experiments suggests that the model does not overﬁt on these distributions since it yields near-perfect detections up to 50 objects.

【翻译】注意，这个测试在设计上是一个分布外泛化测试，因为很少有包含单个类别大量实例的示例图像。从实验中很难分离两种类型的域外泛化：图像本身vs每个类别的目标数量。但由于很少或没有COCO图像只包含同一类别的大量目标，这种类型的实验代表了我们理解查询目标是否过拟合数据集标签和位置分布的最佳努力。总体而言，实验表明模型没有在这些分布上过拟合，因为它在多达50个目标时产生近乎完美的检测。

Fig. 12: Analysis of the number of instances of various classes missed by DETR depending on how many are present in the image. We report the mean and the standard deviation. As the number of instances gets close to 100, DETR starts saturating and misses more and more objects

【翻译】图12：分析DETR根据图像中存在的数量而遗漏的各种类别实例数量。我们报告均值和标准差。当实例数量接近100时，DETR开始饱和并遗漏越来越多的目标。

A.6 PyTorch 推理代码

To demonstrate the simplicity of the approach, we include inference code with PyTorch and Torchvision libraries in Listing 1 . The code runs with Python 3.6 + 3.6+ 3.6+ , PyTorch 1.4 and Torchvision 0.5. Note that it does not support batching, hence it is suitable only for inference or training with DistributedDataParallel with one image per GPU. Also note that for clarity, this code uses learnt positional encodings in the encoder instead of ﬁxed, and positional encodings are added to the input only instead of at each transformer layer. Making these changes requires going beyond PyTorch implementation of transformers, which hampers readability. The entire code to reproduce the experiments will be made available before the conference.

【翻译】为了演示该方法的简单性，我们在列表1中包含了使用PyTorch和Torchvision库的推理代码。该代码可在Python 3.6+、PyTorch 1.4和Torchvision 0.5上运行。注意它不支持批处理，因此仅适用于推理或使用DistributedDataParallel进行训练，每个GPU处理一张图像。还要注意的是，为了清晰起见，此代码在编码器中使用学习的位置编码而不是固定的位置编码，并且位置编码仅添加到输入中，而不是在每个transformer层中添加。进行这些更改需要超越PyTorch的transformer实现，这会影响可读性。重现实验的完整代码将在会议前提供。

【解析】作者提到的"超越PyTorch的transformer实现"是指要实现论文中描述的完整版本（比如在每个transformer层都添加位置编码），需要自定义transformer层，这会让代码变得更复杂。这里的代码主要是为了展示DETR的核心架构思想，帮助读者理解方法的本质。

python 复制代码

import torch
from torch import nn
from torchvision.models import resnet50

class DETR(nn.Module):
    def __init__(self, num_classes, hidden_dim, nheads,
                 num_encoder_layers, num_decoder_layers):
        super().__init__()
        # We take only convolutional layers from ResNet-50 model
        self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2])
        self.conv = nn.Conv2d(2048, hidden_dim, 1)
        self.transformer = nn.Transformer(hidden_dim, nheads,
                                        num_encoder_layers, num_decoder_layers)
        self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
        self.linear_bbox = nn.Linear(hidden_dim, 4)
        self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
        self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
        self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

    def forward(self, inputs):
        x = self.backbone(inputs)
        h = self.conv(x)
        H, W = h.shape[-2:]
        pos = torch.cat([
            self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
            self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
        ], dim=-1).flatten(0, 1).unsqueeze(1)
        h = self.transformer(pos + h.flatten(2).permute(2, 0, 1),
                           self.query_pos.unsqueeze(1))
        return self.linear_class(h), self.linear_bbox(h).sigmoid()

detr = DETR(num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6)
detr.eval()
inputs = torch.randn(1, 3, 800, 1200)
logits, bboxes = detr(inputs)

Listing 1: DETR PyTorch inference code. For clarity it uses learnt positional encodings in the encoder instead of ﬁxed, and positional encodings are added to the input only instead of at each transformer layer. Making these changes requires going beyond PyTorch implementation of transformers, which hampers readability. The entire code to reproduce the experiments will be made available before the conference.

【翻译】列表1：DETR PyTorch推理代码。为了清晰起见，它在编码器中使用学习的位置编码而不是固定的位置编码，并且位置编码仅添加到输入中，而不是在每个transformer层中添加。进行这些更改需要超越PyTorch的transformer实现，这会影响可读性。重现实验的完整代码将在会议前提供。

(DETR)End-to-End Object Detection with Transformers论文精读（逐段解析）