（RT-DETR）DETRs Beat YOLOs on Real-time Object Detection论文精读（逐段解析）

DETRs Beat YOLOs on Real-time Object Detection论文精读（逐段解析）

1 百度公司，中国北京

2 北京大学深圳研究生院电子与计算机工程学院，中国深圳

2024

Abstract

The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However, we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper, we propose the R ealT ime DE tection TR ansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps, drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy. Specifically, we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then, we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder, thereby improving accuracy. In addition, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1 % / 54.3 % A 53.1\%/54.3\%A 53.1%/54.3%A P on COCO and 108 / 74 FPS on T4 GPU, outperforming previously advanced YOLOs in both speed and accuracy. Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2 % 2.2\% 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365, RTDETR-R50 / R101 achieves 55.3 % / 56.2 % A 55.3\%/56.2\%A 55.3%/56.2%A P. The project page: https://zhao-yian.github.io/RTDETR .

【翻译】YOLO系列由于在速度和精度之间的合理权衡，已成为实时目标检测最受欢迎的框架。然而，我们观察到YOLO的速度和精度都受到NMS的负面影响。最近，端到端的基于Transformer的检测器（DETRs）提供了消除NMS的替代方案。然而，高计算成本限制了它们的实用性，并阻碍了它们充分利用排除NMS的优势。在本文中，我们提出了实时检测Transformer（RT-DETR），据我们所知，这是第一个解决上述困境的实时端到端目标检测器。我们分两步构建RT-DETR，借鉴了先进的DETR：首先我们专注于在提高速度的同时保持精度，然后在保持速度的同时提高精度。具体来说，我们设计了一个高效的混合编码器，通过解耦尺度内交互和跨尺度融合来快速处理多尺度特征，从而提高速度。然后，我们提出了不确定性最小查询选择，为解码器提供高质量的初始查询，从而提高精度。此外，RT-DETR支持通过调整解码器层数来进行灵活的速度调优，以适应各种场景而无需重新训练。我们的RT-DETR-R50/R101在COCO上达到 53.1 % / 54.3 % 53.1\%/54.3\% 53.1%/54.3% AP，在T4 GPU上达到108/74 FPS，在速度和精度方面都优于之前先进的YOLO。此外，RT-DETR-R50在精度上比DINO-R50高 2.2 % 2.2\% 2.2% AP，在FPS上快约21倍。在使用Objects365预训练后，RT-DETR-R50/R101达到 55.3 % / 56.2 % 55.3\%/56.2\% 55.3%/56.2% AP。项目页面：https://zhao-yian.github.io/RTDETR 。

【解析】这段摘要首先指出了当前实时目标检测领域的现状和问题。YOLO系列虽然在实时检测中表现优秀，但存在一个关键瓶颈：非极大值抑制（NMS）后处理步骤会拖慢整体推理速度并影响检测精度。传统的目标检测器会产生大量重叠的候选框，需要通过NMS来筛选出最终结果，这个过程不仅耗时，还需要手动调节阈值参数。

为解决这个问题，研究者们开始关注基于Transformer的端到端检测器（DETR），这类方法可以直接输出最终检测结果，无需NMS后处理。但是DETR系列方法计算量太大，无法满足实时检测的速度要求，这就形成了一个矛盾：虽然理论上可以避免NMS的负面影响，但实际上会因为计算太慢而无法应用。

RT-DETR的核心创新在于采用了两阶段优化策略。第一阶段重点解决速度问题：设计混合编码器来高效处理多尺度特征，通过将同一尺度内的特征交互和不同尺度间的特征融合分开处理，大幅降低计算复杂度。第二阶段重点解决精度问题：提出不确定性最小查询选择机制，为解码器提供更高质量的初始查询，从而提升检测精度。这种分步优化的思路确保了在提升速度的同时不损失精度，在保证速度的基础上进一步提升精度。

RT-DETR还具有一个独特的优势：支持灵活的速度调节。通过简单地调整解码器的层数，就可以在不同的速度-精度权衡点之间切换，而无需重新训练模型。这种设计使得同一个模型可以适应不同的应用场景需求，大大提高了实用性。实验结果显示，RT-DETR不仅在精度上超越了之前的YOLO系列，在速度上也有显著提升，真正实现了端到端实时目标检测的突破。

1. Introduction

Real-time object detection is an important area of research and has a wide range of applications, such as object tracking [ 43 ], video surveillance [ 28 ], and autonomous driv- ing [ 2 ], etc. Existing real-time detectors generally adopt the CNN-based architecture, the most famous of which is the YOLO detectors [ 1 , 10 -- 12 , 15 , 16 , 25 , 30 , 38 , 40 ] due to their reasonable trade-off between speed and accuracy. However, these detectors typically require Non-Maximum Suppression (NMS) for post-processing, which not only slows down the inference speed but also introduces hyperparameters that cause instability in both the speed and accuracy. Moreover, considering that different scenarios place different emphasis on recall and accuracy, it is necessary to carefully select the appropriate NMS thresholds, which hinders the development of real-time detectors.

【翻译】实时目标检测是一个重要的研究领域，具有广泛的应用，如目标跟踪[43]、视频监控[28]和自动驾驶[2]等。现有的实时检测器通常采用基于CNN的架构，其中最著名的是YOLO检测器[1, 10-12, 15, 16, 25, 30, 38, 40]，因为它们在速度和精度之间有合理的权衡。然而，这些检测器通常需要非极大值抑制（NMS）进行后处理，这不仅会降低推理速度，还会引入超参数，导致速度和精度的不稳定性。此外，考虑到不同场景对召回率和精度有不同的重视程度，有必要仔细选择合适的NMS阈值，这阻碍了实时检测器的发展。

【解析】这段话首先说明了实时目标检测的重要性和应用场景。实时检测要求算法能够在极短时间内完成目标识别和定位，YOLO系列之所以成为实时检测的代表，是因为它在保证检测精度的同时，能够达到很高的处理速度。但是这里揭示了一个核心问题：NMS后处理步骤成为了性能瓶颈。NMS的作用是从大量重叠的候选框中筛选出最终结果，但这个过程需要额外的计算时间，而且需要手动设置置信度阈值和IoU阈值等超参数。这些超参数的选择会直接影响检测结果的质量，不同的应用场景可能需要不同的参数设置，这就增加了系统的复杂性和不稳定性。

Figure 1. Compared to previously advanced real-time object detectors, our RT-DETR achieves state-of-the-art performance.

【翻译】图1. 与之前先进的实时目标检测器相比，我们的RT-DETR达到了最先进的性能。

Recently, the end-to-end Transformer-based detectors (DETRs) [ 4 , 17 , 23 , 27 , 36 , 39 , 44 , 45 ] have received extensive attention from the academia due to their streamlined architecture and elimination of hand-crafted components. However, their high computational cost prevents them from meeting real-time detection requirements, so the NMS-free architecture does not demonstrate an inference speed advantage. This inspires us to explore whether DETRs can be extended to real-time scenarios and outperform the advanced YOLO detectors in both speed and accuracy, eliminating the delay caused by NMS for real-time object detection.

【翻译】最近，端到端的基于Transformer的检测器（DETRs）[4, 17, 23, 27, 36, 39, 44, 45]由于其简化的架构和消除手工设计组件而受到学术界的广泛关注。然而，它们的高计算成本阻止了它们满足实时检测要求，因此无NMS架构并没有展现出推理速度优势。这启发我们探索DETRs是否可以扩展到实时场景，并在速度和精度方面都超越先进的YOLO检测器，消除NMS对实时目标检测造成的延迟。

【解析】这段话介绍了另一个重要的技术路线：基于Transformer的端到端检测器。DETR系列方法的核心优势在于其端到端的设计理念，可以直接从输入图像预测出最终的检测结果，无需复杂的后处理步骤。这种设计消除了传统检测器中需要手工设计的锚框生成、NMS等组件，使整个检测流程更加简洁。但是问题在于，虽然理论上DETR可以避免NMS的负面影响，但其基于Transformer的架构计算复杂度很高，特别是自注意力机制的计算量随序列长度呈平方增长，这使得DETR在实际应用中速度很慢，无法满足实时检测的需求。因此，虽然DETR在架构上更先进，但在实际性能上反而不如传统的YOLO系列。

To achieve the above goal, we rethink DETRs and conduct detailed analysis of key components to reduce unnecessary computational redundancy and further improve accuracy. For the former, we observe that although the introduction of multi-scale features is beneficial in accelerating the training convergence [ 45 ], it leads to a significant increase in the length of the sequence feed into the encoder. The high computational cost caused by the interaction of multi-scale features makes the Transformer encoder the computational bottleneck. Therefore, implementing the real-time DETR requires a redesign of the encoder. And for the latter, previous works [ 42 , 44 , 45 ] show that the hard-to-optimize object queries hinder the performance of DETRs and propose the query selection schemes to replace the vanilla learnable embeddings with encoder features. However, we observe that the current query selection directly adopt classification scores for selection, ignoring the fact that the detector are required to simultaneously model the category and location of objects, both of which determine the quality of the features. This inevitably results in encoder features with low localization confidence being selected as initial queries, thus leading to a considerable level of uncertainty and hurting the performance of DETRs. We view query initialization as a breakthrough to further improve performance.

【翻译】为了实现上述目标，我们重新思考DETRs并对关键组件进行详细分析，以减少不必要的计算冗余并进一步提高精度。对于前者，我们观察到虽然引入多尺度特征有利于加速训练收敛[45]，但它导致输入编码器的序列长度显著增加。多尺度特征交互造成的高计算成本使Transformer编码器成为计算瓶颈。因此，实现实时DETR需要重新设计编码器。对于后者，之前的工作[42, 44, 45]表明难以优化的目标查询阻碍了DETRs的性能，并提出查询选择方案用编码器特征替换原始的可学习嵌入。然而，我们观察到当前的查询选择直接采用分类分数进行选择，忽略了检测器需要同时建模目标的类别和位置这一事实，这两者都决定了特征的质量。这不可避免地导致具有低定位置信度的编码器特征被选为初始查询，从而导致相当程度的不确定性并损害DETRs的性能。我们将查询初始化视为进一步提高性能的突破口。

【解析】这段话详细分析了DETR存在的两个核心问题并提出了解决思路。第一个问题是计算效率问题。多尺度特征虽然能够帮助网络更好地检测不同大小的目标，但会大幅增加输入序列的长度。在Transformer架构中，自注意力机制的计算复杂度与序列长度的平方成正比，因此序列长度的增加会导致计算量急剧上升。这使得编码器成为整个网络的性能瓶颈，严重影响推理速度。第二个问题是查询初始化问题。在DETR中，目标查询是用来定位和识别目标的关键组件。传统方法通常使用随机初始化的可学习嵌入作为查询，但这种方式优化困难。后续研究提出用编码器特征来初始化查询，但现有方法只考虑分类置信度，忽略了定位精度。这就可能选择到分类置信度高但定位不准确的特征作为初始查询，导致后续优化过程中存在较大的不确定性，影响最终的检测性能。

In this paper, we propose the R ealT ime DE tection TR ansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge. To expeditiously process multi-scale features, we design an efficient hybrid encoder to replace the vanilla Transformer encoder, which significantly improves inference speed by decoupling the intra-scale interaction and cross-scale fusion of features with different scales. To avoid encoder features with low localization confidence being selected as object queries, we propose the uncertainty-minimal query selection, which provides highquality initial queries to the decoder by explicitly optimizing the uncertainty, thereby increasing the accuracy. Furthermore, RT-DETR supports flexible speed tuning to accommodate various real-time scenarios without retraining, thanks to the multi-layer decoder architecture of DETR.

【翻译】在本文中，我们提出了实时检测Transformer（RT-DETR），据我们所知，这是第一个实时端到端目标检测器。为了快速处理多尺度特征，我们设计了一个高效的混合编码器来替代原始的Transformer编码器，通过解耦不同尺度特征的尺度内交互和跨尺度融合，显著提高了推理速度。为了避免具有低定位置信度的编码器特征被选为目标查询，我们提出了不确定性最小查询选择，通过显式优化不确定性为解码器提供高质量的初始查询，从而提高精度。此外，由于DETR的多层解码器架构，RT-DETR支持灵活的速度调节以适应各种实时场景而无需重新训练。

【解析】这段话是对RT-DETR技术贡献的总结。首先明确了RT-DETR是第一个真正实现实时性能的端到端目标检测器。传统的DETR虽然架构先进，但计算量太大无法实时运行，而YOLO虽然速度快但需要NMS后处理。RT-DETR通过三个关键创新解决了这个矛盾。第一个创新是混合编码器设计，传统Transformer编码器在处理多尺度特征时会将所有尺度的特征拼接成一个长序列，然后进行全局自注意力计算，这导致计算复杂度急剧上升。RT-DETR将这个过程分解为两个独立的步骤：先在每个尺度内部进行特征交互，再进行不同尺度间的特征融合，这样大幅降低了计算量。第二个创新是查询选择策略，传统方法在选择初始查询时只考虑分类置信度，但目标检测需要同时准确定位和分类，RT-DETR提出了同时考虑定位和分类不确定性的选择机制，确保选出的初始查询质量更高。第三个创新是灵活的速度调节机制，通过简单调整解码器层数就能在不同的速度-精度权衡点间切换，这对实际应用非常重要。

RT-DETR achieves an ideal trade-off between the speed and accuracy. Specifically, RT-DETR-R50 achieves 53.1 % 53.1\% 53.1% AP on COCO val2017 and 108 FPS on T4 GPU, while RTDETR-R101 achieves 54.3 % 54.3\% 54.3% AP and 74 FPS, outperforming L L L and X X X models of previously advanced YOLO detectors in both speed and accuracy, Figure 1 . We also develop scaled RT-DETRs by scaling the encoder and decoder with smaller backbones, which outperform the lighter YOLO detectors ( S S S and M M M models). Furthermore, RT-DETR-R50 outperforms DINO-Deformable-DETR-R50 by 2.2 % 2.2\% 2.2% AP ( 53.1 % (53.1\% (53.1% AP vs 50.9 % 50.9\% 50.9% AP) in accuracy and by about 21 times in FPS ( 108 FPS vs 5 FPS), significantly improves accuracy and speed of DETRs. After pre-training with Objects365 [ 35 ], RTDETR-R50 / R101 achieves 55.3 % / 56.2 % 55.3\%/56.2\% 55.3%/56.2% AP, resulting in surprising performance improvements. More experimental results are provided in the Appendix.

【翻译】RT-DETR实现了速度和精度之间的理想权衡。具体来说，RT-DETR-R50在COCO val2017上达到 53.1 % 53.1\% 53.1% AP，在T4 GPU上达到108 FPS，而RT-DETR-R101达到 54.3 % 54.3\% 54.3% AP和74 FPS，在速度和精度方面都超越了之前先进YOLO检测器的 L L L和 X X X模型，如图1所示。我们还通过使用更小的骨干网络缩放编码器和解码器开发了缩放版RT-DETR，其性能超越了更轻量的YOLO检测器（ S S S和 M M M模型）。此外，RT-DETR-R50在精度上比DINO-Deformable-DETR-R50高 2.2 % 2.2\% 2.2% AP（ 53.1 % 53.1\% 53.1% AP vs 50.9 % 50.9\% 50.9% AP），在FPS上快约21倍（108 FPS vs 5 FPS），显著提高了DETRs的精度和速度。在使用Objects365[35]预训练后，RT-DETR-R50/R101达到 55.3 % / 56.2 % 55.3\%/56.2\% 55.3%/56.2% AP，取得了令人惊讶的性能提升。更多实验结果在附录中提供。

【解析】从数据对比可以看出RT-DETR的突破性：在精度方面，RT-DETR-R50的53.1% AP已经达到了很高的水平，在速度方面，108 FPS和74 FPS都远超传统DETR方法，真正实现了实时检测。通过模型缩放还能在轻量级模型上保持竞争力。

The main contributions are summarized as: (i). We propose the first real-time end-to-end object detector called RTDETR, which not only outperforms the previously advanced YOLO detectors in both speed and accuracy but also eliminates the negative impact caused by NMS post-processing on real-time object detection; (ii). We quantitatively analyze the impact of NMS on the speed and accuracy of YOLO detectors, and establish an end-to-end speed benchmark to test the end-to-end inference speed of real-time detectors; (iii). The proposed RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to accommodate various scenarios without retraining.

【翻译】主要贡献总结如下：(i) 我们提出了第一个名为RT-DETR的实时端到端目标检测器，它不仅在速度和精度方面都超越了之前先进的YOLO检测器，还消除了NMS后处理对实时目标检测造成的负面影响；(ii) 我们定量分析了NMS对YOLO检测器速度和精度的影响，并建立了端到端速度基准来测试实时检测器的端到端推理速度；(iii) 所提出的RT-DETR支持通过调整解码器层数进行灵活的速度调节，以适应各种场景而无需重新训练。

【解析】这段话明确了论文的三个核心贡献。第一个贡献是技术突破，RT-DETR首次实现了真正意义上的实时端到端目标检。第二个贡献是评估体系的建立，之前的研究往往忽略了NMS的计算开销，只报告模型本身的推理时间，这导致了不公平的性能比较。第三个贡献是实用性设计，通过调整解码器层数来实现速度调节的机制非常实用，因为不同的应用场景对速度和精度的要求不同，这种设计让同一个模型能够适应多种需求，大大提高了模型的实用价值。

2.1. 实时目标检测器

YOLOv1 [ 31 ] is the first CNN-based one-stage object detector to achieve true real-time object detection. Through years of continuous development, the YOLO detectors have outperformed other one-stage object detectors [ 21 , 24 ] and become the synonymous with the real-time object detector. YOLO detectors can be classified into two categories: anchor-based [ 1 , 11 , 15 , 25 , 29 , 30 , 37 , 38 ] and anchorfree [ 10 , 12 , 16 , 40 ], which achieve a reasonable trade-off between speed and accuracy and are widely used in various practical scenarios. These advanced real-time detectors produce numerous overlapping boxes and require NMS postprocessing, which slows down their speed.

【翻译】YOLOv1[31]是第一个基于CNN的单阶段目标检测器，实现了真正的实时目标检测。经过多年的持续发展，YOLO检测器已经超越了其他单阶段目标检测器[21, 24]，成为实时目标检测器的代名词。YOLO检测器可以分为两类：基于锚框的[1, 11, 15, 25, 29, 30, 37, 38]和无锚框的[10, 12, 16, 40]，它们在速度和精度之间实现了合理的权衡，并在各种实际场景中得到广泛应用。这些先进的实时检测器会产生大量重叠的边界框，需要NMS后处理，这会降低它们的速度。

【解析】尽管YOLO系列在速度上有显著优势，但它们都面临一个共同问题：网络会为每个可能的位置生成多个预测框，这些预测框往往高度重叠，因此需要通过NMS算法来筛选出最终的检测结果。NMS虽然能够有效去除冗余检测框，但这个后处理步骤需要额外的计算时间，成为了影响整体检测速度的瓶颈。

2.2. 端到端目标检测器

End-to-end object detectors are well-known for their streamlined pipelines. Carion et al . [ 4 ] first propose the end-toend detector based on Transformer called DETR, which has attracted extensive attention due to its distinctive features. Particularly, DETR eliminates the hand-crafted anchor and NMS components. Instead, it employs bipartite matching and directly predicts the one-to-one object set. Despite its obvious advantages, DETR suffers from several problems: slow training convergence, high computational cost, and hard-to-optimize queries. Many DETR variants have been proposed to address these issues. Accelerating convergence. Deformable-DETR [ 45 ] accelerates training convergence with multi-scale features by enhancing the efficiency of the attention mechanism. DAB-DETR [ 23 ] and DN-DETR [ 17 ] further improve performance by introducing the iterative refinement scheme and denoising training. Group-DETR [ 5 ] introduces group-wise one-to-many assignment. Reducing computational cost. Efficient DETR [ 42 ] and Sparse DETR [ 33 ] reduce the computational cost by reducing the number of encoder and decoder layers or the number of updated queries. Lite DETR [ 18 ] enhances the efficiency of encoder by reducing the update frequency of low-level features in an interleaved way. Optimizing query initialization. Conditional DETR [ 27 ] and Anchor DETR [ 39 ] decrease the optimization difficulty of the queries. Zhu et al . [ 45 ] propose the query selection for two-stage DETR, and DINO [ 44 ] suggests the mixed query selection to help better initialize queries. Current DETRs are still computationally intensive and are not designed to detect in real time. Our RT-DETR vigorously explores computational cost reduction and attempts to optimize query initialization, outperforming state-of-the-art real-time detectors.

【翻译】端到端目标检测器以其简化的流水线而闻名。Carion等人[4]首先提出了基于Transformer的端到端检测器DETR，由于其独特的特性而受到广泛关注。特别是，DETR消除了手工设计的锚框和NMS组件。相反，它采用二分匹配并直接预测一对一的目标集合。尽管有明显的优势，DETR仍存在几个问题：训练收敛慢、计算成本高和查询难以优化。已经提出了许多DETR变体来解决这些问题。加速收敛。Deformable-DETR[45]通过增强注意力机制的效率，利用多尺度特征加速训练收敛。DAB-DETR[23]和DN-DETR[17]通过引入迭代细化方案和去噪训练进一步提高性能。Group-DETR[5]引入了分组式一对多分配。降低计算成本。Efficient DETR[42]和Sparse DETR[33]通过减少编码器和解码器层数或更新查询的数量来降低计算成本。Lite DETR[18]通过以交错方式减少低级特征的更新频率来增强编码器的效率。优化查询初始化。Conditional DETR[27]和Anchor DETR[39]降低了查询的优化难度。Zhu等人[45]为两阶段DETR提出了查询选择，DINO[44]建议混合查询选择以帮助更好地初始化查询。当前的DETRs仍然计算密集，并非为实时检测而设计。我们的RT-DETR积极探索计算成本降低并尝试优化查询初始化，超越了最先进的实时检测器。

【解析】这段话系回顾了DETR系列方法的发展历程和面临的挑战。DETR的核心创新在于将目标检测重新定义为集合预测问题，通过二分匹配算法建立预测结果与真实目标之间的一对一对应关系，从而避免了传统方法中复杂的锚框设计和NMS后处理步骤。但是DETR在实际应用中暴露出三个关键问题。首先是收敛速度问题，由于Transformer架构的复杂性和目标查询的随机初始化，DETR需要更长的训练时间才能达到收敛。其次是计算成本问题，特别是在处理高分辨率图像时，自注意力机制的计算复杂度会急剧增加。最后是查询优化问题，随机初始化的目标查询缺乏先验信息，导致优化过程困难且不稳定。针对这些问题，研究社区提出了多种改进方案。在加速收敛方面，Deformable-DETR通过可变形注意力机制减少了需要关注的特征点数量，DAB-DETR和DN-DETR则通过迭代细化和去噪训练提高了训练效率。在降低计算成本方面，各种轻量化DETR通过减少网络层数或优化特征更新策略来提高效率。在查询初始化方面，多种方法尝试用更有意义的特征来初始化查询，而不是使用随机值。尽管这些改进取得了一定效果，但现有的DETR变体仍然无法满足实时检测的速度要求，这正是RT-DETR要解决的核心问题。

Figure 2. The number of boxes at different confidence thresholds.

【翻译】图2. 不同置信度阈值下的边界框数量。

3. 检测器的端到端速度

3.1. NMS分析

NMS is a widely used post-processing algorithm in object detection, employed to eliminate overlapping output boxes. Two thresholds are required in NMS: confidence threshold and IoU threshold. Specifically, the boxes with scores below the confidence threshold are directly filtered out, and whenever the IoU of any two boxes exceeds the IoU threshold, the box with the lower score will be discarded. This process is performed iteratively until all boxes of every category have been processed. Thus, the execution time of NMS primarily depends on the number of boxes and two thresholds. To verify this observation, we leverage YOLOv5 [ 11 ] (anchor-based) and YOLOv8 [ 12 ] (anchor-free) for analysis.

【翻译】NMS是目标检测中广泛使用的后处理算法，用于消除重叠的输出框。NMS需要两个阈值：置信度阈值和IoU阈值。具体来说，得分低于置信度阈值的框会被直接过滤掉，当任意两个框的IoU超过IoU阈值时，得分较低的框将被丢弃。这个过程会迭代执行，直到处理完每个类别的所有框。因此，NMS的执行时间主要取决于框的数量和两个阈值。为了验证这一观察，我们利用YOLOv5[11]（基于锚框）和YOLOv8[12]（无锚框）进行分析。

【解析】NMS通过两个关键参数来筛选最终结果：置信度阈值用于过滤掉可能性较低的预测框，IoU阈值用于在重叠的预测框中选择最优的那个。这个过程的计算复杂度与预测框的数量直接相关，框越多，需要计算的IoU对比就越多，执行时间也就越长。不同的检测器架构会产生不同数量的预测框，这直接影响了NMS的计算开销。

We first count the number of boxes remaining after filtering the output boxes with different confidence thresholds on the same input. We sample values from 0 . 001 to 0 . 25 as confidence thresholds to count the number of remaining boxes of the two detectors and plot them on a bar graph, which intuitively reflects that NMS is sensitive to its hyperparameters, Figure 2 . As the confidence threshold increases, more prediction boxes are filtered out, and the number of remaining boxes that need to calculate IoU decreases, thus reducing the execution time of NMS.

【翻译】我们首先统计在相同输入上使用不同置信度阈值过滤输出框后剩余的框数量。我们从0.001到0.25采样值作为置信度阈值，统计两个检测器的剩余框数量并绘制在柱状图上，这直观地反映了NMS对其超参数的敏感性，如图2所示。随着置信度阈值的增加，更多的预测框被过滤掉，需要计算IoU的剩余框数量减少，从而减少了NMS的执行时间。

【解析】当置信度阈值较低时，大量低质量的预测框会被保留下来参与IoU计算，这大大增加了计算量。相反，较高的置信度阈值会在早期就过滤掉大部分预测框，显著减少后续的IoU计算需求。这种敏感性说明了在实际应用中需要仔细调节这些超参数，以在检测精度和计算效率之间找到最佳平衡点。

Furthermore, we use YOLOv8 to evaluate the accuracy on the COCO val2017 and test the execution time of the NMS operation under different hyperparameters. Note that the NMS operation we adopt refers to the TensorRT efficientNMSPlugin , which involves multiple kernels, including EfficientNMSFilter , RadixSort EfficientNMS , etc ., and we only report the execution time of the EfficientNMS kernel. We test the speed on T4 GPU with TensorRT FP16, and the input and preprocessing remain consistent. The hyperparameters and the corresponding results are shown in Table 1 . From the results, we can conclude that the execution time of the EfficientNMS kernel increases as the confidence threshold decreases or the IoU threshold increases. The reason is that the high confidence threshold directly filters out more prediction boxes, whereas the high IoU threshold filters out fewer prediction boxes in each round of screening. We also visualize the predictions of YOLOv8 with different NMS thresholds in Appendix. The results show that inappropriate confidence thresholds lead to significant false positives or false negatives by the detector. With a confidence threshold of 0 . 001 and an IoU threshold of 0 . 7 , YOLOv8 achieves the best AP results, but the corresponding NMS time is at a higher level. Considering that YOLO detectors typically report the model speed and exclude the NMS time, thus an end-to-end speed benchmark needs to be established.

【翻译】此外，我们使用YOLOv8在COCO val2017上评估精度，并测试不同超参数下NMS操作的执行时间。注意，我们采用的NMS操作指的是TensorRT efficientNMSPlugin，它涉及多个内核，包括EfficientNMSFilter、RadixSort EfficientNMS等，我们只报告EfficientNMS内核的执行时间。我们在T4 GPU上使用TensorRT FP16测试速度，输入和预处理保持一致。超参数和相应结果如表1所示。从结果可以得出，EfficientNMS内核的执行时间随着置信度阈值的降低或IoU阈值的增加而增加。原因是高置信度阈值直接过滤掉更多预测框，而高IoU阈值在每轮筛选中过滤掉较少的预测框。我们还在附录中可视化了使用不同NMS阈值的YOLOv8预测结果。结果表明，不合适的置信度阈值会导致检测器出现显著的假阳性或假阴性。在置信度阈值为0.001和IoU阈值为0.7时，YOLOv8达到最佳AP结果，但相应的NMS时间处于较高水平。考虑到YOLO检测器通常报告模型速度并排除NMS时间，因此需要建立端到端速度基准。

【解析】为了获得最佳的检测精度，往往需要使用较低的置信度阈值和较高的IoU阈值，但这种设置会显著增加NMS的计算开销。TensorRT的EfficientNMS实现虽然经过优化，但仍然无法完全消除这种计算负担。更重要的是，这个实验暴露了当前检测器性能评估中的一个问题：大多数研究只报告网络推理时间，而忽略了NMS等后处理步骤的开销。这种不完整的性能评估可能会误导研究方向，因为在实际应用中，端到端的检测速度才是真正重要的指标。

Table 1. The effect of IoU threshold and confidence threshold on accuracy and NMS execution time.

3.2. 端到端速度基准

To enable a fair comparison of the end-to-end speed of various real-time detectors, we establish an end-to-end speed benchmark. Considering that the execution time of NMS is influenced by the input, it is necessary to choose a benchmark dataset and calculate the average execution time across multiple images. We choose COCO val2017 [ 20 ] as the benchmark dataset and append the NMS post-processing plugin of TensorRT for YOLO detectors as mentioned above. Specifically, we test the average inference time of the detector according to the NMS thresholds of the corresponding accuracy taken on the benchmark dataset, excluding KaTeX parse error: Undefined control sequence: \I at position 1: \̲I̲/0 and MemoryCopy operations. We utilize the benchmark to test the end-to-end speed of anchor-based detectors YOLOv5 [ 11 ] and YOLOv7 [ 38 ], as well as anchor-free detectors PP-YOLOE [ 40 ], YOLOv6 [ 16 ] and YOLOv8 [ 12 ] on T4 GPU with TensorRT FP16. According to the results ( cf. Table 2 ), we conclude that anchor-free detectors outperform anchor-based detectors with equivalent accuracy for YOLO detectors because the former require less NMS time than the latter . The reason is that anchor-based detectors produce more prediction boxes than anchor-free detectors (three times more in our tested detectors).

【翻译】为了能够公平比较各种实时检测器的端到端速度，我们建立了一个端到端速度基准。考虑到NMS的执行时间受输入影响，有必要选择一个基准数据集并计算多张图像的平均执行时间。我们选择COCO val2017[20]作为基准数据集，并为YOLO检测器附加上述提到的TensorRT的NMS后处理插件。具体来说，我们根据在基准数据集上获得相应精度的NMS阈值测试检测器的平均推理时间，排除KaTeX parse error: Undefined control sequence: \I at position 1: \̲I̲/0和MemoryCopy操作。我们利用该基准在T4 GPU上使用TensorRT FP16测试基于锚框的检测器YOLOv5[11]和YOLOv7[38]，以及无锚框检测器PP-YOLOE[40]、YOLOv6[16]和YOLOv8[12]的端到端速度。根据结果（参见表2），我们得出结论：对于YOLO检测器，在相同精度下，无锚框检测器优于基于锚框的检测器，因为前者比后者需要更少的NMS时间。原因是基于锚框的检测器比无锚框检测器产生更多的预测框（在我们测试的检测器中多三倍）。

【解析】这段话建立了一个更加公平和全面的检测器性能评估体系。作者认识到NMS的执行时间会因输入图像的不同而变化，因此选择了标准的COCO val2017数据集作为统一的测试基准，通过计算多张图像的平均执行时间来获得更稳定和可靠的性能指标。在技术实现上，作者使用了TensorRT的NMS插件，这是工业界广泛使用的高效实现，确保了测试结果的实用性。作者发现了一个重要现象：无锚框检测器在端到端性能上优于基于锚框的检测器。这个发现的根本原因在于两种架构在预测框生成机制上的差异。基于锚框的检测器需要在每个位置放置多个不同尺寸和长宽比的锚框，然后对每个锚框进行分类和回归预测，这导致了大量的预测框输出。而无锚框检测器避免了密集锚框的生成，从而大幅减少了需要后处理的预测框数量。这个发现不仅验证了无锚框设计的优势，也为后续的RT-DETR设计提供了重要启示：减少预测框数量是提高端到端检测速度的关键策略之一。

Figure 3. The encoder structure for each variant. SSE represents the single-scale Transformer encoder, MSE represents the multi-scale Transformer encoder, and CSF represents cross-scale fusion. AIFI and CCFF are the two modules designed into our hybrid encoder.

【翻译】图3. 每个变体的编码器结构。SSE表示单尺度Transformer编码器，MSE表示多尺度Transformer编码器，CSF表示跨尺度融合。AIFI和CCFF是我们设计到混合编码器中的两个模块。

4. The Real-time DETR

4.1. 模型概述

RT-DETR consists of a backbone, an efficient hybrid encoder, and a Transformer decoder with auxiliary prediction heads. The overview of RT-DETR is illustrated in Figure 4 . Specifically, we feed the features from the last three stages of the backbone cient hybrid encoder transforms multi-scale features into a { S 3 , S 4 , S 5 } \{\pmb{S}{3},\pmb{S}{4},\pmb{S}_{5}\} {S3,S4,S5} into the encoder. The effisequence of image features through intra-scale feature interaction and cross-scale feature fusion ( cf. Sec. 4.2 ). Subsequently, the uncertainty-minimal query selection is employed to select a fixed number of encoder features to serve as initial object queries for the decoder ( cf. Sec. 4.3 ). Finally, the decoder with auxiliary prediction heads iteratively optimizes object queries to generate categories and boxes.

【翻译】RT-DETR由骨干网络、高效混合编码器和带有辅助预测头的Transformer解码器组成。RT-DETR的概述如图4所示。具体来说，我们将骨干网络最后三个阶段的特征 { S 3 , S 4 , S 5 } \{\pmb{S}{3},\pmb{S}{4},\pmb{S}_{5}\} {S3,S4,S5}输入编码器。高效混合编码器通过尺度内特征交互和跨尺度特征融合将多尺度特征转换为图像特征序列（参见第4.2节）。随后，采用不确定性最小查询选择来选择固定数量的编码器特征作为解码器的初始目标查询（参见第4.3节）。最后，带有辅助预测头的解码器迭代优化目标查询以生成类别和边界框。

【解析】RT-DETR通过三个主要组件的协同工作实现高效的目标检测。骨干网络负责从输入图像中提取多层次的特征表示，这些特征包含了从低级的边缘纹理信息到高级的语义概念信息。高效混合编码器是RT-DETR的创新核心，它专门设计来处理多尺度特征融合的计算瓶颈问题。传统的DETR变体在处理多尺度特征时往往计算开销巨大，而RT-DETR通过分离尺度内交互和跨尺度融合两个过程，大幅降低了计算复杂度。尺度内特征交互主要关注同一尺度内特征点之间的关系建模，而跨尺度特征融合则负责整合不同分辨率层次的信息。不确定性最小查询选择是另一个重要创新，它解决了传统DETR中目标查询初始化困难的问题。通过显式建模特征的不确定性，系统能够选择出质量更高的初始查询，为后续的解码过程提供更好的起点。解码器部分采用了辅助预测头的设计，这种多头预测机制能够在训练过程中提供额外的监督信号，加速模型收敛并提高最终的检测精度。整个架构的设计原则是在保持检测精度的前提下，通过算法和结构优化来实现真正的实时检测性能。

4.2. Efficient Hybrid Encoder（高效混合编码器）

Computational bottleneck analysis. The introduction of multi-scale features accelerates training convergence and improves performance [ 45 ]. However, although the deformable attention reduces the computational cost, the sharply increased sequence length still causes the encoder to become the computational bottleneck. As reported in Lin et al . [ 19 ], the encoder accounts for 49 % 49\% 49% of the GFLOPs but contributes only 11 % 11\% 11% of the AP in Deformable-DETR. To overcome this bottleneck, we first analyze the computational redundancy present in the multi-scale Transformer encoder. Intuitively, high-level features that contain rich semantic information about objects are extracted from low-level features, making it redundant to perform feature interaction on the concatenated multi-scale features. Therefore, we design a set of variants with different types of the encoder to prove that the simultaneous intra-scale and cross-scale feature interaction is inefficient, Figure 3 . Specially, we use DINO-Deformable-R50 with the smaller size data reader and lighter decoder used in RT-DETR for experiments and first remove the multi-scale Transformer encoder in DINO-Deformable-R50 as variant A. Then, different types of the encoder are inserted to produce a series of variants based on A, elaborated as follows (Detailed indicators of each variant are referred to in Table 3 ):

【翻译】计算瓶颈分析。多尺度特征的引入加速了训练收敛并提高了性能[45]。然而，尽管可变形注意力降低了计算成本，但急剧增加的序列长度仍然使编码器成为计算瓶颈。正如Lin等人[19]报告的那样，在Deformable-DETR中，编码器占据了 49 % 49\% 49%的GFLOPs但仅贡献了 11 % 11\% 11%的AP。为了克服这个瓶颈，我们首先分析多尺度Transformer编码器中存在的计算冗余。直观上，包含丰富目标语义信息的高级特征是从低级特征中提取的，这使得在连接的多尺度特征上执行特征交互变得冗余。因此，我们设计了一组具有不同编码器类型的变体来证明同时进行尺度内和跨尺度特征交互是低效的，如图3所示。具体来说，我们使用DINO-Deformable-R50与RT-DETR中使用的较小尺寸数据读取器和较轻解码器进行实验，首先移除DINO-Deformable-R50中的多尺度Transformer编码器作为变体A。然后，基于A插入不同类型的编码器以产生一系列变体，详细说明如下（每个变体的详细指标参见表3）：

【解析】这段话揭示了现有DETR架构中的一个关键问题：编码器成为了整个系统的计算瓶颈。虽然多尺度特征确实能够提升检测性能，但它们带来的计算开销却是巨大的。作者通过具体数据说明了这个问题的严重性：编码器消耗了将近一半的计算资源，但对最终精度的贡献却微乎其微。这种不平衡的资源分配暴露了传统多尺度Transformer编码器设计的根本缺陷。作者提出了一个重要的洞察：高级特征本身就是从低级特征逐步抽象得来的，它们已经包含了低级特征的关键信息。因此，在已经融合的多尺度特征上再次进行全面的特征交互实际上是在做重复工作。这种冗余不仅浪费计算资源，还可能引入噪声干扰。为了验证这个假设，作者设计了一系列对照实验，通过逐步添加不同的编码器组件来观察性能变化，以清晰地展示每个组件的实际贡献。

• A → B \mathbf{A}\rightarrow\mathbf{B} A→B : Variant B inserts a single-scale Transformer encoder into A, which uses one layer of Transformer block. The multi-scale features share the encoder for intra-scale feature interaction and then concatenate as output.

• B → C \mathrm{B}\to\mathrm{C} B→C : Variant C introduces cross-scale feature fusion based on B and feeds the concatenated features into the multi-scale Transformer encoder to perform simultaneous intra-scale and cross-scale feature interaction.

• C → D \mathbf{C}\rightarrow\mathbf{D} C→D : Variant D decouples intra-scale interaction and cross-scale fusion by utilizing the single-scale Transformer encoder for the former and a PANet-style [ 22 ] structure for the latter.

• D → E \mathrm{D}\rightarrow\mathrm{E} D→E : Variant E enhances the intra-scale interaction and cross-scale fusion based on D, adopting an efficient hybrid encoder designed by us.

【翻译】

• A → B \mathbf{A}\rightarrow\mathbf{B} A→B：变体B在A中插入单尺度Transformer编码器，使用一层Transformer块。多尺度特征共享编码器进行尺度内特征交互，然后连接作为输出。

• B → C \mathrm{B}\to\mathrm{C} B→C：变体C在B的基础上引入跨尺度特征融合，将连接的特征输入多尺度Transformer编码器以同时执行尺度内和跨尺度特征交互。

• C → D \mathbf{C}\rightarrow\mathbf{D} C→D：变体D通过利用单尺度Transformer编码器进行前者，PANet风格[22]结构进行后者，将尺度内交互和跨尺度融合解耦。

• D → E \mathrm{D}\rightarrow\mathrm{E} D→E：变体E在D的基础上增强尺度内交互和跨尺度融合，采用我们设计的高效混合编码器。

【解析】这四个变体是渐进式的优化思路。变体B是最简单的版本，它让所有尺度的特征共享同一个编码器，这种设计虽然简单但缺乏针对性，因为不同尺度的特征具有不同的特性和作用。变体C试图通过引入跨尺度融合来改善这个问题，但它仍然采用传统的同时处理策略，这正是作者想要改进的低效方式。变体D是一个关键的转折点，它将尺度内交互和跨尺度融合完全分离开来处理。这种解耦设计的核心思想是：不同的任务应该用不同的工具来完成。尺度内的特征交互更适合用Transformer的自注意力机制来捕捉长距离依赖关系，而跨尺度的特征融合则更适合用CNN的层次化结构来实现。PANet风格的结构在特征金字塔网络的基础上增加了自底向上的路径，能够更好地传播低级特征中的定位信息。变体E是最终的优化版本，它在D的基础上进一步提升了两个模块的效率，这为后续的混合编码器设计奠定了基础。

Hybrid design. Based on the above analysis, we rethink the structure of the encoder and propose an efficient hybrid encoder , consisting of two modules, namely the Attentionbased Intra-scale Feature Interaction (AIFI) and the CNNbased Cross-scale Feature Fusion (CCFF). Specifically, AIFI further reduces the computational cost based on variant D by performing the intra-scale interaction only on S 5 \pmb{S}{5} S5 with the single-scale Transformer encoder. The reason is that applying the self-attention operation to high-level features with richer semantic concepts captures the connection between conceptual entities, which facilitates the localization and recognition of objects by subsequent modules. However, the intra-scale interactions of lower-level features are unnecessary due to the lack of semantic concepts and the risk of duplication and confusion with high-level feature interactions. To verify this opinion, we perform the intra-scale interaction only on S 5 \pmb{S}{5} S5 in variant D \mathrm{D} D , and the experimental results are reported in Table 3 (see row D S 5 \mathrm{D}{S{5}} DS5 ). Compared to D, D S 5 \mathrm{D}{\mathcal{S}{5}} DS5 not only significantly reduces latency ( 35 % 35\% 35% faster), but also improves accuracy 0.4 % 0.4\% 0.4% AP higher). CCFF is optimized based on the cross-scale fusion module, which inserts several fusion blocks consisting of convolutional layers into the fusion path. The role of the fusion block is to fuse two adjacent scale features into a new feature, and its structure is illustrated in Figure 5 . The fusion block conta s two 1 × 1 1\times1 1×1 convolutions to adjust the number of channels, N N N RepBlock s composed of RepConv [ 8 ] are used for feature fusion, and the two-path outputs are fused by element-wise add. We formulate the calculation of the hybrid encoder as:

【翻译】混合设计。基于上述分析，我们重新思考编码器的结构，并提出了一个高效混合编码器，由两个模块组成，即基于注意力的尺度内特征交互（AIFI）和基于CNN的跨尺度特征融合（CCFF）。具体来说，AIFI在变体D的基础上进一步降低计算成本，仅在 S 5 \pmb{S}{5} S5上使用单尺度Transformer编码器执行尺度内交互。原因是将自注意力操作应用于具有更丰富语义概念的高级特征能够捕获概念实体之间的连接，这有助于后续模块对目标的定位和识别。然而，低级特征的尺度内交互是不必要的，因为缺乏语义概念，并且存在与高级特征交互重复和混淆的风险。为了验证这一观点，我们在变体 D \mathrm{D} D中仅在 S 5 \pmb{S}{5} S5上执行尺度内交互，实验结果在表3中报告（见行 D S 5 \mathrm{D}{S{5}} DS5）。与D相比， D S 5 \mathrm{D}{\mathcal{S}{5}} DS5不仅显著降低了延迟（快 35 % 35\% 35%），还提高了精度（AP高 0.4 % 0.4\% 0.4%）。CCFF基于跨尺度融合模块进行优化，在融合路径中插入由卷积层组成的几个融合块。融合块的作用是将两个相邻尺度特征融合为新特征，其结构如图5所示。融合块包含两个 1 × 1 1\times1 1×1卷积来调整通道数， N N N个由RepConv[8]组成的RepBlock用于特征融合，两路输出通过逐元素相加进行融合。我们将混合编码器的计算公式化为：

Q = K = V = F l a t t e n ( S 5 ) , F 5 = R e s h a p e ( A I F I ( Q , K , V ) ) , O = C C F F ( { S 3 , S 4 , F 5 } ) , \begin{array}{r l}&{\pmb{\mathcal{Q}}=\pmb{\mathcal{K}}=\pmb{\mathcal{V}}=\mathrm{F}\mathrm{latt}\mathrm{en}(\pmb{\mathcal{S}}{5}),}\\ &{\pmb{\mathcal{F}}{5}=\mathrm{Reshape}(\mathrm{AIFI}(\pmb{\mathcal{Q}},\pmb{\mathcal{K}},\pmb{\mathcal{V}})),}\\ &{\pmb{\mathcal{O}}=\mathrm{CCFF}(\{\pmb{\mathcal{S}}{3},\pmb{\mathcal{S}}{4},\pmb{\mathcal{F}}_{5}\}),}\end{array} Q=K=V=Flatten(S5),F5=Reshape(AIFI(Q,K,V)),O=CCFF({S3,S4,F5}),

where Reshape represents restoring the shape of the flattened feature to the same shape as S 5 \pmb{S}_{5} S5 .

【翻译】其中Reshape表示将扁平化特征的形状恢复为与 S 5 \pmb{S}_{5} S5相同的形状。

【解析】RT-DETR的混合编码器设计体现了一种精细化的计算资源分配策略。传统的多尺度Transformer编码器对所有尺度的特征都进行相同强度的处理，这种"一刀切"的方式忽略了不同尺度特征的本质差异。作者通过深入分析发现，高级特征（如 S 5 \pmb{S}{5} S5）包含了丰富的语义信息，这些信息对于理解图像中的目标类别和概念关系至关重要。自注意力机制在这个层次上能够发挥最大价值，因为它能够建立不同语义概念之间的长距离依赖关系，比如识别出图像中的"人"和"自行车"之间的交互关系。相比之下，低级特征主要包含边缘、纹理等基础视觉信息，这些信息的空间关系相对简单，不需要复杂的自注意力机制来处理。更重要的是，如果对低级特征也应用自注意力，可能会产生与高级特征处理过程的信息冗余，甚至可能引入噪声干扰。AIFI模块的设计正是基于这种认识，它只对最高级的特征 S 5 \pmb{S}{5} S5应用Transformer编码器，大幅减少了计算量。CCFF模块则专门负责跨尺度的特征融合，它使用CNN的层次化结构来有效整合不同分辨率的信息。融合块的设计很巧妙：首先用 1 × 1 1\times1 1×1卷积调整通道维度以确保特征兼容性，然后用RepBlock进行深度特征融合，最后通过逐元素相加实现信息整合。这种设计既保持了特征融合的有效性，又避免了Transformer在多尺度处理中的计算瓶颈。公式中的处理流程展示了这种分工合作的机制：首先将 S 5 \pmb{S}_{5} S5扁平化为序列形式供AIFI处理，然后将处理结果重新整形，最后与其他尺度特征一起送入CCFF进行跨尺度融合。

Figure 4. Overview of RT-DETR. We feed the features from the last three stages of the backbone into the encoder. The efficient hybrid encoder transforms multi-scale features into a sequence of image features through the Attention-based Intra-scale Feature Interaction (AIFI) and the CNN-based Cross-scale Feature Fusion (CCFF). Then, the uncertainty-minimal query selection selects a fixed number of encoder features to serve as initial object queries for the decoder. Finally, the decoder with auxiliary prediction heads iteratively optimizes object queries to generate categories and boxes.

【翻译】图4. RT-DETR概述。我们将骨干网络最后三个阶段的特征输入编码器。高效混合编码器通过基于注意力的尺度内特征交互（AIFI）和基于CNN的跨尺度特征融合（CCFF）将多尺度特征转换为图像特征序列。然后，不确定性最小查询选择选择固定数量的编码器特征作为解码器的初始目标查询。最后，带有辅助预测头的解码器迭代优化目标查询以生成类别和边界框。

Figure 5. The fusion block in CCFF.

【翻译】图5. CCFF中的融合块。

4.3. Uncertainty-minimal Query Selection（不确定性最小查询选择）

To reduce the difficulty of optimizing object queries in DETR, several subsequent works [ 42 , 44 , 45 ] propose query selection schemes, which have in common that they use the confidence score to select the top K K K features from the encoder to initialize object queries (or just position queries).

【翻译】为了降低DETR中优化目标查询的难度，几个后续工作[42, 44, 45]提出了查询选择方案，它们的共同点是使用置信度分数从编码器中选择前 K K K个特征来初始化目标查询（或仅位置查询）。

【解析】DETR的一个核心挑战是目标查询的初始化问题。在原始DETR中，目标查询是随机初始化的，这给解码器的优化带来了很大困难，因为解码器需要从完全随机的状态开始学习如何定位和分类目标。为了解决这个问题，研究者们开发了查询选择策略，其基本思路是从编码器输出的特征中挑选出最有希望包含目标的特征作为查询的初始值。这些方法通常依赖置信度分数作为选择标准，置信度分数反映了某个特征位置包含前景目标的可能性。通过这种方式，解码器可以从一个更好的起点开始优化，而不是从随机状态开始。

The confidence score represents the likelihood that the feature includes foreground objects. Nevertheless, the detector are required to simultaneously model the category and location of objects, both of which determine the quality of the features. Hence, the performance score of the feature is a latent variable that is jointly correlated with both classification and localization. Based on the analysis, the current query selection lead to a considerable level of uncertainty in the selected features, resulting in sub-optimal initialization for the decoder and hindering the performance of the detector.

【翻译】置信度分数表示特征包含前景目标的可能性。然而，检测器需要同时建模目标的类别和位置，这两者都决定了特征的质量。因此，特征的性能分数是一个与分类和定位都联合相关的潜在变量。基于这一分析，当前的查询选择导致所选特征存在相当程度的不确定性，导致解码器的次优初始化并阻碍检测器的性能。

【解析】这段话揭示了现有查询选择方法的缺陷。传统方法只考虑置信度这一个维度，但目标检测本质上是一个多任务问题，需要同时解决"这里有什么"（分类）和"它在哪里"（定位）两个问题。一个特征可能在分类上表现很好（能准确识别目标类别），但在定位上表现较差（边界框不准确），或者相反。如果仅仅基于置信度分数来选择查询，就可能选到那些在某一个任务上表现好但在另一个任务上表现差的特征。这种不匹配会产生不确定性，因为我们无法确定所选择的特征在两个任务上的综合表现如何。这种不确定性会传播到解码器的初始化过程中，使得解码器需要花费更多的计算资源来纠正这种不匹配，从而影响整体检测性能。

To address this problem, we propose the uncertainty minimal query selection scheme, which explicitly constructs and optimizes the epistemic uncertainty to model the joint latent variable of encoder features, thereby providing high-quality queries for the decoder. Specifically, the feature uncertainty U \mathcal{U} U is defined as the discre ncy between the p dicted disTo minimize the uncertainty of the queries, we integrate tributions of localization P and classification C in Eq. ( 2 ). the uncertainty into the loss function for the gradient-based optimization in Eq. ( 3 ).

【翻译】为了解决这个问题，我们提出了不确定性最小查询选择方案，该方案显式构建和优化认知不确定性来建模编码器特征的联合潜在变量，从而为解码器提供高质量的查询。具体来说，特征不确定性 U \mathcal{U} U被定义为定位P和分类C的预测分布之间的差异，如公式(2)所示。为了最小化查询的不确定性，我们将不确定性集成到损失函数中进行基于梯度的优化，如公式(3)所示。

【解析】RT-DETR提出的不确定性最小查询选择是一个创新性的解决方案，它从认知科学的角度重新审视了查询选择问题。认知不确定性（epistemic uncertainty）指由于知识不足而产生的不确定性，在这里具体表现为模型对于某个特征在分类和定位任务上的表现不一致性的不确定程度。该方法的核心思想是将分类和定位看作两个相关但独立的预测分布，通过测量这两个分布之间的差异来量化特征的不确定性。当一个特征在分类和定位上的预测结果高度一致时，说明这个特征是高质量的；反之，如果两个预测结果差异很大，说明这个特征存在较高的不确定性。通过将这种不确定性显式地建模并集成到损失函数中，模型可以在训练过程中学会识别和选择那些在两个任务上都表现良好的特征。这种方法不仅提高了查询初始化的质量，还为后续的解码器优化提供了更好的起点。

U ( X ^ ) = ∥ P ( X ^ ) − C ( X ^ ) ∥ , X ^ ∈ R D \mathcal{U}(\hat{\pmb{\mathscr{X}}})=\|\mathcal{P}(\hat{\pmb{\mathscr{X}}})-\mathscr{C}(\hat{\pmb{\mathscr{X}}})\|,\hat{\pmb{\mathscr{X}}}\in\mathbb{R}^{D} U(X^)=∥P(X^)−C(X^)∥,X^∈RD

L ( X ^ , y ^ , y ) = L b o x ( b ^ , b ) + L c l s ( U ( x ^ ) , c ^ , c ) \mathcal{L}(\hat{\pmb{\mathscr{X}}},\hat{\pmb{\mathscr{y}}},\pmb{\mathscr{y}})=\mathcal{L}{b o x}(\hat{\bf b},{\bf b})+\mathcal{L}{c l s}(\mathcal{U}(\hat{\pmb{x}}),\hat{\bf c},{\bf c}) L(X^,y^,y)=Lbox(b^,b)+Lcls(U(x^),c^,c)

where Y ^ \hat{\mathcal{Y}} Y^ and Y \mathcal{Y} Y denote the prediction and ground truth, Y ^ = { c ^ , b ^ } \hat{\mathcal{Y}} = \{\hat{c}, \hat{b}\} Y^={c^,b^}, c ^ \hat{c} c^ and b ^ \hat{b} b^ represent the category and bounding box respectively, X ^ \hat{\mathcal{X}} X^ represent the encoder feature.

【翻译】其中 Y ^ \hat{\mathcal{Y}} Y^和 Y \mathcal{Y} Y分别表示预测值和真实值， Y ^ = { c ^ , b ^ } \hat{\mathcal{Y}} = \{\hat{c}, \hat{b}\} Y^={c^,b^}， c ^ \hat{c} c^和 b ^ \hat{b} b^分别表示类别和边界框， X ^ \hat{\mathcal{X}} X^表示编码器特征。

【解析】公式定义了损失函数中各个符号的含义，建立了预测输出与真实标签之间的对应关系。在目标检测任务中，模型需要同时预测目标的类别和位置信息，因此预测输出 Y ^ \hat{\mathcal{Y}} Y^被分解为两个组成部分：类别预测 c ^ \hat{c} c^和边界框预测 b ^ \hat{b} b^。编码器特征 X ^ \hat{\mathcal{X}} X^是整个预测过程的基础，它包含了从输入图像中提取的语义和空间信息。通过将这些特征输入到分类和定位分支中，模型可以生成相应的预测结果。

Effectiveness analysis. To analyze the effectiveness of the uncertainty-minimal query selection, we visualize the classification scores and IoU scores of the selected features on COCO val2017 , Figure 6 . We draw the scatterplot with classification scores greater than 0 . 5 . The purple and green dots represent the selected features from the model trained with uncertainty-minimal query selection and vanilla query selection, respectively. The closer the dot is to the top right of the figure, the higher the quality of the corresponding feature, i.e ., the more likely the predicted category and box are to describe the true object. The top and right density curves reflect the number of dots for two types.

【翻译】有效性分析。为了分析不确定性最小查询选择的有效性，我们在COCO val2017上可视化了所选特征的分类分数和IoU分数，如图6所示。我们绘制了分类分数大于0.5的散点图。紫色和绿色点分别表示使用不确定性最小查询选择和传统查询选择训练的模型所选择的特征。点越接近图的右上角，对应特征的质量越高，即预测的类别和边界框越有可能描述真实目标。顶部和右侧的密度曲线反映了两种类型点的数量分布。

【解析】这段分析通过可视化实验验证了不确定性最小查询选择方法的优越性。实验通过二维散点图同时展示了特征在分类和定位两个任务上的表现。横轴代表分类性能，纵轴代表定位性能（通过IoU衡量），这样就能直观地看出每个选中特征的综合质量。图的右上角区域代表高质量特征区域，因为这些特征在两个任务上都表现优秀。通过设置分类分数阈值0.5，实验过滤掉了明显的低质量特征，专注于分析中等到高质量特征的分布差异。紫色点和绿色点的对比清晰地展示了两种方法的差异：如果不确定性最小方法有效，紫色点应该更多地聚集在右上角区域。密度曲线提供了定量的统计信息，能够更精确地比较两种方法选择的特征质量分布。

Figure 6. Classification and IoU scores of the selected encoder features. Purple and Green dots represent the selected features from model trained with uncertainty-minimal query selection and vanilla query selection, respectively.

【翻译】图6. 所选编码器特征的分类和IoU分数。紫色和绿色点分别表示使用不确定性最小查询选择和传统查询选择训练的模型所选择的特征。

The most striking feature of the scatterplot is that the purple dots are concentrated in the top right of the figure, while the green dots are concentrated in the bottom right. This shows that uncertainty-minimal query selection produces more high-quality encoder features. Furthermore, we perform quantitative analysis on two query selection schemes. There are 138 % 138\% 138% more purple dots than green dots, i.e ., more green dots with a classification score less than or equal to 0 . 5 , which can be considered low-quality features. And there are 120 % 120\% 120% more purple dots than green dots with both scores greater than 0 . 5 . The same conclusion can be drawn from the density curves, where the gap between purple and green is most evident in the top right of the figure. Quantitative results further demonstrate that the uncertainty-minimal query selection provides more features with accurate classification and precise location for queries, thereby improving the accuracy of the detector ( cf. Sec. 5.3 ).

【翻译】散点图最显著的特征是紫色点集中在图的右上角，而绿色点集中在右下角。这表明不确定性最小查询选择产生了更多高质量的编码器特征。此外，我们对两种查询选择方案进行了定量分析。紫色点比绿色点多 138 % 138\% 138%，即更多绿色点的分类分数小于或等于0.5，这些可以被认为是低质量特征。在两个分数都大于0.5的情况下，紫色点比绿色点多 120 % 120\% 120%。从密度曲线也可以得出相同的结论，紫色和绿色之间的差距在图的右上角最为明显。定量结果进一步证明，不确定性最小查询选择为查询提供了更多具有准确分类和精确定位的特征，从而提高了检测器的准确性（参见第5.3节）。

【解析】作者通过详细的统计数据验证了不确定性最小查询选择方法的优越性。散点图的分布模式揭示了两种方法的本质差异：传统方法选择的特征主要集中在右下角，说明这些特征虽然在分类上表现不错，但在定位精度上存在明显不足；而新方法选择的特征更多地分布在右上角，说明它们在两个任务上都达到了较高的性能水平。 138 % 138\% 138%说明传统方法选择了大量低质量特征（分类分数≤0.5），这些特征对模型训练是有害的。更重要的是， 120 % 120\% 120%的数字表明在高质量特征区域（两个分数都>0.5），新方法选择的特征数量显著超过传统方法。密度曲线提供了额外的统计验证，它们显示了特征分布的概率密度，右上角区域的显著差异进一步确认了新方法的优势。这种改进直接转化为检测器性能的提升，因为解码器从更好的初始查询开始优化，能够更快地收敛到更优的解。

4.4. 可扩展的RT-DETR

Since real-time detectors typically provide models at different scales to accommodate different scenarios, RT-DETR also supports flexible scaling. Specifically, for the hybrid encoder, we control the width by adjusting the embedding dimension and the number of channels, and the depth by adjusting the number of Transformer layers and RepBlock s.

【翻译】由于实时检测器通常提供不同规模的模型以适应不同的场景，RT-DETR也支持灵活的缩放。具体来说，对于混合编码器，我们通过调整嵌入维度和通道数来控制宽度，通过调整Transformer层数和RepBlock数来控制深度。

【解析】这段话介绍了RT-DETR的模型缩放策略，在实际应用中，不同的硬件平台和应用场景对模型的计算资源需求差异很大，比如移动设备需要轻量级模型，而服务器端可以使用更大的模型来追求更高的精度。RT-DETR通过两个维度来实现模型缩放：宽度和深度。宽度控制涉及特征表示的丰富程度，嵌入维度决定了特征向量的长度，通道数决定了特征图的深度，这两个参数直接影响模型的表达能力和计算复杂度。深度控制则涉及网络的层次结构，Transformer层数决定了自注意力机制的迭代次数，RepBlock数量影响特征提取的精细程度。通过合理调整这些参数，可以在保持架构一致性的前提下，生成适合不同计算预算的模型变体。

The width and depth of the decoder can be controlled by manipulating the number of object queries and decoder layers. Furthermore, the speed of RT-DETR supports flexible adjustment by adjusting the number of decoder layers. We observe that removing a few decoder layers at the end has minimal effect on accuracy, but greatly enhances inference speed ( cf. Sec. 5.4 ). We compare the RT-DETR equipped with ResNet50 and ResNet101 [ 13 , 14 ] to the L L L and X X X models of YOLO detectors. Lighter RT-DETRs can be designed by applying other smaller ( e.g ., ResNet18/34) or scalable ( e.g ., CSPResNet [ 40 ]) backbones with scaled encoder and decoder. We compare the scaled RT-DETRs with the lighter ( S \boldsymbol{S} S and M M M ) YOLO detectors in Appendix, which outperform all S S S and M M M models in both speed and accuracy.

【翻译】解码器的宽度和深度可以通过操控目标查询的数量和解码器层数来控制。此外，RT-DETR的速度支持通过调整解码器层数进行灵活调整。我们观察到移除末尾的几个解码器层对精度的影响很小，但大大提高了推理速度（参见第5.4节）。我们将配备ResNet50和ResNet101的RT-DETR与YOLO检测器的 L L L和 X X X模型进行比较。通过应用其他更小的（如ResNet18/34）或可扩展的（如CSPResNet）骨干网络以及相应缩放的编码器和解码器，可以设计出更轻量的RT-DETR。我们在附录中将缩放的RT-DETR与更轻量的（ S \boldsymbol{S} S和 M M M）YOLO检测器进行比较，结果显示在速度和精度方面都优于所有 S S S和 M M M模型。

【解析】解码器的缩放涉及两个关键参数：目标查询数量和解码器层数。目标查询数量直接决定了模型能够同时检测的最大目标数量，这个参数的调整会影响模型的检测能力和计算开销。解码器层数则控制着查询优化的迭代深度，每一层都会对查询进行一次精化。作者发现了一个重要的观察结果：解码器的后几层对精度提升的贡献相对较小，但会显著增加计算时间。这个发现具有重要的实用价值，因为它允许在部署时根据实际需求动态调整模型复杂度，而无需重新训练。通过与不同规模的YOLO模型进行对比，作者展示了RT-DETR在各个规模级别上的竞争力。特别是通过使用不同的骨干网络（从轻量级的ResNet18/34到更强大的ResNet101），RT-DETR可以构建出覆盖广泛性能范围的模型家族，这种灵活性使其能够适应从边缘设备到高性能服务器的各种部署场景。

5. Experiments

5.1. Comparison with SOTA

Table 2 compares RT-DETR with current real-time (YOLOs) and end-to-end (DETRs) detectors, where only the L L L and X X X models of the YOLO detector are compared, and the S S S and M M M models are compared in Appendix. Our RT-DETR and YOLO detectors share a common input size of (640, 640), and other DETRs use an input size of (800, 1333). The FPS is reported on T4 GPU with TensorRT FP16, and for YOLO detectors using official pre-trained models according to the end-to-end speed benchmark proposed in Sec. 3.2 . Our RT-DETR-R50 achieves 53.1 % 53.1\% 53.1% AP and 108 FPS, while RTDETR-R101 achieves 54.3 % 54.3\% 54.3% AP and 74 FPS, outperforming state-of-the-art YOLO detectors of similar scale and DETRs with the same backbone in both speed and accuracy. The experimental settings are shown in Appendix.

【翻译】表2将RT-DETR与当前的实时检测器（YOLOs）和端到端检测器（DETRs）进行了比较，其中只比较了YOLO检测器的 L L L和 X X X模型， S S S和 M M M模型的比较在附录中。我们的RT-DETR和YOLO检测器共享相同的输入尺寸(640, 640)，其他DETR使用输入尺寸(800, 1333)。FPS在T4 GPU上使用TensorRT FP16报告，YOLO检测器使用官方预训练模型，根据第3.2节提出的端到端速度基准测试。我们的RT-DETR-R50达到了 53.1 % 53.1\% 53.1% AP和108 FPS，而RT-DETR-R101达到了 54.3 % 54.3\% 54.3% AP和74 FPS，在速度和精度方面都优于相似规模的最先进YOLO检测器和使用相同骨干网络的DETR。实验设置在附录中显示。

Comparison with real-time detectors. We compare the end-to-end speed ( cf. Sec. 3.2 ) and accuracy of RTDETR with YOLO detectors. We compare RT-DETR with YOLOv5 [ 11 ], PP-YOLOE [ 40 ], YOLOv6v3.0 [ 16 ] (hereinafter referred to as YOLOv6), YOLOv7 [ 38 ] and YOLOv8 [ 12 ]. Compared to YOLOv5-L / PP-YOLOE-L / YOLOv6-L, RT-DETR-R50 improves accuracy by 4.1 % / 4.1\%/ 4.1%/ 1.7 % / 0.3 % 1.7\%/0.3\% 1.7%/0.3% AP, increases FPS by 100.0 % / 14.9 % / 9.1 % 100.0\%/14.9\%/9.1\% 100.0%/14.9%/9.1% , and reduces the number of parameters by 8.7 % / 19.2 % 8.7\%\mathrm{~/~}19.2\% 8.7% / 19.2% 128.8% \textit{128.8\%} 128.8% . Compared to YOLOv5-X / PP-YOLOE-X, RT- DETR-R101 improves accuracy by 3.6 % / 2.0 % 3.6\%/2.0\% 3.6%/2.0% , increases FPS by 72.1 % / 23.3 % 72.1\%\mathrm{~/~}23.3\% 72.1% / 23.3% , and reduces the number of parameters by 11.6 % / 22.4 % 11.6\%\mathrm{~/~}22.4\% 11.6% / 22.4% . Compared to Y O L O v 7 − L ∣ \mathrm{YOLOv}7\mathrm{-L}\mid YOLOv7−L∣ YOLOv8-L, RT-DETR-R50 improves accuracy by 1.9 % , 1.9\%, 1.9%, / 0.2 % 0.2\% 0.2% AP and increases FPS by 96.4 % / 52.1 % 96.4\%/52.1\% 96.4%/52.1% . Compared to YOLOv7-X / YOLOv8-X, RT-DETR-R101 improves ac- curacy by 1.4 % / 0.4 % 1.4\%\mathrm{~/~}0.4\% 1.4% / 0.4% AP and increases FPS by 64.4 % 7 64.4\%7 64.4%7 48.0 % 48.0\% 48.0% . This shows that our RT-DETR achieves state-of-theart real-time detection performance.

【翻译】与实时检测器的比较。我们比较了RT-DETR与YOLO检测器的端到端速度（参见第3.2节）和精度。我们将RT-DETR与YOLOv5 [11]、PP-YOLOE [40]、YOLOv6v3.0 [16]（以下简称YOLOv6）、YOLOv7 [38]和YOLOv8 [12]进行比较。与YOLOv5-L / PP-YOLOE-L / YOLOv6-L相比，RT-DETR-R50在精度上提高了 4.1 % / 1.7 % / 0.3 % 4.1\%/1.7\%/0.3\% 4.1%/1.7%/0.3% AP，FPS提高了 100.0 % / 14.9 % / 9.1 % 100.0\%/14.9\%/9.1\% 100.0%/14.9%/9.1%，参数数量减少了 8.7 % / 19.2 % 8.7\%\mathrm{~/~}19.2\% 8.7% / 19.2% 128.8% \textit{128.8\%} 128.8%。与YOLOv5-X / PP-YOLOE-X相比，RT-DETR-R101在精度上提高了 3.6 % / 2.0 % 3.6\%/2.0\% 3.6%/2.0%，FPS提高了 72.1 % / 23.3 % 72.1\%\mathrm{~/~}23.3\% 72.1% / 23.3%，参数数量减少了 11.6 % / 22.4 % 11.6\%\mathrm{~/~}22.4\% 11.6% / 22.4%。与 Y O L O v 7 − L ∣ \mathrm{YOLOv}7\mathrm{-L}\mid YOLOv7−L∣ YOLOv8-L相比，RT-DETR-R50在精度上提高了 1.9 % , 1.9\%, 1.9%, / 0.2 % 0.2\% 0.2% AP，FPS提高了 96.4 % / 52.1 % 96.4\%/52.1\% 96.4%/52.1%。与YOLOv7-X / YOLOv8-X相比，RT-DETR-R101在精度上提高了 1.4 % / 0.4 % 1.4\%\mathrm{~/~}0.4\% 1.4% / 0.4% AP，FPS提高了 64.4 % 7 64.4\%7 64.4%7 48.0 % 48.0\% 48.0%。这表明我们的RT-DETR达到了最先进的实时检测性能。

Comparison with end-to-end detectors. We also compare RT-DETR with existing DETRs using the same backbone.

【翻译】与端到端检测器的比较。我们还将RT-DETR与使用相同骨干网络的现有DETR进行比较。

Table 2. Comparison with SOTA (only L L L and X X X models of YOLO detectors, see Appendix for the comparison with S S S and M M M models). We do not test the speed of other DETRs, except for DINO-Deformable-DETR [ 44 ] for comparison, as they are not real-time detectors. Our RT-DETR outperforms the state-of-the-art YOLO detectors and DETRs in both speed and accuracy.

【翻译】表2. 与SOTA的比较（仅YOLO检测器的 L L L和 X X X模型，与 S S S和 M M M模型的比较见附录）。除了DINO-Deformable-DETR [44]用于比较外，我们没有测试其他DETR的速度，因为它们不是实时检测器。我们的RT-DETR在速度和精度方面都优于最先进的YOLO检测器和DETR。

We test the speed of DINO-Deformable-DETR [ 44 ] according to the settings of the corresponding accuracy taken on COCO val2017 for comparison, i.e ., the speed is tested with TensorRT FP16 and the input size is (800, 1333). Table 2 shows that RT-DETR outperforms all DETRs with the same backbone in both speed and accuracy. Compared to DINO-Deformable-DETR-R50, RT-DETR-R50 improves the accuracy by 2.2 % 2.2\% 2.2% AP and the speed by 21 times ( 108 FPS vs 5 FPS), both of which are significantly improved.

【翻译】我们根据在COCO val2017上获得的相应精度设置测试DINO-Deformable-DETR [44]的速度进行比较，即使用TensorRT FP16测试速度，输入尺寸为(800, 1333)。表2显示RT-DETR在速度和精度方面都优于所有使用相同骨干网络的DETR。与DINO-Deformable-DETR-R50相比，RT-DETR-R50在精度上提高了 2.2 % 2.2\% 2.2% AP，速度提高了21倍（108 FPS vs 5 FPS），两者都有显著改善。

5.2. Ablation Study on Hybrid Encoder

We evaluate the indicators of the variants designed in Sec. 4.2 , including AP (trained with 1 × 1\times 1× configuration), the number of parameters, and the latency, Table 3 . Compared to baseline A, variant B improves accuracy by 1.9 % 1.9\% 1.9% AP and increases the latency by 54 % 54\% 54% . This proves that the intrascale feature interaction is significant, but the single-scale Transformer encoder is computationally expensive. Variant C delivers a 0.7 % 0.7\% 0.7% AP improvement over B and increases the latency by 20 % 20\% 20% . This shows that the cross-scale feature fusion is also necessary but the multi-scale Transformer encoder requires higher computational cost. Variant D delivers a 0.8 % 0.8\% 0.8% AP improvement over C, but reduces latency by 8 % 8\% 8% , suggesting that decoupling intra-scale interaction and crossscale fusion not only reduces computational cost but also improves accuracy. Compared to variant D, D S 5 D_{S_{5}} DS5 reduces the latency by 35 % 35\% 35% but delivers 0.4 % 0.4\% 0.4% AP improvement, demonstrating that intra-scale interactions of lower-level features are not required. Finally, variant E delivers 1.5 % 1.5\% 1.5% AP improvement over D. Despite a 20 % 20\% 20% increase in the number of parameters, the latency is reduced by 24 % 24\% 24% , making the encoder more efficient. This shows that our hybrid encoder achieves a better trade-off between speed and accuracy.

【翻译】我们评估了第4.2节中设计的变体的指标，包括AP（使用 1 × 1\times 1×配置训练）、参数数量和延迟，见表3。与基线A相比，变体B将精度提高了 1.9 % 1.9\% 1.9% AP，但延迟增加了 54 % 54\% 54%。这证明了尺度内特征交互是重要的，但单尺度Transformer编码器在计算上是昂贵的。变体C相比B提供了 0.7 % 0.7\% 0.7% AP的改进，延迟增加了 20 % 20\% 20%。这表明跨尺度特征融合也是必要的，但多尺度Transformer编码器需要更高的计算成本。变体D相比C提供了 0.8 % 0.8\% 0.8% AP的改进，但延迟减少了 8 % 8\% 8%，表明解耦尺度内交互和跨尺度融合不仅降低了计算成本，还提高了精度。与变体D相比， D S 5 D_{S_{5}} DS5将延迟减少了 35 % 35\% 35%，但提供了 0.4 % 0.4\% 0.4% AP的改进，证明了低级特征的尺度内交互是不需要的。最后，变体E相比D提供了 1.5 % 1.5\% 1.5% AP的改进。尽管参数数量增加了 20 % 20\% 20%，但延迟减少了 24 % 24\% 24%，使编码器更加高效。这表明我们的混合编码器在速度和精度之间实现了更好的权衡。

5.3. Ablation Study on Query Selection

We conduct an ablation study on uncertainty-minimal query selection, and the results are reported on RT-DETR-R50 with 1 × 1\times 1× configuratio T e query selection in RT-DETR selects the top K ( K = 300 K=300 K=300 encoder features according to the classification scores as the content queries, and the prediction boxes corresponding to the selected features are used as initial position queries. We compare the encoder features selected by the two query selection schemes on COCO val2017 and calculate the proportions of classification scores greater than 0 . 5 and both classification and IoU scores greater than 0 . 5 , respectively. The results show that the encoder features selected by uncertainty-minimal query selection not only increase the proportion of high classification scores ( 0.82 % (0.82\% (0.82% vs 0.35 % 0.35\% 0.35% ) but also provide more high-quality features ( 0.67 % (0.67\% (0.67% vs 0.30 % 0.30\% 0.30% ). We also evaluate the accuracy of the detectors trained with the two query selection schemes on COCO val2017 , where the uncertainty- minimal query selection achieves an improvement of 0.8 % 0.8\% 0.8% AP ( 48.7 % 48.7\% 48.7% AP vs 47.9 % 47.9\% 47.9% AP).

【翻译】我们对不确定性最小查询选择进行了消融研究，结果在使用 1 × 1\times 1×配置的RT-DETR-R50上报告。RT-DETR中的查询选择根据分类分数选择前K个（ K = 300 K=300 K=300）编码器特征作为内容查询，与所选特征对应的预测框用作初始位置查询。我们在COCO val2017上比较了两种查询选择方案选择的编码器特征，并分别计算分类分数大于0.5和分类与IoU分数都大于0.5的比例。结果表明，不确定性最小查询选择选择的编码器特征不仅增加了高分类分数的比例（ 0.82 % 0.82\% 0.82% vs 0.35 % 0.35\% 0.35%），还提供了更多高质量特征（ 0.67 % 0.67\% 0.67% vs 0.30 % 0.30\% 0.30%）。我们还在COCO val2017上评估了使用两种查询选择方案训练的检测器的精度，其中不确定性最小查询选择实现了 0.8 % 0.8\% 0.8% AP的改进（ 48.7 % 48.7\% 48.7% AP vs 47.9 % 47.9\% 47.9% AP）。

Table 3. The indicators of the set of variants illustrated in Figure 3 .

【翻译】表3. 图3中所示变体集合的指标。

Table 4. Results of the ablation study on uncertainty-minimal query selection. P r o p c l s \mathbf{Prop}{c l s} Propcls and P r o p b o t h \mathbf{Prop}{b o t h} Propboth represent the proportion of classification score and both scores greater than 0 . 5 respectively.

【翻译】表4. 不确定性最小查询选择消融研究的结果。 P r o p c l s \mathbf{Prop}{c l s} Propcls和 P r o p b o t h \mathbf{Prop}{b o t h} Propboth分别表示分类分数和两个分数都大于0.5的比例。

5.4. Ablation Study on Decoder

Table 5 shows the inference latency and accuracy of each decoder layer of RT-DETR-R50 trained with different numbers of decoder layers. When the number of decoder layers is set to 6 , the RT-DETR-R50 achieves the best accuracy 53.1 % 53.1\% 53.1% AP. Furthermore, we observe that the difference in accuracy between adjacent decoder layers gradually decreases as the index of the decoder layer increases. Taking the column RTDETR-R50-Det 6 as an example, using 5 -th decoder layer for inference only loses 0.1 % 0.1\% 0.1% AP ( 53.1 % (53.1\% (53.1% AP vs 53.0 % 53.0\% 53.0% AP) in accuracy, while reducing latency by 0.5 m s 0.5\mathrm{ms} 0.5ms ( 9 . 3 ms vs 8 . 8 ms). Therefore, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers without retraining, thus improving its practicality.

【翻译】表5显示了使用不同解码器层数训练的RT-DETR-R50每个解码器层的推理延迟和精度。当解码器层数设置为6时，RT-DETR-R50达到了最佳精度 53.1 % 53.1\% 53.1% AP。此外，我们观察到随着解码器层索引的增加，相邻解码器层之间的精度差异逐渐减小。以RT-DETR-R50-Det 6列为例，使用第5个解码器层进行推理仅损失 0.1 % 0.1\% 0.1% AP（ 53.1 % 53.1\% 53.1% AP vs 53.0 % 53.0\% 53.0% AP）的精度，同时将延迟减少 0.5 m s 0.5\mathrm{ms} 0.5ms（9.3 ms vs 8.8 ms）。因此，RT-DETR支持通过调整解码器层数进行灵活的速度调优而无需重新训练，从而提高了其实用性。

6. Limitation and Discussion

Limitation. Although the proposed RT-DETR outperforms the state-of-the-art real-time detectors and end-to-end detectors with similar size in both speed and accuracy, it shares the same limitation as the other DETRs, i.e ., the performance on small objects is still inferior than the strong real-time detectors. According to Table 2 , RT-DETR-R50 is 0.5 % 0.5\% 0.5% AP lower than the highest A P S v a l \mathbf{AP}{S}^{v a l} APSval in the L L L model (YOLOv8-L) and RTDETR-R101 is 0.9 % 0.9\% 0.9% AP lower than the highest A P S v a l \mathbf{AP}{S}^{v a l} APSval in the X X X model (YOLOv7-X). We hope that this problem will be addressed in future work.

【翻译】局限性。尽管所提出的RT-DETR在速度和精度方面都优于最先进的实时检测器和相似规模的端到端检测器，但它与其他DETR共享相同的局限性，即在小物体上的性能仍然不如强大的实时检测器。根据表2，RT-DETR-R50比 L L L模型（YOLOv8-L）中最高的 A P S v a l \mathbf{AP}{S}^{v a l} APSval低 0.5 % 0.5\% 0.5% AP，RT-DETR-R101比 X X X模型（YOLOv7-X）中最高的 A P S v a l \mathbf{AP}{S}^{v a l} APSval低 0.9 % 0.9\% 0.9% AP。我们希望这个问题能在未来的工作中得到解决。

Table 5. Results of the ablation study on decoder. ID indicates decoder layer index. D e t k \mathbf{Det}^{k} Detk represents detector with k k k decoder layers. All results are reported on RT-DETR-R50 with 6 × 6\times 6× configuration.

【翻译】表5. 解码器消融研究的结果。ID表示解码器层索引。 D e t k \mathbf{Det}^{k} Detk表示具有 k k k个解码器层的检测器。所有结果都在使用 6 × 6\times 6×配置的RT-DETR-R50上报告。

Discussion. Existing large DETR models [ 3 , 6 , 32 , 41 , 44 , 46 ] have demonstrated impressive performance on COCO test-dev [ 20 ] leaderboard. The proposed RT-DETR at different scales preserves decoders homogeneous to other DETRs, which makes it possible to distill our lightweight detector with high accuracy pre-trained large DETR models. We believe that this is one of the advantages of RT-DETR over other real-time detectors and could be an interesting direction for future exploration.

【翻译】讨论。现有的大型DETR模型[3, 6, 32, 41, 44, 46]在COCO test-dev [20]排行榜上表现出了令人印象深刻的性能。所提出的不同尺度的RT-DETR保持了与其他DETR同质的解码器，这使得用高精度预训练的大型DETR模型蒸馏我们的轻量级检测器成为可能。我们认为这是RT-DETR相对于其他实时检测器的优势之一，可能是未来探索的一个有趣方向。

7. Conclusion

In this work, we propose a real-time end-to-end detector, called RT-DETR, which successfully extends DETR to the real-time detection scenario and achieves state-of-the-art performance. RT-DETR includes two key enhancements: an efficient hybrid encoder that expeditiously processes multiscale features, and the uncertainty-minimal query selection that improves the quality of initial object queries. Furthermore, RT-DETR supports flexible speed tuning without retraining and eliminates the inconvenience caused by two NMS thresholds, facilitating its practical application. RTDETR, along with its model scaling strategy, broadens the technical approach to real-time object detection, offering new possibilities beyond YOLO for diverse real-time scenarios. We hope that RT-DETR can be put into practice.

【翻译】在这项工作中，我们提出了一个实时端到端检测器，称为RT-DETR，它成功地将DETR扩展到实时检测场景并实现了最先进的性能。RT-DETR包含两个关键增强：一个高效的混合编码器，可以快速处理多尺度特征，以及不确定性最小查询选择，可以提高初始对象查询的质量。此外，RT-DETR支持无需重新训练的灵活速度调优，并消除了两个NMS阈值造成的不便，促进了其实际应用。RT-DETR及其模型缩放策略拓宽了实时目标检测的技术方法，为多样化的实时场景提供了超越YOLO的新可能性。我们希望RT-DETR能够投入实践。