【论文阅读/复现】RT-DETR的网络结构/训练/推理/验证/导出模型

利用ultralytics仓库,复现RT-DETR官方实验环境。

使用基于ResNet50和ResNet101的RT-DETR。

目录

[一 RT-DETR的网络结构](#一 RT-DETR的网络结构)

[1 编码器结构](#1 编码器结构)

[2 RT-DETR](#2 RT-DETR)

[3 CCFF中的融合块](#3 CCFF中的融合块)

[4 实验结果](#4 实验结果)

[二 RT-DETR的安装/训练/推理/验证/导出模型](#二 RT-DETR的安装/训练/推理/验证/导出模型)

[1 安装](#1 安装)

[2 配置文件](#2 配置文件)

[3 训练](#3 训练)

[4 推理](#4 推理)

[5 验证](#5 验证)

[6 导出模型](#6 导出模型)



一 RT-DETR的网络结构

论文题目 :DETRs Beat YOLOs on Real-time Object Detection

论文地址http://arxiv.org/pdf/2304.08069

代码地址DETRs Beat YOLOs on Real-time Object Detection

**【摘要】**YOLO系列由于在速度和精度之间进行了合理的权衡,已经成为最流行的实时目标检测框架。然而,我们观察到YOLOs的速度和准确性受到NMS的负面影响。最近,端到端的基于Transformer的检测器( Transformer-based detector,DETR )为消除NMS提供了一种替代方案 。然而,高昂的计算成本限制了它们的实用性,阻碍了它们充分发挥排除NMS的优势。本文提出了第一个实时端到端目标检测器RT-DETR ( Real-Time DEtection TRansformer )来解决上述问题 。借鉴先进的DETR,我们分两步构建RT-DETR:第一步在保持精度的同时提高速度,第二步在保持速度的同时提高精度。具体来说,设计了一个高效的混合编码器,通过解耦尺度内相互作用和跨尺度融合来快速处理多尺度特征,以提高速度 。然后,提出了不确定性最小查询选择,为解码器提供高质量的初始查询,从而提高准确性 。此外,RT-DETR通过调整解码器层数来支持灵活的速度调整,以适应各种场景而无需重新训练 。我们的RT-DETR-R50 / R101在COCO上达到了53.1% / 54.3%的AP,在T4 GPU上达到了108 / 74 FPS,在速度和精度上都优于先前先进的YOLOs。此外,RT-DETR-R50比DINO-R50在准确率上提高了2.2% AP,在FPS上提高了约21倍。经过Objects365预训练后,RTDETR-R50 / R101取得了55.3% / 56.2%的AP。

综上,RT-DETR模型建立在于两个关键创新:

高效混合编码器:通过解耦内部尺度交互和跨尺度融合来处理多尺度特征。这种设计显著降低了计算负担,同时保持了高性能,实现了实时目标检测。

②提出了不确定性最小的查询选择,为解码器提供高质量的初始查询,从而提高准确率。

1 编码器结构

下图是每个变体的编码器结构。SSE表示单尺度Transformer编码器,MSE表示多尺度Transformer 编码器,CSF表示跨尺度融合。AIFI和CCFF是设计的混合编码器的两个模块。

2 RT-DETR

下图为RT-DETR概述。将主干最后三个阶段的特征输入到编码器中。高效混合编码器通过基于注意力的尺度内特征交互(AIFI)和基于cnn的跨尺度特征融合(CCFF)将多尺度特征转化为图像特征序列 。然后,最小不确定性查询选择固定数量的编码器特征作为解码器的初始对象查询 。最后,具有辅助预测头的解码器迭代优化对象查询以生成类别和框

3 CCFF中的融合块

下图为CCFF中的融合块。

4 实验结果

二 RT-DETR的安装/训练/推理/验证/导出模型

1 安装

复制代码
git clone https://github.com/sunsmarterjie/yolov12

cd yolov12

conda create -n yolov12 python=3.11

# 输入y,继续。激活环境。

conda activate yolov12

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install -e .

2 配置文件

  • rtdetr-l.yaml

    Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com/license

    Ultralytics RT-DETR-l hybrid object detection model with P3/8 - P5/32 outputs

    Model docs: https://docs.ultralytics.com/models/rtdetr

    Task docs: https://docs.ultralytics.com/tasks/detect

    Parameters

    nc: 80 # number of classes
    scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
    # [depth, width, max_channels]
    l: [1.00, 1.00, 1024]

    backbone:
    # [from, repeats, module, args]
    - [-1, 1, HGStem, [32, 48]] # 0-P2/4
    - [-1, 6, HGBlock, [48, 128, 3]] # stage 1

    复制代码
    - [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
    - [-1, 6, HGBlock, [96, 512, 3]] # stage 2
    
    - [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P3/16
    - [-1, 6, HGBlock, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
    - [-1, 6, HGBlock, [192, 1024, 5, True, True]]
    - [-1, 6, HGBlock, [192, 1024, 5, True, True]] # stage 3
    
    - [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P4/32
    - [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4

    head:
    - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 10 input_proj.2
    - [-1, 1, AIFI, [1024, 8]]
    - [-1, 1, Conv, [256, 1, 1]] # 12, Y5, lateral_convs.0

    复制代码
    - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    - [7, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.1
    - [[-2, -1], 1, Concat, [1]]
    - [-1, 3, RepC3, [256]] # 16, fpn_blocks.0
    - [-1, 1, Conv, [256, 1, 1]] # 17, Y4, lateral_convs.1
    
    - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    - [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 19 input_proj.0
    - [[-2, -1], 1, Concat, [1]] # cat backbone P4
    - [-1, 3, RepC3, [256]] # X3 (21), fpn_blocks.1
    
    - [-1, 1, Conv, [256, 3, 2]] # 22, downsample_convs.0
    - [[-1, 17], 1, Concat, [1]] # cat Y4
    - [-1, 3, RepC3, [256]] # F4 (24), pan_blocks.0
    
    - [-1, 1, Conv, [256, 3, 2]] # 25, downsample_convs.1
    - [[-1, 12], 1, Concat, [1]] # cat Y5
    - [-1, 3, RepC3, [256]] # F5 (27), pan_blocks.1
    
    - [[21, 24, 27], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
  • rtdetr-resnet50.yaml

    Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com/license

    Ultralytics RT-DETR-ResNet50 hybrid object detection model with P3/8 - P5/32 outputs

    Model docs: https://docs.ultralytics.com/models/rtdetr

    Task docs: https://docs.ultralytics.com/tasks/detect

    Parameters

    nc: 80 # number of classes
    scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
    # [depth, width, max_channels]
    l: [1.00, 1.00, 1024]

    backbone:
    # [from, repeats, module, args]
    - [-1, 1, ResNetLayer, [3, 64, 1, True, 1]] # 0
    - [-1, 1, ResNetLayer, [64, 64, 1, False, 3]] # 1
    - [-1, 1, ResNetLayer, [256, 128, 2, False, 4]] # 2
    - [-1, 1, ResNetLayer, [512, 256, 2, False, 6]] # 3
    - [-1, 1, ResNetLayer, [1024, 512, 2, False, 3]] # 4

    head:
    - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5
    - [-1, 1, AIFI, [1024, 8]]
    - [-1, 1, Conv, [256, 1, 1]] # 7

    复制代码
    - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    - [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 9
    - [[-2, -1], 1, Concat, [1]]
    - [-1, 3, RepC3, [256]] # 11
    - [-1, 1, Conv, [256, 1, 1]] # 12
    
    - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    - [2, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14
    - [[-2, -1], 1, Concat, [1]] # cat backbone P4
    - [-1, 3, RepC3, [256]] # X3 (16), fpn_blocks.1
    
    - [-1, 1, Conv, [256, 3, 2]] # 17, downsample_convs.0
    - [[-1, 12], 1, Concat, [1]] # cat Y4
    - [-1, 3, RepC3, [256]] # F4 (19), pan_blocks.0
    
    - [-1, 1, Conv, [256, 3, 2]] # 20, downsample_convs.1
    - [[-1, 7], 1, Concat, [1]] # cat Y5
    - [-1, 3, RepC3, [256]] # F5 (22), pan_blocks.1
    
    - [[16, 19, 22], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
  • rtdetr-resnet101.yaml

    Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com/license

    Ultralytics RT-DETR-ResNet101 hybrid object detection model with P3/8 - P5/32 outputs

    Model docs: https://docs.ultralytics.com/models/rtdetr

    Task docs: https://docs.ultralytics.com/tasks/detect

    Parameters

    nc: 80 # number of classes
    scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
    # [depth, width, max_channels]
    l: [1.00, 1.00, 1024]

    backbone:
    # [from, repeats, module, args]
    - [-1, 1, ResNetLayer, [3, 64, 1, True, 1]] # 0
    - [-1, 1, ResNetLayer, [64, 64, 1, False, 3]] # 1
    - [-1, 1, ResNetLayer, [256, 128, 2, False, 4]] # 2
    - [-1, 1, ResNetLayer, [512, 256, 2, False, 23]] # 3
    - [-1, 1, ResNetLayer, [1024, 512, 2, False, 3]] # 4

    head:
    - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5
    - [-1, 1, AIFI, [1024, 8]]
    - [-1, 1, Conv, [256, 1, 1]] # 7

    复制代码
    - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    - [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 9
    - [[-2, -1], 1, Concat, [1]]
    - [-1, 3, RepC3, [256]] # 11
    - [-1, 1, Conv, [256, 1, 1]] # 12
    
    - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    - [2, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14
    - [[-2, -1], 1, Concat, [1]] # cat backbone P4
    - [-1, 3, RepC3, [256]] # X3 (16), fpn_blocks.1
    
    - [-1, 1, Conv, [256, 3, 2]] # 17, downsample_convs.0
    - [[-1, 12], 1, Concat, [1]] # cat Y4
    - [-1, 3, RepC3, [256]] # F4 (19), pan_blocks.0
    
    - [-1, 1, Conv, [256, 3, 2]] # 20, downsample_convs.1
    - [[-1, 7], 1, Concat, [1]] # cat Y5
    - [-1, 3, RepC3, [256]] # F5 (22), pan_blocks.1
    
    - [[16, 19, 22], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
  • rtdetr-x.yaml

    Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com/license

    Ultralytics RT-DETR-x hybrid object detection model with P3/8 - P5/32 outputs

    Model docs: https://docs.ultralytics.com/models/rtdetr

    Task docs: https://docs.ultralytics.com/tasks/detect

    Parameters

    nc: 80 # number of classes
    scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
    # [depth, width, max_channels]
    x: [1.00, 1.00, 2048]

    backbone:
    # [from, repeats, module, args]
    - [-1, 1, HGStem, [32, 64]] # 0-P2/4
    - [-1, 6, HGBlock, [64, 128, 3]] # stage 1

    复制代码
    - [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
    - [-1, 6, HGBlock, [128, 512, 3]]
    - [-1, 6, HGBlock, [128, 512, 3, False, True]] # 4-stage 2
    
    - [-1, 1, DWConv, [512, 3, 2, 1, False]] # 5-P3/16
    - [-1, 6, HGBlock, [256, 1024, 5, True, False]] # cm, c2, k, light, shortcut
    - [-1, 6, HGBlock, [256, 1024, 5, True, True]]
    - [-1, 6, HGBlock, [256, 1024, 5, True, True]]
    - [-1, 6, HGBlock, [256, 1024, 5, True, True]]
    - [-1, 6, HGBlock, [256, 1024, 5, True, True]] # 10-stage 3
    
    - [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 11-P4/32
    - [-1, 6, HGBlock, [512, 2048, 5, True, False]]
    - [-1, 6, HGBlock, [512, 2048, 5, True, True]] # 13-stage 4

    head:
    - [-1, 1, Conv, [384, 1, 1, None, 1, 1, False]] # 14 input_proj.2
    - [-1, 1, AIFI, [2048, 8]]
    - [-1, 1, Conv, [384, 1, 1]] # 16, Y5, lateral_convs.0

    复制代码
    - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    - [10, 1, Conv, [384, 1, 1, None, 1, 1, False]] # 18 input_proj.1
    - [[-2, -1], 1, Concat, [1]]
    - [-1, 3, RepC3, [384]] # 20, fpn_blocks.0
    - [-1, 1, Conv, [384, 1, 1]] # 21, Y4, lateral_convs.1
    
    - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    - [4, 1, Conv, [384, 1, 1, None, 1, 1, False]] # 23 input_proj.0
    - [[-2, -1], 1, Concat, [1]] # cat backbone P4
    - [-1, 3, RepC3, [384]] # X3 (25), fpn_blocks.1
    
    - [-1, 1, Conv, [384, 3, 2]] # 26, downsample_convs.0
    - [[-1, 21], 1, Concat, [1]] # cat Y4
    - [-1, 3, RepC3, [384]] # F4 (28), pan_blocks.0
    
    - [-1, 1, Conv, [384, 3, 2]] # 29, downsample_convs.1
    - [[-1, 16], 1, Concat, [1]] # cat Y5
    - [-1, 3, RepC3, [384]] # F5 (31), pan_blocks.1
    
    - [[25, 28, 31], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)

3 训练

复制代码
yolo detect train data=coco128.yaml model=cfg/models/rt-detr/rtdetr-resnet50.yaml epochs=100 batch=16 imgsz=640 device=cpu 

4 推理

复制代码
yolo task=detect mode=predict model=best.pt source=test.jpg device=cpu

5 验证

复制代码
yolo task=detect mode=val model=best.pt data=coco128.yaml device=cpu

6 导出模型

复制代码
yolo task=detect mode=export model=best.pt format=onnx  

至此,本文的内容就结束了。

相关推荐
一眼青苔4 分钟前
python中 str.strip() 是什么意思
开发语言·python
pianmian120 分钟前
【无标题】
python
IT北辰1 小时前
Python数据处理:文件的自动化重命名与整合
数据库·python·自动化
白熊1881 小时前
【计算机视觉】目标检测:深度解析YOLOv9:下一代实时目标检测架构的创新与实战
目标检测·计算机视觉·架构
沉到海底去吧Go2 小时前
【图片识别改名】批量读取图片区域文字识别后批量改名,基于Python和腾讯云的实现方案
开发语言·python·腾讯云
百锦再2 小时前
Python深度挖掘:openpyxl和pandas的使用详细
java·开发语言·python·框架·pandas·压力测试·idea
sagima_sdu3 小时前
论文笔记-基于多层感知器(MLP)的多变量桥式起重机自适应安全制动与距离预测
论文阅读
编程自留地3 小时前
第10次:电商项目配置开发环境
python·django·商城
IT技术员3 小时前
【Java学习】动态代理有哪些形式?
java·python·学习