YOLO26论文精读(逐段解析)

YOLO26: KEY ARCHITECTURAL ENHANCEMENTS AND PERFORMANCE BENCHMARKING FOR REAL-TIME OBJECT DETECTION

YOLO26:实时目标检测的关键架构改进与性能基准测试

论文地址:https://arxiv.org/abs/2509.25164

Ranjan Sapkota1, Rahul Harsha Cheppally2, Ajay Sharda2, Manoj Karkee1

1康奈尔大学,生物与环境工程系,美国纽约州伊萨卡市

2堪萨斯州立大学,生物与农业工程系,美国堪萨斯州曼哈顿市

2025


【论文总结】
YOLO26作为2025年9月发布的最新一代实时目标检测模型,在算法技术层面实现了四个关键突破:首先移除了分布焦点损失(DFL)以降低计算复杂度和推理延迟,其次采用端到端无NMS推理架构直接输出最终检测结果,第三集成了ProgLoss渐进式损失平衡机制和小目标感知标签分配(STAL)策略来优化训练过程和小目标检测性能,第四引入MuSGD优化器结合随机梯度下降的泛化能力与动量方法的收敛稳定性。这些技术创新使YOLO26在保持高精度的同时显著提升了在边缘设备和低功耗平台上的部署效率,支持目标检测、实例分割、姿态估计等多任务应用,并提供ONNX、TensorRT、CoreML等多种导出格式和INT8/FP16量化选项,为机器人技术、智能制造和物联网等领域的实时AI应用提供了更优的解决方案。

ABSTRACT

This study presents a comprehensive analysis of Ultralytics YOLO26, highlighting its key architectural enhancements and performance benchmarking for real-time edge object detection. YOLO26, released in September 2025, stands as the newest and most advanced member of the YOLO family, purpose-built to deliver efficiency, accuracy, and deployment readiness on edge and low-power devices. The paper sequentially details YOLO26's architectural innovations, including the removal of Distribution Focal Loss (DFL), adoption of end-to-end NMS-free inference, integration of ProgLoss and Small-Target-Aware Label Assignment (STAL), and the introduction of the MuSGD optimizer for stable convergence. Beyond architecture, the study positions YOLO26 as a multi-task framework, supporting object detection, instance segmentation, pose/keypoints estimation, oriented detection, and classification. We present performance benchmarks of YOLO26 on edge devices such as NVIDIA Jetson Nano and Orin, comparing its results with YOLOv8, YOLOv11, YOLOv12, YOLOv13, and transformer-based detectors. This paper further explores real-time deployment pathways, flexible export options (ONNX, TensorRT, CoreML, TFLite), and quantization for INT8/FP16. Practical use cases of YOLO26 across robotics, manufacturing, and IoT are highlighted to demonstrate cross-industry adaptability. Finally, insights on deployment efficiency and broader implications are discussed, with future directions for YOLO26 and the YOLO lineage outlined.

【翻译】本研究对Ultralytics YOLO26进行了全面分析,重点介绍了其关键架构改进和实时边缘目标检测的性能基准测试。YOLO26于2025年9月发布,是YOLO家族中最新、最先进的成员,专门为在边缘和低功耗设备上提供效率、准确性和部署就绪性而构建。本文依次详述了YOLO26的架构创新,包括移除分布焦点损失(DFL)、采用端到端无NMS推理、集成ProgLoss和小目标感知标签分配(STAL),以及引入MuSGD优化器以实现稳定收敛。除了架构之外,本研究将YOLO26定位为多任务框架,支持目标检测、实例分割、姿态/关键点估计、定向检测和分类。我们展示了YOLO26在NVIDIA Jetson Nano和Orin等边缘设备上的性能基准,并将其结果与YOLOv8、YOLOv11、YOLOv12、YOLOv13和基于transformer的检测器进行比较。本文进一步探讨了实时部署路径、灵活的导出选项(ONNX、TensorRT、CoreML、TFLite)以及INT8/FP16量化。突出展示了YOLO26在机器人技术、制造业和物联网领域的实际用例,以展示跨行业的适应性。最后,讨论了部署效率和更广泛影响的见解,并概述了YOLO26和YOLO系列的未来发展方向。

【解析】YOLO26四个关键突破:首先是移除分布焦点损失DFL,DFL虽然能够提高边界框回归精度,但会增加计算复杂度和推理延迟,特别是在资源受限的边缘设备上。其次是端到端无NMS推理,传统的非极大值抑制后处理步骤需要额外的计算开销,而YOLO26通过网络结构设计直接输出最终检测结果,显著降低了推理时间和内存占用。第三是ProgLoss和STAL的集成,前者通过渐进式损失平衡解决训练过程中不同损失项权重分配问题,后者专门针对小目标检测的标签分配策略进行优化,共同提升模型在复杂场景下的检测性能。第四是MuSGD优化器的引入,它结合了随机梯度下降的泛化能力和动量方法的收敛稳定性,为大规模训练提供了更可靠的优化路径。

关键词 YOLO26 ⋅ \cdot ⋅ 边缘AI ⋅ \cdot ⋅ 多任务目标检测 ⋅ \cdot ⋅ 无NMS推理 ⋅ \cdot ⋅ 小目标识别 ⋅ \cdot ⋅ You Only Look Once ⋅ \cdot ⋅ 目标检测 ⋅ \cdot ⋅ MuSGD优化器

1 Introduction

Object detection has emerged as one of the most critical tasks in computer vision, enabling machines to localize and classify multiple objects within an image or video stream [1, 2]. From autonomous driving and robotics to surveillance, medical imaging, agriculture, and smart manufacturing, real-time object detection algorithms serve as the backbone of artificial intelligence (AI) applications [3, 4]. Among these algorithms, the You Only Look Once (YOLO) family has established itself as the most influential series of models for real-time object detection, combining accuracy with unprecedented inference speed [5, 6, 7, 7]. Since its introduction in 2016, YOLO has evolved through numerous architectural revisions, each addressing limitations of its predecessors while integrating cutting-edge advances in neural network design, loss functions, and deployment efficiency [5]. The release of YOLO26 in September 2025 represents the latest milestone in this evolutionary trajectory, introducing architectural simplifications, novel optimizers, and enhanced edge deployment capabilities designed for low-power devices.

【翻译】目标检测已成为计算机视觉中最关键的任务之一,使机器能够在图像或视频流中定位和分类多个目标[1, 2]。从自动驾驶和机器人技术到监控、医学成像、农业和智能制造,实时目标检测算法作为人工智能(AI)应用的支柱[3, 4]。在这些算法中,You Only Look Once (YOLO)系列已确立自己作为实时目标检测最具影响力的模型系列,将准确性与前所未有的推理速度相结合[5, 6, 7, 7]。自2016年推出以来,YOLO经历了众多架构修订,每次都解决了前代的局限性,同时整合了神经网络设计、损失函数和部署效率方面的前沿进展[5]。YOLO26于2025年9月发布,代表了这一演进轨迹中的最新里程碑,引入了架构简化、新颖优化器和专为低功耗设备设计的增强边缘部署能力。

【解析】每一代YOLO模型都在前代基础上引入新的网络架构、损失函数设计和训练策略,逐步解决小目标检测、多尺度融合、训练稳定性等关键技术难题。YOLO26的发布标志着该系列在边缘计算和低功耗部署方面的重大突破,通过架构简化和算法优化,使高精度目标检测能够在资源受限的设备上实现实时运行。

Table 1 provides a detailed comparison of YOLO models from version YOLOv1 to YOLOv13 and YOLO26, highlighting their release years, key architectural innovations, performance enhancements, and development frameworks.

【翻译】表1提供了从YOLOv1版本到YOLOv13和YOLO26的YOLO模型的详细比较,突出了它们的发布年份、关键架构创新、性能增强和开发框架。

Table 1: Summary of YOLOv1 to YOLOv13 and YOLO26 models: release year, architecture, innovations, frameworks.

【翻译】表1:YOLOv1到YOLOv13和YOLO26模型总结:发布年份、架构、创新点、框架。

The YOLO framework was first proposed by Joseph Redmon and colleagues in 2016, introducing a paradigm shift in object detection [8]. Unlike traditional two-stage detectors such as R-CNN [18] and Faster R-CNN [19], which separated region proposal from classification, YOLO formulated detection as a single regression problem [20]. By directly predicting bounding boxes and class probabilities in one forward pass through a convolutional neural network (CNN), YOLO achieved real-time speeds while maintaining competitive accuracy [21, 20]. This efficiency made YOLOv1 highly attractive for applications where latency was a critical factor, including robotics, autonomous navigation, and live video analytics. Subsequent versions YOLOv2 (2017) [9]and YOLOv3 (2018) [10] significantly improved accuracy while retaining real-time performance. YOLOv2 introduced batch normalization, anchor boxes, and multi-scale training, which increased robustness across varying object sizes. YOLOv3 leveraged a deeper architecture based on Darknet-53, along with multi-scale feature maps for better small-object detection. These enhancements made YOLOv3 the de facto standard for academic and industrial applications for several years [22, 5, 23].

【翻译】YOLO框架最初由Joseph Redmon及其同事于2016年提出,在目标检测领域引入了范式转变[8]。与传统的两阶段检测器如R-CNN [18]和Faster R-CNN [19]不同,后者将区域提议与分类分离,YOLO将检测表述为单一回归问题[20]。通过在卷积神经网络(CNN)的一次前向传播中直接预测边界框和类别概率,YOLO在保持竞争性准确度的同时实现了实时速度[21, 20]。这种效率使YOLOv1在延迟是关键因素的应用中极具吸引力,包括机器人技术、自主导航和实时视频分析。后续版本YOLOv2 (2017) [9]和YOLOv3 (2018) [10]在保持实时性能的同时显著提高了准确性。YOLOv2引入了批量归一化、锚框和多尺度训练,增强了对不同目标尺寸的鲁棒性。YOLOv3利用基于Darknet-53的更深架构,以及多尺度特征图来更好地检测小目标。这些增强使YOLOv3在数年内成为学术和工业应用的事实标准[22, 5, 23]。

【解析】YOLO的创新在于将目标检测从复杂的多阶段流程简化为单一的回归任务。传统的两阶段检测器需要先生成候选区域,然后对每个区域进行分类和边界框回归,这种设计虽然准确但计算开销巨大。YOLO通过将整个图像作为输入,直接在网络的输出层同时预测所有目标的位置和类别,实现了端到端的检测流程。计算效率大幅提升,因为避免了重复的特征提取和复杂的后处理步骤。YOLOv2的批量归一化技术解决了深度网络训练中的梯度消失问题,锚框机制则提供了更灵活的边界框预测基础,多尺度训练策略通过在不同分辨率下训练模型来增强其对各种目标尺寸的适应能力。YOLOv3的Darknet-53骨干网络采用残差连接结构,有效缓解了网络加深带来的退化问题,同时多尺度特征图的融合机制使模型能够在不同层次上捕获从细粒度到粗粒度的特征信息,这对小目标检测尤为重要,因为小目标往往在浅层特征图中具有更丰富的细节信息。

As the demand for higher accuracy grew, especially in challenging domains such as aerial imagery, agriculture, and medical analysis, YOLO models diversified into more advanced architectures. YOLOv4 (2020) [11] introduced Cross-Stage Partial Networks (CSPNet), improved activation functions like Mish, and advanced training strategies including mosaic data augmentation and CIoU loss. YOLOv5 (Ultralytics, 2020), though unofficial, gained immense popularity due to its PyTorch implementation, extensive community support, and simplified deployment across diverse platforms. YOLOv5 also brought modularity, making it easier to adapt for segmentation, classification, and edge applications. Further developments included YOLOv6[12] and YOLOv7 [13] (2022), which integrated advanced optimization techniques, parameter-efficient modules, and transformer-inspired blocks. These iterations pushed YOLO closer to state-of-the-art (SoTA) accuracy benchmarks while retaining a focus on real-time inference. The YOLO ecosystem, by this point, had firmly established itself as the leading family of models in object detection research and deployment.

【翻译】随着对更高准确性需求的增长,特别是在航空图像、农业和医学分析等具有挑战性的领域,YOLO模型发展出更先进的架构。YOLOv4 (2020) [11]引入了跨阶段部分网络(CSPNet)、改进的激活函数如Mish,以及包括马赛克数据增强和CIoU损失在内的先进训练策略。YOLOv5 (Ultralytics, 2020)虽然非官方,但由于其PyTorch实现、广泛的社区支持和跨多样化平台的简化部署而获得了巨大的流行度。YOLOv5还带来了模块化特性,使其更容易适应分割、分类和边缘应用。进一步的发展包括YOLOv6[12]和YOLOv7 [13] (2022),它们集成了先进的优化技术、参数高效模块和受transformer启发的块。这些迭代将YOLO推向更接近最先进(SoTA)准确性基准,同时保持对实时推理的关注。到此时,YOLO生态系统已牢固确立了自己作为目标检测研究和部署中领先模型家族的地位。

【解析】YOLOv4的CSPNet架构通过将特征图分为两部分并在不同路径中处理,减少计算量同时保持特征表达能力。Mish激活函数相比传统的ReLU具有更平滑的梯度特性,有助于模型训练的稳定性和收敛性。马赛克数据增强技术通过将四张图像拼接成一张训练图像,不仅增加了数据的多样性,还迫使模型学习在复杂背景下识别目标的能力。CIoU损失函数在传统IoU基础上考虑了边界框的中心点距离、宽高比等几何特性,提供了更精确的回归监督信号。YOLOv5的模块化设计和可配置的网络结构,用户可以根据具体应用需求选择不同规模的模型变体。YOLOv6和YOLOv7进一步引入了参数高效的设计思想,通过深度可分离卷积、通道注意力机制等技术在保持模型性能的同时降低参数量和计算复杂度。Transformer启发的块结构将自注意力机制引入目标检测,使模型能够更好地捕获长距离依赖关系和全局上下文信息。

Ultralytics, the primary maintainer of modern YOLO releases, redefined the framework with YOLOv8 (2023) [24]. YOLOv8 featured a decoupled detection head, anchor-free predictions, and refined training strategies, resulting in substantial improvements in both accuracy and deployment versatility [25]. It was widely adopted in industry due to its clean Python API, compatibility with TensorRT, CoreML, and ONNX, and availability of variants optimized for speed versus accuracy trade-offs (nano, small, medium, large, and extra-large). YOLOv9 [14], YOLOv10 [15], and YOLO11 followed in rapid succession, each iteration pushing the boundaries of architecture and performance. YOLOv9 introduced GELAN (Generalized Efficient Layer Aggregation Network) and Progressive Distillation, combining efficiency with higher representational capacity. YOLOv10 focused on balancing accuracy and inference latency with hybrid task-aligned assignments. YOLOv11 further refined Ultralytics' vision, offering higher efficiency on GPUs while maintaining strong small-object performance [5]. Together, these models cemented Ultralytics' reputation for producing production-ready YOLO releases tailored to modern deployment pipelines.

【翻译】Ultralytics作为现代YOLO版本的主要维护者,通过YOLOv8 (2023) [24]重新定义了该框架。YOLOv8具有解耦检测头、无锚框预测和精细化训练策略,在准确性和部署多样性方面都有显著改进[25]。由于其简洁的Python API、与TensorRT、CoreML和ONNX的兼容性,以及针对速度与准确性权衡优化的变体(nano、small、medium、large和extra-large),它在工业界得到了广泛采用。YOLOv9 [14]、YOLOv10 [15]和YOLO11紧随其后,每次迭代都推动了架构和性能的边界。YOLOv9引入了GELAN(广义高效层聚合网络)和渐进蒸馏,将效率与更高的表征能力相结合。YOLOv10专注于通过混合任务对齐分配来平衡准确性和推理延迟。YOLOv11进一步完善了Ultralytics的愿景,在GPU上提供更高效率的同时保持强大的小目标性能[5]。这些模型共同巩固了Ultralytics在生产就绪的YOLO版本方面的声誉,这些版本专为现代部署管道量身定制。

【解析】YOLOv8的解耦检测头设计将分类和回归任务分离到不同的网络分支中,避免两个任务之间的相互干扰,使得每个任务都能获得更专门化的特征表示。无锚框预测机制摆脱了传统的预定义锚框约束,直接预测目标的中心点和尺寸,简化了网络结构并减少了超参数调优的复杂性。精细化训练策略包括改进的数据增强技术、学习率调度和损失函数设计,这些策略协同工作以提高模型的泛化能力和收敛稳定性。GELAN架构通过高效的层聚合机制实现特征的多尺度融合,在保持计算效率的同时增强网络的表征学习能力。渐进蒸馏技术采用教师-学生网络的知识转移范式,通过逐步的知识传递过程来训练更紧凑但性能相当的学生网络。YOLOv10的一致双分配策略消除NMS依赖,以及整体效率-准确性驱动的模型设计策略用于优化架构,高精度的同时提升推理速度。YOLOv11在GPU优化方面的改进主要体现在内存访问模式的优化和并行计算的充分利用,这些优化使得模型能够更有效地利用现代GPU的计算资源。

Following YOLO11, alternative versions YOLOv12[16] and YOLOv13 [17] introduced attention-centric designs and advanced architectural components that sought to maximize accuracy across diverse datasets. These models explored multi-head self-attention, improved multi-scale fusion, and stronger training regularization strategies. While they offered strong benchmarks, they retained reliance on Non-Maximum Suppression (NMS) and Distribution Focal Loss (DFL), which introduced latency overhead and export challenges, especially for low-power devices. The limitations of NMS-based post-processing and complex loss formulations motivated the development of YOLO26 (Ultralytics YOLO26 Official Source). By September 2025, at the YOLO Vision 2025 event in London, Ultralytics unveiled YOLO26 as a next-generation model optimized for edge computing, robotics, and mobile AI.

【翻译】在YOLO11之后,替代版本YOLOv12[16]和YOLOv13 [17]引入了以注意力为中心的设计和先进的架构组件,旨在最大化跨多样化数据集的准确性。这些模型探索了多头自注意力、改进的多尺度融合和更强的训练正则化策略。虽然它们提供了强大的基准测试结果,但它们仍然依赖于非极大值抑制(NMS)和分布焦点损失(DFL),这引入了延迟开销和导出挑战,特别是对于低功耗设备。基于NMS的后处理和复杂损失公式的局限性促使了YOLO26的开发(Ultralytics YOLO26官方来源)。到2025年9月,在伦敦举行的YOLO Vision 2025活动上,Ultralytics发布了YOLO26,作为针对边缘计算、机器人技术和移动AI优化的下一代模型。

【解析】多头自注意力机制是从Transformer架构中借鉴的技术,通过计算特征图中每个位置与所有其他位置之间的相关性来捕获长距离依赖关系,能够让网络在处理复杂场景时更好地理解目标之间的空间关系和上下文信息。多尺度融合的改进主要体现在特征金字塔网络的设计优化上,通过更精细的特征融合策略来处理不同尺寸目标的检测问题。NMS后处理步骤需要对所有预测框进行排序和筛选操作,这个过程在CPU上执行时会产生显著的计算开销,特别是当检测到大量目标时。DFL虽然通过概率分布建模提高了边界框回归的精度,但其复杂的数学运算在资源受限的设备上会造成额外的计算负担。YOLO26就是专门针对边缘部署优化的新一代模型。

YOLO26 is engineered around three guiding principles simplicity, efficiency, and innovation and the overview in Figure 1 situates these choices alongside its five supported tasks: object detection, instance segmentation, pose/keypoints detection, oriented detection, and classification. On the inference path, YOLO26 eliminates NMS, producing native end-to-end predictions that remove a major post-processing bottleneck, reduce latency variance, and simplify threshold tuning across deployments. On the regression side, it removes DFL, turning distributional box decoding into a lighter, hardware-friendly formulation that exports cleanly to ONNX, TensorRT, CoreML, and TFLite a practical win for edge and mobile pipelines. Together, these changes yield a leaner graph, faster cold-start, and fewer runtime dependencies, which is particularly beneficial for CPU-bound and embedded scenarios. Training stability and small-object fidelity are addressed through ProgLoss (progressive loss balancing) and STAL (small-target-aware label assignment). ProgLoss adaptively reweights objectives to prevent domination by easy examples late in training, while STAL prioritizes assignment for tiny or occluded instances, improving recall under clutter, foliage, or motion blur conditions common in aerial, robotics, and smart-camera feeds. Optimization is driven by MuSGD, a hybrid that blends the generalization of SGD with momentum/curvature behaviors inspired by Muon-style methods, enabling faster, smoother convergence and more reliable plateaus across scales.

【翻译】YOLO26围绕三个指导原则设计:简洁性、效率性和创新性,图1中的概述将这些选择与其支持的五个任务并列展示:目标检测、实例分割、姿态/关键点检测、定向检测和分类。在推理路径上,YOLO26消除了NMS,产生原生的端到端预测,移除了主要的后处理瓶颈,减少了延迟方差,并简化了跨部署的阈值调优。在回归方面,它移除了DFL,将分布式边界框解码转换为更轻量、硬件友好的表述,能够干净地导出到ONNX、TensorRT、CoreML和TFLite------这对边缘和移动管道来说是实用的胜利。这些变化共同产生了更精简的图结构、更快的冷启动和更少的运行时依赖,这对CPU受限和嵌入式场景特别有益。训练稳定性和小目标保真度通过ProgLoss(渐进损失平衡)和STAL(小目标感知标签分配)来解决。ProgLoss自适应地重新加权目标,防止训练后期简单样本的主导,而STAL优先为微小或被遮挡的实例分配标签,改善在杂乱、叶片或运动模糊条件下的召回率,这些条件在航空、机器人和智能摄像头馈送中很常见。优化由MuSGD驱动,这是一个混合方法,将SGD的泛化能力与受Muon风格方法启发的动量/曲率行为相结合,实现更快、更平滑的收敛和跨尺度更可靠的平台期。

【解析】YOLO26通过移除NMS和DFL这两个传统组件来实现架构简化。NMS的移除是因为模型能在网络内部直接学习到抑制重复检测的能力,而不需要依赖外部的后处理算法。端到端不仅减少推理时间,还消除阈值调优的复杂性,因为传统NMS需要手动设置IoU阈值和置信度阈值。DFL的移除则将边界框回归从概率分布预测简化为直接坐标回归,不仅降低了计算复杂度,还提高了模型在不同硬件平台上的兼容性。ProgLoss机制通过动态调整不同损失项的权重来解决训练过程中的样本不平衡问题,特别是防止简单样本在训练后期过度主导梯度更新,自适应权重调整策略能够确保模型在整个训练过程中保持稳定的学习进度。STAL策略专门针对小目标检测的标签分配问题进行优化,通过为小目标和被遮挡目标提供更精确的正负样本分配,提高模型对这些困难样本的学习效果。MuSGD优化器将传统SGD的泛化优势与Muon优化器的自适应特性相结合,能够在保持训练稳定性的同时加速收敛过程。

Functionally, as highlighted again in Figure 1, YOLO26's five capabilities share a unified backbone/neck and streamlined heads:

• Object Detection: Anchor-free, NMS-free boxes and scores

• Instance Segmentation: Lightweight mask branches coupled to shared features

• Pose/Keypoints Detection: Compact keypoint heads for human or part landmarks

• Oriented Detection: Rotated boxes for oblique objects and elongated targets

• Classification: Single-label logits for pure recognition tasks.

【翻译】在功能上,如图1再次强调的,YOLO26的五种能力共享统一的骨干网络/颈部和精简的检测头:

• 目标检测:无锚框、无NMS的边界框和分数

• 实例分割:与共享特征耦合的轻量级掩码分支

• 姿态/关键点检测:用于人体或部件标志点的紧凑关键点头

• 定向检测:用于倾斜目标和细长目标的旋转边界框

• 分类:用于纯识别任务的单标签逻辑回归

Figure 1: YOLO26 unified architecture supports five key vision tasks object detection, instance segmentation, pose/keypoints detection, oriented detection, and classification.

【翻译】图1:YOLO26统一架构支持五个关键视觉任务:目标检测、实例分割、姿态/关键点检测、定向检测和分类。

This consolidated design allows multi-task training or task-specific fine-tuning without architectural rework, while the simplified exports preserve portability across accelerators. In sum, YOLO26 advances the YOLO lineage by pairing end-to-end inference and DFL-free regression with ProgLoss, STAL, and MuSGD, yielding a model that is faster to deploy, steadier to train, and broader in capability as visually summarized in Figure 1.

【翻译】这种整合设计允许多任务训练或任务特定的微调,而无需重新设计架构,同时简化的导出保持了跨加速器的可移植性。总之,YOLO26通过将端到端推理和无DFL回归与ProgLoss、STAL和MuSGD相结合,推进了YOLO系列的发展,产生了一个部署更快、训练更稳定、功能更广泛的模型,如图1所示。

【解析】端到端推理消除了传统的后处理步骤,使得整个检测流程更加流畅和高效。无DFL回归简化了边界框预测的复杂性,降低了计算开销。ProgLoss通过动态平衡不同损失项来提高训练稳定性,STAL专门优化小目标的检测性能,而MuSGD优化器则结合了多种优化策略的优势来加速收敛过程。这些技术创新的协同作用使得YOLO26在保持YOLO系列传统优势的基础上,在部署效率、训练稳定性和功能多样性方面都有显著提升。

2 YOLO26的架构增强

The architecture of YOLO26 follows a streamlined and efficient pipeline that has been purpose-built for real-time object detection across edge and server platforms. As illustrated in Figure 2, the process begins with the ingestion of input data in the form of images or video streams, which are first passed through preprocessing operations including resizing and normalization to standard dimensions suitable for model inference. The data is then fed into the backbone feature extraction stage, where a compact yet powerful convolutional network captures hierarchical representations of visual patterns. To enhance robustness across scales, the architecture generates multi-scale feature maps (Figure 2) that preserve semantic richness for both large and small objects. These feature maps are then merged within a lightweight feature fusion neck, where information is integrated in a computationally efficient manner. Detection-specific processing occurs in the direct regression head, which, unlike prior YOLO versions, outputs bounding boxes and class probabilities without relying on Non-Maximum Suppression (NMS). This end-to-end NMS-free inference (Figure 2) eliminates post-processing overhead and accelerates deployment. Training stability and accuracy are reinforced by ProgLoss balancing and STAL assignment modules, which ensure equitable weighting of loss terms and improved detection of small targets. Model optimization is guided by the MuSGD optimizer, combining the strengths of SGD and Muon for faster and more reliable convergence. Deployment efficiency is further enhanced through quantization, with support for FP16 and INT8 precision, enabling acceleration on CPUs, NPUs, and GPUs with minimal accuracy degradation. Finally, the pipeline culminates in the generation of output predictions, including bounding boxes and class assignments that can be visualized overlaid on the input image. Overall, the architecture of YOLO26 demonstrates a carefully balanced design philosophy that simultaneously advances accuracy, stability, and deployment simplicity.

【翻译】YOLO26的架构遵循一个精简高效的流水线,专门为边缘和服务器平台的实时目标检测而构建。如图2所示,该过程从摄取图像或视频流形式的输入数据开始,首先通过预处理操作,包括调整大小和归一化到适合模型推理的标准尺寸。然后将数据输入到骨干特征提取阶段,其中紧凑而强大的卷积网络捕获视觉模式的层次表示。为了增强跨尺度的鲁棒性,架构生成多尺度特征图(图2),为大小目标保持语义丰富性。然后在轻量级特征融合颈部合并这些特征图,以计算高效的方式集成信息。检测特定处理发生在直接回归头中,与之前的YOLO版本不同,它输出边界框和类别概率而不依赖于非极大值抑制(NMS)。这种端到端的无NMS推理(图2)消除了后处理开销并加速了部署。训练稳定性和准确性通过ProgLoss平衡和STAL分配模块得到加强,确保损失项的公平加权和小目标检测的改进。模型优化由MuSGD优化器指导,结合SGD和Muon的优势,实现更快更可靠的收敛。部署效率通过量化进一步增强,支持FP16和INT8精度,在CPU、NPU和GPU上实现加速,同时最小化精度下降。最后,流水线在生成输出预测中达到高潮,包括可以可视化叠加在输入图像上的边界框和类别分配。总体而言,YOLO26的架构展示了一个精心平衡的设计理念,同时推进了准确性、稳定性和部署简单性。

YOLO26 introduces several key architectural innovations that differentiate it from prior generations of YOLO models. These enhancements not only improve training stability and inference efficiency but also fundamentally reshape the deployment pipeline for real-time edge devices. In this section, we describe four major contributions of YOLO26: (i) the removal of Distribution Focal Loss (DFL), (ii) the introduction of end-to-end Non-Maximum Suppression (NMS)-free inference, (iii) novel loss function strategies including Progressive Loss Balancing (ProgLoss) and Small-Target-Aware Label Assignment (STAL), and (iv) the development of the MuSGD optimizer for stable and efficient convergence. Each of these architectural enhancements is discussed in detail, with comparative insights highlighting their advantages over earlier YOLO versions such as YOLOv8, YOLOv11, YOLOv12, and YOLOv13.

【翻译】YOLO26引入了几个关键的架构创新,使其与之前几代YOLO模型区别开来。这些增强不仅改善了训练稳定性和推理效率,还从根本上重塑了实时边缘设备的部署流水线。在本节中,我们描述了YOLO26的四个主要贡献:(i) 移除分布焦点损失(DFL),(ii) 引入端到端无非极大值抑制(NMS)推理,(iii) 新颖的损失函数策略,包括渐进损失平衡(ProgLoss)和小目标感知标签分配(STAL),以及 (iv) 开发MuSGD优化器以实现稳定高效的收敛。每个架构增强都将详细讨论,并提供比较见解,突出它们相对于早期YOLO版本(如YOLOv8、YOLOv11、YOLOv12和YOLOv13)的优势。

Figure 2: Simplified Architecture diagram of Ultralytics YOLO26

【翻译】图2:Ultralytics YOLO26的简化架构图

2.1 移除分布焦点损失(DFL)

One of the most significant architectural simplifications in YOLO26 is the removal of the Distribution Focal Loss (DFL) module (Figure 3a), which had been present in prior YOLO releases such as YOLOv8 and YOLOv11. DFL was originally designed to improve bounding box regression by predicting probability distributions for box coordinates, thereby allowing more precise localization of objects. While this strategy demonstrated accuracy gains in earlier models, it also introduced non-trivial computational overhead and export difficulties. In practice, DFL required specialized handling during inference and model export, which complicated deployment pipelines targeting hardware accelerators such as ONNX, CoreML, TensorRT, or TFLite.

【翻译】YOLO26最重要的架构简化之一是移除了分布焦点损失(DFL)模块(图3a),该模块在之前的YOLO版本(如YOLOv8和YOLOv11)中一直存在。DFL最初设计用于通过预测边界框坐标的概率分布来改善边界框回归,从而实现更精确的目标定位。虽然这种策略在早期模型中展示了准确性提升,但它也引入了不可忽视的计算开销和导出困难。在实践中,DFL在推理和模型导出过程中需要专门处理,这使得针对硬件加速器(如ONNX、CoreML、TensorRT或TFLite)的部署流水线变得复杂。

【解析】分布焦点损失DFL是一种复杂的边界框回归方法,它不是直接预测边界框的坐标值,而是预测坐标的概率分布。核心思想是将连续的坐标回归问题转化为离散的分类问题,通过学习每个坐标位置的概率分布来获得更精确的定位结果。但是,这种设计虽然在理论上能够提供更细粒度的位置信息,但在实际部署时却带来了复杂性。首先,概率分布的计算需要额外的计算资源,增加了模型的推理时间。其次,当模型需要导出到不同的硬件平台时,复杂的概率分布计算往往需要特殊的算子支持,而许多轻量级的推理引擎可能不支持这些复杂操作,导致模型转换失败或性能下降。

By eliminating DFL, YOLO26 simplifies the model's architecture, making bounding box prediction a more straightforward regression task without sacrificing performance. Comparative analysis indicates that YOLO26 achieves comparable or superior accuracy to DFL-based YOLO models, particularly when combined with other innovations such as ProgLoss and STAL. Moreover, the removal of DFL significantly reduces inference latency and improves cross-platform compatibility. This makes YOLO26 more suitable for edge AI scenarios, where lightweight and hardware-friendly models are paramount.

【翻译】通过消除DFL,YOLO26简化了模型架构,使边界框预测成为更直接的回归任务,而不会牺牲性能。比较分析表明,YOLO26实现了与基于DFL的YOLO模型相当或更优的准确性,特别是与其他创新(如ProgLoss和STAL)结合使用时。此外,移除DFL显著减少了推理延迟并改善了跨平台兼容性。这使得YOLO26更适合边缘AI场景,在这些场景中,轻量级和硬件友好的模型至关重要。

【解析】移除DFL后,YOLO26回归到了更传统但更高效的直接坐标回归方式。虽然这种方法在理论上可能损失一些精度,但YOLO26通过引入ProgLoss和STAL等技术来补偿潜在的精度损失。ProgLoss通过动态调整不同损失项的权重来优化训练过程,而STAL则专门针对小目标检测进行优化。组合策略不仅保持了检测精度,还提升了模型的部署效率。推理延迟的减少主要来自于计算复杂度的降低,而跨平台兼容性的改善则源于模型结构的简化,使得模型更容易在各种硬件平台上运行,无需复杂的算子支持。

In contrast, models such as YOLOv12 and YOLOv13 retained DFL in their architectures, which limited their applicability on constrained devices despite strong accuracy benchmarks on GPU-rich environments. YOLO26 therefore marks a decisive step toward aligning state-of-the-art object detection performance with the realities of mobile, embedded, and industrial applications.

【翻译】相比之下,YOLOv12和YOLOv13等模型在其架构中保留了DFL,尽管在GPU资源丰富的环境中具有强大的准确性基准,但这限制了它们在受限设备上的适用性。因此,YOLO26标志着将最先进的目标检测性能与移动、嵌入式和工业应用的现实需求相结合的决定性步骤。

Figure 3: Key architectural enhancements in YOLO26: (a) Removal of Distribution Focal Loss (DFL) streamlines bounding box regression, boosting efficiency and export compatibility. (b) End-to-end NMS-free inference eliminates post-processing bottlenecks, enabling faster and simpler deployment. © ProgLoss and STAL enhance training stability and significantly improve small-object detection accuracy. (d) The MuSGD optimizer combines SGD and Muon strengths, achieving faster, more stable convergence in training.

【翻译】图3:YOLO26的关键架构增强:(a) 移除分布焦点损失(DFL)简化了边界框回归,提升了效率和导出兼容性。(b) 端到端无NMS推理消除了后处理瓶颈,实现了更快更简单的部署。© ProgLoss和STAL增强了训练稳定性并显著改善了小目标检测精度。(d) MuSGD优化器结合了SGD和Muon的优势,在训练中实现了更快更稳定的收敛。

2.2 端到端无NMS推理

Another groundbreaking feature of YOLO26 is its native support for end-to-end inference without Non-Maximum Suppression (NMS) (Refer to Figure 3b). Traditional YOLO models, including YOLOv8 through YOLOv13, rely heavily on NMS as a post-processing step to filter out duplicate predictions by retaining only the bounding boxes with the highest confidence scores. While effective, NMS adds additional latency to the pipeline and requires manually tuned hyperparameters such as the Intersection-over-Union (IoU) threshold. This dependence on a handcrafted post-processing step introduces fragility in deployment pipelines, especially for edge devices and latency-sensitive applications.

【翻译】YOLO26的另一个突破性特征是其对端到端推理的原生支持,无需非极大值抑制(NMS)(参见图3b)。传统的YOLO模型,包括YOLOv8到YOLOv13,严重依赖NMS作为后处理步骤,通过仅保留具有最高置信度分数的边界框来过滤重复预测。虽然有效,但NMS为流水线增加了额外的延迟,并且需要手动调整超参数,如交并比(IoU)阈值。这种对手工制作的后处理步骤的依赖在部署流水线中引入了脆弱性,特别是对于边缘设备和延迟敏感的应用。

【解析】非极大值抑制NMS是传统目标检测中的关键后处理技术,解决模型输出中的重复检测问题。在目标检测过程中,模型通常会为同一个目标生成多个重叠的边界框,这些边界框可能具有不同的置信度分数。NMS算法通过计算边界框之间的交并比IoU来判断它们是否检测的是同一个目标。具体流程是:首先按置信度分数对所有边界框进行排序,然后选择置信度最高的边界框作为保留框,接着计算其他边界框与该保留框的IoU值,如果IoU超过预设阈值(通常为0.5),则认为这些边界框检测的是同一目标,将其抑制(删除)。这个过程重复进行直到处理完所有边界框。然而,NMS的问题在于它需要额外的计算时间来执行这些比较和筛选操作,而且IoU阈值的选择往往需要根据具体应用场景进行手动调整,这种依赖性使得模型在不同部署环境中的表现可能不一致,特别是在计算资源受限的边缘设备上,额外的后处理开销可能影响实时性能。

YOLO26 fundamentally redesigns the prediction head to produce direct, non-redundant bounding box predictions without the need for NMS. This end-to-end design not only reduces inference complexity but also eliminates the dependency on hand-tuned thresholds, thereby simplifying integration into production systems. Comparative benchmarks demonstrate that YOLO26 achieves faster inference speeds than YOLOv11 and YOLOv12, with CPU inference times reduced by up to 43 % 43\% 43% for the nano model. This makes YOLO26 particularly advantageous for mobile devices, UAVs, and embedded robotics platforms where milliseconds of latency can have substantial operational impacts.

【翻译】YOLO26从根本上重新设计了预测头,以产生直接的、非冗余的边界框预测,而无需NMS。这种端到端设计不仅降低了推理复杂性,还消除了对手动调整阈值的依赖,从而简化了与生产系统的集成。比较基准测试表明,YOLO26实现了比YOLOv11和YOLOv12更快的推理速度,nano模型的CPU推理时间减少了高达 43 % 43\% 43%。这使得YOLO26在移动设备、无人机和嵌入式机器人平台上特别有优势,在这些平台上,毫秒级的延迟可能产生重大的操作影响。

【解析】YOLO26通过改进预测头,使其能够直接输出经过优化的、互不冗余的边界框预测结果。

Beyond speed, the NMS-free approach improves reproducibility and deployment portability, as models no longer require extensive post-processing code. While other advanced detectors such as RT-DETR and Sparse R-CNN have experimented with NMS-free inference, YOLO26 represents the first YOLO release to adopt this paradigm while maintaining YOLO's hallmark balance between speed and accuracy. Compared to YOLOv13, which still depends on NMS, YOLO26's end-to-end pipeline stands out as a forward-looking architecture for real-time detection.

【翻译】除了速度之外,无NMS方法还改善了可重现性和部署可移植性,因为模型不再需要大量的后处理代码。虽然其他先进的检测器如RT-DETR和Sparse R-CNN已经尝试了无NMS推理,但YOLO26代表了第一个采用这种范式的YOLO版本,同时保持了YOLO在速度和准确性之间的标志性平衡。与仍然依赖NMS的YOLOv13相比,YOLO26的端到端流水线作为实时检测的前瞻性架构脱颖而出。

【解析】从技术发展的角度来看,RT-DETR和Sparse R-CNN等基于Transformer架构的检测器通过注意力机制和查询-键-值的交互方式实现了无NMS推理,但这些方法通常计算复杂度较高,难以在资源受限的设备上实现真正的实时检测。

2.3 ProgLoss和STAL:增强训练稳定性和小目标检测

Training stability and small-object recognition remain persistent challenges in object detection. YOLO26 addresses these through the integration of two novel strategies: Progressive Loss Balancing (ProgLoss) and Small-Target-Aware Label Assignment (STAL), as depicted in Figure (Figure 3c).

【翻译】训练稳定性和小目标识别仍然是目标检测中的持续挑战。YOLO26通过集成两种新颖策略来解决这些问题:渐进损失平衡(ProgLoss)和小目标感知标签分配(STAL),如图3c所示。

【解析】训练稳定性问题,在训练过程中,模型需要同时学习分类、定位和置信度预测等多个任务,多任务之间的损失函数权重如果设置不当,容易导致训练过程中的震荡或收敛困难。小目标识别困难则是因为小目标在图像中占据的像素数量有限,特征信息稀少,容易被背景噪声干扰,同时在特征提取过程中容易丢失关键信息。ProgLoss和STAL分别从损失函数优化和标签分配策略两个角度来解决这些问题。

ProgLoss dynamically adjusts the weighting of different loss components during training, ensuring that the model does not overfit to dominant object categories while underperforming on rare or small classes. This progressive rebalancing improves generalization and prevents instability during later epochs of training. STAL, on the other hand, explicitly prioritizes label assignments for small objects, which are particularly difficult to detect due to their limited pixel representation and susceptibility to occlusion. Together, ProgLoss and STAL provide YOLO26 with a substantial accuracy boost on datasets with small or occluded objects, such as COCO and UAV imagery benchmarks.

【翻译】ProgLoss在训练过程中动态调整不同损失组件的权重,确保模型不会过度拟合主导目标类别,同时在稀有或小类别上表现不佳。这种渐进式重新平衡改善了泛化能力,并防止了训练后期的不稳定性。另一方面,STAL明确优先考虑小目标的标签分配,这些小目标由于像素表示有限和易受遮挡而特别难以检测。ProgLoss和STAL共同为YOLO26在包含小目标或遮挡目标的数据集(如COCO和无人机图像基准)上提供了显著的准确性提升。

【解析】ProgLoss是在训练的不同阶段动态调整各个损失项的权重系数。传统的目标检测模型通常使用固定的损失权重,这可能导致某些损失项在训练早期占主导地位,而其他重要的损失项被忽略。ProgLoss通过监控训练进度和各类别的学习状态,自适应地调整分类损失、回归损失和置信度损失之间的平衡,确保模型在学习常见目标的同时不忽略稀有类别。STAL则专门针对小目标检测的标签分配问题。在传统的标签分配策略中,小目标往往难以获得足够的正样本,因为它们的边界框与预设的锚点匹配度较低。STAL通过设计专门的匹配策略,为小目标分配更多的正样本,同时考虑小目标的空间分布特征和遮挡情况,提高模型对小目标的敏感性。这种组合策略不仅提升了小目标检测的召回率,还改善了整体的检测精度。

Comparatively, earlier models such as YOLOv8 and YOLOv11 did not incorporate such targeted mechanisms, often requiring dataset-specific augmentations or external training tricks to achieve acceptable small-object performance. YOLOv12 and YOLOv13 attempted to address this gap through attention-based modules and enhanced multi-scale feature fusion; however, these solutions increased architectural complexity and inference costs. YOLO26 achieves similar or superior improvements with a more lightweight approach, reinforcing its suitability for edge AI applications. By integrating ProgLoss and STAL, YOLO26 establishes itself as a robust small-object detector while maintaining the efficiency and portability of the YOLO family.

【翻译】相比之下,YOLOv8和YOLOv11等早期模型没有集成这样的针对性机制,通常需要特定于数据集的增强或外部训练技巧来实现可接受的小目标性能。YOLOv12和YOLOv13试图通过基于注意力的模块和增强的多尺度特征融合来解决这一差距;然而,这些解决方案增加了架构复杂性和推理成本。YOLO26通过更轻量级的方法实现了类似或更优的改进,强化了其对边缘AI应用的适用性。通过集成ProgLoss和STAL,YOLO26确立了自己作为强大小目标检测器的地位,同时保持了YOLO家族的效率和可移植性。

【解析】早期的YOLO版本在处理小目标检测时主要依赖数据增强技术,如多尺度训练、图像裁剪、马赛克增强等外部手段来提升性能。这些方法虽然在一定程度上有效,但需要大量的超参数调优和计算资源,且效果往往不够稳定。YOLOv12和YOLOv13引入了注意力机制和复杂的多尺度特征融合网络,试图从架构层面解决小目标检测问题。注意力机制能够让模型关注到重要的特征区域,而多尺度特征融合则通过结合不同分辨率的特征图来保留小目标的细节信息。但是,这些改进带来了计算开销和内存消耗,使得模型在边缘设备上的部署变得困难。YOLO26的创新之处在于通过算法层面的优化(ProgLoss和STAL)而非架构复杂化来解决问题,在保持模型轻量化的同时实现了性能提升。

2.4 MuSGD优化器:实现稳定收敛

A final innovation in YOLO26 is the introduction of the MuSGD optimizer (Figure 3d), which combines the strengths of Stochastic Gradient Descent (SGD) with the recently proposed Muon optimizer, a technique inspired by optimization strategies used in large language model (LLM) training. MuSGD leverages the robustness and generalization capacity of SGD while incorporating adaptive properties from Muon, enabling faster convergence and more stable optimization across diverse datasets.

【翻译】YOLO26的最终创新是引入了MuSGD优化器(图3d),它结合了随机梯度下降(SGD)的优势和最近提出的Muon优化器,这是一种受大语言模型(LLM)训练中使用的优化策略启发的技术。MuSGD利用了SGD的鲁棒性和泛化能力,同时融合了Muon的自适应特性,实现了更快的收敛和在不同数据集上更稳定的优化。

【解析】SGD作为最基础的梯度下降算法,具有良好的泛化性能和理论保证,但其固定的学习率设置往往导致收敛速度较慢,特别是在损失函数地形复杂的情况下。Muon优化器则是近年来在大语言模型训练中发展起来的新技术,通过动态调整梯度更新的方向和幅度来加速收敛过程。MuSGD的核心思想是在保持SGD稳定性的基础上,引入Muon的自适应机制,使得优化器能够根据当前的梯度信息和历史更新情况自动调整学习策略。这种混合方法不仅能够在训练初期快速找到合适的参数更新方向,还能在训练后期保持稳定的收敛行为,避免了纯自适应优化器可能出现的震荡问题。

This hybrid optimizer reflects an important trend in modern deep learning: the cross-pollination of advances between natural language processing (NLP) and computer vision. By borrowing from LLM training practices (e.g., Kimi K2 by Moonshot AI), YOLO26 benefits from stability enhancements that were previously unexplored in the YOLO lineage. Empirical results show that MuSGD enables YOLO26 to reach competitive accuracy with fewer training epochs, reducing both training time and computational cost.

【翻译】这种混合优化器反映了现代深度学习的一个重要趋势:自然语言处理(NLP)和计算机视觉之间进展的交叉融合。通过借鉴LLM训练实践(例如月之暗面的Kimi K2),YOLO26受益于之前在YOLO系列中未曾探索的稳定性增强。实证结果表明,MuSGD使YOLO26能够用更少的训练轮次达到竞争性准确度,减少了训练时间和计算成本。

【解析】实验数据显示,使用MuSGD优化器的YOLO26模型能够在更短的训练时间内达到与传统优化器相当甚至更好的检测精度,这对于需要快速迭代和部署的工业应用具有重要意义。

Previous YOLO versions, including YOLOv8 through YOLOv13, relied on standard SGD or AdamW variants. While effective, these optimizers required extensive hyperparameter tuning and sometimes exhibited unstable convergence, particularly on datasets with high variability. In comparison, MuSGD improves reliability while preserving YOLO's lightweight training ethos. For practitioners, this translates into shorter development cycles, fewer training restarts, and more predictable performance across deployment scenarios. By integrating MuSGD, YOLO26 positions itself as not only an inference-optimized model but also a training-friendly architecture for researchers and industry practitioners alike.

【翻译】之前的YOLO版本,包括YOLOv8到YOLOv13,依赖于标准的SGD或AdamW变体。虽然有效,但这些优化器需要大量的超参数调优,有时会表现出不稳定的收敛,特别是在具有高变异性的数据集上。相比之下,MuSGD提高了可靠性,同时保持了YOLO的轻量级训练理念。对于实践者来说,这转化为更短的开发周期、更少的训练重启和在部署场景中更可预测的性能。通过集成MuSGD,YOLO26不仅将自己定位为推理优化模型,还成为对研究人员和行业实践者都友好的训练架构。

3 基准测试和对比分析

In the case of YOLO26, a series of rigorous benchmarks were conducted to assess its performance in comparison to both its YOLO predecessors and alternative state-of-the-art architectures. Figure 4 presents a consolidated view of this evaluation, plotting COCO mAP(50--95) against latency (ms per image) on an NVIDIA T4 GPU with TensorRT FP16 optimization. The inclusion of competing architectures such as YOLOv10, RT-DETR, RT-DETRv2, RT-DETRv3, and DEIM provides a comprehensive landscape of recent advancements in real-time detection. From the figure, YOLO26 demonstrates a distinctive positioning: it maintains high accuracy levels that rival transformer-based models like RT-DETRv3, while significantly outperforming them in terms of inference speed. For instance, YOLO26-m and YOLO26-l achieve competitive mAP scores above 51 % 51\% 51% and 53 % 53\% 53% , respectively, but at a substantially reduced latency, underscoring the benefits of its NMS-free architecture and lightweight regression head.

【翻译】对于YOLO26,进行了一系列严格的基准测试来评估其性能,与YOLO前代版本和其他最先进架构进行比较。图4展示了这一评估的综合视图,在NVIDIA T4 GPU上使用TensorRT FP16优化绘制了COCO mAP(50--95)与延迟(每张图像毫秒数)的关系图。包含YOLOv10、RT-DETR、RT-DETRv2、RT-DETRv3和DEIM等竞争架构,提供了实时检测领域最新进展的全面景观。从图中可以看出,YOLO26展现了独特的定位:它保持了与RT-DETRv3等基于Transformer的模型相媲美的高精度水平,同时在推理速度方面显著优于它们。例如,YOLO26-m和YOLO26-l分别实现了超过 51 % 51\% 51%和 53 % 53\% 53%的竞争性mAP分数,但延迟大幅降低,突出了其无NMS架构和轻量级回归头的优势。

This balance between accuracy and speed is particularly relevant for edge deployments, where maintaining real-time throughput is as important as ensuring reliable detection quality. Compared with YOLOv10, YOLO26 consistently achieves lower latency across model scales, with speedups of up to 43 % 43\% 43% observed for CPU-bound inference, while preserving or improving accuracy through its ProgLoss and STAL mechanisms. When compared to DEIM and the RT-DETR series, which rely heavily on transformer encoders and decoders, YOLO26's simplified backbone and MuSGD-driven training pipeline enable faster convergence and leaner inference without compromising small-object recognition. The plot in Figure 4 clearly illustrates these distinctions: while RT-DETRv3 excels in large-scale accuracy benchmarks, its latency profile remains less favorable than YOLO26, reinforcing YOLO26's edge-centric design philosophy. Furthermore, the benchmarking analysis highlights YOLO26's robustness in balancing the accuracy--latency curve, situating it as a versatile detector suitable for both high-throughput server applications and resource-constrained devices. This comparative evidence substantiates the claim that YOLO26 is not merely an incremental update but a paradigm shift in the YOLO lineage, successfully bridging the gap between the efficiency-first philosophy of earlier YOLO models and the accuracy-driven orientation of transformer-based detectors. Ultimately, the benchmarking results demonstrate that YOLO26 offers a compelling deployment advantage, particularly in real-world environments requiring reliable performance under stringent latency constraints.

【翻译】这种准确性和速度之间的平衡对于边缘部署特别重要,在边缘部署中,保持实时吞吐量与确保可靠的检测质量同样重要。与YOLOv10相比,YOLO26在各种模型规模上都能持续实现更低的延迟,在CPU推理中观察到高达43%的加速,同时通过其ProgLoss和STAL机制保持或提高准确性。与严重依赖transformer编码器和解码器的DEIM和RT-DETR系列相比,YOLO26的简化骨干网络和MuSGD驱动的训练流水线能够实现更快的收敛和更精简的推理,而不会影响小目标识别。图4中的图表清楚地说明了这些区别:虽然RT-DETRv3在大规模准确性基准测试中表现出色,但其延迟特性仍然不如YOLO26有利,这强化了YOLO26以边缘为中心的设计理念。此外,基准测试分析突出了YOLO26在平衡准确性-延迟曲线方面的鲁棒性,将其定位为适用于高吞吐量服务器应用和资源受限设备的多功能检测器。这一比较证据证实了YOLO26不仅仅是一个增量更新,而是YOLO系列的范式转变,成功地弥合了早期YOLO模型的效率优先理念与基于transformer的检测器的准确性导向之间的差距。最终,基准测试结果表明,YOLO26提供了令人信服的部署优势,特别是在需要在严格延迟约束下可靠性能的真实世界环境中。

Figure 4: Performance benchmarking of YOLO26 compared with YOLOv10, RT-DETR, RT-DETRv2, RT-DETRv3, and DEIM on the COCO dataset. The plot shows COCO mAP(50--95) versus latency (ms per image) measured on an NVIDIA T4 GPU using TensorRT FP16 inference. YOLO26 demonstrates superior balance between accuracy and efficiency, achieving competitive detection performance while significantly reducing latency, thereby highlighting its suitability for real-time edge and resource-constrained deployments.

【翻译】图4:YOLO26与YOLOv10、RT-DETR、RT-DETRv2、RT-DETRv3和DEIM在COCO数据集上的性能基准测试比较。该图显示了在NVIDIA T4 GPU上使用TensorRT FP16推理测量的COCO mAP(50--95)与延迟(每张图像毫秒数)的关系。YOLO26展现了准确性和效率之间的卓越平衡,在显著降低延迟的同时实现了具有竞争力的检测性能,从而突出了其对实时边缘和资源受限部署的适用性。

4 使用Ultralytics YOLO26进行实时部署

Over the past decade, the evolution of object detection models has been marked not only by increases in accuracy but also by growing complexity in deployment [26, 27, 28]. Early detectors such as R-CNN and its faster variants (Fast R-CNN, Faster R-CNN) achieved impressive detection quality but were computationally expensive, requiring multiple stages for region proposal and classification [29, 30, 31]. This limited their use in real-time and embedded applications. The arrival of the YOLO family transformed this landscape by reframing detection as a single regression problem, enabling real-time performance on commodity GPUs [32]. However, as the YOLO lineage progressed from YOLOv1 through YOLOv13, accuracy improvements often came at the cost of additional architectural components such as Distribution Focal Loss (DFL), complex post-processing steps like Non-Maximum Suppression (NMS), and increasingly heavy backbones that introduced friction during deployment. YOLO26 addresses this longstanding challenge directly by streamlining both architecture and export pathways, thereby reducing deployment barriers across diverse hardware and software ecosystems.

【翻译】在过去十年中,目标检测模型的演进不仅以准确性的提升为标志,还伴随着部署复杂性的不断增长[26, 27, 28]。早期的检测器如R-CNN及其更快的变体(Fast R-CNN、Faster R-CNN)实现了令人印象深刻的检测质量,但计算成本昂贵,需要多个阶段进行区域提议和分类[29, 30, 31]。这限制了它们在实时和嵌入式应用中的使用。YOLO家族的出现通过将检测重新定义为单一回归问题来改变了这一格局,在商用GPU上实现了实时性能[32]。然而,随着YOLO系列从YOLOv1发展到YOLOv13,准确性的改进往往以增加额外的架构组件为代价,如分布焦点损失(DFL)、复杂的后处理步骤如非极大值抑制(NMS),以及日益沉重的骨干网络,这些都在部署过程中引入了阻力。YOLO26通过简化架构和导出路径直接解决了这一长期挑战,从而减少了在不同硬件和软件生态系统中的部署障碍。

4.1 灵活的导出和集成路径

A key advantage of YOLO26 is its seamless integration into existing production pipelines. Ultralytics maintains an actively developed Python package that provides unified support for training, validation, and export, lowering the technical barrier for practitioners seeking to adopt YOLO26. Unlike earlier YOLO models, which required extensive custom conversion scripts for hardware acceleration [33, 34, 35], YOLO26 natively supports a wide range of export formats. These include TensorRT for maximum GPU acceleration, ONNX for broad cross-platform compatibility,

【翻译】YOLO26的一个关键优势是它能够无缝集成到现有的生产流水线中。Ultralytics维护着一个积极开发的Python包,提供训练、验证和导出的统一支持,降低了寻求采用YOLO26的实践者的技术门槛。与需要大量自定义转换脚本进行硬件加速的早期YOLO模型不同[33, 34, 35],YOLO26原生支持广泛的导出格式。这些格式包括用于最大GPU加速的TensorRT、用于广泛跨平台兼容性的ONNX、

CoreML for native iOS integration, TFLite for Android and edge devices, and OpenVINO for optimized performance on Intel hardware. The breadth of these export options enables researchers, engineers, and developers to move models from prototyping to production without encountering the compatibility bottlenecks common in earlier generations.

【翻译】用于原生iOS集成的CoreML、用于Android和边缘设备的TFLite,以及用于在Intel硬件上优化性能的OpenVINO。这些导出选项的广度使研究人员、工程师和开发者能够将模型从原型设计转移到生产环境,而不会遇到早期版本中常见的兼容性瓶颈。

Historically, YOLOv3 through YOLOv7 often required manual intervention during export, particularly when targeting specialized inference engines such as NVIDIA TensorRT or Apple CoreML [36, 37]. Similarly, transformer-based detectors like DETR and its successors faced challenges when converted outside PyTorch environments due to their reliance on dynamic attention mechanisms. By comparison, YOLO26's architecture, simplified through the removal of DFL and the adoption of an NMS-free prediction head, ensures compatibility across platforms without sacrificing accuracy. This makes YOLO26 one of the most deployment-friendly detectors released to date, reinforcing its identity as an edge-first model.

【翻译】从历史上看,YOLOv3到YOLOv7在导出过程中经常需要手动干预,特别是在针对NVIDIA TensorRT或Apple CoreML等专用推理引擎时[36, 37]。同样,基于transformer的检测器如DETR及其后续版本在PyTorch环境之外转换时面临挑战,这是由于它们依赖动态注意力机制。相比之下,YOLO26的架构通过移除DFL和采用无NMS预测头进行简化,确保了跨平台兼容性而不牺牲准确性。这使得YOLO26成为迄今为止发布的最部署友好的检测器之一,强化了其作为边缘优先模型的身份。

4.2 量化和资源受限设备

Beyond export flexibility, the true challenge in real-world deployment lies in ensuring efficiency on devices with limited computational resources [27, 38]. Edge devices such as smartphones, drones, and embedded vision systems often lack discrete GPUs and must balance memory, power, and latency constraints [39, 40]. Quantization is a widely adopted strategy to reduce model size and computational load, yet many complex detectors experience significant accuracy degradation under aggressive quantization. YOLO26 has been designed with this limitation in mind.

【翻译】除了导出灵活性之外,现实世界部署的真正挑战在于确保在计算资源有限的设备上的效率[27, 38]。边缘设备如智能手机、无人机和嵌入式视觉系统通常缺乏独立GPU,必须平衡内存、功耗和延迟约束[39, 40]。量化是减少模型大小和计算负载的广泛采用策略,然而许多复杂检测器在激进量化下会经历显著的准确性下降。YOLO26在设计时就考虑了这一限制。

Owing to its streamlined architecture and simplified bounding box regression pipeline, YOLO26 demonstrates consistent accuracy under both half-precision (FP16) and integer (INT8) quantization schemes. FP16 quantization leverages native GPU support for mixed-precision arithmetic, enabling faster inference with reduced memory footprint. INT8 quantization compresses model weights to 8-bit integers, delivering dramatic reductions in model size and energy consumption while maintaining competitive accuracy. Benchmark experiments confirm that YOLO26 maintains stability across these quantization levels, outperforming YOLOv11 and YOLOv12 under identical conditions. This makes YOLO26 particularly well-suited for deployment on compact hardware such as NVIDIA Jetson Orin, Qualcomm Snapdragon AI accelerators, or even ARM-based CPUs powering smart cameras.

【翻译】由于其精简的架构和简化的边界框回归流水线,YOLO26在半精度(FP16)和整数(INT8)量化方案下都表现出一致的准确性。FP16量化利用GPU对混合精度运算的原生支持,在减少内存占用的同时实现更快的推理。INT8量化将模型权重压缩为8位整数,在保持竞争性准确度的同时显著减少模型大小和能耗。基准实验证实,YOLO26在这些量化级别上保持稳定性,在相同条件下优于YOLOv11和YOLOv12。这使得YOLO26特别适合在紧凑硬件上部署,如NVIDIA Jetson Orin、高通骁龙AI加速器,甚至是为智能相机提供动力的基于ARM的CPU。

In contrast, transformer-based detectors such as RT-DETRv3 exhibit sharp drops in performance under INT8 quantization [41], primarily due to the sensitivity of attention mechanisms to reduced precision. Similarly, YOLOv12 and YOLOv13, while delivering strong accuracy on GPU servers, struggle to retain competitive performance on low-power devices once quantized. YOLO26 therefore establishes a new benchmark for quantization-aware design in object detection, demonstrating that architectural simplicity can directly translate into deployment robustness.

【翻译】相比之下,基于transformer的检测器如RT-DETRv3在INT8量化下表现出急剧的性能下降[41],主要是由于注意力机制对精度降低的敏感性。同样,YOLOv12和YOLOv13虽然在GPU服务器上提供强大的准确性,但一旦量化,在低功耗设备上很难保持竞争性能。因此,YOLO26为目标检测中的量化感知设计建立了新的基准,证明了架构简单性可以直接转化为部署鲁棒性。

4.3 跨行业应用:从机器人技术到制造业

The practical impact of these deployment enhancements is best illustrated through cross-industry applications. In robotics, real-time perception is crucial for navigation, manipulation, and safe human-robot collaboration [42, 43]. By offering NMS-free predictions and consistent low-latency inference, YOLO26 allows robotic systems to interpret their environments faster and more reliably. For example, robotic arms equipped with YOLO26 can identify and grasp objects with higher precision under dynamic conditions, while mobile robots benefit from improved obstacle recognition in cluttered spaces. Compared with YOLOv8 or YOLOv11, YOLO26 offers reduced inference delay, which can be the difference between a safe maneuver and a collision in high-speed scenarios.

【翻译】这些部署增强的实际影响最好通过跨行业应用来说明。在机器人技术中,实时感知对于导航、操作和安全的人机协作至关重要[42, 43]。通过提供无NMS预测和一致的低延迟推理,YOLO26使机器人系统能够更快、更可靠地解释其环境。例如,配备YOLO26的机械臂可以在动态条件下以更高精度识别和抓取物体,而移动机器人则受益于在杂乱空间中改进的障碍物识别。与YOLOv8或YOLOv11相比,YOLO26提供了减少的推理延迟,这在高速场景中可能是安全机动和碰撞之间的区别。

In manufacturing, YOLO26 has significant implications for automated defect detection and quality assurance. Traditional manual inspection is not only labor-intensive but also prone to human error. Previous YOLO releases, particularly YOLOv8, were already deployed in smart factories; however, the complexity of export and the latency overhead of NMS sometimes constrained large-scale rollout. YOLO26 mitigates these barriers by offering lightweight deployment options through OpenVINO or TensorRT, allowing manufacturers to integrate real-time defect detection systems directly on production lines. Early benchmarks suggest that YOLO26-based defect detection pipelines achieve higher throughput and lower operational costs compared to both YOLOv12 and transformer-based alternatives such as DEIM.

【翻译】在制造业中,YOLO26对自动化缺陷检测和质量保证具有重要意义。传统的人工检查不仅劳动密集,而且容易出现人为错误。以前的YOLO版本,特别是YOLOv8,已经在智能工厂中部署;然而,导出的复杂性和NMS的延迟开销有时会限制大规模推广。YOLO26通过OpenVINO或TensorRT提供轻量级部署选项来缓解这些障碍,允许制造商直接在生产线上集成实时缺陷检测系统。早期基准测试表明,与YOLOv12和基于transformer的替代方案(如DEIM)相比,基于YOLO26的缺陷检测流水线实现了更高的吞吐量和更低的运营成本。

4.4 YOLO26部署的更广泛见解

Taken together, the deployment features of YOLO26 underscore a central theme in the evolution of object detection: architectural efficiency is just as critical as accuracy. While the past five years have seen the rise of increasingly sophisticated models ranging from convolution-based YOLO variants to transformer-based detectors like DETR and RT-DETR the gap between laboratory performance and production readiness has often limited their impact. YOLO26 bridges this gap by simplifying architecture, expanding export compatibility, and ensuring resilience under quantization, thereby aligning cutting-edge accuracy with practical deployment needs.

【翻译】综合来看,YOLO26的部署特性强调了目标检测演进中的一个核心主题:架构效率与准确性同样重要。虽然过去五年见证了从基于卷积的YOLO变体到基于transformer的检测器(如DETR和RT-DETR)等日益复杂模型的兴起,但实验室性能与生产就绪性之间的差距往往限制了它们的影响。YOLO26通过简化架构、扩展导出兼容性和确保量化下的鲁棒性来弥合这一差距,从而将前沿准确性与实际部署需求相结合。

For developers building mobile applications, YOLO26 enables seamless integration through CoreML and TFLite, ensuring that models run natively on iOS and Android platforms. For enterprises deploying vision AI in cloud or on-premise servers, TensorRT and ONNX exports provide scalable acceleration options. For industrial and edge users, OpenVINO and INT8 quantization guarantee that performance remains consistent even under tight resource constraints. In this sense, YOLO26 is not only a step forward in object detection research but also a major milestone in democratizing deployment.

【翻译】对于构建移动应用程序的开发者,YOLO26通过CoreML和TFLite实现无缝集成,确保模型在iOS和Android平台上原生运行。对于在云端或本地服务器上部署视觉AI的企业,TensorRT和ONNX导出提供可扩展的加速选项。对于工业和边缘用户,OpenVINO和INT8量化保证即使在严格的资源约束下性能也保持一致。从这个意义上说,YOLO26不仅是目标检测研究的一个进步,也是部署民主化的一个重要里程碑。

5 结论与未来方向

In conclusion, YOLO26 represents a significant leap in the YOLO object detection series, blending architectural innovation with a pragmatic focus on deployment. The model simplifies its design by removing the Distribution Focal Loss (DFL) module and eliminating the need for non-maximum suppression. By removing DFL, YOLO26 streamlines bounding box regression and avoids export complications, which broadens compatibility with various hardware. Likewise, its end-to-end, NMS-free inference enables the network to output final detections directly without a post-processing step. This not only reduces latency but also simplifies the deployment pipeline, making YOLO26 a natural evolution of earlier YOLO concepts. In training, YOLO26 introduces Progressive Loss Balancing (ProgLoss) and Small-Target-Aware Label Assignment (STAL), which together stabilize learning and boost accuracy on challenging small objects. Additionally, a novel MuSGD optimizer, combining properties of SGD and Muon, accelerates convergence and improves training stability. These enhancements work in concert to deliver a detector that is not only more accurate and robust but also markedly faster and lighter in practice.

【翻译】总之,YOLO26代表了YOLO目标检测系列的重大飞跃,将架构创新与对部署的实用关注相结合。该模型通过移除分布焦点损失(DFL)模块和消除非最大抑制的需求来简化其设计。通过移除DFL,YOLO26简化了边界框回归并避免了导出复杂性,这扩大了与各种硬件的兼容性。同样,其端到端、无NMS推理使网络能够直接输出最终检测结果而无需后处理步骤。这不仅减少了延迟,还简化了部署流水线,使YOLO26成为早期YOLO概念的自然演进。在训练中,YOLO26引入了渐进损失平衡(ProgLoss)和小目标感知标签分配(STAL),它们共同稳定学习并提高对具有挑战性的小目标的准确性。此外,一个新颖的MuSGD优化器结合了SGD和Muon的特性,加速收敛并改善训练稳定性。这些增强功能协同工作,提供了一个不仅更准确、更鲁棒,而且在实践中明显更快、更轻的检测器。

Benchmark comparisons underscore YOLO26's strong performance relative to both its YOLO predecessors and contemporary models. Prior YOLO versions such as YOLO11 surpassed earlier releases with greater efficiency, and YOLO12 extended accuracy further through the integration of attention mechanisms. YOLO13 added hypergraph-based refinements to achieve additional improvements. Against transformer-based rivals, YOLO26 closes much of the gap. Its native NMS-free design mirrors the end-to-end approach of transformer-inspired detectors, but with YOLO's hallmark efficiency. YOLO26 delivers competitive accuracy while dramatically boosting throughput on common hardware and minimizing complexity. In fact, YOLO26's design yields up to 43 % 43\% 43% faster inference on CPU than previous YOLO versions, making it one of the most practical real-time detectors for resource-constrained environments. This harmonious balance of performance and efficiency allows YOLO26 to excel not just on benchmark leaderboards but also in actual field deployments where speed, memory, and energy are at a premium.

【翻译】基准比较强调了YOLO26相对于其YOLO前辈和当代模型的强劲性能。之前的YOLO版本如YOLO11以更高的效率超越了早期版本,YOLO12通过集成注意力机制进一步扩展了准确性。YOLO13添加了基于超图的改进以实现额外的提升。与基于transformer的竞争对手相比,YOLO26缩小了大部分差距。其原生的无NMS设计反映了受transformer启发的检测器的端到端方法,但具有YOLO标志性的效率。YOLO26在显著提升通用硬件吞吐量和最小化复杂性的同时提供了竞争性的准确性。实际上,YOLO26的设计在CPU上的推理速度比以前的YOLO版本快达43%,使其成为资源受限环境中最实用的实时检测器之一。这种性能和效率的和谐平衡使YOLO26不仅在基准排行榜上表现出色,而且在速度、内存和能耗至关重要的实际现场部署中也表现优异。

A major contribution of YOLO26 is its emphasis on deployment advantages. The model's architecture was deliberately optimized for real-world use: by omitting DFL and NMS, YOLO26 avoids operations that are difficult to implement on specialized hardware accelerators, thereby improving compatibility across devices. The network is exportable to a wide array of formats including ONNX, TensorRT, CoreML, TFLite, and OpenVINO ensuring that developers can integrate it into mobile apps, embedded systems, or cloud services with equal ease. Crucially, YOLO26 also supports robust quantization: it can be deployed with INT8 quantization or half-precision FP16 with minimal impact on accuracy, thanks to its simplified architecture that tolerates low-bitwidth inference. This means models can be compressed and accelerated while still delivering reliable detection performance. Such features translate to real edge performance gains from drones to smart cameras, YOLO26 can run real-time on CPU and small devices where previous YOLO models struggled. All these improvements demonstrate an overarching theme: YOLO26 bridges the gap between cutting-edge research ideas and deployable AI solutions. This approach underscores YOLO26's role as a bridge between academic innovation and industry application, bringing the latest vision advancements directly into the hands of practitioners.

【翻译】YOLO26的一个主要贡献是其对部署优势的重视。该模型的架构被有意优化用于现实世界使用:通过省略DFL和NMS,YOLO26避免了在专用硬件加速器上难以实现的操作,从而改善了跨设备的兼容性。该网络可导出为多种格式,包括ONNX、TensorRT、CoreML、TFLite和OpenVINO,确保开发者可以同样轻松地将其集成到移动应用、嵌入式系统或云服务中。至关重要的是,YOLO26还支持鲁棒的量化:由于其简化的架构能够容忍低位宽推理,它可以通过INT8量化或半精度FP16部署,对准确性的影响最小。这意味着模型可以被压缩和加速,同时仍然提供可靠的检测性能。这些特性转化为真正的边缘性能提升------从无人机到智能相机,YOLO26可以在CPU和小型设备上实时运行,而以前的YOLO模型在这些设备上表现困难。所有这些改进都展示了一个总体主题:YOLO26弥合了前沿研究思想与可部署AI解决方案之间的差距。这种方法强调了YOLO26作为学术创新与工业应用之间桥梁的作用,将最新的视觉进展直接带到实践者手中。

5.1 未来方向

Looking ahead, the trajectory of YOLO and object detection research suggests several promising directions. One clear avenue is the unification of multiple vision tasks into even more holistic models. YOLO26 already supports object detection, instance segmentation, pose estimation, oriented bounding boxes, and classification in one framework, reflecting a trend toward multi-task versatility. Future YOLO iterations might push this further by incorporating open-vocabulary and foundation-model capabilities. This could mean leveraging powerful vision-language models so that detectors can recognize arbitrary object categories in a zero-shot manner, without being limited to a fixed label set. By building on foundation models and large-scale pretraining, the next generation of YOLO could serve as a general-purpose vision AI that seamlessly handles detection, segmentation, and even description of novel objects in context.

【翻译】展望未来,YOLO和目标检测研究的发展轨迹表明了几个有前景的方向。一个明确的途径是将多个视觉任务统一到更加整体的模型中。YOLO26已经在一个框架中支持目标检测、实例分割、姿态估计、定向边界框和分类,反映了向多任务通用性发展的趋势。未来的YOLO迭代可能通过整合开放词汇和基础模型能力来进一步推进这一点。这可能意味着利用强大的视觉-语言模型,使检测器能够以零样本方式识别任意目标类别,而不受固定标签集的限制。通过构建在基础模型和大规模预训练之上,下一代YOLO可以作为通用视觉AI,无缝处理检测、分割,甚至在上下文中描述新颖目标。

Another key evolution is likely in the realm of semi-supervised and self-supervised learning for object detection [44, 45, 46, 47]. State-of-the-art detectors still rely heavily on large labeled datasets, but research is rapidly advancing methods to train on unlabeled or partially labeled data. Techniques such as teacher--student training [48, 49, 50], pseudo-labeling [51, 52], and self-supervised feature learning [53]could be integrated into the YOLO training pipeline to reduce the need for extensive manual annotations. A future YOLO might automatically leverage vast amounts of unannotated images or videos to improve recognition robustness. By doing so, the model can continue to improve its detection capabilities without proportional increases in labeled data, making it more adaptable to new domains or rare object categories.

【翻译】另一个关键演进可能在目标检测的半监督和自监督学习领域[44, 45, 46, 47]。最先进的检测器仍然严重依赖大型标注数据集,但研究正在快速推进在未标注或部分标注数据上训练的方法。诸如教师-学生训练[48, 49, 50]、伪标签[51, 52]和自监督特征学习[53]等技术可以集成到YOLO训练流水线中,以减少对大量人工标注的需求。未来的YOLO可能会自动利用大量未标注的图像或视频来提高识别鲁棒性。通过这样做,模型可以在不按比例增加标注数据的情况下继续改善其检测能力,使其更适应新领域或稀有目标类别。

Architecturally, we anticipate a continued blending of transformer and CNN design principles in object detectors. The success of recent YOLO models has shown that injecting attention and global reasoning into YOLO-like architectures can yield accuracy gains [54, 55]. Future YOLO architectures may adopt hybrid designs that combine convolutional backbones (for efficient local feature extraction) with transformer-based modules or decoders (for capturing long-range dependencies and context). Such hybrid approaches can improve how the model understands complex scenes, for example in crowded or highly contextual environments, by modeling relationships that pure CNNs or naive self-attention might miss. We expect next-generation detectors to intelligently fuse these techniques, achieving both rich feature representation and low latency. In short, the line between "CNN-based" and "transformer-based" detectors will continue to blur, taking the best of both worlds to handle diverse detection challenges.

【翻译】在架构方面,我们预期目标检测器中transformer和CNN设计原则的持续融合。最近YOLO模型的成功表明,将注意力和全局推理注入类YOLO架构可以产生准确性提升[54, 55]。未来的YOLO架构可能采用混合设计,将卷积骨干网络(用于高效的局部特征提取)与基于transformer的模块或解码器(用于捕获长程依赖和上下文)相结合。这种混合方法可以改善模型对复杂场景的理解,例如在拥挤或高度上下文化的环境中,通过建模纯CNN或朴素自注意力可能遗漏的关系。我们期望下一代检测器能够智能地融合这些技术,实现丰富的特征表示和低延迟。简而言之,"基于CNN"和"基于transformer"的检测器之间的界限将继续模糊,取两者之长来处理多样化的检测挑战。

Lastly, as deployment remains a paramount concern, future research will likely emphasize edge-aware training and optimization. This means that model development will increasingly account for hardware constraints from the training phase onward, not just as an afterthought. Techniques such as quantization aware training where the model is trained with simulated low-precision arithmetic can ensure the network remains accurate even after being quantized to INT8 for fast inference. We may also see neural architecture search and automated model compression become standard in crafting YOLO models, so that each new version is co-designed with specific target platforms in mind. In addition, incorporating feedback from deployment, such as latency measurements or energy usage on device, into the training loop is an emerging idea. An edge-optimized YOLO could, for example, learn to dynamically adjust its depth or resolution based on runtime constraints, or be distilled from a larger model to a smaller one with minimal performance loss. By training with these considerations, the resulting detectors would achieve a superior trade-off between accuracy and efficiency in practice. This focus on efficient AI is crucial as object detectors move into IoT, AR/VR, and autonomous systems where real-time performance on limited hardware is non-negotiable.

【翻译】最后,由于部署仍然是一个至关重要的问题,未来的研究可能会强调边缘感知训练和优化。这意味着模型开发将从训练阶段开始就越来越多地考虑硬件约束,而不仅仅是事后考虑。诸如量化感知训练等技术,其中模型使用模拟的低精度算术进行训练,可以确保网络即使在量化为INT8以进行快速推理后仍保持准确性。我们也可能看到神经架构搜索和自动化模型压缩成为制作YOLO模型的标准,使得每个新版本都与特定目标平台共同设计。此外,将部署反馈(如延迟测量或设备上的能耗)纳入训练循环是一个新兴想法。例如,边缘优化的YOLO可以学会根据运行时约束动态调整其深度或分辨率,或从较大模型蒸馏到较小模型,性能损失最小。通过考虑这些因素进行训练,所得到的检测器将在实践中实现准确性和效率之间的卓越权衡。这种对高效AI的关注至关重要,因为目标检测器正在进入物联网、AR/VR和自主系统,在这些系统中,有限硬件上的实时性能是不可妥协的。

Note: This study will experimentally evaluate YOLO26 by benchmarking its performance against YOLOv13, YOLOv12, and YOLOv11 in the near future. A custom dataset will be collected in agricultural environments using a machine vision camera, with 10,000 plus manually labeled objects of interest. Models will be trained under identical conditions, and results will be reported in terms of precision, recall, accuracy, F1 score, mAP, inference speed, and pre/post-processing times. Additionally, edge computing experiments on NVIDIA Jetson will assess real-time detection capacity, providing insights into YOLO26's practical deployment in resource-constrained agricultural applications.

【翻译】注:本研究将在不久的将来通过将YOLO26的性能与YOLOv13、YOLOv12和YOLOv11进行基准比较来实验性评估YOLO26。将使用机器视觉相机在农业环境中收集自定义数据集,包含10,000多个手动标注的感兴趣目标。模型将在相同条件下训练,结果将以精确度、召回率、准确性、F1分数、mAP、推理速度和前/后处理时间的形式报告。此外,在NVIDIA Jetson上的边缘计算实验将评估实时检测能力,为YOLO26在资源受限的农业应用中的实际部署提供见解。

相关推荐
wwlsm_zql5 小时前
「赤兔」Chitu 框架深度解读(十四):核心算子优化
人工智能·1024程序员节
小冷爱读书5 小时前
F-INR: Functional Tensor Decomposition for Implicit Neural Representations
深度学习·inr·函数张量分解
浣熊-论文指导7 小时前
聚类与Transformer融合的六大创新方向
论文阅读·深度学习·机器学习·transformer·聚类
AKAMAI7 小时前
Fermyon推出全球最快边缘计算平台:WebAssembly先驱携手Akamai云驱动无服务器技术新浪潮
人工智能·云计算·边缘计算
云雾J视界8 小时前
TMS320C6000 VLIW架构并行编程实战:加速AI边缘计算推理性能
人工智能·架构·边缘计算·dsp·vliw·tms320c6000
想ai抽8 小时前
基于AI Agent的数据资产自动化治理实验
人工智能·langchain·embedding
小马过河R9 小时前
AIGC视频生成之Deepseek、百度妙笔组合实战小案例
人工智能·深度学习·计算机视觉·百度·aigc
june-Dai Yi9 小时前
免费的大语言模型API接口
人工智能·语言模型·自然语言处理·chatgpt·api接口
东经116度9 小时前
生成对抗网络(GAN)
深度学习·gan·模式崩塌