YOLOv11: AN OVERVIEW OF THE KEY ARCHITECTURAL ENHANCEMENTS目标检测论文精读（逐段解析）

论文地址：https://www.arxiv.org/abs/2410.17725

Rahima Khanam and Muhammad Hussain

Ultralytics公司发布

CVPR 2024

论文写的比较简单，比较核心的改进包括：

C3K2高效特征提取机制。对C2f模块的改进，其主要改进是可选择地（配置True或False，False时退化为C2f，True时为使用C3k）使用C3k块替代标准Bottleneck块作为内部特征处理单元，从而在保持C2f高效性的同时，提供更强的特征提取能力。其中，C3k是对C3的改进，将C3所包含的固定K的Bottleneck（ 1 × 1 1 \times 1 1×1和 3 × 3 3 \times 3 3×3）改为了允许调整卷积核大小（K可变）的Bottleneck，C3中堆叠的Bottleneck是使用经典的1×1+3×3卷积组合，为的是计算效率高，内存占用相对较小，而C3k允许调整k值以获得不同大小的感受野，不同的k值捕获不同尺度的特征，根据具体任务需求调整卷积核大小。
在SPPF模块后新增C2PSA模块。C2PSA结合了通道分离和位置敏感注意力，其技术流程为：首先通过1×1卷积将输入特征通道扩展为原来的两倍，然后将扩展后的特征沿通道维度分为两个相等的部分，其中一部分直接保留作为跳跃连接，另一部分通过多个PSABlock进行位置敏感注意力处理（每个PSABlock包含多头自注意力机制和前馈网络，并采用残差连接），最后将处理后的特征与保留的特征拼接，再通过1×1卷积压缩回原始通道数。C2PSA在保持计算效率的前提下增强特征的全局上下文感知能力，更有效地关注图像中的重要区域。

ABSTRACT

This study presents an architectural analysis of YOLOv11, the latest iteration in the YOLO (You Only Look Once) series of object detection models. We examine the models architectural innovations, including the introduction of the C3k2 (Cross Stage Partial with kernel size 2) block, SPPF (Spatial Pyramid Pooling - Fast), and C2PSA (Convolutional block with Parallel Spatial Attention) components, which contribute in improving the models performance in several ways such as enhanced feature extraction. The paper explores YOLOv11's expanded capabilities across various computer vision tasks, including object detection, instance segmentation, pose estimation, and oriented object detection (OBB). We review the model's performance improvements in terms of mean Average Precision (mAP) and computational efficiency compared to its predecessors, with a focus on the trade-off between parameter count and accuracy. Additionally, the study discusses YOLOv11's versatility across different model sizes, from nano to extra-large, catering to diverse application needs from edge devices to high-performance computing environments. Our research provides insights into YOLOv11's position within the broader landscape of object detection and its potential impact on real-time computer vision applications.

【翻译】本研究对YOLOv11进行了架构分析，YOLOv11是YOLO（You Only Look Once）目标检测模型系列的最新版本。我们研究了该模型的架构创新，包括引入C3k2（Cross Stage Partial with kernel size 2）块、SPPF（Spatial Pyramid Pooling - Fast）和C2PSA（Convolutional block with Parallel Spatial Attention）组件，这些组件在多个方面改进了模型性能，如增强特征提取。本文探讨了YOLOv11在各种计算机视觉任务中的扩展能力，包括目标检测、实例分割、姿态估计和有向目标检测（OBB）。我们回顾了与前代相比，该模型在平均精度均值（mAP）和计算效率方面的性能改进，重点关注参数数量与精度之间的权衡。此外，本研究讨论了YOLOv11在不同模型规模上的通用性，从纳米级到超大级，满足从边缘设备到高性能计算环境的不同应用需求。我们的研究为YOLOv11在更广泛的目标检测领域中的地位以及其对实时计算机视觉应用的潜在影响提供了见解。

Keywords Automation; Computer Vision; YOLO; YOLOV11; Object Detection; Real-Time Image processing; YOLO version comparison

【翻译】关键词：自动化；计算机视觉；YOLO；YOLOV11；目标检测；实时图像处理；YOLO版本比较

1 Introduction

Computer vision, a rapidly advancing field, enables machines to interpret and understand visual data [ 1 ]. A crucial aspect of this domain is object detection[ 2 ], which involves the precise identification and localization of objects within images or video streams[ 3 ]. Recent years have witnessed remarkable progress in algorithmic approaches to address this challenge [4].

【翻译】计算机视觉是一个快速发展的领域，使机器能够解释和理解视觉数据[1]。该领域的一个关键方面是目标检测[2]，它涉及在图像或视频流中精确识别和定位目标[3]。近年来，在解决这一挑战的算法方法方面取得了显著进展[4]。

A pivotal breakthrough in object detection came with the introduction of the You Only Look Once (YOLO) algorithm by Redmon et al. in 2015 [ 5 ]. This innovative approach, as its name suggests, processes the entire image in a single pass to detect objects and their locations. YOLO's methodology diverges from traditional two-stage detection processes by framing object detection as a regression problem [ 5 ]. It employs a single convolutional neural network to simultaneously predict bounding boxes and class probabilities across the entire image [ 6 ], streamlining the detection pipeline compared to more complex traditional methods.

【翻译】目标检测的一个关键性突破是Redmon等人在2015年提出的You Only Look Once (YOLO)算法[5]。这种创新方法正如其名称所示，在单次处理中处理整个图像以检测目标及其位置。YOLO的方法通过将目标检测框架化为回归问题，偏离了传统的两阶段检测过程[5]。它采用单个卷积神经网络同时预测整个图像上的边界框和类别概率[6]，相比于更复杂的传统方法，简化了检测流水线。

【解析】传统的检测方法通常分为两个步骤：先找出可能包含目标的区域，再对这些区域进行分类。这种方法虽然准确，但计算量大且速度慢。YOLO则采用了完全不同的策略，它把整个检测过程看作一个回归问题，也就是直接从图像像素预测出目标的位置坐标和类别。通过将图像划分成网格，每个网格负责预测其中可能存在的目标，整个网络只需要运行一次就能得到所有检测结果。这种"一次性"的处理方式大大提高了检测速度，为实时目标检测应用奠定了基础。

YOLOv11 is the latest iteration in the YOLO series, building upon the foundation established by YOLOv1. Unveiled at the YOLO Vision 2024 (YV24) conference, YOLOv11 represents a significant leap forward in real-time object detection technology. This new version introduces substantial enhancements in both architecture and training methodologies, pushing the boundaries of accuracy, speed, and efficiency.

【翻译】YOLOv11是YOLO系列的最新版本，建立在YOLOv1奠定的基础之上。在YOLO Vision 2024 (YV24)会议上发布的YOLOv11代表了实时目标检测技术的重大飞跃。这个新版本在架构和训练方法论方面都引入了实质性的增强，推动了准确性、速度和效率的边界。

YOLOv11's innovative design incorporates advanced feature extraction techniques, allowing for more nuanced detail capture while maintaining a lean parameter count. This results in improved accuracy across a diverse range of computer vision (CV) tasks, from object detection to classification. Furthermore, YOLOv11 achieves remarkable gains in processing speed, substantially enhancing real-time performance capabilities.

【翻译】YOLOv11的创新设计融合了先进的特征提取技术，在保持精简参数数量的同时，允许更细致的细节捕获。这导致在从目标检测到分类的各种计算机视觉(CV)任务中准确性的提高。此外，YOLOv11在处理速度方面取得了显著进展，大幅增强了实时性能能力。

【解析】其实从Anchor free之后，YOLO系列改进工作就体现了现在的一个重要趋势：因为YOLO网络整个范式比较固定了，只能用更聪明的架构设计来替代简单的参数或者模块堆叠（去精细化各模块作用）。传统上，提高模型性能往往意味着增加更多的层数和参数，但这会导致计算成本的急剧上升。YOLOv11通过引入更高效的特征提取模块，能够在较少的参数下捕获更丰富的图像信息。

In the following sections, this paper will provide a comprehensive analysis of YOLOv11's architecture, exploring its key components and innovations. We will examine the evolution of YOLO models, leading up to the development of YOLOv11. The study will delve into the model's expanded capabilities across various CV tasks, including object detection, instance segmentation, pose estimation, and oriented object detection. We will also review YOLOv11's performance improvements in terms of accuracy and computational efficiency compared to its predecessors, with a particular focus on its versatility across different model sizes. Finally, we will discuss the potential impact of YOLOv11 on real-time CV applications and its position within the broader landscape of object detection technologies.

【翻译】在接下来的章节中，本文将对YOLOv11的架构进行全面分析，探索其关键组件和创新。我们将检查YOLO模型的演进，直到YOLOv11的开发。研究将深入探讨该模型在各种CV任务中的扩展能力，包括目标检测、实例分割、姿态估计和有向目标检测。我们还将回顾YOLOv11与其前代相比在准确性和计算效率方面的性能改进，特别关注其在不同模型尺寸上的多功能性。最后，我们将讨论YOLOv11对实时CV应用的潜在影响以及其在更广泛的目标检测技术领域中的地位。

2 Evolution of YOLO models

Table 1 illustrates the progression of YOLO models from their inception to the most recent versions. Each iteration has brought significant improvements in object detection capabilities, computational efficiency, and versatility in handling various CV tasks.

【翻译】表1展示了YOLO模型从诞生到最新版本的发展历程。每次迭代都在目标检测能力、计算效率以及处理各种计算机视觉任务的多功能性方面带来了显著改进。

【解析】从最初的YOLOv1开始，每个新版本都不仅仅是简单的性能提升，而是在解决前代模型存在的具体问题的同时，引入新的技术理念。比如早期版本主要解决检测精度和速度的平衡问题，后期版本则开始关注多尺度目标检测、小目标检测等更复杂的场景。Ultralytics的发布使YOLO不再局限于单纯的目标检测，而是发展成为一个能够处理分割、姿态估计、OBB等多种视觉任务的统一框架。

Table 1: YOLO: Evolution of models

This evolution showcases the rapid advancement in object detection technologies, with each version introducing novel features and expanding the range of supported tasks. From the original YOLO's groundbreaking single-stage detection to YOLOv10's NMS-free training, the series has consistently pushed the boundaries of real-time object detection.

【翻译】这种演进展现了目标检测技术的快速发展，每个版本都引入了新颖特性并扩展了支持任务的范围。从最初YOLO开创性的单阶段检测到YOLOv10的无NMS训练，该系列始终在推动实时目标检测的边界。

【解析】YOLOv10无NMS可以使网络直接输出最终结果而无需额外的后处理，这进一步提升了推理效率。

The latest iteration, YOLO11, builds upon this legacy with further enhancements in feature extraction, efficiency, and multi-task capabilities. Our subsequent analysis will delve into YOLO11's architectural innovations, including its improved backbone and neck structures, and its performance across various computer vision tasks such as object detection, instance segmentation, and pose estimation.

【翻译】最新版本YOLO11在这一传承基础上进一步增强了特征提取、效率和多任务能力。我们后续的分析将深入探讨YOLO11的架构创新，包括其改进的主干网络和颈部结构，以及其在目标检测、实例分割和姿态估计等各种计算机视觉任务中的性能表现。

3 What is YOLOv11?

The evolution of the YOLO algorithm reaches new heights with the introduction of YOLOv11 [ 16 ], representing a significant advancement in real-time object detection technology. This latest iteration builds upon the strengths of its predecessors while introducing novel capabilities that expand its utility across diverse CV applications.

【翻译】YOLO算法的演进随着YOLOv11的推出达到了新的高度[16]，代表了实时目标检测技术的重大进步。这一最新版本在其前代优势的基础上构建，同时引入了新颖的能力，扩展了其在各种计算机视觉应用中的实用性。

YOLOv11 distinguishes itself through its enhanced adaptability, supporting an expanded range of CV tasks beyond traditional object detection. Notable among these are posture estimation and instance segmentation, broadening the model's applicability in various domains. YOLOv11's design focuses on balancing power and practicality, aiming to address specific challenges across various industries with increased accuracy and efficiency.

【翻译】YOLOv11通过其增强的适应性来区别于其他模型，支持超越传统目标检测的扩展CV任务范围。其中值得注意的是姿态估计和实例分割，这扩大了模型在各个领域的适用性。YOLOv11的设计专注于平衡功能强度和实用性，旨在以更高的准确性和效率解决各行业的特定挑战。

This latest model demonstrates the ongoing evolution of real-time object detection technology, pushing the boundaries of what's possible in CV applications. Its versatility and performance improvements position YOLOv11 as a significant advancement in the field, potentially opening new avenues for real-world implementation across diverse sectors.

【翻译】这一最新模型展示了实时目标检测技术的持续演进，推动了计算机视觉应用中可能性的边界。其多功能性和性能改进将YOLOv11定位为该领域的重大进步，可能为跨不同行业的现实世界实施开辟新途径。

4 Architectural footprint of Yolov11

The YOLO framework revolutionized object detection by introducing a unified neural network architecture that simultaneously handles both bounding box regression and object classification tasks [ 17 ]. This integrated approach marked a significant departure from traditional two-stage detection methods, offering end-to-end training capabilities through its fully differentiable design.

【翻译】YOLO框架通过引入一个统一的神经网络架构彻底改变了目标检测，该架构同时处理边界框回归和目标分类任务[17]。这种集成方法标志着与传统两阶段检测方法的重大偏离，通过其完全可微分设计提供端到端的训练能力。

【解析】这里又强调了传统两阶段方法：首先生成候选区域，然后对这些区域进行分类和位置精修。YOLO将这两个任务融合到一个神经网络中，直接从输入图像预测目标的位置和类别。完全可微分设计是说整个网络的所有组件都可以使用梯度下降算法进行联合优化，这使得训练过程更加简洁高效，避免了传统方法中需要分阶段训练和手工调参的复杂性。

At its core, the YOLO architecture consists of three fundamental components. First, the backbone serves as the primary feature extractor, utilizing convolutional neural networks to transform raw image data into multi-scale feature maps. Second, the neck component acts as an intermediate processing stage, employing specialized layers to aggregate and enhance feature representations across different scales. Third, the head component functions as the prediction mechanism, generating the final outputs for object localization and classification based on the refined feature maps.

【翻译】在其核心，YOLO架构由三个基本组件组成。首先，主干网络(backbone)作为主要的特征提取器，利用卷积神经网络将原始图像数据转换为多尺度特征图。其次，颈部(neck)组件作为中间处理阶段，采用专门的层来聚合和增强跨不同尺度的特征表示。第三，头部(head)组件作为预测机制，基于精炼的特征图生成目标定位和分类的最终输出。

【解析】主干网络负责从原始像素中提取基础视觉特征，通过多层卷积逐步抽象图像信息，从低级的边缘纹理特征到高级的语义特征。颈部组件解决不同尺度目标检测的关键问题，因为同一类目标在图像中可能呈现不同的大小，需要融合来自不同层级的特征信息，融合过程包括上采样、下采样和特征连接操作。头部组件则专门负责最终的预测任务，将处理好的特征映射到具体的检测结果，包括目标的空间位置、尺寸和类别概率，这种模块化设计使得每个组件都可以独立优化和替换。

Building on this established architecture, YOLO11 extends and enhances the foundation laid by YOLOv8, introducing architectural innovations and parameter optimizations to achieve superior detection performance as illustrated in Figure 1. The following sections detail the key architectural modifications implemented in YOLO11:

【翻译】基于这一既定架构，YOLO11扩展和增强了YOLOv8奠定的基础，引入了架构创新和参数优化，以实现卓越的检测性能，如图1所示。以下章节详细介绍了YOLO11中实施的关键架构修改：

Figure 1: Key architectural modules in YOLO11.

【翻译】YOLOv11的架构核心模块。

4.1 Backbone

The backbone is a crucial component of the YOLO architecture, responsible for extracting features from the input image at multiple scales. This process involves stacking convolutional layers and specialized blocks to generate feature maps at various resolutions.

【翻译】主干网络是YOLO架构的关键组件，负责从输入图像中提取多尺度特征。这一过程涉及堆叠卷积层和专门的模块来生成不同分辨率的特征图。

4.1.1 Convolutional Layers

YOLOv11 maintains a structure similar to its predecessors, utilizing initial convolutional layers to downsample the image. These layers form the foundation of the feature extraction process, gradually reducing spatial dimensions while increasing the number of channels. A significant improvement in YOLO11 is the introduction of the C3k2 block, which replaces the C2f block used in previous versions [ 18 ]. The C3k2 block is a more computationally efficient implementation of the Cross Stage Partial (CSP) Bottleneck. It employs two smaller convolutions instead of one large convolution, as seen in YOLOv8 [ 13 ]. The "k2" in C3k2 indicates a smaller kernel size, which contributes to faster processing while maintaining performance.

【翻译】YOLOv11保持了与其前代类似的结构，利用初始卷积层对图像进行下采样。这些层构成了特征提取过程的基础，逐渐减少空间维度的同时增加通道数。YOLO11的一个重大改进是引入了C3k2模块，它取代了先前版本中使用的C2f模块[18]。C3k2模块是跨阶段部分(CSP)瓶颈的更具计算效率的实现。它采用两个较小的卷积而不是一个大卷积，如YOLOv8中所见[13]。C3k2中的"k2"表示较小的内核大小，有助于在保持性能的同时实现更快的处理。

4.1.2 SPPF and C2PSA

YOLO11 retains the Spatial Pyramid Pooling - Fast (SPPF) block from previous versions but introduces a new Cross Stage Partial with Spatial Attention (C2PSA) block after it [ 18 ]. The C2PSA block is a notable addition that enhances spatial attention in the feature maps. This spatial attention mechanism allows the model to focus more effectively on important regions within the image. By pooling features spatially, the C2PSA block enables YOLO11 to concentrate on specific areas of interest, potentially improving detection accuracy for objects of varying sizes and positions.

【翻译】YOLO11保留了先前版本的空间金字塔池化-快速(SPPF)模块，但在其后引入了一个新的带空间注意力的跨阶段部分(C2PSA)模块[18]。C2PSA模块是一个显著的新增功能，它增强了特征图中的空间注意力。这种空间注意力机制使模型能够更有效地关注图像中的重要区域。通过空间特征池化，C2PSA模块使YOLO11能够专注于特定的感兴趣区域，潜在地提高对不同大小和位置目标的检测精度。

【解析】SPPF模块本质上是一种多尺度特征融合技术，它通过不同大小的池化操作来捕获不同尺度的上下文信息，这对于检测不同大小的目标非常重要。而新引入的C2PSA模块则在SPPF的基础上进一步引入了空间注意力机制。空间注意力的核心思想是让网络学会自动识别图像中哪些区域更重要，哪些区域相对不重要。传统的卷积操作对图像的每个位置都给予相同的权重，但现实中图像的不同区域包含的信息价值是不同的。C2PSA通过计算每个空间位置的重要性权重，让网络能够将更多的计算资源和注意力集中在真正包含目标或重要特征的区域上。这种机制有助于处理复杂场景中的目标检测问题，比如当目标被部分遮挡、处于复杂背景中或者尺寸变化很大时。空间池化操作进一步增强了这种能力，它不仅关注单个像素点，而是考虑局部区域的特征分布，这样可以更好地理解目标的空间结构和上下文关系。

4.2 Neck

The neck combines features at different scales and transmits them to the head for prediction. This process typically involves upsampling and concatenation of feature maps from different levels, enabling the model to capture multi-scale information effectively.

【翻译】颈部组件结合不同尺度的特征并将其传输到头部进行预测。这个过程通常涉及来自不同层级的特征图的上采样和连接，使模型能够有效地捕获多尺度信息。

【解析】颈部组件在整个网络架构中起到承上启下的关键作用。主干网络在提取特征的过程中会产生多个不同分辨率的特征图，浅层特征图保留了更多的空间细节信息但语义信息较少，而深层特征图包含更丰富的语义信息但空间分辨率较低。颈部组件的核心任务就是将这些具有互补性质的特征进行有机融合。其中的上采样操作将低分辨率的特征图通过插值（或反卷积等）恢复到更高的分辨率，以将深层的语义信息传播到更精细的空间位置上。特征连接将不同来源的特征图在通道维度上进行拼接，让网络同时访问多种类型的特征信息。多尺度特征融合策略对于目标检测任务特别重要，因为现实场景中的目标往往具有不同的尺寸，小目标需要高分辨率的细节信息来准确定位，而大目标则更依赖于高级的语义特征来正确分类。通过颈部组件的处理，网络能够在单一的特征表示中同时保持空间精度和语义丰富性，为后续的检测头提供最优质的特征输入。

4.2.1 C3k2 Block

YOLO11 introduces a significant change by replacing the C2f block in the neck with the C3k2 block. The C3k2 block is designed to be faster and more efficient, enhancing the overall performance of the feature aggregation process. After upsampling and concatenation, the neck in YOLO11 incorporates this improved block, resulting in enhanced speed and performance [18].

【翻译】YOLO11通过在颈部用C3k2模块替换C2f模块作为改进变化。C3k2模块设计得更快、更高效，增强了特征聚合过程的整体性能。在上采样和连接之后，YOLO11的颈部组件整合了这个改进的模块，从而提高了速度和性能[18]。

【解析】C2f模块虽然在之前的版本中表现良好，但在特征聚合的计算开销方面仍有优化空间。特征聚合是指将不同分辨率和不同语义层次的特征信息有机结合的过程，这个过程需要在计算效率和信息保留之间找到最佳平衡点。C3k2模块通过更优化的内部结构和计算路径，能够在更短的时间内完成同样质量的特征处理工作，上采样和连接操作本身就是计算密集型的操作，如果后续的处理模块能够更加高效，就能显著提升整个颈部组件的处理速度，进而影响整个网络的实时性能。

4.2.2 Attention Mechanism

A notable addition to YOLO11 is its increased focus on spatial attention through the C2PSA module. This attention mechanism enables the model to concentrate on key regions within the image, potentially leading to more accurate detection, especially for smaller or partially occluded objects. The inclusion of C2PSA sets YOLO11 apart from its predecessor, YOLOv8, which lacks this specific attention mechanism [18].

【翻译】YOLO11的一个显著新增功能是通过C2PSA模块加强了对空间注意力的关注。这种注意力机制使模型能够专注于图像中的关键区域，可能会带来更准确的检测，特别是对于较小或部分被遮挡的目标。C2PSA的引入使YOLO11与其前代YOLOv8区别开来，后者缺乏这种特定的注意力机制[18]。

【解析】在传统的卷积神经网络中，每个像素位置都会被同等对待，但这显然不符合人类视觉系统的工作方式------我们看图片时会自然地把注意力集中在重要的区域上。C2PSA模块通过学习生成一个空间权重图，这个权重图告诉网络哪些区域更值得关注。当处理包含小目标的图像时，小目标往往只占据很少的像素，容易被大面积的背景信息淹没，而空间注意力机制能够让网络学会识别这些小目标所在的区域并给予更多关注。对于被遮挡的目标，注意力机制帮助网络聚焦于目标可见的部分，即使只有一部分可见，网络也能通过注意力权重的指导更好地推断出完整目标的存在。

4.3 Head

The head of YOLOv11 is responsible for generating the final predictions in terms of object detection and classification. It processes the feature maps passed from the neck, ultimately outputting bounding boxes and class labels for objects within the image.

【翻译】YOLOv11的头部负责生成目标检测和分类的最终预测结果。它处理从颈部传递来的特征图，最终输出图像中目标的边界框和类别标签。

【解析】头部组件是整个检测网络的最后一个环节，可以说是决定最终检测效果的核心部分。经过主干网络和颈部组件的层层处理，原始图像已经被转换成了包含丰富语义信息和空间信息的特征表示。头部组件的任务就是将这些抽象的特征表示转换成人类可以理解和使用的具体检测结果。这个转换过程涉及两个主要的预测任务：首先是回归任务，即预测目标在图像中的精确位置，这通过边界框的坐标来表示，包括目标的中心点坐标以及宽度和高度信息；其次是分类任务，即判断检测到的目标属于哪个类别，比如是人、车、动物等。头部组件需要在保持高精度的同时确保推理速度，因为它直接影响整个检测系统的实时性能。

4.3.1 C3k2 Block

In the head section, YOLOv11 utilizes multiple C3k2 blocks to efficiently process and refine the feature maps. The C3k2 blocks are placed in several pathways within the head, functioning to process multi-scale features at different depths. The C3k2 block exhibits flexibility depending on the value of the c3k parameter:

【翻译】在头部模块中，YOLOv11利用多个C3k2块来高效处理和精炼特征图。C3k2块被放置在头部的几个路径中，用于处理不同深度的多尺度特征。C3k2块根据c3k参数的值表现出灵活性：

• When c 3 k = \mathrm{c3k=} c3k= False, the C3k2 module behaves similarly to the C2f block, utilizing a standard bottleneck structure.

• When c 3 k = \mathrm{c3k=} c3k= True, the bottleneck structure is replaced by the C3 module, which allows for deeper and more complex feature extraction.

【翻译】• 当 c 3 k = \mathrm{c3k=} c3k= False时，C3k2模块的行为类似于C2f块，使用标准的瓶颈结构。

• 当 c 3 k = \mathrm{c3k=} c3k= True时，瓶颈结构被C3模块替换，允许更深层和更复杂的特征提取。

Key characteristics of the C3k2 block:

• Faster processing: The use of two smaller convolutions reduces the computational overhead compared to a single large convolution, leading to quicker feature extraction.

• Parameter efficiency: C3k2 is a more compact version of the CSP bottleneck, making the architecture more efficient in terms of the number of trainable parameters.

【翻译】C3k2块的关键特征：

• 更快的处理：使用两个较小的卷积相比单个大卷积减少了计算开销，实现更快的特征提取。

• 参数效率：C3k2是CSP瓶颈的更紧凑版本，在可训练参数数量方面使架构更加高效。

Another notable addition is the C3k block, which offers enhanced flexibility by allowing customizable kernel sizes. The adaptability of C 3 k \mathrm{C}3\mathrm{k} C3k is particularly useful for extracting more detailed features from images, contributing to improved detection accuracy.

【翻译】另一个值得注意的新增是C3k块，它通过允许自定义内核大小提供增强的灵活性。 C 3 k \mathrm{C}3\mathrm{k} C3k 的适应性在从图像中提取更详细的特征方面特别有用，有助于提高检测精度。

【解析】可自定义内核大小让网络结构更加灵活和自适应。传统普遍使用3x3卷积核。但不同的特征和不同的目标其实需要不同尺寸的感受野来最好地捕获。小的卷积核善于捕获细节特征，如边缘、纹理等，而大的卷积核更适合捕获全局的结构信息。C3k块允许根据具体任务来调整卷积核大小，网络可以更加精确地匹配不同类型特征的提取需求。当检测小目标时，可以使用较小的卷积核来保持空间精度；当检测大目标或需要理解全局上下文时，可以使用较大的卷积核来扩大感受野。这种自适应能力让单一的网络架构能够更好地处理多尺度目标检测的挑战。

4.3.2 CBS Blocks

The head of YOLOv11 includes several CBS (Convolution-BatchNorm-Silu) [ 19 ] layers after the C3k2 blocks. These layers further refine the feature maps by:

• Extracting relevant features for accurate object detection.

• Stabilizing and normalizing the data flow through batch normalization.

• Utilizing the Sigmoid Linear Unit (SiLU) activation function for non-linearity, which improves model performance.

CBS blocks serve as foundational components in both feature extraction and the detection process, ensuring that the refined feature maps are passed to the subsequent layers for bounding box and classification predictions.

【翻译】YOLOv11的头部在C3k2块之后包含多个CBS（卷积-批归一化-Silu）[19]层。这些层通过以下方式进一步优化特征图：

• 提取相关特征以实现准确的目标检测。

• 通过批归一化稳定和规范化数据流。

• 利用Sigmoid线性单元（SiLU）激活函数实现非线性，从而提高模型性能。

CBS块在特征提取和检测过程中都作为基础组件，确保精炼的特征图传递给后续层进行边界框和分类预测。

【解析】CBS块是标准组合模块。

4.3.3 Final Convolutional Layers and Detect Layer

Each detection branch ends with a set of Conv2D layers, which reduce the features to the required number of outputs for bounding box coordinates and class predictions. The final Detect layer consolidates these predictions, which include:

• Bounding box coordinates for localizing objects in the image.

• Objectness scores that indicate the presence of objects.

• Class scores for determining the class of the detected object.

【翻译】每个检测分支以一组Conv2D层结束，这些层将特征减少到边界框坐标和类别预测所需的输出数量。最终的检测层整合这些预测，包括：

• 用于在图像中定位目标的边界框坐标。

• 指示目标存在的目标性得分。

• 用于确定检测目标类别的类别得分。

【解析】这段描述了YOLO检测头的最终输出层结构。其实就是最后的Conv2D层，作用：将高维的特征表示映射到具体的数值输出。

5 Key Computer Vision Tasks Supported by YOLO11

YOLO11 supports a diverse range of CV tasks, showcasing its versatility and power in various applications. Here's an overview of the key tasks:

【翻译】YOLO11支持多种计算机视觉任务，展示了其在各种应用中的多功能性和强大功能。以下是关键任务的概述：

Object Detection: YOLO11 excels in identifying and localizing objects within images or video frames, providing bounding boxes for each detected item [ 20 ]. This capability finds applications in surveillance systems, autonomous vehicles, and retail analytics, where precise object identification is crucial [21].

【翻译】1. 目标检测：YOLO11在识别和定位图像或视频帧中的目标方面表现出色，为每个检测到的项目提供边界框[20]。这种能力在监控系统、自动驾驶车辆和零售分析中得到应用，在这些领域中精确的目标识别至关重要[21]。

Instance Segmentation: Going beyond simple detection, YOLO11 can identify and separate individual objects within an image down to the pixel level [ 20 ]. This fine-grained segmentation is particularly valuable in medical imaging for precise organ or tumor delineation, and in manufacturing for detailed defect detection [21].

【翻译】2. 实例分割：超越简单检测，YOLO11可以识别和分离图像中的单个目标，精确到像素级别[20]。这种细粒度分割在医学成像中用于精确的器官或肿瘤描绘特别有价值，在制造业中用于详细的缺陷检测[21]。

Image Classification: YOLOv11 is capable of classifying entire images into predetermined categories, making it ideal for applications like product categorization in e-commerce platforms or wildlife monitoring in ecological studies [21].

【翻译】3. 图像分类：YOLOv11能够将整个图像分类到预定义的类别中，使其非常适合电商平台的产品分类或生态研究中的野生动物监测等应用[21]。

Pose Estimation: The model can detect specific key points within images or video frames to track movements or poses. This capability is beneficial for fitness tracking applications, sports performance analysis, and various healthcare applications requiring motion assessment [21].

【翻译】4. 姿态估计：该模型可以检测图像或视频帧中的特定关键点来跟踪运动或姿势。这种能力对健身跟踪应用、运动表现分析以及需要运动评估的各种医疗应用都很有益[21]。

Oriented Object Detection (OBB): YOLO11 introduces the ability to detect objects with an orientation angle, allowing for more precise localization of rotated objects. This feature is especially valuable in aerial imagery analysis, robotics, and warehouse automation tasks where object orientation is crucial [21].

【翻译】5. 定向目标检测（OBB）：YOLO11引入了检测带有方向角的目标的能力，允许更精确地定位旋转的目标。这个特性在航空图像分析、机器人技术和仓库自动化任务中特别有价值，在这些领域中目标方向至关重要[21]。

Object Tracking: It identifies and traces the path of objects in a sequence of images or video frames[ 21 ]. This real-time tracking capability is essential for applications such as traffic monitoring, sports analysis, and security systems.

【翻译】6. 目标跟踪：它在一系列图像或视频帧中识别和跟踪目标的路径[21]。这种实时跟踪能力对于交通监控、体育分析和安全系统等应用至关重要。

Table 2 outlines the YOLOv11 model variants and their corresponding tasks. Each variant is designed for specific use cases, from object detection to pose estimation. Moreover, all variants support core functionalities like inference, validation, training, and export, making YOLOv11 a versatile tool for various CV applications.

【翻译】表2概述了YOLOv11模型变体及其相应的任务。每个变体都是为特定用例设计的，从目标检测到姿态估计。此外，所有变体都支持推理、验证、训练和导出等核心功能，使YOLOv11成为各种计算机视觉应用的多功能工具。

6 Advancements and Key Features of YOLOv11

YOLOv11 represents a significant advancement in object detection technology, building upon the foundations laid by its predecessors, YOLOv9 and YOLOv10, which were introduced earlier in 2024. This latest iteration from Ultralytics showcases enhanced architectural designs, more sophisticated feature extraction techniques, and refined training methodologies. The synergy of YOLOv11's rapid processing, high accuracy, and computational efficiency positions it as one of the most formidable models in Ultralytics' portfolio to date [ 22 ]. A key strength of YOLOv11 lies in its refined architecture, which facilitates the detection of subtle details even in challenging scenarios. The model's improved feature extraction capabilities allow it to identify and process a broader range of patterns and intricate elements within images. Compared to earlier versions, YOLOv11 introduces several notable enhancements:

【翻译】YOLOv11代表了目标检测技术的重大进步，建立在其前任YOLOv9和YOLOv10的基础之上，这两个版本在2024年早期推出。来自Ultralytics的这一最新迭代展示了增强的架构设计、更复杂的特征提取技术和精细的训练方法。YOLOv11的快速处理、高精度和计算效率的协同作用使其成为Ultralytics迄今为止最强大的模型之一[22]。YOLOv11的一个关键优势在于其精细的架构，这有助于在具有挑战性的场景中检测细微细节。该模型改进的特征提取能力使其能够识别和处理图像中更广泛的模式和复杂元素。与早期版本相比，YOLOv11引入了几个显著的增强：

Table 2: YOLOv11 Model Variants and Tasks

Enhanced precision with reduced complexity: The YOLOv11m variant achieves superior mean Average Precision (mAP) scores on the COCO dataset while utilizing 22 % 22\% 22% fewer parameters than its YOLOv8m counterpart, demonstrating improved computational efficiency without compromising accuracy [23].

Versatility in CV tasks: YOLOv11 exhibits proficiency across a diverse array of CV applications, including pose estimation, object recognition, image classification, instance segmentation, and oriented bounding box (OBB) detection [23].

Optimized speed and performance: Through refined architectural designs and streamlined training pipelines, YOLOv11 achieves faster processing speeds while maintaining a balance between accuracy and computational efficiency [23].

Streamlined parameter count: The reduction in parameters contributes to faster model performance without significantly impacting the overall accuracy of YOLOv11 [22].

Advanced feature extraction: YOLOv11 incorporates improvements in both its backbone and neck architectures, resulting in enhanced feature extraction capabilities and, consequently, more precise object detection [23].

Contextual adaptability: YOLOv11 demonstrates versatility across various deployment scenarios, including cloud platforms, edge devices, and systems optimized for NVIDIA GPUs [23].

【翻译】1. 在降低复杂性的同时增强精度：YOLOv11m变体在COCO数据集上实现了优异的平均精度均值（mAP）分数，同时比其YOLOv8m对应版本使用了 22 % 22\% 22%更少的参数，在不影响准确性的情况下展示了改进的计算效率[23]。

计算机视觉任务的多功能性：YOLOv11在多种多样的计算机视觉应用中表现出色，包括姿态估计、目标识别、图像分类、实例分割和定向边界框（OBB）检测[23]。
优化的速度和性能：通过精细的架构设计和简化的训练流水线，YOLOv11实现了更快的处理速度，同时保持了准确性和计算效率之间的平衡[23]。
精简的参数数量：参数的减少有助于更快的模型性能，而不会显著影响YOLOv11的整体准确性[22]。
先进的特征提取：YOLOv11在其骨干网络和颈部架构中都融入了改进，导致增强的特征提取能力，从而实现更精确的目标检测[23]。
上下文适应性：YOLOv11在各种部署场景中表现出多功能性，包括云平台、边缘设备和针对NVIDIA GPU优化的系统[23]。

YOLOv11 model demonstrates significant advancements in both inference speed and accuracy compared to its predecessors. In the benchmark analysis, YOLOv11 was compared against several of its predecessors including variants such as YOLOv5 [ 24 ] through to the more recent variants such as YOLOv10. As presented in Figure 2, YOLOv11 consistently outperforms these models, achieving superior mAP on the COCO dataset while maintaining a faster inference rate [25].

【翻译】与其前任相比，YOLOv11模型在推理速度和准确性方面都表现出显著的进步。在基准分析中，YOLOv11与其几个前任进行了比较，包括YOLOv5[24]等变体，直到YOLOv10等较新的变体。如图2所示，YOLOv11始终优于这些模型，在COCO数据集上实现了优异的mAP，同时保持了更快的推理速度[25]。

The performance comparison graph depicted in Figure 2 overs several key insights. The YOLOv11 variants (11n, 11s, 11m, and 11x) form a distinct performance frontier, with each model achieving higher C O C O mAP ⁡ 50 − 95 \mathrm{COCO}\operatorname*{mAP}^{50-95} COCOmAP50−95 scores at their respective latency points. Notably, the YOLOv11x achieves approximately 5 4 ˇ . 5 % m A P 50 − 95 5\check{4}.5\%\mathrm{mAP^{50-95}} 54ˇ.5%mAP50−95 at 13 m s 13\mathrm{ms} 13ms latency, surpassing all previous YOLO iterations. The intermediate variants, particularly YOLOv11m, demonstrate exceptional efficiency by achieving comparable accuracy to larger models from previous generations while requiring significantly less processing time.

【翻译】图2所示的性能比较图提供了几个关键洞察。YOLOv11变体（11n、11s、11m和11x）形成了一个独特的性能前沿，每个模型在其各自的延迟点上都实现了更高的 C O C O mAP ⁡ 50 − 95 \mathrm{COCO}\operatorname*{mAP}^{50-95} COCOmAP50−95分数。值得注意的是，YOLOv11x在 13 m s 13\mathrm{ms} 13ms延迟下实现了大约 5 4 ˇ . 5 % m A P 50 − 95 5\check{4}.5\%\mathrm{mAP^{50-95}} 54ˇ.5%mAP50−95，超越了所有以前的YOLO迭代。中间变体，特别是YOLOv11m，通过在需要显著更少处理时间的同时实现与前几代更大模型相当的准确性，展示了卓越的效率。

A particularly noteworthy observation is the performance leap in the low-latency regime (2-6ms), where YOLOv11s maintains high accuracy (approximately 47 % m A P 50 − 95 ) 47\%\mathrm{mAP^{50-95}}) 47%mAP50−95) ) while operating at speeds previously associated with much less accurate models. This represents a crucial advancement for real-time applications where both speed and accuracy are critical. The improvement curve of YOLOv11 also shows better scaling characteristics across its model variants, suggesting more efficient utilization of additional computational resources compared to previous generations.

【翻译】一个特别值得注意的观察是在低延迟区间（2-6ms）的性能飞跃，其中YOLOv11s保持高精度（大约 47 % m A P 50 − 95 47\%\mathrm{mAP^{50-95}} 47%mAP50−95），同时以之前与精度低得多的模型相关的速度运行。这代表了对于速度和精度都至关重要的实时应用的关键进步。YOLOv11的改进曲线还显示了其模型变体之间更好的扩展特性，表明与前几代相比更有效地利用了额外的计算资源。

Figure 2: Benchmarking YOLOv11 Against Previous Versions [23]

7 Discussion

YOLO11 marks a significant leap forward in object detection technology, building upon its predecessors while introducing innovative enhancements. This latest iteration demonstrates remarkable versatility and efficiency across various CV tasks.

【翻译】YOLO11标志着目标检测技术的重大飞跃，在其前任的基础上引入了创新的增强功能。这一最新迭代在各种计算机视觉任务中展示了卓越的多功能性和效率。

Efficiency and Scalability: YOLO11 introduces a range of model sizes, from nano to extra-large, catering to diverse application needs. This scalability allows for deployment in scenarios ranging from resourceconstrained edge devices to high-performance computing environments. The nano variant, in particular, showcases impressive speed and efficiency improvements over its predecessor, making it ideal for real-time applications.

Architectural Innovations: The model incorporates novel architectural elements that enhance its feature extraction and processing capabilities. The incorporation of novel elements such as the C3k2 block, SPPF, and C2PSA contributes to more effective feature extraction and processing. These enhancements allow the model to better analyze and interpret complex visual information, potentially leading to improved detection accuracy across various scenarios.

Multi-Task Proficiency: YOLO11's versatility extends beyond object detection, encompassing tasks such as instance segmentation, image classification, pose estimation, and oriented object detection. This multi-faceted approach positions YOLO11 as a comprehensive solution for diverse CV challenges.

Enhanced Attention Mechanisms: A key advancement in YOLO11 is the integration of sophisticated spatial attention mechanisms, particularly the C2PSA component. This feature enables the model to focus more effectively on critical regions within an image, enhancing its ability to detect and analyze objects. The improved attention capability is especially beneficial for identifying complex or partially occluded objects, addressing a common challenge in object detection tasks. This refinement in spatial awareness contributes to YOLO11's overall performance improvements, particularly in challenging visual environments.

Performance Benchmarks: Comparative analyses reveal YOLO11's superior performance, particularly in its smaller variants. The nano model, despite a slight increase in parameters, demonstrates enhanced inference speed and frames per second (FPS) compared to its predecessor. This improvement suggests that YOLO11 achieves a favorable balance between computational efficiency and detection accuracy.

Implications for Real-World Applications: The advancements in YOLO11 have significant implications for various industries. Its improved efficiency and multi-task capabilities make it particularly suitable for applications in autonomous vehicles, surveillance systems, and industrial automation. The model's ability to perform well across different scales also opens up new possibilities for deployment in resource-constrained environments without compromising on performance.

【翻译】1. 效率和可扩展性：YOLO11引入了一系列模型尺寸，从纳米级到超大型，满足多样化的应用需求。这种可扩展性允许在从资源受限的边缘设备到高性能计算环境的各种场景中部署。特别是纳米变体，相比其前任展示了令人印象深刻的速度和效率改进，使其非常适合实时应用。

架构创新：该模型融入了增强其特征提取和处理能力的新颖架构元素。诸如C3k2块、SPPF和C2PSA等新颖元素的融入有助于更有效的特征提取和处理。这些增强使模型能够更好地分析和解释复杂的视觉信息，可能导致在各种场景中检测精度的提高。
多任务熟练度：YOLO11的多功能性超越了目标检测，涵盖了实例分割、图像分类、姿态估计和定向目标检测等任务。这种多方面的方法将YOLO11定位为应对多样化计算机视觉挑战的综合解决方案。
增强的注意力机制：YOLO11的一个关键进步是整合了复杂的空间注意力机制，特别是C2PSA组件。这一特性使模型能够更有效地关注图像中的关键区域，增强其检测和分析目标的能力。改进的注意力能力对于识别复杂或部分遮挡的目标特别有益，解决了目标检测任务中的一个常见挑战。这种空间感知的精细化有助于YOLO11的整体性能改进，特别是在具有挑战性的视觉环境中。
性能基准：比较分析揭示了YOLO11的优异性能，特别是在其较小的变体中。纳米模型尽管参数略有增加，但与其前任相比展示了增强的推理速度和每秒帧数（FPS）。这种改进表明YOLO11在计算效率和检测精度之间实现了有利的平衡。
对现实世界应用的影响：YOLO11的进步对各个行业都有重要影响。其改进的效率和多任务能力使其特别适合自动驾驶车辆、监控系统和工业自动化中的应用。该模型在不同尺度上表现良好的能力也为在资源受限环境中的部署开辟了新的可能性，而不会影响性能。

8 Conclusion

YOLOv11 represents a significant advancement in the field of CV, offering a compelling combination of enhanced performance and versatility. This latest iteration of the YOLO architecture demonstrates marked improvements in accuracy and processing speed, while simultaneously reducing the number of parameters required. Such optimizations make YOLOv11 particularly well-suited for a wide range of applications, from edge computing to cloud-based analysis.

【翻译】YOLOv11代表了计算机视觉领域的重大进步，提供了增强性能和多功能性的引人注目的组合。YOLO架构的这一最新迭代在准确性和处理速度方面展示了显著改进，同时减少了所需的参数数量。这种优化使YOLOv11特别适合广泛的应用，从边缘计算到基于云的分析。

The model's adaptability across various tasks, including object detection, instance segmentation, and pose estimation, positions it as a valuable tool for diverse industries such as emotion detection [ 26 ], healthcare [ 27 ] and various other industries [ 17 ]. Its seamless integration capabilities and improved efficiency make it an attractive option for businesses seeking to implement or upgrade their CV systems. In summary, YOLOv11's blend of enhanced feature extraction, optimized performance, and broad task support establishes it as a formidable solution for addressing complex visual recognition challenges in both research and practical applications.

【翻译】该模型在各种任务中的适应性，包括目标检测、实例分割和姿态估计，将其定位为情感检测[26]、医疗保健[27]和其他各个行业[17]等多样化行业的宝贵工具。其无缝集成能力和改进的效率使其成为寻求实施或升级其计算机视觉系统的企业的有吸引力的选择。总之，YOLOv11融合了增强的特征提取、优化的性能和广泛的任务支持，将其确立为应对研究和实际应用中复杂视觉识别挑战的强大解决方案。