一、改进介绍
在计算机视觉领域,小目标检测始终是极具挑战性的核心难题。小目标像素占比低、特征信息少、易受背景噪声干扰,在复杂场景、低光照、遮挡条件下,单一视觉模态与传统网络结构极易出现漏检、误检。HCANet(Hybrid Convolution Attention Network) 正是针对小目标检测痛点提出的高精度特征增强网络,通过深度融合卷积的局部细节提取能力与注意力机制的全局关联建模能力,实现小目标特征的强化、提纯与精准表达,从网络架构层面突破小目标检测性能瓶颈。
本文改进基于YOLOv11框架,结合上下文增强和特征细化网络ContextAggregation,对Yolov11的backbone、neck、head进行改进 ,可以显著提升红外小目标检测性能;同时,本文还创新性地将ContextAggregation与P3小目标检测头进行融合,进一步提示小目标检测性能。
ContextAggregation 通过上下文信息增强与特征精细化提取,能够有效捕捉红外小目标微弱特征,扩大感受野并聚合全局与局部上下文信息,抑制背景干扰,强化小目标特征表达;将其融入 YOLOv11 主干、颈部、头部 并与P3 小目标检测头融合后,可显著增强模型对小目标的定位与识别能力,进一步提升红外场景下小目标检测精度与召回率。
目录
[2.1 ContextAggregation是什么](#2.1 ContextAggregation是什么)
[2.2 为什么ContextAggregation适合小目标/红外检测?](#2.2 为什么ContextAggregation适合小目标/红外检测?)
[2.3 核心原理:ContextAggregation是怎么工作的?](#2.3 核心原理:ContextAggregation是怎么工作的?)
[3.1 修改1:添加核心代码](#3.1 修改1:添加核心代码)
[3.2 修改2:task.py添加模块](#3.2 修改2:task.py添加模块)
[3.3 修改3:yaml文件](#3.3 修改3:yaml文件)
[3.3.1 yolo11-ContextAggregation-head-p2.yaml](#3.3.1 yolo11-ContextAggregation-head-p2.yaml)
[3.3.2 yolo11-ContextAggregation-backbone-p2.yaml](#3.3.2 yolo11-ContextAggregation-backbone-p2.yaml)
[3.3.3 yolo11-ContextAggregation-neck-p2.yaml](#3.3.3 yolo11-ContextAggregation-neck-p2.yaml)
二、原理介绍
官方论文地址:https://arxiv.org/abs/2106.01401

2.1 ContextAggregation是什么
ContextAggregation,直译就是「上下文聚合」,本质上是一个专注于特征增强与上下文信息融合的轻量级网络模块,核心作用是解决传统CNN在目标检测中"感受野不足、特征提取不精细、背景干扰严重"的问题。简单来说,它就像一个"特征筛选与增强器":一方面能捕捉目标的局部细节特征(比如小目标的微弱像素信息),另一方面能聚合全局上下文信息(比如目标周围的环境特征,帮助区分目标和背景),最终让模型能更精准地识别目标,尤其是红外场景下的小目标、模糊目标。
这里要注意一个点:ContextAggregation不是某一篇论文专属的模块,而是一个"通用技术术语"------不同论文(比如ICLR 2016、NeurIPS 2021)都有基于这个思想的实现,侧重点不同,但核心逻辑一致。其中,最适合YOLOv11小目标改进的,是NeurIPS 2021的《Container: Context Aggregation Networks》中的实现
2.2 为什么ContextAggregation适合小目标/红外检测?
在讲原理之前,先搞清楚一个关键问题:为什么我们做YOLOv11红外小目标改进时,一定要加ContextAggregation?核心原因就是小目标/红外场景的3个痛点,刚好被它完美解决:
- 小目标特征微弱:红外小目标通常像素占比低(比如几十像素),特征信号弱,传统CNN很容易把它当成背景噪声过滤掉;
- 感受野不匹配:YOLO系列的backbone(比如C3k2)在下采样过程中,感受野会扩大,但会丢失小目标的细节特征,导致定位不准;
- 背景干扰严重:红外图像通常对比度低,背景(比如天空、地面)和目标的灰度差异小,模型很难区分目标和背景。
而ContextAggregation的核心优势,就是针对性解决这3个痛点:通过上下文增强扩大感受野、通过特征细化保留小目标细节、通过注意力机制抑制背景干扰,这也是它能成为YOLOv11小目标改进"神器"的原因。
2.3 核心原理:ContextAggregation是怎么工作的?
ContextAggregation的核心原理可以概括为一句话:"全局上下文引导 + 局部特征细化 + 注意力加权融合",本质是通过"多分支结构",同时捕捉全局上下文和局部细节,再通过注意力机制筛选出有用的特征,最终输出增强后的特征图。结合NeurIPS 2021的Container论文(最适配YOLOv11的版本),我们把原理拆成3个关键步骤,通俗好懂:
步骤1:特征输入与分支拆分
当YOLOv11的backbone或neck输出特征图(比如C3k2的输出)后,会输入到ContextAggregation模块中。模块首先会将输入特征拆分成两个核心分支,各司其职:
- 局部特征分支(Local Branch):用小卷积核(比如1×1、3×3)提取目标的局部细节特征,重点捕捉小目标的边缘、轮廓等关键信息,解决"小目标细节丢失"的问题;
- 上下文分支(Context Branch):用空洞卷积(Dilated Convolution)提取全局上下文信息,空洞卷积的优势是"不增加参数量,却能显著扩大感受野",可以捕捉到目标周围的环境特征,帮助模型区分目标和背景,解决"感受野不足"的问题。
这里补充一个小知识点:空洞卷积(也叫膨胀卷积)是ContextAggregation的核心支撑,最早由ICLR 2016的《Multi-Scale Context Aggregation by Dilated Convolutions》提出,通过在卷积核中插入"空洞",让卷积核覆盖更大的区域,比如3×3的空洞卷积(膨胀率为2),感受野相当于5×5的普通卷积,但参数量完全不变,非常适合轻量级模型改进。
步骤2:上下文聚合与特征交互
两个分支提取到特征后,不会各自独立输出,而是会进行"上下文聚合"------简单来说,就是让局部特征和全局上下文特征相互融合、相互引导:
- 用全局上下文特征"引导"局部特征:告诉局部特征"哪些区域是目标,哪些是背景",避免局部特征误判;
- 用局部特征"补充"全局上下文特征:给全局特征添加细节信息,避免全局特征过于粗糙,导致目标定位不准。
这个过程就像"侦探查案":局部特征是"线索细节"(比如目标的轮廓),全局上下文是"案件背景"(比如目标出现的场景),两者结合,才能更精准地"锁定目标"。部分改进版的ContextAggregation还会采用"多尺度空洞卷积"(不同膨胀率),进一步提升多尺度上下文的捕捉能力,适配不同大小的小目标。
步骤3:注意力加权与残差输出
融合后的特征,还会经过一个"特征重加权模块(FRB)"------本质是简化版的通道注意力机制,灵感来源于SENet,但针对上下文聚合场景做了优化,核心作用是"筛选有用特征、抑制无用特征"。
比如,红外场景中,背景的灰度特征和目标的灰度特征很接近,重加权模块会给目标特征对应的通道"加权"(提高权重),给背景特征对应的通道"降权"(降低权重),从而进一步抑制背景干扰,强化目标特征。
最后,通过残差连接(将输入特征与输出特征相加),避免梯度消失,保证模块的训练稳定性,同时保留原始特征的信息,最终输出"细节丰富、背景干净"的增强型特征图,传递给YOLOv11的neck或head,提升检测精度。
三、代码添加教程
3.1 修改1:添加核心代码
在ultralytics/nn/下新建attention目录,建立attention.py代码文件,将以下代码加入attention.py
python
###################### ContextAggregation #### START by AI&CV ###############################
from mmcv.cnn import ConvModule
from mmengine.model import caffe2_xavier_init, constant_init
from ultralytics.nn.modules.conv import Conv
class ContextAggregation(nn.Module):
"""
Context Aggregation Block.
Args:
in_channels (int): Number of input channels.
reduction (int, optional): Channel reduction ratio. Default: 1.
conv_cfg (dict or None, optional): Config dict for the convolution
layer. Default: None.
"""
def __init__(self, in_channels, reduction=1):
super(ContextAggregation, self).__init__()
self.in_channels = in_channels
self.reduction = reduction
self.inter_channels = max(in_channels // reduction, 1)
conv_params = dict(kernel_size=1, act_cfg=None)
self.a = ConvModule(in_channels, 1, **conv_params)
self.k = ConvModule(in_channels, 1, **conv_params)
self.v = ConvModule(in_channels, self.inter_channels, **conv_params)
self.m = ConvModule(self.inter_channels, in_channels, **conv_params)
self.init_weights()
def init_weights(self):
for m in (self.a, self.k, self.v):
caffe2_xavier_init(m.conv)
constant_init(self.m.conv, 0)
def forward(self, x):
# n, c = x.size(0)
n = x.size(0)
c = self.inter_channels
# n, nH, nW, c = x.shape
# a: [N, 1, H, W]
a = self.a(x).sigmoid()
# k: [N, 1, HW, 1]
k = self.k(x).view(n, 1, -1, 1).softmax(2)
# v: [N, 1, C, HW]
v = self.v(x).view(n, 1, c, -1)
# y: [N, C, 1, 1]
y = torch.matmul(v, k).view(n, c, 1, 1)
y = self.m(y) * a
return x + y
###################### ContextAggregation #### END by AI&CV ###############################
3.2 修改2:task.py添加模块
修改ultralytics/nn/tasks.py文件
(1)代码开头进行引用
python
from ultralytics.nn.attention.attention import ContextAggregation
(2)修改def parse_model(d, ch, verbose=True)
python
n = n_ = max(round(n * depth), 1) if n > 1 else n # depth gain
if m in {
Classify,
Conv,
ConvTranspose,
GhostConv,
Bottleneck,
GhostBottleneck,
SPP,
SPPF,
C2fPSA,
C2PSA,
DWConv,
Focus,
BottleneckCSP,
C1,
C2,
C2f,
C3k2,
RepNCSPELAN4,
ELAN1,
ADown,
AConv,
SPPELAN,
C2fAttn,
C3,
C3TR,
C3Ghost,
nn.ConvTranspose2d,
DWConvTranspose2d,
C3x,
RepC3,
PSA,
SCDown,
C2fCIB,
ContextAggregation # 这里
}:
c1, c2 = ch[f], args[0]
if c2 != nc: # if c2 not equal to number of classes (i.e. for Classify() output)
c2 = make_divisible(min(c2, max_channels) * width, 8)
if m is C2fAttn:
args[1] = make_divisible(min(args[1], max_channels // 2) * width, 8) # embed channels
args[2] = int(
max(round(min(args[2], max_channels // 2 // 32)) * width, 1) if args[2] > 1 else args[2]
) # num heads
args = [c1, c2, *args[1:]]
if m in {
BottleneckCSP,
C1,
C2,
C2f,
C3k2,
C2fAttn,
C3,
C3TR,
C3Ghost,
C3x,
RepC3,
C2fPSA,
C2fCIB,
C2PSA,
}:
args.insert(2, n) # number of repeats
n = 1
3.3 修改3:yaml文件
ContextAggregation + P2小目标检测头:提供多种ContextAggregation修改方式,分别加在网络不同位置
3.3.1 yolo11-ContextAggregation-head-p2.yaml
python
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
# YOLO11n backbone
backbone:
# [from, repeats, module, args]
- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- [-1, 2, C3k2, [256, False, 0.25]]
- [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- [-1, 2, C3k2, [512, False, 0.25]]
- [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- [-1, 2, C3k2, [512, True]]
- [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- [-1, 2, C3k2, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 9
- [-1, 2, C2PSA, [1024]] # 10
# YOLO11n head
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 6], 1, Concat, [1]] # cat backbone P4
- [-1, 2, C3k2, [512, False]] # 13
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 2], 1, Concat, [1]] # cat head P2
- [-1, 2, C3k2, [128, False]] # 19 (P2/4-xsmall)
- [-1, 1, Conv, [128, 3, 2]]
- [[-1, 16], 1, Concat, [1]] # cat head P3
- [-1, 2, C3k2, [256, False]] # 22 (P3/8-medium)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 13], 1, Concat, [1]] # cat head P4
- [-1, 2, C3k2, [512, False]] # 25 (P4/16-medium)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 10], 1, Concat, [1]] # cat head P5
- [-1, 2, C3k2, [1024, True]] # 28 (P5/32-large)
- [19, 1, ContextAggregation, [128]] # 29
- [22, 1, ContextAggregation, [256]] # 30
- [25, 1, ContextAggregation, [512]] # 31
- [28, 1, ContextAggregation, [1024]] # 32
- [[29, 30, 31, 32], 1, Detect, [nc]] # Detect(P2, P3, P4, P5)
3.3.2 yolo11-ContextAggregation-backbone-p2.yaml
python
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
# YOLO11n backbone
backbone:
# [from, repeats, module, args]
- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- [-1, 2, C3k2, [256, False, 0.25]]
- [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- [-1, 2, C3k2, [512, False, 0.25]]
- [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- [-1, 2, C3k2, [512, True]]
- [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- [-1, 2, C3k2, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 9
- [-1, 2, C2PSA, [1024]] # 10
- [-1, 1, ContextAggregation, [1024]] # 11
# YOLO11n head
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 6], 1, Concat, [1]] # cat backbone P4
- [-1, 2, C3k2, [512, False]] # 14
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 2, C3k2, [256, False]] # 17 (P3/8-small)
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 2], 1, Concat, [1]] # cat head P2
- [-1, 2, C3k2, [128, False]] # 20 (P2/4-xsmall)
- [-1, 1, Conv, [128, 3, 2]]
- [[-1, 17], 1, Concat, [1]] # cat head P3
- [-1, 2, C3k2, [256, False]] # 23 (P3/8-medium)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 14], 1, Concat, [1]] # cat head P4
- [-1, 2, C3k2, [512, False]] # 26 (P4/16-medium)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 11], 1, Concat, [1]] # cat head P5
- [-1, 2, C3k2, [1024, True]] # 29 (P5/32-large)
- [[20, 23, 26, 29], 1, Detect, [nc]] # Detect(P2, P3, P4, P5)
3.3.3 yolo11-ContextAggregation-neck-p2.yaml
python
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
# YOLO11n backbone
backbone:
# [from, repeats, module, args]
- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- [-1, 2, C3k2, [256, False, 0.25]]
- [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- [-1, 2, C3k2, [512, False, 0.25]]
- [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- [-1, 2, C3k2, [512, True]]
- [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- [-1, 2, C3k2, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 9
- [-1, 2, C2PSA, [1024]] # 10
# YOLO11n head
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 6], 1, Concat, [1]] # cat backbone P4
- [-1, 2, C3k2, [512, False]] # 13
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)
# - [-1, 1, ContextAggregation, [256]] # 17
- [ -1, 1, nn.Upsample, [ None, 2, "nearest" ] ]
- [ [ -1, 2 ], 1, Concat, [ 1 ] ] # cat head P2
- [ -1, 2, C3k2, [ 128, False ] ] # 19 (P2/4-xsmall)
- [-1, 1, ContextAggregation, [128]] # 20
- [-1, 1, Conv, [128, 3, 2]]
- [[-1, 16], 1, Concat, [1]] # cat head P3
- [-1, 2, C3k2, [256, False]] # 23 (P3/8-medium)
- [-1, 1, ContextAggregation, [256]] # 24
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 13], 1, Concat, [1]] # cat head P4
- [-1, 2, C3k2, [512, False]] # 27 (P4/16-medium)
- [-1, 1, ContextAggregation, [512]] # 28
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 10], 1, Concat, [1]] # cat head P5
- [-1, 2, C3k2, [1024, True]] # 31 (P5/32-large)
- [-1, 1, ContextAggregation, [1024]] # 32
- [[20, 24, 28, 32], 1, Detect, [nc]] # Detect(P2, P3, P4, P5)
四、本文总结
不同数据集可能存在涨点不一致的现象,可以将创新点放在网络不同位置进行尝试。