接前篇,常规卷积在CUDA上回进行内存重排,使之变为连续的,然后放到CUDA核或者Tensor核上进行一系列高性能的乘加操作。但是风车卷积不是常规的卷积,虽说参数量也小,但是在jetson上无对应的高性能算子,导致访存不连续,进而拉慢了推理性能。本篇去掉了风车型卷积,改回SPDConv,同时去掉了边缘设备上不友好的DFL结构,并将激活函数从SiLU改为ReLU重新训练,以提高边缘设备推理性能。
一、模型信息
模型结构图

YAML文件
python
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.5, 0.50, 1024]
# s: [1.0, 1.00, 1024]
# m: [1.00, 2.00, 512]
backbone:
# [from, repeats, module, args]
- [-1, 1, SPDConv, [32]]
- [-1, 1, SPDConv, [64]]
- [-1, 2, C3k2, [64, True, 0.25]] # 2 P2
- [-1, 1, Conv, [64, 3, 2]]
- [-1, 2, C3k2, [128, True, 0.25]] # 4 P3
- [-1, 1, Conv, [128, 3, 2]]
- [-1, 2, C3k2, [256, False]] # 6 P4
- [-1, 1, SPPF, [256, 5]]
- [-1, 2, C2PSA, [256]] # 8
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 2, C3k2, [128, False]] # 11
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 2], 1, Concat, [1]] # cat backbone P2
- [-1, 2, C3k2, [64, False]] # 14
- [-1, 1, Conv, [64, 3, 2]]
- [[-1, 11], 1, Concat, [1]]
- [-1, 2, C3k2, [128, False]] # 17
# 向上分支,融合原始特征
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 2], 1, Concat, [1]] # cat backbone P2
- [-1, 2, MicroC3, [64]] # 20
- [-1, 1, HDC, [64]]
- [-1, 1, ART, [64]] # 22
- [17, 1, Conv, [128, 3, 2]]
- [[-1, 8], 1, Concat, [1]] # 24
- [-1, 2, C3k2, [256, True]] #
- [[22, 17, 25], 1, Detect, [nc]] # Detect(P2, P3, P4)
# - [[21, 17, 24], 1, Detect, [nc]] # 减少一个concat
模型参数量分析

n-model总体FLOPs很小,只有4.78G,参数量500多K。
s-modelFLOPs也只有21.554G
二、详细改动
1.关闭DFL
ultralytics/nn/modules/head.py
python
class Detect(nn.Module):
...
def __init__(self, nc: int = 80, ch: tuple = ()):
"""
Initialize the YOLO detection layer with specified number of classes and channels.
Args:
nc (int): Number of classes.
ch (tuple): Tuple of channel sizes from backbone feature maps.
"""
super().__init__()
self.nc = nc # number of classes
self.nl = len(ch) # number of detection layers
# self.reg_max = 16 # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
self.reg_max = 1 # !!!注释掉上面一句,修改为这个
2.修改模块激活函数
ultralytics/nn/modules/conv.py
python
class Conv(nn.Module):
"""
Standard convolution module with batch normalization and activation.
Attributes:
conv (nn.Conv2d): Convolutional layer.
bn (nn.BatchNorm2d): Batch normalization layer.
act (nn.Module): Activation function layer.
default_act (nn.Module): Default activation function (SiLU).
"""
# default_act = nn.SiLU() # default activation
default_act = nn.ReLU() # !!!修改在此处
其余使用到的模块,也需要检查激活函数是否为ReLU.
三、实验结果
测试集上混淆矩阵

网络在自制测试集上的召回率和准确率都很高。
推理性能
n-model在jetson nx板子上,可以达到90FPS!
四、后续
- 推理代码分享