yolov8源码解读Detect层

Detect层解读
网络各层解读及detect层后的处理

关于网络的backbone,head，以及detect层后处理，可以参考文章结尾博主的文章。

Detect层解读

先贴一下全部代码,下面一一解读。

python 复制代码

class Detect(nn.Module):
    """YOLOv8 Detect head for detection models."""
    dynamic = False  # force grid reconstruction
    export = False  # export mode
    shape = None
    anchors = torch.empty(0)  # init
    strides = torch.empty(0)  # init

    def __init__(self, nc=80, ch=()):
        """Initializes the YOLOv8 detection layer with specified number of classes and channels."""
        super().__init__()
        self.nc = nc  # number of classes
        self.nl = len(ch)  # number of detection layers
        self.reg_max = 16  # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
        self.no = nc + self.reg_max * 4  # number of outputs per anchor
        self.stride = torch.zeros(self.nl)  # strides computed during build
        c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels
        self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch)
        self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)
        self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

    def forward(self, x):
        """Concatenates and returns predicted bounding boxes and class probabilities."""
        shape = x[0].shape  # BCHW
        # print(">>>>", x[0].shape)
        # print(">>>>", x[1].shape)
        # print(">>>>", x[2].shape)
        for i in range(self.nl):
            x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)

        if self.training:
            return x
        elif self.dynamic or self.shape != shape:
            self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))
            self.shape = shape

        x_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)
        if self.export and self.format in ('saved_model', 'pb', 'tflite', 'edgetpu', 'tfjs'):  # avoid TF FlexSplitV ops
            box = x_cat[:, :self.reg_max * 4]
            cls = x_cat[:, self.reg_max * 4:]
        else:
            box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)

        dbox = dist2bbox(self.dfl(box), self.anchors.unsqueeze(0), xywh=True, dim=1) * self.strides

        if self.export and self.format in ('tflite', 'edgetpu'):
            # Normalize xywh with image size to mitigate quantization error of TFLite integer models as done in YOLOv5:
            # https://github.com/ultralytics/yolov5/blob/0c8de3fca4a702f8ff5c435e67f378d1fce70243/models/tf.py#L307-L309
            # See this PR for details: https://github.com/ultralytics/ultralytics/pull/1695
            img_h = shape[2] * self.stride[0]
            img_w = shape[3] * self.stride[0]
            img_size = torch.tensor([img_w, img_h, img_w, img_h], device=dbox.device).reshape(1, 4, 1)
            dbox /= img_size
        # print(cls.shape)
        y = torch.cat((dbox, cls.sigmoid()), 1)
        # print(y.shape)
        return y if self.export else (y, x)

javascript 复制代码

	dynamic = False #这个属性指示网格（通常是特征图上的锚框网格）是否需要动态地重建
    export = False  #这个属性用于指示模型是否处于导出模式。
    shape = None # 用于存储输入图像或特征图的尺寸。
    anchors = torch.empty(0)  # 创建了一个空的PyTorch张量
    strides = torch.empty(0)

步长（strides）是卷积神经网络中特征图相对于输入图像的缩小比例。
例如，如果步长是32，那么一个32x32像素的区域在特征图上就对应一个单元。
和anchors一样，这里的torch.empty(0)表示步长尚未初始化。

javascript 复制代码

    def __init__(self, nc=80, ch=()):
        """Initializes the YOLOv8 detection layer with specified number of classes and channels."""
        super().__init__()
        self.nc = nc  # number of classes
        self.nl = len(ch)  # number of detection layers
        self.reg_max = 16  # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
        self.no = nc + self.reg_max * 4  # number of outputs per anchor
        self.stride = torch.zeros(self.nl)  # strides computed during build
        c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels
        self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch)
        self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)
        self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

nc:类别数
nl:检测层的数量，目标检测中为3。
ch:传入的图片通道尺寸，在yolov8n,图片大小为640*640时。这里的ch为（256,128,64）
no:两个卷积再拼接后输出通道数，为4×reg_max+nc
c2,c3:计算卷积层的通道数。
cv2,cv3:定义的卷积操作，以输出有关类别和选框的特征图。
dfl:通过将分布式的概率分布转化为单一的预测值

javascript 复制代码

class DFL(nn.Module):

    def __init__(self, c1=16):
        """Initialize a convolutional layer with a given number of input channels."""
        super().__init__()
        self.conv = nn.Conv2d(c1, 1, 1, bias=False).requires_grad_(False)
        x = torch.arange(c1, dtype=torch.float)
        self.conv.weight.data[:] = nn.Parameter(x.view(1, c1, 1, 1))
        self.c1 = c1

    def forward(self, x):
        """Applies a transformer layer on input tensor 'x' and returns a tensor."""
        b, c, a = x.shape  # batch, channels, anchors
        return self.conv(x.view(b, 4, self.c1, a).transpose(2, 1).softmax(1)).view(b, 4, a)

self.conv：创建了一个输入通道为16，输出为1，没有偏置项，不需要进行梯度更新的卷积层。
这样的权重设置实际上模拟了一个积分过程，将卷积操作变成了加权求和的形式。
x:1到15的整数。
self.conv.weight.data[:] = nn.Parameter(x.view(1, c1, 1, 1))：
这里使用nn.Parameter将重塑后的张量设置为模型的参数，并且参数不会被更新。
假设前向传播中，x的形状为(1, 64, 8400)，下面解释下forword中的变化。
1，x.view(b, 4, self.c1, a): 这个操作是对x的形状进行重塑。self.c1是16（因为输入通道数是64，即4*self.c1），那么a是8400（代表了所有锚点的数量）。b是批次大小，这里为1。所以x.view(b, 4, self.c1, a)将x从(1, 64, 8400)重塑为(1, 4, 16, 8400)。在这个形状中，我们得到了每个锚点的每个坐标轴（x, y, 宽度, 高度）上的16个预测值（可能代表某种概率分布）。
2，transpose(2, 1): 这个操作交换第二维和第三维。在应用transpose之后，张量的形状变为(1, 16, 4, 8400)。这样做的目的是让每组概率分布的16个预测值连续地排列在一起，为后面的softmax运算做准备。
3，softmax(1): softmax函数应用于第一维（现在是16个预测值的这一维）。softmax确保了这16个值之和为1，转换为一个有效的概率分布，表示每个预测值的可能性。
4，self.conv(...): 这个操作将配置好的卷积层应用在进行了softmax操作的张量上。由于卷积层的权重已被设置为从0到15的整数，并且不更新权重（不进行梯度下降优化），这个步骤实际上是在计算期望值。卷积层将每个离散的概率值乘以其相应的索引（也就是权重），然后对结果进行求和，得到该坐标的预测值。
5，view(b, 4, a): 最后一步是将张量的形状从卷积操作后的(1, 1, 4, 8400)转换回(1, 4, 8400)。这样确保了最终的输出张量与每个坐标轴的预测值（x, y, 宽度, 高度）和所有锚点的数量对齐。
总的来说，dfl层就是对预测的坐标求加权期望值。将（1,64，8400）先变为（1,16,4，8400），然后对这16个通道求加权期望，变为（1,4,8400）即这8400个锚点中的每一个锚点，x,y,width,hight的加权平均值。
接下来是前向传播的过程。打印传入的x形状，发现通道数是64,128,256。

原因：Detect层接受15,18，21层的输入。原本通道数是1024,512,256。但是yolov8n还需要乘0.25。

经过cv2，通道数变为64，经过cv3通道数变为nc，我这里nc为2（二分类）。在经过cat拼接，在通道维度上拼接，所以x[i]的通道数变为66。

如果处于训练模式，就直接返回x。
否则执行下面的代码，将特征图列表x（1×66×40×40,1×66×80×80,1×66×20×20）传递给make_anchors()函数。make_anchors函数用于生成锚点（anchors）,它通常用在目标检测网络中。每个锚点代表了特征图上的一个点,可以用来预测相对于该点的边界框。strides是这些特征图相对于原始图像的下采样步长。简单来说，生成了8400个锚点（40×40+80×80+20×20），变量为anchors，形状为1×2×8400）。同时生成了8400个步长，变量为strides,形状为1×8400。参数0.5表示每个锚点处于每个像素块的中央。

javascript 复制代码

        if self.training:
            return x
        elif self.dynamic or self.shape != shape:
            self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))
            self.shape = shape

将xi按照2维度进行拼接，xi分别为1×66×40×40,1×66×80×80,1×66×20×20。拼接后的x_cat为1×66×8400

javascript 复制代码

x_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)

这段代码就是把x_cat进行拆分。box形状为1×64×8400，包含每个边界框的回归参数。cls形状为1×2×8400，会包含类别预测，2是因为我这里类别为2。

javascript 复制代码

        if self.export and self.format in ('saved_model', 'pb', 'tflite', 'edgetpu', 'tfjs'):  # avoid TF FlexSplitV ops
            box = x_cat[:, :self.reg_max * 4]
            cls = x_cat[:, self.reg_max * 4:]
        else:
            box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)

dfl层就是对预测的坐标求加权期望值。将（1,64，8400）先变为（1,16,4，8400），然后对这16个通道求加权期望，变为（1,4,8400）即这8400个锚点中的每一个锚点，x1,y1,x2,y2的加权平均值。dist2bbox()函数的作用是将锚点x1,y1,x2,y2转换为x,y,width,hight的形式。最后在乘以步长，还原到原图的大小比例。

javascript 复制代码

dbox = dist2bbox(self.dfl(box), self.anchors.unsqueeze(0), xywh=True, dim=1) * self.strides

此代码片段的作用是在模型导出为 Tensorflow Lite (tflite) 或 Edge TPU 兼容格式时，对预测框 (dbox) 进行归一化处理。

bash 复制代码

 if self.export and self.format in ('tflite', 'edgetpu'):
            # Normalize xywh with image size to mitigate quantization error of TFLite integer models as done in YOLOv5:
            # https://github.com/ultralytics/yolov5/blob/0c8de3fca4a702f8ff5c435e67f378d1fce70243/models/tf.py#L307-L309
            # See this PR for details: https://github.com/ultralytics/ultralytics/pull/1695
            img_h = shape[2] * self.stride[0]
            img_w = shape[3] * self.stride[0]
            img_size = torch.tensor([img_w, img_h, img_w, img_h], device=dbox.device).reshape(1, 4, 1)
            dbox /= img_size

此时y的形状为1×66×8400。代表有8400个锚点，每个锚点包含坐标框的x,y,width,hight，以及类别得分信息。

bash 复制代码

y = torch.cat((dbox, cls.sigmoid()), 1)

返回值，至此detect层结束。完整的预测，后续还需要进行一些处理。如进行非极大抑制，对这8400个锚点进行筛选

bash 复制代码

return y if self.export else (y, x)

另外，最后的bias_init（）函数用于初始化一个目标检测模型中的Detect层的偏置。确保在训练开始时偏置值是基于合理假设的。这种方法的目标是为模型提供一个好的起点，并有助于加速训练过程中的收敛。

bash 复制代码

    def bias_init(self):
        """Initialize Detect() biases, WARNING: requires stride availability."""
        m = self  # self.model[-1]  # Detect() module
        # cf = torch.bincount(torch.tensor(np.concatenate(dataset.labels, 0)[:, 0]).long(), minlength=nc) + 1
        # ncf = math.log(0.6 / (m.nc - 0.999999)) if cf is None else torch.log(cf / cf.sum())  # nominal class frequency
        for a, b, s in zip(m.cv2, m.cv3, m.stride):  # from
            a[-1].bias.data[:] = 1.0  # box
            b[-1].bias.data[:m.nc] = math.log(5 / m.nc / (640 / s) ** 2)  # cls (.01 objects, 80 classes, 640 img)

网络各层解读及detect层后的处理

关于backbone,head层,以及detect层参考下面博主的文章，讲的非常好。

链接: Yolov 8源码超详细逐行解读+ 网络结构细讲(自我用的小白笔记)

链接: 最细致讲解yolov8模型推理完整代码--（前处理，后处理）