YOLOv3-tiny 网络结构浅析

YOLOv3-tiny 网络结构深度解析

一、概述

YOLOv3-tiny是YOLOv3的轻量化版本，由Joseph Redmon和Ali Farhadi在2018年提出。它专为资源受限的边缘设备设计，在保持实时检测能力的同时大幅降低了计算量和参数量。

复制代码

YOLOv3-tiny 核心特点
═══════════════════════════════════════════════════════════════
参数量:     约8.7M（YOLOv3的1/8）
计算量:     约5.6 GFLOPs（YOLOv3的1/10）
检测速度:   可达220+ FPS（在高端GPU上）
检测尺度:   2个（YOLOv3有3个）
网络深度:   13个卷积层（YOLOv3有53个backbone层）

二、整体架构

YOLOv3-tiny的架构可以分为三个部分：Backbone（特征提取）、Neck（特征融合）、Head（检测输出）。

复制代码

YOLOv3-tiny 整体数据流
═══════════════════════════════════════════════════════════════

输入图像 416×416×3
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│                      Backbone                                │
│   ┌─────────────────────────────────────────────────────┐   │
│   │  Conv1 → Pool → Conv2 → Pool → Conv3 → Pool →       │   │
│   │  Conv4 → Pool → Conv5 → Pool → Conv6 → Conv7        │   │
│   └─────────────────────────────────────────────────────┘   │
│                           │                                  │
│              ┌────────────┴────────────┐                    │
│              ▼                         ▼                    │
│         特征图P4                   特征图P5                 │
│        26×26×256                  13×13×512                 │
└─────────────────────────────────────────────────────────────┘
         │                               │
         │                               ▼
         │                    ┌──────────────────┐
         │                    │   Conv + Upsample │
         │                    └────────┬─────────┘
         │                             │
         ▼                             ▼
┌─────────────────────────────────────────────────────────────┐
│                        Neck (FPN简化版)                      │
│                                                              │
│         Concat(P4, Upsampled_P5)                            │
│                    │                                         │
│                    ▼                                         │
│              融合特征图                                      │
│             26×26×384                                        │
└─────────────────────────────────────────────────────────────┘
         │                               │
         ▼                               ▼
┌─────────────────────────────────────────────────────────────┐
│                         Head                                 │
│                                                              │
│    检测头1 (26×26)                检测头2 (13×13)           │
│    检测小目标                      检测大目标                │
│         │                              │                     │
│         ▼                              ▼                     │
│  26×26×(3×(5+C))               13×13×(3×(5+C))              │
└─────────────────────────────────────────────────────────────┘

三、Backbone详解

3.1 逐层结构

YOLOv3-tiny的Backbone由7个卷积层和5个最大池化层组成。

复制代码

Backbone 逐层详解
═══════════════════════════════════════════════════════════════

Layer 0: Input
         输入: 416×416×3

Layer 1: Conv + BN + LeakyReLU
         卷积核: 3×3, 16filters, stride=1, padding=1
         输出: 416×416×16
         
Layer 2: MaxPool
         池化核: 2×2, stride=2
         输出: 208×208×16

Layer 3: Conv + BN + LeakyReLU
         卷积核: 3×3, 32filters, stride=1, padding=1
         输出: 208×208×32

Layer 4: MaxPool
         池化核: 2×2, stride=2
         输出: 104×104×32

Layer 5: Conv + BN + LeakyReLU
         卷积核: 3×3, 64filters, stride=1, padding=1
         输出: 104×104×64

Layer 6: MaxPool
         池化核: 2×2, stride=2
         输出: 52×52×64

Layer 7: Conv + BN + LeakyReLU
         卷积核: 3×3, 128filters, stride=1, padding=1
         输出: 52×52×128

Layer 8: MaxPool
         池化核: 2×2, stride=2
         输出: 26×26×128
         ──────────────────────────────────
         ↑ 这里输出P4特征图（route连接点）

Layer 9: Conv + BN + LeakyReLU
         卷积核: 3×3, 256filters, stride=1, padding=1
         输出: 26×26×256

Layer 10: MaxPool
          池化核: 2×2, stride=2
          输出: 13×13×256

Layer 11: Conv + BN + LeakyReLU
          卷积核: 3×3, 512filters, stride=1, padding=1
          输出: 13×13×512

Layer 12: Conv + BN + LeakyReLU
          卷积核: 1×1, 256filters, stride=1
          输出: 13×13×256

Layer 13: Conv + BN + LeakyReLU
          卷积核: 3×3, 512filters, stride=1, padding=1
          输出: 13×13×512
          ──────────────────────────────────
          ↑ 这里输出P5特征图

3.2 通道数变化可视化

复制代码

通道数变化趋势
═══════════════════════════════════════════════════════════════

     512 ─────────────────────────────────────■■■■■ Layer11-13
         │                                    │
     256 ─────────────────────────────■■■■■──┘     Layer9-10,12
         │                            │
     128 ────────────────────■■■■■───┘             Layer7-8
         │                   │
      64 ───────────■■■■■───┘                      Layer5-6
         │          │
      32 ────■■■■──┘                               Layer3-4
         │   │
      16 ■■─┘                                      Layer1-2
         │
       3 ■                                         Input
         └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──→
            1  2  3  4  5  6  7  8  9 10 11 12 13  Layer

3.3 空间分辨率变化

复制代码

空间分辨率变化（输入416×416）
═══════════════════════════════════════════════════════════════

416×416 ──■ Input
           │
           ├─ Conv1 → 416×416
           │
           ├─ Pool1 → 208×208  (÷2)
           │
           ├─ Conv2 → 208×208
           │
           ├─ Pool2 → 104×104  (÷4)
           │
           ├─ Conv3 → 104×104
           │
           ├─ Pool3 → 52×52    (÷8)
           │
           ├─ Conv4 → 52×52
           │
           ├─ Pool4 → 26×26    (÷16) ← P4输出点, stride=16
           │
           ├─ Conv5 → 26×26
           │
           ├─ Pool5 → 13×13    (÷32) ← P5输出点, stride=32
           │
           └─ Conv6,7 → 13×13

四、卷积块设计

4.1 标准卷积块

YOLOv3-tiny使用的基本卷积单元是Conv + BatchNorm + LeakyReLU的组合。

复制代码

标准卷积块 (ConvBNLeaky)
═══════════════════════════════════════════════════════════════

        输入特征图
             │
             ▼
    ┌─────────────────┐
    │   Conv2D        │  不使用bias（因为后面有BN）
    │   kernel: 3×3   │
    │   stride: 1     │
    │   padding: 1    │
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │  BatchNorm2D    │  归一化，稳定训练
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │  LeakyReLU      │  negative_slope = 0.1
    │                 │  f(x) = x if x>0 else 0.1x
    └────────┬────────┘
             │
             ▼
        输出特征图

4.2 为什么用LeakyReLU而不是ReLU

复制代码

ReLU vs LeakyReLU
═══════════════════════════════════════════════════════════════

ReLU:        f(x) = max(0, x)
             │
             │      ╱
             │     ╱
         ────┼────╱────
             │   ╱
             │  ╱ (负值全部为0，可能导致神经元"死亡")
             │ ╱

LeakyReLU:   f(x) = x if x>0 else 0.1x
             │
             │      ╱
             │     ╱
         ────┼────╱────
             │  ╱
             │╱   (负值保留10%，避免神经元死亡)
            ╱│

优势：
├── 避免神经元死亡问题
├── 保留负值信息
└── 训练更稳定

4.3 1×1卷积的作用

复制代码

1×1卷积的作用
═══════════════════════════════════════════════════════════════

Layer 12使用了1×1卷积：512 → 256通道

作用1: 降维
       减少通道数，降低后续计算量
       
作用2: 跨通道信息融合
       每个输出通道是所有输入通道的线性组合
       
作用3: 增加非线性
       配合激活函数，增加网络表达能力

计算量对比（假设特征图13×13）：

3×3卷积 512→256:  13×13 × 512 × 256 × 3 × 3 = 198M FLOPs
1×1卷积 512→256:  13×13 × 512 × 256 × 1 × 1 = 22M FLOPs

节省约9倍计算量！

五、Neck结构：简化版FPN

5.1 特征金字塔网络（FPN）思想

复制代码

FPN核心思想
═══════════════════════════════════════════════════════════════

深层特征：语义信息丰富，空间信息弱（知道"是什么"，不知道"在哪"）
浅层特征：空间信息丰富，语义信息弱（知道"在哪"，不知道"是什么"）

解决方案：
把深层特征上采样，和浅层特征融合，两全其美！

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│    浅层特征              深层特征                            │
│    26×26×256            13×13×256                           │
│    (空间好)              (语义好)                            │
│        │                     │                               │
│        │                     ▼                               │
│        │              ┌─────────────┐                        │
│        │              │  Upsample   │  13×13 → 26×26        │
│        │              │    ×2       │                        │
│        │              └──────┬──────┘                        │
│        │                     │                               │
│        │                     ▼                               │
│        │               26×26×256                             │
│        │                     │                               │
│        └──────────┬──────────┘                               │
│                   │                                          │
│                   ▼                                          │
│            ┌─────────────┐                                   │
│            │   Concat    │                                   │
│            └──────┬──────┘                                   │
│                   │                                          │
│                   ▼                                          │
│             26×26×512                                        │
│          (空间+语义都好)                                     │
│                                                              │
└──────────────────────────────────────────────────────────────┘

5.2 YOLOv3-tiny的Neck实现

复制代码

YOLOv3-tiny Neck结构
═══════════════════════════════════════════════════════════════

从Backbone获取两个特征图：
├── P4: 26×26×256 (Layer 9输出)
└── P5: 13×13×512 (Layer 13输出)

处理流程：

P5 (13×13×512)
      │
      ▼
┌─────────────┐
│ Conv 1×1    │  512 → 256
└──────┬──────┘
       │
       ├────────────────────────┐
       │                        │
       ▼                        ▼
┌─────────────┐          ┌─────────────┐
│ Conv 3×3    │          │  Upsample   │
│ 256 → 512   │          │    ×2       │
└──────┬──────┘          └──────┬──────┘
       │                        │
       ▼                        │
  13×13×512                     │
  (大目标检测)                  │
       │                        │
       │                   26×26×256
       │                        │
       │                        ▼
       │                 ┌─────────────┐
       │                 │   Concat    │ ← 与P4(26×26×128)拼接
       │                 └──────┬──────┘
       │                        │
       │                   26×26×384
       │                        │
       │                        ▼
       │                 ┌─────────────┐
       │                 │ Conv 3×3    │
       │                 │ 384 → 256   │
       │                 └──────┬──────┘
       │                        │
       │                   26×26×256
       │                   (小目标检测)
       │                        │
       ▼                        ▼
   检测头P5                 检测头P4

六、检测头（Head）详解

6.1 检测头结构

复制代码

检测头输出格式
═══════════════════════════════════════════════════════════════

每个检测头输出: H × W × (num_anchors × (5 + num_classes))

其中：
├── H × W: 特征图空间尺寸
├── num_anchors: 每个位置的anchor数量（YOLOv3-tiny每层3个）
├── 5: tx, ty, tw, th, confidence
└── num_classes: 类别数（COCO是80类）

具体计算：
├── 大目标检测头: 13×13×(3×(5+80)) = 13×13×255
└── 小目标检测头: 26×26×(3×(5+80)) = 26×26×255

6.2 预测值含义

复制代码

每个anchor预测的5+C个值
═══════════════════════════════════════════════════════════════

┌────────────────────────────────────────────────────────────┐
│  tx   │  ty   │  tw   │  th   │ conf  │ c1 │ c2 │...│ c80 │
└───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴────┴────┴───┴─────┘
    │       │       │       │       │
    │       │       │       │       └── 类别概率（80个）
    │       │       │       │
    │       │       │       └── objectness置信度
    │       │       │           该格子是否包含目标
    │       │       │
    │       │       └── 预测框高度偏移
    │       │
    │       └── 预测框宽度偏移
    │
    └── 中心点x,y偏移

总共: 5 + 80 = 85个值（COCO数据集）
      5 + C 个值（C类检测任务）

6.3 边界框解码

复制代码

从网络输出到实际坐标的解码过程
═══════════════════════════════════════════════════════════════

网络输出: tx, ty, tw, th（都是相对值）
需要转换为: bx, by, bw, bh（图像上的实际坐标）

解码公式：
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   bx = σ(tx) + cx                                          │
│   by = σ(ty) + cy                                          │
│   bw = pw × e^tw                                           │
│   bh = ph × e^th                                           │
│                                                             │
│   其中：                                                    │
│   ├── σ() 是sigmoid函数，将值压缩到(0,1)                   │
│   ├── cx, cy 是当前格子左上角坐标                          │
│   ├── pw, ph 是anchor的预设宽高                            │
│   └── e^tw, e^th 是缩放因子                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

图示：

      当前格子
    ┌─────────┐
    │ (cx,cy) │
    │    ●────┼──→ bx = σ(tx) + cx
    │    │    │    (中心点只能在当前格子内移动)
    │    ▼    │
    └─────────┘
         by = σ(ty) + cy

    预测框大小：
    ┌─────────────────┐
    │                 │
    │    bw = pw×e^tw │  pw是anchor宽度
    │                 │  e^tw是缩放因子
    │─────────────────│  
    │       bh        │
    │                 │
    └─────────────────┘

6.4 为什么这样设计

复制代码

设计原理
═══════════════════════════════════════════════════════════════

1. 中心点用sigmoid + offset

   为什么不直接预测绝对坐标？
   ├── 绝对坐标范围大（0~416），难以学习
   ├── sigmoid输出(0,1)，学习更稳定
   └── 保证中心点落在当前格子内，便于分配正样本

2. 宽高用anchor × exp(offset)

   为什么不直接预测宽高？
   ├── 目标尺寸变化范围大（几十到几百像素）
   ├── 用exp可以预测任意正数
   ├── anchor提供先验，网络只需学习"调整量"
   └── 相当于对数空间的预测，大小目标都好学

3. exp的问题

   exp(tw)可能数值爆炸
   ├── 如果tw=10，exp(10)≈22026
   ├── YOLOv5/v8改进了这个问题
   └── 使用sigmoid×4的形式限制范围

七、Anchor机制

7.1 什么是Anchor

复制代码

Anchor（锚框）概念
═══════════════════════════════════════════════════════════════

Anchor是预设的参考框，代表"常见目标的形状"

原理：
├── 统计训练集中所有目标框的宽高
├── 用K-means聚类找出最常见的几种尺寸
├── 这些尺寸就是anchor
└── 网络预测的是相对于anchor的偏移量，而不是绝对尺寸

好处：
├── 网络只需要学习"微调"，不需要从零预测尺寸
├── 大幅降低学习难度
└── 提高收敛速度和精度

7.2 YOLOv3-tiny的Anchor设置

复制代码

YOLOv3-tiny Anchor配置（COCO数据集）
═══════════════════════════════════════════════════════════════

总共6个anchor，分配给2个检测尺度：

13×13检测头（大目标）：
├── Anchor 4: (81, 82)
├── Anchor 5: (135, 169)
└── Anchor 6: (344, 319)

26×26检测头（小目标）：
├── Anchor 1: (10, 14)
├── Anchor 2: (23, 27)
└── Anchor 3: (37, 58)

可视化（按比例）：

小目标anchor:          大目标anchor:
┌──┐                   ┌─────────────────┐
│10│                   │                 │
│14│                   │      344        │
└──┘                   │      319        │
  ┌────┐               │                 │
  │ 23 │               └─────────────────┘
  │ 27 │                 ┌──────────┐
  └────┘                 │   135    │
    ┌──────┐             │   169    │
    │  37  │             └──────────┘
    │  58  │               ┌──────┐
    └──────┘               │ 81   │
                           │ 82   │
                           └──────┘

7.3 Anchor与检测头的匹配

复制代码

为什么大anchor分配给小特征图？
═══════════════════════════════════════════════════════════════

看似反直觉，但逻辑是这样的：

13×13特征图：
├── 每个格子"看"的原图区域大（32×32像素）
├── 感受野大，能看到大范围上下文
├── 适合检测大目标
└── 所以分配大anchor

26×26特征图：
├── 每个格子"看"的原图区域小（16×16像素）
├── 空间分辨率高，定位精细
├── 适合检测小目标
└── 所以分配小anchor

匹配原则：
特征图stride × anchor尺寸 ≈ 目标实际尺寸范围

7.4 自定义Anchor的计算

python 复制代码

# 使用K-means为自己的数据集计算anchor
# ═══════════════════════════════════════════════════════════════

import numpy as np
from sklearn.cluster import KMeans

def compute_anchors(bbox_wh, num_anchors=6):
    """
    bbox_wh: numpy array, shape (N, 2), 所有gt框的宽高
    num_anchors: anchor数量
    """
    # K-means聚类
    kmeans = KMeans(n_clusters=num_anchors, random_state=42)
    kmeans.fit(bbox_wh)
    
    # 获取聚类中心
    anchors = kmeans.cluster_centers_
    
    # 按面积排序
    anchors = anchors[np.argsort(anchors[:, 0] * anchors[:, 1])]
    
    return anchors

# 使用示例
# gt_boxes: 所有ground truth框，格式 [x1, y1, x2, y2]
# bbox_wh = gt_boxes[:, 2:4] - gt_boxes[:, 0:2]  # 计算宽高
# anchors = compute_anchors(bbox_wh, num_anchors=6)
# print("Computed anchors:", anchors)

八、损失函数

8.1 损失函数组成

复制代码

YOLOv3-tiny 损失函数
═══════════════════════════════════════════════════════════════

总损失 = λ_coord × 定位损失 + λ_obj × 置信度损失 + λ_cls × 分类损失

┌────────────────────────────────────────────────────────────┐
│                                                            │
│  L_total = λ_coord × L_box                                │
│          + λ_obj × L_obj                                  │
│          + λ_noobj × L_noobj                              │
│          + λ_cls × L_cls                                  │
│                                                            │
└────────────────────────────────────────────────────────────┘

各部分说明：
├── L_box: 边界框回归损失（MSE或GIoU）
├── L_obj: 正样本置信度损失（BCE）
├── L_noobj: 负样本置信度损失（BCE）
└── L_cls: 分类损失（BCE，多标签）

8.2 定位损失

复制代码

定位损失计算
═══════════════════════════════════════════════════════════════

原始YOLOv3使用MSE损失：

L_box = Σ[(tx - tx')² + (ty - ty')²]      # 中心点
      + Σ[(tw - tw')² + (th - th')²]      # 宽高

问题：
├── 对大目标和小目标的惩罚不均衡
├── 小目标的小偏差会产生很大的IoU下降
└── MSE与IoU不完全相关

改进方案（YOLOv5/v8采用）：

L_box = 1 - GIoU(pred_box, gt_box)

GIoU = IoU - (C - Union) / C

其中C是能包含pred和gt的最小框面积

8.3 置信度损失

复制代码

置信度损失
═══════════════════════════════════════════════════════════════

使用二元交叉熵（BCE）：

L_obj = -Σ[y × log(p) + (1-y) × log(1-p)]

正样本（有目标）：
├── y = 1（或y = IoU，作为软标签）
├── 希望网络预测p接近1
└── 只对匹配上gt的anchor计算

负样本（无目标）：
├── y = 0
├── 希望网络预测p接近0
└── 需要平衡正负样本数量

正负样本不平衡问题：
├── 正样本远少于负样本
├── 解决方案：λ_noobj < λ_obj
└── 或使用Focal Loss

8.4 分类损失

复制代码

分类损失
═══════════════════════════════════════════════════════════════

YOLOv3使用多标签分类（不是softmax）：

L_cls = -Σ[y_c × log(p_c) + (1-y_c) × log(1-p_c)]

对每个类别单独做二元交叉熵

为什么不用softmax？
├── softmax假设类别互斥
├── 有些数据集标签不互斥（如person和woman）
├── BCE更灵活
└── 但对于单标签数据集，softmax收敛更快

8.5 正负样本分配

复制代码

样本分配策略
═══════════════════════════════════════════════════════════════

对于每个ground truth框：

1. 计算gt中心落在哪个格子
   ├── 该格子负责预测这个gt
   └── 其他格子是负样本

2. 计算gt与所有anchor的IoU
   ├── 选IoU最大的anchor作为正样本
   └── 或IoU超过阈值的都作为正样本

3. 忽略样本（ignore）
   ├── IoU介于低阈值和高阈值之间
   ├── 不计入正样本也不计入负样本
   └── 避免边界情况的影响

YOLOv3-tiny中：
├── 每个gt只匹配一个最佳anchor
├── IoU > 0.5 的其他anchor被忽略
└── IoU < 0.5 的anchor是负样本

九、与其他YOLO版本对比

9.1 YOLOv3 vs YOLOv3-tiny

复制代码

YOLOv3 vs YOLOv3-tiny
═══════════════════════════════════════════════════════════════

                    YOLOv3              YOLOv3-tiny
────────────────────────────────────────────────────────────────
Backbone            Darknet-53          Darknet-tiny (7 conv)
Backbone层数        53层                7层
检测尺度            3个                 2个
                    (52×52, 26×26,      (26×26, 13×13)
                     13×13)
Anchor数量          9个                 6个
参数量              61.5M               8.7M
FLOPs               65.9G               5.6G
COCO mAP            33.0                16.6
推理速度            ~30 FPS             ~220 FPS
────────────────────────────────────────────────────────────────

YOLOv3-tiny的取舍：
├── 去掉了最大尺度(52×52)检测 → 小目标检测能力下降
├── 大幅减少backbone深度 → 特征表达能力下降
├── 减少通道数 → 进一步减少计算量
└── 换来了7倍的速度提升

9.2 YOLOv3-tiny vs YOLOv5n

复制代码

YOLOv3-tiny vs YOLOv5-nano
═══════════════════════════════════════════════════════════════

                    YOLOv3-tiny         YOLOv5n
────────────────────────────────────────────────────────────────
发布时间            2018                2020
Backbone            Darknet-tiny        CSPDarknet-nano
激活函数            LeakyReLU           SiLU
Neck                简化FPN             PANet
检测尺度            2                   3
参数量              8.7M                1.9M
FLOPs               5.6G                4.5G
COCO mAP            16.6                28.0
────────────────────────────────────────────────────────────────

YOLOv5n的改进：
├── CSP结构：减少计算量同时保持表达能力
├── PANet：双向特征融合
├── SiLU激活：比LeakyReLU效果更好
├── Mosaic数据增强
├── 自适应anchor计算
└── 更好的训练策略

十、代码实现

10.1 完整的PyTorch实现

python 复制代码

import torch
import torch.nn as nn

class ConvBNLeaky(nn.Module):
    """标准卷积块: Conv + BN + LeakyReLU"""
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, 
                              stride, padding, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)
        self.leaky = nn.LeakyReLU(0.1, inplace=True)
    
    def forward(self, x):
        return self.leaky(self.bn(self.conv(x)))


class YOLOv3TinyBackbone(nn.Module):
    """YOLOv3-tiny Backbone"""
    def __init__(self):
        super().__init__()
        
        # Layer 1-2
        self.conv1 = ConvBNLeaky(3, 16, 3, 1, 1)
        self.pool1 = nn.MaxPool2d(2, 2)
        
        # Layer 3-4
        self.conv2 = ConvBNLeaky(16, 32, 3, 1, 1)
        self.pool2 = nn.MaxPool2d(2, 2)
        
        # Layer 5-6
        self.conv3 = ConvBNLeaky(32, 64, 3, 1, 1)
        self.pool3 = nn.MaxPool2d(2, 2)
        
        # Layer 7-8
        self.conv4 = ConvBNLeaky(64, 128, 3, 1, 1)
        self.pool4 = nn.MaxPool2d(2, 2)
        
        # Layer 9-10
        self.conv5 = ConvBNLeaky(128, 256, 3, 1, 1)
        self.pool5 = nn.MaxPool2d(2, 2)
        
        # Layer 11-13
        self.conv6 = ConvBNLeaky(256, 512, 3, 1, 1)
        self.conv7 = ConvBNLeaky(512, 256, 1, 1, 0)
        self.conv8 = ConvBNLeaky(256, 512, 3, 1, 1)
    
    def forward(self, x):
        x = self.pool1(self.conv1(x))    # 208×208×16
        x = self.pool2(self.conv2(x))    # 104×104×32
        x = self.pool3(self.conv3(x))    # 52×52×64
        x = self.pool4(self.conv4(x))    # 26×26×128
        
        p4 = self.conv5(x)               # 26×26×256 (P4 route)
        
        x = self.pool5(p4)               # 13×13×256
        x = self.conv6(x)                # 13×13×512
        x = self.conv7(x)                # 13×13×256
        p5 = self.conv8(x)               # 13×13×512 (P5)
        
        return p4, p5


class YOLOv3TinyNeck(nn.Module):
    """YOLOv3-tiny Neck (简化FPN)"""
    def __init__(self):
        super().__init__()
        
        # P5分支
        self.conv_p5 = ConvBNLeaky(512, 256, 1, 1, 0)
        
        # 上采样和融合
        self.upsample = nn.Upsample(scale_factor=2, mode='nearest')
        
        # P4分支（融合后）
        self.conv_p4 = ConvBNLeaky(256 + 256, 256, 3, 1, 1)  # 拼接后384通道
    
    def forward(self, p4, p5):
        # P5处理
        p5_out = self.conv_p5(p5)        # 13×13×256
        
        # 上采样并与P4融合
        p5_up = self.upsample(p5_out)    # 26×26×256
        p4_cat = torch.cat([p4, p5_up], dim=1)  # 26×26×512
        p4_out = self.conv_p4(p4_cat)    # 26×26×256
        
        return p4_out, p5_out


class YOLOv3TinyHead(nn.Module):
    """YOLOv3-tiny 检测头"""
    def __init__(self, num_classes, num_anchors=3):
        super().__init__()
        self.num_classes = num_classes
        self.num_anchors = num_anchors
        self.output_channels = num_anchors * (5 + num_classes)
        
        # P4检测头 (26×26)
        self.head_p4 = nn.Sequential(
            ConvBNLeaky(256, 512, 3, 1, 1),
            nn.Conv2d(512, self.output_channels, 1, 1, 0)
        )
        
        # P5检测头 (13×13)
        self.head_p5 = nn.Sequential(
            ConvBNLeaky(256, 512, 3, 1, 1),
            nn.Conv2d(512, self.output_channels, 1, 1, 0)
        )
    
    def forward(self, p4, p5):
        out_p4 = self.head_p4(p4)  # 26×26×(3×(5+C))
        out_p5 = self.head_p5(p5)  # 13×13×(3×(5+C))
        return out_p4, out_p5


class YOLOv3Tiny(nn.Module):
    """完整的YOLOv3-tiny模型"""
    def __init__(self, num_classes=80):
        super().__init__()
        self.num_classes = num_classes
        
        self.backbone = YOLOv3TinyBackbone()
        self.neck = YOLOv3TinyNeck()
        self.head = YOLOv3TinyHead(num_classes)
        
        # Anchor配置 (相对于输入尺寸416)
        self.anchors = {
            'p4': [(10, 14), (23, 27), (37, 58)],      # 小目标
            'p5': [(81, 82), (135, 169), (344, 319)]   # 大目标
        }
        self.strides = {'p4': 16, 'p5': 32}
    
    def forward(self, x):
        # Backbone
        p4, p5 = self.backbone(x)
        
        # Neck
        p4, p5 = self.neck(p4, p5)
        
        # Head
        out_p4, out_p5 = self.head(p4, p5)
        
        return out_p4, out_p5


# 测试模型
if __name__ == "__main__":
    model = YOLOv3Tiny(num_classes=80)
    x = torch.randn(1, 3, 416, 416)
    out_p4, out_p5 = model(x)
    
    print(f"Input shape: {x.shape}")
    print(f"P4 output shape: {out_p4.shape}")  # [1, 255, 26, 26]
    print(f"P5 output shape: {out_p5.shape}")  # [1, 255, 13, 13]
    
    # 统计参数量
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Total parameters: {total_params / 1e6:.2f}M")

10.2 边界框解码

python 复制代码

def decode_predictions(output, anchors, stride, num_classes, conf_thresh=0.5):
    """
    解码网络输出为边界框
    
    Args:
        output: 网络输出 [B, num_anchors*(5+C), H, W]
        anchors: 该尺度的anchor列表 [(w1,h1), (w2,h2), (w3,h3)]
        stride: 下采样倍数
        num_classes: 类别数
        conf_thresh: 置信度阈值
    
    Returns:
        boxes: [N, 6] - x1, y1, x2, y2, confidence, class_id
    """
    batch_size, _, h, w = output.shape
    num_anchors = len(anchors)
    
    # 重塑输出 [B, num_anchors, 5+C, H, W]
    output = output.view(batch_size, num_anchors, 5 + num_classes, h, w)
    output = output.permute(0, 1, 3, 4, 2).contiguous()  # [B, A, H, W, 5+C]
    
    # 创建网格
    grid_y, grid_x = torch.meshgrid(torch.arange(h), torch.arange(w), indexing='ij')
    grid_x = grid_x.float().to(output.device)
    grid_y = grid_y.float().to(output.device)
    
    # 提取预测值
    tx = output[..., 0]
    ty = output[..., 1]
    tw = output[..., 2]
    th = output[..., 3]
    conf = output[..., 4]
    cls_pred = output[..., 5:]
    
    # 解码中心点
    bx = (torch.sigmoid(tx) + grid_x) * stride
    by = (torch.sigmoid(ty) + grid_y) * stride
    
    # 解码宽高
    anchor_w = torch.tensor([a[0] for a in anchors]).view(1, -1, 1, 1).to(output.device)
    anchor_h = torch.tensor([a[1] for a in anchors]).view(1, -1, 1, 1).to(output.device)
    bw = anchor_w * torch.exp(tw)
    bh = anchor_h * torch.exp(th)
    
    # 转换为x1, y1, x2, y2
    x1 = bx - bw / 2
    y1 = by - bh / 2
    x2 = bx + bw / 2
    y2 = by + bh / 2
    
    # 置信度和类别
    confidence = torch.sigmoid(conf)
    class_prob = torch.sigmoid(cls_pred)
    class_score, class_id = class_prob.max(dim=-1)
    
    # 最终得分 = 置信度 × 类别概率
    final_score = confidence * class_score
    
    # 过滤低置信度
    mask = final_score > conf_thresh
    
    # 收集结果
    boxes = torch.stack([x1, y1, x2, y2, final_score, class_id.float()], dim=-1)
    boxes = boxes[mask]
    
    return boxes

10.3 NMS后处理

python 复制代码

def non_max_suppression(boxes, iou_thresh=0.5):
    """
    非极大值抑制
    
    Args:
        boxes: [N, 6] - x1, y1, x2, y2, score, class_id
        iou_thresh: IoU阈值
    
    Returns:
        keep_boxes: 保留的框
    """
    if len(boxes) == 0:
        return boxes
    
    # 按类别分组处理
    unique_classes = boxes[:, 5].unique()
    keep_boxes = []
    
    for cls in unique_classes:
        cls_mask = boxes[:, 5] == cls
        cls_boxes = boxes[cls_mask]
        
        # 按置信度排序
        scores = cls_boxes[:, 4]
        order = scores.argsort(descending=True)
        cls_boxes = cls_boxes[order]
        
        keep = []
        while len(cls_boxes) > 0:
            # 保留最高分的框
            keep.append(cls_boxes[0])
            
            if len(cls_boxes) == 1:
                break
            
            # 计算IoU
            ious = box_iou(cls_boxes[0:1, :4], cls_boxes[1:, :4])
            
            # 过滤IoU过高的框
            mask = ious[0] < iou_thresh
            cls_boxes = cls_boxes[1:][mask]
        
        keep_boxes.extend(keep)
    
    if keep_boxes:
        return torch.stack(keep_boxes)
    return torch.tensor([])


def box_iou(box1, box2):
    """计算IoU"""
    # 交集
    x1 = torch.max(box1[:, 0:1], box2[:, 0])
    y1 = torch.max(box1[:, 1:2], box2[:, 1])
    x2 = torch.min(box1[:, 2:3], box2[:, 2])
    y2 = torch.min(box1[:, 3:4], box2[:, 3])
    
    inter = (x2 - x1).clamp(min=0) * (y2 - y1).clamp(min=0)
    
    # 面积
    area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])
    area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])
    
    # IoU
    union = area1[:, None] + area2 - inter
    iou = inter / (union + 1e-6)
    
    return iou

十一、训练技巧

11.1 数据增强

复制代码

常用数据增强方法
═══════════════════════════════════════════════════════════════

1. 几何变换
   ├── 随机缩放 (0.5~1.5倍)
   ├── 随机裁剪
   ├── 随机翻转 (水平)
   └── 随机旋转 (小角度)

2. 颜色变换
   ├── 色相 (Hue) ±0.1
   ├── 饱和度 (Saturation) 0.5~1.5
   ├── 明度 (Value/Brightness) 0.5~1.5
   └── 对比度调整

3. Mosaic增强 (YOLOv4引入)
   ├── 4张图拼接成1张
   ├── 增加小目标样本
   └── 减少对大batch size的依赖

4. Mixup
   ├── 两张图像混合
   ├── 标签也按比例混合
   └── 正则化效果

11.2 训练策略

复制代码

训练超参数参考
═══════════════════════════════════════════════════════════════

学习率：
├── 初始: 0.001 (如果用预训练) 或 0.01 (从零开始)
├── 调度: Cosine退火 或 Step衰减
├── Warmup: 前1-3个epoch线性增加
└── 最终: 初始的1/100

优化器：
├── SGD + Momentum(0.937) + Weight Decay(0.0005)
└── 或 AdamW

Batch Size：
├── 根据显存调整
├── 推荐16~64
└── 小batch可用梯度累积

训练轮数：
├── 小数据集: 100~300 epochs
├── 大数据集: 300+ epochs
└── 观察验证集mAP决定

损失权重：
├── λ_box = 0.05
├── λ_obj = 1.0 (或按尺度调整)
├── λ_noobj = 0.5
└── λ_cls = 0.5

11.3 常见问题排查

复制代码

常见问题及解决方案
═══════════════════════════════════════════════════════════════

1. Loss不下降
   ├── 检查学习率（太大或太小）
   ├── 检查数据加载（标签是否正确）
   ├── 检查anchor（是否匹配数据集）
   └── 检查梯度（是否有NaN）

2. mAP50高但mAP50-95低
   ├── 框定位不准确
   ├── 检查anchor尺寸
   ├── 检查解码公式
   └── 检查stride设置

3. 小目标检测差
   ├── 增加输入分辨率
   ├── 增加小目标anchor
   ├── 使用Mosaic增强
   └── 增加52×52检测头

4. 大目标检测差
   ├── 检查大anchor设置
   ├── 可能需要更大的anchor
   └── 检查13×13检测头是否正常工作

5. 过拟合
   ├── 增加数据增强
   ├── 增加dropout（检测头之前）
   ├── 减少模型容量
   └── 早停

十二、总结

复制代码

YOLOv3-tiny 核心要点总结
═══════════════════════════════════════════════════════════════

架构设计:
├── Backbone: 7个卷积层 + 5个池化层，逐步下采样32倍
├── Neck: 简化FPN，单向上采样融合
├── Head: 2个检测尺度 (13×13, 26×26)
└── 输出: 每个位置3个anchor，每个anchor预测(5+C)个值

关键机制:
├── Anchor: 预设参考框，降低学习难度
├── 多尺度检测: 大小目标分别处理
├── FPN: 融合深层语义和浅层空间信息
└── 边界框解码: sigmoid + offset方式

优势:
├── 速度快: 220+ FPS
├── 模型小: 8.7M参数
├── 部署方便: 结构简单，易于转换
└── 适合边缘设备

局限:
├── 精度有限: COCO mAP只有16.6%
├── 小目标差: 只有2个检测尺度
├── 特征表达弱: backbone太浅
└── 已被YOLOv5n等超越

适用场景:
├── 实时性要求高
├── 计算资源有限
├── 目标较大且类别简单
└── 边缘设备部署