YOLOv6 端到端详解

YOLOv6 端到端详解：工业级实时目标检测的突破性创新

引言

YOLOv6 是美团视觉智能部在2022年提出的工业级实时目标检测模型，专注于检测精度和推理效率的平衡。相比前代YOLO系列模型，YOLOv6在保持高精度的同时，显著提升了推理速度，特别适合工业部署场景。

为什么需要 YOLOv6？

传统目标检测模型面临的核心挑战：

精度与速度的权衡：高精度模型往往推理速度慢，难以满足实时性要求
硬件适配性差：模型设计未充分考虑硬件特性，无法充分利用GPU/CPU算力
部署复杂度高：模型结构复杂，难以优化和部署

YOLOv6 通过硬件感知的神经网络设计，在算法层面进行了多项创新，实现了精度和速度的双重突破。

YOLOv6 概述

模型系列

YOLOv6 提供了多个不同规模的模型变体，以适应不同的应用场景：

模型	参数量	FLOPs	适用场景
YOLOv6-N	~4.3M	~11.1G	移动端、边缘设备
YOLOv6-T	~15.0M	~37.3G	轻量级应用
YOLOv6-S	~17.2M	~44.2G	平衡精度与速度
YOLOv6-M	~34.3M	~82.2G	高精度应用
YOLOv6-L	~59.6M	~144.0G	极致精度需求

核心创新点

可重参数化架构：基于 RepVGG 设计，训练时多分支提升精度，推理时融合为单分支提升速度
高效解耦头：优化了 YOLOX 的解耦头设计，降低延时开销
Anchor-free 检测：采用无锚框设计，简化训练流程
动态标签分配：使用 TaskAlignedAssigner 实现更优的正负样本匹配

整体架构设计

架构总览

YOLOv6 的整体架构遵循经典的 Backbone-Neck-Head 设计范式，但在每个组件上都进行了创新优化：

复制代码

┌─────────────────────────────────────────────────────────────────┐
│                        输入图像 (640×640×3)                       │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Backbone (特征提取)                            │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  N/T/S 模型: EfficientRep                                  │  │
│  │  M/L 模型: CSPBep                                          │  │
│  │                                                            │  │
│  │  Stem Layer → Stage1 → Stage2 → Stage3 → Stage4          │  │
│  │     ↓          ↓         ↓         ↓         ↓            │  │
│  │   P1/2       P1/4      P1/8     P1/16     P1/32          │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Neck (特征融合)                                │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  N/T/S 模型: Rep-PAN                                      │  │
│  │  M/L 模型: CSPRepPAFPN                                    │  │
│  │                                                            │  │
│  │  Top-Down Path:  P4 ← P5, P3 ← P4                        │  │
│  │  Bottom-Up Path: P3 → P4, P4 → P5                        │  │
│  │                                                            │  │
│  │  输出: P3 (80×80), P4 (40×40), P5 (20×20)                │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Head (检测头)                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Efficient Decoupled Head                                │  │
│  │                                                            │  │
│  │  ┌──────────────┐      ┌──────────────┐                 │  │
│  │  │ 分类分支 (Cls)│      │ 回归分支 (Reg)│                 │  │
│  │  │  Conv 3×3    │      │  Conv 3×3    │                 │  │
│  │  │  Conv 1×1    │      │  Conv 1×1    │                 │  │
│  │  │ 输出: (B,C,H,W)│    │ 输出: (B,4,H,W)│                │  │
│  │  └──────────────┘      └──────────────┘                 │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
                   检测结果 (BBox + Class)

架构对比：N/T/S vs M/L

不同规模的模型采用了不同的架构设计：

N/T/S 模型架构：

Backbone: EfficientRep (轻量级，基于 RepVGG)
Neck: Rep-PAN (简化版特征金字塔)
特点：参数量少，推理速度快

M/L 模型架构：

Backbone: CSPBep (基于 CSP 结构)
Neck: CSPRepPAFPN (完整版特征金字塔)
特点：参数量多，精度更高

核心组件详解

1. Backbone 设计

1.1 RepVGG Block：可重参数化的核心

RepVGG 是 YOLOv6 的核心创新，它解决了多分支结构精度高但速度慢的问题。

训练时的多分支结构：

复制代码

输入
  ├─ 3×3 Conv ────┐
  ├─ 1×1 Conv ────┤
  ├─ Identity ────┼─→ 相加 → BN → ReLU → 输出
  └─ BN ──────────┘

推理时的单分支结构：

复制代码

输入 → 3×3 Conv → BN → ReLU → 输出

重参数化过程：

RepVGG 通过结构重参数化（Structural Re-parameterization）将训练时的多分支结构融合为推理时的单分支结构：

1×1 卷积融合：将 1×1 卷积填充为 3×3 卷积
Identity 分支融合：将 Identity 分支转换为 1×1 卷积，再填充为 3×3
BN 融合：将 BN 层的参数融合到卷积层中
分支合并：将多个 3×3 卷积的权重和偏置相加

数学表示：

训练时：

复制代码

y = BN(Conv3x3(x) + Conv1x1(x) + x)

推理时（融合后）：

复制代码

y = Conv3x3_fused(x)

其中 Conv3x3_fused 的权重和偏置是通过重参数化计算得到的。

1.2 EfficientRep Backbone (N/T/S 模型)

EfficientRep 是专为小模型设计的轻量级骨干网络：

复制代码

输入 (640×640×3)
    │
    ▼
Stem Layer (RepVGG Block, stride=2)
    │ 输出: 320×320×64
    ▼
Stage 1
    ├─ RepVGG Block (stride=2) → 160×160×128
    └─ RepStageBlock (n=1) → 160×160×128
    │
    ▼
Stage 2
    ├─ RepVGG Block (stride=2) → 80×80×256
    └─ RepStageBlock (n=2) → 80×80×256
    │
    ▼
Stage 3
    ├─ RepVGG Block (stride=2) → 40×40×512
    └─ RepStageBlock (n=4) → 40×40×512
    │
    ▼
Stage 4
    ├─ RepVGG Block (stride=2) → 20×20×1024
    ├─ RepStageBlock (n=3) → 20×20×1024
    └─ SPPF → 20×20×1024
    │
    ▼
输出特征图: P3(80×80), P4(40×40), P5(20×20)

RepStageBlock 结构：

复制代码

输入
  │
  ├─ RepVGG Block
  ├─ RepVGG Block
  ├─ ...
  └─ RepVGG Block (n个)
  │
输出

1.3 CSPBep Backbone (M/L 模型)

CSPBep 是为大模型设计的骨干网络，采用 CSP (Cross Stage Partial) 结构：

复制代码

输入 (640×640×3)
    │
    ▼
Stem Layer
    │
    ▼
Stage 1 → Stage 2 → Stage 3 → Stage 4
    │        │        │        │
    ▼        ▼        ▼        ▼
  P1/4    P1/8    P1/16    P1/32

BepC3StageBlock 结构：

复制代码

输入
  │
  ├─ 1×1 Conv (split)
  │   ├─ Main Branch
  │   │   ├─ RepVGG Block
  │   │   └─ RepVGG Block
  │   │   └─ ... (n个)
  │   │
  │   └─ Shortcut Branch (Identity)
  │
  ├─ 1×1 Conv (concat)
  └─ 1×1 Conv (output)
  │
输出

CSP 结构的优势：

减少计算量：通过 split 操作将特征分为两部分
增强特征融合：通过 concat 操作融合不同路径的特征
提升梯度流：shortcut 分支有助于梯度反向传播

1.4 SPPF 模块

SPPF (Spatial Pyramid Pooling Fast) 是 YOLOv5 中 SPP 模块的改进版本：

复制代码

输入特征图
    │
    ├─ MaxPool 5×5 (stride=1, padding=2)
    ├─ MaxPool 5×5 → MaxPool 5×5 (串联)
    └─ MaxPool 5×5 → MaxPool 5×5 → MaxPool 5×5 (串联)
    │
    ▼
Concat (拼接)
    │
    ▼
输出特征图

相比 SPP，SPPF 通过串联多个小池化核替代大池化核，计算效率更高。

2. Neck 设计

2.1 Rep-PAN (N/T/S 模型)

Rep-PAN 是基于 PANet 的改进版本，使用 RepVGG Block 替换了原始卷积：

复制代码

P5 (20×20×1024) ──────────────┐
                                │
                                ▼
                          Top-Down Path
                                │
                                ▼
P4 (40×40×512) ←───────────────┘
    │
    ├─ Upsample (2×)
    │
    ▼
                          Top-Down Path
                                │
                                ▼
P3 (80×80×256) ←───────────────┘
    │
    ├─ Downsample (stride=2)
    │
    ▼
                          Bottom-Up Path
                                │
                                ▼
P4 (40×40×512) ────────────────┘
    │
    ├─ Downsample (stride=2)
    │
    ▼
                          Bottom-Up Path
                                │
                                ▼
P5 (20×20×1024) ───────────────┘

Top-Down Path 详细结构：

复制代码

高层特征 (P5)
    │
    ├─ RepVGG Block
    ├─ Upsample (2×)
    │
    ▼
低层特征 (P4)
    │
    ├─ Concat
    │
    ▼
融合特征
    │
    ├─ RepStageBlock
    │
    ▼
输出 (P4)

Bottom-Up Path 详细结构：

复制代码

低层特征 (P3)
    │
    ├─ RepVGG Block
    ├─ Downsample (stride=2, 3×3 Conv)
    │
    ▼
高层特征 (P4)
    │
    ├─ Concat
    │
    ▼
融合特征
    │
    ├─ RepStageBlock
    │
    ▼
输出 (P4)

2.2 CSPRepPAFPN (M/L 模型)

CSPRepPAFPN 是完整版的特征金字塔网络，使用 CSP 结构增强特征融合能力：

复制代码

P5 ──→ CSPRepBlock ──→ Upsample ──┐
                                  │
                                  ▼
                            Concat + CSPRepBlock
                                  │
                                  ▼
P4 ──→ CSPRepBlock ──→ Upsample ──┘
                                  │
                                  ▼
                            Concat + CSPRepBlock
                                  │
                                  ▼
P3 ──→ CSPRepBlock ──→ Downsample ──┐
                                     │
                                     ▼
                               Concat + CSPRepBlock
                                     │
                                     ▼
P4 ──→ CSPRepBlock ──→ Downsample ──┘
                                     │
                                     ▼
                               Concat + CSPRepBlock
                                     │
                                     ▼
P5 ──→ CSPRepBlock ────────────────┘

3. Head 设计

3.1 Efficient Decoupled Head

YOLOv6 采用了解耦检测头，将分类和回归任务分离：

复制代码

输入特征图 (P3/P4/P5)
    │
    ├─ Shared Conv 3×3
    │
    ├─ 分类分支              ├─ 回归分支
    │   ├─ Conv 3×3         │   ├─ Conv 3×3
    │   ├─ Conv 1×1         │   ├─ Conv 1×1
    │   └─ 输出: (B,C,H,W)  │   └─ 输出: (B,4,H,W)
    │                       │
    └───────────────────────┘

与 YOLOX 解耦头的对比：

YOLOX 解耦头：

复制代码

输入
  ├─ Shared Conv 3×3
  │
  ├─ 分类分支              ├─ 回归分支
  │   ├─ Conv 3×3         │   ├─ Conv 3×3
  │   ├─ Conv 3×3         │   ├─ Conv 3×3
  │   ├─ Conv 1×1         │   ├─ Conv 1×1
  │   └─ 输出              │   └─ 输出

YOLOv6 Efficient Decoupled Head：

复制代码

输入
  ├─ Shared Conv 3×3
  │
  ├─ 分类分支              ├─ 回归分支
  │   ├─ Conv 3×3         │   ├─ Conv 3×3
  │   ├─ Conv 1×1         │   ├─ Conv 1×1
  │   └─ 输出              │   └─ 输出

优化点：

减少了一个 3×3 卷积层，降低计算量
采用 hybrid-channel 策略，平衡精度和速度
在 Backbone 和 Neck 使用 ReLU，Head 使用 SiLU 激活函数

3.2 输出格式

对于 COCO 数据集（80 类），输入图像 640×640：

特征层	特征图尺寸	分类输出	回归输出
P3	80×80	(B, 80, 80, 80)	(B, 4, 80, 80)
P4	40×40	(B, 80, 40, 40)	(B, 4, 40, 40)
P5	20×20	(B, 80, 20, 20)	(B, 4, 20, 20)

回归输出格式：

预测值为距离 anchor point 到边界框四边的距离：(left, top, right, bottom)
需要解码为 (x1, y1, x2, y2) 格式

训练策略与损失函数

1. Anchor-free 检测范式

YOLOv6 采用 Anchor-free 设计，无需预设 anchor boxes：

Anchor Point 生成：

python 复制代码

# 使用 MlvlPointGenerator 生成 anchor points
prior_generator = dict(
    type='mmdet.MlvlPointGenerator',
    offset=0.5,  # 网格中心点
    strides=[8, 16, 32]  # 对应 P3, P4, P5 的下采样倍数
)

# 生成的 anchor points 格式: (x, y, stride_h, stride_w)
# P3: (80×80, 4) - 6400 个点
# P4: (40×40, 4) - 1600 个点  
# P5: (20×20, 4) - 400 个点
# 总计: 8400 个 anchor points

优势：

无需手动设计 anchor 尺寸和比例
减少超参数调优
泛化能力更强
解码逻辑更简单

2. 标签分配策略

YOLOv6 采用两阶段标签分配策略：

2.1 Warm-up 阶段 (Epoch 0-3)

使用 ATSSAssigner 进行标签分配：

ATSS 算法流程：

复制代码

1. Anchor 生成
   anchor_point → anchor_box (size = 5×stride)
   
2. 计算 IoU
   对于每个 GT，计算与所有 anchor 的 IoU
   
3. 计算中心距离
   计算 anchor 中心点与 GT 中心点的距离
   
4. 初筛样本选择
   在每个 FPN 层选择 topK 个距离最近的 anchor
   
5. 自适应阈值计算
   threshold = mean(IoU) + std(IoU)
   
6. 正样本筛选
   positive = (IoU > threshold) AND (center in GT)

代码示例：

python 复制代码

# 1. 将 anchor points 转化为 anchors
cell_half_size = priors[:, 2:] * 2.5  # 5×stride / 2
priors_gen = torch.zeros_like(priors)
priors_gen[:, :2] = priors[:, :2] - cell_half_size  # x1, y1
priors_gen[:, 2:] = priors[:, :2] + cell_half_size  # x2, y2

# 2. 计算 IoU
overlaps = iou_calculator(gt_bboxes, priors_gen)

# 3. 计算中心距离
distances = bbox_center_distance(gt_bboxes, priors_gen)

# 4. 选择 topK 候选样本
is_in_candidate, candidate_idxs = select_topk_candidates(
    distances, num_level_priors, topk=9)

# 5. 计算自适应阈值
overlaps_thr_per_gt = threshold_calculator(
    is_in_candidate, candidate_idxs, overlaps)

# 6. 筛选正样本
is_pos = (overlaps > overlaps_thr_per_gt) & is_in_candidate

2.2 正式训练阶段 (Epoch >= 4)

使用 TaskAlignedAssigner 进行动态标签分配：

TaskAlignedAssigner 算法流程：

复制代码

1. 计算对齐分数 (Alignment Metrics)
   alignment_metrics = (cls_score)^α × (IoU)^β
   
2. 筛选候选样本
   candidates = (center in GT) AND (not padding)
   
3. 选择 TopK 正样本
   positive = topK(alignment_metrics × candidates)

对齐分数的意义：

同时考虑分类分数和回归质量
选择对分类和回归都友好的样本
随着训练动态调整，比静态分配更优

代码示例：

python 复制代码

# 1. 计算对齐分数
cls_scores = pred_cls.sigmoid()  # 分类分数
overlaps = iou_calculator(pred_bbox, gt_bbox)  # IoU
alignment_metrics = cls_scores.pow(alpha) * overlaps.pow(beta)

# 2. 筛选候选样本
is_in_gts = select_candidates_in_gts(priors, gt_bboxes)

# 3. 选择 TopK 正样本
topk_metric = select_topk_candidates(
    alignment_metrics * is_in_gts,
    topk=13)

is_pos = topk_metric > 0

3. 损失函数设计

YOLOv6 的损失函数由两部分组成：

3.1 分类损失：Varifocal Loss

Varifocal Loss (VFL) 是 YOLOv6 采用的分类损失函数：

原始 VFL 公式：

复制代码

VFL(p, q) = {
    -q(q·log(p) + (1-q)·log(1-p)),  if q > 0  (正样本)
    -α·p^γ·log(1-p),                 if q = 0  (负样本)
}

其中：

p: 预测的分类分数 [0, 1]
q: 目标值（正样本为 IoU，负样本为 0）
α: 负样本权重因子，默认 0.75
γ: Focal 权重因子，默认 2.0

YOLOv6 改进版 VFL：

YOLOv6 使用 Task Alignment Learning (TAL) 对目标值进行归一化：

复制代码

t_hat = AlignmentMetrics / max(AlignmentMetrics) × max(IoU)

改进后的 VFL：

复制代码

VFL(p, t_hat) = {
    -t_hat(t_hat·log(p) + (1-t_hat)·log(1-p)),  if t_hat > 0
    -α·p^γ·log(1-p),                              if t_hat = 0
}

VFL 的优势：

正样本使用 IoU 加权，关注高质量样本
负样本使用 Focal 权重，关注困难样本
通过 TAL 归一化，提升分类回归一致性

代码实现：

python 复制代码

def varifocal_loss(pred, target, alpha=0.75, gamma=2.0):
    """
    pred: 预测分数 (B, N, C)
    target: 目标值 (B, N, C)，正样本为归一化 IoU，负样本为 0
    """
    pred_sigmoid = pred.sigmoid()
    
    # 计算权重
    focal_weight = target * (target > 0.0).float() + \
                   alpha * (pred_sigmoid - target).abs().pow(gamma) * \
                   (target <= 0.0).float()
    
    # 计算二值交叉熵
    loss = F.binary_cross_entropy_with_logits(
        pred, target, reduction='none') * focal_weight
    
    return loss.mean()

3.2 回归损失：GIoU Loss / SIoU Loss

YOLOv6 根据模型大小选择不同的回归损失：

模型	回归损失
N/T	SIoU Loss
S/M/L	GIoU Loss

GIoU Loss：

GIoU (Generalized IoU) 是对 IoU 的改进：

复制代码

GIoU = IoU - |C \ (A ∪ B)| / |C|

其中 C 是包含 A 和 B 的最小外接矩形。

复制代码

GIoU Loss = 1 - GIoU

SIoU Loss：

SIoU (Scylla IoU) 考虑了角度、距离和形状信息：

SIoU 的四个组成部分：

IoU 成本：

IoU_cost = 1 - IoU
角度成本：

sin(α) = |ch| / σ
angle_cost = 1 - 2 × sin²(arcsin(sin(α)) - π/4)
= cos(2α - π/2)
距离成本：

ρ_x = (cw / cw_enclose)²
ρ_y = (ch / ch_enclose)²
γ = 2 - angle_cost
distance_cost = Σ(1 - e^(-γ·ρ_t)), t ∈ {x, y}
形状成本：

ω_w = |w1 - w2| / max(w1, w2)
ω_h = |h1 - h2| / max(h1, h2)
shape_cost = Σ((1 - e^(-ω_t))θ), t ∈ {w, h}

SIoU 公式：

复制代码

SIoU = IoU - (distance_cost + shape_cost) / 2
SIoU Loss = 1 - SIoU

SIoU 的优势：

考虑回归角度，减少不确定性
同时优化距离和形状
对小模型效果更好

代码实现：

python 复制代码

def siou_loss(pred_bbox, target_bbox):
    """
    pred_bbox: (N, 4) - (x1, y1, x2, y2)
    target_bbox: (N, 4) - (x1, y1, x2, y2)
    """
    # 计算 IoU
    ious = bbox_overlaps(pred_bbox, target_bbox, mode='iou')
    
    # 计算角度成本
    sigma_cw = (target_bbox[:, 0] + target_bbox[:, 2]) / 2 - \
               (pred_bbox[:, 0] + pred_bbox[:, 2]) / 2
    sigma_ch = (target_bbox[:, 1] + target_bbox[:, 3]) / 2 - \
               (pred_bbox[:, 1] + pred_bbox[:, 3]) / 2
    sigma = torch.sqrt(sigma_cw**2 + sigma_ch**2 + eps)
    
    sin_alpha = torch.abs(sigma_ch) / sigma
    sin_alpha = torch.where(sin_alpha <= math.sin(math.pi/4), 
                            sin_alpha, 
                            torch.abs(sigma_cw) / sigma)
    angle_cost = torch.cos(torch.arcsin(sin_alpha) * 2 - math.pi / 2)
    
    # 计算距离成本
    enclose_w = torch.max(pred_bbox[:, 2], target_bbox[:, 2]) - \
                torch.min(pred_bbox[:, 0], target_bbox[:, 0])
    enclose_h = torch.max(pred_bbox[:, 3], target_bbox[:, 3]) - \
                torch.min(pred_bbox[:, 1], target_bbox[:, 1])
    
    rho_x = (sigma_cw / enclose_w)**2
    rho_y = (sigma_ch / enclose_h)**2
    gamma = 2 - angle_cost
    distance_cost = (1 - torch.exp(-gamma * rho_x)) + \
                    (1 - torch.exp(-gamma * rho_y))
    
    # 计算形状成本
    w1, h1 = pred_bbox[:, 2] - pred_bbox[:, 0], \
             pred_bbox[:, 3] - pred_bbox[:, 1]
    w2, h2 = target_bbox[:, 2] - target_bbox[:, 0], \
             target_bbox[:, 3] - target_bbox[:, 1]
    
    omiga_w = torch.abs(w1 - w2) / torch.max(w1, w2)
    omiga_h = torch.abs(h1 - h2) / torch.max(h1, h2)
    shape_cost = torch.pow(1 - torch.exp(-omiga_w), 4) + \
                 torch.pow(1 - torch.exp(-omiga_h), 4)
    
    # 计算 SIoU
    siou = ious - (distance_cost + shape_cost) / 2
    loss = 1 - siou
    
    return loss.mean()

3.3 损失权重

YOLOv6 的损失权重设置：

复制代码

总损失 = loss_cls × 1.0 + loss_bbox × 2.5

分类损失和回归损失的权重比例为 1:2.5，强调回归任务的重要性。

4. 优化器与学习率策略

4.1 优化器分组

YOLOv6 采用优化器分组策略，对不同层使用不同的学习率：

python 复制代码

# 分组策略
param_groups = [
    {'params': backbone.parameters(), 'lr': base_lr},
    {'params': neck.parameters(), 'lr': base_lr},
    {'params': head.parameters(), 'lr': base_lr * 10},  # Head 使用更高学习率
]

optimizer = torch.optim.SGD(param_groups, momentum=0.937, weight_decay=0.0005)

4.2 学习率调度

YOLOv6 使用余弦退火学习率调度：

复制代码

lr(epoch) = lr_min + (lr_max - lr_min) × (1 + cos(π × epoch / total_epochs)) / 2

4.3 Weight Decay 自适应

根据 BatchNorm 层数量自适应调整 weight decay：

python 复制代码

def get_weight_decay(model):
    """根据 BN 层数量自适应 weight decay"""
    num_bn = sum(1 for m in model.modules() if isinstance(m, nn.BatchNorm2d))
    weight_decay = 5e-4 if num_bn < 100 else 5e-4 * (num_bn / 100)
    return weight_decay

数据增强策略

YOLOv6 的数据增强策略与 YOLOv5 基本一致，但有一些调整：

训练阶段数据增强

前 N-15 个 Epoch：

Mosaic 增强：随机拼接 4 张图像
RandomAffine：随机仿射变换（旋转、缩放、平移）
MixUp：图像混合增强
HSV 颜色空间增强：调整色调、饱和度、亮度
随机水平翻转

最后 15 个 Epoch：

移除 Mosaic 增强
使用 YOLOv5KeepRatioResize + LetterResize 替代
目的是拟合真实推理场景

数据增强流程

复制代码

原始图像
    │
    ├─ Mosaic (4 张图像拼接)
    │
    ├─ RandomAffine (旋转、缩放、平移)
    │
    ├─ MixUp (可选)
    │
    ├─ HSV 增强
    │
    ├─ 随机水平翻转
    │
    └─ Resize to 640×640

验证阶段

验证时只进行：

KeepRatioResize：保持宽高比缩放
LetterResize：填充到 640×640（保持宽高比）

推理与后处理

1. 前向推理流程

复制代码

输入图像 (640×640×3)
    │
    ├─ 预处理：归一化、BGR→RGB
    │
    ├─ Backbone 前向传播
    │   └─ 输出：P3, P4, P5
    │
    ├─ Neck 前向传播
    │   └─ 输出：融合后的 P3, P4, P5
    │
    ├─ Head 前向传播
    │   ├─ 分类输出：(B, 80, H, W)
    │   └─ 回归输出：(B, 4, H, W)
    │
    └─ 后处理

2. 解码过程

BBox 解码：

YOLOv6 使用 DistancePointBBoxCoder 进行解码：

python 复制代码

def decode(points, pred_bboxes, stride):
    """
    points: anchor points (N, 2) - (x, y)
    pred_bboxes: 预测距离 (N, 4) - (left, top, right, bottom)
    stride: 下采样倍数
    """
    # 将预测值转换为原图尺度
    distance = pred_bboxes * stride
    
    # 解码为 bbox 坐标
    x1 = points[:, 0] - distance[:, 0]  # left
    y1 = points[:, 1] - distance[:, 1]  # top
    x2 = points[:, 0] + distance[:, 2]  # right
    y2 = points[:, 1] + distance[:, 3]  # bottom
    
    bboxes = torch.stack([x1, y1, x2, y2], dim=-1)
    return bboxes

分类分数解码：

python 复制代码

# 对分类输出应用 sigmoid
cls_scores = torch.sigmoid(pred_cls)  # (B, C, H, W)

# 转换为 (B, H*W, C)
cls_scores = cls_scores.permute(0, 2, 3, 1).reshape(B, -1, C)

3. 后处理流程

YOLOv6 的后处理包括三个步骤：

3.1 多尺度预测合并

python 复制代码

# 合并三个尺度的预测结果
all_bboxes = []
all_scores = []

for scale_idx, (bbox_pred, cls_pred) in enumerate(predictions):
    # 解码 bbox
    bboxes = decode(anchor_points[scale_idx], bbox_pred, strides[scale_idx])
    
    # 获取分类分数
    scores = cls_pred.sigmoid()
    
    all_bboxes.append(bboxes)
    all_scores.append(scores)

# 拼接所有尺度的结果
all_bboxes = torch.cat(all_bboxes, dim=1)  # (B, N, 4)
all_scores = torch.cat(all_scores, dim=1)  # (B, N, C)

3.2 分数过滤

python 复制代码

# 设置置信度阈值
conf_threshold = 0.25

# 过滤低置信度预测
max_scores, _ = all_scores.max(dim=-1)  # (B, N)
valid_mask = max_scores > conf_threshold

filtered_bboxes = all_bboxes[valid_mask]
filtered_scores = all_scores[valid_mask]

3.3 NMS (Non-Maximum Suppression)

python 复制代码

def nms(bboxes, scores, iou_threshold=0.45):
    """
    bboxes: (N, 4) - (x1, y1, x2, y2)
    scores: (N,)
    """
    # 按分数排序
    order = scores.argsort(descending=True)
    
    keep = []
    while len(order) > 0:
        # 保留分数最高的
        i = order[0]
        keep.append(i)
        
        if len(order) == 1:
            break
        
        # 计算 IoU
        ious = bbox_overlaps(bboxes[i:i+1], bboxes[order[1:]])
        
        # 移除 IoU > threshold 的框
        inds = torch.where(ious <= iou_threshold)[0]
        order = order[inds + 1]
    
    return keep

多类别 NMS：

python 复制代码

def multiclass_nms(bboxes, scores, nms_threshold=0.45, score_threshold=0.25):
    """
    bboxes: (N, 4)
    scores: (N, C) - C 个类别
    """
    results = []
    
    # 对每个类别分别进行 NMS
    for cls_idx in range(scores.shape[1]):
        cls_scores = scores[:, cls_idx]
        
        # 过滤低分
        valid_mask = cls_scores > score_threshold
        if not valid_mask.any():
            continue
        
        cls_bboxes = bboxes[valid_mask]
        cls_scores = cls_scores[valid_mask]
        
        # NMS
        keep = nms(cls_bboxes, cls_scores, nms_threshold)
        
        if len(keep) > 0:
            results.append({
                'bboxes': cls_bboxes[keep],
                'scores': cls_scores[keep],
                'labels': torch.full((len(keep),), cls_idx)
            })
    
    return results

4. 完整推理代码示例

python 复制代码

import torch
import torch.nn.functional as F
from torchvision import transforms

class YOLOv6Inference:
    def __init__(self, model, conf_threshold=0.25, nms_threshold=0.45):
        self.model = model.eval()
        self.conf_threshold = conf_threshold
        self.nms_threshold = nms_threshold
        
        # 预处理
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                               std=[0.229, 0.224, 0.225])
        ])
    
    def preprocess(self, image):
        """图像预处理"""
        # Resize 到 640×640
        image = cv2.resize(image, (640, 640))
        # 转换为 tensor
        tensor = self.transform(image)
        return tensor.unsqueeze(0)
    
    def decode_bbox(self, points, pred_bboxes, stride):
        """解码 bbox"""
        distance = pred_bboxes * stride
        x1 = points[:, 0] - distance[:, 0]
        y1 = points[:, 1] - distance[:, 1]
        x2 = points[:, 0] + distance[:, 2]
        y2 = points[:, 1] + distance[:, 3]
        return torch.stack([x1, y1, x2, y2], dim=-1)
    
    def nms(self, bboxes, scores, iou_threshold):
        """NMS"""
        order = scores.argsort(descending=True)
        keep = []
        
        while len(order) > 0:
            i = order[0]
            keep.append(i)
            
            if len(order) == 1:
                break
            
            ious = self.bbox_iou(bboxes[i:i+1], bboxes[order[1:]])
            inds = torch.where(ious <= iou_threshold)[0]
            order = order[inds + 1]
        
        return keep
    
    def bbox_iou(self, box1, box2):
        """计算 IoU"""
        # 交集
        inter_x1 = torch.max(box1[:, 0], box2[:, 0])
        inter_y1 = torch.max(box1[:, 1], box2[:, 1])
        inter_x2 = torch.min(box1[:, 2], box2[:, 2])
        inter_y2 = torch.min(box1[:, 3], box2[:, 3])
        
        inter_area = torch.clamp(inter_x2 - inter_x1, min=0) * \
                     torch.clamp(inter_y2 - inter_y1, min=0)
        
        # 并集
        box1_area = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])
        box2_area = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])
        union_area = box1_area + box2_area - inter_area
        
        iou = inter_area / (union_area + 1e-6)
        return iou
    
    def postprocess(self, predictions, anchor_points, strides):
        """后处理"""
        all_bboxes = []
        all_scores = []
        
        # 合并多尺度预测
        for scale_idx, (bbox_pred, cls_pred) in enumerate(predictions):
            # 解码
            bboxes = self.decode_bbox(
                anchor_points[scale_idx], 
                bbox_pred, 
                strides[scale_idx]
            )
            
            # 分类分数
            scores = torch.sigmoid(cls_pred)
            
            all_bboxes.append(bboxes)
            all_scores.append(scores)
        
        # 拼接
        all_bboxes = torch.cat(all_bboxes, dim=1)  # (B, N, 4)
        all_scores = torch.cat(all_scores, dim=1)  # (B, N, C)
        
        # 多类别 NMS
        results = []
        for cls_idx in range(all_scores.shape[2]):
            cls_scores = all_scores[0, :, cls_idx]
            
            # 过滤低分
            valid_mask = cls_scores > self.conf_threshold
            if not valid_mask.any():
                continue
            
            cls_bboxes = all_bboxes[0][valid_mask]
            cls_scores = cls_scores[valid_mask]
            
            # NMS
            keep = self.nms(cls_bboxes, cls_scores, self.nms_threshold)
            
            if len(keep) > 0:
                results.append({
                    'bboxes': cls_bboxes[keep].cpu().numpy(),
                    'scores': cls_scores[keep].cpu().numpy(),
                    'labels': cls_idx
                })
        
        return results
    
    def __call__(self, image):
        """推理主函数"""
        # 预处理
        tensor = self.preprocess(image)
        
        # 前向传播
        with torch.no_grad():
            outputs = self.model(tensor)
        
        # 后处理
        results = self.postprocess(outputs, anchor_points, strides)
        
        return results

性能评估与对比

1. COCO 数据集性能

YOLOv6 在 COCO 2017 val 数据集上的性能：

模型	mAP@0.5:0.95	mAP@0.5	参数量	FLOPs	速度 (V100)
YOLOv6-N	35.9%	50.5%	4.3M	11.1G	1243 FPS
YOLOv6-T	40.3%	54.8%	15.0M	37.3G	1022 FPS
YOLOv6-S	43.1%	58.5%	17.2M	44.2G	484 FPS
YOLOv6-M	49.5%	63.3%	34.3M	82.2G	226 FPS
YOLOv6-L	52.5%	66.8%	59.6M	144.0G	119 FPS

2. 与其他 YOLO 模型对比

精度对比（COCO val）：

复制代码

YOLOv6-L: 52.5% mAP
YOLOv5-L: 48.2% mAP
YOLOX-L:  50.0% mAP

速度对比（V100, TensorRT）：

复制代码

YOLOv6-S: 484 FPS
YOLOv5-S: 333 FPS
YOLOX-S:  208 FPS

3. 精度-速度权衡曲线

复制代码

mAP (%)
  │
60│                                    ● YOLOv6-L
  │                              ● YOLOv6-M
55│                        ● YOLOv6-S
  │                  ● YOLOv6-T
50│            ● YOLOv6-N
  │      ●
45│  ●
  │
40│
  └──────────────────────────────────────────→ FPS
    0    200   400   600   800   1000  1200

4. 消融实验

RepVGG Block 的影响：

组件	mAP	FPS
标准 Conv	42.1%	350
RepVGG Block	43.1%	484

Efficient Decoupled Head 的影响：

Head 类型	mAP	FPS
YOLOX Head	42.8%	420
Efficient Head	43.1%	484

标签分配策略的影响：

策略	mAP
ATSS	41.5%
TaskAlignedAssigner	43.1%

配置文件详解

1. 完整配置文件结构

YOLOv6 的配置文件采用 MMYOLO 的模块化设计，以下是完整的配置文件示例：

python 复制代码

# ======================= 基础配置 =====================
_base_ = ['../_base_/default_runtime.py', '../_base_/det_p5_tta.py']

# ======================= 数据相关参数 =====================
data_root = 'data/coco/'  # 数据根目录
train_ann_file = 'annotations/instances_train2017.json'  # 训练标注文件
train_data_prefix = 'train2017/'  # 训练图像路径前缀
val_ann_file = 'annotations/instances_val2017.json'  # 验证标注文件
val_data_prefix = 'val2017/'  # 验证图像路径前缀

num_classes = 80  # 类别数量（COCO 数据集）
img_scale = (640, 640)  # 输入图像尺寸 (width, height)
dataset_type = 'YOLOv5CocoDataset'  # 数据集类型

# ======================= 训练相关参数 =====================
train_batch_size_per_gpu = 32  # 每个 GPU 的 batch size
train_num_workers = 8  # 数据加载线程数
val_batch_size_per_gpu = 1  # 验证时每个 GPU 的 batch size
val_num_workers = 2  # 验证时数据加载线程数
persistent_workers = True  # 是否保持 worker 进程

# ======================= 优化器相关参数 =====================
base_lr = 0.01  # 基础学习率
max_epochs = 400  # 最大训练轮数
num_last_epochs = 15  # 最后 N 个 epoch 切换训练策略
lr_factor = 0.01  # 学习率缩放因子（最终学习率 = base_lr * lr_factor）
weight_decay = 0.0005  # 权重衰减系数
momentum = 0.937  # SGD 动量

# ======================= 模型结构参数 =====================
deepen_factor = 0.33  # 深度缩放因子（控制网络深度）
widen_factor = 0.5  # 宽度缩放因子（控制网络宽度）

# ======================= 数据增强参数 =====================
affine_scale = 0.5  # RandomAffine 缩放比例范围

# ======================= 模型配置 =====================
model = dict(
    type='YOLODetector',
    data_preprocessor=dict(
        type='YOLOv5DetDataPreprocessor',
        mean=[0., 0., 0.],  # 图像均值（归一化后）
        std=[255., 255., 255.],  # 图像标准差
        bgr_to_rgb=True),  # BGR 转 RGB
    backbone=dict(
        type='YOLOv6EfficientRep',
        deepen_factor=deepen_factor,
        widen_factor=widen_factor,
        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),  # BatchNorm 配置
        act_cfg=dict(type='ReLU', inplace=True)),  # 激活函数
    neck=dict(
        type='YOLOv6RepPAFPN',
        deepen_factor=deepen_factor,
        widen_factor=widen_factor,
        in_channels=[256, 512, 1024],  # 输入通道数（对应 P3, P4, P5）
        out_channels=[128, 256, 512],  # 输出通道数
        num_csp_blocks=12,  # CSP 块数量
        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
        act_cfg=dict(type='ReLU', inplace=True)),
    bbox_head=dict(
        type='YOLOv6Head',
        head_module=dict(
            type='YOLOv6HeadModule',
            num_classes=num_classes,
            in_channels=[128, 256, 512],  # 输入通道数
            widen_factor=widen_factor,
            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
            act_cfg=dict(type='SiLU', inplace=True),  # Head 使用 SiLU
            featmap_strides=[8, 16, 32]),  # 特征图下采样倍数
        loss_bbox=dict(
            type='IoULoss',
            iou_mode='giou',  # GIoU 损失
            bbox_format='xyxy',
            reduction='mean',
            loss_weight=2.5,  # 回归损失权重
            return_iou=False)),
    train_cfg=dict(
        initial_epoch=4,  # Warm-up epoch 数量
        initial_assigner=dict(
            type='BatchATSSAssigner',
            num_classes=num_classes,
            topk=9,  # TopK 候选样本数
            iou_calculator=dict(type='mmdet.BboxOverlaps2D')),
        assigner=dict(
            type='BatchTaskAlignedAssigner',
            num_classes=num_classes,
            topk=13,  # TopK 正样本数
            alpha=1,  # 分类分数权重
            beta=6)),  # IoU 权重
    test_cfg=dict(
        multi_label=True,  # 多标签分类
        nms_pre=30000,  # NMS 前保留的预测框数量
        score_thr=0.001,  # 分数阈值
        nms=dict(type='nms', iou_threshold=0.65),  # NMS 配置
        max_per_img=300))  # 每张图像最多保留的检测框数

# ======================= 数据增强 Pipeline =====================
pre_transform = [
    dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
    dict(type='LoadAnnotations', with_bbox=True)
]

# 训练 Pipeline（前 N-15 个 epoch）
train_pipeline = [
    *pre_transform,
    dict(
        type='Mosaic',
        img_scale=img_scale,
        pad_val=114.0,  # 填充值
        pre_transform=pre_transform),
    dict(
        type='YOLOv5RandomAffine',
        max_rotate_degree=0.0,  # 最大旋转角度
        max_translate_ratio=0.1,  # 最大平移比例
        scaling_ratio_range=(1 - affine_scale, 1 + affine_scale),  # 缩放范围
        border=(-img_scale[0] // 2, -img_scale[1] // 2),  # 边界
        border_val=(114, 114, 114),  # 边界填充值
        max_shear_degree=0.0),  # 最大剪切角度
    dict(type='YOLOv5HSVRandomAug'),  # HSV 颜色空间增强
    dict(type='mmdet.RandomFlip', prob=0.5),  # 随机水平翻转
    dict(
        type='mmdet.PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 
                   'flip', 'flip_direction'))
]

# 训练 Pipeline Stage 2（最后 15 个 epoch）
train_pipeline_stage2 = [
    *pre_transform,
    dict(type='YOLOv5KeepRatioResize', scale=img_scale),  # 保持宽高比缩放
    dict(
        type='LetterResize',
        scale=img_scale,
        allow_scale_up=True,  # 允许放大
        pad_val=dict(img=114)),  # 填充值
    dict(
        type='YOLOv5RandomAffine',
        max_rotate_degree=0.0,
        max_translate_ratio=0.1,
        scaling_ratio_range=(1 - affine_scale, 1 + affine_scale),
        max_shear_degree=0.0),
    dict(type='YOLOv5HSVRandomAug'),
    dict(type='mmdet.RandomFlip', prob=0.5),
    dict(
        type='mmdet.PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 
                   'flip', 'flip_direction'))
]

# ======================= 数据加载器配置 =====================
train_dataloader = dict(
    batch_size=train_batch_size_per_gpu,
    num_workers=train_num_workers,
    collate_fn=dict(type='yolov5_collate'),  # 数据整理函数
    persistent_workers=persistent_workers,
    pin_memory=True,  # 固定内存
    sampler=dict(type='DefaultSampler', shuffle=True),  # 采样器
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file=train_ann_file,
        data_prefix=dict(img=train_data_prefix),
        filter_cfg=dict(filter_empty_gt=False, min_size=32),  # 过滤配置
        pipeline=train_pipeline))

# ======================= 验证 Pipeline =====================
test_pipeline = [
    dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
    dict(type='YOLOv5KeepRatioResize', scale=img_scale),
    dict(
        type='LetterResize',
        scale=img_scale,
        allow_scale_up=False,  # 不允许放大
        pad_val=dict(img=114)),
    dict(type='LoadAnnotations', with_bbox=True, _scope_='mmdet'),
    dict(
        type='mmdet.PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                   'scale_factor', 'pad_param'))
]

val_dataloader = dict(
    batch_size=val_batch_size_per_gpu,
    num_workers=val_num_workers,
    persistent_workers=persistent_workers,
    pin_memory=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        test_mode=True,
        data_prefix=dict(img=val_data_prefix),
        ann_file=val_ann_file,
        pipeline=test_pipeline,
        batch_shapes_cfg=dict(
            type='BatchShapePolicy',
            batch_size=val_batch_size_per_gpu,
            img_size=img_scale[0],
            size_divisor=32,
            extra_pad_ratio=0.5)))

test_dataloader = val_dataloader

# ======================= 优化器配置 =====================
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(
        type='SGD',
        lr=base_lr,
        momentum=momentum,
        weight_decay=weight_decay,
        nesterov=True,  # 使用 Nesterov 动量
        batch_size_per_gpu=train_batch_size_per_gpu),
    constructor='YOLOv5OptimizerConstructor')  # 优化器构造函数

# ======================= 学习率调度器配置 =====================
default_hooks = dict(
    param_scheduler=dict(
        type='YOLOv5ParamSchedulerHook',
        scheduler_type='cosine',  # 余弦退火
        lr_factor=lr_factor,
        max_epochs=max_epochs),
    checkpoint=dict(
        type='CheckpointHook',
        interval=10,  # 保存间隔（epoch）
        max_keep_ckpts=3,  # 最多保留的检查点数量
        save_best='auto'))  # 自动保存最佳模型

# ======================= 自定义 Hooks =====================
custom_hooks = [
    dict(
        type='EMAHook',  # 指数移动平均
        ema_type='ExpMomentumEMA',
        momentum=0.0001,  # EMA 动量
        update_buffers=True,
        strict_load=False,
        priority=49),
    dict(
        type='mmdet.PipelineSwitchHook',  # Pipeline 切换 Hook
        switch_epoch=max_epochs - num_last_epochs,  # 切换 epoch
        switch_pipeline=train_pipeline_stage2)  # 切换到 Stage 2 Pipeline
]

# ======================= 评估器配置 =====================
val_evaluator = dict(
    type='mmdet.CocoMetric',
    proposal_nums=(100, 1, 10),  # 评估时使用的 proposal 数量
    ann_file=data_root + val_ann_file,
    metric='bbox')  # 评估指标

test_evaluator = val_evaluator

# ======================= 训练配置 =====================
train_cfg = dict(
    type='EpochBasedTrainLoop',
    max_epochs=max_epochs,
    val_interval=10,  # 验证间隔
    dynamic_intervals=[(max_epochs - num_last_epochs, 1)])  # 动态验证间隔

val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

2. 关键参数详解

2.1 模型结构参数

参数	说明	默认值	影响
`deepen_factor`	深度缩放因子	0.33 (S)	控制网络深度，值越大网络越深
`widen_factor`	宽度缩放因子	0.5 (S)	控制网络宽度，值越大通道数越多
`num_csp_blocks`	CSP 块数量	12	控制 Neck 中 CSP 块的数量
`in_channels`	输入通道数	[256, 512, 1024]	Backbone 输出的通道数
`out_channels`	输出通道数	[128, 256, 512]	Neck 输出的通道数

不同模型的参数设置：

模型	deepen_factor	widen_factor	num_csp_blocks
YOLOv6-N	0.33	0.25	6
YOLOv6-T	0.33	0.375	6
YOLOv6-S	0.33	0.5	12
YOLOv6-M	0.6	0.75	16
YOLOv6-L	1.0	1.0	18

2.2 训练超参数

参数	说明	默认值	调优建议
`base_lr`	基础学习率	0.01	根据 batch size 调整：lr = base_lr * batch_size / 64
`max_epochs`	最大训练轮数	400	小数据集可减少，大数据集可增加
`num_last_epochs`	最后 N 个 epoch	15	关闭 Mosaic 的 epoch 数
`lr_factor`	学习率缩放因子	0.01	最终学习率 = base_lr * lr_factor
`weight_decay`	权重衰减	0.0005	防止过拟合，可根据模型大小调整
`momentum`	SGD 动量	0.937	通常不需要调整

学习率调整策略：

python 复制代码

# 根据 batch size 调整学习率
effective_batch_size = train_batch_size_per_gpu * num_gpus
base_lr = 0.01 * (effective_batch_size / 64)

# 余弦退火学习率
lr(epoch) = lr_min + (lr_max - lr_min) * (1 + cos(π * epoch / max_epochs)) / 2
# 其中 lr_max = base_lr, lr_min = base_lr * lr_factor

2.3 数据增强参数

参数	说明	默认值	影响
`affine_scale`	仿射变换缩放范围	0.5	值越大，缩放范围越大
`max_rotate_degree`	最大旋转角度	0.0	通常设为 0，避免旋转
`max_translate_ratio`	最大平移比例	0.1	值越大，平移范围越大
`max_shear_degree`	最大剪切角度	0.0	通常设为 0
`pad_val`	填充值	114	图像填充的像素值

2.4 损失函数参数

参数	说明	默认值	影响
`loss_weight` (cls)	分类损失权重	1.0	控制分类损失的重要性
`loss_weight` (bbox)	回归损失权重	2.5	控制回归损失的重要性
`alpha` (VFL)	VFL alpha 参数	0.75	负样本权重因子
`gamma` (VFL)	VFL gamma 参数	2.0	Focal 权重因子
`topk` (ATSS)	ATSS TopK	9	候选样本数量
`topk` (TAL)	TAL TopK	13	正样本数量
`alpha` (TAL)	TAL alpha	1	分类分数权重
`beta` (TAL)	TAL beta	6	IoU 权重

完整训练流程详解

1. 环境准备

1.1 安装依赖

bash 复制代码

# 安装 PyTorch（根据 CUDA 版本选择）
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# 安装 MMYOLO
pip install openmim
mim install mmengine
mim install mmcv
mim install mmdet
mim install mmyolo

# 安装其他依赖
pip install opencv-python pillow numpy matplotlib

1.2 准备数据集

bash 复制代码

# COCO 数据集目录结构
data/coco/
├── annotations/
│   ├── instances_train2017.json
│   └── instances_val2017.json
├── train2017/
│   ├── 000000000009.jpg
│   ├── 000000000025.jpg
│   └── ...
└── val2017/
    ├── 000000000139.jpg
    ├── 000000000285.jpg
    └── ...

2. 完整训练脚本

2.1 使用 MMYOLO 训练

bash 复制代码

# 单 GPU 训练
python tools/train.py configs/yolov6/yolov6_s_syncbn_fast_8xb32-400e_coco.py

# 多 GPU 训练（8 GPUs）
bash tools/dist_train.sh \
    configs/yolov6/yolov6_s_syncbn_fast_8xb32-400e_coco.py \
    8

# 指定工作目录和 GPU
python tools/train.py \
    configs/yolov6/yolov6_s_syncbn_fast_8xb32-400e_coco.py \
    --work-dir work_dirs/yolov6_s \
    --gpu-ids 0,1,2,3

# 从检查点恢复训练
python tools/train.py \
    configs/yolov6/yolov6_s_syncbn_fast_8xb32-400e_coco.py \
    --resume work_dirs/yolov6_s/epoch_100.pth

2.2 自定义训练脚本

python 复制代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
YOLOv6 完整训练脚本
支持单 GPU 和多 GPU 训练
"""

import argparse
import os
import sys
from pathlib import Path

import torch
from mmengine.config import Config
from mmengine.runner import Runner
from mmengine.utils import digit_version
from mmengine.utils.dl_utils import TORCH_VERSION


def parse_args():
    parser = argparse.ArgumentParser(description='Train YOLOv6 detector')
    parser.add_argument('config', help='train config file path')
    parser.add_argument('--work-dir', help='the dir to save logs and models')
    parser.add_argument(
        '--resume',
        action='store_true',
        help='resume from the latest checkpoint in the work_dir automatically')
    parser.add_argument(
        '--amp',
        action='store_true',
        help='enable automatic-mixed-precision training')
    parser.add_argument(
        '--auto-scale-lr',
        action='store_true',
        help='enable automatically scaling LR.')
    parser.add_argument(
        '--seed',
        type=int,
        default=None,
        help='random seed')
    parser.add_argument(
        '--cfg-options',
        nargs='+',
        action=argparse.DictAction,
        help='override some settings in the used config, the key-value pair '
        'in xxx=yyy format will be merged into config file. If the value to '
        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
        'Note that the quotation marks are necessary and that no white space '
        'is allowed.')
    parser.add_argument(
        '--launcher',
        choices=['none', 'pytorch', 'slurm', 'mpi'],
        default='none',
        help='job launcher')
    parser.add_argument('--local_rank', type=int, default=0)
    args = parser.parse_args()
    if 'LOCAL_RANK' not in os.environ:
        os.environ['LOCAL_RANK'] = str(args.local_rank)

    return args


def main():
    args = parse_args()

    # 加载配置文件
    cfg = Config.fromfile(args.config)
    cfg.launcher = args.launcher
    if args.cfg_options is not None:
        cfg.merge_from_dict(args.cfg_options)

    # 创建工作目录
    work_dir = args.work_dir
    if work_dir is None:
        work_dir = os.path.join('./work_dirs',
                                Path(args.config).stem)
    cfg.work_dir = work_dir

    # 恢复训练
    resume = None
    if args.resume:
        resume = 'auto'

    # 自动混合精度训练
    if args.amp:
        if digit_version(TORCH_VERSION) < digit_version('1.6.0'):
            raise RuntimeError('AMP training needs PyTorch>=1.6')
        cfg.optim_wrapper.type = 'AmpOptimWrapper'
        cfg.optim_wrapper.setdefault('loss_scale', 'dynamic')

    # 自动缩放学习率
    if args.auto_scale_lr:
        if 'auto_scale_lr' in cfg:
            cfg.auto_scale_lr.enable = True
        else:
            raise RuntimeError('Can not find `auto_scale_lr` in the config.')

    # 设置随机种子
    if args.seed is not None:
        cfg['randomness'] = dict(seed=args.seed)

    # 创建 Runner
    runner = Runner.from_cfg(cfg)

    # 开始训练
    runner.train()

    # 训练完成后进行验证
    runner.val()


if __name__ == '__main__':
    main()

2.3 训练监控脚本

python 复制代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
训练监控脚本
实时监控训练进度、损失值、学习率等
"""

import json
import time
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np


class TrainingMonitor:
    """训练监控类"""
    
    def __init__(self, log_dir='work_dirs/yolov6_s'):
        self.log_dir = Path(log_dir)
        self.log_file = self.log_dir / 'training_log.json'
        self.logs = []
        
        # 创建日志文件
        if not self.log_file.exists():
            with open(self.log_file, 'w') as f:
                json.dump([], f)
    
    def load_logs(self):
        """加载日志"""
        if self.log_file.exists():
            with open(self.log_file, 'r') as f:
                self.logs = json.load(f)
        return self.logs
    
    def save_log(self, epoch, loss_cls, loss_bbox, lr, mAP=None):
        """保存日志"""
        log_entry = {
            'epoch': epoch,
            'loss_cls': float(loss_cls),
            'loss_bbox': float(loss_bbox),
            'loss_total': float(loss_cls + loss_bbox),
            'lr': float(lr),
            'mAP': float(mAP) if mAP is not None else None,
            'timestamp': time.time()
        }
        
        self.logs.append(log_entry)
        
        with open(self.log_file, 'w') as f:
            json.dump(self.logs, f, indent=2)
    
    def plot_training_curves(self, save_path=None):
        """绘制训练曲线"""
        if not self.logs:
            print("No logs found!")
            return
        
        epochs = [log['epoch'] for log in self.logs]
        loss_cls = [log['loss_cls'] for log in self.logs]
        loss_bbox = [log['loss_bbox'] for log in self.logs]
        loss_total = [log['loss_total'] for log in self.logs]
        lrs = [log['lr'] for log in self.logs]
        maps = [log['mAP'] for log in self.logs if log['mAP'] is not None]
        map_epochs = [log['epoch'] for log in self.logs if log['mAP'] is not None]
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # 损失曲线
        axes[0, 0].plot(epochs, loss_cls, label='Loss Cls', alpha=0.7)
        axes[0, 0].plot(epochs, loss_bbox, label='Loss Bbox', alpha=0.7)
        axes[0, 0].plot(epochs, loss_total, label='Loss Total', alpha=0.7)
        axes[0, 0].set_xlabel('Epoch')
        axes[0, 0].set_ylabel('Loss')
        axes[0, 0].set_title('Training Loss')
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # 学习率曲线
        axes[0, 1].plot(epochs, lrs, label='Learning Rate', color='green')
        axes[0, 1].set_xlabel('Epoch')
        axes[0, 1].set_ylabel('Learning Rate')
        axes[0, 1].set_title('Learning Rate Schedule')
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
        axes[0, 1].set_yscale('log')
        
        # mAP 曲线
        if maps:
            axes[1, 0].plot(map_epochs, maps, label='mAP@0.5:0.95', 
                           marker='o', color='red')
            axes[1, 0].set_xlabel('Epoch')
            axes[1, 0].set_ylabel('mAP')
            axes[1, 0].set_title('Validation mAP')
            axes[1, 0].legend()
            axes[1, 0].grid(True, alpha=0.3)
        
        # 损失分布
        axes[1, 1].hist(loss_total, bins=50, alpha=0.7, edgecolor='black')
        axes[1, 1].set_xlabel('Loss Value')
        axes[1, 1].set_ylabel('Frequency')
        axes[1, 1].set_title('Loss Distribution')
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')
        else:
            plt.savefig(self.log_dir / 'training_curves.png', 
                       dpi=300, bbox_inches='tight')
        
        plt.close()
        print(f"Training curves saved to {save_path or self.log_dir / 'training_curves.png'}")


# 使用示例
if __name__ == '__main__':
    monitor = TrainingMonitor('work_dirs/yolov6_s')
    monitor.load_logs()
    monitor.plot_training_curves()

3. 训练步骤详解

3.1 训练前检查

python 复制代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
训练前检查脚本
检查数据集、配置文件、GPU 等
"""

import os
import json
import torch
from pathlib import Path


def check_dataset(data_root, ann_file):
    """检查数据集"""
    print("=" * 50)
    print("Checking Dataset...")
    print("=" * 50)
    
    ann_path = Path(data_root) / ann_file
    if not ann_path.exists():
        raise FileNotFoundError(f"Annotation file not found: {ann_path}")
    
    with open(ann_path, 'r') as f:
        ann_data = json.load(f)
    
    print(f"✓ Dataset: {ann_data.get('info', {}).get('description', 'Unknown')}")
    print(f"✓ Images: {len(ann_data.get('images', []))}")
    print(f"✓ Annotations: {len(ann_data.get('annotations', []))}")
    print(f"✓ Categories: {len(ann_data.get('categories', []))}")
    
    # 检查图像文件
    img_dir = Path(data_root) / ann_data['images'][0]['file_name'].split('/')[0]
    if not img_dir.exists():
        raise FileNotFoundError(f"Image directory not found: {img_dir}")
    
    print(f"✓ Image directory exists: {img_dir}")
    print()


def check_gpu():
    """检查 GPU"""
    print("=" * 50)
    print("Checking GPU...")
    print("=" * 50)
    
    if not torch.cuda.is_available():
        print("✗ CUDA is not available!")
        return False
    
    print(f"✓ CUDA available: {torch.cuda.is_available()}")
    print(f"✓ CUDA version: {torch.version.cuda}")
    print(f"✓ GPU count: {torch.cuda.device_count()}")
    
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"    Memory: {torch.cuda.get_device_properties(i).total_memory / 1024**3:.2f} GB")
    
    print()


def check_config(config_path):
    """检查配置文件"""
    print("=" * 50)
    print("Checking Config...")
    print("=" * 50)
    
    if not Path(config_path).exists():
        raise FileNotFoundError(f"Config file not found: {config_path}")
    
    from mmengine.config import Config
    cfg = Config.fromfile(config_path)
    
    print(f"✓ Config loaded: {config_path}")
    print(f"✓ Model: {cfg.model.type}")
    print(f"✓ Backbone: {cfg.model.backbone.type}")
    print(f"✓ Neck: {cfg.model.neck.type}")
    print(f"✓ Head: {cfg.model.bbox_head.type}")
    print(f"✓ Batch size: {cfg.train_batch_size_per_gpu}")
    print(f"✓ Learning rate: {cfg.base_lr}")
    print(f"✓ Max epochs: {cfg.max_epochs}")
    print()


def main():
    # 检查数据集
    check_dataset('data/coco', 'annotations/instances_train2017.json')
    
    # 检查 GPU
    check_gpu()
    
    # 检查配置文件
    check_config('configs/yolov6/yolov6_s_syncbn_fast_8xb32-400e_coco.py')
    
    print("=" * 50)
    print("All checks passed! Ready to train.")
    print("=" * 50)


if __name__ == '__main__':
    main()

3.2 训练过程详解

训练流程：

复制代码

1. 初始化阶段
   ├─ 加载配置文件
   ├─ 创建模型
   ├─ 初始化优化器
   ├─ 加载数据集
   └─ 设置训练参数

2. Warm-up 阶段 (Epoch 0-3)
   ├─ 使用 ATSSAssigner 进行标签分配
   ├─ 学习率线性增长
   └─ 模型参数预热

3. 正式训练阶段 (Epoch 4-N-15)
   ├─ 使用 TaskAlignedAssigner 进行标签分配
   ├─ 使用 Mosaic 数据增强
   ├─ 余弦退火学习率调度
   └─ EMA 更新模型参数

4. 精调阶段 (最后 15 个 Epoch)
   ├─ 关闭 Mosaic 增强
   ├─ 使用 LetterResize
   └─ 更频繁的验证

5. 验证阶段
   ├─ 在验证集上评估
   ├─ 计算 mAP 指标
   └─ 保存最佳模型

3.3 训练后处理

python 复制代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
训练后处理脚本
模型转换、性能分析等
"""

import torch
from pathlib import Path
from mmengine.config import Config
from mmyolo.apis import init_model


def convert_to_onnx(config_path, checkpoint_path, output_path, 
                    input_shape=(640, 640)):
    """转换为 ONNX 格式"""
    print(f"Converting {checkpoint_path} to ONNX...")
    
    # 加载模型
    model = init_model(config_path, checkpoint_path, device='cpu')
    model.eval()
    
    # 创建输入
    dummy_input = torch.randn(1, 3, *input_shape)
    
    # 导出 ONNX
    torch.onnx.export(
        model,
        dummy_input,
        output_path,
        input_names=['input'],
        output_names=['output'],
        dynamic_axes={'input': {0: 'batch_size'},
                     'output': {0: 'batch_size'}},
        opset_version=11,
        do_constant_folding=True)
    
    print(f"✓ ONNX model saved to {output_path}")


def analyze_model(config_path, checkpoint_path):
    """分析模型"""
    print("=" * 50)
    print("Model Analysis")
    print("=" * 50)
    
    # 加载模型
    model = init_model(config_path, checkpoint_path, device='cpu')
    
    # 计算参数量
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"Total parameters: {total_params / 1e6:.2f}M")
    print(f"Trainable parameters: {trainable_params / 1e6:.2f}M")
    
    # 计算 FLOPs（需要安装 fvcore）
    try:
        from fvcore.nn import FlopCountMode, flop_count
        dummy_input = torch.randn(1, 3, 640, 640)
        flops = flop_count(model, (dummy_input,), mode=FlopCountMode.TRAIN)
        total_flops = sum(flops[0].values())
        print(f"FLOPs: {total_flops / 1e9:.2f}G")
    except ImportError:
        print("Install fvcore to calculate FLOPs: pip install fvcore")
    
    # 模型大小
    checkpoint_size = Path(checkpoint_path).stat().st_size / 1024 / 1024
    print(f"Checkpoint size: {checkpoint_size:.2f}MB")


if __name__ == '__main__':
    config_path = 'configs/yolov6/yolov6_s_syncbn_fast_8xb32-400e_coco.py'
    checkpoint_path = 'work_dirs/yolov6_s/best.pth'
    
    # 分析模型
    analyze_model(config_path, checkpoint_path)
    
    # 转换为 ONNX
    convert_to_onnx(config_path, checkpoint_path, 'yolov6_s.onnx')

完整代码示例

1. 模型定义示例

python 复制代码

import torch
import torch.nn as nn
from mmyolo.models import YOLODetector
from mmyolo.models.backbones import YOLOv6EfficientRep
from mmyolo.models.necks import YOLOv6RepPAFPN
from mmyolo.models.dense_heads import YOLOv6Head

# 创建 YOLOv6-S 模型
model = YOLODetector(
    data_preprocessor=dict(
        type='YOLOv5DetDataPreprocessor',
        mean=[0., 0., 0.],
        std=[255., 255., 255.],
        bgr_to_rgb=True),
    backbone=dict(
        type='YOLOv6EfficientRep',
        deepen_factor=0.33,
        widen_factor=0.5,
        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
        act_cfg=dict(type='ReLU', inplace=True)),
    neck=dict(
        type='YOLOv6RepPAFPN',
        deepen_factor=0.33,
        widen_factor=0.5,
        in_channels=[256, 512, 1024],
        out_channels=[128, 256, 512],
        num_csp_blocks=12,
        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
        act_cfg=dict(type='ReLU', inplace=True)),
    bbox_head=dict(
        type='YOLOv6Head',
        head_module=dict(
            type='YOLOv6HeadModule',
            num_classes=80,
            in_channels=[128, 256, 512],
            widen_factor=0.5,
            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
            act_cfg=dict(type='SiLU', inplace=True),
            featmap_strides=[8, 16, 32]),
        loss_cls=dict(
            type='mmdet.VarifocalLoss',
            use_sigmoid=True,
            alpha=0.75,
            gamma=2.0,
            iou_weighted=True,
            reduction='sum',
            loss_weight=1.0),
        loss_bbox=dict(
            type='IoULoss',
            iou_mode='giou',
            bbox_format='xyxy',
            reduction='mean',
            loss_weight=2.5,
            return_iou=False),
        train_cfg=dict(
            initial_epoch=4,
            initial_assigner=dict(
                type='BatchATSSAssigner',
                num_classes=80,
                topk=9,
                iou_calculator=dict(type='mmdet.BboxOverlaps2D')),
            assigner=dict(
                type='BatchTaskAlignedAssigner',
                num_classes=80,
                topk=13,
                alpha=1,
                beta=6)),
        test_cfg=dict(
            multi_label=True,
            nms_pre=30000,
            score_thr=0.001,
            nms=dict(type='nms', iou_threshold=0.65),
            max_per_img=300)))

2. 推理代码示例

python 复制代码

import torch
import cv2
import numpy as np
from mmyolo.apis import inference_detector, init_detector
from mmdet.visualization import DetLocalVisualizer


def inference_image(model, image_path, score_threshold=0.3):
    """单张图像推理"""
    # 读取图像
    image = cv2.imread(image_path)
    
    # 推理
    results = inference_detector(model, image)
    
    # 可视化
    visualizer = DetLocalVisualizer()
    visualizer.dataset_meta = model.dataset_meta
    
    vis_image = visualizer.draw_detections(
        image,
        results,
        score_threshold=score_threshold)
    
    return vis_image, results


def inference_video(model, video_path, output_path, score_threshold=0.3):
    """视频推理"""
    cap = cv2.VideoCapture(video_path)
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))
    
    visualizer = DetLocalVisualizer()
    visualizer.dataset_meta = model.dataset_meta
    
    frame_count = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        # 推理
        results = inference_detector(model, frame)
        
        # 可视化
        vis_frame = visualizer.draw_detections(
            frame,
            results,
            score_threshold=score_threshold)
        
        out.write(vis_frame)
        frame_count += 1
        
        if frame_count % 100 == 0:
            print(f"Processed {frame_count} frames")
    
    cap.release()
    out.release()
    print(f"Video saved to {output_path}")


# 使用示例
if __name__ == '__main__':
    # 初始化模型
    config_path = 'configs/yolov6/yolov6_s_syncbn_fast_8xb32-400e_coco.py'
    checkpoint_path = 'checkpoints/yolov6_s.pth'
    
    model = init_detector(
        config_path,
        checkpoint_path,
        device='cuda:0')
    
    # 单张图像推理
    vis_image, results = inference_image(
        model, 
        'test_image.jpg',
        score_threshold=0.3)
    
    cv2.imwrite('result.jpg', vis_image)
    
    # 视频推理
    inference_video(
        model,
        'input_video.mp4',
        'output_video.mp4',
        score_threshold=0.3)

总结与展望

YOLOv6 的核心贡献

硬件感知设计：通过 RepVGG 重参数化，实现了训练精度和推理速度的双重提升
高效架构：EfficientRep 和 Rep-PAN 在保持精度的同时显著降低计算量
优化检测头：Efficient Decoupled Head 平衡了精度和速度
先进训练策略：TaskAlignedAssigner 和 Varifocal Loss 提升了检测精度

适用场景

工业检测：高精度、实时性要求
边缘设备：YOLOv6-N/T 适合移动端部署
服务器部署：YOLOv6-M/L 适合高精度需求
实时视频分析：高 FPS 满足实时处理需求

未来发展方向

模型压缩：进一步优化模型大小和速度
多任务学习：扩展到实例分割、关键点检测等任务
自监督学习：减少对标注数据的依赖
Transformer 融合：结合 Vision Transformer 提升性能

学习建议

理解 RepVGG：掌握重参数化的原理和实现
熟悉 Anchor-free：理解无锚框检测的设计思想
掌握标签分配：理解动态标签分配的优势
实践训练：通过实际训练加深理解

参考文献

Li, C., et al. "YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications." arXiv preprint arXiv:2209.02976 (2022).
Ding, X., et al. "RepVGG: Making VGG-style ConvNets Great Again." CVPR 2021.
Ge, Z., et al. "YOLOX: Exceeding YOLO Series in 2021." arXiv preprint arXiv:2107.08430 (2021).
Feng, C., et al. "TOOD: Task-aligned One-stage Object Detection." ICCV 2021.
Zhang, H., et al. "VarifocalNet: An IoU-aware Dense Object Detector." CVPR 2021.
Gevorgyan, Z. "SIoU Loss: More Powerful Learning for Bounding Box Regression." arXiv preprint arXiv:2205.12740 (2022).