目标检测(AABB)评估指标

AABB（Axis-Aligned Bounding Box，轴对齐矩形框）目标检测，常用评估指标和普通目标检测基本一致，核心是看：

框有没有检对
类别有没有判对
框和真值框重合得够不够好

下面按评估链路整理。

1. IoU：框重合度

最基础指标是 IoU（Intersection over Union）：

IoU = \\frac{\\text{预测框} \\cap \\text{真值框}}{\\text{预测框} \\cup \\text{真值框}}

其中预测框和 GT（ground truth）都为 AABB。

作用

用来判定一个预测框是不是 TP（真正例）。

例如设阈值：

IoU ≥ 0.5：认为匹配成功
IoU < 0.5：认为没检对

常见写法：

AP50：IoU 阈值为 0.5
AP75：IoU 阈值为 0.75

2. TP / FP / FN

在目标检测里，通常按"类别正确 + IoU 达标"来判 TP。

TP（True Positive）

满足：

预测类别正确
与某个尚未匹配的 GT 的 IoU ≥ 阈值

FP（False Positive）

包括：

框位置不准，IoU 不够
类别预测错
重复检测（一个 GT 被多个预测框打中，只有一个算 TP，其余算 FP）
背景上误检出了目标

FN（False Negative）

某个 GT 没有被任何预测框成功匹配到

3. Precision / Recall

Precision（精确率）

Precision=TPTP+FP Precision = \frac{TP}{TP+FP} Precision=TP+FPTP

表示：预测出来的目标里，有多少是真的

Recall（召回率）

Recall=TPTP+FN Recall = \frac{TP}{TP+FN} Recall=TP+FNTP

表示：真实目标里，有多少被找出来了

4. PR 曲线

目标检测模型每个预测框一般都有一个 confidence score。

把置信度阈值从高到低扫一遍，可以得到不同的：

Precision
Recall

从而画出 PR 曲线（Precision-Recall Curve）。

这比单点 Precision/Recall 更全面。

5. AP：单类别平均精度

AP（Average Precision） 是某一类别 PR 曲线下的面积。

可理解为：

阈值变化时，模型整体 Precision-Recall 表现的综合指标

常见 AP 指标

AP50：IoU=0.5 时的 AP
AP75：IoU=0.75 时的 AP
AP@ $0.5:0.95$ ：IoU 从 0.5 到 0.95，步长 0.05，取平均

后者更严格，也是 COCO 常用主指标。

6. mAP：多类别平均精度

mAP（mean Average Precision） 是所有类别 AP 的平均值。

若有 © 个类别：

mAP=1C∑i=1CAPi mAP = \frac{1}{C}\sum_{i=1}^{C} AP_i mAP=C1i=1∑CAPi

常见写法

mAP@0.5
mAP@0.75
mAP@ $0.5:0.95$

不同数据集习惯

VOC 风格

常用：

mAP@0.5

COCO 风格

常用：

mAP@ $0.5:0.95$
同时报告：
- AP50
- AP75
- APS / APM / APL（小/中/大目标）

7. AR：平均召回率

COCO 里还常看 AR（Average Recall）。

它衡量在限定最大检测数时，模型的召回能力。

例如：

AR@1
AR@10
AR@100

表示每张图最多保留 1 / 10 / 100 个框时的平均召回。

8. 小目标 / 中目标 / 大目标指标

对于 AABB 检测，尤其 COCO 评估中，常拆成：

APs：small objects
APm：medium objects
APl：large objects

用来看模型对不同尺寸目标的能力。

这在零售货架、小商品、远距离目标场景里很重要。

9. 速度指标

除了精度，还常看推理速度：

Latency：单张图推理时延
FPS：每秒帧数
吞吐量：batch 场景下每秒处理图像数

工程里常一起看：

mAP
Recall
FPS / Latency
模型大小
显存/内存占用

10. AABB 检测里最核心的几个指标

如果你只抓重点，通常看这几个就够了：

学术/公开 benchmark

mAP@ $0.5:0.95$
AP50
AP75

工程落地

Precision
Recall
F1
mAP@0.5
漏检率 / 误检率
推理时延

其中 F1 为：

F1=2PRP+R F1 = \frac{2PR}{P+R} F1=P+R2PR

11. 一个直观例子

假设一张图里有 10 个真实目标：

模型预测出 12 个框
其中 8 个类别正确且 IoU≥0.5 → TP=8
4 个是假框/重复框/类别错 → FP=4
还有 2 个真实目标没检出 → FN=2

则：

Precision=812=0.667 Precision = \frac{8}{12}=0.667 Precision=128=0.667

Recall=810=0.8 Recall = \frac{8}{10}=0.8 Recall=108=0.8

F1=2×0.667×0.80.667+0.8≈0.727 F1 = \frac{2 \times 0.667 \times 0.8}{0.667+0.8} \approx 0.727 F1=0.667+0.82×0.667×0.8≈0.727

12. AABB 检测评估的注意点

1）匹配规则

通常一个 GT 只能匹配一个预测框，按预测分数从高到低匹配。

2）类别相关

IoU 高不代表 TP，类别也必须对。

3）NMS 会影响评估

NMS 阈值过低可能漏检，过高可能重复框变多。

4）小目标更难

AABB 对小目标轻微偏移就可能导致 IoU 明显下降。

5）类别不平衡

某些类别样本很少时，mAP 可能不稳定。

13.python版最小实现

可直接运行的 AABB 目标检测评估示例，包含：

IoU
单类别 TP / FP 匹配
Precision / Recall
AP
mAP
纯 Python + NumPy 的最小可用实现，风格偏 VOC/mAP@IoU，适合自己接数据结构、做二次开发。

python 复制代码

import numpy as np
from collections import defaultdict


def box_iou_xyxy(box1, box2):
    """
    AABB IoU, box format: [x1, y1, x2, y2]
    """
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    inter_w = max(0.0, x2 - x1)
    inter_h = max(0.0, y2 - y1)
    inter = inter_w * inter_h

    area1 = max(0.0, box1[2] - box1[0]) * max(0.0, box1[3] - box1[1])
    area2 = max(0.0, box2[2] - box2[0]) * max(0.0, box2[3] - box2[1])

    union = area1 + area2 - inter
    if union <= 0:
        return 0.0
    return inter / union


def compute_ap_voc_style(recall, precision, use_07_metric=False):
    """
    计算 AP
    - use_07_metric=True: VOC2007 11-point AP
    - use_07_metric=False: 更常见的积分式 AP
    """
    recall = np.asarray(recall, dtype=np.float64)
    precision = np.asarray(precision, dtype=np.float64)

    if use_07_metric:
        ap = 0.0
        for t in np.arange(0.0, 1.1, 0.1):
            if np.sum(recall >= t) == 0:
                p = 0.0
            else:
                p = np.max(precision[recall >= t])
            ap += p / 11.0
        return ap

    # 积分式 AP
    mrec = np.concatenate(([0.0], recall, [1.0]))
    mpre = np.concatenate(([0.0], precision, [0.0]))

    # precision envelope
    for i in range(len(mpre) - 1, 0, -1):
        mpre[i - 1] = max(mpre[i - 1], mpre[i])

    # 只在 recall 变化处积分
    idx = np.where(mrec[1:] != mrec[:-1])[0]
    ap = np.sum((mrec[idx + 1] - mrec[idx]) * mpre[idx + 1])
    return ap


def evaluate_class(
    preds,
    gts,
    class_id,
    iou_thresh=0.5,
    use_07_metric=False
):
    """
    评估单个类别

    参数
    ----
    preds: list of dict
        每个元素:
        {
            "image_id": str/int,
            "category_id": int,
            "bbox": [x1, y1, x2, y2],
            "score": float
        }

    gts: list of dict
        每个元素:
        {
            "image_id": str/int,
            "category_id": int,
            "bbox": [x1, y1, x2, y2]
        }

    返回
    ----
    result: dict
        {
            "ap": float,
            "precision": np.ndarray,
            "recall": np.ndarray,
            "tp": np.ndarray,
            "fp": np.ndarray,
            "num_gt": int
        }
    """
    # 过滤出该类
    cls_preds = [p for p in preds if p["category_id"] == class_id]
    cls_gts = [g for g in gts if g["category_id"] == class_id]

    # 按 image_id 聚合 GT
    gt_by_image = defaultdict(list)
    for g in cls_gts:
        gt_by_image[g["image_id"]].append(g["bbox"])

    # 记录每张图的 GT 是否已被匹配
    matched = {
        img_id: np.zeros(len(boxes), dtype=bool)
        for img_id, boxes in gt_by_image.items()
    }

    num_gt = len(cls_gts)

    # 按 score 从高到低排序预测框
    cls_preds = sorted(cls_preds, key=lambda x: x["score"], reverse=True)

    tp = np.zeros(len(cls_preds), dtype=np.float64)
    fp = np.zeros(len(cls_preds), dtype=np.float64)

    for i, pred in enumerate(cls_preds):
        image_id = pred["image_id"]
        pred_box = pred["bbox"]

        gt_boxes = gt_by_image.get(image_id, [])

        if len(gt_boxes) == 0:
            fp[i] = 1.0
            continue

        ious = np.array([box_iou_xyxy(pred_box, gt_box) for gt_box in gt_boxes], dtype=np.float64)
        max_iou_idx = int(np.argmax(ious))
        max_iou = ious[max_iou_idx]

        if max_iou >= iou_thresh:
            if not matched[image_id][max_iou_idx]:
                tp[i] = 1.0
                matched[image_id][max_iou_idx] = True
            else:
                # 重复检测，同一 GT 只能匹配一次
                fp[i] = 1.0
        else:
            fp[i] = 1.0

    cum_tp = np.cumsum(tp)
    cum_fp = np.cumsum(fp)

    precision = cum_tp / np.maximum(cum_tp + cum_fp, 1e-12)
    recall = cum_tp / max(num_gt, 1)

    ap = 0.0 if num_gt == 0 else compute_ap_voc_style(
        recall, precision, use_07_metric=use_07_metric
    )

    return {
        "ap": ap,
        "precision": precision,
        "recall": recall,
        "tp": tp,
        "fp": fp,
        "num_gt": num_gt,
        "sorted_preds": cls_preds,
    }


def evaluate_map(
    preds,
    gts,
    class_ids=None,
    iou_thresh=0.5,
    use_07_metric=False
):
    """
    评估 mAP@iou_thresh
    """
    if class_ids is None:
        pred_classes = {p["category_id"] for p in preds}
        gt_classes = {g["category_id"] for g in gts}
        class_ids = sorted(pred_classes | gt_classes)

    per_class = {}
    aps = []

    for cid in class_ids:
        result = evaluate_class(
            preds=preds,
            gts=gts,
            class_id=cid,
            iou_thresh=iou_thresh,
            use_07_metric=use_07_metric
        )
        per_class[cid] = result

        # 常见做法：即使该类 GT=0，也可选择跳过不计入 mAP
        if result["num_gt"] > 0:
            aps.append(result["ap"])

    mAP = float(np.mean(aps)) if len(aps) > 0 else 0.0

    return {
        "mAP": mAP,
        "per_class": per_class,
        "class_ids": class_ids,
        "iou_thresh": iou_thresh,
    }


def evaluate_map_range(
    preds,
    gts,
    class_ids=None,
    iou_thresholds=None,
    use_07_metric=False
):
    """
    类 COCO 风格:
    对多个 IoU 阈值求平均
    例如 iou_thresholds=np.arange(0.5, 1.0, 0.05)
    """
    if iou_thresholds is None:
        iou_thresholds = np.arange(0.5, 1.0, 0.05)

    results = {}
    map_list = []

    for thr in iou_thresholds:
        r = evaluate_map(
            preds=preds,
            gts=gts,
            class_ids=class_ids,
            iou_thresh=float(thr),
            use_07_metric=use_07_metric
        )
        results[float(thr)] = r
        map_list.append(r["mAP"])

    mean_map = float(np.mean(map_list)) if len(map_list) > 0 else 0.0

    return {
        "mAP_range": mean_map,
        "per_iou": results,
        "iou_thresholds": list(map(float, iou_thresholds))
    }


if __name__ == "__main__":
    # -----------------------------
    # 示例 GT
    # -----------------------------
    gts = [
        {"image_id": "img1", "category_id": 0, "bbox": [10, 10, 50, 50]},
        {"image_id": "img1", "category_id": 1, "bbox": [60, 60, 100, 100]},
        {"image_id": "img2", "category_id": 0, "bbox": [15, 15, 40, 40]},
        {"image_id": "img2", "category_id": 1, "bbox": [100, 100, 160, 160]},
    ]

    # -----------------------------
    # 示例预测
    # -----------------------------
    preds = [
        # img1, class0: 一个正确
        {"image_id": "img1", "category_id": 0, "bbox": [12, 12, 48, 48], "score": 0.95},

        # img1, class1: 一个正确
        {"image_id": "img1", "category_id": 1, "bbox": [58, 58, 102, 102], "score": 0.90},

        # img2, class0: 一个正确
        {"image_id": "img2", "category_id": 0, "bbox": [14, 14, 39, 39], "score": 0.88},

        # img2, class1: 一个偏差较大的预测
        {"image_id": "img2", "category_id": 1, "bbox": [110, 110, 170, 170], "score": 0.80},

        # img2, class1: 一个重复/竞争框
        {"image_id": "img2", "category_id": 1, "bbox": [101, 101, 159, 159], "score": 0.70},

        # img1, class0: 重复框，应为 FP
        {"image_id": "img1", "category_id": 0, "bbox": [11, 11, 49, 49], "score": 0.60},

        # 背景误检
        {"image_id": "img1", "category_id": 0, "bbox": [200, 200, 260, 260], "score": 0.30},
    ]

    print("==== IoU demo ====")
    iou_demo = box_iou_xyxy([10, 10, 50, 50], [12, 12, 48, 48])
    print("IoU =", round(iou_demo, 4))

    print("\n==== mAP@0.5 ====")
    result_05 = evaluate_map(preds, gts, class_ids=[0, 1], iou_thresh=0.5, use_07_metric=False)
    print("mAP@0.5 =", round(result_05["mAP"], 4))

    for cid, r in result_05["per_class"].items():
        print(f"class {cid}: AP={r['ap']:.4f}, num_gt={r['num_gt']}")
        print("  precision:", np.round(r["precision"], 4))
        print("  recall   :", np.round(r["recall"], 4))
        print("  tp       :", r["tp"].astype(int))
        print("  fp       :", r["fp"].astype(int))

    print("\n==== mAP@[0.5:0.95] approx ====")
    result_range = evaluate_map_range(
        preds,
        gts,
        class_ids=[0, 1],
        iou_thresholds=np.arange(0.5, 1.0, 0.05),
        use_07_metric=False
    )
    print("mAP@[0.5:0.95] =", round(result_range["mAP_range"], 4))

    print("\nPer IoU mAP:")
    for thr, r in result_range["per_iou"].items():
        print(f"  IoU={thr:.2f}: mAP={r['mAP']:.4f}")

14. 总结

AABB 目标检测的评估，本质上就是：

用 IoU 判断框是否对齐
用 Precision / Recall 衡量检出质量
用 AP / mAP 做综合评价
工程上再结合 F1、漏检率、误检率、时延/FPS