OpenCV 检测流程中损坏 JPEG 图片的定位与清理

在批量检测图片时，控制台可能会出现类似下面的日志：

text 复制代码

Corrupt JPEG data: 53 extraneous bytes before marker 0xd9
Corrupt JPEG data: premature end of data segment

这类日志通常不是 Python 主动抛出的异常，而是 OpenCV 底层 JPEG 解码库输出到 stderr 的警告。很多情况下，cv2.imread() 仍然会返回图像，但图像数据已经存在截断、尾部异常字节或写入不完整的问题。对于目标检测任务，这类图片可能导致检测框异常、漏检、置信度波动，甚至影响批量处理稳定性。

产生原因

常见原因包括：

图片采集或网络传输未完成，文件只保存了一部分。
写文件过程中程序退出、磁盘异常或进程被终止。
JPEG 文件尾部存在多余字节，触发 extraneous bytes before marker 0xd9。
JPEG 数据段提前结束，触发 premature end of data segment。
Windows 环境下直接使用 cv2.imread(path) 读取中文路径，可能把正常图片误判为读取失败。

其中第 5 点很重要：如果图片路径包含中文，不能简单使用 cv2.imread() 判断图片是否损坏。更稳妥的方式是：

python 复制代码

data = np.fromfile(path, dtype=np.uint8)
img = cv2.imdecode(data, cv2.IMREAD_COLOR)

这种方式由 Python 负责读取文件路径，能够正确处理中文路径，再交给 OpenCV 解码图片内容。

处理思路

推荐按下面顺序处理：

先扫描图片目录，只生成坏图报告，不修改文件。
人工检查 bad_images.csv，确认是否确实是坏图。
第一次处理建议移动到隔离目录，不要直接删除。
确认无误后，再选择是否删除。
如果图片旁边有同名 LabelMe JSON，可以使用 --with-json 一起移动或删除。

完整代码

将下面代码保存为 find_bad_images.py：

python 复制代码

import argparse
import csv
import os
import shutil
import subprocess
import sys
from pathlib import Path


IMAGE_EXTS = {".jpg", ".jpeg", ".png", ".bmp", ".tif", ".tiff", ".webp"}
JPEG_WARNING_PATTERNS = (
    "corrupt jpeg data",
    "premature end of data segment",
    "extraneous bytes before marker",
    "invalid sos parameters",
    "bad huffman code",
    "unsupported marker type",
)

CHECK_CODE = r"""
import sys
import cv2
import numpy as np

data = np.fromfile(sys.argv[1], dtype=np.uint8)
if data.size == 0:
    print("image file is empty or unreadable", file=sys.stderr)
    sys.exit(2)

img = cv2.imdecode(data, cv2.IMREAD_COLOR)
if img is None:
    print("cv2.imdecode returned None", file=sys.stderr)
    sys.exit(2)
sys.exit(0)
"""


def parse_args():
    parser = argparse.ArgumentParser(description="Find JPEG/images that trigger OpenCV decode warnings.")
    parser.add_argument("--image-dir", default="../Car", help="Directory to scan.")
    parser.add_argument("--recursive", action="store_true", help="Scan image-dir recursively.")
    parser.add_argument("--report", default="bad_images.csv", help="CSV report path.")
    parser.add_argument("--move-bad", default=None, help="Move bad images to this directory instead of deleting.")
    parser.add_argument("--delete", action="store_true", help="Delete bad images. Use only after checking the report.")
    parser.add_argument("--with-json", action="store_true", help="Also move/delete same-stem LabelMe json files.")
    parser.add_argument("--any-stderr", action="store_true", help="Treat any decoder stderr output as bad.")
    return parser.parse_args()


def iter_images(image_dir: Path, recursive: bool):
    pattern = "**/*" if recursive else "*"
    for path in sorted(image_dir.glob(pattern)):
        if path.is_file() and path.suffix.lower() in IMAGE_EXTS:
            yield path


def check_image(path: Path, any_stderr: bool):
    proc = subprocess.run(
        [sys.executable, "-c", CHECK_CODE, str(path)],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        encoding="utf-8",
        errors="replace",
    )
    stderr = proc.stderr.strip()
    stderr_lower = stderr.lower()
    has_known_warning = any(pattern in stderr_lower for pattern in JPEG_WARNING_PATTERNS)

    if proc.returncode != 0:
        return False, stderr or f"decode failed with return code {proc.returncode}"
    if has_known_warning:
        return False, stderr
    if any_stderr and stderr:
        return False, stderr
    return True, ""


def related_files(image_path: Path, include_json: bool):
    paths = [image_path]
    if include_json:
        json_path = image_path.with_suffix(".json")
        if json_path.exists():
            paths.append(json_path)
    return paths


def move_files(paths, source_root: Path, target_root: Path):
    moved = []
    for path in paths:
        relative_path = path.relative_to(source_root)
        target_path = target_root / relative_path
        target_path.parent.mkdir(parents=True, exist_ok=True)
        shutil.move(str(path), str(target_path))
        moved.append(str(target_path))
    return moved


def delete_files(paths):
    deleted = []
    for path in paths:
        path.unlink()
        deleted.append(str(path))
    return deleted


def main():
    args = parse_args()
    image_dir = Path(args.image_dir).resolve()
    if not image_dir.exists():
        raise FileNotFoundError(f"image-dir does not exist: {image_dir}")
    if args.delete and args.move_bad:
        raise ValueError("--delete and --move-bad cannot be used together")

    report_path = Path(args.report)
    bad_rows = []
    total = 0

    for image_path in iter_images(image_dir, args.recursive):
        total += 1
        ok, reason = check_image(image_path, args.any_stderr)
        if ok:
            continue

        action = "report"
        affected = []
        files = related_files(image_path, args.with_json)
        if args.move_bad:
            action = "move"
            affected = move_files(files, image_dir, Path(args.move_bad).resolve())
        elif args.delete:
            action = "delete"
            affected = delete_files(files)

        bad_rows.append([str(image_path), reason.replace("\n", " | "), action, ";".join(affected)])
        print(f"BAD: {image_path}")
        print(f"  reason: {reason}")
        if affected:
            print(f"  {action}: {affected}")

    with report_path.open("w", newline="", encoding="utf-8") as report_file:
        writer = csv.writer(report_file)
        writer.writerow(["image_path", "reason", "action", "affected_files"])
        writer.writerows(bad_rows)

    print("")
    print(f"Scanned images: {total}")
    print(f"Bad images: {len(bad_rows)}")
    print(f"Report: {report_path.resolve()}")
    if bad_rows and not args.delete and not args.move_bad:
        print("No files were changed. Re-run with --move-bad bad_images or --delete after checking the report.")


if __name__ == "__main__":
    main()

使用方法

只扫描，不修改任何图片：

bash 复制代码

python find_bad_images.py --image-dir "D:\数据集\车辆图片" --report bad_images.csv

递归扫描子目录：

bash 复制代码

python find_bad_images.py --image-dir "D:\数据集\车辆图片" --recursive --report bad_images.csv

推荐先移动坏图到隔离目录：

bash 复制代码

python find_bad_images.py --image-dir "D:\数据集\车辆图片" --move-bad bad_images --with-json

确认坏图无价值后，直接删除：

bash 复制代码

python find_bad_images.py --image-dir "D:\数据集\车辆图片" --delete --with-json

如果想把所有 OpenCV 解码 stderr 输出都当作异常处理，可以加：

bash 复制代码

python find_bad_images.py --image-dir "D:\数据集\车辆图片" --any-stderr --report bad_images.csv

报告字段说明

bad_images.csv 包含以下字段：

字段	说明
`image_path`	被判定异常的图片路径
`reason`	OpenCV/JPEG 解码输出的异常原因
`action`	当前执行动作，可能是 `report`、`move` 或 `delete`
`affected_files`	被移动或删除的文件路径

注意事项

如果路径包含中文，必须使用本文代码里的 np.fromfile + cv2.imdecode，不要直接用 cv2.imread(path) 做坏图判断。
第一次处理建议使用 --move-bad，不要直接 --delete。
如果已经生成过旧版报告，应删除旧的 bad_images.csv 后重新扫描。
--with-json 适合 LabelMe 数据集，会同步处理同名 .json 标注文件。
如果图片来自摄像头、网络请求或异步写盘流程，应同时检查上游写文件逻辑，避免还没写完就进入检测流程。