在批量检测图片时,控制台可能会出现类似下面的日志:
text
Corrupt JPEG data: 53 extraneous bytes before marker 0xd9
Corrupt JPEG data: premature end of data segment
这类日志通常不是 Python 主动抛出的异常,而是 OpenCV 底层 JPEG 解码库输出到 stderr 的警告。很多情况下,cv2.imread() 仍然会返回图像,但图像数据已经存在截断、尾部异常字节或写入不完整的问题。对于目标检测任务,这类图片可能导致检测框异常、漏检、置信度波动,甚至影响批量处理稳定性。
产生原因
常见原因包括:
- 图片采集或网络传输未完成,文件只保存了一部分。
- 写文件过程中程序退出、磁盘异常或进程被终止。
- JPEG 文件尾部存在多余字节,触发
extraneous bytes before marker 0xd9。 - JPEG 数据段提前结束,触发
premature end of data segment。 - Windows 环境下直接使用
cv2.imread(path)读取中文路径,可能把正常图片误判为读取失败。
其中第 5 点很重要:如果图片路径包含中文,不能简单使用 cv2.imread() 判断图片是否损坏。更稳妥的方式是:
python
data = np.fromfile(path, dtype=np.uint8)
img = cv2.imdecode(data, cv2.IMREAD_COLOR)
这种方式由 Python 负责读取文件路径,能够正确处理中文路径,再交给 OpenCV 解码图片内容。
处理思路
推荐按下面顺序处理:
- 先扫描图片目录,只生成坏图报告,不修改文件。
- 人工检查
bad_images.csv,确认是否确实是坏图。 - 第一次处理建议移动到隔离目录,不要直接删除。
- 确认无误后,再选择是否删除。
- 如果图片旁边有同名 LabelMe JSON,可以使用
--with-json一起移动或删除。
完整代码
将下面代码保存为 find_bad_images.py:
python
import argparse
import csv
import os
import shutil
import subprocess
import sys
from pathlib import Path
IMAGE_EXTS = {".jpg", ".jpeg", ".png", ".bmp", ".tif", ".tiff", ".webp"}
JPEG_WARNING_PATTERNS = (
"corrupt jpeg data",
"premature end of data segment",
"extraneous bytes before marker",
"invalid sos parameters",
"bad huffman code",
"unsupported marker type",
)
CHECK_CODE = r"""
import sys
import cv2
import numpy as np
data = np.fromfile(sys.argv[1], dtype=np.uint8)
if data.size == 0:
print("image file is empty or unreadable", file=sys.stderr)
sys.exit(2)
img = cv2.imdecode(data, cv2.IMREAD_COLOR)
if img is None:
print("cv2.imdecode returned None", file=sys.stderr)
sys.exit(2)
sys.exit(0)
"""
def parse_args():
parser = argparse.ArgumentParser(description="Find JPEG/images that trigger OpenCV decode warnings.")
parser.add_argument("--image-dir", default="../Car", help="Directory to scan.")
parser.add_argument("--recursive", action="store_true", help="Scan image-dir recursively.")
parser.add_argument("--report", default="bad_images.csv", help="CSV report path.")
parser.add_argument("--move-bad", default=None, help="Move bad images to this directory instead of deleting.")
parser.add_argument("--delete", action="store_true", help="Delete bad images. Use only after checking the report.")
parser.add_argument("--with-json", action="store_true", help="Also move/delete same-stem LabelMe json files.")
parser.add_argument("--any-stderr", action="store_true", help="Treat any decoder stderr output as bad.")
return parser.parse_args()
def iter_images(image_dir: Path, recursive: bool):
pattern = "**/*" if recursive else "*"
for path in sorted(image_dir.glob(pattern)):
if path.is_file() and path.suffix.lower() in IMAGE_EXTS:
yield path
def check_image(path: Path, any_stderr: bool):
proc = subprocess.run(
[sys.executable, "-c", CHECK_CODE, str(path)],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
encoding="utf-8",
errors="replace",
)
stderr = proc.stderr.strip()
stderr_lower = stderr.lower()
has_known_warning = any(pattern in stderr_lower for pattern in JPEG_WARNING_PATTERNS)
if proc.returncode != 0:
return False, stderr or f"decode failed with return code {proc.returncode}"
if has_known_warning:
return False, stderr
if any_stderr and stderr:
return False, stderr
return True, ""
def related_files(image_path: Path, include_json: bool):
paths = [image_path]
if include_json:
json_path = image_path.with_suffix(".json")
if json_path.exists():
paths.append(json_path)
return paths
def move_files(paths, source_root: Path, target_root: Path):
moved = []
for path in paths:
relative_path = path.relative_to(source_root)
target_path = target_root / relative_path
target_path.parent.mkdir(parents=True, exist_ok=True)
shutil.move(str(path), str(target_path))
moved.append(str(target_path))
return moved
def delete_files(paths):
deleted = []
for path in paths:
path.unlink()
deleted.append(str(path))
return deleted
def main():
args = parse_args()
image_dir = Path(args.image_dir).resolve()
if not image_dir.exists():
raise FileNotFoundError(f"image-dir does not exist: {image_dir}")
if args.delete and args.move_bad:
raise ValueError("--delete and --move-bad cannot be used together")
report_path = Path(args.report)
bad_rows = []
total = 0
for image_path in iter_images(image_dir, args.recursive):
total += 1
ok, reason = check_image(image_path, args.any_stderr)
if ok:
continue
action = "report"
affected = []
files = related_files(image_path, args.with_json)
if args.move_bad:
action = "move"
affected = move_files(files, image_dir, Path(args.move_bad).resolve())
elif args.delete:
action = "delete"
affected = delete_files(files)
bad_rows.append([str(image_path), reason.replace("\n", " | "), action, ";".join(affected)])
print(f"BAD: {image_path}")
print(f" reason: {reason}")
if affected:
print(f" {action}: {affected}")
with report_path.open("w", newline="", encoding="utf-8") as report_file:
writer = csv.writer(report_file)
writer.writerow(["image_path", "reason", "action", "affected_files"])
writer.writerows(bad_rows)
print("")
print(f"Scanned images: {total}")
print(f"Bad images: {len(bad_rows)}")
print(f"Report: {report_path.resolve()}")
if bad_rows and not args.delete and not args.move_bad:
print("No files were changed. Re-run with --move-bad bad_images or --delete after checking the report.")
if __name__ == "__main__":
main()
使用方法
只扫描,不修改任何图片:
bash
python find_bad_images.py --image-dir "D:\数据集\车辆图片" --report bad_images.csv
递归扫描子目录:
bash
python find_bad_images.py --image-dir "D:\数据集\车辆图片" --recursive --report bad_images.csv
推荐先移动坏图到隔离目录:
bash
python find_bad_images.py --image-dir "D:\数据集\车辆图片" --move-bad bad_images --with-json
确认坏图无价值后,直接删除:
bash
python find_bad_images.py --image-dir "D:\数据集\车辆图片" --delete --with-json
如果想把所有 OpenCV 解码 stderr 输出都当作异常处理,可以加:
bash
python find_bad_images.py --image-dir "D:\数据集\车辆图片" --any-stderr --report bad_images.csv
报告字段说明
bad_images.csv 包含以下字段:
| 字段 | 说明 |
|---|---|
image_path |
被判定异常的图片路径 |
reason |
OpenCV/JPEG 解码输出的异常原因 |
action |
当前执行动作,可能是 report、move 或 delete |
affected_files |
被移动或删除的文件路径 |
注意事项
- 如果路径包含中文,必须使用本文代码里的
np.fromfile + cv2.imdecode,不要直接用cv2.imread(path)做坏图判断。 - 第一次处理建议使用
--move-bad,不要直接--delete。 - 如果已经生成过旧版报告,应删除旧的
bad_images.csv后重新扫描。 --with-json适合 LabelMe 数据集,会同步处理同名.json标注文件。- 如果图片来自摄像头、网络请求或异步写盘流程,应同时检查上游写文件逻辑,避免还没写完就进入检测流程。