[图像处理] 基于CleanVision库清洗图像数据集

CleanVision是一个开源的Python库,旨在帮助用户自动检测图像数据集中可能影响机器学习项目的常见问题。该库被设计为计算机视觉项目的初步工具,以便在应用机器学习之前发现并解决数据集中的问题。CleanVision的核心功能包括检测完全重复、近似重复、模糊、低信息量、过暗、过亮、灰度、不规则长宽比和尺寸异常等问题图片。CleanVision开源仓库地址为:CleanVision,官方文档地址为:CleanVision-docs

CleanVision基础版安装命令如下:

pip install cleanvision

完整版安装命令如下:

pip install "cleanvision[all]"

查看CleanVision版本:

python 复制代码
# 查看版本
import cleanvision
cleanvision.__version__
'0.3.6'

本文代码必要库版本:

python 复制代码
# 用于表格显示
import tabulate
# tabulate版本需要0.8.10以上
tabulate.__version__
'0.9.0'

目录

  • [1 使用说明](#1 使用说明)
    • [1.1 CleanVision功能介绍](#1.1 CleanVision功能介绍)
    • [1.2 基础使用](#1.2 基础使用)
    • [1.3 自定义检测](#1.3 自定义检测)
    • [1.4 在Torchvision数据集上运行CleanVision](#1.4 在Torchvision数据集上运行CleanVision)
    • [1.5 在Hugging Face数据集上运行CleanVision](#1.5 在Hugging Face数据集上运行CleanVision)
  • [2 参考](#2 参考)

1 使用说明

1.1 CleanVision功能介绍

CleanVision支持多种格式的图像文件,并能检测以下类型的数据问题:

示例图片 问题类型 描述 关键字
完全重复 完全相同的图像 exact_duplicates
近似重复 视觉上几乎相同的图像 near_duplicates
模糊 图像细节模糊(焦点不实) blurry
信息量低 缺乏内容的图像(像素值的熵很小) low_information
过暗 不规则的暗图像(曝光不足) dark
过亮 不规则的亮图像(曝光过度) light
灰度 缺乏颜色的图像 grayscale
异常宽高比 宽高比异常的图像 odd_aspect_ratio
异常大小 相比数据集中其他图像,尺寸异常的图像 odd_size

上表中,CleanVision针对这些问题的检测主要依赖于多种统计方法,其中关键字列表用于指定CleanVision代码中每种问题类型的名称。CleanVision兼容Linux、macOS和Windows系统,可在Python 3.7及以上版本的环境中运行。

1.2 基础使用

本节介绍如何读取文件夹中的图片以进行问题检测。以下示例展示了对一个包含607张图片的文件夹进行质量检测的过程。在检测过程中,CleanVision将自动加载多进程以加快处理速度:

基础使用

python 复制代码
from cleanvision import Imagelab

# 示例数据:https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip
# 读取示例图片
dataset_path = "./image_files/"

# 实例化Imagelab类,以用于后续处理
imagelab = Imagelab(data_path=dataset_path)

# 使用multiprocessing进行多进程处理,n_jobs设置进程数
# n_jobs默认为None,表示自动确定进程数
# 处理时会先检测每张图片的image_property(图像质量)
# 等所有图片处理完后,再检测duplicate(重复)
imagelab.find_issues(verbose=False, n_jobs=2)
Reading images from D:/cleanvision/image_files

如果在Windows系统上运行CleanVision代码,需要将相关代码放入main函数中,以便正确加载multiprocessing模块。当然,也可以将n_jobs设置为1,以使用单进程:

python 复制代码
from cleanvision import Imagelab

if '__main__' == __name__:
    # 示例数据:https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip
    # 读取示例图片
    dataset_path = "./image_files/"
    imagelab = Imagelab(data_path=dataset_path)
    imagelab.find_issues(verbose=False)
Reading images from D:/cleanvision/image_files

基于report函数,能够报告数据集中每种问题类型的图像数量,并展示每种问题类型中最严重实例的图像:

python 复制代码
imagelab.report()
Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | odd_size         |          109 |
|  1 | grayscale        |           20 |
|  2 | near_duplicates  |           20 |
|  3 | exact_duplicates |           19 |
|  4 | odd_aspect_ratio |           11 |
|  5 | dark             |           10 |
|  6 | blurry           |            6 |
|  7 | light            |            5 |
|  8 | low_information  |            5 | 

--------------------- odd_size images ----------------------

Number of examples with this issue: 109
Examples representing most severe instances of this issue:
--------------------- grayscale images ---------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:
------------------ near_duplicates images ------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

Set: 0
Set: 1
Set: 2
Set: 3
----------------- exact_duplicates images ------------------

Number of examples with this issue: 19
Examples representing most severe instances of this issue:

Set: 0
Set: 1
Set: 2
Set: 3
----------------- odd_aspect_ratio images ------------------

Number of examples with this issue: 11
Examples representing most severe instances of this issue:
----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:
---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:
----------------------- light images -----------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:
------------------ low_information images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

如果想创建自定义的问题识别类型,可以参考:custom_issue_manager

与数据结果交互的主要方式是通过Imagelab类。此类可用于在宏观层面(全局概览)和微观层面(每张图片的问题和质量评分)了解数据集中的问题。它包含三个主要属性:

  • Imagelab.issue_summary:问题摘要
  • Imagelab.issues:问题列表
  • Imagelab.info:数据集信息,包括相似图片信息

问题结果分析

通过issue_summary属性可以展示数据集中不同问题类别的图像数量:

python 复制代码
# 返回结果为pandas的dataframe
res = imagelab.issue_summary
type(res)
pandas.core.frame.DataFrame

查看汇总结果:

python 复制代码
res

| | issue_type | num_images |
| 0 | odd_size | 109 |
| 1 | grayscale | 20 |
| 2 | near_duplicates | 20 |
| 3 | exact_duplicates | 19 |
| 4 | odd_aspect_ratio | 11 |
| 5 | dark | 10 |
| 6 | blurry | 6 |
| 7 | light | 5 |

8 low_information 5

通过issues属性,可以展示每张图片中各种问题的质量分数及其存在情况。这些质量分数的范围从0到1,较低的分数表示问题的严重性更高:

python 复制代码
imagelab.issues.head()

| | odd_size_score | is_odd_size_issue | odd_aspect_ratio_score | is_odd_aspect_ratio_issue | low_information_score | is_low_information_issue | light_score | is_light_issue | grayscale_score | is_grayscale_issue | dark_score | is_dark_issue | blurry_score | is_blurry_issue | exact_duplicates_score | is_exact_duplicates_issue | near_duplicates_score | is_near_duplicates_issue |
| D:/cleanvision/image_files/image_0.png | 1.0 | False | 1.0 | False | 0.806332 | False | 0.925490 | False | 1 | False | 1.000000 | False | 0.980373 | False | 1.0 | False | 1.0 | False |
| D:/cleanvision/image_files/image_1.png | 1.0 | False | 1.0 | False | 0.923116 | False | 0.906609 | False | 1 | False | 0.990676 | False | 0.472314 | False | 1.0 | False | 1.0 | False |
| D:/cleanvision/image_files/image_10.png | 1.0 | False | 1.0 | False | 0.875129 | False | 0.995127 | False | 1 | False | 0.795937 | False | 0.470706 | False | 1.0 | False | 1.0 | False |
| D:/cleanvision/image_files/image_100.png | 1.0 | False | 1.0 | False | 0.916140 | False | 0.889762 | False | 1 | False | 0.827587 | False | 0.441195 | False | 1.0 | False | 1.0 | False |

D:/cleanvision/image_files/image_101.png 1.0 False 1.0 False 0.779338 False 0.960784 False 0 True 0.992157 False 0.507767 False 1.0 False 1.0 False

由于imagelab.issues返回的是Pandas的数据表格,因此可以对特定类型的数据进行筛选:

python 复制代码
# 得分越小,越严重
dark_images = imagelab.issues[imagelab.issues["is_dark_issue"] == True].sort_values(
    by=["dark_score"]
)
dark_images_files = dark_images.index.tolist()
dark_images_files
['D:/cleanvision/image_files/image_417.png',
 'D:/cleanvision/image_files/image_350.png',
 'D:/cleanvision/image_files/image_605.png',
 'D:/cleanvision/image_files/image_177.png',
 'D:/cleanvision/image_files/image_346.png',
 'D:/cleanvision/image_files/image_198.png',
 'D:/cleanvision/image_files/image_204.png',
 'D:/cleanvision/image_files/image_485.png',
 'D:/cleanvision/image_files/image_457.png',
 'D:/cleanvision/image_files/image_576.png']

可视化其中的问题图片:

python 复制代码
imagelab.visualize(image_files=dark_images_files[:4])

完成上述任务的更简洁方法是直接在imagelab.visualize函数中指定issue_types参数,这样可以直接显示某个问题下的图片,并按严重程度对其进行排序展示:

python 复制代码
# issue_types:问题类型,num_images:显示图片数,cell_size:每个网格中图片尺寸
imagelab.visualize(issue_types=["low_information"], num_images=3, cell_size=(3, 3))

查看图片信息和相似图片

通过info属性可以查看数据集的信息:

python 复制代码
# 查看存在的项目
imagelab.info.keys()
dict_keys(['statistics', 'dark', 'light', 'odd_aspect_ratio', 'low_information', 'blurry', 'grayscale', 'odd_size', 'exact_duplicates', 'near_duplicates'])
python 复制代码
# 查看统计信息
imagelab.info["statistics"].keys()
dict_keys(['brightness', 'aspect_ratio', 'entropy', 'blurriness', 'color_space', 'size'])
python 复制代码
# 查看数据集的统计信息
imagelab.info["statistics"]["size"]
count     607.000000
mean      280.830152
std       215.001908
min        32.000000
25%       256.000000
50%       256.000000
75%       256.000000
max      4666.050578
Name: size, dtype: float64

查看数据集中基本相似的图片个数:

python 复制代码
imagelab.info["exact_duplicates"]["num_sets"]
9

查看数据集中近似的图片对:

python 复制代码
imagelab.info["near_duplicates"]["sets"]
[['D:/cleanvision/image_files/image_103.png',
  'D:/cleanvision/image_files/image_408.png'],
 ['D:/cleanvision/image_files/image_109.png',
  'D:/cleanvision/image_files/image_329.png'],
 ['D:/cleanvision/image_files/image_119.png',
  'D:/cleanvision/image_files/image_250.png'],
 ['D:/cleanvision/image_files/image_140.png',
  'D:/cleanvision/image_files/image_538.png'],
 ['D:/cleanvision/image_files/image_25.png',
  'D:/cleanvision/image_files/image_357.png'],
 ['D:/cleanvision/image_files/image_255.png',
  'D:/cleanvision/image_files/image_43.png'],
 ['D:/cleanvision/image_files/image_263.png',
  'D:/cleanvision/image_files/image_486.png'],
 ['D:/cleanvision/image_files/image_3.png',
  'D:/cleanvision/image_files/image_64.png'],
 ['D:/cleanvision/image_files/image_389.png',
  'D:/cleanvision/image_files/image_426.png'],
 ['D:/cleanvision/image_files/image_52.png',
  'D:/cleanvision/image_files/image_66.png']]

1.3 自定义检测

指定检测类型

python 复制代码
from cleanvision import Imagelab

# 示例数据:https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip
dataset_path = "./image_files/"

# 指定检测类型
issue_types = {"blurry":{}, "dark": {}}

imagelab = Imagelab(data_path=dataset_path)

imagelab.find_issues(issue_types=issue_types, verbose=False)
imagelab.report()
Reading images from D:/cleanvision/image_files


Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           10 |
|  1 | blurry       |            6 | 

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:
---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

如果已经运行过find_issues函数,再次运行该函数时如果添加新的检测类型,当前结果将会与上一次的结果合并:

python 复制代码
issue_types = {"light": {}}
imagelab.find_issues(issue_types)
# 报告三个类型的结果
imagelab.report()
Checking for light images ...
Issue checks completed. 21 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           10 |
|  1 | blurry       |            6 |
|  2 | light        |            5 | 

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:
---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:
----------------------- light images -----------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

结果保存

以下代码展示了如何保存和加载结果,但加载结果时,数据路径和数据集必须与保存时保持一致:

python 复制代码
save_path = "./results"
# 保存结果
# force表示是否覆盖原文件
imagelab.save(save_path, force=True)
python 复制代码
# 加载结果
imagelab = Imagelab.load(save_path, dataset_path)
Successfully loaded Imagelab

阈值设置

CleanVision通过阈值控制来确定各种检测结果,其中exact_duplicates和near_duplicates是基于图像哈希(由 imagehash库提供)进行检测的,而其他类型的检测则采用范围为0到1的阈值来控制结果。如果图片在某一问题类型上的得分低于设定的阈值,则认为该图片存在该问题;阈值越高,判定为存在该问题的可能性越大。如下所示:

关键字 超参数
1 light threshold
2 dark threshold
3 odd_aspect_ratio threshold
4 exact_duplicates N/A
5 near_duplicates hash_size(int),hash_types(whash,phash,ahash,dhash,chash)
6 blurry threshold
7 grayscale threshold
8 low_information threshold

对于单一检测类型,阈值设置代码如下:

python 复制代码
imagelab = Imagelab(data_path=dataset_path)
issue_types = {"dark": {"threshold": 0.5}}
imagelab.find_issues(issue_types)

imagelab.report()
Reading images from D:/cleanvision/image_files
Checking for dark images ...

Issue checks completed. 20 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           20 | 

----------------------- dark images ------------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

如果某类问题的存在是正常的,例如天文数据集中普遍图像过暗的情况,那么可以设置一个最大出现率(max_prevalence)。这意味着如果某一问题的图像所占比例超过了max_prevalence,则可以认为该问题是正常的。以上示例中,dark问题的图像数量为10,图像总数为607,因此dark问题的图像占比约为0.016。如果将max_prevalence设置为0.015,那么出现dark问题的图片将不会被报告为dark问题:

python 复制代码
imagelab.report(max_prevalence=0.015)
Removing dark from potential issues in the dataset as it exceeds max_prevalence=0.015 
Please specify some issue_types to check for in imagelab.find_issues().

1.4 在Torchvision数据集上运行CleanVision

CleanVision支持使用Torchvision数据集进行问题检测,具体代码如下:

准备数据集

python 复制代码
from torchvision.datasets import CIFAR10
from torch.utils.data import ConcatDataset
from cleanvision import Imagelab

# 准备torchvision中的CIFAR10数据集
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
Files already downloaded and verified
Files already downloaded and verified
python 复制代码
# 查看训练集和测试集样本数
len(train_set), len(test_set)
(50000, 10000)

如果想对训练集和测试集进行合并处理,可以使用如下代码:

python 复制代码
dataset = ConcatDataset([train_set, test_set])
len(dataset)
60000

查看图片:

python 复制代码
dataset[0][0]

运行CleanVision

只需在创建Imagelab示例时指定torchvision_dataset参数,即可对Torchvision数据集进行操作,后续的处理步骤与读取文件夹中图片的处理方式相同:

python 复制代码
imagelab = Imagelab(torchvision_dataset=dataset)
imagelab.find_issues()
# 查看结果
# imagelab.report()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 173 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
python 复制代码
# 结果汇总
imagelab.issue_summary

| | issue_type | num_images |
| 0 | blurry | 118 |
| 1 | near_duplicates | 40 |
| 2 | dark | 11 |
| 3 | light | 3 |
| 4 | low_information | 1 |
| 5 | grayscale | 0 |
| 6 | odd_aspect_ratio | 0 |
| 7 | odd_size | 0 |

8 exact_duplicates 0

1.5 在Hugging Face数据集上运行CleanVision

CleanVision支持基于Hugging Face数据集(如果能用的话)进行问题检测,代码如下:

python 复制代码
# datasets是专门用于下载huggingface数据集的工具
from datasets import load_dataset
from cleanvision import Imagelab
# 以https://huggingface.co/datasets/mah91/cat为例
# 下载某个hugging face数据集,只需要将参数path设置为待下载链接datasets后的文字
# split表示提取train或test的数据,如果没有提供分割后的数据集则返回完整的数据
dataset = load_dataset(path="mah91/cat", split="train")
Repo card metadata block was not found. Setting CardData to empty.
python 复制代码
# 查看数据集,可以看到该数据集有800张图片,只提供了图片没有注释。
dataset
Dataset({
    features: ['image'],
    num_rows: 800
})
python 复制代码
# dataset.features包含数据集中不同列的信息以及每列的类型,例如图像,音频
dataset.features
{'image': Image(mode=None, decode=True, id=None)}

指定hf_dataset参数加载hugging face数据集:

python 复制代码
# 加载数据至CleanVision,image_key指定包含'image'的数据
imagelab = Imagelab(hf_dataset=dataset, image_key="image")

进行检测的代码如下:

python 复制代码
imagelab.find_issues()
# 结果汇总
imagelab.issue_summary
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 4 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().

| | issue_type | num_images |
| 0 | blurry | 3 |
| 1 | odd_size | 1 |
| 2 | dark | 0 |
| 3 | grayscale | 0 |
| 4 | light | 0 |
| 5 | low_information | 0 |
| 6 | odd_aspect_ratio | 0 |
| 7 | exact_duplicates | 0 |

8 near_duplicates 0

2 参考