1. 安装方式

通过pip安装：

bash 复制代码

pip install ydata-profiling

或者通过conda安装：

bash 复制代码

conda install -c conda-forge ydata-profiling

2. 分析DataFrame

创建一个随机的数据集（包含重复值和缺失值）。

python 复制代码

import pandas as pd

df1 = pd._testing.makeDataFrame()
df2 = pd._testing.makeMissingDataframe()

# 包含重复数据df1，缺失数据 df2
df = pd.concat([df1, df1, df2], axis=0)
df

然后看看用ydata-profiling探索的结果如何：

python 复制代码

from ydata_profiling import ProfileReport

report = ProfileReport(df, title="随机数据集")
report

默认生成的报告内容，包括：

Overview：数据集概要信息，比如数据量，重复的数据比率，缺失的数据比率等等
Variables：数据集每个列的情况分析
Interactions：不同列之间的数据分布关系
Correlations：不同列之间统计学上的相关性
Missing values：各个列的缺失值统计情况
Sample：数据集中前10行和后10行的数据
Duplicate rows：重复数据的统计

这些内容如果用pandas自带的函数去分析的话，不仅要写很多代码，而且也做不出效果这么好的图文报告。
ydata-profiling为我们在处理数据之前，节约了大量的时间。

3. 其他常用功能

除了预分析数据集，ydata-profiling还有一些其他的扩展功能，下面列取了其中一些我最近用到的。

（全部的功能可以参考官方文档）

3.1. 数据集比较

比如就用上面随机生成的数据集 df1 和 df2。

python 复制代码

report1 = ProfileReport(df1, title="df1")
report2 = ProfileReport(df2, title="df2")

compare_report = report1.compare(report2)
compare_report

比较数据集之后，生成的分析报告内容和上一节的类似，只是每个部分包含了2个数据集的情况。

3.2. 敏感数据保护

有时候，数据集中包含一些敏感数据，比如手机号，身份证号之类的，我们不希望报告中显示这些数据。

这时，设置sensitive=True，生成的报告中就不会显示数据集中的示例数据了。

python 复制代码

df_time =  pd._testing.makeTimeDataFrame()
df_time

report = ProfileReport(df_time, sensitive=True, title="不显示敏感数据")
report

3.3. 报告样式调整

ydata-profiling默认的报告其实也挺美观的，它还提供了一些参数设置报告的样式（目前样式相关的参数不是很多）：

参数	类型	默认值	描述
html.minify_html	bool	True	如果是`True` ，则使用 `htmlmin` 包缩小输出 `HTML`。
html.use_local_assets	bool	True	如果 `True` ，则所有资源（样式表、脚本、图像）都存储在本地。如果 `False` ，则 CDN 用于某些样式表和脚本。
html.inline	boolean	True	如果 `True` ，则所有资产都包含在报告中。如果为 `False` ，则会创建 Web 导出，其中所有资源都存储在"`[REPORT_NAME]_assets/`"目录中。
html.navbar_show	boolean	True	报表中是否包含导航栏
html.style.theme	string	None	选择主题。可用选项：flatly (dark) 和 united (orange)
html.style.logo	string	nan	`Base64` 编码的徽标，显示在导航栏中
html.style.primary_color	string	#337ab7	报告中使用的主要颜色
html.style.full_width	boolean	False	默认情况下，报告的宽度是固定的。如果设置为 `True` ，则使用屏幕的整个宽度。

设置属性的方式如下：

python 复制代码

report = ProfileReport(df_time, sensitive=True, title="不显示敏感数据")

# 比如设置不显示导航栏
report.config.html.navbar_show = False

4. 报告导出

最后是报告导出，一般导出html格式：

python 复制代码

with open("output.html", "w") as f:
    f.write(report.to_html())

或者更直接的方式如下：

python 复制代码

report.to_file("output.html")

ydata-profiling👉一行代码探索DataFrame奥秘

1. 安装方式

2. 分析DataFrame

3. 其他常用功能

3.1. 数据集比较

3.2. 敏感数据保护

3.3. 报告样式调整

4. 报告导出