简介

当进行数据分析项目时，常常会首先进行探索性数据分析（Exploratory Data Analysis，简称EDA）。EDA是一种探索性的数据分析方法，旨在通过统计和图形化方法来理解数据集的特征、结构和潜在模式。

EDA的目标是揭示数据集中的关系、异常值、缺失值和潜在问题，并为进一步的数据处理和建模提供基础。通过对数据进行初步的探索，可以帮助我们了解数据的整体情况，发现数据中的趋势和模式，并提供指导我们下一步分析的线索。

在进行EDA时，常见的步骤包括：

数据收集：收集所需的数据，并确保数据的完整性和准确性。
查看数据集整体情况
数据清洗：对数据进行清理，处理缺失值、异常值和重复值。这一步骤对于后续的分析和建模非常重要，因为准确的数据是进行有效分析的基础。
描述性统计分析：通过计算数据的基本统计指标（如均值、中位数、标准差等）计算数据之间的相关性以及变量之间的关系。

1查看数据集

以波士顿房价预测 Boston Housing为例子,先导入包

py 复制代码

from sklearn import datasets 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("../sklearn特征工程和EDA/boston_housing_data.csv")
df

通过describe(),查看数据,这里用点更可视化的包

py 复制代码

import plotly.express as px
df.describe().T\
.style.bar(subset=['mean'], color=px.colors.qualitative.G10[2])\
.background_gradient(subset=['std'], cmap='Blues')\
.background_gradient(subset=['50%'], cmap='BuGn')

CRIM--城镇人均犯罪率 ------【城镇人均犯罪率】
ZN - 占地面积超过25,000平方英尺的住宅用地比例。 ------【住宅用地所占比例】
INDUS - 每个城镇非零售业务的比例。 ------【城镇中非商业用地占比例】
CHAS - Charles River虚拟变量（如果是河道，则为1;否则为0 ------【查尔斯河虚拟变量，用于回归分析】
NOX - 一氧化氮浓度（每千万份） ------【环保指标】
RM - 每间住宅的平均房间数 ------【每栋住宅房间数】
AGE - 1940年以前建造的自住单位比例 ------【1940年以前建造的自住单位比例】
DIS -波士顿的五个就业中心加权距离 ------【与波士顿的五个就业中心加权距离】
RAD - 径向高速公路的可达性指数 ------【距离高速公路的便利指数】
TAX - 每10,000美元的全额物业税率 ------【每一万美元的不动产税率】
PTRATIO - 城镇的学生与教师比例 ------【城镇中教师学生比例】
B - 1000（Bk - 0.63）^ 2其中Bk是城镇黑人的比例 ------【城镇中黑人比例】
LSTAT - 人口状况下降％ ------【房东属于低等收入阶层比例】
MEDV - 自有住房的中位数报价, 单位1000美元 ------【自住房屋房价中位数】

2查看缺失值

这里可以看到target(MEDV)存在缺失值
具体是要删除还是保留就看操作了,可以做一个伪标签增加训练数据

py 复制代码

def summary(df):
    sum = pd.DataFrame(df.dtypes,columns = ['dtypes'])
    sum['missing#'] = df.isna().sum()
    sum['missing%'] = df.isna().sum() / len(df)
    sum['uniques'] = df.nunique().values
    sum['count'] = df.count().values
    return sum

summary(df)

3异常值分析

使用IsolationForest

py 复制代码

def count_outliers(df, select_features):
    df_subset = df[select_features]
    clf = IsolationForest(contamination='auto')
    predictions = clf.fit_predict(df_subset)
    
    ##创建一个DataFrame来存储每行的异常值计数
    outlier_count_df = pd.DataFrame({
        'Outlier_Count': [(pred == -1) for pred in predictions]
    })
    
    #将每行的计数相加以获得总异常值计数
    total_outliers = outlier_count_df['Outlier_Count'].sum()
    df['Outlier_Count'] = outlier_count_df
    return df

4 相关系数图

画特征之间的相关系数图

py 复制代码

corr_matrix = train[num_var].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='Blues', fmt='.2f', linewidths=1, square=True, annot_kws={"size": 9} )
plt.title('Correlation Matrix', fontsize=15)
plt.show()

高度相关变量的散点图

py 复制代码

sns.pairplot(data = df[['NOX', 'RAD', 'RM','DIS']],corner = True)

5查看预测值的分布，要是有测试集这里可以看train和test的分布差别

target的分布

py 复制代码

sns.displot(df.MEDV,kde= True)

target转换后更偏向正太分布的图

py 复制代码

 sns.displot(np.log(df.MEDV),kde= True)

小提琴图

py 复制代码

num_rows = len(df.columns)
num_cols = 2 

total_plots = num_rows * num_cols
plt.figure(figsize=(14, num_rows * 2.5))

for idx, col in enumerate(df.columns):
    plt.subplot(num_rows, num_cols, idx % total_plots + 1)
    #小提琴图
    sns.violinplot(x='MEDV', y=col, data=df,color="skyblue")
    plt.title(f"{col} Distribution for target")

plt.tight_layout()
plt.show()

查看train和test集数值变量的分布
该数据没有test,用另外数据做演示

py 复制代码

#添加soruce
df = pd.concat([train[num_var].assign(Source = 'Train'), 
                test[num_var].assign(Source = 'Test')], 
               axis=0, ignore_index = True);

fig, axes = plt.subplots(len(num_var), 3 ,figsize = (16, len(num_var) * 4.2), 
                         gridspec_kw = {'hspace': 0.35, 'wspace': 0.3, 'width_ratios': [0.80, 0.20, 0.20]});

for i,col in enumerate(num_var):
    ax = axes[i,0];
    sns.kdeplot(data = df[[col, 'Source']], x = col, hue = 'Source', ax = ax, linewidth = 2.1)
    ax.set_title(f"\n{col}",fontsize = 9, fontweight= 'bold');
    ax.grid(visible=True, which = 'both', linestyle = '--', color='lightgrey', linewidth = 0.75);
    ax.set(xlabel = '', ylabel = '');
    ax = axes[i,1];
    sns.boxplot(data = df.loc[df.Source == 'Train', [col]], y = col, width = 0.25,saturation = 0.90, linewidth = 0.90, fliersize= 2.25, color = '#037d97',
                ax = ax);
    ax.set(xlabel = '', ylabel = '');
    ax.set_title(f"Train",fontsize = 9, fontweight= 'bold');

    ax = axes[i,2];
    sns.boxplot(data = df.loc[df.Source == 'Test', [col]], y = col, width = 0.25, fliersize= 2.25,
                saturation = 0.6, linewidth = 0.90, color = '#E4591E',
                ax = ax); 
    ax.set(xlabel = '', ylabel = '');
    ax.set_title(f"Test",fontsize = 9, fontweight= 'bold');

plt.tight_layout();
plt.show();

6特征选择

参考

Filter：过滤法 ，不用考虑后续学习器，按照发散性或者相关性对各个特征进行评分，设定阈值或者待选择阈值的个数，选择特征。
- 下面就选出了10个特征

py 复制代码

from sklearn.feature_selection import VarianceThreshold
 
#方差选择法，返回值为特征选择后的数据
#参数threshold为方差的阈值
var = VarianceThreshold(threshold=3)
var.fit_transform(X)
featurres = var.get_feature_names_out()
df[featurres]

Wrapper：包装法 ，需考虑后续学习器，根据目标函数（通常是预测效果评分），每次选择若干特征，或者排除若干特征。
- 使用chi2

py 复制代码

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X = df.drop('MEDV',axis = 1)
Y = df.MEDV
#选择K个最好的特征，返回选择特征后的数据
skb = SelectKBest(chi2, k=10)
skb.fit_transform(X,Y.astype('int'))
featurres = skb.get_feature_names_out()
df[featurres]

Embedded：嵌入法，是Filter与Wrapper方法的结合。先使用某些机器学习的算法和模型进行训练，得到各个特征的权值系数，根据系数从大到小选择特征。

py 复制代码

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

#带L1惩罚项的逻辑回归作为基模型的特征选择
sfm =  SelectFromModel(LogisticRegression(penalty="l1", C=0.1,solver='liblinear'))
sfm.fit_transform(X, Y.astype(int))
featues = sfm.get_feature_names_out()
df[featurres]

参考: EDA + Modeling📈(Ensemle+NN)

机器学习EDA-数据探索性分析

简介

1查看数据集

2查看缺失值

3异常值分析

4 相关系数图

5查看预测值的分布，要是有测试集这里可以看train和test的分布差别

6特征选择