公众号:尤而小屋
作者:Peter
编辑:Peter
大家好,我是Peter~
本文是UCI数据集建模的第3篇,第一篇是数据的探索性分析EDA部分,第二篇是基于LightGBM模型的baseline。
本文是第3篇,主要是对LightGBM模型的优化,最终准确率提升2%+
导入库
导入建模所需要的各种库:
In [1]:
python
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 100)
from IPython.display import display_html
import plotly_express as px
import plotly.graph_objects as go
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] # 设置字体
plt.rcParams["axes.unicode_minus"]=False # 解决"-"负号的乱码问题
import seaborn as sns
%matplotlib inline
import missingno as ms
import gc
from datetime import datetime
from sklearn.model_selection import train_test_split,StratifiedKFold,GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from imblearn.under_sampling import ClusterCentroids
from imblearn.over_sampling import KMeansSMOTE, SMOTE
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, auc
from sklearn.metrics import roc_auc_score,precision_recall_curve, confusion_matrix,classification_report
# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import tree
from pydotplus import graph_from_dot_data
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
2 导入数据
In [2]:
ini
df = pd.read_csv("UCI.csv")
df.head()
Out[2]:
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 20000.0 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | -2 | 3913.0 | 3102.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
1 | 2 | 120000.0 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | 0 | 2 | 2682.0 | 1725.0 | 2682.0 | 3272.0 | 3455.0 | 3261.0 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 |
2 | 3 | 90000.0 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 29239.0 | 14027.0 | 13559.0 | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
3 | 4 | 50000.0 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | 0 | 46990.0 | 48233.0 | 49291.0 | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
4 | 5 | 50000.0 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | 0 | 8617.0 | 5670.0 | 35835.0 | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
3 数据基本信息
1、整体数据量
整理的数据量大小:30000条记录,25个字段信息
In [3]:
df.shape
Out[3]:
scss
(30000, 25)
2、数据字段信息
In [4]:
bash
df.columns # 全部的字段名
Out[4]:
css
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default.payment.next.month'],
dtype='object')
不同的字段类型统计:
In [5]:
df.dtypes
Out[5]:
go
ID int64
LIMIT_BAL float64
SEX int64
EDUCATION int64
MARRIAGE int64
AGE int64
PAY_0 int64
PAY_2 int64
PAY_3 int64
PAY_4 int64
PAY_5 int64
PAY_6 int64
BILL_AMT1 float64
BILL_AMT2 float64
BILL_AMT3 float64
BILL_AMT4 float64
BILL_AMT5 float64
BILL_AMT6 float64
PAY_AMT1 float64
PAY_AMT2 float64
PAY_AMT3 float64
PAY_AMT4 float64
PAY_AMT5 float64
PAY_AMT6 float64
default.payment.next.month int64
dtype: object
In [6]:
bash
pd.value_counts(df.dtypes) # 统计不同类型的个数
Out[6]:
go
float64 13
int64 12
Name: count, dtype: int64
从结果中能够看到全部是数值型字段,几乎各占一半。最后一个字段default.payment.next.month
是我们最终的目标字段。
字段名称的具体解释:
- ID:ID唯一值
- LIMIT_BAL:可透支金额(新台币计算,包含个人或者家庭)
- SEX:性别:1-男, 2-女
- EDUCATION:1-研究生;2-本科;3-高中;4-其他;0/5/6-未知
- MARRIAGE:婚姻状态;1-已婚,2-单身;3-其他
- AGE:年龄
- PAY_0:2005年9月的还款状态(-2-未消费,-1-按时还款, 1-延迟一个月还款, 2-延迟两个月还款,...,8-延迟8个月还款, 9-延迟9个月还款)
- PAY_2:2005年8月的还款状态(同上)
- PAY_3:2005年7月的还款状态(同上)
- PAY_4:2005年6月的还款状态(同上)
- PAY_5:2005年5月的还款状态(同上)
- PAY_6:2005年4月的还款状态(同上)
- BILL_AMT1:2005年9月的账单金额
- BILL_AMT2:2005年8月的账单金额
- BILL_AMT3:2005年7月的账单金额
- BILL_AMT4:2005年6月的账单金额
- BILL_AMT5:2005年5月的账单金额
- BILL_AMT6:2005年4月的账单金额
- PAY_AMT1:2005年9月之前的付款金额;
- PAY_AMT2:2005年8月之前的付款金额
- PAY_AMT3:2005年7月之前的付款金额
- PAY_AMT4:2005年6月之前的付款金额
- PAY_AMT5:2005年5月之前的付款金额
- PAY_AMT6:2005年4月之前的付款金额
- default.payment.next.month:最终目标变量,下个月还款违约情况(1-是,逾期;0-否,未逾期)
说明内容:
- PAY_ATM如果低于银行规定的最低还款额,则视为违约;
- PAY_ATM如果大于上月账单金额BILL_AMT,则视为及时还;
- PAY_AMT如果大于最低还款额但低于上月账单金额,则视为延迟还款。
3、数据的描述统计信息
In [7]:
bash
df.describe().T # 字段较多,转置后显示更直观
Out[7]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ID | 30000.0 | 15000.500000 | 8660.398374 | 1.0 | 7500.75 | 15000.5 | 22500.25 | 30000.0 |
LIMIT_BAL | 30000.0 | 167484.322667 | 129747.661567 | 10000.0 | 50000.00 | 140000.0 | 240000.00 | 1000000.0 |
SEX | 30000.0 | 1.603733 | 0.489129 | 1.0 | 1.00 | 2.0 | 2.00 | 2.0 |
EDUCATION | 30000.0 | 1.853133 | 0.790349 | 0.0 | 1.00 | 2.0 | 2.00 | 6.0 |
MARRIAGE | 30000.0 | 1.551867 | 0.521970 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 |
AGE | 30000.0 | 35.485500 | 9.217904 | 21.0 | 28.00 | 34.0 | 41.00 | 79.0 |
PAY_0 | 30000.0 | -0.016700 | 1.123802 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_2 | 30000.0 | -0.133767 | 1.197186 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_3 | 30000.0 | -0.166200 | 1.196868 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_4 | 30000.0 | -0.220667 | 1.169139 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_5 | 30000.0 | -0.266200 | 1.133187 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_6 | 30000.0 | -0.291100 | 1.149988 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
BILL_AMT1 | 30000.0 | 51223.330900 | 73635.860576 | -165580.0 | 3558.75 | 22381.5 | 67091.00 | 964511.0 |
BILL_AMT2 | 30000.0 | 49179.075167 | 71173.768783 | -69777.0 | 2984.75 | 21200.0 | 64006.25 | 983931.0 |
BILL_AMT3 | 30000.0 | 47013.154800 | 69349.387427 | -157264.0 | 2666.25 | 20088.5 | 60164.75 | 1664089.0 |
BILL_AMT4 | 30000.0 | 43262.948967 | 64332.856134 | -170000.0 | 2326.75 | 19052.0 | 54506.00 | 891586.0 |
BILL_AMT5 | 30000.0 | 40311.400967 | 60797.155770 | -81334.0 | 1763.00 | 18104.5 | 50190.50 | 927171.0 |
BILL_AMT6 | 30000.0 | 38871.760400 | 59554.107537 | -339603.0 | 1256.00 | 17071.0 | 49198.25 | 961664.0 |
PAY_AMT1 | 30000.0 | 5663.580500 | 16563.280354 | 0.0 | 1000.00 | 2100.0 | 5006.00 | 873552.0 |
PAY_AMT2 | 30000.0 | 5921.163500 | 23040.870402 | 0.0 | 833.00 | 2009.0 | 5000.00 | 1684259.0 |
PAY_AMT3 | 30000.0 | 5225.681500 | 17606.961470 | 0.0 | 390.00 | 1800.0 | 4505.00 | 896040.0 |
PAY_AMT4 | 30000.0 | 4826.076867 | 15666.159744 | 0.0 | 296.00 | 1500.0 | 4013.25 | 621000.0 |
PAY_AMT5 | 30000.0 | 4799.387633 | 15278.305679 | 0.0 | 252.50 | 1500.0 | 4031.50 | 426529.0 |
PAY_AMT6 | 30000.0 | 5215.502567 | 17777.465775 | 0.0 | 117.75 | 1500.0 | 4000.00 | 528666.0 |
default.payment.next.month | 30000.0 | 0.221200 | 0.415062 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
4、字段整体信息
In [8]:
go
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 30000 non-null int64
1 LIMIT_BAL 30000 non-null float64
2 SEX 30000 non-null int64
3 EDUCATION 30000 non-null int64
4 MARRIAGE 30000 non-null int64
5 AGE 30000 non-null int64
6 PAY_0 30000 non-null int64
7 PAY_2 30000 non-null int64
8 PAY_3 30000 non-null int64
9 PAY_4 30000 non-null int64
10 PAY_5 30000 non-null int64
11 PAY_6 30000 non-null int64
12 BILL_AMT1 30000 non-null float64
13 BILL_AMT2 30000 non-null float64
14 BILL_AMT3 30000 non-null float64
15 BILL_AMT4 30000 non-null float64
16 BILL_AMT5 30000 non-null float64
17 BILL_AMT6 30000 non-null float64
18 PAY_AMT1 30000 non-null float64
19 PAY_AMT2 30000 non-null float64
20 PAY_AMT3 30000 non-null float64
21 PAY_AMT4 30000 non-null float64
22 PAY_AMT5 30000 non-null float64
23 PAY_AMT6 30000 non-null float64
24 default.payment.next.month 30000 non-null int64
dtypes: float64(13), int64(12)
memory usage: 5.7 MB
为了数据处理方便,将原始的default.payment.next.month字段重新命名成Label:
In [9]:
ini
df.rename(columns={"default.payment.next.month":"Label"},inplace=True)
4 缺失值
4.1 缺失值统计
统计每个字段的缺失值:
In [10]:
ini
df.isnull().sum().sort_values(ascending=False)
Out[10]:
css
ID 0
BILL_AMT2 0
PAY_AMT6 0
PAY_AMT5 0
PAY_AMT4 0
PAY_AMT3 0
PAY_AMT2 0
PAY_AMT1 0
BILL_AMT6 0
BILL_AMT5 0
BILL_AMT4 0
BILL_AMT3 0
BILL_AMT1 0
LIMIT_BAL 0
PAY_6 0
PAY_5 0
PAY_4 0
PAY_3 0
PAY_2 0
PAY_0 0
AGE 0
MARRIAGE 0
EDUCATION 0
SEX 0
Label 0
dtype: int64
In [11]:
ini
# 缺失值个数
total = df.isnull().sum().sort_values(ascending=False)
In [12]:
scss
# 缺失值比例
percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False)
percent
Out[12]:
css
ID 0.0
BILL_AMT2 0.0
PAY_AMT6 0.0
PAY_AMT5 0.0
PAY_AMT4 0.0
PAY_AMT3 0.0
PAY_AMT2 0.0
PAY_AMT1 0.0
BILL_AMT6 0.0
BILL_AMT5 0.0
BILL_AMT4 0.0
BILL_AMT3 0.0
BILL_AMT1 0.0
LIMIT_BAL 0.0
PAY_6 0.0
PAY_5 0.0
PAY_4 0.0
PAY_3 0.0
PAY_2 0.0
PAY_0 0.0
AGE 0.0
MARRIAGE 0.0
EDUCATION 0.0
SEX 0.0
Label 0.0
dtype: float64
将个数和比例的合并,显示完整的缺失值信息:
In [13]:
ini
pd.concat([total, percent],axis=1,keys=["Total","Percent"]).T
Out[13]:
4.2 缺失值可视化
In [14]:
ini
ms.bar(df,color="blue")
plt.show()
另一种写法:
In [15]:
shell
# ms.matrix(df, labels=True,label_rotation=45)
# plt.show()
下面进行不同字段的详细数据探索过程:
In [16]:
df.columns
Out[16]:
css
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
dtype='object')
ID字段对建模无效,直接删除:
In [17]:
ini
df.drop("ID",inplace=True,axis=1)
5 统计信息
5.1 Personal Information
查看用户的信用额度、学历、婚姻状态、年龄等字段的统计信息:
In [18]:
scss
df[['LIMIT_BAL', 'EDUCATION', 'MARRIAGE', 'AGE']].describe()
Out[18]:
LIMIT_BAL | EDUCATION | MARRIAGE | AGE | |
---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 167484.322667 | 1.853133 | 1.551867 | 35.485500 |
std | 129747.661567 | 0.790349 | 0.521970 | 9.217904 |
min | 10000.000000 | 0.000000 | 0.000000 | 21.000000 |
25% | 50000.000000 | 1.000000 | 1.000000 | 28.000000 |
50% | 140000.000000 | 2.000000 | 2.000000 | 34.000000 |
75% | 240000.000000 | 2.000000 | 2.000000 | 41.000000 |
max | 1000000.000000 | 6.000000 | 3.000000 | 79.000000 |
In [19]:
ini
df["EDUCATION"].value_counts().sort_values(ascending=False)
Out[19]:
yaml
EDUCATION
2 14030
1 10585
3 4917
5 280
4 123
6 51
0 14
Name: count, dtype: int64
用户的学历中,出现最多的是:本科生EDUCATION=2
In [20]:
ini
df["MARRIAGE"].value_counts().sort_values(ascending=False)
Out[20]:
yaml
MARRIAGE
2 15964
1 13659
3 323
0 54
Name: count, dtype: int64
用户的婚姻状态中,出现最多的是MARRIAGE=2,已婚人群。
5.2 LIMIT_BAL
LIMIT_BAL的分布
In [21]:
ini
df["LIMIT_BAL"].value_counts().sort_values(ascending=False)
Out[21]:
yaml
LIMIT_BAL
50000.0 3365
20000.0 1976
30000.0 1610
80000.0 1567
200000.0 1528
...
800000.0 2
1000000.0 1
327680.0 1
760000.0 1
690000.0 1
Name: count, Length: 81, dtype: int64
可以看到信用额度最为频繁的是50,000
In [22]:
ini
plt.figure(figsize = (14,6))
plt.title('Density Plot of LIMIT_BAL')
sns.set_color_codes("pastel")
sns.distplot(df['LIMIT_BAL'],kde=True,bins=200)
plt.show()
5.3 PAY0-PAY6
每月之前的对应还款状态:
In [23]:
scss
df[["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]].describe()
Out[23]:
PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | |
---|---|---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | -0.016700 | -0.133767 | -0.166200 | -0.220667 | -0.266200 | -0.291100 |
std | 1.123802 | 1.197186 | 1.196868 | 1.169139 | 1.133187 | 1.149988 |
min | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 |
25% | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 |
不同还款状态的对比:
In [24]:
ini
repay = df[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'Label']]
repay = pd.melt(repay,
id_vars="Label",
var_name="Payment Status",
value_name="Delay(Month)"
)
repay.head()
Out[24]:
Label | Payment Status | Delay(Month) | |
---|---|---|---|
0 | 1 | PAY_0 | 2 |
1 | 1 | PAY_0 | -1 |
2 | 0 | PAY_0 | 0 |
3 | 0 | PAY_0 | 0 |
4 | 0 | PAY_0 | -1 |
In [25]:
ini
fig = px.box(repay, x="Payment Status", y="Delay(Month)",color="Label")
fig.show()
5.4 BILL_AMT1-BILL_AMT6
当月的账单金额
In [26]:
scss
df[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].describe()
Out[26]:
BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | |
---|---|---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 3.000000e+04 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 51223.330900 | 49179.075167 | 4.701315e+04 | 43262.948967 | 40311.400967 | 38871.760400 |
std | 73635.860576 | 71173.768783 | 6.934939e+04 | 64332.856134 | 60797.155770 | 59554.107537 |
min | -165580.000000 | -69777.000000 | -1.572640e+05 | -170000.000000 | -81334.000000 | -339603.000000 |
25% | 3558.750000 | 2984.750000 | 2.666250e+03 | 2326.750000 | 1763.000000 | 1256.000000 |
50% | 22381.500000 | 21200.000000 | 2.008850e+04 | 19052.000000 | 18104.500000 | 17071.000000 |
75% | 67091.000000 | 64006.250000 | 6.016475e+04 | 54506.000000 | 50190.500000 | 49198.250000 |
max | 964511.000000 | 983931.000000 | 1.664089e+06 | 891586.000000 | 927171.000000 | 961664.000000 |
是否违约客户的对比:
In [27]:
df.columns
Out[27]:
css
Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
dtype='object')
In [28]:
scss
BILL_AMTS = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
plt.figure(figsize=(12,6))
for i, col in enumerate(BILL_AMTS):
plt.subplot(2,3,i+1)
sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red",shade=True)
sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue",shade=True)
plt.xlim(-40000, 200000)
plt.ylabel("")
plt.xlabel(col, fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()
5.5 PAY_AMT1-PAY_AMT6
每月之前的对应付款金额
In [29]:
scss
df[['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']].describe()
Out[29]:
PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | |
---|---|---|---|---|---|---|
count | 30000.000000 | 3.000000e+04 | 30000.00000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 5663.580500 | 5.921163e+03 | 5225.68150 | 4826.076867 | 4799.387633 | 5215.502567 |
std | 16563.280354 | 2.304087e+04 | 17606.96147 | 15666.159744 | 15278.305679 | 17777.465775 |
min | 0.000000 | 0.000000e+00 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1000.000000 | 8.330000e+02 | 390.00000 | 296.000000 | 252.500000 | 117.750000 |
50% | 2100.000000 | 2.009000e+03 | 1800.00000 | 1500.000000 | 1500.000000 | 1500.000000 |
75% | 5006.000000 | 5.000000e+03 | 4505.00000 | 4013.250000 | 4031.500000 | 4000.000000 |
max | 873552.000000 | 1.684259e+06 | 896040.00000 | 621000.000000 | 426529.000000 | 528666.000000 |
In [30]:
scss
PAY_AMTS = ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
plt.figure(figsize=(12,6))
for i, col in enumerate(PAY_AMTS):
plt.subplot(2,3,i+1)
sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red", shade=True)
sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue", shade=True)
plt.xlim(-10000, 70000)
plt.ylabel("")
plt.xlabel(col, fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()
6 Label
是否发生违约(default.payment.next.month重命名为Label)的人数进行对比:
In [31]:
scss
df["Label"].value_counts()
Out[31]:
yaml
Label
0 23364
1 6636
Name: count, dtype: int64
In [32]:
ini
label = df["Label"].value_counts()
df_label = pd.DataFrame(label).reset_index()
df_label
Out[32]:
Label | count | |
---|---|---|
0 | 0 | 23364 |
1 | 1 | 6636 |
In [33]:
arduino
# plt.figure(figsize = (6,6))
# plt.title('Default = 0 & Not Default = 1')
# sns.set_color_codes("pastel")
# sns.barplot(x = 'Label', y="count", data=df_label)
# locs, labels = plt.xticks()
# plt.show()
In [34]:
css
plt.figure(figsize = (5,5))
graph = sns.countplot(x="Label", data=df, palette=["red","blue"])
i = 0
for p in graph.patches:
print(type(p))
h = p.get_height()
percentage = round( 100 * df["Label"].value_counts()[i] / len(df),2)
str_percentage = f"{percentage} %"
graph.text(p.get_x()+p.get_width()/2., h - 100, str_percentage, ha="center")
i += 1
plt.title("class distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")
plt.show()
可以看到二者是很不均衡的。
In [35]:
shell
# value_counts = df['Label'].value_counts()
# # 计算每个值的百分比
# percentages = value_counts / len(df)
# # 使用matplotlib绘制柱状图
# plt.bar(value_counts.index, value_counts.values)
# # 在柱状图上添加百分比标签
# for i, v in enumerate(percentages.values):
# plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")
# # 设置xy轴标签、标题
# plt.title("Class Distribution")
# plt.xticks([0,1], ["Non-Default","Default"])
# plt.xlabel("Default Payment Next Month",fontsize=12)
# plt.ylabel("Number of Clients")
# plt.show()
In [36]:
scss
value_counts = df['Label'].value_counts()
# 计算每个值的百分比
percentages = value_counts / len(df)
# 使用matplotlib绘制柱状图
plt.bar(value_counts.index, value_counts.values)
# 在柱状图上添加百分比标签
for i, v in enumerate(percentages.values):
plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")
# 设置xy轴标签、标题
plt.title("Class Distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")
plt.show()
7 相关性分析
7.1 相关性热力图
In [37]:
ini
numeric = ['LIMIT_BAL','AGE','PAY_0','PAY_2',
'PAY_3','PAY_4','PAY_5','PAY_6',
'BILL_AMT1','BILL_AMT2','BILL_AMT3',
'BILL_AMT4','BILL_AMT5','BILL_AMT6'] # 全部数值型字段
numeric
Out[37]:
css
['LIMIT_BAL', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
In [38]:
scss
corr = df[numeric].corr()
corr.head()
Out[38]:
LIMIT_BAL | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LIMIT_BAL | 1.000000 | 0.144713 | -0.271214 | -0.296382 | -0.286123 | -0.267460 | -0.249411 | -0.235195 | 0.285430 | 0.278314 | 0.283236 | 0.293988 | 0.295562 | 0.290389 |
AGE | 0.144713 | 1.000000 | -0.039447 | -0.050148 | -0.053048 | -0.049722 | -0.053826 | -0.048773 | 0.056239 | 0.054283 | 0.053710 | 0.051353 | 0.049345 | 0.047613 |
PAY_0 | -0.271214 | -0.039447 | 1.000000 | 0.672164 | 0.574245 | 0.538841 | 0.509426 | 0.474553 | 0.187068 | 0.189859 | 0.179785 | 0.179125 | 0.180635 | 0.176980 |
PAY_2 | -0.296382 | -0.050148 | 0.672164 | 1.000000 | 0.766552 | 0.662067 | 0.622780 | 0.575501 | 0.234887 | 0.235257 | 0.224146 | 0.222237 | 0.221348 | 0.219403 |
PAY_3 | -0.286123 | -0.053048 | 0.574245 | 0.766552 | 1.000000 | 0.777359 | 0.686775 | 0.632684 | 0.208473 | 0.237295 | 0.227494 | 0.227202 | 0.225145 | 0.222327 |
In [39]:
ini
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(12,10))
sns.heatmap(corr,
mask=mask,
vmin=-1,
vmax=1,
center=0,
square=True,
cbar_kws={'shrink': .5},
annot=True,
annot_kws={'size': 10},
cmap="Blues")
plt.show()
7.2 变量两两关系
In [40]:
ini
plt.figure(figsize=(12,10))
pair_plot = sns.pairplot(df[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','Label']],
hue='Label',
diag_kind='kde',
corner=True)
pair_plot._legend.remove()
8 正态检验-QQ图
为了检查我们的数据是否为高斯分布,我们使用一种称为分位数-分位数(QQ图)图的图形方法进行定性评估。
在QQ图中,独立变量的分位数与正态分布的预期分位数相对应。如果变量是正态分布的,QQ图中的点应该沿着45度对角线排列。
In [41]:
ini
sns.set_color_codes('pastel') # 设置样式
fig, axs = plt.subplots(5, 3, figsize=(18,18)) # 图像大小和子图设置
numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
i, j = 0, 0
for f in numeric:
if j == 3:
j = 0
i = i + 1
stats.probplot(df[f], # 绘图数据:某个字段的全部取值
dist='norm', # 标准化
sparams=(df[f].mean(), df[f].std()),
plot=axs[i,j]) # 子图位置
axs[i,j].get_lines()[0].set_marker('.')
axs[i,j].grid()
axs[i,j].get_lines()[1].set_linewidth(3.0)
j = j+1
fig.tight_layout()
axs[4,2].set_visible(False)
plt.show()
9 数据预处理
9.1 分类型数据处理
针对分类型数据的处理:
In [42]:
scss
df["EDUCATION"].value_counts()
Out[42]:
yaml
EDUCATION
2 14030
1 10585
3 4917
5 280
4 123
6 51
0 14
Name: count, dtype: int64
In [43]:
bash
df["GRAD_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")
df["UNIVERSITY"] = (df["EDUCATION"] == 2).astype("category")
df["HIGH_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")
df.drop("EDUCATION",axis=1,inplace=True)
In [44]:
bash
df['MALE'] = (df['SEX'] == 1).astype('category')
df.drop('SEX', axis=1, inplace=True)
In [45]:
bash
df['MARRIED'] = (df['MARRIAGE'] == 1).astype('category')
df.drop('MARRIAGE', axis=1, inplace=True)
9.2 数据切分
In [46]:
ini
# 划分数据
y = df['Label']
X = df.drop('Label', axis=1, inplace=False)
根据y中的类别比例进行切分:
In [47]:
ini
# 切分数据
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=24, stratify=y)
9.3 特征归一化/标准化
最值归一化:
In [48]:
ini
mm = MinMaxScaler()
X_train_norm = X_train_raw.copy()
X_test_norm = X_test_raw.copy()
In [49]:
css
# LIMIT_BAL + AGE
X_train_norm['LIMIT_BAL'] = mm.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_norm['LIMIT_BAL'] = mm.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_norm['AGE'] = mm.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_norm['AGE'] = mm.transform(X_test_raw['AGE'].values.reshape(-1, 1))
In [50]:
ini
pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]
for pay in pay_list:
X_train_norm[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
X_test_norm[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))
In [51]:
scss
for i in range(1,7):
X_train_norm['BILL_AMT' + str(i)] = mm.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_test_norm['BILL_AMT' + str(i)] = mm.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_train_norm['PAY_AMT' + str(i)] = mm.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
X_test_norm['PAY_AMT' + str(i)] = mm.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
标准化过程:
In [52]:
css
ss = StandardScaler()
X_train_std = X_train_raw.copy()
X_test_std = X_test_raw.copy()
X_train_std['LIMIT_BAL'] = ss.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_std['LIMIT_BAL'] = ss.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_std['AGE'] = ss.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_std['AGE'] = ss.transform(X_test_raw['AGE'].values.reshape(-1, 1))
In [53]:
ini
pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]
for pay in pay_list:
X_train_std[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
X_test_std[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))
In [54]:
scss
for i in range(1,7):
X_train_std['BILL_AMT' + str(i)] = ss.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_test_std['BILL_AMT' + str(i)] = ss.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_train_std['PAY_AMT' + str(i)] = ss.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
X_test_std['PAY_AMT' + str(i)] = ss.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
In [55]:
scss
sns.set_color_codes('deep')
numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5', 'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
fig, axs = plt.subplots(1, 2, figsize=(24,6))
sns.boxplot(data=X_train_norm[numeric], ax=axs[0])
axs[0].set_title('Boxplot of normalized numeric features')
axs[0].set_xticklabels(labels=numeric, rotation=25)
axs[0].set_xlabel(' ')
sns.boxplot(data=X_train_std[numeric], ax=axs[1])
axs[1].set_title('Boxplot of standardized numeric features')
axs[1].set_xticklabels(labels=numeric, rotation=25)
axs[1].set_xlabel(' ')
fig.tight_layout()
plt.show()
9.4 数据降维
In [56]:
ini
pc = len(X_train_norm.columns.values) # 25
pca = PCA(n_components=pc) # 指定主成分个数
pca.fit(X_train_norm)
sns.reset_orig()
sns.set_color_codes('pastel') # 设置绘图颜色
plt.figure(figsize = (8,4)) # 图的大小
plt.grid() # 网格设置
plt.title('Explained Variance of Principal Components') # 标题设置
plt.plot(pca.explained_variance_ratio_, marker='o') # 绘制单个主成分的方差解释比例
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o') # 累计解释方差
plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"]) # 图例设置
plt.xlabel('Principal Component Indexes') # x-y轴标题
plt.ylabel('Explained Variance Ratio')
plt.tight_layout() # 调整布局,更紧凑
plt.axvline(12, 0, ls='--') # 设置虚线x=12
plt.show() # 显示图像
代码的各部分含义如下:
pc = len(X_train_norm.columns.values) # 25
:计算训练集的特征数量,这里的结果是25。pca = PCA(n_components=pc) # 指定主成分个数
:创建一个PCA对象,指定主成分的数量为pc
,即25。pca.fit(X_train_norm)
:对训练集X_train_norm
进行PCA拟合。sns.reset_orig()
和sns.set_color_codes('pastel')
:这两行代码是使用seaborn库来设置绘图的颜色。reset_orig()
会重置颜色到默认设置,set_color_codes('pastel')
会将颜色设置为柔和色调。plt.figure(figsize = (8,4))
:创建一个新的图形,设置其大小为8x4。plt.grid()
:在图形上显示网格。plt.title('Explained Variance of Principal Components')
:设置图形的标题为"主成分的方差解释"。plt.plot(pca.explained_variance_ratio_, marker='o')
:绘制单个主成分的方差解释比例。plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
:绘制累积方差解释比例。plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"])
:为图形添加图例,分别表示单个主成分的方差解释和累积方差解释。plt.xlabel('Principal Component Indexes')
:设置x轴的标签为"主成分索引"。plt.ylabel('Explained Variance Ratio')
:设置y轴的标签为"方差解释比例"。plt.tight_layout()
:自动调整图形布局,使其看起来紧凑。plt.axvline(12, 0, ls='--')
:在x=12的位置画一条从y=0到y=1的虚线。这可能是为了标示某个特定的主成分。plt.show()
:显示图形。
根据PCA的定义,主成分的顺序是不重要的,它们只按照其方差大小进行排序。
9.4.1 计算累计解释方差
In [57]:
ini
cumsum = np.cumsum(pca.explained_variance_ratio_) # 计算累计解释性方差
cumsum
Out[57]:
scss
array([0.44924877, 0.6321187 , 0.8046163 , 0.87590932, 0.92253799,
0.95438576, 0.96762706, 0.97773098, 0.9842774 , 0.98824928,
0.99088299, 0.99280785, 0.99444757, 0.99576128, 0.99690533,
0.99781622, 0.99844676, 0.99890236, 0.99924315, 0.99955744,
0.9997182 , 0.99983861, 0.99992993, 1. , 1. ])
In [58]:
ini
indexes = ['PC' + str(i) for i in range(1, pc+1)]
cumsum_df = pd.DataFrame(data=cumsum, index=indexes, columns=['var1'])
cumsum_df.head()
Out[58]:
var1 | |
---|---|
PC1 | 0.449249 |
PC2 | 0.632119 |
PC3 | 0.804616 |
PC4 | 0.875909 |
PC5 | 0.922538 |
In [59]:
scss
# 保留4位小数
cumsum_df['var2'] = pd.Series([round(val, 4) for val in cumsum_df['var1']],
index = cumsum_df.index)
# 转成百分比
cumsum_df['Cumulative Explained Variance'] = pd.Series(["{0:.2f}%".format(val * 100) for val in cumsum_df['var2']],
index = cumsum_df.index)
cumsum_df.head()
Out[59]:
var1 | var2 | Cumulative Explained Variance | |
---|---|---|---|
PC1 | 0.449249 | 0.4492 | 44.92% |
PC2 | 0.632119 | 0.6321 | 63.21% |
PC3 | 0.804616 | 0.8046 | 80.46% |
PC4 | 0.875909 | 0.8759 | 87.59% |
PC5 | 0.922538 | 0.9225 | 92.25% |
In [60]:
ini
cumsum_df = cumsum_df.drop(['var1','var2'], axis=1, inplace=False)
cumsum_df.T.iloc[:,:15]
Out[60]:
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cumulative Explained Variance | 44.92% | 63.21% | 80.46% | 87.59% | 92.25% | 95.44% | 96.76% | 97.77% | 98.43% | 98.82% | 99.09% | 99.28% | 99.44% | 99.58% | 99.69% |
9.4.2 指定主成分个数12
In [61]:
ini
pc = 12
pca = PCA(n_components=pc)
pca.fit(X_train_norm)
X_train = pd.DataFrame(pca.transform(X_train_norm))
X_test = pd.DataFrame(pca.transform(X_test_norm))
# 列名设置
X_train.columns = ['PC' + str(i) for i in range(1, pc+1)]
X_test.columns = ['PC' + str(i) for i in range(1, pc+1)]
X_train.head()
Out[61]:
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.234536 | -0.310556 | 0.812443 | 0.583386 | 0.086486 | 0.193288 | -0.045393 | -0.059547 | 0.031720 | -0.001745 | -0.004745 | -0.003148 |
1 | -0.781139 | -0.520069 | -0.198721 | -0.239243 | -0.055078 | -0.059366 | -0.090988 | 0.049630 | -0.070282 | 0.059528 | 0.033893 | 0.003430 |
2 | -0.787315 | -0.131143 | 0.747751 | -0.187888 | 0.166084 | -0.272372 | 0.157680 | -0.008314 | 0.252000 | -0.074637 | 0.029909 | 0.058873 |
3 | -0.636174 | 0.390267 | -0.599050 | -0.132501 | -0.213672 | -0.049675 | -0.114476 | -0.006438 | 0.058377 | 0.035740 | 0.052377 | 0.030388 |
4 | -0.790242 | -0.497498 | -0.205812 | -0.227087 | 0.045253 | -0.137781 | -0.179086 | -0.010123 | 0.019700 | 0.008193 | 0.001996 | 0.011253 |
10 数据不均衡处理
10.1 目标变量类别数统计
In [62]:
ini
count = pd.value_counts(y_train)
count
Out[62]:
yaml
Label
0 17523
1 4977
Name: count, dtype: int64
In [63]:
ini
percentage = pd.value_counts(y_train, normalize=True)
percentage
Out[63]:
yaml
Label
0 0.7788
1 0.2212
Name: proportion, dtype: float64
In [64]:
ini
class_count_df = pd.DataFrame(data=count.values,
index=['Non-defaulters', 'Defaulters'],
columns=['Number']
)
class_count_df
Out[64]:
Number | |
---|---|
Non-defaulters | 17523 |
Defaulters | 4977 |
In [65]:
css
class_count_df["Percentage"] = percentage.values
class_count_df
Out[65]:
Number | Percentage | |
---|---|---|
Non-defaulters | 17523 | 0.7788 |
Defaulters | 4977 | 0.2212 |
In [66]:
less
class_count_df["Percentage"] = class_count_df["Percentage"].apply(lambda x: "{:.2%}".format(x))
class_count_df
Out[66]:
Number | Percentage | |
---|---|---|
Non-defaulters | 17523 | 77.88% |
Defaulters | 4977 | 22.12% |
基于自定义函数的实现的小数转成百分比:
python
def to_percent(x):
return "{:.2%}".format(x)
df[col] = df[col].apply(to_percent)
10.2 方法1:基于聚类中心的欠采样Cluster Centroid Undersampling
具体实施过程:
In [67]:
ini
oversample = ClusterCentroids(random_state=24) # 设置对象
# 针对X_train和y_train 的欠采样
X_train_cc, y_train_cc = oversample.fit_resample(X_train, y_train)
In [68]:
ini
count_cc = pd.value_counts(y_train_cc) # 换成采样后的数据y_train_cc
percentage_cc = pd.value_counts(y_train_cc, normalize=True)
class_count_df_cc = pd.DataFrame(data=count_cc.values,
index=['Non-defaulters', 'Defaulters'],
columns=['Number']
)
class_count_df_cc["Percentage"] = percentage_cc.values
class_count_df_cc["Percentage"] = class_count_df_cc["Percentage"].apply(lambda x: "{:.2%}".format(x))
class_count_df_cc
Out[68]:
Number | Percentage | |
---|---|---|
Non-defaulters | 4977 | 50.00% |
Defaulters | 4977 | 50.00% |
此时我们发现y=0和y=1是均衡的,保证数据和少数类样本相同。
10.3 方法2:合成少数累过采样技术Synthetic Minority Oversampling Technique(SMOTE)
SMOTE(Synthetic Minority Oversampling Technique)是一种过采样方法,旨在解决数据集不平衡问题。它通过对少数类样本进行插值生成合成样本,从而增加少数类样本的数量。SMOTE的主要步骤包括:
- 对于每一个少数类样本,计算其与所有其他少数类样本之间的距离,并找到其K个最近邻居。
- 从这K个最近邻居中随机选择一个样本,并计算该样本与当前样本的差异。
- 根据差异比例,生成一个新的合成样本,该样本位于两个样本之间的连线上。
- 重复上述步骤,生成指定数量的合成样本。
SMOTE算法的关键是通过插值生成合成样本,从而使得少数类样本的特征空间得到扩展。这有助于模型更好地探索和学习少数类的特征,提高模型的性能。
具体实施过程:
In [69]:
ini
oversample = SMOTE(random_state=24)
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)
In [70]:
ini
count_smote = pd.value_counts(y_train_smote) # y_train_smote
percentage_smote = pd.value_counts(y_train_smote, normalize=True)
class_count_df_smote = pd.DataFrame(data=count_smote.values,
index=['Non-defaulters', 'Defaulters'],
columns=['Number']
)
class_count_df_smote["Percentage"] = percentage_smote.values
class_count_df_smote["Percentage"] = class_count_df_smote["Percentage"].apply(lambda x: "{:.2%}".format(x))
class_count_df_smote
Out[70]:
Number | Percentage | |
---|---|---|
Non-defaulters | 17523 | 50.00% |
Defaulters | 17523 | 50.00% |
此时我们发现,少数类的样本经过过采样变得和多数类样本数相同。
10.4 方法3:结合K-Means聚类 + SMOTE
具体实施过程:
In [71]:
ini
oversample = KMeansSMOTE(cluster_balance_threshold=0.00001, random_state=24)
X_train_ksmote, y_train_ksmote = oversample.fit_resample(X_train, y_train)
In [72]:
ini
count_ksmote = pd.value_counts(y_train_ksmote) # y_train_ksmote
percentage_ksmote = pd.value_counts(y_train_ksmote, normalize=True)
class_count_df_ksmote = pd.DataFrame(data=count_ksmote.values,
index=['Non-defaulters', 'Defaulters'],
columns=['Number']
)
class_count_df_ksmote["Percentage"] = percentage_ksmote.values
class_count_df_ksmote["Percentage"] = class_count_df_ksmote["Percentage"].apply(lambda x: "{:.2%}".format(x))
class_count_df_ksmote
Out[72]:
Number | Percentage | |
---|---|---|
Non-defaulters | 17528 | 50.01% |
Defaulters | 17523 | 49.99% |
10.5 对比三种方法
In [73]:
scss
display(class_count_df)
display(class_count_df_cc)
display(class_count_df_smote)
display(class_count_df_ksmote)
Number | Percentage | |
---|---|---|
Non-defaulters | 17523 | 77.88% |
Defaulters | 4977 | 22.12% |
Number | Percentage | |
---|---|---|
Non-defaulters | 4977 | 50.00% |
Defaulters | 4977 | 50.00% |
Number | Percentage | |
---|---|---|
Non-defaulters | 17523 | 50.00% |
Defaulters | 17523 | 50.00% |
Number | Percentage | |
---|---|---|
Non-defaulters | 17528 | 50.01% |
Defaulters | 17523 | 49.99% |
原始数据中类别是极不均衡;经过3种采样方法处理后,基于聚类中心和SMOTE采样的方法能够类别数相同。
但是如果使用K-Means SMOTE方法采样后,两个类别的比例稍有差别。
11 模型评估
11.1 交叉验证
基于 k-fold cross-validation的交叉验证:将数据分为k折,前面k-1用于训练,剩下1折用于验证。
分类模型评价指标
1、混淆矩阵
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Predicted Negative Predicted Positive Actual Negative TN FP Actual Positive FN TP \begin{array}{ccc} & \text { Predicted Negative } & \text { Predicted Positive } \\ \hline \text { Actual Negative } & \text { TN } & \text { FP } \\ \text { Actual Positive } & \text { FN } & \text { TP } \end{array} </math> Actual Negative Actual Positive Predicted Negative TN FN Predicted Positive FP TP
2、准确率
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> A c c u r a c y = T P + T N T P + F P + T N + F N { Accuracy }=\frac{T P+T N}{T P+F P+T N+F N} </math>Accuracy=TP+FP+TN+FNTP+TN
3、精确率
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Precision = T P T P + F P \text { Precision} =\frac{T P}{T P+F P} </math> Precision=TP+FPTP
4、召回率
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Recall = T P T P + F N \text { Recall} =\frac{T P}{T P+F N} </math> Recall=TP+FNTP
5、F1_score
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> F 1 s c o r e = 2 1 r + 1 p = 2 r p r + p { F1_{score} }=\frac{2}{\frac{1}{r}+\frac{1}{p}}=\frac{2 r p}{r+p} </math>F1score=r1+p12=r+p2rp
12 基于LightGBM建立二分类模型(使用非均衡数据)
使用不同的训练集的标签数据进行模型训练:
python
# pca降维后的数据y_train,
# 基于聚类中心的欠采样y_train_cc
# 基于SMOTE的过采样y_train_smote
# 基于聚类+SMOTE的采样y_train_smote
# y_train,y_train_cc,y_train_smote,y_train_ksmote
12.1 baseline-基础模型
In [74]:
bash
X_train # 降维与归一化后的特征数据
训练集中的目标值:
In [75]:
y_train
Out[75]:
yaml
24832 0
969 0
20833 1
21670 0
25380 0
..
20828 1
897 1
16452 0
3888 0
5743 0
Name: Label, Length: 22500, dtype: int64
模型训练:
In [76]:
ini
# 模型训练
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
[LightGBM] [Info] Number of positive: 4977, number of negative: 17523
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000607 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 22500, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.221200 -> initscore=-1.258687
[LightGBM] [Info] Start training from score -1.258687
模型预测:
In [77]:
ini
# 模型预测
y_pred = lgb_clf.predict(X_test)
y_pred
Out[77]:
ini
array([1, 0, 0, ..., 0, 0, 0], dtype=int64)
基于baseline的准确率acc:
In [78]:
scss
acc = accuracy_score(y_test, y_pred)
print(acc)
0.8130666666666667
12.2 优化模型(交叉验证+超参数调优)
In [79]:
ini
# 定义LightGBM分类器
lgb_clf_new = lgb.LGBMClassifier()
12.2.1 超参数范围
LightGBM算法一般对以下超参数进行调优:
- num_leaves(叶子节点数):控制树的深度,影响模型的复杂度和训练速度。较小的值会导致更深的树,更大的值会减少树的深度。
- learning_rate(学习率):控制每次迭代时的权重更新步长,影响模型的收敛速度和泛化能力。较小的值会导致更慢的收敛速度,较大的值可能导致过拟合。
- n_estimators(树的数量):控制模型的复杂度,影响模型的拟合能力和训练时间。较大的值会增加模型的复杂度,但也可能导致过拟合。
- max_depth(最大深度):控制树的最大深度,影响模型的复杂度和训练速度。较小的值会导致更深的树,更大的值会减少树的深度。
- min_child_samples(最小叶子节点样本数):控制一个叶子节点在分裂前所需的最小样本数,影响模型的复杂度和过拟合程度。较小的值会导致更多的叶子节点,更大的值会减少叶子节点的数量。
- subsample(随机采样比例):控制每个子节点上随机选择的特征比例,影响模型的训练速度和泛化能力。较小的值会导致更多的特征被选择,较大的值会减少特征的选择数量。
- colsample_bytree(列采样比例):控制每棵树在分裂时随机选择的特征比例,影响模型的训练速度和泛化能力。较小的值会导致更多的特征被选择,较大的值会减少特征的选择数量。
- reg_alpha(L1正则化系数):控制L1正则化的强度,影响模型的稀疏性和泛化能力。较小的值会导致更强的正则化,较大的值会减少正则化的强度。
- reg_lambda(L2正则化系数):控制L2正则化的强度,影响模型的稀疏性和泛化能力。较小的值会导致更强的正则化,较大的值会减少正则化的强度。
In [80]:
ini
# 设置超参数网格搜索范围
param_grid = {
'num_leaves': [31, 63, 127],
'learning_rate': [0.01, 0.02, 0.03, 0.04, 0.05],
'n_estimators': [100, 200, 300],
'max_depth': [4,5,6,7]
}
12.2.2 使用K折交叉验证
In [81]:
ini
# 使用k折交叉验证和网格搜索进行超参数调优
# 5折交叉验证实例对象
# cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv = 5
# 网格搜索
grid_search = GridSearchCV(lgb_clf_new, # lgb模型
param_grid, # 参数
scoring='accuracy', # 评估指标
cv=cv, # 5折交叉验证
n_jobs=-1
)
In [82]:
bash
# 网格搜索对象的训练
grid_search.fit(X_train, y_train)
确定最佳参数组合:
In [83]:
grid_search.best_params_
Out[83]:
arduino
{'learning_rate': 0.02, 'max_depth': 5, 'n_estimators': 300, 'num_leaves': 63}
12.2.3 建立新模型
基于网格搜索得到的最佳参数组合建立新的模型:
In [84]:
ini
new_model = lgb.LGBMClassifier(learning_rate=0.02,
max_depth=5,
n_estimators=300,
num_leaves=63)
new_model.fit(X_train,y_train)
12.2.4 新模型评估
In [85]:
ini
y_pred_new = new_model.predict(X_test)
y_pred_new
Out[85]:
ini
array([1, 0, 0, ..., 0, 0, 0], dtype=int64)
模型的准确率:
In [86]:
scss
acc_new = accuracy_score(y_test, y_pred_new)
print(acc_new)
0.8330666666666667