案例-卡方分箱
数据集使用germancredit
-
Toad 是专为工业界模型开发设计的Python工具包,特别针对评分卡的开发
-
Toad 的功能覆盖了建模全流程,从EDA、特征工程、特征筛选到模型验证和评分卡转化
-
Toad 的主要功能极大简化了建模中最重要最费时的流程,即特征筛选和分箱。
第一次使用toad,需要安装:
pip install toad==0.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
案例代码实现:
import pandas as pd
import numpy as np
import toad
data = pd.read_csv('../data/germancredit.csv')
data.replace({'good':0,'bad':1},inplace=True)
print(data.shape) # 1000 data and 20 features
data.head()
显示结果
(1000, 21)
status.of.existing.checking.account duration.in.month credit.history purpose credit.amount savings.account.and.bonds present.employment.since installment.rate.in.percentage.of.disposable.income personal.status.and.sex other.debtors.or.guarantors ... property age.in.years other.installment.plans housing number.of.existing.credits.at.this.bank job number.of.people.being.liable.to.provide.maintenance.for telephone foreign.worker creditability 0 ... < 0 DM 6 critical account/ other credits existing (not ... radio/television 1169 unknown/ no savings account ... >= 7 years 4 male : single none ... real estate 67 none own 2 skilled employee / official 1 yes, registered under the customers name yes 0 1 0 <= ... < 200 DM 48 existing credits paid back duly till now radio/television 5951 ... < 100 DM 1 <= ... < 4 years 2 female : divorced/separated/married none ... real estate 22 none own 1 skilled employee / official 1 none yes 1 2 no checking account 12 critical account/ other credits existing (not ... education 2096 ... < 100 DM 4 <= ... < 7 years 2 male : single none ... real estate 49 none own 1 unskilled - resident 2 none yes 0 3 ... < 0 DM 42 existing credits paid back duly till now furniture/equipment 7882 ... < 100 DM 4 <= ... < 7 years 2 male : single guarantor ... building society savings agreement/ life insur... 45 none for free 1 skilled employee / official 2 none yes 0 4 ... < 0 DM 24 delay in paying off in the past car (new) 4870 ... < 100 DM 1 <= ... < 4 years 3 male : single none ... unknown / no property 53 none for free 2 skilled employee / official 2 none yes 1
数据字段说明
-
Status of existing checking account(现有支票帐户的存款状态)
-
Duration in month(持续月数)
-
Credit history(信用历史记录)
-
Purpose(申请目的)
-
Credit amount(信用保证金额)
-
Savings account/bonds(储蓄账户/债券金额)
-
Present employment since(当前就业年限)
-
Installment rate in percentage of disposable income(可支配收入占比)
-
Personal status and gender(个人婚姻状态及性别)
-
Other debtors / guarantors(其他债务人或担保人)
-
Present residence since(当前居民年限)
-
Property(财产)
-
Age in years(年龄)
-
Other installment plans (其他分期付款计划)
-
Housing(房屋状况)
-
Number of existing credits at this bank(在该银行已有的信用卡数)
-
Job(工作性质)
-
Number of people being liable to provide maintenance for(可提供维护人数)
-
Telephone(是否留存电话)
-
foreign worker(是否外国工人)
-
creditability
数据标签
数据预处理
import pandas as pd
import numpy as np
import toad
from toad.plot import bin_plot
import os
os.chdir(r'D:\CodeProject\05JRFK_Project\my_project\day04-分箱和编码')
os.getcwd()
data = pd.read_csv('../data/germancredit.csv')
# 替换标签列, 0 for good, 1 for bad
data.replace({'good':0,'bad':1},inplace=True)
print(data.shape) # 1000 data and 20 features
data.head()
分箱入门
# 初始化一个combiner类
combiner = toad.transform.Combiner()
# 训练数据并指定分箱方法,其它参数可选 # min_samples: 每箱至少包含样本量,可以是数字或者占比
combiner.fit(data,y='creditability',method='chi',min_samples = 0.05)
# 以字典形式保存分箱结果
bins = combiner.export()
# 分箱调整方式:
# 1. n_bins : 指定箱数
# 2. update/set_rules API
bins # 字典类型(所有列)
#查看分箱结果
print('duration.in.month:', bins['duration.in.month'])
# duration.in.month: [9, 12, 13, 16, 36, 45]
调整分箱绘图
bin_plot, 用来查看分箱结果是否可行, 箱数是否设置合理
评判方式: 看bin_plot图是否单调, 只有单调才具有可解释性
from toad.plot import bin_plot
c2 = toad.transform.Combiner()
c2.fit(data[['duration.in.month', 'creditability']], y='creditability', method='chi', n_bins=5)# 调整指定箱数调整分箱结果
transformed = c2.transform(data[['duration.in.month', 'creditability']], labels=True)
bin_plot(transformed, x='duration.in.month', target='creditability')
# bin_plot, 用来查看分箱结果是否可行, 箱数是否设置合理
# 评判方式: 看bin_plot图是否单调, 只有单调才具有可解释性
其他分箱方式
分箱方式介绍
chi:卡方分箱
dt:决策树分箱
quantile:等频分箱
step:等距分箱
kmeans:KMeans分箱
for method in ['chi', 'dt', 'quantile', 'step', 'kmeans']:
c2 = toad.transform.Combiner()
c2.fit(data[['duration.in.month', 'creditability']],
y='creditability', method=method, n_bins=5)
bin_plot(c2.transform(data[['duration.in.month', 'creditability']], labels=True),
x='duration.in.month', target='creditability')
编码
编码方式:one-hot: 热编码, label: 标签编码WOE: Weight of Evidence: 风控下常用, 计算方式引入了正负样本比例
WOE Encoding
WOE(Weight of Evidence) 反映单特征在好坏用户区分度的度量,WOE编码是一种用于二分类问题的编码方法,通过计算每个类别的证据权重来表示其与目标变量之间的关系。
优势:WOE越大,bad rate越高,也就是说,通过WOE变换,特征值不仅仅代表一个分类,还代表了这个分类的权重。WOE可以把相对于bad rate显现非线性的特征转换为线性的,且对波动不敏感。遇到异常数据亦能平稳表现。
应用场景:该方法通常用于分类建模中的特征工程,特别是在信用风险评估、营销模型和欺诈检测等领域。该方法的目标是将分类变量转换为数值变量,以便在统计建模中使用。
计算公式:
好用户比例/坏用户比例
婚姻状态 | Good | Bad | G-B | ln(G/B) | WOE |
---|---|---|---|---|---|
未婚 | 30% | 20% | 10% | 0.405 | 0.405 |
已婚 | 40% | 10% | 30% | 1.386 | 1.386 |
离异 | 10% | 40% | -30% | -1.386 | -1.386 |
丧偶 | 20% | 30% | -10% | -0.405 | -0.405 |
总计 | 100% | 100% |
toad计算woe
数据准备
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(data.drop('creditability', axis=1), data['creditability'], test_size=0.25, random_state=450)
#%%
# 拼接训练集和测试集
data_train = pd.concat([x_train, y_train], axis=1)
data_test = pd.concat([x_test, y_test], axis=1)
data_train['type'] = 'train'
data_test['type'] = 'test'
#%%
data_train
分箱(调整前)
c2 = toad.transform.Combiner()
adj_bins = {'duration.in.month': [9, 12, 18, 33]}
c2.set_rules(adj_bins)
#%%
data_ = pd.concat([data_train, data_test])
temp_data = c2.transform(data_[['duration.in.month', 'creditability', 'type']], labels=True)
#%%
from toad.plot import badrate_plot
from toad.plot import proportion_plot
badrate_plot(temp_data, by='duration.in.month', target='creditability', x='type')
proportion_plot(temp_data['duration.in.month'])
分箱(调整后)
# 假定将第一箱、第二箱合并
adj_bin = {'duration.in.month': [9,18,33]}
c2.set_rules(adj_bin)
temp_data = c2.transform(data_[['duration.in.month','creditability','type']])
badrate_plot(temp_data, target = 'creditability', x = 'type', by = 'duration.in.month')
proportion_plot(temp_data['duration.in.month'])
计算WOE
binned_data = c2.transform(data_train)
transfer = toad.transform.WOETransformer()
data_woe_result = transfer.fit_transform(binned_data, binned_data['creditability'], exclude=['creditability', 'type'])
data_woe_result.head()
显示结果
status.of.existing.checking.account | duration.in.month | credit.history | purpose | credit.amount | savings.account.and.bonds | present.employment.since | installment.rate.in.percentage.of.disposable.income | personal.status.and.sex | other.debtors.or.guarantors | ... | age.in.years | other.installment.plans | housing | number.of.existing.credits.at.this.bank | job | number.of.people.being.liable.to.provide.maintenance.for | telephone | foreign.worker | creditability | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
569 | 0.786313 | 0.786622 | 0.069322 | -0.384125 | 0.333152 | 0.244802 | 0.002898 | -0.056341 | 0.355058 | 0.0 | ... | 0.085604 | -0.157497 | -0.174441 | 0.039485 | 0.002648 | 0.012822 | 0.001722 | 0.043742 | 1 | train |
574 | 0.363027 | -0.279729 | 0.069322 | -0.384125 | -0.159408 | 0.244802 | -0.173326 | 0.154169 | -0.212356 | 0.0 | ... | 0.085604 | -0.157497 | -0.174441 | -0.071350 | -0.298467 | 0.012822 | -0.001130 | 0.043742 | 0 | train |
993 | 0.786313 | 0.786622 | 0.069322 | 0.141484 | 0.333152 | 0.244802 | 0.534527 | 0.154169 | -0.212356 | 0.0 | ... | 0.085604 | -0.157497 | -0.174441 | 0.039485 | 0.311383 | 0.012822 | 0.001722 | 0.043742 | 0 | train |
355 | 0.363027 | 0.099812 | 0.069322 | 0.272947 | -0.159408 | 0.244802 | 0.399313 | 0.154169 | -0.212356 | 0.0 | ... | 0.546949 | 0.605057 | -0.174441 | 0.039485 | -0.298467 | 0.012822 | -0.001130 | 0.043742 | 1 | train |
508 | -1.072960 | 0.099812 | 0.069322 | -0.384125 | -0.159408 | 0.244802 | 0.002898 | 0.154169 | -0.302447 | 0.0 | ... | 0.085604 | -0.157497 | -0.174441 | 0.039485 | 0.002648 | 0.012822 | -0.001130 | 0.043742 | 0 | train |
-
WOE理解:当前组中好用户和坏用户的比值与所有样本中这个比值的差异。差异通过对这两个比值取对数来表示
WOE越大,差异越大,这个分组里的好用户的可能性就越大
WOE越小,差异越小,这个分组里的好用户的可能性也就越小。
-
分箱结果对WOE结果有直接影响,分箱不同,WOE映射值也会有很大的不同
箱的总数在5~10箱(可以适当调整,通常不超过10箱)
并且将每一箱之间的负样本占比差值尽可能大作为箱合并的基本原则
每一箱的样本量不能小于整体样本的5%,原则是每一箱的频数需要具有统计意义
-
三种encoding的利弊
优势 | 劣势 | |
---|---|---|
Onehot Encoding | 简单易处理、稳定、无需归一化、不依赖历史数据 | 数据过于稀疏 |
Label Encoding | 区分效果好,维度小 | 需统计历史数据、不稳定、需要归一化 |
WOE Encoding | 区分效果好,维度小,不需要归一化 | 需统计历史数据、不稳定 |
案例总结
分箱:
创建对象
c = toad.transform.Combiner()
分箱, n_bins指定分箱的数量
c.transform(n_bins=xxx)
调整分箱
c.set_rules()/ c.update()
编码
创建对象
transfer = toad.transform.WOETransformer()
WOE编码
transfer.fit_transform()
可解释性
信贷业务的特征要求
-
逻辑简单
-
有强业务解释性
-
容易构造
-
容易排查错误
模型的可解释性没有准确的定义,凡是可以协助人理解模型决策过程和结果的方法,都可称之为模型的可解释性。
我们常说一个模型是"黑盒"模型,就是指该模型可解释性差。模型的构建者和使用者无法准确梳理模型的决策依据。
如果某个模型可解释性好,则是说我们能通过某些方法理解模型结果产生的逻辑。
一般来说,算法越复杂,一般精度越好,可解释性越差。
按照可解释性的强弱关系,将算法分类如下:
-
第一梯度:线性回归、逻辑回归
-
第二梯度:集成学习(结果是多个树共同决定的)
-
第三梯度:支持向量机(把数据往高维空间映射,数据会失真)
-
第四梯队: 深度学习
结论: 在本项目中, 可以选择的算法有: 逻辑回归(输出违约的概率), 集成学习