案例3:P2P平台个人信用评估
案例背景
本案例使用经典的prosperLoanData.csv数据集,利用朴素贝叶斯模型,对用户是否能正常偿还贷款进行预测。美国P2P网贷平台是一个通过让有借款需求者和有闲置资金的出资人能够自行配对的平台站点,目前拥有超过98万会员,超过2亿美元的借贷额,是世界上最大的P2P借贷平台。本案例将数据集中的收入以及信用额度数据与贷款状态(是否正常偿还)建立起联系,希望使用借款人相关信息评估其个人信用,具体而言是使用这些信息来预测借款人能够正常偿还
数据读取与划分
python
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
base = pd.read_csv('./prosperLoanData.csv')
python
drop = base
df_all = drop[drop['ListingCreationDate']>'2008-1-1']
df_all.reset_index(drop=True)
| | ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
| 0 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 1 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 2 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| 3 | 0F05359734824199381F61D | 1074836 | 2013-12-14 08:26:37.093000000 | NaN | 60 | Current | NaN | 0.15425 | 0.1314 | 0.1214 | ... | -25.33 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 4 | 0F0A3576754255009D63151 | 750899 | 2013-04-12 09:52:56.147000000 | NaN | 36 | Current | NaN | 0.31032 | 0.2712 | 0.2612 | ... | -22.95 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85056 | E6D9357655724827169606C | 753087 | 2013-04-14 05:55:02.663000000 | NaN | 36 | Current | NaN | 0.22354 | 0.1864 | 0.1764 | ... | -75.58 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 85057 | E6DB353036033497292EE43 | 537216 | 2011-11-03 20:42:55.333000000 | NaN | 36 | FinalPaymentInProgress | NaN | 0.13220 | 0.1110 | 0.1010 | ... | -30.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 |
| 85058 | E6E13596170052029692BB1 | 1069178 | 2013-12-13 05:49:12.703000000 | NaN | 60 | Current | NaN | 0.23984 | 0.2150 | 0.2050 | ... | -16.91 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 119 |
| 85059 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | ... | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |
| 85060 | E6ED3600409833199F711B7 | 1140093 | 2014-01-15 09:27:37.657000000 | NaN | 36 | Current | NaN | 0.13189 | 0.1039 | 0.0939 | ... | -1.70 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
|---|
85061 rows × 81 columns
python
columns_new=[
'ProsperRating (numeric)',#平台评分
'Term',#偿还期限
'BorrowerRate',#借款标利率
'LoanStatus',#贷款状态
'EmploymentStatus',#雇佣状态
'EmploymentStatusDuration',#雇佣时长
'IsBorrowerHomeowner',#是否有房屋
'CreditScoreRangeLower',#消费信用最低
'CreditScoreRangeUpper',#消费信用最高分
'CurrentCreditLines',#总信用额度
'OpenCreditLines',#公开信用额度
'TotalCreditLinespast7years',#过去7年的总信用额度
'OpenRevolvingAccounts',#公开帐户
'OpenRevolvingMonthlyPayment',#申请贷款已有的月供
'InquiriesLast6Months',#最近6个月查过多少次征信记录
'TotalInquiries',#被催款次数
'CurrentDelinquencies',#不良次数
'AmountDelinquent',#不良金额数
'LoanOriginalAmount',#原始金额的贷款'
'RevolvingCreditBalance',#循环信贷余额
'BankcardUtilization',#银行卡利用率
'TradesNeverDelinquent (percentage)',#交易从来没有拖欠
'DebtToIncomeRatio',#借款人的债务收入比
'IncomeRange',#贷款人年收入范围
'IncomeVerifiable',#可核查的收入
'StatedMonthlyIncome',#客户月收入
'MonthlyLoanPayment'#每月付息
]
df = pd.DataFrame(df_all,columns = columns_new)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85061 entries, 1 to 113936
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ProsperRating (numeric) 84853 non-null float64
1 Term 85061 non-null int64
2 BorrowerRate 85061 non-null float64
3 LoanStatus 85061 non-null object
4 EmploymentStatus 85061 non-null object
5 EmploymentStatusDuration 85041 non-null float64
6 IsBorrowerHomeowner 85061 non-null bool
7 CreditScoreRangeLower 85061 non-null float64
8 CreditScoreRangeUpper 85061 non-null float64
9 CurrentCreditLines 85061 non-null float64
10 OpenCreditLines 85061 non-null float64
11 TotalCreditLinespast7years 85061 non-null float64
12 OpenRevolvingAccounts 85061 non-null int64
13 OpenRevolvingMonthlyPayment 85061 non-null float64
14 InquiriesLast6Months 85061 non-null float64
15 TotalInquiries 85061 non-null float64
16 CurrentDelinquencies 85061 non-null float64
17 AmountDelinquent 85061 non-null float64
18 LoanOriginalAmount 85061 non-null int64
19 RevolvingCreditBalance 85061 non-null float64
20 BankcardUtilization 85061 non-null float64
21 TradesNeverDelinquent (percentage) 85061 non-null float64
22 DebtToIncomeRatio 77740 non-null float64
23 IncomeRange 85061 non-null object
24 IncomeVerifiable 85061 non-null bool
25 StatedMonthlyIncome 85061 non-null float64
26 MonthlyLoanPayment 85061 non-null float64
dtypes: bool(2), float64(19), int64(3), object(3)
memory usage: 17.0+ MB
将数据类型为object的字段的缺失值用unknown填充
python
categorical=df.select_dtypes(include=['object']).columns.values
df[categorical]=df[categorical].fillna('unknown')
df.select_dtypes(exclude=[np.number]).isnull().sum()
LoanStatus 0
EmploymentStatus 0
IsBorrowerHomeowner 0
IncomeRange 0
IncomeVerifiable 0
dtype: int64
将数值类型缺失值用中间值填充
python
#筛选出数值类型字段
categorical_num = df.select_dtypes(include=[np.number]).columns.values
# 筛选有缺失值字段的函数
def find_na_column(df,columns):
miss_columns = []
for column in columns:
if (df[column].isnull().sum())> 0:
miss_columns.append(column)
return miss_columns
#筛选出数值类型的有缺失值的字段
categorical_num=find_na_column(df,categorical_num)
## 用中间值填充缺失值的函数
def fillNull(column):
null_count=df[column].isnull().sum()
sum_count=df.shape[0]
null_rate=null_count/sum_count
df[column]=df[column].fillna(df[column].median() )
#调用函数填充缺失值
for column in categorical_num:
fillNull(column)
df.select_dtypes(include=[np.number]).isnull().sum()
ProsperRating (numeric) 0
Term 0
BorrowerRate 0
EmploymentStatusDuration 0
CreditScoreRangeLower 0
CreditScoreRangeUpper 0
CurrentCreditLines 0
OpenCreditLines 0
TotalCreditLinespast7years 0
OpenRevolvingAccounts 0
OpenRevolvingMonthlyPayment 0
InquiriesLast6Months 0
TotalInquiries 0
CurrentDelinquencies 0
AmountDelinquent 0
LoanOriginalAmount 0
RevolvingCreditBalance 0
BankcardUtilization 0
TradesNeverDelinquent (percentage) 0
DebtToIncomeRatio 0
StatedMonthlyIncome 0
MonthlyLoanPayment 0
dtype: int64
对因变量,分类变量进行赋值并筛选(LoanStatus,EmploymentStatus,Term,IsBorrowerHomeowner, IncomeRange, IncomeVerifiable)
python
def loanStatus(value):
if(value in ['Completed','FinalPaymentInProgress','Past Due (1-15 days)',
'Past Due (31-60 days)','Past Due (61-90 days) ','Past Due (91-120 days) ',
'Past Due (16-30 days)']):
return 1
else:
return 0
df["LoanStatus"]=df["LoanStatus"].map(lambda status : loanStatus(status))
def incomeType(value):
if(value=='$0 '):
return 0
elif(value=='$1-24,999'):
return 1
elif(value=='$25,000-49,999'):
return 2
elif(value=='$50,000-74,999'):
return 3
elif(value=='$75,000-99,999'):
return 4
elif(value=='$100,000+'):
return 5
elif(value=='Not employed'):
return 6
else:
return 7
df.IncomeRange=df.IncomeRange.map(lambda range: incomeType(range))
df=df[df['LoanStatus'].isin([0,1])]
df['EmploymentStatus'].replace({'Employed':1,'Self-employed':1,'Other':1,'Full-time':1,'Not employed':0,'Retired':0,'Part-time':1},inplace=True)
df['Term'].replace({12.0:1,60.0:2,36.0:3},inplace=True)
df['IsBorrowerHomeowner'].replace({True:1,False:0},inplace=True)
df['IncomeVerifiable']=(df['IncomeVerifiable']==True)*1
对连续变量分箱(DebtToIncomeRatio,BorrowerRate,StatedMonthlyIncom)
python
dcat=pd.cut(list(df['DebtToIncomeRatio'].values),[-0.001,0.15,0.3,0.5,10.01])
df['DebtToIncomeRatio']=dcat.codes
rate=pd.cut(list(df['BorrowerRate'].values),[-0.001,0.1,0.2,0.3,2])
df['BorrowerRate']=rate.codes
mcat=pd.cut(list(df['StatedMonthlyIncome'].values),[1,2000,4000,6000,8000,10000000])
df['StatedMonthlyIncome']=mcat.codes
划分数据集7:3
python
from sklearn.model_selection import train_test_split
X = df.drop(["LoanStatus"],axis=1)
Y = df["LoanStatus"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
数据归一化处理
python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test= scaler.transform(X_test)
模型搭建与训练
python
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,Y_train.astype(int))
MultinomialNB()
python
mnb.score(X_test,Y_test.astype(int))
0.747364708648458