第八章分类朴素贝叶斯案例：P2P平台个人信用评估

案例3：P2P平台个人信用评估

案例背景

本案例使用经典的prosperLoanData.csv数据集，利用朴素贝叶斯模型，对用户是否能正常偿还贷款进行预测。美国P2P网贷平台是一个通过让有借款需求者和有闲置资金的出资人能够自行配对的平台站点，目前拥有超过98万会员，超过2亿美元的借贷额，是世界上最大的P2P借贷平台。本案例将数据集中的收入以及信用额度数据与贷款状态（是否正常偿还）建立起联系，希望使用借款人相关信息评估其个人信用，具体而言是使用这些信息来预测借款人能够正常偿还

数据读取与划分

python 复制代码

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

base = pd.read_csv('./prosperLoanData.csv')

python 复制代码

drop = base
df_all = drop[drop['ListingCreationDate']>'2008-1-1']
df_all.reset_index(drop=True)

| | ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
| 0 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 1 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 2 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| 3 | 0F05359734824199381F61D | 1074836 | 2013-12-14 08:26:37.093000000 | NaN | 60 | Current | NaN | 0.15425 | 0.1314 | 0.1214 | ... | -25.33 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 4 | 0F0A3576754255009D63151 | 750899 | 2013-04-12 09:52:56.147000000 | NaN | 36 | Current | NaN | 0.31032 | 0.2712 | 0.2612 | ... | -22.95 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85056 | E6D9357655724827169606C | 753087 | 2013-04-14 05:55:02.663000000 | NaN | 36 | Current | NaN | 0.22354 | 0.1864 | 0.1764 | ... | -75.58 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 85057 | E6DB353036033497292EE43 | 537216 | 2011-11-03 20:42:55.333000000 | NaN | 36 | FinalPaymentInProgress | NaN | 0.13220 | 0.1110 | 0.1010 | ... | -30.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 |
| 85058 | E6E13596170052029692BB1 | 1069178 | 2013-12-13 05:49:12.703000000 | NaN | 60 | Current | NaN | 0.23984 | 0.2150 | 0.2050 | ... | -16.91 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 119 |
| 85059 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | ... | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |

85060	E6ED3600409833199F711B7	1140093	2014-01-15 09:27:37.657000000	NaN	36	Current	NaN	0.13189	0.1039	0.0939	...	-1.70	0.0	0.0	0.0	0.0	1.0	0	0	0.0	1

85061 rows × 81 columns

python 复制代码

columns_new=[
    'ProsperRating (numeric)',#平台评分
    'Term',#偿还期限
    'BorrowerRate',#借款标利率
    'LoanStatus',#贷款状态
    'EmploymentStatus',#雇佣状态
    'EmploymentStatusDuration',#雇佣时长
    'IsBorrowerHomeowner',#是否有房屋
    'CreditScoreRangeLower',#消费信用最低
    'CreditScoreRangeUpper',#消费信用最高分
    'CurrentCreditLines',#总信用额度
    'OpenCreditLines',#公开信用额度
    'TotalCreditLinespast7years',#过去7年的总信用额度
    'OpenRevolvingAccounts',#公开帐户
    'OpenRevolvingMonthlyPayment',#申请贷款已有的月供
    'InquiriesLast6Months',#最近6个月查过多少次征信记录
    'TotalInquiries',#被催款次数
    'CurrentDelinquencies',#不良次数
    'AmountDelinquent',#不良金额数
    'LoanOriginalAmount',#原始金额的贷款'
    'RevolvingCreditBalance',#循环信贷余额
    'BankcardUtilization',#银行卡利用率
    'TradesNeverDelinquent (percentage)',#交易从来没有拖欠
    'DebtToIncomeRatio',#借款人的债务收入比
    'IncomeRange',#贷款人年收入范围
    'IncomeVerifiable',#可核查的收入
    'StatedMonthlyIncome',#客户月收入
    'MonthlyLoanPayment'#每月付息
]
df = pd.DataFrame(df_all,columns = columns_new)
df.info()

复制代码

<class 'pandas.core.frame.DataFrame'>
Int64Index: 85061 entries, 1 to 113936
Data columns (total 27 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   ProsperRating (numeric)             84853 non-null  float64
 1   Term                                85061 non-null  int64  
 2   BorrowerRate                        85061 non-null  float64
 3   LoanStatus                          85061 non-null  object 
 4   EmploymentStatus                    85061 non-null  object 
 5   EmploymentStatusDuration            85041 non-null  float64
 6   IsBorrowerHomeowner                 85061 non-null  bool   
 7   CreditScoreRangeLower               85061 non-null  float64
 8   CreditScoreRangeUpper               85061 non-null  float64
 9   CurrentCreditLines                  85061 non-null  float64
 10  OpenCreditLines                     85061 non-null  float64
 11  TotalCreditLinespast7years          85061 non-null  float64
 12  OpenRevolvingAccounts               85061 non-null  int64  
 13  OpenRevolvingMonthlyPayment         85061 non-null  float64
 14  InquiriesLast6Months                85061 non-null  float64
 15  TotalInquiries                      85061 non-null  float64
 16  CurrentDelinquencies                85061 non-null  float64
 17  AmountDelinquent                    85061 non-null  float64
 18  LoanOriginalAmount                  85061 non-null  int64  
 19  RevolvingCreditBalance              85061 non-null  float64
 20  BankcardUtilization                 85061 non-null  float64
 21  TradesNeverDelinquent (percentage)  85061 non-null  float64
 22  DebtToIncomeRatio                   77740 non-null  float64
 23  IncomeRange                         85061 non-null  object 
 24  IncomeVerifiable                    85061 non-null  bool   
 25  StatedMonthlyIncome                 85061 non-null  float64
 26  MonthlyLoanPayment                  85061 non-null  float64
dtypes: bool(2), float64(19), int64(3), object(3)
memory usage: 17.0+ MB

将数据类型为object的字段的缺失值用unknown填充

python 复制代码

categorical=df.select_dtypes(include=['object']).columns.values
df[categorical]=df[categorical].fillna('unknown')
df.select_dtypes(exclude=[np.number]).isnull().sum()

复制代码

LoanStatus             0
EmploymentStatus       0
IsBorrowerHomeowner    0
IncomeRange            0
IncomeVerifiable       0
dtype: int64

将数值类型缺失值用中间值填充

python 复制代码

#筛选出数值类型字段
categorical_num = df.select_dtypes(include=[np.number]).columns.values

# 筛选有缺失值字段的函数
def find_na_column(df,columns):  
    miss_columns = []  
    for column in columns:
        if (df[column].isnull().sum())> 0:
            miss_columns.append(column)
    return miss_columns

#筛选出数值类型的有缺失值的字段
categorical_num=find_na_column(df,categorical_num)

## 用中间值填充缺失值的函数
def fillNull(column):
    null_count=df[column].isnull().sum()
    sum_count=df.shape[0]
    null_rate=null_count/sum_count
    df[column]=df[column].fillna(df[column].median() )
    
#调用函数填充缺失值
for column in categorical_num:
    fillNull(column)

df.select_dtypes(include=[np.number]).isnull().sum()

复制代码

ProsperRating (numeric)               0
Term                                  0
BorrowerRate                          0
EmploymentStatusDuration              0
CreditScoreRangeLower                 0
CreditScoreRangeUpper                 0
CurrentCreditLines                    0
OpenCreditLines                       0
TotalCreditLinespast7years            0
OpenRevolvingAccounts                 0
OpenRevolvingMonthlyPayment           0
InquiriesLast6Months                  0
TotalInquiries                        0
CurrentDelinquencies                  0
AmountDelinquent                      0
LoanOriginalAmount                    0
RevolvingCreditBalance                0
BankcardUtilization                   0
TradesNeverDelinquent (percentage)    0
DebtToIncomeRatio                     0
StatedMonthlyIncome                   0
MonthlyLoanPayment                    0
dtype: int64

对因变量，分类变量进行赋值并筛选（LoanStatus，EmploymentStatus，Term，IsBorrowerHomeowner, IncomeRange, IncomeVerifiable）

python 复制代码

def loanStatus(value):
    if(value in ['Completed','FinalPaymentInProgress','Past Due (1-15 days)',
                 'Past Due (31-60 days)','Past Due (61-90 days) ','Past Due (91-120 days) ',
                'Past Due (16-30 days)']):
        return 1
    else:
        return 0
df["LoanStatus"]=df["LoanStatus"].map(lambda status : loanStatus(status))
def incomeType(value):
    if(value=='$0 '):
        return 0
    elif(value=='$1-24,999'):
         return 1
    elif(value=='$25,000-49,999'):
         return 2
    elif(value=='$50,000-74,999'):
         return 3
    elif(value=='$75,000-99,999'):
         return 4
    elif(value=='$100,000+'):
         return 5
    elif(value=='Not employed'):
         return 6
    else:
        return 7
df.IncomeRange=df.IncomeRange.map(lambda range: incomeType(range))
df=df[df['LoanStatus'].isin([0,1])]
df['EmploymentStatus'].replace({'Employed':1,'Self-employed':1,'Other':1,'Full-time':1,'Not employed':0,'Retired':0,'Part-time':1},inplace=True)
df['Term'].replace({12.0:1,60.0:2,36.0:3},inplace=True)
df['IsBorrowerHomeowner'].replace({True:1,False:0},inplace=True)
df['IncomeVerifiable']=(df['IncomeVerifiable']==True)*1

对连续变量分箱（DebtToIncomeRatio，BorrowerRate，StatedMonthlyIncom）

python 复制代码

dcat=pd.cut(list(df['DebtToIncomeRatio'].values),[-0.001,0.15,0.3,0.5,10.01])
df['DebtToIncomeRatio']=dcat.codes
rate=pd.cut(list(df['BorrowerRate'].values),[-0.001,0.1,0.2,0.3,2])
df['BorrowerRate']=rate.codes
mcat=pd.cut(list(df['StatedMonthlyIncome'].values),[1,2000,4000,6000,8000,10000000])
df['StatedMonthlyIncome']=mcat.codes

划分数据集7：3

python 复制代码

from sklearn.model_selection import train_test_split
X = df.drop(["LoanStatus"],axis=1)
Y = df["LoanStatus"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

数据归一化处理

python 复制代码

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test= scaler.transform(X_test)

模型搭建与训练

python 复制代码

from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,Y_train.astype(int))

复制代码

MultinomialNB()

python 复制代码

mnb.score(X_test,Y_test.astype(int))

复制代码

0.747364708648458

第八章 分类 朴素贝叶斯案例：P2P平台个人信用评估

案例3：P2P平台个人信用评估

案例背景

数据读取与划分

模型搭建与训练

第八章分类朴素贝叶斯案例：P2P平台个人信用评估