第八章 分类 朴素贝叶斯案例:P2P平台个人信用评估

案例3:P2P平台个人信用评估

案例背景

本案例使用经典的prosperLoanData.csv数据集,利用朴素贝叶斯模型,对用户是否能正常偿还贷款进行预测。美国P2P网贷平台是一个通过让有借款需求者和有闲置资金的出资人能够自行配对的平台站点,目前拥有超过98万会员,超过2亿美元的借贷额,是世界上最大的P2P借贷平台。本案例将数据集中的收入以及信用额度数据与贷款状态(是否正常偿还)建立起联系,希望使用借款人相关信息评估其个人信用,具体而言是使用这些信息来预测借款人能够正常偿还

数据读取与划分

python 复制代码
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

base = pd.read_csv('./prosperLoanData.csv')
python 复制代码
drop = base
df_all = drop[drop['ListingCreationDate']>'2008-1-1']
df_all.reset_index(drop=True)

| | ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
| 0 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 1 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 2 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| 3 | 0F05359734824199381F61D | 1074836 | 2013-12-14 08:26:37.093000000 | NaN | 60 | Current | NaN | 0.15425 | 0.1314 | 0.1214 | ... | -25.33 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 4 | 0F0A3576754255009D63151 | 750899 | 2013-04-12 09:52:56.147000000 | NaN | 36 | Current | NaN | 0.31032 | 0.2712 | 0.2612 | ... | -22.95 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85056 | E6D9357655724827169606C | 753087 | 2013-04-14 05:55:02.663000000 | NaN | 36 | Current | NaN | 0.22354 | 0.1864 | 0.1764 | ... | -75.58 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 85057 | E6DB353036033497292EE43 | 537216 | 2011-11-03 20:42:55.333000000 | NaN | 36 | FinalPaymentInProgress | NaN | 0.13220 | 0.1110 | 0.1010 | ... | -30.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 |
| 85058 | E6E13596170052029692BB1 | 1069178 | 2013-12-13 05:49:12.703000000 | NaN | 60 | Current | NaN | 0.23984 | 0.2150 | 0.2050 | ... | -16.91 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 119 |
| 85059 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | ... | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |

85060 E6ED3600409833199F711B7 1140093 2014-01-15 09:27:37.657000000 NaN 36 Current NaN 0.13189 0.1039 0.0939 ... -1.70 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1

85061 rows × 81 columns

python 复制代码
columns_new=[
    'ProsperRating (numeric)',#平台评分
    'Term',#偿还期限
    'BorrowerRate',#借款标利率
    'LoanStatus',#贷款状态
    'EmploymentStatus',#雇佣状态
    'EmploymentStatusDuration',#雇佣时长
    'IsBorrowerHomeowner',#是否有房屋
    'CreditScoreRangeLower',#消费信用最低
    'CreditScoreRangeUpper',#消费信用最高分
    'CurrentCreditLines',#总信用额度
    'OpenCreditLines',#公开信用额度
    'TotalCreditLinespast7years',#过去7年的总信用额度
    'OpenRevolvingAccounts',#公开帐户
    'OpenRevolvingMonthlyPayment',#申请贷款已有的月供
    'InquiriesLast6Months',#最近6个月查过多少次征信记录
    'TotalInquiries',#被催款次数
    'CurrentDelinquencies',#不良次数
    'AmountDelinquent',#不良金额数
    'LoanOriginalAmount',#原始金额的贷款'
    'RevolvingCreditBalance',#循环信贷余额
    'BankcardUtilization',#银行卡利用率
    'TradesNeverDelinquent (percentage)',#交易从来没有拖欠
    'DebtToIncomeRatio',#借款人的债务收入比
    'IncomeRange',#贷款人年收入范围
    'IncomeVerifiable',#可核查的收入
    'StatedMonthlyIncome',#客户月收入
    'MonthlyLoanPayment'#每月付息
]
df = pd.DataFrame(df_all,columns = columns_new)
df.info()
复制代码
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85061 entries, 1 to 113936
Data columns (total 27 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   ProsperRating (numeric)             84853 non-null  float64
 1   Term                                85061 non-null  int64  
 2   BorrowerRate                        85061 non-null  float64
 3   LoanStatus                          85061 non-null  object 
 4   EmploymentStatus                    85061 non-null  object 
 5   EmploymentStatusDuration            85041 non-null  float64
 6   IsBorrowerHomeowner                 85061 non-null  bool   
 7   CreditScoreRangeLower               85061 non-null  float64
 8   CreditScoreRangeUpper               85061 non-null  float64
 9   CurrentCreditLines                  85061 non-null  float64
 10  OpenCreditLines                     85061 non-null  float64
 11  TotalCreditLinespast7years          85061 non-null  float64
 12  OpenRevolvingAccounts               85061 non-null  int64  
 13  OpenRevolvingMonthlyPayment         85061 non-null  float64
 14  InquiriesLast6Months                85061 non-null  float64
 15  TotalInquiries                      85061 non-null  float64
 16  CurrentDelinquencies                85061 non-null  float64
 17  AmountDelinquent                    85061 non-null  float64
 18  LoanOriginalAmount                  85061 non-null  int64  
 19  RevolvingCreditBalance              85061 non-null  float64
 20  BankcardUtilization                 85061 non-null  float64
 21  TradesNeverDelinquent (percentage)  85061 non-null  float64
 22  DebtToIncomeRatio                   77740 non-null  float64
 23  IncomeRange                         85061 non-null  object 
 24  IncomeVerifiable                    85061 non-null  bool   
 25  StatedMonthlyIncome                 85061 non-null  float64
 26  MonthlyLoanPayment                  85061 non-null  float64
dtypes: bool(2), float64(19), int64(3), object(3)
memory usage: 17.0+ MB

将数据类型为object的字段的缺失值用unknown填充

python 复制代码
categorical=df.select_dtypes(include=['object']).columns.values
df[categorical]=df[categorical].fillna('unknown')
df.select_dtypes(exclude=[np.number]).isnull().sum()
复制代码
LoanStatus             0
EmploymentStatus       0
IsBorrowerHomeowner    0
IncomeRange            0
IncomeVerifiable       0
dtype: int64

将数值类型缺失值用中间值填充

python 复制代码
#筛选出数值类型字段
categorical_num = df.select_dtypes(include=[np.number]).columns.values

# 筛选有缺失值字段的函数
def find_na_column(df,columns):  
    miss_columns = []  
    for column in columns:
        if (df[column].isnull().sum())> 0:
            miss_columns.append(column)
    return miss_columns

#筛选出数值类型的有缺失值的字段
categorical_num=find_na_column(df,categorical_num)

## 用中间值填充缺失值的函数
def fillNull(column):
    null_count=df[column].isnull().sum()
    sum_count=df.shape[0]
    null_rate=null_count/sum_count
    df[column]=df[column].fillna(df[column].median() )
    
#调用函数填充缺失值
for column in categorical_num:
    fillNull(column)

df.select_dtypes(include=[np.number]).isnull().sum()
复制代码
ProsperRating (numeric)               0
Term                                  0
BorrowerRate                          0
EmploymentStatusDuration              0
CreditScoreRangeLower                 0
CreditScoreRangeUpper                 0
CurrentCreditLines                    0
OpenCreditLines                       0
TotalCreditLinespast7years            0
OpenRevolvingAccounts                 0
OpenRevolvingMonthlyPayment           0
InquiriesLast6Months                  0
TotalInquiries                        0
CurrentDelinquencies                  0
AmountDelinquent                      0
LoanOriginalAmount                    0
RevolvingCreditBalance                0
BankcardUtilization                   0
TradesNeverDelinquent (percentage)    0
DebtToIncomeRatio                     0
StatedMonthlyIncome                   0
MonthlyLoanPayment                    0
dtype: int64

对因变量,分类变量进行赋值并筛选(LoanStatus,EmploymentStatus,Term,IsBorrowerHomeowner, IncomeRange, IncomeVerifiable)

python 复制代码
def loanStatus(value):
    if(value in ['Completed','FinalPaymentInProgress','Past Due (1-15 days)',
                 'Past Due (31-60 days)','Past Due (61-90 days) ','Past Due (91-120 days) ',
                'Past Due (16-30 days)']):
        return 1
    else:
        return 0
df["LoanStatus"]=df["LoanStatus"].map(lambda status : loanStatus(status))
def incomeType(value):
    if(value=='$0 '):
        return 0
    elif(value=='$1-24,999'):
         return 1
    elif(value=='$25,000-49,999'):
         return 2
    elif(value=='$50,000-74,999'):
         return 3
    elif(value=='$75,000-99,999'):
         return 4
    elif(value=='$100,000+'):
         return 5
    elif(value=='Not employed'):
         return 6
    else:
        return 7
df.IncomeRange=df.IncomeRange.map(lambda range: incomeType(range))
df=df[df['LoanStatus'].isin([0,1])]
df['EmploymentStatus'].replace({'Employed':1,'Self-employed':1,'Other':1,'Full-time':1,'Not employed':0,'Retired':0,'Part-time':1},inplace=True)
df['Term'].replace({12.0:1,60.0:2,36.0:3},inplace=True)
df['IsBorrowerHomeowner'].replace({True:1,False:0},inplace=True)
df['IncomeVerifiable']=(df['IncomeVerifiable']==True)*1

对连续变量分箱(DebtToIncomeRatio,BorrowerRate,StatedMonthlyIncom)

python 复制代码
dcat=pd.cut(list(df['DebtToIncomeRatio'].values),[-0.001,0.15,0.3,0.5,10.01])
df['DebtToIncomeRatio']=dcat.codes
rate=pd.cut(list(df['BorrowerRate'].values),[-0.001,0.1,0.2,0.3,2])
df['BorrowerRate']=rate.codes
mcat=pd.cut(list(df['StatedMonthlyIncome'].values),[1,2000,4000,6000,8000,10000000])
df['StatedMonthlyIncome']=mcat.codes

划分数据集7:3

python 复制代码
from sklearn.model_selection import train_test_split
X = df.drop(["LoanStatus"],axis=1)
Y = df["LoanStatus"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

数据归一化处理

python 复制代码
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test= scaler.transform(X_test)

模型搭建与训练

python 复制代码
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,Y_train.astype(int))
复制代码
MultinomialNB()
python 复制代码
mnb.score(X_test,Y_test.astype(int))
复制代码
0.747364708648458
相关推荐
枫叶林FYL13 小时前
【机器学习与智慧医疗】T2DM-EWS: 2型糖尿病早期预警系统(多参数集成分类模型)完整实现
人工智能·机器学习·分类
深念Y1 天前
网络多播与广播:到底能不能节省带宽和流量?
网络·直播·p2p·点对点·多播·流量·单播
qq_296553271 天前
[特殊字符] 搜索插入位置:从O(n)到O(log n)的优雅进化
数据结构·算法·面试·分类·柔性数组
元让_vincent1 天前
论文Review SLAM X-ICP | 面向极端退化环境的可定位性感知 LiDAR 配准方法
人工智能·分类·数据挖掘·slam·激光slam·退化检测·退化场景
神经网络机器学习智能算法画图绘图1 天前
基于改进的支持向量机多分类预测研究
算法·支持向量机·分类
动物园猫2 天前
金属外表多种生锈检测数据集分享(适用于YOLO系列深度学习分类检测任务)
深度学习·yolo·分类
Ricky05532 天前
AgriDet:基于农业检测框架的植物叶片病害严重程度分类(印度2023年研究)
人工智能·分类·数据挖掘
2zcode4 天前
基于机器视觉与YOLO11的服装厂废料(边角料)分类检测系统(数据集+UI界面+训练代码+数据分析)
jvm·分类·数据分析·机器视觉·yolo11·服装厂废料
动物园猫4 天前
交通事故车辆受损情况数据集分享(适用于YOLO系列深度学习分类检测任务)
深度学习·yolo·分类