【机器学习】Kaggle实战信用卡反欺诈预测（场景解析、数据预处理、特征工程、模型训练、模型评估与优化）

构建信用卡反欺诈预测模型

建模思路

本项目需解决的问题

本项目通过利用信用卡的历史交易数据，进行机器学习，构建信用卡反欺诈预测模型，提前发现客户信用卡被盗刷的事件。

项目背景

数据集包含由欧洲持卡人于2013年9月使用信用卡进行交的数据。此数据集显示两天内发生的交易，其中284,807笔交易中有492笔被盗刷。数据集非常不平衡，

积极的类（被盗刷）占所有交易的0.172％。

它只包含作为PCA转换结果的数字输入变量。不幸的是，由于保密问题，我们无法提供有关数据的原始功能和更多背景信息。特征V1，V2，... V28是使用PCA

获得的主要组件，没有用PCA转换的唯一特征是"时间"和"量"。特征'时间'包含数据集中每个事务和第一个事务之间经过的秒数。特征"金额"是交易金额，此特

征可用于实例依赖的成本认知学习。特征'类'是响应变量，如果发生被盗刷，则取值1，否则为0。

以上取自Kaggle官网对本数据集部分介绍（谷歌翻译），关于数据集更多介绍请参考《Credit Card Fraud Detection》。

场景解析（算法选择）

首先，我们拿到的数据是持卡人两天内的信用卡交易数据，这份数据包含很多维度，要解决的问题是预测持卡人是否会发生信用卡被盗刷。信用卡持卡人是否会发生被盗刷只有两种可能，发生被盗刷或不发生被盗刷。又因为这份数据是打标好的（字段Class是目标列），也就是说它是一个监督学习的场景。于是，我们判定信用卡持卡人是否会发生被盗刷是一个二元分类问题，意味着可以通过二分类相关的算法来找到具体的解决办法，本项目选用的算法是逻辑斯蒂回归（Logistic Regression）。

分析数据

数据是结构化数据，不需要做特征抽象。特征V1至V28是经过PCA处理，而特征Time和Amount的数据规格与其他特征差别较大，需要对其做特征缩放，将特征缩放至同一个规格。在数据质量方面，没有出现乱码或空字符的数据，可以确定字段Class为目标列，其他列为特征列。

模型评估

这份数据是全部打标好的数据，可以通过交叉验证的方法对训练集生成的模型进行评估。70%的数据进行训练，30%的数据进行预测和评估。

场景总结

现对该业务场景进行总结如下：

根据历史记录数据学习并对信用卡持卡人是否会发生被盗刷进行预测，二分类监督学习场景，选择逻辑斯蒂回归（Logistic Regression）算法。
数据为结构化数据，不需要做特征抽象，但需要做特征缩放。

数据文件

1、数据预处理

1.1、导包

python 复制代码

import numpy as np
import pandas as pd
pd.set_option('display.float_format',lambda x :'%.4f' % x)

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import missingno as msno # 可视化工具，pip install missingno

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.metrics import auc,roc_auc_score,roc_curve,recall_score,accuracy_score,classification_report

from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

1.2、解码数据

python 复制代码

data = pd.read_csv('./creditcard.csv')
data.head()

python 复制代码

data.tail()

5 rows × 31 columns

python 复制代码

print(data.shape)
data.info()

复制代码

(284807, 31)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     284807 non-null  float64
 22  V22     284807 non-null  float64
 23  V23     284807 non-null  float64
 24  V24     284807 non-null  float64
 25  V25     284807 non-null  float64
 26  V26     284807 non-null  float64
 27  V27     284807 non-null  float64
 28  V28     284807 non-null  float64
 29  Amount  284807 non-null  float64
 30  Class   284807 non-null  int64  
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

python 复制代码

data.describe().T

python 复制代码

msno.matrix(data)

复制代码

<AxesSubplot:>

python 复制代码

data.isnull().sum().sum()

复制代码

2、特征工程

2.1、目标变量

python 复制代码

fig,axs = plt.subplots(1,2,figsize = (14,7))

sns.countplot(x = 'Class',data = data,ax = axs[0])
axs[0].set_title('Frequency of each Calss')

data['Class'].value_counts().plot(kind = 'pie',ax = axs[1],autopct = '%1.2f%%')
axs[1].set_title('Percent of each Class')

复制代码

Text(0.5, 1.0, 'Percent of each Class')

python 复制代码

data.groupby(by = 'Class').size()

复制代码

Class
0    284315
1       492
dtype: int64

2.2、特征衍生

python 复制代码

data.head() # 时间以秒为单位，离散性太强

5 rows × 31 columns

python 复制代码

data['Hour'] = data['Time'].apply(lambda x : divmod(x,3600)[0])
data

284807 rows × 32 columns

2.3、特征选择

2.3.1、信用卡正常消费和盗刷对比

python 复制代码

XFraud = data.loc[data['Class'] == 1] # 盗刷
XnonFraud = data.loc[data['Class'] == 0] # 正常消费

correlationNonFraud = XnonFraud.loc[:,data.columns != 'Class'].corr()

mask = np.zeros_like(correlationNonFraud)

index = np.triu_indices_from(correlationNonFraud) # 右上部分的索引
mask[index] = True # mask 面具，0没有面具，1表示有面具

kw = {'width_ratios':[1,1,0.05],'wspace':0.2}
f,(ax1,ax2,ax3) = plt.subplots(1,3,gridspec_kw=kw,figsize = (22,9))


cmap = sns.diverging_palette(220,8,as_cmap = True) # 一系列颜色
sns.heatmap(correlationNonFraud,ax = ax1,vmin = -1,vmax = 1,square=False,
            linewidths=0.5,mask = mask,cbar=False,cmap= cmap)
ax1.set_title('Normal')

correlationFraud = XFraud.loc[:,data.columns != 'Class'].corr()
sns.heatmap(correlationFraud,vmin = -1,vmax= 1,cmap = cmap,ax = ax2,
            square=False,linewidths=0.5,mask = mask,yticklabels=True,cbar_ax=ax3,
           cbar_kws={'orientation':'vertical','ticks':[-1,-0.5,0,0.5,1]})

ax2.set_title('Fraud')

复制代码

Text(0.5, 1.0, 'Fraud')

从上图可以看出，信用卡被盗刷的事件中，部分变量之间的相关性更明显。其中变量V1、V2、V3、V4、V5、V6、V7、V9、V10、V11、V12、V14、V16、V17和V18以及V19之间的变化在信用卡被盗刷的样本中呈性一定的规律。

特征V8、V13 、V15 、V20 、V21 、V22、 V23 、V24 、V25 、V26 、V27 和V28规律不明显！

python 复制代码

from matplotlib import colors

python 复制代码

plt.colormaps()

复制代码

['Accent',
 'Accent_r',
 'Blues',
 'Blues_r',
 'BrBG',
 ...
 'viridis',
 'viridis_r',
 'vlag',
 'vlag_r',
 'winter',
 'winter_r']

2.3.2、交易金额和交易次数

python 复制代码

f,(ax1,ax2) = plt.subplots(2,1,sharex=True,figsize = (16,6))

ax1.hist(data['Amount'][data['Class'] == 1],bins = 30)
ax1.set_title('Fraud')
plt.yscale('log')


ax2.hist(data['Amount'][data['Class'] == 0],bins = 100)
ax2.set_title('Normal')

plt.xlabel('Amount($)')
plt.ylabel('count')
plt.yscale('log')

信用卡被盗刷发生的金额与信用卡正常用户发生的金额相比呈现散而小的特点，这说明信用卡盗刷者为了不引起信用卡卡主的注意，更偏向选择小金额消费。

2.3.3、信用卡盗刷时间

参数介绍：

size 每个面的高度（英寸）标量
aspect 纵横比标量

python 复制代码

sns.factorplot(x = 'Hour',data = data,kind = 'count',palette = 'ocean',size = 6,
               aspect = 3)

复制代码

<seaborn.axisgrid.FacetGrid at 0x292aee31670>

每天早上9点到晚上11点之间是信用卡消费的高频时间段。

2.3.4、交易金额和交易时间关系

python 复制代码

f,(ax1,ax2) = plt.subplots(2,1,sharex=True,figsize = (16,6))

cond1 = data['Class'] == 1
ax1.scatter(data['Hour'][cond1],data['Amount'][cond1])
ax1.set_title('Fraud')

cond2 = data['Class'] == 0
ax2.scatter(data['Hour'][cond2],data['Amount'][cond2])
ax2.set_title('Normal')

复制代码

Text(0.5, 1.0, 'Normal')

python 复制代码

sns.catplot(x = 'Hour',kind = 'count',data = data[cond1],height=9,aspect=2)

复制代码

<seaborn.axisgrid.FacetGrid at 0x292af9cd430>

从上图可以看出，在信用卡被盗刷样本中，离群值发生在客户使用信用卡消费更低频的时间段。信用卡被盗刷数量案发最高峰在第一天上午11点达到43次，其余发生信用卡被盗刷案发时间在晚上时间11点至第二早上9点之间，说明信用卡盗刷者为了不引起信用卡卡主注意，更喜欢选择信用卡卡主睡觉时间和消费频率较高的时间点作案；同时，信用卡发生被盗刷的最大值也就只有2,125.87美元。

python 复制代码

data['Amount'][cond1].max()

复制代码

2125.87

2.3.5、特征分布(帮助筛选特征！！！)

python 复制代码

data.head()

5 rows × 32 columns

python 复制代码

from matplotlib import font_manager

python 复制代码

fm = font_manager.FontManager()

python 复制代码

[font.name for font in fm.ttflist]

复制代码

['DejaVu Serif Display',
 'DejaVu Sans Mono',
 'cmss10',
 'DejaVu Serif',
 'DejaVu Sans',
 'STIXSizeFourSym',
 'STIXNonUnicode',
 'cmtt10',
 ....
 'Century Schoolbook',
 'Calisto MT',
 'Calibri',
 'Malgun Gothic',
 'Britannic Bold',
 'Matura MT Script Capitals']

python 复制代码

sns.__version__

复制代码

'0.11.1'

python 复制代码

data

284807 rows × 32 columns

python 复制代码

plt.rcParams['font.family'] = 'STKaiti'
v_feat = data.iloc[:,1:29].columns
plt.figure(figsize=(16,4 * 28))
cond1 = data['Class'] == 1
cond2 = data['Class'] == 0

gs = gridspec.GridSpec(28,1) # 子视图
for i,cn in enumerate(v_feat):
    ax = plt.subplot(gs[i])
    sns.distplot(data[cn][cond1],bins = 50) # 欺诈
    sns.distplot(data[cn][cond2],bins = 100) # 正常消费
    ax.set_title('特征概率分布图' + cn)

上图是不同变量在信用卡被盗刷和信用卡正常的不同分布情况，我们将选择在不同信用卡状态下的分布有明显区别的变量。因此剔除变量V8、V13 、V15 、V20 、V21 、V22、 V23 、V24 、V25 、V26 、V27 和V28变量。这也与我们开始用相关性图谱观察得出结论一致。同时剔除变量Time，保留离散程度更小的Hour变量。

python 复制代码

droplist = ['V8','V13','V15','V20','V21','V22','V23','V24','V25','V26','V27','V28','Time']

data_new = data.drop(labels=droplist,axis = 1)
display(data.shape, data_new.shape)

复制代码

(284807, 32)



(284807, 19)

python 复制代码

data_new.head()

特征从31个缩减至18个（不含目标变量）。

2.4、特征缩放

由于特征Hour和Amount的规格和其他特征相差较大，因此我们需对其进行特征缩放。

python 复制代码

col = ['Amount','Hour']
sc = StandardScaler() # Z-score归一化

data_new[col] = sc.fit_transform(data_new[col])
data_new.head()

python 复制代码

data_new.describe().T

2.5、特征重要性排序

python 复制代码

feture = list(data_new.columns)
print(feture)

复制代码

['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V9', 'V10', 'V11', 'V12', 'V14', 'V16', 'V17', 'V18', 'V19', 'Amount', 'Class', 'Hour']

python 复制代码

feture.remove('Class') # 特征名，修改原数据
feture

复制代码

['V1',
 'V2',
 'V3',
 'V4',
 'V5',
 'V6',
 'V7',
 'V9',
 'V10',
 'V11',
 'V12',
 'V14',
 'V16',
 'V17',
 'V18',
 'V19',
 'Amount',
 'Hour']

构建X变量和y变量

python 复制代码

X = data_new[feture]
y = data_new['Class']
display(X.head(),y.head())

复制代码

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

利用随机森林的feature importance对特征的重要性进行排序

python 复制代码

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.fit(X,y)
clf.feature_importances_

复制代码

array([0.01974556, 0.015216  , 0.02202158, 0.03775329, 0.01995699,
       0.02257269, 0.02873717, 0.04056692, 0.08524621, 0.06078786,
       0.18454186, 0.12167233, 0.06079882, 0.19714222, 0.03220492,
       0.01833037, 0.01716956, 0.01553564])

python 复制代码

plt.rcParams['figure.figsize'] = (12,6)
plt.style.use('fivethirtyeight')

python 复制代码

from matplotlib import style

style.available

复制代码

['Solarize_Light2',
 '_classic_test_patch',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark',
 'seaborn-dark-palette',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'tableau-colorblind10']

python 复制代码

len(feture)

复制代码

python 复制代码

importances = clf.feature_importances_
feat_name = feture
feat_name = np.array(feat_name)
index = np.argsort(importances)[::-1]

plt.bar(range(len(index)),importances[index],color = 'lightblue')
plt.step(range(18),np.cumsum(importances[index]))
_ = plt.xticks(range(18),labels=feat_name[index],rotation = 'vertical',fontsize = 14)

python 复制代码

feat_name

复制代码

['V1',
 'V2',
 'V3',
 'V4',
 'V5',
 'V6',
 'V7',
 'V9',
 'V10',
 'V11',
 'V12',
 'V14',
 'V16',
 'V17',
 'V18',
 'V19',
 'Amount',
 'Hour']

3、模型训练

3.1、过采样

前面提到，目标列Class呈现较大的样本不平衡，会对模型学习造成困扰。样本不平衡常用的解决方法有过采样和欠采样，本项目处理样本不平衡采用的是过采样的方法，具体操作使用SMOTE（SyntheticMinority Oversampling Technique）

python 复制代码

# pip install imblearn
from imblearn.over_sampling import SMOTE # 近邻规则，创造一些新数据

python 复制代码

print('在过采样之前样本比例：\n',y.value_counts())

复制代码

在过采样之前样本比例：
 0    284315
1       492
Name: Class, dtype: int64

python 复制代码

smote = SMOTE()
# X，y是数据
X,y = smote.fit_resample(X,y)
print('在过采样之后样本比例是：\n',y)

复制代码

在过采样之后样本比例是：
 0         0
1         0
2         0
3         0
4         0
         ..
568625    1
568626    1
568627    1
568628    1
568629    1
Name: Class, Length: 568630, dtype: int64

python 复制代码

y.value_counts()

复制代码

0    284315
1    284315
Name: Class, dtype: int64

3.2、算法建模

3.2.1、准确率

python 复制代码

model = LogisticRegression()
model.fit(X,y) # 样本是均衡的
y_ = model.predict(X)
print('逻辑斯蒂回归算准确率是：',accuracy_score(y,y_))
# 信用卡反欺诈，更希望算法，找到盗刷的交易！
# 正常交易，不关心！

复制代码

逻辑斯蒂回归算准确率是： 0.9380581397393736

混淆矩阵和召回率

python 复制代码

from sklearn.metrics import confusion_matrix # 混淆矩阵

cm = confusion_matrix(y,y_)
print(cm)
recall = cm[1,1]/(cm[1,1] + cm[1,0])
print('召回率：',recall)

复制代码

[[276963   7352]
 [ 27870 256445]]
召回率： 0.9019749221813833

python 复制代码

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    绘制预测结果与真实结果的混淆矩阵
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

python 复制代码

import itertools
plot_confusion_matrix(cm,classes=[0,1])

3.2.2、ROC与AUC

python 复制代码

proba_ = model.predict_proba(X)[:,1]# 索引1，表示获取类别1的概率，正样本，阳性，信用卡盗刷

fpr,tpr,thesholds_ = roc_curve(y,proba_)

roc_auc = auc(fpr,tpr) # 曲线下面积

# 绘制 ROC曲线
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.5f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

复制代码

Text(0.5, 0, 'False Positive Rate')

4、模型评估与优化

上一个步骤中，我们的模型训练和测试都在同一个数据集上进行，这样导致模型产生过拟合的问题。

一般来说，将数据集划分为训练集和测试集有3种处理方法：

留出法（hold-out）
交叉验证法（cross-validation）
自助法（bootstrapping）

本次项目采用的是交叉验证法划分数据集，将数据划分为3部分：训练集（training set）、验证集

（validation set）和测试集（test set）。让模型在训练集进行学习，在验证集上进行参数调优，最后使用测试集数据评估模型的性能。

模型调优我们采用网格搜索调优参数（grid search），通过构建参数候选集合，然后网格搜索会穷举各种参数组合，根据设定评定的评分机制找到最好的那一组设置。

结合cross-validation和grid search，具体操作我们采用scikit learn模块model_selection中的GridSearchCV方法。

4.1、交叉验证

交叉验证筛选参数

python 复制代码

%%time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# 构建参数组合
param_grid = {'C': [0.01,0.1, 1, 10, 100, 1000,],'penalty': [ 'l1', 'l2']}

# 确定模型LogisticRegression，和参数组合param_grid ，cv指定10折
grid_search = GridSearchCV(LogisticRegression(),param_grid,cv=10) 

grid_search.fit(X_train, y_train) # 使用训练集学习算法

复制代码

Wall time: 1min 5s





GridSearchCV(cv=10, estimator=LogisticRegression(),
             param_grid={'C': [0.01, 0.1, 1, 10, 100, 1000],
                         'penalty': ['l1', 'l2']})

查看最佳参数

python 复制代码

results = pd.DataFrame(grid_search.cv_results_) 
display(results)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.5f}".format(grid_search.best_score_))

复制代码

Best parameters: {'C': 10, 'penalty': 'l2'}
Best cross-validation score: 0.93776

测评数据的评估

python 复制代码

y_pred = grid_search.predict(X_test)

print('准确率：',accuracy_score(y_test,y_pred))

复制代码

准确率： 0.9391432038408104

分类效果评估报告

python 复制代码

from sklearn.metrics import classification_report

python 复制代码

print(classification_report(y_test,y_pred))

复制代码

              precision    recall  f1-score   support

           0       0.91      0.98      0.94     56981
           1       0.97      0.90      0.94     56745

    accuracy                           0.94    113726
   macro avg       0.94      0.94      0.94    113726
weighted avg       0.94      0.94      0.94    113726

4.2、混淆矩阵

python 复制代码

# 生成测试数据混淆矩阵
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 绘制模型优化后的混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')

复制代码

Recall metric in the testing dataset:  0.9031104062031897

从上可以看出，经过交叉验证训练和参数调优后，模型的性能有较大的提升，recall值从0.818上升到

0.9318，上升幅度达到11.34%。

4.3、模型评估

解决不同的问题，通常需要不同的指标来度量模型的性能。例如我们希望用算法来预测癌症是否是恶性的，假设100个病人中有5个病人的癌症是恶性，对于医生来说，尽可能提高模型的查全率（recall）比提高查准率（precision）更为重要，因为站在病人的角度，发生漏发现癌症为恶性比发生误判为癌症是恶性更为严重。

4.3.1、混淆矩阵

python 复制代码

# 获得预测概率值
y_pred_proba = grid_search.predict_proba(X_test) 

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]  # 设定不同阈值

plt.figure(figsize=(15,10))
np.set_printoptions(precision=2)
j = 1
for t in thresholds:
    # 根据阈值转换为类别 
    y_pred = y_pred_proba[:,1] > t
    plt.subplot(3,3,j)
    j += 1
    # 计算混淆矩阵
    cnf_matrix = confusion_matrix(y_test, y_pred)
    print("召回率是：", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]),end = '\t')
    print('准确率是：',(cnf_matrix[0,0] + cnf_matrix[1,1])/(cnf_matrix.sum()))
    # 绘制混淆矩阵
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix, classes=class_names)

复制代码

召回率是： 0.9837342497136311	准确率是： 0.8754814202557023
召回率是： 0.957952242488325	准确率是： 0.9291103177813341
召回率是： 0.9321878579610539	准确率是： 0.9376659690835869
召回率是： 0.9182835492113842	准确率是： 0.9406292316620649
召回率是： 0.9031104062031897	准确率是： 0.9391432038408104
召回率是： 0.8919904837430611	准确率是： 0.9371559713697836
召回率是： 0.8860516345052427	准确率是： 0.9368833863848196
召回率是： 0.8795312362322671	准确率是： 0.9348433955296063
召回率是： 0.8651158692395806	准确率是： 0.9291806622935829

从上可以看出，经过交叉验证训练和参数调优后，模型的性能有较大的提升，recall值从0.818上升到

0.9318，上升幅度达到11.34%。

4.3.2、精确率-召回率

python 复制代码

from sklearn.metrics import precision_recall_curve

python 复制代码

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
colors = ['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal', 'red', 'yellow', 'green', 'blue']

plt.figure(figsize=(12,7))

j = 1
for t,color in zip(thresholds,colors):
    y_pred = y_pred_proba[:,1] > t #预测出来的概率值是否大于阈值  

    precision, recall, threshold = precision_recall_curve(y_test, y_pred)
    area = auc(recall, precision)
    cm = confusion_matrix(y_test,y_pred)
    # TP/(TP + FN)
    r = cm[1,1]/(cm[1,0] + cm[1,1])
  
    # 绘制 Precision-Recall curve
    plt.plot(recall, precision, color=color,
                 label='Threshold=%s,  AUC=%0.3f,  recall=%0.3f' %(t,area,r))
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('Precision-Recall Curve')
    plt.legend(loc="lower left")

4.3.3、ROC曲线

python 复制代码

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
colors = ['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal', 'red', 'yellow', 'green', 'blue']

plt.figure(figsize=(12,7))

j = 1
for t,color in zip(thresholds,colors):
#     y_pred = grid_search.predict(X_teste) # 算法预测测试数据的值
    y_pred = y_pred_proba[:,1] >= t #预测出来的概率值是否大于阈值 （人为） 
  
    cm = confusion_matrix(y_test,y_pred)
    # TP/(TP + FP)
    precision = cm[1,1]/(cm[0,1] + cm[1,1])

    fpr,tpr,_ = roc_curve(y_test,y_pred)
    accuracy = accuracy_score(y_test,y_pred)
  
    auc_ = auc(fpr,tpr)
  
    # 绘制 ROC curve
    plt.plot(fpr, tpr, color=color,
                 label='Threshold=%s,  AUC=%0.3f,  precision=%0.3f' %(t , auc_,precision))
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('ROC Curve')
    plt.legend(loc="lower right")

4.3.4、各评估指标趋势图

python 复制代码

'''
true negatives:`C_{0,0}`
false negatives: `C_{1,0}` 
true positives is:`C_{1,1}` 
false positives is :`C_{0,1}`
'''
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
recalls = [] # 召回率
precisions = [] # 精确度
aucs = [] # 曲线下面积
y_pred_proba = grid_search.predict_proba(X_test)
for threshold in thresholds:
    y_ = y_pred_proba[:,1] >= threshold
    cm = confusion_matrix(y_test,y_)
    # TP/(TP + FN)
    recalls.append(cm[1,1]/(cm[1,0] + cm[1,1])) # 召回率，从真的癌症患者中找出来的比例，200,85个，42.5%
    # TP/(TP + FP)
    precisions.append(cm[1,1]/(cm[0,1] + cm[1,1])) # 精确率，找到癌症患者，100个，85个真的，15个没病，预测有病
    fpr,tpr,_ = roc_curve(y_test,y_)
    auc_ = auc(fpr,tpr)
    aucs.append(auc_)
    
plt.figure(figsize=(12,6))
plt.plot(thresholds,recalls,label = 'Recall')
plt.plot(thresholds,aucs,label = 'auc')
plt.plot(thresholds,precisions,label = 'precision')
plt.legend()
plt.xlabel('thresholds')

复制代码

Text(0.5, 0, 'thresholds')

4.4、最优阈值

precision和recall是一组矛盾的变量。从上面混淆矩阵和PRC曲线、ROC曲线可以看到，阈值越小，

recall值越大，模型能找出信用卡被盗刷的数量也就更多，但换来的代价是误判的数量也较大。随着阈值的提高，recall值逐渐降低，precision值也逐渐提高，误判的数量也随之减少。通过调整模型阈值，控制模型反信用卡欺诈的力度，若想找出更多的信用卡被盗刷就设置较小的阈值，反之，则设置较大的阈值。

实际业务中，阈值的选择取决于公司业务边际利润和边际成本的比较；当模型阈值设置较小的值，确实能找出更多的信用卡被盗刷的持卡人，但随着误判数量增加，不仅加大了贷后团队的工作量，也会降低误判为信用卡被盗刷客户的消费体验，从而导致客户满意度下降，如果某个模型阈值能让业务的边际利润和边际成本达到平衡时，则该模型的阈值为最优值。当然也有例外的情况，发生金融危机，往往伴随着贷款违约或信用卡被盗刷的几率会增大，而金融机构会更愿意不惜一切代价守住风险的底线。