第九章 集成学习 Boosting案例:信用卡欺诈分类

案例:信用卡欺诈分类

案例背景

数据集包含2013年9月欧洲持卡人的信用卡交易。

该数据集显示了两天内发生的交易,其中284,807宗交易中只有492个欺诈。

数据集高度不平衡,正类(欺诈交易)仅占所有交易的0.172%。

它只包含数值输入变量,这是一个PCA变换的结果。

出于保密问题,没有提供原始特征和更多关于数据的背景信息。

特征V1, V2,...V28为主成分分析(PCA)得到的主成分;

唯一没有使用PCA转换的特征是时间和数量。

Feature Time包含每个事务与数据集中的第一个事务之间所经过的秒数。

特征Amount是指交易金额,此特征可用于示例依赖的成本敏感学习。

Feature Class是标签变量,如果发生欺诈,它的值为1,否则为0。

数据读取与划分

python 复制代码
import pandas as pd 
import numpy as np
import matplotlib
from IPython.display import Image
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

import gc
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , f1_score ,roc_auc_score
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from xgboost import XGBClassifier

pd.set_option('display.max_columns', 100)

#TRAIN/TEST SPLIT
TEST_SIZE = 0.20 # test size using_train_test_split

RANDOM_STATE = 42

接着读取数据,并且输出数据的信息

python 复制代码
data = pd.read_csv("./creditcard.csv")

print("Credit Card Fraud Detection data -  rows:",data.shape[0]," columns:", data.shape[1])
复制代码
Credit Card Fraud Detection data -  rows: 284807  columns: 31

接着使用head观察数据

python 复制代码
data.head()

| | Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class |
| 0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.090794 | -0.551600 | -0.617801 | -0.991390 | -0.311169 | 1.468177 | -0.470401 | 0.207971 | 0.025791 | 0.403993 | 0.251412 | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
| 1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | -0.166974 | 1.612727 | 1.065235 | 0.489095 | -0.143772 | 0.635558 | 0.463917 | -0.114805 | -0.183361 | -0.145783 | -0.069083 | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
| 2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | 0.207643 | 0.624501 | 0.066084 | 0.717293 | -0.165946 | 2.345865 | -2.890083 | 1.109969 | -0.121359 | -2.261857 | 0.524980 | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
| 3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | -0.054952 | -0.226487 | 0.178228 | 0.507757 | -0.287924 | -0.631418 | -1.059647 | -0.684093 | 1.965775 | -1.232622 | -0.208038 | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |

4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

从数据结果中看到数据集中共284,807条记录

查看数据集中的数据缺失情况

python 复制代码
total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent']).transpose()

| | Time | V16 | Amount | V28 | V27 | V26 | V25 | V24 | V23 | V22 | V21 | V20 | V19 | V18 | V17 | V15 | V1 | V14 | V13 | V12 | V11 | V10 | V9 | V8 | V7 | V6 | V5 | V4 | V3 | V2 | Class |
| Total | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

Percent 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

从上面的结果来看,数据集中不存在缺失数据

按照是否被欺诈进行分类可视化

python 复制代码
temp = data["Class"].value_counts()
df = pd.DataFrame({'Class': temp.index,'values': temp.values})

trace = go.Bar(
    x = df['Class'],y = df['values'],
    name="Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)",
    marker=dict(color="Blue"),
    text=df['values']
)
temp_data = [trace]
layout = dict(title = 'Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)',
          xaxis = dict(title = 'Class', showticklabels=True), 
          yaxis = dict(title = 'Number of transactions'),
          hovermode = 'closest',width=600
         )
fig = dict(data=temp_data, layout=layout)
iplot(fig, filename='class')

从可视化结果来看,数据集中是存在数据不平衡性的,只有492条诈骗记录(0.172%)

接下来对欺诈案例按照时间维度进行可视化

python 复制代码
class_0 = data.loc[data['Class'] == 0]["Time"]
class_1 = data.loc[data['Class'] == 1]["Time"]

hist_data = [class_0, class_1]
group_labels = ['Not Fraud', 'Fraud']

fig = ff.create_distplot(hist_data, group_labels, show_hist=False, show_rug=False)
fig['layout'].update(title='Credit Card Transactions Time Density Plot', xaxis=dict(title='Time [s]'))
iplot(fig, filename='dist_only')

从结果来看,欺诈交易分布的较为均匀,且容易在夜间持续发生。

接下来将将数据分为训练集和测试集

python 复制代码
train_df, test_df = train_test_split(data, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=True)

由于数据集中存在数据不平衡的现象,因此需要对训练数据进行平衡,通过对数量较多的非欺诈样本进行欠采样,将欺诈样本的比例提升到1%

python 复制代码
# 获得欺诈样本的数量
train_fraud_df  = train_df[train_df['Class'] ==1]
no_of_fraud = train_fraud_df.shape[0]

# 对非欺诈样本进行欠采样
no_of_non_fraud = no_of_fraud * 99
train_non_fraud_df = train_df[train_df['Class'] ==0].sample( no_of_non_fraud , random_state =RANDOM_STATE)
no_of_non_fraud = train_non_fraud_df.shape[0]

# 将欠采样后的数据进行整合,并且对数据的顺序进行打乱

train_df = pd.concat([train_fraud_df, train_non_fraud_df] , axis =0 )
train_df = train_df.sample(frac = 1,random_state =RANDOM_STATE)
复制代码
Total Fraud in Train Data : 394
Total non Fraud in Train Data : 39006
python 复制代码
target = 'Class'
predictors = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',\
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',\
       'Amount']

AdaBoost模型搭建与训练

python 复制代码
ada_clf = AdaBoostClassifier(random_state=RANDOM_STATE)

ada_clf.fit(train_df[predictors], train_df[target].values)

test_df['prediction'] = ada_clf.predict(test_df[predictors])

cm = pd.crosstab(test_df[target].values, test_df['prediction'], rownames=['Actual'], colnames=['Predicted'])
fig, ax1 = plt.subplots(ncols=1, figsize=(7,7))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues" , fmt='d')
plt.title('Confusion Matrix', fontsize=16)
plt.show()
python 复制代码
metric_data = pd.DataFrame(columns =['Model Name','Detection Rate' ,'AUC','F1 Score','Accuracy','Fraud Loss Saved'])
python 复制代码
# we will use original data as Amount is transformed for modelling 
def fraud_loss_saved ( dataset , key) :

    df = dataset.copy()
    total_fraud_amt = df[df['Class'] ==1]['Amount'].sum()
    print("Total Fraud Amount in Test Data : " +  str(round(total_fraud_amt,2)))
    total_fraud_amt_detected = df.loc[(df['prediction'] ==1) & (df['Class']==1) ]['Amount'].sum()
    print("Total Fraud Amount Detected in Test Data : " +  str(round(total_fraud_amt_detected,2)))
    print("Fraud Loss Saved (%): " + str(round(100*total_fraud_amt_detected/total_fraud_amt ,2)))
    detection_rate  = 100 * (df[df['prediction']==1]['Class'].sum())/df['Class'].sum()
    print("Detection Rate (%) : " + str(round(detection_rate , 2)))
    accuracy = 100*accuracy_score(df['Class'] ,df['prediction'])
    print("Accuracy : " + str(round(accuracy ,2)))
    f1 = f1_score(df['Class'] ,df['prediction'])
    print("F1 Score : " + str(round(f1 ,4)))   
    auc_score = roc_auc_score(df['Class'],df['prediction'])
    print("AUC Score : " + str(round(auc_score,4)))
    values = []
    values.append(key)
    values.append(detection_rate)
    values.append(auc_score)
    values.append(f1)
    values.append(accuracy)
    values.append(round(100*total_fraud_amt_detected/total_fraud_amt ,2))
    
    final_values =[]
    final_values.append(values)
    temp_df = pd.DataFrame(final_values ,columns =['Model Name','Detection Rate' ,'AUC','F1 Score','Accuracy','Fraud Loss Saved'])
    
    global metric_data
    
    metric_data = pd.concat([metric_data,temp_df ] , axis = 0 )
    
    
    
python 复制代码
fraud_loss_saved(test_df ,'AdaBoost - Test Data')
复制代码
Total Fraud Amount in Test Data : 16078.4
Total Fraud Amount Detected in Test Data : 12019.13
Fraud Loss Saved (%): 74.75
Detection Rate (%) : 79.59
Accuracy : 99.91
F1 Score : 0.7573
AUC Score : 0.8977

GBDT模型搭建与训练

python 复制代码
gbdf_clf = GradientBoostingClassifier(random_state=RANDOM_STATE)

gbdf_clf.fit(train_df[predictors], train_df[target].values)

test_df['prediction'] = gbdf_clf.predict(test_df[predictors])

cm = pd.crosstab(test_df[target].values, test_df['prediction'], rownames=['Actual'], colnames=['Predicted'])
fig, ax1 = plt.subplots(ncols=1, figsize=(7,7))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues" , fmt='d')
plt.title('Confusion Matrix', fontsize=16)
plt.show()
python 复制代码
fraud_loss_saved(test_df ,'GBDT - Test Data')
复制代码
Total Fraud Amount in Test Data : 16078.4
Total Fraud Amount Detected in Test Data : 12252.02
Fraud Loss Saved (%): 76.2
Detection Rate (%) : 82.65
Accuracy : 99.79
F1 Score : 0.5724
AUC Score : 0.9124

XGBoost模型搭建与训练

python 复制代码
xgb_clf = XGBClassifier(random_state=RANDOM_STATE)

xgb_clf.fit(train_df[predictors], train_df[target].values)

test_df['prediction'] = xgb_clf.predict(test_df[predictors])

cm = pd.crosstab(test_df[target].values, test_df['prediction'], rownames=['Actual'], colnames=['Predicted'])
fig, ax1 = plt.subplots(ncols=1, figsize=(7,7))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues" , fmt='d')
plt.show()
python 复制代码
fraud_loss_saved(test_df ,'XGBoost - Test Data')
复制代码
Total Fraud Amount in Test Data : 16078.4
Total Fraud Amount Detected in Test Data : 12346.92
Fraud Loss Saved (%): 76.79
Detection Rate (%) : 83.67
Accuracy : 99.94
F1 Score : 0.8241
AUC Score : 0.9182

模型对比

python 复制代码
metric_data

| | Model Name | Detection Rate | AUC | F1 Score | Accuracy | Fraud Loss Saved |
| 0 | AdaBoost - Test Data | 79.591837 | 0.897695 | 0.757282 | 99.912222 | 74.75 |
| 0 | GBDT - Test Data | 82.653061 | 0.912351 | 0.572438 | 99.787578 | 76.20 |

0 XGBoost - Test Data 83.673469 0.918200 0.824121 99.938556 76.79

从结果中可以看到,在3个模型当中,XGBoost模型具有最好的效果,能够帮助更加准确的识别出欺诈行为,进而帮助我们保护更多的财产。

相关推荐
天行健,君子而铎5 天前
自适应分类·高准确率·可视化易用——运营商数据分类分级解决方案
大数据·分类
装不满的克莱因瓶5 天前
了解多标签图像分类方法——从Sigmoid输出到真实世界复杂视觉理解
人工智能·pytorch·python·深度学习·机器学习·分类·数据挖掘
keykey6.5 天前
集成学习:从 Bagging 到 XGBoost
人工智能·机器学习·集成学习
装不满的克莱因瓶6 天前
掌握语义分割经典模型 FCN——从像素分类到端到端分割的奠基之作
人工智能·python·深度学习·算法·机器学习·分类·数据挖掘
雷工笔记6 天前
MES系列51-人防门行业 MES 质检分类体系
人工智能·分类·数据挖掘
2401_885665196 天前
从零搭建CNN到迁移学习:以食物分类为例深入理解PyTorch图像分类实战
人工智能·pytorch·深度学习·分类·cnn·迁移学习
百胜软件@百胜软件6 天前
货品“精”营:ABC-XYZ分类如何驱动鞋服全渠道库存效率革命?
人工智能·分类·数据挖掘·零售数字化·数智中台·珠宝行业
zcg19426 天前
分类中的样本不平衡问题——Asymmetric Loss
人工智能·分类·数据挖掘
daly5207 天前
人工智能专业有哪些?2026高考报考指南(专业分类 + 课程 + 就业全解析)
人工智能·分类·高考
FPC_小西7 天前
LDO 低压差线性稳压器 拆解电源稳压核心原理
人工智能·单片机·嵌入式硬件·集成学习·pcb工艺·hdi高密度互联