【机器学习】信用卡欺诈检测实战：逻辑回归 + 过采样

文章目录

前言
- [ $【机器学习】信用卡欺诈检测实战：逻辑回归 + 下采样$ (https://blog.csdn.net/m0_66822255/article/details/161661031?spm=1001.2014.3001.5502)](#【机器学习】信用卡欺诈检测实战：逻辑回归 + 下采样 )
一、完整代码展示
二、过采样与下采样的对比
三、核心差异一：不平衡处理方式
四、核心差异二：阈值调整（仅下采样方法中有）
五、核心差异三：样本处理
六、选择建议
总结

前言

信用卡欺诈检测是典型的类别极不平衡问题：正常交易（负样本）远多于欺诈交易（正样本）。如果直接用原始数据训练，模型会偏向多数类，导致召回率（Recall）很低。本文采用 SMOTE 过采样技术生成少数类（欺诈）的新样本，使正负样本数量平衡，再使用逻辑回归建模，并通过交叉验证选择最佳正则化参数，最终评估模型性能。

如果对该项目有兴趣但是没有了解的读者，可以先观看博主关于逻辑回归下采样的文章。

【机器学习】信用卡欺诈检测实战：逻辑回归 + 下采样

一、完整代码展示

c 复制代码

import pandas as pd
import numpy as np
import time

#建立混都矩阵可视图
def cm_plot(y,yp):
    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as plt
    cm =confusion_matrix(y,yp)
    plt.matshow(cm, cmap=plt.cm.Blues)
    plt.colorbar()
    for x in range(len(cm)):
        for y in range(len(cm)):
            plt.annotate(cm[x,y],xy=(y,x),horizontalalignment='center',
                         verticalalignment='center')
            plt.ylabel('True label')
            plt.xlabel('Predicted label')
    return plt

data = pd.read_csv(r"creditcard.csv", encoding='utf8', engine='python')
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data['Amount'] = scaler.fit_transform(data[['Amount']])

data = data.drop(['Time'],axis=1)

from sklearn.model_selection import train_test_split
X_whole = data.drop('Class',axis=1)
y_whole = data.Class


x_train_w, x_test_w,y_train_w, y_test_w =\
    train_test_split(X_whole,y_whole, test_size = 0.3, random_state=100)

from imblearn.over_sampling import SMOTE

oversampling = SMOTE(random_state=0)
os_x_train,os_y_train = oversampling.fit_resample(x_train_w,y_train_w)

#交叉验证
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

scores = []
c_param_range = [0.01,0.1,1,10,100]
z = 1
for i in c_param_range:#第1词循环的时候C=0.01，5个逻辑回归模型
    start_time = time.time()
    lr = LogisticRegression(C = i, solver='lbfgs', max_iter=1000)
    score = cross_val_score(lr, os_x_train, os_y_train, cv=8,scoring='recall')
    #scoring:可选"accuracy"recall(召回率)、roc_auc(roc值)、
    score_mean= sum(score)/len(score)#交叉验证后的值召回率
    scores.append(score_mean)
    print(score_mean)#输出召回率，择优
    end_time = time.time()
    print('第{}次...'.format(z))
    print("time spend:{:.2f}".format(end_time-start_time))
    print("recall:{}".format(score_mean))
    z +=1

best_c=c_param_range[np.argmax(scores)]
print("........最优秀惩罚因子为：{}.....".format(best_c))

lr = LogisticRegression(C = best_c, solver='lbfgs',max_iter =1000)
lr.fit(os_x_train, os_y_train)
from sklearn import metrics
#训练集预测概率【大数据集】#人工拟合45w
train_predicted = lr.predict(os_x_train)
print(metrics.classification_report(os_y_train, train_predicted))
#训练集预测概率【小数据集】 原始28w
test_predicted = lr.predict(x_test_w)
print(metrics.classification_report(y_test_w, test_predicted))

二、过采样与下采样的对比

方法	核心做法	方法	核心做法
下采样	从多数类（正常交易）中随机抽取与少数类（欺诈）相同数量的样本	训练速度快，不会生成虚假样本	丢失大量多数类信息，模型泛化能力可能下降
过采样	在少数类样本之间插值生成新样本	保留全部多数类信息，不易过拟合	生成样本可能不合理，训练时间较长

不同之处仅在于使用交叉验证选择逻辑回归的最佳正则化参数c之前如何处理类别不平衡，以及后续是否调整阈值。

三、核心差异一：不平衡处理方式

下采样

c 复制代码

x_train_w_copy = x_train_w.copy()
x_train_w_copy['Class'] = y_train_w

# 分别取出多数类（正常）和少数类（欺诈）
pg = x_train_w_copy[x_train_w_copy['Class'] == 0]   # 多数类
ng = x_train_w_copy[x_train_w_copy['Class'] == 1]   # 少数类

pg = pg.sample(len(ng))

data_balanced = pd.concat([pg, ng])

X_train_balanced = data_balanced.drop('Class', axis=1)
y_train_balanced = data_balanced.Class

过采样

c 复制代码

from imblearn.over_sampling import SMOTE

oversampler = SMOTE(random_state=0)
X_train_balanced, y_train_balanced = oversampler.fit_resample(x_train_w, y_train_w)

两种方式最终都生成了 X_train_balanced 和 y_train_balanced，但下采样减少了多数类，过采样增加了少数类。

四、核心差异二：阈值调整（仅下采样方法中有）

c 复制代码

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
for thresh in thresholds:
    proba = lr.predict_proba(x_test_w)[:, 1]          # 取欺诈类的概率
    pred = (proba > thresh).astype(int)
    recall = metrics.recall_score(y_test_w, pred)
    print(f"阈值 {thresh} -> 召回率: {recall:.3f}")

过采样原代码中没有做阈值调优，但你可以将上述代码添加到 SMOTE 方法后，效果类似。

五、核心差异三：样本处理

下采样版本

c 复制代码

# ---------- 下采样处理 ----------
train_temp = x_train_w.copy()
train_temp['Class'] = y_train_w
majority = train_temp[train_temp['Class'] == 0].sample(n=len(train_temp[train_temp['Class'] == 1]))
balanced = pd.concat([majority, train_temp[train_temp['Class'] == 1]])
X_bal = balanced.drop('Class', axis=1)
y_bal = balanced['Class']

# ---------- 交叉验证选 C ----------
best_c = ...  # 同上通用交叉验证代码

# ---------- 训练 ----------
lr = LogisticRegression(C=best_c, solver='lbfgs', max_iter=1000)
lr.fit(X_bal, y_bal)

# ---------- 测试评估 ----------
y_pred = lr.predict(x_test_w)
print(metrics.classification_report(y_test_w, y_pred))

# ---------- 阈值调优 ----------
for thresh in [0.3, 0.4, 0.5]:
    prob = lr.predict_proba(x_test_w)[:, 1]
    pred = (prob > thresh).astype(int)
    print(f"阈值{thresh} 召回率: {metrics.recall_score(y_test_w, pred):.3f}")

过采样版本

c 复制代码

# ---------- 过采样 ----------
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=0)
X_bal, y_bal = smote.fit_resample(x_train_w, y_train_w)

# ---------- 交叉验证选 C ----------
best_c = ...  # 同上通用交叉验证代码

# ---------- 训练 ----------
lr = LogisticRegression(C=best_c, solver='lbfgs', max_iter=1000)
lr.fit(X_bal, y_bal)

# ---------- 测试评估 ----------
y_pred = lr.predict(x_test_w)
print(metrics.classification_report(y_test_w, y_pred))

六、选择建议

在同一份信用卡数据集上（测试集比例30%，随机种子相同）

下采样的训练样本总数较少即少数类×2的数量。

过采样的训练样本总数较多即多数类×2的数量。

如果漏抓欺诈成本极高（例如银行交易），优先选择 SMOTE，召回率更高。

如果计算资源有限或数据量极大，可选用下采样并配合阈值调优。

两种方法都可以进一步降低分类阈值（比如设为0.3）来提升召回率，但会牺牲精确率（误拦正常交易增加）。

总结

因为过采样后正负样本相等，模型通常会学习得很好，召回率和精确率都可能很高。

同时由于测试集保留了原始的极不平衡比例（欺诈样本很少），模型可能仍然能获得不错的召回率（通常 80%~95%），但精确率可能较低（会误把一些正常交易判为欺诈）。这时可根据业务需要调整分类阈值（如低于 0.5 的概率就判为欺诈）来进一步提升召回率。