机器学习之过采样和下采样调整不均衡样本的逻辑回归模型

过采样和下采样调整不均衡样本的逻辑回归模型

目录

  • 过采样和下采样调整不均衡样本的逻辑回归模型
    • [1 过采样](#1 过采样)
      • [1.1 样本不均衡](#1.1 样本不均衡)
      • [1.2 概念](#1.2 概念)
      • [1.3 图片理解](#1.3 图片理解)
      • [1.4 SMOTE算法](#1.4 SMOTE算法)
      • [1.5 算法导入](#1.5 算法导入)
      • [1.6 函数及格式](#1.6 函数及格式)
      • [1.7 样本类别可视化理解](#1.7 样本类别可视化理解)
    • [2 下采样](#2 下采样)
      • [2.1 概念](#2.1 概念)
      • [2.2 图片理解](#2.2 图片理解)
      • [2.3 数据处理理解](#2.3 数据处理理解)
      • [2.4 样本类别可视化理解](#2.4 样本类别可视化理解)
    • [3 实际调整模型](#3 实际调整模型)

1 过采样


1.1 样本不均衡

数据集中不同类别的样本数量差异很大,通常表现为一个类别的样本数量远多于其他类别

1.2 概念

增加少数类的样本数量,使其样本多的类别样本数量相同。

1.3 图片理解

1.4 SMOTE算法

1.5 算法导入

python 复制代码
from imblearn.over_sampling import SMOTE

1.6 函数及格式

  • ov = SMOTE(random_state=0),随机抽取函数

random_state是随机种子,保证同一数字时随机抽取数据相同

  • x_ov,y_ov = ov.fit_resample(x_tr_all,y_tr_all)
    • x_ov经过随机抽取,自动拟合后数据,y_ov
    • x_tr_all,y_tr_all

1.7 样本类别可视化理解

python 复制代码
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import pylab as mpl

# 标准化处理
scaler = StandardScaler()
data = pd.read_csv('creditcard.csv')
a = data[['Amount']]
b = data['Amount']
# z标准化处理Amount,再存Amount中
data['Amount'] = scaler.fit_transform(data[['Amount']])
# 删除time列
data = data.drop(['Time'],axis=1)
# 特征数据x,删除class列
x_all = data.drop(['Class'],axis=1)
# class为标签结果列
y_all = data.Class
# # 训练集特征,测试集特征,训练集结果,测试集结果,test_size抽取的测试集百分比,train_size 抽取的训练集百分比
x_tr_all,x_te_all,y_tr_all,y_te_all = \
    train_test_split(x_all,y_all, test_size=0.2,random_state=1000)
# 样本不均衡图片
mpl.rcParams['font.sans-serif']=['Microsoft YaHei']
mpl.rcParams['axes.unicode_minus']=False
labels_count = pd.value_counts(y_all)
plt.title('正负样本数1')
plt.xlabel('类别')
plt.ylabel('频数')
labels_count.plot(kind='bar')
plt.show()
# #过采样使样本均衡
from imblearn.over_sampling import SMOTE
ov = SMOTE(random_state=0)
x_tr_ov,y_tr_ov = ov.fit_resample(x_tr_all,y_tr_all)
# 交叉验证
scores = []
c_range = [0.01,0.1,1,10,100]
# 均衡样本正负图像显示
mpl.rcParams['font.sans-serif']=['Microsoft YaHei']
mpl.rcParams['axes.unicode_minus']=False
labels_count = pd.value_counts(y_tr_ov)
plt.title('正负样本数')
plt.xlabel('类别')
plt.ylabel('频数')
labels_count.plot(kind='bar')
plt.show()

2 下采样


2.1 概念

减少多数类的样本数量,使其样本少的类别样本数量相同,但可能会丢失重要信息。

2.2 图片理解

2.3 数据处理理解

  • pt_eg = **data_tr[data_tr['Class'] == 0]**找出两类数据
  • ng_eg = data_tr[data_tr['Class'] == 1]
  • pt_eg = pt_eg.sample(len(ng_eg))根据少的数据对多的数据进行抽取
  • data_c = pd.concat([pt_eg,ng_eg]),再将两类数据合并

2.4 样本类别可视化理解

代码展示:

python 复制代码
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from numpy.random import sample
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import pylab as mpl

# 标准化处理
scaler = StandardScaler()
data = pd.read_csv('creditcard.csv')
a = data[['Amount']]
b = data['Amount']
# z标准化处理Amount,再存Amount中
data['Amount'] = scaler.fit_transform(data[['Amount']])
# 删除time列
data = data.drop(['Time'],axis=1)
# 特征数据x,删除class列
x_all = data.drop(['Class'],axis=1)
# class为标签结果列
y_all = data.Class
# # 训练集特征,测试集特征,训练集结果,测试集结果,test_size抽取的测试集百分比,train_size 抽取的训练集百分比
x_tr_all,x_te_all,y_tr_all,y_te_all = \
    train_test_split(x_all,y_all, test_size=0.2,random_state=1000)
# 样本不均衡
mpl.rcParams['font.sans-serif']=['Microsoft YaHei']
mpl.rcParams['axes.unicode_minus']=False
labels_count = pd.value_counts(y_all)
plt.title('正负样本数1')
plt.xlabel('类别')
plt.ylabel('频数')
labels_count.plot(kind='bar')
plt.show()
#下采样
## 组合,为后准备,两个表格组合,前datafarme,后serise,添加列,直接赋值
np.random.seed(seed=4)
# 随机种子
x_tr_all['Class'] = y_tr_all
data_tr = x_tr_all
pt_eg = data_tr[data_tr['Class'] == 0]
ng_eg = data_tr[data_tr['Class'] == 1]
pt_eg = pt_eg.sample(len(ng_eg))
data_c = pd.concat([pt_eg,ng_eg])
x_data_c = data_c.drop(['Class'],axis=1)
y_data_c = data_c['Class']
mpl.rcParams['font.sans-serif']=['Microsoft YaHei']
mpl.rcParams['axes.unicode_minus']=False
labels_count = pd.value_counts(y_data_c )
plt.title('正负样本数1')
plt.xlabel('类别')
plt.ylabel('频数')
labels_count.plot(kind='bar')
plt.show()


3 实际调整模型

不均衡样本,下采样样本,过采样样本训练模型代码及结果,可以明显看到数据召回率上升。

代码展示:

python 复制代码
import time
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from numpy.random import sample
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import pylab as mpl
# 标准化处理
scaler = StandardScaler()
data = pd.read_csv('creditcard.csv')
a = data[['Amount']]
b = data['Amount']
# z标准化处理Amount,再存Amount中
data['Amount'] = scaler.fit_transform(data[['Amount']])
# 删除time列
data = data.drop(['Time'],axis=1)
# 特征数据x,删除class列
x_all = data.drop(['Class'],axis=1)
# class为标签结果列
y_all = data.Class
# # 训练集特征,测试集特征,训练集结果,测试集结果,test_size抽取的测试集百分比,train_size 抽取的训练集百分比
x_tr_all,x_te_all,y_tr_all,y_te_all = \
    train_test_split(x_all,y_all, test_size=0.2,random_state=1000)
# 样本不均衡
scores = []
c_range = [0.01,0.1,1,10,100]
## 循环测试带入因子
for i in c_range:
    start_time = time.time()
    lg = LogisticRegression(C=i,penalty='l2',solver='lbfgs',max_iter=1000)
    # 模型迭代8次后的所有模型的recall值
    score = cross_val_score(lg,x_tr_all,y_tr_all,cv=5,scoring='recall')
    # score的平均值,也就是recall的平均值
    score_m = sum(score)/len(score)
    # scores列表添加均值recall
    scores.append(score_m)
    end_time = time.time()
best_c = c_range[np.argmax(scores)]
lg = LogisticRegression(C=best_c,penalty='l2',max_iter=1000)
lg.fit(x_te_all,y_te_all)
te_pr = lg.predict(x_te_all)
print("不均衡样本训练")
print(metrics.classification_report(y_te_all,te_pr))

# 下采样
np.random.seed(seed=4)
x_tr_all['Class'] = y_tr_all
data_tr = x_tr_all
pt_eg = data_tr[data_tr['Class'] == 0]
ng_eg = data_tr[data_tr['Class'] == 1]
pt_eg = pt_eg.sample(len(ng_eg))
data_c = pd.concat([pt_eg,ng_eg])
x_data_c = data_c.drop(['Class'],axis=1)
# class为标签结果列
y_data_c = data_c.Class
# # 交叉验证
scores = []
c_range = [0.01,0.1,1,10,100]
# 循环测试带入因子
for i in c_range:
    lg = LogisticRegression(C=i,penalty='l2',solver='lbfgs',max_iter=1000)
    # 模型迭代8次后的所有模型的recall值
    score = cross_val_score(lg,x_data_c,y_data_c,cv=5,scoring='recall')
    # score的平均值,也就是recall的平均值
    score_m = sum(score)/len(score)
    # scores列表添加均值recall
    scores.append(score_m)
best_c = c_range[np.argmax(scores)]
# 根据上面最大判断,建立模型
lg = LogisticRegression(C=best_c,penalty='l2',max_iter=1000)
lg.fit(x_data_c,y_data_c)
te_pr = lg.predict(x_te_all)
print("下采样均衡样本训练")
print(metrics.classification_report(y_te_all,te_pr))

# #过采样
scaler = StandardScaler()
data = pd.read_csv('creditcard.csv')
a = data[['Amount']]
b = data['Amount']
# z标准化处理Amount,再存Amount中
data['Amount'] = scaler.fit_transform(data[['Amount']])
# 删除time列
data = data.drop(['Time'],axis=1)
# 特征数据x,删除class列
x_all = data.drop(['Class'],axis=1)
# class为标签结果列
y_all = data.Class
x_tr_all,x_te_all,y_tr_all,y_te_all = \
    train_test_split(x_all,y_all, test_size=0.2,random_state=1000)
from imblearn.over_sampling import SMOTE
ov = SMOTE(random_state=0)
x_tr_ov,y_tr_ov = ov.fit_resample(x_tr_all,y_tr_all)
# 交叉验证
scores = []
c_range = [0.01,0.1,1,10,100]
## 循环测试带入因子
for i in c_range:
    # start_time = time.time()
    lg = LogisticRegression(C=i,penalty='l2',solver='lbfgs',max_iter=1000)
    # 模型迭代8次后的所有模型的recall值
    score = cross_val_score(lg,x_tr_ov,y_tr_ov,cv=5,scoring='recall')
    # score的平均值,也就是recall的平均值
    score_m = sum(score)/len(score)
    # scores列表添加均值recall
    scores.append(score_m)
best_c = c_range[np.argmax(scores)]
lg = LogisticRegression(C=best_c,penalty='l2',max_iter=1000)
lg.fit(x_tr_ov,y_tr_ov)
te_pr1 = lg.predict(x_te_all)
print("过采样均衡样本训练")
print(metrics.classification_report(y_te_all,te_pr1))

运行结果:

相关推荐
2301_82270320几秒前
腾讯云AI代码助手编程挑战赛-随机数字小游戏
python·腾讯云ai代码助手
SEO_juper几秒前
语义SEO全解析:如何在搜索引擎中脱颖而出?
人工智能·谷歌·seo·数字营销·seo优化·谷歌seo·语义seo
金创想9 分钟前
十大排序简介
算法·排序算法·十大排序
L-李俊漩19 分钟前
多类特征(Multiple features)
人工智能·线性代数·机器学习·矩阵
pumpkin8451425 分钟前
TensorFlow 介绍
人工智能·python·tensorflow
执着的小火车32 分钟前
【2024华为OD-E卷-100分-boss的收入】(题目+思路+Java&C++&Python解析)
数据结构·算法·华为od·华为·排序算法
終不似少年遊*35 分钟前
机器学习模型评估指标
人工智能·算法·机器学习·回归·模型评价
Lucky_Turtle36 分钟前
Python requests库过指纹检测
开发语言·python
剁椒排骨40 分钟前
冒泡排序(C语言)
c语言·算法·排序算法·算法与结构
明晚十点睡42 分钟前
校园网断网自检测重链接
python