机器学习之实战篇------肿瘤良性/恶性分类器(二元逻辑回归)
前言
实验中难免有许多缺陷和错误,望批评指正!
如环境配置(jupyter notebook、模块不兼容)遇到问题,可参考传送门
数据集和实验文件下载
通过百度网盘分享的文件:肿瘤诊断.zip
链接:https://pan.baidu.com/s/1Q55Y71pEC4G51AFjfuJ18g?pwd=tmdt
提取码:tmdt
相关文章推荐
二元逻辑回归详细笔记可以看这篇,本实验中的核心内容在文章中均有详细解析:
机器学习之监督学习(二)二元逻辑回归
实验过程
导入相关模块
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from logistic_regression import *
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,precision_score,recall_score,roc_auc_score,roc_curve,confusion_matrix,precision_recall_curve,f1_score,fbeta_score,auc
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,train_test_split
其中logistic_regression.py是自己编写的模块,主要包括两部分内容:
①手动二元逻辑回归(正则化+批量梯度下降)实现代码;
②分类模型效果评估,输出多个评价指标(包括准确率、精确率、召回率、f1_score、fbeta_score、P-R曲线图,ROC曲线,AUC值)
内容如下:
python
import numpy as np
from math import exp
#预测函数
def logistic_predict(X, w, b,sigma=0.5):
#simga:阈值,默认值0.5
#p:预测的概率向量,y_pred:预测的类别向量
m = X.shape[0]
y_pred = np.zeros(m)
p=np.zeros(m)
for i in range(m):
z_i = np.dot(X[i], w) + b
y_pred[i] = sigmoid(z_i) >= sigma #
p[i]=sigmoid(z_i)
return p,y_pred
# sigmoid函数
def sigmoid(z):
return 1 / (1 + exp(-z))
# 计算代价的函数(引入正则化)
def get_cost_logistic(X, y, w, b, lamb):
m = X.shape[0]
cost = 0.0
for i in range(m):
z_i = np.dot(X[i], w) + b
f_wb_i = sigmoid(z_i)
cost += -y[i] * np.log(f_wb_i) - (1 - y[i]) * np.log(1 - f_wb_i)
cost = cost / m
# 添加正则化项
reg_cost = (lamb / (2 * m)) * np.sum(np.square(w))
cost += reg_cost
return cost
# 计算梯度的函数(引入正则化)
def get_gradient(X, y, w, b, lamb):
m, n = X.shape
dj_dw = np.zeros((n,))
dj_db = 0
for i in range(m):
error = sigmoid(np.dot(X[i, :], w) + b) - y[i]
dj_db += error
for j in range(n):
dj_dw[j] += (error * X[i, j])
# 平均梯度
dj_db /= m
dj_dw /= m
# 添加正则化项的梯度
dj_dw += (lamb / m) * w
return dj_dw, dj_db
# 批量梯度下降函数
def gradient_descent(X, y, w_in, b_in, alpha=0.01, lamb=0.01, iters=1000, batch_size=32):
'''
Args:
X:特征矩阵
y:目标向量
w_in:初始权重向量
b_in:初始偏置值
alpha:学习率,默认0.01
lamb:正则化系数,默认0.01
iters:梯度下降迭代次数,默认1000
batch_size:随机梯度下降批量,默认32
Outputs:
w:训练好的权重向量
b:训练好的偏置值
p:训练中最后的输出概率向量
y_hat:训练中最后的输出类别向量
cost_his:记录训练过程中的代价历史
'''
m = X.shape[0]
w = w_in
b = b_in
cost_his = []
for i in range(iters):
# 随机选择batch_size的数据
indices = np.random.choice(m, batch_size, replace=False)
X_batch = X[indices]
y_batch = y[indices]
dj_dw, dj_db = get_gradient(X_batch, y_batch, w, b, lamb)
w = w - dj_dw * alpha
b = b - dj_db * alpha
cost_his.append(get_cost_logistic(X, y, w, b, lamb)) # 计算整个数据集的代价
if (i) % (iters / 10) == 0:
print(f'iteration: {i}, cost: {cost_his[i]}')
print(f'final w: {w}, b: {b}')
p,y_hat=logistic_predict(X,w,b)
return w, b,p, y_hat,cost_his
#绘制P-R曲线函数
def plot_precision_recall_curve(y_train,y_score):
# 计算精确率和召回率
_precision, _recall, thresholds = precision_recall_curve(y_train, y_score)
# 绘制精确率-召回率曲线
plt.figure(figsize=(8, 6))
plt.plot(_recall, _precision, color='blue')
plt.xlabel('召回率 (Recall)')
plt.ylabel('精确率 (Precision)')
plt.title('Precision-Recall Curve')
plt.show()
#绘制roc曲线图的函数
def plot_roc_curve(y_train,y_score):
# 计算 ROC 曲线
fpr, tpr, thresholds = roc_curve(y_train, y_score)
# 计算 AUC
roc_auc = auc(fpr, tpr)
# 绘制 ROC 曲线
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--') # 对角线
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('假阳性率 (False Positive Rate)')
plt.ylabel('真阳性率 (True Positive Rate)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()
return roc_auc
#分类效果评价,输出多个评价指标(包括准确率、精确率、召回率、f1_score、fbeta_score、P-R曲线图,ROC曲线,AUC值)
def judge_classification(y_target,y_predicted,y_score):
'''
Args:
y_target:目标向量
y_predicted:预测类别向量
y_score:预测概率向量
'''
option=int(input('是否设置fbeta_score,1-yes,0-no'))
if option:
beta=int(input('输入beta:'))
else:
beta=1
#计算相关指标
accuracy = accuracy_score(y_target, y_predicted)
confusion_Matrix=confusion_matrix(y_target, y_predicted)
precision = precision_score(y_target, y_predicted)
recall = recall_score(y_target, y_predicted)
F1_score=f1_score(y_target, y_predicted)
fBeta_score=fbeta_score(y_target,y_predicted,beta)
AUC=roc_auc_score(y_target,y_score)
# 打印结果
print(f"准确率 (Accuracy): {accuracy:.4f}")
print(f'混淆矩阵(Confusion Matrix):{confusion_Matrix}')
print(f"精确率 (Precision): {precision:.4f}")
print(f"召回率 (Recall): {recall:.4f}")
print(f'f1 score:{F1_score:.4f}')
if option:
print(f'fbeta score:{fBeta_score:4f}')
print('P-R曲线图如下:')
plot_precision_recall_curve(y_target,y_score)
print('roc曲线图如下:')
plot_roc_curve(y_target,y_score)
print(f'roc_auc_score:{AUC:4f}')
除此外,为了绘图中正确显示中文和负号,需要进行预配置:
python
# 设置中文字体
import matplotlib
matplotlib.rcParams['font.family'] = 'SimHei' # 或者 'Microsoft YaHei'
matplotlib.rcParams['axes.unicode_minus'] = False # 解决负号 '-'
数据预处理
使用pandas的read_excel函数读入数据集,由于原文件中已经将良性和恶性肿瘤的数据分隔开,因此先使用sample函数进行随机打乱,frac=1表示按行打乱
python
data=pd.read_excel('datasets/肿瘤数据.xlsx',sheet_name='Sheet1').sample(frac=1,random_state=42)
查看数据前五行
python
data.head()
查看数据总结信息
python
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 675 entries, 396 to 102
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 675 non-null int64
1 Clump Thickness 675 non-null int64
2 Uniformity of Cell Size 675 non-null int64
3 Uniformity of Cell Shape 675 non-null int64
4 Marginal Adhesion 675 non-null int64
5 Single Epithelial Cell Size 675 non-null int64
6 Bare Nuclei 675 non-null object
7 Bland Chromatin 675 non-null int64
8 Normal Nucleoli 675 non-null int64
9 Mitoses 675 non-null int64
10 Class 675 non-null object
dtypes: int64(9), object(2)
memory usage: 63.3+ KB
可以发现第6列数据类型为object,说明有异常数值,需要进行数据清洗
数据清洗,'Bare nuclei'一列中有缺失数据,用'?'表示,将存在缺失数据的行删除,并将该列数据类型调整为int
python
data['Bare Nuclei']=data['Bare Nuclei'].astype('string')
cond=(data['Bare Nuclei']!='?')
data=data[cond]
data['Bare Nuclei']=data['Bare Nuclei'].astype('int')
观察到id并非有用特征,因此将编号列设为索引
python
data.set_index('id',drop=True,inplace=True)
提取特征和目标部分
python
X_=data.iloc[:,:-1]
y_=data.iloc[:,-1]
分割数据集(训练集:测试集=7:3){此实验中不设置验证集}
python
X_train,X_test,y_train,y_test=train_test_split(X_,y_,test_size=0.3)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(462, 9)
(462,)
(199, 9)
(199,)
特征缩放,使用sklearn.preprocessing的标准化模块,注意这里的数据集由Dataframe直接转化为ndarray类型
python
scaler=StandardScaler()
X_s=scaler.fit_transform(X_)
X_train_s=scaler.fit_transform(X_train)
X_test_s=scaler.fit_transform(X_test)
输出编码:1-恶性,0-良性,使用map映射函数,传入转化字典
python
y_train=y_train.map({'良性':0,'恶性':1}).values
y_test=y_test.map({'良性':0,'恶性':1}).values
y_=y_.map({'良性':0,'恶性':1}).values
对另外的预测数据集(无标签)做同样的处理
python
#读入文件
data_predicted=pd.read_excel('datasets/肿瘤测试数据.xlsx',sheet_name='Sheet1')
#标签列作为索引
data_predicted.set_index('id',drop=True,inplace=True)
X_predicted=data_predicted.iloc[:,:-1]
#用该列均值填充缺失值
X_predicted.replace('?',np.nan,inplace=True)
X_predicted['Bare Nuclei']=X_predicted['Bare Nuclei'].fillna(X_test['Bare Nuclei'].mean())
#特征缩放
X_predicted_s=scaler.fit_transform(X_predicted)
X_predicted_s
手写二元逻辑回归模型(小批量梯度下降)
获取训练集大小,初始化权重向量和偏置值
python
m,n=X_train_s.shape
w_in=np.zeros(n)
b_in=0
调用手写训练函数开始训练,其中学习率和正则化系数采用默认值0.01,梯度下降批量也采用默认值32,迭代次数设置10000次
python
%%time
w1,b1,p_train,y_train_hat,his=gradient_descent(X_train_s,y_train,w_in,b_in,iters=10000)
iteration: 0, cost: 0.6835531909218033
iteration: 1000, cost: 0.09561179284006076
iteration: 2000, cost: 0.08143988852386196
iteration: 3000, cost: 0.07602179037731364
iteration: 4000, cost: 0.07297962427710258
iteration: 5000, cost: 0.07104092210035404
iteration: 6000, cost: 0.06952930739554782
iteration: 7000, cost: 0.06836304664921707
iteration: 8000, cost: 0.0674995109940224
iteration: 9000, cost: 0.0668178167569174
final w: [1.45252021 0.53480326 0.76912313 0.661257 0.15524646 1.3456865
0.90781468 0.86672594 0.80311382], b: -0.7663124236956939
CPU times: total: 25.5 s
Wall time: 1min 33s
绘制学习曲线cost-iteration
python
plt.plot(his)
查看训练效果,各种指标的意义请参考机器学习之监督学习(二)二元逻辑回归,输出图可在实验文件中查看。
python
judge_classification(y_train,y_train_hat,p_train)
是否设置fbeta_score,1-yes,0-no 1
输入beta: 2
C:\Users\21316\.conda\envs\ai\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass beta=2 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
准确率 (Accuracy): 0.9762
混淆矩阵(Confusion Matrix):[[294 4]
[ 7 157]]
精确率 (Precision): 0.9752
召回率 (Recall): 0.9573
f1 score:0.9662
fbeta score:0.960832
P-R曲线图如下:
roc曲线图如下:
推荐阈值0.30989255200204613
roc_auc_score:0.997299
可以看到分类效果还是相当不错的,只有4个假阳性,7个假阴性(从混淆矩阵中看出)。笔者考虑召回率:精准率的权重β=2理由是:个人认为,相比于避免良性肿瘤被误诊为恶性肿瘤,希望更多的恶性肿瘤患者被成功诊断,因此看重召回率,采用f2_score。训练集中的f2_score为0.961,auc=0.997。同时推荐了阈值0.3,我们再测试一下阈值为0.3时的效果。
python
y_train_hat_=(p_train>0.3)
judge_classification(y_train,y_train_hat_,p_train)
是否设置fbeta_score,1-yes,0-no 1
输入beta: 2
C:\Users\21316\.conda\envs\ai\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass beta=2 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
准确率 (Accuracy): 0.9805
混淆矩阵(Confusion Matrix):[[290 8]
[ 1 163]]
精确率 (Precision): 0.9532
召回率 (Recall): 0.9939
f1 score:0.9731
fbeta score:0.985490
P-R曲线图如下:
roc曲线图如下:
推荐阈值0.30989255200204613
roc_auc_score:0.997299
f2_score提升到了0.9855,显然模型能力得到了提升,因此0.3是更适合这个肿瘤分类器的阈值
再检验手写模型在测试集中表现
python
p_test,y_test_hat=logistic_predict(X_test_s,w1,b1,0.3)
judge_classification(y_test,y_test_hat,p_test)
是否设置fbeta_score,1-yes,0-no 1
输入beta: 2
C:\Users\21316\.conda\envs\ai\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass beta=2 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
准确率 (Accuracy): 0.9648
混淆矩阵(Confusion Matrix):[[125 5]
[ 2 67]]
精确率 (Precision): 0.9306
召回率 (Recall): 0.9710
f1 score:0.9504
fbeta score:0.962644
P-R曲线图如下:
roc曲线图如下:
推荐阈值0.19691750268862288
roc_auc_score:0.994426
总结一下手写模型:
train(σ=0.5) | train(σ=0.3) | test (σ=0.3) | |
---|---|---|---|
ACC | 0.963 | 0.980 | 0.965 |
Precision | 0.975 | 0.953 | 0.931 |
Recall | 0.957 | 0.993 | 0.971 |
f1_score | 0.966 | 0.973 | 0.950 |
f2_score | 0.961 | 0.985 | 0.962 |
auc | 0.997 | 0.997 | 0.994 |
sklearn逻辑回归器
#创建逻辑回归分类器,拟合数据,predict输出预测类别向量,predict_proba输出预测概率向量,采用默认阈值0.5
python
LR=LogisticRegression()
LR.fit(X_train_s,y_train)
y_train_S=LR.predict(X_train_s)
p_train_S=LR.predict_proba(X_train_s)[:,1]
检验训练效果
python
judge_classification(y_train,y_train_S,p_train_S)
是否设置fbeta_score,1-yes,0-no 1
输入beta: 2
C:\Users\21316\.conda\envs\ai\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass beta=2 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
准确率 (Accuracy): 0.9740
混淆矩阵(Confusion Matrix):[[293 5]
[ 7 157]]
精确率 (Precision): 0.9691
召回率 (Recall): 0.9573
f1 score:0.9632
fbeta score:0.959658
P-R曲线图如下:
roc曲线图如下:
推荐阈值0.2884938390167448
roc_auc_score:0.997626
测试
python
y_test_S=LR.predict(X_test_s)
p_test_S=LR.predict_proba(X_test_s)[:,1]
judge_classification(y_test,y_test_S,p_test_S)
是否设置fbeta_score,1-yes,0-no 1
输入beta: 2
C:\Users\21316\.conda\envs\ai\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass beta=2 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
准确率 (Accuracy): 0.9648
混淆矩阵(Confusion Matrix):[[125 5]
[ 2 67]]
精确率 (Precision): 0.9306
召回率 (Recall): 0.9710
f1 score:0.9504
fbeta score:0.962644
P-R曲线图如下:
roc曲线图如下:
推荐阈值0.1641399332130378
roc_auc_score:0.994091
可以看出sklearn逻辑回归器的测试效果和我们手写模型几项指标几乎相等,因此两个模型性能相当。在实验文件中还对另一个无标签数据集进行预测,两个模型预测结果一样,读者可以打开实验文件查看。
最后对整个测试集进行10折交叉验证,可以看到准确率达到0.967
python
k = 10 # 设定折数
cv_scores = cross_validate(LR, X_s, y_, cv=k, scoring=['precision','recall','accuracy'])
print(f"{k}-折交叉验证的平均精准率: {np.mean(cv_scores['test_precision']):.4f}")
print(f"{k}-折交叉验证的平均召回率: {np.mean(cv_scores['test_recall']):.4f}")
print(f"{k}-折交叉验证的平均准确率: {np.mean(cv_scores['test_accuracy']):.4f}")
10-折交叉验证的平均精准率: 0.9579
10-折交叉验证的平均召回率: 0.9486
10-折交叉验证的平均准确率: 0.9667
对比matlab工具箱中各个分类器的交叉验证结果,可以看出我们设计的分类器效果可以名列前茅,相当不戳。