在机器学习的多分类任务场景中,选择合适的算法并调优是提升模型效果的核心。本文将围绕一个实际的多分类数据集,实战落地逻辑回归(LR)、随机森林(RF)、支持向量机(SVM)、AdaBoost、高斯朴素贝叶斯(GNB)、XGBoost等经典算法,并补充简单神经网络和卷积神经网络的实现,对比各算法在召回率、准确率等指标上的表现,为同类任务提供参考。
一、数据准备与预处理
本次实验使用的数据集已完成基础预处理(平均值填充缺失值),分为训练集和测试集两个Excel文件。数据特征与标签分离的逻辑如下
python
import pandas as pd
from sklearn import metrics
# 数据提取
train_data = pd.read_excel('训练集数据[平均值填充].xlsx')
train_data_x = train_data.iloc[:,1:] # 特征列(第2列及以后)
train_data_y = train_data.iloc[:,0] # 标签列(第1列)
test_data = pd.read_excel('测试集数据[平均值填充].xlsx')
test_data_x = test_data.iloc[:,1:]
test_data_y = test_data.iloc[:,0]
# 存储各算法结果
result_data = {}
数据集标签分为0、1、2、3四类,后续将围绕这四类的召回率(Recall)和整体准确率(Accuracy)展开评估。
二、经典算法实现与调优
1. 逻辑回归(LR)
逻辑回归是线性分类算法,适合处理分类边界线性可分的场景。针对多分类任务,需注意正则化、求解器、多分类策略的选择:
• 正则化(penalty):选择L1正则化,搭配liblinear求解器(仅该求解器支持L1+多分类ovr策略);
• 超参数调优:通过网格搜索筛选出C=100(正则化强度倒数)、max_iter=100、multi_class='auto'等最优参数。
核心实现代码:
python
from sklearn.linear_model import LogisticRegression
LR_result={}
# 实例化模型(已调优参数)
lr = LogisticRegression(C=100, max_iter=100, multi_class='auto', penalty='l1', solver='liblinear')
lr.fit(train_data_x,train_data_y)
# 预测与评估
train_predicted = lr.predict(train_data_x)
print('LR的train:\n',metrics.classification_report(train_data_y, train_predicted))
test_predicted=lr.predict(test_data_x)
print('LR的test: \n',metrics.classification_report(test_data_y, test_predicted))
# 提取关键指标(0-3类召回率+整体准确率)
a = metrics.classification_report(test_data_y, test_predicted,digits=6)
b = a.split()
LR_result['recall_0'] = float(b[6])
LR_result['recall_1'] = float(b[11])
LR_result['recall_2'] = float(b[16])
LR_result['recall_3'] = float(b[21])
LR_result['acc']=float(b[25])
result_data['LR']= LR_result
2. 随机森林(RF)
随机森林是基于决策树的集成算法,通过多棵树的投票降低过拟合风险,适合非线性数据:
• 核心调优方向:决策树数量(n_estimators)、树深度(max_depth)、特征采样方式(max_features)等;
• 最优参数:bootstrap=False(不使用自举样本)、max_depth=20、max_features='log2'、n_estimators=50等。核心实现代码:
python
from sklearn.ensemble import RandomForestClassifier
RF_result = {}
rf = RandomForestClassifier(bootstrap=False,
max_depth=20,
max_features='log2',
min_samples_leaf=1,
min_samples_split=2,
n_estimators=50,
random_state=487)
rf.fit(train_data_x,train_data_y)
# 预测与指标提取
train_predicted = rf.predict(train_data_x)
test_predicted = rf.predict(test_data_x)
print('RF的train: \n', metrics.classification_report(train_data_y, train_predicted))
print('RFtest:\n',metrics.classification_report(test_data_y, test_predicted))
rf_test_report = metrics.classification_report(test_data_y, test_predicted, digits=6)
b = rf_test_report.split()
RF_result['recall_0'] = float(b[6])
RF_result['recall_1'] = float(b[11])
RF_result['recall_2'] = float(b[16])
RF_result['recall_3'] = float(b[21])
RF_result['acc']=float(b[25])
result_data['RF']= RF_result
3. 支持向量机(SVM)
SVM通过核函数将数据映射到高维空间,实现非线性分类,多分类任务中需重点调优核函数类型:
• 最优参数:多项式核(poly)、degree=4、gamma=1、C=1等;
• 关键设置:probability=True(支持概率预测),方便后续扩展评估。
核心实现代码:
python
from sklearn.svm import SVC
SVM_result = {}
svm = SVC(C=1, coef0=0.1, degree=4, gamma=1, kernel='poly', probability=True, random_state=100)
svm.fit(train_data_x,train_data_y)
test_predicted = svm.predict(test_data_x)
print('SVM的test:\n',metrics.classification_report(test_data_y, test_predicted))
a = metrics.classification_report(test_data_y,test_predicted,digits=6)
b = a.split()
SVM_result['recall_0'] = float(b[6])
SVM_result['recall_1'] = float(b[11])
SVM_result['recall_2'] = float(b[16])
SVM_result['recall_3'] = float(b[21])
SVM_result['acc']= float(b[25])
result_data['SVM'] = SVM_result
4. AdaBoost
基于决策树的集成提升算法,通过调整弱分类器权重聚焦难分样本:
python
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
AdaBoost_result = {}
adf = AdaBoostClassifier(algorithm='SAMME',
estimator=DecisionTreeClassifier(max_depth=2),
n_estimators=200,
learning_rate=1.0,
random_state=0)
adf.fit(train_data_x,train_data_y)
train_predicted = adf.predict(train_data_x) #训练数据的预测结果
print('AdaBoost的train:\n', metrics.classification_report(train_data_y, train_predicted))
test_predicted = adf.predict(test_data_x)#训练数据集的结果
print('AdaBoost的test: \n',metrics.classification_report(test_data_y, test_predicted))
a = metrics.classification_report(test_data_y, test_predicted, digits=6)
b = a.split()
AdaBoost_result['recall_0'] = float(b[6])#添加类别为0的召回率
AdaBoost_result['recall_1'] = float(b[11])#添加类别为1的召回率
AdaBoost_result['recall_2'] = float(b[16])#添加类别为2的召回率
AdaBoost_result['recall_3'] = float(b[21])#添加类别为3的召回率
AdaBoost_result['acc'] = float(b[25]) #添accuracy结梁
result_data['AdaBoost'] = AdaBoost_result
5. 高斯朴素贝叶斯(GNB)
基于贝叶斯定理的概率分类算法,假设特征服从高斯分布,实现简单、速度快:
python
from sklearn.naive_bayes import GaussianNB
GNB_result = {}
gnb =GaussianNB()
gnb.fit(train_data_x,train_data_y)
train_predicted = gnb.predict(train_data_x) #训练数据的测结果
print('GNB的train: \n',metrics.classification_report(train_data_y, train_predicted))
test_predicted = gnb.predict(test_data_x) #训练数据集的预测结果
print('GNB的test:\n',metrics.classification_report(test_data_y, test_predicted))
a = metrics.classification_report(test_data_y, test_predicted,digits=6)
b = a.split()
GNB_result['recall_0'] = float(b[6])#添加类别为0的召回率
GNB_result['recall_1'] = float(b[11])#添加类别为1的召回率
GNB_result['recall_2'] = float(b[16])#添加类别为2的召回率
GNB_result['recall_3'] = float(b[21])#添加类别为3的召回率
GNB_result['acc'] = float(b[25]) #添accuracy结梁
result_data['AdaBoost'] = GNB_result
6. XGBoost
极致优化的梯度提升树,通过正则化和并行计算提升效果,多分类任务中设置
python
objective='multi:softmax':
import xgboost as xgb
XGBoost_result = {}
xgb_model = xgb.XGBClassifier(learning_rate=0.05,
n_estimators=200,
max_depth=7,
min_child_weight=1,
gamma=0,
subsample=0.6,
colsample_bytree=0.8,
objective='multi:softmax',
seed=0)
xgb_model.fit(train_data_x,train_data_y)
train_predicted =xgb_model.predict(train_data_x)
print('XGBoost的train: \n',metrics.classification_report(train_data_y,train_predicted))
test_predicted =xgb_model.predict(test_data_x)#训练数据集的预测结果
print('xGBoost的test:\n',metrics.classification_report(test_data_y, test_predicted))
a = metrics.classification_report(test_data_y, test_predicted,digits=6)
b = a.split()
XGBoost_result['recall_0'] = float(b[6])#添加类别为0的召回率XGBoost_result['recall_1'] = float(b[11])#添加类别为1的召回率
XGBoost_result['recall_2'] = float(b[16])#添加类别为2的召回率
XGBoost_result['recall_3'] = float(b[21])#添加类别为3的召回率
XGBoost_result['acc']= float(b[25]) #添加accuracy结果
result_data['XGBoost'] = XGBoost_result
7. 神经网络
除传统机器学习算法外,我们也尝试了简单的全连接神经网络(PyTorch实现),用于对比非线性拟合能力:
python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
# 定义神经网络结构
class Net(nn.Module):
def __init__(self): # 修正原代码笔误init_为__init__
super(Net, self).__init__()
self.fc1 = nn.Linear(13, 32) # 输入特征数13,隐藏层32
self.fc2 = nn.Linear(32, 64)
self.fc3 = nn.Linear(64, 4) # 输出4类
def forward(self,x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
# 数据转换(修正原代码valus为values)
X_train = torch.tensor(train_data_x.values,dtype=torch.float32)
Y_train = torch.tensor(train_data_y.values)
X_test = torch.tensor(test_data_x.values,dtype=torch.float32)
Y_test = torch.tensor(test_data_y.values)
# 实例化模型、损失函数、优化器
model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
# 评估函数
def evaluate_model(model,X_data,Y_data,train_or_test):
size = len(X_data)
with torch.no_grad():
predictions = model(X_data)
correct = (predictions.argmax(1) == Y_data).type(torch.float).sum().item()
correct /= size
loss = criterion(predictions,Y_data).item()
print(f"{train_or_test}: \t Accuracy:{(100*correct)}%")
return correct
# 训练过程
epochs = 15000
accs = []
for epoch in range(epochs):
optimizer.zero_grad()
outputs = model(X_train)
loss = criterion(outputs,Y_train)
loss.backward()
optimizer.step()
if (epoch + 1) % 100 == 0:
print(f'Epoch [{epoch + 1}/{epochs}],Loss:{loss.item():.4f}')
train_acc = evaluate_model(model,X_train,Y_train,'train')
test_acc = evaluate_model(model,X_test,Y_test,'test')
accs.append(test_acc*100)
# 存储结果
net_result = {}
net_result['acc'] = max(accs)
result_data['net'] = net_result
8. 卷积神经网络
python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
# 定义一维卷积神经网络模型
class ConvNet(nn.Module):
def __init__(self, num_features, hidden_size, num_classes):
super(ConvNet, self).__init__()
self.conv1 = nn.Conv1d(in_channels = 1,out_channels = 16,kernel_size=3, padding=1)
self.conv2 = nn.Conv1d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
self.conv3 = nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
self.relu = nn.ReLU()
self.fc = nn.Linear(64, num_classes)
def forward(self, x):
x = x.unsqueeze(1) # 增加channels维度
x = self.conv1(x)
x = self.relu(x)
x = self.conv2(x)
x = self.relu(x)
x = self.conv3(x)
x = self.relu(x)
x = x.mean(dim=2)
x = self.fc(x)
return x
hidden_size = 10
num_classes = 4
model = ConvNet(13, hidden_size, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 训练模型
num_epochs = 15000
accs = []
for epoch in range(num_epochs):
outputs = model(X_train) # 前向传播
loss = criterion(outputs, Y_train)
optimizer.zero_grad() # 反向传播和优化
loss.backward()
optimizer.step()
if (epoch + 1) % 100 == 0: #每隔100次打印训练结果
print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')
# 测试模型
with torch.no_grad():
predictions = model(X_train)
predicted_classes = predictions.argmax(dim=1)
accuracy = (predicted_classes == Y_train).float().mean()
print(f'Train Accuracy: {accuracy.item() * 100:.2f}%')
predictions = model(X_test)
predicted_classes = predictions.argmax(dim=1)
accuracy = (predicted_classes == Y_test).float().mean()
print(f'Test Accuracy: {accuracy.item() * 100:.2f}%')
accs.append(accuracy*100)
cnn_result = {}
cnn_result['acc'] = max(accs).item()
result_data['cnn'] = cnn_result
print(result_data)
#
import json#数据格式,网络传输。保存提取json类型的数据。 csv:表格类型的数据
# 使用 'w' 模式打开文件,确保如果文件已存在则会被覆盖,
result = {}
result['mean fill'] = result_data
with open(r'temp_data/平均值填充result.json', 'w', encoding='utf-8') as file:
# 使用 json.dump() 方法将字典转换为 JSON 格式并写入文件,JSON一般来是字典
json.dump(result, file, ensure_ascii=False, indent=4)
三、算法效果对比与分析
1. 核心指标维度
各算法最终存储的指标包括:
• 分类维度:0/1/2/3四类的召回率(Recall),反映每类样本的识别能力;
• 整体维度:测试集准确率(Accuracy),反映模型整体分类效果。
2. 通用结论(基于同类场景经验)
• 线性场景:LR在特征与标签线性相关时表现稳定,但非线性数据下效果有限;
• 非线性场景:RF、XGBoost通常表现更优,XGBoost在数据量适中时精度更高,RF训练速度更快、鲁棒性更强;
• 小样本场景:SVM(poly核)和GNB优势明显,训练成本低且不易过拟合;
• 集成算法:AdaBoost对弱分类器的提升显著,但易受噪声数据影响;
• 神经网络:简单全连接网络需足够epoch和调优,在特征复杂时潜力更大,但本次实验因结构简单,优势未完全体现。
四、总结与拓展
本次实验完整落地了多分类任务下的经典机器学习算法,从代码实现到指标提取形成闭环,核心总结如下:
-
算法选择需结合数据特性:线性数据优先LR,非线性优先RF/XGBoost,小样本优先SVM/GNB;
-
超参数调优是关键:网格搜索(GridSearchCV)是通用调优方法,需注意参数组合的兼容性(如LR的penalty与solver匹配);
-
指标评估需全面:除准确率外,召回率能反映每类样本的识别效果,更适合多分类任务的深度评估。
后续可拓展方向:
• 特征工程:增加特征筛选、归一化/标准化,提升模型输入质量;
• 算法融合:通过投票、堆叠(Stacking)融合多算法结果;
• 神经网络优化:增加Dropout、批归一化,或尝试CNN/LSTM等结构适配序列特征。