目录
[1. 逻辑回归(LR)](#1. 逻辑回归(LR))
[2. 随机森林(RF)](#2. 随机森林(RF))
[3. 高斯朴素贝叶斯(GNB)](#3. 高斯朴素贝叶斯(GNB))
[4. 支持向量机(SVM)](#4. 支持向量机(SVM))
[5. AdaBoost](#5. AdaBoost)
[6. XGBoost](#6. XGBoost)
一、阶段衔接与开发目标
在《矿物分类系统开发笔记(一)》中,我们完成了矿物数据集的收集、清洗与预处理工作,重点对数据中的空缺值进行了分析,并采用 "删除空缺行" 的方式生成了可供模型训练的标准化数据集。本阶段作为开发流程的延续,主要基于预处理后的数据完成以下目标:
- 选取 6 种经典机器学习算法进行矿物分类模型训练
- 通过网格搜索优化模型参数,提升分类性能
- 构建统一的评估体系,对比各模型在测试集上的表现
- 记录并分析实验结果,为后续系统选型提供依据
二、数据准备
数据来源:使用预处理阶段生成的训练集(训练数据集 [删除空缺行].xlsx)和测试集(测试数据集 [删除空缺行].xlsx)
数据划分:
- 特征集(X):所有样本的属性数据(除最后一列标签外的所有列)
- 标签集(y):
- 训练集标签:包含 0、1、3 三类(训练集中标签为 2 的样本均存在数据空缺,已在预处理阶段随空缺行一同删除)
- 测试集标签:包含 0、1、2、3 四类(保留了数据完整的标签 2 样本,用于验证模型对未见过类别的泛化能力)
特殊处理:
针对 XGBoost 模型特性,构建标签映射关系:{0:0, 1:1, 3:2}
,将原始标签转换为连续整数编码;预测后通过反向映射{0:0, 1:1, 2:3}
还原原始标签,对测试集特有的标签 2 单独处理(预测结果中若出现未映射编码则判定为 2)
python
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import json
# 数据读取
train_data = pd.read_excel('..//temp_data//训练数据集[删除空缺行].xlsx')
test_data = pd.read_excel('..//temp_data//测试数据集[删除空缺行].xlsx')
# 特征与标签分割
train_X = train_data.iloc[:, :-1]
train_y = train_data.iloc[:, -1] # 训练标签:0、1、3
test_X = test_data.iloc[:, :-1]
test_y = test_data.iloc[:, -1] # 测试标签:0、1、2、3
# XGBoost标签映射处理
label_mapping = {0: 0, 1: 1, 3: 2}
reverse_mapping = {v: k for k, v in label_mapping.items()}
train_y_xgb = train_y.map(label_mapping) # 转换为连续编码
test_y_xgb = test_y.map(label_mapping)
# 结果存储容器
result_data = {}
三、模型选择与训练
选取 6 种经典分类算法进行对比实验,均采用网格搜索(GridSearchCV)进行参数优化,5 折交叉验证确定最佳参数:
1. 逻辑回归(LR)
核心参数:C=0.001, max_iter=100, multi_class='ovr', penalty='l1', solver='liblinear'
特点:采用 L1 正则化(Lasso),适合高维数据特征选择,使用 ovr 策略处理多分类
python
# 网格搜索优化(实际运行时启用)
# logreg = LogisticRegression()
# param_grid = [
# {'penalty': ['l1'], 'solver': ['liblinear'], 'C': [0.001, 0.01, 0.1], 'multi_class': ['ovr']},
# {'penalty': ['l2'], 'solver': ['lbfgs'], 'C': [0.001, 0.01, 0.1], 'multi_class': ['multinomial']}
# ]
# grid_search = GridSearchCV(logreg, param_grid, cv=5)
# grid_search.fit(train_X, train_y)
# print("LR最佳参数:", grid_search.best_params_)
# 最佳模型训练
LR_result = {}
lr = LogisticRegression(C=0.001, max_iter=100, multi_class='ovr',
penalty='l1', solver='liblinear')
lr.fit(train_X, train_y)
# 评估
train_pred = lr.predict(train_X)
test_pred = lr.predict(test_X)
print("LR训练集评估:\n", metrics.classification_report(train_y, train_pred))
print("LR测试集评估:\n", metrics.classification_report(test_y, test_pred))
# 结果提取
report = metrics.classification_report(test_y, test_pred, digits=6).split()
LR_result['recall_0'] = float(report[6])
LR_result['recall_1'] = float(report[11])
LR_result['recall_2'] = float(report[16])
LR_result['recall_3'] = float(report[21])
LR_result['acc'] = float(report[25])
result_data['LR'] = LR_result
2. 随机森林(RF)
核心参数:bootstrap=True, criterion='gini', max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=200
特点:集成多棵决策树降低过拟合风险,Gini 系数作为不纯度度量,保留完整决策树深度
python
# 网格搜索优化(实际运行时启用)
# rf = RandomForestClassifier(random_state=42)
# param_grid = {
# 'n_estimators': [100, 200],
# 'max_depth': [None, 20],
# 'min_samples_split': [2, 5],
# 'bootstrap': [True]
# }
# grid_search = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1)
# grid_search.fit(train_X, train_y)
# print("RF最佳参数:", grid_search.best_params_)
# 最佳模型训练
RF_result = {}
rf = RandomForestClassifier(
bootstrap=True, criterion='gini', max_depth=None,
min_samples_leaf=1, min_samples_split=2, n_estimators=200,
random_state=42
)
rf.fit(train_X, train_y)
# 评估
train_pred = rf.predict(train_X)
test_pred = rf.predict(test_X)
print("RF训练集评估:\n", metrics.classification_report(train_y, train_pred))
print("RF测试集评估:\n", metrics.classification_report(test_y, test_pred))
# 结果提取
report = metrics.classification_report(test_y, test_pred, digits=6).split()
RF_result['recall_0'] = float(report[6])
RF_result['recall_1'] = float(report[11])
RF_result['recall_2'] = float(report[16])
RF_result['recall_3'] = float(report[21])
RF_result['acc'] = float(report[25])
result_data['RF'] = RF_result
3. 高斯朴素贝叶斯(GNB)
核心参数:var_smoothing=1e-06
特点:基于贝叶斯定理的概率模型,通过 var_smoothing 参数提高数值稳定性
python
# 网格搜索优化(实际运行时启用)
# gnb = GaussianNB()
# param_grid = {'var_smoothing': [1e-9, 1e-6, 1e-3]}
# grid_search = GridSearchCV(gnb, param_grid, cv=5)
# grid_search.fit(train_X, train_y)
# print("GNB最佳参数:", grid_search.best_params_)
# 最佳模型训练
GNB_result = {}
gnb = GaussianNB(var_smoothing=1e-06)
gnb.fit(train_X, train_y)
# 评估
train_pred = gnb.predict(train_X)
test_pred = gnb.predict(test_X)
print("GNB训练集评估:\n", metrics.classification_report(train_y, train_pred))
print("GNB测试集评估:\n", metrics.classification_report(test_y, test_pred))
# 结果提取
report = metrics.classification_report(test_y, test_pred, digits=6).split()
GNB_result['recall_0'] = float(report[6])
GNB_result['recall_1'] = float(report[11])
GNB_result['recall_2'] = float(report[16])
GNB_result['recall_3'] = float(report[21])
GNB_result['acc'] = float(report[25])
result_data['GNB'] = GNB_result
4. 支持向量机(SVM)
核心参数:C=10, gamma=1, kernel='rbf', max_iter=1000
特点:采用 RBF 核函数处理非线性关系,较大的 C 值表示对误分类惩罚更严格
python
# 网格搜索优化(实际运行时启用)
# svm = SVC(random_state=42)
# param_grid = {
# 'kernel': ['rbf'],
# 'C': [1, 10],
# 'gamma': [0.1, 1],
# 'max_iter': [1000]
# }
# grid_search = GridSearchCV(svm, param_grid, cv=5, n_jobs=-1)
# grid_search.fit(train_X, train_y)
# print("SVM最佳参数:", grid_search.best_params_)
# 最佳模型训练
SVM_result = {}
svm = SVC(C=10, gamma=1, kernel='rbf', max_iter=1000, random_state=42)
svm.fit(train_X, train_y)
# 评估
train_pred = svm.predict(train_X)
test_pred = svm.predict(test_X)
print("SVM训练集评估:\n", metrics.classification_report(train_y, train_pred))
print("SVM测试集评估:\n", metrics.classification_report(test_y, test_pred))
# 结果提取
report = metrics.classification_report(test_y, test_pred, digits=6).split()
SVM_result['recall_0'] = float(report[6])
SVM_result['recall_1'] = float(report[11])
SVM_result['recall_2'] = float(report[16])
SVM_result['recall_3'] = float(report[21])
SVM_result['acc'] = float(report[25])
result_data['SVM'] = SVM_result
5. AdaBoost
核心参数:algorithm='SAMME', learning_rate=0.5, n_estimators=50
特点:通过 SAMME 算法集成弱分类器,学习率 0.5 控制迭代步长,50 个基分类器
python
# 网格搜索优化(实际运行时启用)
# ada = AdaBoostClassifier(random_state=42)
# param_grid = {
# 'n_estimators': [50, 100],
# 'learning_rate': [0.5, 1.0],
# 'algorithm': ['SAMME']
# }
# grid_search = GridSearchCV(ada, param_grid, cv=5, n_jobs=-1)
# grid_search.fit(train_X, train_y)
# print("AdaBoost最佳参数:", grid_search.best_params_)
# 最佳模型训练
Ada_result = {}
ada = AdaBoostClassifier(
algorithm='SAMME', learning_rate=0.5, n_estimators=50, random_state=42
)
ada.fit(train_X, train_y)
# 评估
train_pred = ada.predict(train_X)
test_pred = ada.predict(test_X)
print("AdaBoost训练集评估:\n", metrics.classification_report(train_y, train_pred))
print("AdaBoost测试集评估:\n", metrics.classification_report(test_y, test_pred))
# 结果提取
report = metrics.classification_report(test_y, test_pred, digits=6).split()
Ada_result['recall_0'] = float(report[6])
Ada_result['recall_1'] = float(report[11])
Ada_result['recall_2'] = float(report[16])
Ada_result['recall_3'] = float(report[21])
Ada_result['acc'] = float(report[25])
result_data['AdaBoost'] = Ada_result
6. XGBoost
核心参数:colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_depth=3, n_estimators=200
特点:基于树的集成模型,通过列采样(80%)防止过拟合,深度 3 的树结构控制复杂度
python
# 网格搜索优化(实际运行时启用)
# xgb = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss', num_class=3)
# param_grid = {
# 'n_estimators': [100, 200],
# 'max_depth': [3, 5],
# 'learning_rate': [0.1],
# 'colsample_bytree': [0.8]
# }
# grid_search = GridSearchCV(xgb, param_grid, cv=5, n_jobs=-1)
# grid_search.fit(train_X, train_y_xgb)
# print("XGBoost最佳参数:", grid_search.best_params_)
# 最佳模型训练
XGB_result = {}
xgb_best = XGBClassifier(
colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_depth=3,
n_estimators=200, reg_alpha=0, reg_lambda=0, subsample=0.8,
random_state=42, use_label_encoder=False, eval_metric='mlogloss', num_class=3
)
xgb_best.fit(train_X, train_y_xgb)
# 评估(含标签映射还原)
train_pred_encoded = xgb_best.predict(train_X)
train_pred = [reverse_mapping[code] for code in train_pred_encoded]
test_pred_encoded = xgb_best.predict(test_X)
test_pred = [reverse_mapping[code] if code in reverse_mapping else 2
for code in test_pred_encoded]
print("XGBoost训练集评估:\n", metrics.classification_report(train_y, train_pred))
print("XGBoost测试集评估:\n", metrics.classification_report(test_y, test_pred))
# 结果提取
report = metrics.classification_report(test_y, test_pred, digits=6).split()
XGB_result['recall_0'] = float(report[6])
XGB_result['recall_1'] = float(report[11])
XGB_result['recall_2'] = float(report[16])
XGB_result['recall_3'] = float(report[21])
XGB_result['acc'] = float(report[25])
result_data['XGBoost'] = XGB_result
# 保存所有结果
with open('..//temp_data//结果数据[删除空缺行].json', 'w', encoding='utf-8') as f:
json.dump(result_data, f, ensure_ascii=False, indent=4)
四、模型评估与结果分析
评估指标
- 各类别召回率(recall):分别记录 0、1、2、3 四类的召回率
- 整体准确率(acc):模型整体分类正确率
评估结果
模型 | 召回率_0 | 召回率_1 | 召回率_2 | 召回率_3 | 准确率(acc) |
---|---|---|---|---|---|
LR | 1.0 | 0.0 | 0.0 | 0.0 | 0.6 |
RF | 0.933333 | 0.333333 | 0.0 | 1.0 | 0.68 |
GNB | 0.733333 | 0.333333 | 0.0 | 1.0 | 0.56 |
SVM | 0.866667 | 0.0 | 0.0 | 1.0 | 0.56 |
AdaBoost | 0.8 | 0.666667 | 0.0 | 1.0 | 0.68 |
XGBoost | 0.933333 | 0.166667 | 0.0 | 1.0 | 0.64 |
结果分析
- 整体表现:随机森林(RF)和 AdaBoost 模型表现最优,准确率均达到 0.68;逻辑回归(LR)和支持向量机(SVM)对标签 1 的识别能力较弱(召回率为 0)
- 类别特异性:
- 标签 0:逻辑回归(LR)识别效果最佳(召回率 1.0),XGBoost 和 RF 次之(0.933333)
- 标签 1:AdaBoost 表现最优(召回率 0.666667),显著高于其他模型
- 标签 2:所有模型召回率均为 0,主要原因是训练集中无该类别样本(因数据空缺已删除),模型无法学习该类特征
- 标签 3:RF、GNB、SVM、AdaBoost、XGBoost 均能 100% 识别,说明该类别特征与其他类别区分度较高
- 泛化能力:由于训练集缺失标签 2 样本,所有模型对该类别均无识别能力,反映了训练数据完整性对模型泛化能力的关键影响
五、开发总结
- 完成了 6 种分类模型的训练与优化,验证了不同算法在矿物分类任务上的适用性
- 通过标签映射机制解决了 XGBoost 对非连续标签的处理问题,保证了模型间评估标准的一致性
- 明确了训练数据空缺对模型性能的影响:标签 2 因训练样本缺失导致所有模型识别失败
- 确定了 RF 和 AdaBoost 为当前阶段表现最优的模型,为后续系统开发提供了选型依据
后续将进行其他5种预处理生成的数据集(平均值填充、中位数填充、众数填充、线性回归填充、随机森林填充)进行模型训练。