【机器学习】任务五：葡萄酒和鸢尾花数据集分类任务

Bagging ：通过对数据进行重采样，构建多个独立的模型，然后将这些模型的预测结果进行平均（回归）或投票（分类）。典型的算法包括随机森林。
Boosting ：通过串行训练多个模型，每个模型都在前一个模型的基础上进行改进。常用算法包括GBDT 、XGBoost 和 LightGBM。

（1）随机森林

随机森林是 Bagging 的一种具体实现，通过随机选择样本和特征构建多个决策树，并通过投票（分类）或平均（回归）进行预测。随机森林减少了过拟合，增强了模型的稳定性和准确性。

（2）梯度提升决策树（GBDT）

GBDT 是 Boosting 的一种，通过逐步构建树模型，每棵新树都是在前一棵树的基础上通过最小化残差来改进模型。GBDT 在处理非线性关系和特征交互时表现良好。

（3）XGBoost

XGBoost 是 GBDT 的高效实现，具有更快的训练速度和更好的性能。通过正则化和并行计算加速模型训练，同时减少过拟合，XGBoost 还可以处理缺失值。

（4）LightGBM

LightGBM 是微软开发的高效 GBDT 实现，使用基于直方图的决策树算法，显著提高了训练速度和内存效率，适用于大规模数据集和高维特征场景。

1.2 参数优化

为了提升模型性能，参数优化是关键步骤。常用的参数优化方法有：

（1）网格搜索

网格搜索是一种穷举法，通过在预定义的参数空间中列出所有可能的参数组合，逐一训练模型并评估其性能。虽然计算开销较大，但简单易懂，适用于小规模参数优化。

（2）随机搜索

随机搜索随机选择参数组合来进行训练，相比网格搜索能更高效地探索高维空间，常在较短时间内找到较优的参数。

（3）贝叶斯优化

贝叶斯优化通过构建目标函数的概率模型，逐步更新该模型来找到最优参数，适合计算成本较高的模型训练，能减少评估次数。

（4）遗传算法

遗传算法模拟自然选择的过程，通过选择、交叉和变异操作在参数空间内进行全局搜索，适合复杂的优化问题。

（5）超参数优化工具

Optuna 、Hyperopt 和 Ray Tune 等工具集成了多种优化算法，可以自动化参数搜索过程，并提供可视化和分析功能。

1.3 模型解释

理解模型的决策过程对于确保模型透明性和解释性至关重要。常用的解释方法包括：

（1）特征重要性分析

通过评估特征对模型预测结果的影响，分析哪些特征在模型决策中起到了关键作用。随机森林和 XGBoost 提供了基于树模型的特征重要性评估。

（2）SHAP

SHAP 值基于合作博弈论，通过计算特征对预测结果的贡献来解释模型。它能够量化每个特征对单个预测的影响，也能通过全局解释分析模型行为。

（3）部分依赖图（PDP）

PDP 用于可视化单个或多个特征对预测的影响，揭示特征与预测结果之间的关系。

（4）对抗性样本分析

通过生成对抗性样本并观察模型的预测变化，可以评估模型的脆弱性和决策边界。

（5）LIME

LIME 是一种模型无关的解释方法，能通过生成局部线性模型来近似复杂模型的行为，提供单个预测的可解释性。

2.葡萄酒数据集分类任务

2.1 导入必要的库文件并加载数据集

（1）目标：

导入必要的库文件（如 pandas、numpy、matplotlib、seaborn、sklearn、xgboost 和 lightgbm 等）。
加载 scikit-learn 提供的葡萄酒数据集。

（2）代码：

python 复制代码

# 导入必要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
import xgboost as xgb
import lightgbm as lgb

# 加载葡萄酒数据集
wine = load_wine()

# 提取特征矩阵（X）和目标变量（y）
X_wine, y_wine = wine.data, wine.target

（3）解释：

导入库文件 ：导入数据处理、可视化、机器学习和模型评估所需的库。
- pandas 用于数据处理，numpy 用于数值计算，matplotlib 和 seaborn 用于绘图。
- xgboost 和 lightgbm 是两种常用的集成学习方法，用于分类任务。
加载葡萄酒数据集 ：使用 scikit-learn 提供的 load_wine() 函数加载葡萄酒数据集。数据集包含 13 个特征和 178 个样本。
提取特征矩阵和目标变量 ：X_wine 保存特征数据，y_wine 保存目标类别标签。

2.2 对数据进行初步探索和可视化

（1）目标：

查看数据集的基本信息和统计摘要，确保数据集没有缺失值或异常。
使用 seaborn 进行数据可视化，了解特征与目标标签之间的关系。

（2）代码：

python 复制代码

# 将数据转为DataFrame
df_wine = pd.DataFrame(X_wine, columns=wine.feature_names)
df_wine['target'] = y_wine

# 查看数据集信息
print(df_wine.info())
print(df_wine.describe())

# 可视化葡萄酒数据集特征与目标的关系
sns.pairplot(df_wine, hue="target", markers=["o", "s", "D"])
plt.show()

（3）解释：

转换为 DataFrame ：将 numpy 数组转换为 pandas 数据框，以便更方便地查看和处理数据。
数据集探索 ：使用 info() 和 describe() 查看数据的基本信息，包括数据类型、缺失值和统计摘要。
- info() 显示数据集的结构信息，确保没有缺失值。
- describe() 提供特征的统计描述，如均值、方差、最大值和最小值等。
数据可视化 ：使用 seaborn 的 pairplot() 绘制多变量散点图，观察特征与目标变量（葡萄酒种类）之间的关系。
- hue="target" 使用不同颜色区分不同类别的葡萄酒。

2.3 将数据集分为训练集和测试集

（1）目标：

将葡萄酒数据集划分为训练集和测试集，以便训练模型并验证其性能。
采用 80:20 的比例划分，80% 的数据用于训练，20% 用于测试。

（2）代码：

python 复制代码

# 将葡萄酒数据集划分为训练集和测试集
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(X_wine, y_wine, test_size=0.2, random_state=42)

（3）解释：

train_test_split()：用于将数据集划分为训练集和测试集。这里使用 80% 的数据用于模型训练，20% 的数据用于测试模型。
random_state=42：确保数据集划分的随机性保持一致，便于结果复现。
X_train_wine, y_train_wine 分别表示葡萄酒训练集的特征和标签，X_test_wine, y_test_wine 为测试集。

2.4 对 `XGBoost` 模型进行参数优化

（1）目标：

使用网格搜索 (GridSearchCV) 对 XGBoost 分类器的参数进行优化，找到最佳参数组合以提高模型性能。

（2）代码：

python 复制代码

# 定义XGBoost模型和参数网格
xgb_model_wine = xgb.XGBClassifier(random_state=42)
xgb_param_grid_wine = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3]
}

# 进行网格搜索和交叉验证
xgb_grid_search_wine = GridSearchCV(estimator=xgb_model_wine, param_grid=xgb_param_grid_wine, scoring='accuracy', cv=3)
xgb_grid_search_wine.fit(X_train_wine, y_train_wine)

# 打印最佳参数
print("XGBoost最佳参数（葡萄酒数据集）:", xgb_grid_search_wine.best_params_)

（3）解释：

定义模型 ：xgb.XGBClassifier 是 XGBoost 分类器，用于处理葡萄酒数据集的分类任务。
参数优化 ：
- 通过网格搜索 (GridSearchCV) 调整 n_estimators（决策树数量）、max_depth（决策树深度）和 learning_rate（学习率）等参数。
- cv=3 表示 3 折交叉验证，用于评估模型性能并选择最佳参数组合。
输出最佳参数 ：通过 best_params_ 打印 XGBoost 模型的最佳参数组合。

2.5 使用测试集进行预测并评估模型性能

（1）目标：

使用通过网格搜索找到的最佳 XGBoost 模型对测试集进行预测，并评估模型的分类性能。

（2）代码：

python 复制代码

# 使用最佳XGBoost模型对测试集进行预测
xgb_y_pred_wine = xgb_grid_search_wine.best_estimator_.predict(X_test_wine)

# 评估模型性能（准确率和F1得分）
xgb_accuracy_wine = accuracy_score(y_test_wine, xgb_y_pred_wine)
xgb_f1_wine = f1_score(y_test_wine, xgb_y_pred_wine, average='weighted')

# 打印性能结果
print(f"XGBoost分类器 - 葡萄酒数据集: 准确率 = {xgb_accuracy_wine}, F1 得分 = {xgb_f1_wine}")

（3）解释：

定义模型 ：xgb.XGBClassifier 是 XGBoost 分类器，用于处理葡萄酒数据集的分类任务。
参数优化 ：
- 通过网格搜索 (GridSearchCV) 调整 n_estimators（决策树数量）、max_depth（决策树深度）和 learning_rate（学习率）等参数。
- cv=3 表示 3 折交叉验证，用于评估模型性能并选择最佳参数组合。
输出最佳参数 ：通过 best_params_ 打印 XGBoost 模型的最佳参数组合。

2.6 可视化模型性能

（1）目标：

绘制 XGBoost 模型在葡萄酒数据集上的准确率和 F1 得分的柱状图，直观展示模型性能。

（2）代码：

python 复制代码

# 绘制XGBoost模型性能柱状图
performance_metrics = ['Accuracy', 'F1 Score']
xgb_performance_wine = [xgb_accuracy_wine, xgb_f1_wine]

plt.figure(figsize=(8, 6))
plt.bar(performance_metrics, xgb_performance_wine, color='blue')
plt.title('XGBoost 模型在葡萄酒数据集上的性能')
plt.ylabel('得分')
plt.ylim(0, 1)
plt.show()

（3）解释：

绘图：使用 matplotlib 的 bar() 函数绘制柱状图，展示 XGBoost 模型的准确率和 F1 得分。
直观展示 ：通过柱状图直观展示模型的分类性能，帮助更好地理解模型的优劣。
- ylim(0, 1) 设置 Y 轴的范围为 0 到 1。

3.鸢尾花数据集分类任务报告

3.1 导入必要的库文件并加载数据集

（1）目标：

导入必要的库文件（如 pandas、numpy、matplotlib、seaborn、sklearn、xgboost 和 lightgbm 等）。
加载 scikit-learn 提供的鸢尾花数据集。

（2）代码：

python 复制代码

# 导入必要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
import xgboost as xgb
import lightgbm as lgb

# 加载鸢尾花数据集
iris = load_iris()

# 提取特征矩阵（X）和目标变量（y）
X_iris, y_iris = iris.data, iris.target

（3）解释：

导入库文件 ：导入数据处理、可视化、机器学习和模型评估所需的库。
- pandas 用于数据处理，numpy 用于数值计算，matplotlib 和 seaborn 用于绘图。
- xgboost 和 lightgbm 是两种常用的集成学习方法，用于分类任务。
加载鸢尾花数据集 ：使用 scikit-learn 提供的 load_iris() 函数加载鸢尾花数据集。该数据集包含 150 个样本，4 个特征，以及 3 种不同的鸢尾花类别。
提取特征矩阵和目标变量 ：X_iris 保存特征数据，y_iris 保存目标类别标签。

3.2 对数据进行初步探索和可视化

（1）目标：

查看数据集的基本信息和统计摘要，确保数据集没有缺失值或异常。
使用 seaborn 进行数据可视化，了解特征与目标标签之间的关系。

（2）代码：

python 复制代码

# 将数据转为DataFrame
df_iris = pd.DataFrame(X_iris, columns=iris.feature_names)
df_iris['target'] = y_iris

# 查看数据集信息
print(df_iris.info())
print(df_iris.describe())

# 可视化鸢尾花数据集特征与目标的关系
sns.pairplot(df_iris, hue="target", markers=["o", "s", "D"])
plt.show()

（3）解释：

转换为 DataFrame ：将 numpy 数组转换为 pandas 数据框，以便更方便地查看和处理数据。
数据集探索 ：使用 info() 和 describe() 查看数据的基本信息，包括数据类型、缺失值和统计摘要。
- info() 显示数据集的结构信息，确保没有缺失值。
- describe() 提供特征的统计描述，如均值、标准差、最大值和最小值等。
数据可视化 ：使用 seaborn 的 pairplot() 绘制多变量散点图，观察特征与目标变量（鸢尾花种类）之间的关系。
- hue="target" 使用不同颜色区分不同类别的鸢尾花。

3.3 将数据集分为训练集和测试集

（1）目标：

将鸢尾花数据集划分为训练集和测试集，以便训练模型并验证其性能。
采用 80:20 的比例划分，80% 的数据用于训练，20% 用于测试。

（2）代码：

python 复制代码

# 将鸢尾花数据集划分为训练集和测试集
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

（3）解释：

train_test_split()：用于将数据集划分为训练集和测试集。这里使用 80% 的数据用于模型训练，20% 的数据用于测试模型。
random_state=42：确保数据集划分的随机性保持一致，便于结果复现。
X_train_iris, y_train_iris 分别表示鸢尾花训练集的特征和标签，X_test_iris, y_test_iris 为测试集。

3.4 对 `LightGBM` 模型进行参数优化

（1）目标：

使用网格搜索 (GridSearchCV) 对 LightGBM 分类器的参数进行优化，找到最佳参数组合以提高模型性能。

（2）代码：

python 复制代码

# 定义LightGBM模型和参数网格
lgb_model_iris = lgb.LGBMClassifier(random_state=42)
lgb_param_grid_iris = {
    'n_estimators': [50, 100, 200],
    'num_leaves': [31, 50, 100],
    'learning_rate': [0.01, 0.1, 0.3]
}

# 进行网格搜索和交叉验证
lgb_grid_search_iris = GridSearchCV(estimator=lgb_model_iris, param_grid=lgb_param_grid_iris, scoring='accuracy', cv=3)
lgb_grid_search_iris.fit(X_train_iris, y_train_iris)

# 打印最佳参数
print("LightGBM最佳参数（鸢尾花数据集）:", lgb_grid_search_iris.best_params_)

（3）解释：

定义模型 ：lgb.LGBMClassifier 是 LightGBM 分类器，用于处理鸢尾花数据集的分类任务。
参数优化 ：
- 通过网格搜索 (GridSearchCV) 调整 n_estimators（决策树数量）、num_leaves（叶子节点数）和 learning_rate（学习率）等参数。
- cv=3 表示 3 折交叉验证，用于评估模型性能并选择最佳参数组合。
输出最佳参数 ：通过 best_params_ 打印 LightGBM 模型的最佳参数组合。

3.5 使用测试集进行预测并评估模型性能

（1）目标：

使用通过网格搜索找到的最佳 LightGBM 模型对测试集进行预测，并评估模型的分类性能。

（2）代码：

python 复制代码

# 使用最佳LightGBM模型对测试集进行预测
lgb_y_pred_iris = lgb_grid_search_iris.best_estimator_.predict(X_test_iris)

# 评估模型性能（准确率和F1得分）
lgb_accuracy_iris = accuracy_score(y_test_iris, lgb_y_pred_iris)
lgb_f1_iris = f1_score(y_test_iris, lgb_y_pred_iris, average='weighted')

# 打印性能结果
print(f"LightGBM分类器 - 鸢尾花数据集: 准确率 = {lgb_accuracy_iris}, F1 得分 = {lgb_f1_iris}")

（3）解释：

预测：使用 best_estimator_ 返回最佳 LightGBM 模型，使用 predict() 方法对测试集进行预测。
性能评估 ：
- accuracy_score() 用于计算模型的准确率，即预测正确的样本占总样本的比例。
- f1_score() 计算 F1 得分，衡量模型的精确率和召回率的平衡。
输出结果 ：打印 LightGBM 模型在鸢尾花数据集上的准确率和 F1 得分。

3.6 可视化模型性能

（1）目标：

绘制 LightGBM 模型在鸢尾花数据集上的准确率和 F1 得分的柱状图，直观展示模型性能。

（2）代码：

python 复制代码

# 绘制LightGBM模型性能柱状图
performance_metrics = ['Accuracy', 'F1 Score']
lgb_performance_iris = [lgb_accuracy_iris, lgb_f1_iris]

plt.figure(figsize=(8, 6))
plt.bar(performance_metrics, lgb_performance_iris, color='green')
plt.title('LightGBM 模型在鸢尾花数据集上的性能')
plt.ylabel('得分')
plt.ylim(0, 1)
plt.show()

（3）解释：

绘图：使用 matplotlib 的 bar() 函数绘制柱状图，展示 LightGBM 模型的准确率和 F1 得分。
直观展示 ：通过柱状图直观展示模型的分类性能，帮助更好地理解模型的优劣。
- ylim(0, 1) 设置 Y 轴的范围为 0 到 1。

4.总体代码和结果

4.1葡萄酒

（1）总体代码

python 复制代码

# 步骤一：导入所需模块
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import f1_score
import numpy as np

# 步骤二：导入数据集，并分割为特征和标签
wine = load_wine()
X = wine.data
y = wine.target

# 步骤三：划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 输出训练集和测试集的形状
print("训练集特征形状:", X_train.shape)
print("测试集特征形状:", X_test.shape)
print("训练集标签形状:", y_train.shape)
print("测试集标签形状:", y_test.shape)

# 步骤四：定义随机森林分类器并进行参数优化
rf = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=3)
grid_search.fit(X_train, y_train)

# 打印最佳参数
print("最佳参数:", grid_search.best_params_)

# 步骤五：使用最佳参数训练模型
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# 步骤六：打印出分类器的混淆矩阵
y_pred = best_model.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
print("混淆矩阵:\n", conf_matrix)

# 绘制混淆矩阵热力图
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

# 步骤七：获取特征重要性并排序可视化
feature_importances = best_model.feature_importances_
feature_names = wine.feature_names
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# 绘制特征重要性条形图
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance from Random Forest')
plt.show()

# 步骤八：进行特征选择
selector = SelectFromModel(best_model, threshold='mean', prefit=True)
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

# 步骤九：使用选择后的特征训练模型并评估
rf_selected = RandomForestClassifier(random_state=42)
rf_selected.fit(X_train_selected, y_train)
y_pred_selected = rf_selected.predict(X_test_selected)

accuracy_selected = accuracy_score(y_test, y_pred_selected)
f1_selected = f1_score(y_test, y_pred_selected, average='weighted')

print("选择后模型的准确率:", accuracy_selected)
print("选择后模型的F1得分:", f1_selected)

# 步骤十：绘制性能对比图
performance_metrics = ['Accuracy', 'F1 Score']
original_performance = [accuracy_score(y_test, y_pred), f1_score(y_test, y_pred, average='weighted')]
selected_performance = [accuracy_selected, f1_selected]

fig, ax = plt.subplots(figsize=(10, 6))
bar_width = 0.35
index = np.arange(len(performance_metrics))

bar1 = ax.bar(index, original_performance, bar_width, label='Original Features')
bar2 = ax.bar(index + bar_width, selected_performance, bar_width, label='Selected Features')

ax.set_xlabel('Performance Metrics')
ax.set_title('Classifier Performance Comparison Before and After Feature Selection')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(performance_metrics)
ax.legend()

for bar in bar1 + bar2:
    height = bar.get_height()
    ax.annotate('%.2f' % height, xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')

plt.show()

（2）运行结果

复制代码

训练集特征形状: (142, 13)
测试集特征形状: (36, 13)
训练集标签形状: (142,)
测试集标签形状: (36,)
最佳参数: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
混淆矩阵:
 [[14  0  0]
 [ 0 14  0]
 [ 0  0  8]]

复制代码

选择后模型的准确率: 0.9722222222222222
选择后模型的F1得分: 0.9717752234993614

22\]:

（3）结果分析

模型在葡萄酒数据集上的分类表现非常优异，达到了 97.22% 的准确率和 0.9718 的 F1 得分。通过最佳参数（未限制树深度、100 棵树等），模型有效地避免了过拟合，并且对所有类别的分类均无误，具备良好的泛化能力和稳定性。

4.2鸢尾花

（1）总体代码

python 复制代码

# 步骤一：导入需要的库文件及数据集
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
import xgboost as xgb
import lightgbm as lgb
import matplotlib.font_manager as fm

# 设置中文字体，解决显示中文乱码问题
plt.rcParams['font.sans-serif'] = ['SimHei']  # 使用黑体字体
plt.rcParams['axes.unicode_minus'] = False  # 解决负号显示问题
font_path = 'C:/Windows/Fonts/simhei.ttf'  # Windows系统的字体路径
my_font = fm.FontProperties(fname=font_path)

# 步骤二：加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target
features = iris.feature_names

# 将数据转换为DataFrame便于查看
df = pd.DataFrame(X, columns=features)
df['target'] = y

# 步骤三：对数据进行初步探索，包括查看数据的基本信息和可视化特征之间的关系
print(df.info())
print(df.describe())

# 可视化特征之间的关系
sns.pairplot(df, hue="target", markers=["o", "s", "D"])
plt.show()

# 步骤四：将数据集分为训练集和测试集，按8:2的比例划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 步骤五：对选定的分类方法（XGBoost和LightGBM）进行参数优化
# 1. XGBoost参数优化
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3]
}
xgb_grid_search = GridSearchCV(estimator=xgb_model, param_grid=xgb_param_grid, scoring='accuracy', cv=3)
xgb_grid_search.fit(X_train, y_train)
xgb_best_model = xgb_grid_search.best_estimator_
print(f"XGBoost最佳参数: {xgb_grid_search.best_params_}")

# 2. LightGBM参数优化
# 使用 verbosity=-1 来消除警告
lgb_model = lgb.LGBMClassifier(random_state=42, verbosity=-1)

# LightGBM参数优化
lgb_param_grid = {
    'n_estimators': [50, 100, 200],
    'num_leaves': [31, 50, 100],
    'learning_rate': [0.01, 0.1, 0.3]
}

lgb_grid_search = GridSearchCV(estimator=lgb_model, param_grid=lgb_param_grid, scoring='accuracy', cv=3)
lgb_grid_search.fit(X_train, y_train)

lgb_best_model = lgb_grid_search.best_estimator_
print(f"LightGBM最佳参数: {lgb_grid_search.best_params_}")

# 步骤六：使用测试集进行预测，并将结果保存下来
xgb_y_pred = xgb_best_model.predict(X_test)
lgb_y_pred = lgb_best_model.predict(X_test)

# 保存结果到CSV文件
pd.DataFrame({'XGBoost预测结果': xgb_y_pred, 'LightGBM预测结果': lgb_y_pred}).to_csv('classification_results.csv', index=False)

# 步骤七：分别比较参数优化前后及特征选择前后两个不同分类器的性能
# 对XGBoost分类器的性能评估
xgb_accuracy = accuracy_score(y_test, xgb_y_pred)
xgb_f1 = f1_score(y_test, xgb_y_pred, average='weighted')
print(f"XGBoost分类器 - 准确率: {xgb_accuracy}, F1得分: {xgb_f1}")

# 对LightGBM分类器的性能评估
lgb_accuracy = accuracy_score(y_test, lgb_y_pred)
lgb_f1 = f1_score(y_test, lgb_y_pred, average='weighted')
print(f"LightGBM分类器 - 准确率: {lgb_accuracy}, F1得分: {lgb_f1}")

# 步骤八：对比较结果进行可视化
performance_metrics = ['Accuracy', 'F1 Score']
xgb_performance = [xgb_accuracy, xgb_f1]
lgb_performance = [lgb_accuracy, lgb_f1]

fig, ax = plt.subplots(figsize=(10, 6))
bar_width = 0.35
index = np.arange(len(performance_metrics))

bar1 = ax.bar(index, xgb_performance, bar_width, label='XGBoost')
bar2 = ax.bar(index + bar_width, lgb_performance, bar_width, label='LightGBM')

ax.set_xlabel('性能指标', fontproperties=my_font)
ax.set_title('XGBoost vs LightGBM 性能比较', fontproperties=my_font)
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(performance_metrics, fontproperties=my_font)
ax.legend()

# 在柱状图上标注数值
for bar in bar1 + bar2:
    height = bar.get_height()
    ax.annotate('%.2f' % height, xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')

plt.show()

（2）运行结果

复制代码

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
None
       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)      target  
count        150.000000  150.000000  
mean           1.199333    1.000000  
std            0.762238    0.819232  
min            0.100000    0.000000  
25%            0.300000    0.000000  
50%            1.300000    1.000000  
75%            1.800000    2.000000  
max            2.500000    2.000000

复制代码

XGBoost最佳参数: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200}
LightGBM最佳参数: {'learning_rate': 0.01, 'n_estimators': 200, 'num_leaves': 31}
XGBoost分类器 - 准确率: 1.0, F1得分: 1.0
LightGBM分类器 - 准确率: 1.0, F1得分: 1.0

（3）结果分析

模型在鸢尾花数据集上的表现同样出色，分类准确率达到了 100%。通过最佳参数的选择，模型能够对所有类别的样本进行准确分类，显示出较强的泛化能力和稳定性，证明了其对数据的高效学习和精确预测能力。