西南交通大学【机器学习实验9】

实验目的

参考随机森林，以决策树为基学习器，构建bagging集成器用于多分类任务。

实验要求

编程实现随机森林模型，对手写数字识别数据集进行分类。基模型采用决策树模型，划分属性指标采用信息熵指标，随机选取属性子集数目为50。将决策树数量T 依次设置为1,2,...,20，计算随机森林在测试集上的精度，并绘制随机森林精度随基模型数量增加的变化曲线。

实验环境

Python，numpy，matplotlib，sklearn

实验代码

python 复制代码

import numpy as np

from matplotlib.ticker import MultipleLocator

from sklearn.tree import DecisionTreeClassifier

import matplotlib.pyplot as plt

import matplotlib





matplotlib.use("TKAgg")





# 计算投票结果

def vote(predictions_matrix):

    n_samples = predictions_matrix.shape[0]

    final_predictions = np.zeros(n_samples, dtype=int)



    for i in range(n_samples):

        row = predictions_matrix[i]

        # 找出所有可能类别及其计数

        unique, counts = np.unique(row, return_counts=True)

        # 选出现次数最多的类别

        final_predictions[i] = unique[np.argmax(counts)]



    return final_predictions





# 读取训练数据

train_data = np.genfromtxt("experiment_09_training_set.csv", delimiter=",", skip_header=1)

columnOfTrainDataset = train_data.shape[1]



# 读取测试数据

test_data = np.genfromtxt("experiment_09_testing_set.csv", delimiter=",", skip_header=1)

test_x = test_data[:, 1: columnOfTrainDataset]

test_y = test_data[:, 0]



# 设置T和随机种子

T = 20

np.random.seed(42)



# 准备画图向量

x_line = np.arange(1, T+1)

y_line = np.zeros(T)



# 循环

for number in range(1, T+1):

    # 存储决策树

    models = []

    # 训练基模型

    for i in range(number):

        # 随机取样得到下标

        index = np.random.choice(train_data.shape[0], size=train_data.shape[0], replace=True)

        # 通过下标得到x和y

        train_x = train_data[index, 1:columnOfTrainDataset]

        train_y = train_data[index, 0]

        # 设置基模型为决策树模型，使用信息熵作为划分属性指标

        tree = DecisionTreeClassifier(criterion='entropy', max_features=50)

        tree.fit(train_x, train_y)

        # 添加模型

        models.append(tree)

   

    # 初始化预测结果

    pred = np.zeros((test_data.shape[0], number))



    # 遍历决策树模型

    for i, tree in enumerate(models):

        # 进行预测

        pred[:, i] = tree.predict(test_x)

    # 得到预测结果

    pred_y = vote(pred)

    # 计算精度

    accuracy = np.sum(pred_y == test_y) / test_y.shape[0]

    # 打印结果

    print(f"T: {number} -> accuracy: {accuracy: .4f}")

    y_line[number-1] = accuracy



# 画图

plt.figure(1)

plt.plot(x_line, y_line, color='b', linewidth=1.5, label='Accuracy line')

plt.xlabel("T", fontsize=12)

plt.ylabel("Accuracy", fontsize=12)

plt.title("Accuracy line", fontsize=14)

plt.legend(loc='upper right', frameon=True)

plt.grid(alpha=0.3, linestyle=':')

plt.ylim(0, 1.1)

ax = plt.gca()

ax.xaxis.set_major_locator(MultipleLocator(2))

plt.tight_layout()

plt.show()

结果分析

测试集上精度

|-----|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| T | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 精度 | 0.7980 | 0.8022 | 0.8730 | 0.9036 | 0.9163 | 0.9237 | 0.9303 | 0.9373 | 0.9386 | 0.9421 |
| T | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 精度 | 0.9447 | 0.9457 | 0.9489 | 0.9513 | 0.9500 | 0.9505 | 0.9547 | 0.9553 | 0.9548 | 0.9561 |

精度随T增加的变化曲线