Python手写“随机森林”解决鸢尾花数据集分类问题

Python使用"随机森林"解决鸢尾花数据集分类问题

任务描述
解题
- [1. 导入必要的库](#1. 导入必要的库)
- [2. 数据采样函数 `sample`](#2. 数据采样函数 sample)
- [3. 设置随机种子和超参数](#3. 设置随机种子和超参数)
- [4. 定义随机森林类 `random_forest`](#4. 定义随机森林类 random_forest)
- [5. 加载数据集并划分训练集和测试集](#5. 加载数据集并划分训练集和测试集)
- [6. 创建并训练随机森林模型](#6. 创建并训练随机森林模型)
- [7. 进行预测并计算准确率](#7. 进行预测并计算准确率)
代码

任务描述

您的任务是编写一个"random_forest"类来解决基于 Fisher Iris 数据集（sklearn.datasets.load_iris）的分类问题，该类以"n_estimators"、"max_depth"、"subspaces_dim"和"random_state"作为输入参数。下面给出了这些参数的描述。该类必须定义 .fit() 和 .predict() 方法，以及 ._estimators 字段，该字段必须存储集成中使用的算法列表。

n_estimators - 集合中的树的数量
max_depth - 集合中每棵树的最大深度
subspaces_dim - 每棵树的随机子空间的维数

要为每个基础分类器创建一个训练子样本，您可以使用在上一个任务中实现的"样本"类。如果您使用它，请记住将其描述包含在当前任务的解决方案文件中。我们还建议您通过填充"subspace_idx"列表来组织每棵树的子空间选择，该列表将记录为每个基分类器选择的子空间。

注意：此任务不允许使用 sklearn.ensemble.RandomForestClassifier 类（我们正常情况下是可以直接调用这个功能包使用解决问题的，这里我们拆开写是为了提升我们思考）

另外，选择你的算法在测试样本上获得最佳质量（就准确度指标、正确答案的比例而言）的超参数，参数test_size=0.3，将它们设置为全局变量N_ESTIMATORS、MAX_DEPTH、SUBSPACE_DIM。

类模板：

python 复制代码

import numpy as np
np.random.seed(42)

N_ESTIMATORS = None
MAX_DEPTH = None
SUBSPACE_DIM = None

class random_forest(object):
  def __init__(self, n_estimators: int, max_depth: int, subspaces_dim: int, random_state: int):
    self.n_estimators = n_estimators
    self.max_depth = max_depth
    self.subspaces_dim = subspaces_dim
    self.random_state = random_state
    """
      在类构造函数中设置所有必需的字段
    """

  def fit(self, X, y):
    for i in range(self.n_estimators):
      """
        编写一个函数来在自己的子样本上训练每个算法的树。
      """
      pass

  def predict(self, X):
    """
      编写一个函数来获取算法的平均预测
    """
    pass

解题

实现一个自定义的随机森林分类器，并使用鸢尾花数据集进行分类任务。

1. 导入必要的库

python 复制代码

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

numpy：是 Python 中用于科学计算的基础库，提供了高性能的多维数组对象和处理这些数组的工具。
load_iris：从 sklearn.datasets 中导入，用于加载鸢尾花数据集。
train_test_split：从 sklearn.model_selection 中导入，用于将数据集划分为训练集和测试集。
accuracy_score：从 sklearn.metrics 中导入，用于计算模型预测的准确率。

2. 数据采样函数 `sample`

自助采样是一种有放回的随机抽样方法。从原始数据集中随机抽取样本，每次抽取后将样本放回，这意味着同一个样本可能会被多次抽取。

作用：

增加数据多样性：在随机森林中，每棵决策树使用不同的自助采样数据集进行训练，这样可以使每棵树关注不同的数据特征和模式，增加模型的多样性，避免所有树都学习到相同的信息，从而提升整体模型的泛化能力。
估计模型稳定性：自助采样还可用于估计模型的稳定性和误差，例如通过计算袋外数据（未被抽到的数据）的误差来评估模型性能。

python 复制代码

def sample(X, y, n_samples=None):
    if n_samples is None:
        n_samples = len(X)
    indices = np.random.choice(len(X), n_samples, replace=True)
    return X[indices], y[indices]

该函数用于对输入的特征矩阵 X 和标签向量 y 进行自助采样（bootstrap sampling）。
n_samples 表示要采样的样本数量，如果未指定，则默认采样数量等于原始数据集的样本数量。
np.random.choice 函数用于随机选择样本的索引，replace=True 表示可以重复选择同一个样本。
最后返回采样后的特征矩阵和标签向量。

3. 设置随机种子和超参数

python 复制代码

np.random.seed(42)

N_ESTIMATORS = 100
MAX_DEPTH = 5
SUBSPACE_DIM = 2

np.random.seed(42)：设置随机数种子，确保每次运行代码时生成的随机数序列相同，从而保证结果的可重复性。
N_ESTIMATORS：随机森林中决策树的数量，这里设置为 100。
MAX_DEPTH：每棵决策树的最大深度，设置为 5 可以防止过拟合。
SUBSPACE_DIM：每个决策树训练时随机选择的特征子空间的维度，设置为 2。

4. 定义随机森林类 `random_forest`

python 复制代码

class random_forest(object):
    def __init__(self, n_estimators: int, max_depth: int, subspaces_dim: int, random_state: int):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.subspaces_dim = subspaces_dim
        self.random_state = random_state
        self._estimators = []

该类用于实现随机森林分类器。
__init__ 方法是类的构造函数，用于初始化随机森林的参数，包括决策树的数量、最大深度、子空间维度和随机种子。
self._estimators 是一个空列表，用于存储随机森林中的每棵决策树。

python 复制代码

    def fit(self, X, y):
        for i in range(self.n_estimators):
            X_sample, y_sample = sample(X, y)
            subspace_idx = np.random.choice(X.shape[1], self.subspaces_dim, replace=False)
            X_subspace = X_sample[:, subspace_idx]

            from sklearn.tree import DecisionTreeClassifier
            estimator = DecisionTreeClassifier(max_depth=self.max_depth, random_state=self.random_state)
            estimator.fit(X_subspace, y_sample)
            self._estimators.append(estimator)

fit 方法用于训练随机森林模型。
对于每棵决策树：
- 调用 sample 函数对数据进行自助采样，得到训练样本 X_sample 和对应的标签 y_sample。
- 使用 np.random.choice 随机选择 subspaces_dim 个特征，得到特征子空间的索引 subspace_idx。
- 从采样数据中提取特征子空间 X_subspace。
- 创建一个决策树分类器 DecisionTreeClassifier，设置最大深度和随机种子。
- 使用特征子空间和对应的标签训练决策树。
- 将训练好的决策树添加到 self._estimators 列表中。

其中特征子空间采样：

原理：特征子空间采样是指在训练每棵决策树时，随机选择一部分特征作为该树的特征子空间，而不是使用所有特征。
作用：
- 降低特征间的相关性影响：不同的决策树使用不同的特征子空间进行训练，这样可以减少特征之间的相关性对模型的影响，使每棵树关注不同的特征组合，进一步增加模型的多样性。
- 防止过拟合：通过限制每棵树使用的特征数量，可以避免模型过于复杂，从而减少过拟合的风险。

python 复制代码

    def predict(self, X):
        predictions = []
        for estimator in self._estimators:
            subspace_idx = np.random.choice(X.shape[1], self.subspaces_dim, replace=False)
            X_subspace = X[:, subspace_idx]
            pred = estimator.predict(X_subspace)
            predictions.append(pred)

        predictions = np.array(predictions).T
        final_predictions = np.array([np.bincount(pred).argmax() for pred in predictions])
        return final_predictions

predict 方法用于对输入数据 X 进行预测。
对于随机森林中的每棵决策树：
- 随机选择 subspaces_dim 个特征，得到特征子空间 X_subspace。
- 使用该决策树对特征子空间进行预测，将预测结果添加到 predictions 列表中。
将所有决策树的预测结果转置，使得每一行代表一个样本的所有预测结果。
对于每个样本，使用 np.bincount 统计每个类别出现的次数，然后使用 argmax 选择出现次数最多的类别作为最终预测结果。

5. 加载数据集并划分训练集和测试集

python 复制代码

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

使用 load_iris 函数加载鸢尾花数据集，将特征矩阵存储在 X 中，标签向量存储在 y 中。
使用 train_test_split 函数将数据集划分为训练集和测试集，测试集占比为 30%。

6. 创建并训练随机森林模型

python 复制代码

rf = random_forest(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, subspaces_dim=SUBSPACE_DIM, random_state=42)
rf.fit(X_train, y_train)

创建一个 random_forest 类的实例 rf，并传入超参数。
调用 fit 方法对随机森林模型进行训练。

7. 进行预测并计算准确率

python 复制代码

y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

调用 predict 方法对测试集进行预测，得到预测结果 y_pred。
使用 accuracy_score 函数计算模型在测试集上的准确率。
打印准确率结果。

代码

python 复制代码

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# 数据采样函数
def sample(X, y, n_samples=None):
    if n_samples is None:
        n_samples = len(X)
    # 随机选择样本索引
    indices = np.random.choice(len(X), n_samples, replace=True)
    return X[indices], y[indices]


np.random.seed(42)

# 定义超参数
N_ESTIMATORS = 100
MAX_DEPTH = 5
SUBSPACE_DIM = 2


class random_forest(object):
    def __init__(self, n_estimators: int, max_depth: int, subspaces_dim: int, random_state: int):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.subspaces_dim = subspaces_dim
        self.random_state = random_state
        # 存储每棵决策树
        self._estimators = []

    def fit(self, X, y):
        for i in range(self.n_estimators):
            # 对数据进行采样
            X_sample, y_sample = sample(X, y)
            # 随机选择特征子空间
            subspace_idx = np.random.choice(X.shape[1], self.subspaces_dim, replace=False)
            X_subspace = X_sample[:, subspace_idx]

            from sklearn.tree import DecisionTreeClassifier
            # 创建决策树分类器
            estimator = DecisionTreeClassifier(max_depth=self.max_depth, random_state=self.random_state)
            # 训练决策树
            estimator.fit(X_subspace, y_sample)
            self._estimators.append(estimator)

    def predict(self, X):
        predictions = []
        for estimator in self._estimators:
            # 随机选择特征子空间
            subspace_idx = np.random.choice(X.shape[1], self.subspaces_dim, replace=False)
            X_subspace = X[:, subspace_idx]
            # 每棵树进行预测
            pred = estimator.predict(X_subspace)
            predictions.append(pred)

        predictions = np.array(predictions).T
        # 通过多数表决得到最终预测结果
        final_predictions = np.array([np.bincount(pred).argmax() for pred in predictions])
        return final_predictions


# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建并训练随机森林模型
rf = random_forest(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, subspaces_dim=SUBSPACE_DIM, random_state=42)
rf.fit(X_train, y_train)

# 进行预测
y_pred = rf.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

输出：Accuracy: 0.7111111111111111

Python手写“随机森林”解决鸢尾花数据集分类问题

Python使用"随机森林"解决鸢尾花数据集分类问题

任务描述

解题

1. 导入必要的库

2. 数据采样函数 sample

3. 设置随机种子和超参数

4. 定义随机森林类 random_forest

5. 加载数据集并划分训练集和测试集

6. 创建并训练随机森林模型

7. 进行预测并计算准确率

代码

2. 数据采样函数 `sample`

4. 定义随机森林类 `random_forest`