【大数据】机器学习------决策树

一、基本流程

决策树是一种基于树结构的分类和回归方法，它通过对特征空间进行划分，每个内部节点表示一个特征测试，每个分支代表一个测试输出，每个叶节点代表一个类别或回归值。

特征选择：根据某种准则（如信息增益、信息增益比、基尼指数等）选择最优特征进行划分。
决策树生成：从根节点开始，根据选定的特征对样本进行划分，生成子节点，递归地构建决策树。
决策树剪枝：通过剪枝处理防止过拟合，提高决策树的泛化能力。

二、划分选择

1. 信息增益（ID3 算法）

信息增益表示得知特征 X X X 的信息而使得类 Y Y Y 的信息不确定性减少的程度。

设数据集 D D D 的信息熵为：

其中 C k C_k Ck 是类别 k k k 的样本集合， ∣ C k ∣ |C_k| ∣Ck∣ 是类别 k k k 的样本数量， ∣ D ∣ |D| ∣D∣ 是数据集 D D D 的样本总数。

对于特征 A A A，信息增益为：

2. 信息增益比（C4.5 算法）

信息增益比克服了信息增益偏向于选择取值较多的特征的问题，定义为：

3. 基尼指数（CART 算法）

基尼指数表示集合的不确定性，对于数据集 D D D：

其中

对于特征 A A A 的基尼指数：

三、剪枝处理

1. 预剪枝

在决策树生成过程中，对每个节点在划分前进行估计，如果当前节点的划分不能带来决策树泛化性能的提升，则停止划分。

2. 后剪枝

先生成完整的决策树，然后自底向上对非叶节点进行考察，若将其替换为叶节点能提高泛化性能，则进行剪枝。

四、连续与缺失值处理

1. 连续值处理

对于连续特征，通常将其离散化，如采用二分法，将连续特征的取值排序，取相邻值的平均值作为划分点，计算不同划分点的信息增益（或其他指标），选择最优划分点。

2. 缺失值处理

样本权重调整：对于含有缺失值的样本，根据无缺失值样本中该特征的取值分布，将其以一定权重划分到不同子节点。
属性值填充：使用一些策略（如均值、中位数、众数）填充缺失值。

五、多变量决策树

多变量决策树不是为每个节点寻找一个最优划分属性，而是试图建立一个线性组合作为划分属性，如：

六、代码示例

1. 使用 sklearn 实现决策树分类

python 复制代码

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target


# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# 初始化决策树分类器，使用信息增益（默认）
clf = DecisionTreeClassifier(criterion='entropy')


# 训练模型
clf.fit(X_train, y_train)


# 预测
y_pred = clf.predict(X_test)


# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

2. 自定义决策树（使用信息增益）

python 复制代码

import numpy as np


def entropy(y):
    """计算信息熵"""
    unique_labels = np.unique(y)
    entropy = 0
    for label in unique_labels:
        p = np.mean(y == label)
        entropy -= p * np.log2(p)
    return entropy


def information_gain(X, y, feature_index):
    """计算信息增益"""
    base_entropy = entropy(y)
    values = np.unique(X[:, feature_index])
    new_entropy = 0
    for value in values:
        sub_y = y[X[:, feature_index] == value]
        p = len(sub_y) / len(y)
        new_entropy += p * entropy(sub_y)
    return base_entropy - new_entropy


class Node:
    """决策树节点类"""
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, value=None):
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value


def build_tree(X, y, depth=0, max_depth=5):
    """构建决策树"""
    if len(np.unique(y)) == 1:
        return Node(value=y[0])
    if depth >= max_depth:
        return Node(value=np.bincount(y).argmax())
    n_features = X.shape[1]
    best_gain = 0
    best_feature = None
    best_threshold = None
    for feature_index in range(n_features):
        gain = information_gain(X, y, feature_index)
        if gain > best_gain:
            best_gain = gain
            best_feature = feature_index
    if best_gain == 0:
        return Node(value=np.bincount(y).argmax())
    feature_values = np.unique(X[:, best_feature])
    best_threshold = (feature_values[:-1] + feature_values[1:]) / 2
    best_threshold = best_threshold[np.argmax([information_gain(X, y, best_feature, t) for t in best_threshold])]
    left_indices = X[:, best_feature] <= best_threshold
    right_indices = X[:, best_feature] > best_threshold
    left = build_tree(X[left_indices], y[left_indices], depth + 1, max_depth)
    right = build_tree(X[right_indices], y[right_indices], depth + 1, max_depth)
    return Node(feature_index=best_feature, threshold=best_threshold, left=left, right=right)


def predict_sample(node, sample):
    """预测单个样本"""
    if node.value is not None:
        return node.value
    if sample[node.feature_index] <= node.threshold:
        return predict_sample(node.left, sample)
    else:
        return predict_sample(node.right, sample)


def predict(tree, X):
    """预测多个样本"""
    return np.array([predict_sample(tree, sample) for sample in X])


# 示例数据
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])


# 构建决策树
tree = build_tree(X, y, max_depth=3)


# 预测
y_pred = predict(tree, X)
print(y_pred)

代码解释：

1. 使用 sklearn 实现决策树分类：

load_iris() 加载鸢尾花数据集。
DecisionTreeClassifier(criterion='entropy') 初始化一个使用信息熵作为划分准则的决策树分类器。
clf.fit(X_train, y_train) 训练模型。
clf.predict(X_test) 对测试集进行预测。
accuracy_score(y_test, y_pred) 计算准确率。

2. 自定义决策树（使用信息增益）：

entropy(y) 函数计算信息熵。
information_gain(X, y, feature_index) 计算信息增益。
Node 类定义决策树的节点。
build_tree(X, y, depth=0, max_depth=5) 递归构建决策树，使用信息增益选择特征和阈值进行划分。
predict_sample(node, sample) 对单个样本进行预测。
predict(tree, X) 对多个样本进行预测。