【人工智能基础】决策树实验分析

实验环境：anaconda、jupyter notebook、graphviz

实验用到的包： numpy、scikit-learn、matplotlib、IPython

一、决策树基础知识

决策树模型：数据通过一个节点后被分为多个组（如果模型是二叉树就分成两个组）。

决策树通常要将分类效果更强的节点作为高层的节点，先进行筛选

如何判断节点的分类效果

熵

熵：是表示随机变量不确定性的度量

熵值的计算公式 ：H(x) = - Σp_i * log_kp_i, i=1,2,......n , k一般取2

我们使用熵值来表示一组数据的混乱程度。显然，如果数据分类后的混乱程度越低（熵值越低）则这个节点的分类效果越好

分类效果的评判的标准

信息增益 ：特征x使得类y的不确定性减少的程度（分类后的专一性，希望分类后的结果是同类的在一起）
信息增益率 ：考虑到了自身熵值信息增益/自身熵值
CART：使用GINI系数，Gini§ = ∑ pk(1-pk)=1-Σpk^2

信息增益来计算节点的分类能力的缺陷：如果出现了非常离散的数据特征如id，这个特征分类后熵值会很低，信息增益很高，但是根据这个特征分类没有意义

Gini系数的计算速度更快

决策树模型预防过拟合

预剪枝

边建立决策树边剪枝，限制深度、叶子节点个数、叶子节点样本数、信息增益量

后剪枝

建立完决策树后进行剪枝

通过一定的衡量标准Cα(T)=C(T)+α|Tleaf|，叶子节点越多，损失越大

二、决策树实验准备

下载graphviz

我是用的是ubuntu 22.04，其他操作系统的可以上网查询

shell 复制代码

sudo apt install graphviz
dot -version

包导入

python 复制代码

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')

三、决策树绘制

训练决策树

这里使用的是鸢尾花数据集

python 复制代码

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
# 获得petal length和petal width
x = iris.data[:,2:]
y = iris.target

# 设定最大深度为2
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(x,y)

获得决策树图像

python 复制代码

from sklearn.tree import export_graphviz

export_graphviz(
    tree_clf,
    out_file='iris_tree.dot',
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)

以参数out_file命名的dot文件会生成在脚本文件同一目录下，我们可以使用dot -Tpng iris_tree.dot -o iris_tree.png 命令转dot文件为png文件

展示图像

python 复制代码

from IPython.display import Image
Image(filename='iris_tree.png', width=400, height='400')

图像每个节点都展示了gini系数，左节点gini系数为0，提前结束了分类

四、决策边界

代码基本是图像操作

python 复制代码

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, x, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):
    # 构建点阵
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    # 获得预测值
    x_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(x_new).reshape(x1.shape)
    # 绘制区域
    custom_cmap = ListedColormap(['#fafab0', '#9898ff', '#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if plot_training:
        plt.plot(x[:,0][y == 0], x[:,1][y == 0], "yo", label='iris-setosa')
        plt.plot(x[:,0][y == 1], x[:,1][y == 1], "bs", label='iris-versicolor')
        plt.plot(x[:,0][y == 2], x[:,1][y == 2], "g^", label='iris-virginica')
    if iris:
        plt.xlabel("Petal length", fontsize=14)
        plt.ylabel("Petal width", fontsize=14)
    else:
        plt.xlabel("$x_1$", fontsize=18)
        plt.ylabel("$x_2$", fontsize=18, rotation=0)
    if legend:
        plt.legend(loc='lower right', fontsize=14)

plt.figure(figsize=(8,4))
plot_decision_boundary(tree_clf, x, y)
plt.plot([2.4, 2.45],  [0,    3],    'k-',  linewidth=2)
plt.plot([2.4, 7.5],   [1.75, 1.75], 'k--', linewidth=2)
plt.plot([4.95, 4.95], [0,    1.75], 'k:',  linewidth=2)
plt.plot([4.85, 4.85], [1.75, 3],    'k:',  linewidth=2)
plt.text(1.40, 1.0, 'depth=0', fontsize=15)
plt.text(3.2,  1.8, 'depth=1', fontsize=13)
plt.text(4.05, 0.5, 'depth=2', fontsize=11)
plt.title('Decision Tree decision boundaries')

plt.show()

五、决策树正则化（剪枝操作）

scikit-learn工具包提供的参数

DecisionTreeClass类中还有一些其他参数类似的限制了决策树的形状：

min_sanmples_split：节点在分割之前必须具有的最小样本数
min_samples_leaf：叶子节点必须具有的最小样本数
max_leaf_nodes：叶子节点的最大数量
max_features：在每个节点处评估用于拆分的最大特征数（一般不做限制）
max_depth：树的最大深度

正则化前后对比

这里只对比min_samples_leaf一个参数

python 复制代码

from sklearn.datasets import make_moons
x,y = make_moons(n_samples=100, noise=0.25, random_state=53)
# 不做限制的决策树分类器
tree_clf_default = DecisionTreeClassifier(random_state=42).fit(x,y)
# 限制叶子节点最小样本数的决策树分类器
tree_clf_leaf_count = DecisionTreeClassifier(min_samples_leaf=4,random_state=42).fit(x,y)

plt.figure(figsize=(12,4))
plt.subplot(121)
plot_decision_boundary(tree_clf_default,    x, y ,axes=[-1.5, 3, -1, 1.5])

plt.subplot(122)
plot_decision_boundary(tree_clf_leaf_count, x, y ,axes=[-1.5, 3, -1, 1.5])

可以看到如果不对决策树进行剪枝，决策树出现过拟合现象，试图纳入全部的点在一个分类中

五、决策树的数据敏感度

如果数据发生了改变，决策树也会发生改变（如数据产生了旋转，会导致决策树的分类边界的完全不同）

python 复制代码

np.random.seed(6)

xs = np.random.rand(100, 2) - 0.5
ys = (xs[:,0] > 0).astype(np.float32) * 2

# 旋转90度
angle = np.pi / 4
rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])
xsr = xs.dot(rotation_matrix)

tree_clf_s = DecisionTreeClassifier(random_state=42).fit(xs,  ys)
tree_clf_sr = DecisionTreeClassifier(random_state=42).fit(xsr, ys)

plt.figure(figsize=(11,4))
plt.subplot(121)
plot_decision_boundary(tree_clf_s,xs,ys,axes=[-0.7, 0.7, -0.7, 0.7], iris=False)
plt.title('sensitivity to training set rotation')

plt.subplot(122)
plot_decision_boundary(tree_clf_sr, xsr, ys, axes=[-0.7,0.7,-0.7,0.7],iris=False)
plt.title('sensitivity to training set rotation')

plt.show()

六、决策树回归

训练模型

python 复制代码

# 构建数据
np.random.seed(42)
m=200
x = np.random.rand(m,1)
y = 4 * (x - 0.5) ** 2 + np.random.randn(m,1) / 10

# 训练模型
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(max_depth=2).fit(x,y)

绘制决策树

python 复制代码

export_graphviz(
    tree_reg,
    out_file=('regression_tree.dot'),
    feature_names=['x1'],
    rounded=True,
    filled=True
)
# 要先使用dot -Tpng regression_tree.dot -o regression_tree.png
Image(filename='regression_tree.png', width=600, height=600)

回归图像绘制

python 复制代码

tree_reg1 = DecisionTreeRegressor(random_state=42, max_depth=2).fit(x,y)
tree_reg2 = DecisionTreeRegressor(random_state=42, max_depth=3).fit(x,y)

def plot_regression_rediction(tree_reg,x,y,axes=[0,1,-0.2,1],ylabel='$y$'):
    x1 = np.linspace(axes[0], axes[1], 500).reshape(-1,1)
    y_pred = tree_reg.predict(x1)
    plt.axis(axes)
    plt.xlabel('$x_1$', fontsize=18)
    if ylabel:
        plt.ylabel(ylabel, fontsize=18, rotation=0)
    plt.plot(x,y,'b.')
    plt.plot(x1,y_pred,'r.-', linewidth=2, label=r'$\hat(y)$')

plt.figure(figsize=(11,4))
plt.subplot(121)

plot_regression_rediction(tree_reg1,x,y)

for split, style in ((0.1973,'k--'),(0.7718,'k--')):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
plt.text(0.21, 0.65, 'Depth=0', fontsize=15)
plt.text(0.01, 0.2,  'Depth=1', fontsize=13)
plt.text(0.65, 0.8,  'Depth=1', fontsize=13)
plt.legend(loc='upper center', fontsize=18)
plt.title('max_depth=2', fontsize=14)

plt.subplot(122)
plot_regression_rediction(tree_reg2,x,y,ylabel=None)
for split, style in ((0.1973,'k--'),(0.7718,'k--')):
    plt.plot([split,split],[-0.2,1],style, linewidth=2)
for split in (0.0458, 0.1298, 0.2873, 0.9040):
    plt.plot([split,split],[-0.2,1],style, linewidth=2)
plt.text(0.3,0.5, 'depth=2',fontsize=13)
plt.title('max_depth=3',fontsize=14)

plt.show()

图中红线就是回归函数