机器学习本科课程实验3 决策树处理分类任务

实验3.1 决策树处理分类任务

使用sklearn.tree.DecisionTreeClassifier完成肿瘤分类（breast-cancer）
计算最大深度为10时，十折交叉验证的精度(accuracy)，查准率(precision)，查全率(recall)，F1值
绘制最大深度从1到10的决策树十折交叉验证精度的变化图

1. 读取数据

python 复制代码

import numpy as np
import pandas as pd
data = pd.read_csv('breast-cancer.csv')
print(data.shape)
data.head()

python 复制代码

data = data.values 
data_x = data[:,2:-1]
data_y = data[:,1:2]
data_y = np.reshape(data_y,(-1))


print(data_x.shape)
print(data_y.shape)

2. 导入模型

python 复制代码

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier

3. 训练与预测

计算最大深度为10的决策树，在使用数据data_x，标记data_y下，十折交叉验证的精度，查准率，查全率和F1值

python 复制代码

model = DecisionTreeClassifier(max_depth = 10) # 参数max_depth决定了决策树的最大深度


prediction = cross_val_predict(model,data_x,data_y,cv = 10)
 
acc1 = accuracy_score(data_y,prediction)
precision1 = precision_score(data_y,prediction,average="macro")
recall1 = recall_score(data_y,prediction,average="macro")
f1 = f1_score(data_y,prediction,average="macro")

print("决策树在data_测试集上的四项指标")
print("精度:",acc1)
print("查准率:",precision1)
print("查全率:",recall1)
print("f1值:",f1)

4. 改变最大深度，绘制决策树的精度变换图

绘制最大深度从1到10，决策树十折交叉验证精度的变化图

python 复制代码

import matplotlib.pyplot as plt
%matplotlib inline

y = []

for i in range(10):
    model = DecisionTreeClassifier(max_depth = i + 1)
    prediction = cross_val_predict(model,data_x,data_y,cv=10)
    y.append(prediction)
    
x = np.linspace(1,10,10)
test = [accuracy_score(data_y, val) for val in y]

plt.figure()
plt.plot(x,test,'-')
plt.title("DecisionTree's accuracy_score changes with the max_depth")
plt.xlabel("max_depth")    
plt.ylabel("accuracy_score")

5. 通过调整参数，得到一个泛化能力最好的模型

查看决策树文档，通过调整决策树的参数，得到一个最好的模型

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

并在下方给出参数的设定与其泛化性能指标

criterion：用于衡量特征选择质量的准则。可以是"gini"（基尼系数）或"entropy"（信息增益）。

max_depth：决策树的最大深度。控制树的复杂度和过拟合的风险。

min_samples_split：拆分内部节点所需的最小样本数。

min_samples_leaf：叶子节点所需的最小样本数。

max_features：寻找最佳分割时要考虑的特征数量。

random_state：控制随机性的种子值。

使用的GridSearchCV，它存在的意义就是自动调参，只要把参数输进去，就能给出最优化的结果和参数。

但是这个方法适合于小数据集，一旦数据的量级上去了，很难得出结果。

python 复制代码

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
modelfit = DecisionTreeClassifier(max_depth = 10)
param_grid = {'criterion':['gini','entropy'],'max_depth':[10,11,12],
                  'min_samples_leaf':[1,2,3,4,5],'max_features':[1,2,3,4,5],'min_samples_split':[2,3,4,5]}

grid = GridSearchCV(modelfit,param_grid,cv = 10)

grid.fit(data_x,data_y)

best = grid.best_params_  #最优分类器
print(best)
best_decision_tree_classifier = DecisionTreeClassifier(max_depth = best['max_depth'],max_features=best['max_features'],
                                                       min_samples_leaf = best['min_samples_leaf'],min_samples_split = best['min_samples_split'])
# your code
prediction11 = cross_val_predict(model, data_x, data_y, cv=10)
acc11 = accuracy_score(data_y, prediction)
precision11 = precision_score(data_y, prediction, average="macro")
recall11 = recall_score(data_y, prediction, average="macro")
f1_11 = f1_score(data_y, prediction, average="macro")
print("-------------------")
print("精度:", acc11)
print("查准率:", precision11)
print("查全率:", recall11)
print("f1值:", f1_11)

双击此处填写优化后的决策树参数设置与性能指标的结果

参数设置：

复制代码

划分标准-基尼系数; 最大深度-10; 最大特征数-5; 叶子节点最少样本数-5; 内部节点再划分所需最小样本数-3;

性能指标得分：

精度: 0.9104
查准率: 0.9033
查全率: 0.9056
f1值: 0.9044

实验3.2决策树处理回归任务

使用sklearn.tree.DecisionTreeRegressor完成kaggle房价预测问题
计算最大深度为10的决策树，训练集上十折交叉验证的MAE和RMSE
绘制最大深度从1到30，决策树在训练集和测试集上MAE的变化曲线
选择一个合理的树的最大深度，并给出理由

1. 读取数据

python 复制代码

import pandas as pd
data = pd.read_csv('train.csv')
# 丢弃有缺失值的特征（列）
data.dropna(axis = 1, inplace = True)

# 只保留整数的特征
data = data[[col for col in data.dtypes.index if data.dtypes[col] == 'int64']]

2. 数据集划分

70%做训练集，30%做测试集

python 复制代码

from sklearn.utils import shuffle
data_shuffled = shuffle(data, random_state = 32)
split_line = int(len(data_shuffled) * 0.7)
training_data = data_shuffled[:split_line]
testing_data = data_shuffled[split_line:]

3. 导入模型

python 复制代码

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
import numpy as np

4. 选取特征和标记

python 复制代码

features = data.columns.tolist()
target = 'SalePrice'
features.remove(target)

5. 训练与预测

请你在下面计算树的最大深度为10时，使用训练集全量特征训练的决策树的十折交叉验证的MAE和RMSE

python 复制代码

# YOUR CODE HERE
# training_data[features]
# training_data[target]
model12 = DecisionTreeRegressor(max_depth = 10)
model12.fit(training_data[features], training_data[target])
predictions12 = model12.predict(testing_data[features])

mae12 = mean_absolute_error(testing_data[target], predictions12)
mse12 = mean_squared_error(testing_data[target], predictions12)
rmse12 = np.sqrt(mse12)

print("Mean Absolute Error:", mae12)
print("Mean Squared Error:", mse12)
print("Root Mean Squared Error:", rmse12)

6. 改变最大深度，绘制决策树的精度变换图

绘制最大深度从1到30，决策树训练集和测试集MAE的变化图

python 复制代码

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")
# YOUR CODE HERE
y12 = []

for i in range(30):
    model12 =  DecisionTreeRegressor(max_depth = i+1)
    model12.fit(training_data[features], training_data[target])
    predictions12 = model12.predict(testing_data[features])
    
    mae12 = mean_absolute_error(testing_data[target], predictions12)
    mse12 = mean_squared_error(testing_data[target], predictions12)
    rmse12 = np.sqrt(mse12)
    y12.append(mae12)

    print('----------------------')
    print("max_depth: ", i+1)
    print("Mean Absolute Error:", mae12)
    print("Mean Squared Error:", mse12)
    print("Root Mean Squared Error:", rmse12)

python 复制代码

x12 = np.linspace(1, 30, 30)

plt.figure()
plt.plot(x12, y12, '-')
plt.title("DecisionTree's MAE changes with the max_depth")
plt.xlabel("max_depth")
plt.ylabel("MAE")

7. 请你选择一个合理的树的最大深度，并给出理由

请你选择一个合理的树的最大深度，并给出理由

根据走势图, 我认为选择最大深度为6比较合适, 当最大深度到达6附近, MAE接近全局最小值, 而当最大深度增大时, 对模型的性能没有明显的增益, 甚至增大一定程度后会造成MAE值的波动, 因此在效率和精度的双重考虑下选择最大深度为6

实验3.3实现决策树

使用LendingClub Safe Loans数据集：

实现信息增益、信息增益率、基尼指数三种划分标准
使用给定的训练集完成三种决策树的训练过程
计算三种决策树在最大深度为10时在训练集和测试集上的精度，查准率，查全率，F1值
画出决策树（选作）

在这部分，我们会实现一个很简单的二叉决策树

1. 读取数据

python 复制代码

# 导入类库
import pandas as pd
import numpy as np
import json
# 导入数据
loans = pd.read_csv('lending-club-data.csv', low_memory=False)

数据中有两列是我们想预测的指标，一项是safe_loans，一项是bad_loans，分别表示正例和负例，我们对其进行处理，将正例的safe_loans设为1，负例设为-1，删除bad_loans这列

python 复制代码

# 对数据进行预处理，将safe_loans作为标记
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
del loans['bad_loans']

我们只使用grade, term, home_ownership, emp_length这四列作为特征，safe_loans作为标记，只保留loans中的这五列

python 复制代码

features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
loans = loans[features + [target]]

2. 划分训练集和测试集

python 复制代码

from sklearn.utils import shuffle
loans = shuffle(loans, random_state = 34)

split_line = int(len(loans) * 0.6)
train_data = loans.iloc[: split_line]
test_data = loans.iloc[split_line:]

3. 特征预处理

可以看到所有的特征都是离散类型的特征，需要对数据进行预处理，使用one-hot编码对其进行处理。

one-hot编码的思想就是将离散特征变成向量，假设特征 A A A有三种取值 { a , b , c } \{a, b, c\} {a,b,c}，这三种取值等价，如果我们使用1,2,3三个数字表示这三种取值，那么在计算时就会产生偏差，有一些涉及距离度量的算法会认为，2和1离得近，3和1离得远，但这三个值应该是等价的，这种表示方法会造成模型在判断上出现偏差。解决方案就是使用一个三维向量表示他们，用 [ 1 , 0 , 0 ] [1, 0, 0] [1,0,0]表示a， [ 0 , 1 , 0 ] [0, 1, 0] [0,1,0]表示b， [ 0 , 0 , 1 ] [0, 0, 1] [0,0,1]表示c，这样三个向量之间的距离就都是相等的了，任意两个向量在欧式空间的距离都是 2 \sqrt{2} 2 。这就是one-hot编码是思想。

pandas中使用get_dummies生成one-hot向量

python 复制代码

def one_hot_encoding(data, features_categorical):
    '''
    Parameter
    ----------
    data: pd.DataFrame
    
    features_categorical: list(str)
    '''
    
    # 对所有的离散特征遍历
    for cat in features_categorical:
        
        # 对这列进行one-hot编码，前缀为这个变量名
        one_encoding = pd.get_dummies(data[cat], prefix = cat)
        
        # 将生成的one-hot编码与之前的dataframe拼接起来
        data = pd.concat([data, one_encoding],axis=1)
        
        # 删除掉原始的这列离散特征
        del data[cat]
    
    return data

首先对训练集生成one-hot向量，然后对测试集生成one-hot向量，这里需要注意的是，如果训练集中，特征 A A A的取值为 { a , b , c } \{a, b, c\} {a,b,c}，这样我们生成的特征就有三列，分别为 A _ a A\_a A_a, A _ b A\_b A_b, A _ c A\_c A_c，然后我们使用这个训练集训练模型，模型就就会考虑这三个特征，在测试集中如果有一个样本的特征 A A A的值为 d d d，那它的 A _ a A\_a A_a， A _ b A\_b A_b， A _ c A\_c A_c就都为0，我们不去考虑 A _ d A\_d A_d，因为这个特征在训练模型的时候是不存在的。

python 复制代码

train_data = one_hot_encoding(train_data, features)
train_data.head()
one_hot_features = train_data.columns.tolist()
one_hot_features.remove(target)
one_hot_features

接下来是对测试集进行one_hot编码，但只要保留出现在one_hot_features中的特征即可

python 复制代码

test_data_tmp = one_hot_encoding(test_data, features)
# 创建一个空的DataFrame
test_data = pd.DataFrame(columns = train_data.columns)
for feature in train_data.columns:
    # 如果训练集中当前特征在test_data_tmp中出现了，将其复制到test_data中
    if feature in test_data_tmp.columns:
        test_data[feature] = test_data_tmp[feature].copy()
    else:
        # 否则就用全为0的列去替代
        test_data[feature] = np.zeros(test_data_tmp.shape[0], dtype = 'uint8')
        
test_data.head()

处理完后，所有的特征都是0和1，标记是1和-1，以上就是数据预处理流程

4. 实现3种特征划分准则

决策树中有很多常用的特征划分方法，比如信息增益、信息增益率、基尼指数

我们需要实现一个函数，它的作用是，给定决策树的某个结点内的所有样本的标记，让它计算出对应划分指标的值是多少

接下来我们会实现上述三种划分指标

这里我们约定，将所有特征取值为0的样本，划分到左子树，特征取值为1的样本，划分到右子树

4.1 信息增益

信息熵：
E n t ( D ) = − ∑ k = 1 ∣ Y ∣ p k l o g 2 p k \mathrm{Ent}(D) = - \sum^{\vert \mathcal{Y} \vert}_{k = 1} p_k \mathrm{log}_2 p_k Ent(D)=−k=1∑∣Y∣pklog2pk

信息增益：
G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) \mathrm{Gain}(D, a) = \mathrm{Ent}(D) - \sum^{V}_{v=1} \frac{\vert D^v \vert}{\vert D \vert} \mathrm{Ent}(D^v) Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)

计算信息熵时约定：若 p = 0 p = 0 p=0，则 p log ⁡ 2 p = 0 p \log_2p = 0 plog2p=0

python 复制代码

def information_entropy(labels_in_node):
    '''
    求当前结点的信息熵
    
    Parameter
    ----------
    labels_in_node: np.ndarray, 如[-1, 1, -1, 1, 1]
    
    Returns
    ----------
    float: information entropy
    '''
    # 统计样本总个数
    num_of_samples = labels_in_node.shape[0]
    
    if num_of_samples == 0:
        return 0
    
    # 统计出标记为1的个数
    num_of_positive = len(labels_in_node[labels_in_node == 1])
    
    # 统计出标记为-1的个数
    num_of_negative = len(labels_in_node[labels_in_node == -1])
    
    # 统计正例的概率
    prob_positive = num_of_positive / num_of_samples
    
    # 统计负例的概率
    prob_negative = num_of_negative / num_of_samples
    
    if prob_positive == 0:
        positive_part = 0
    else:
        positive_part = prob_positive * np.log2(prob_positive)
    
    if prob_negative == 0:
        negative_part = 0
    else:
        negative_part = prob_negative * np.log2(prob_negative)
    
    return - ( positive_part + negative_part )

下面是6个测试样例

python 复制代码

# 信息熵测试样例1
example_labels = np.array([-1, -1, 1, 1, 1])
print(information_entropy(example_labels)) # 0.97095

# 信息熵测试样例2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
print(information_entropy(example_labels)) # 0.86312
    
# 信息熵测试样例3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
print(information_entropy(example_labels)) # 0.86312

# 信息熵测试样例4
example_labels = np.array([-1] * 9 + [1] * 8)
print(information_entropy(example_labels)) # 0.99750

# 信息熵测试样例5
example_labels = np.array([1] * 8)
print(information_entropy(example_labels)) # 0

# 信息熵测试样例6
example_labels = np.array([])
print(information_entropy(example_labels)) # 0

接下来完成计算所有特征的信息增益的函数
需要填写三个部分

python 复制代码

def compute_information_gains(data, features, target, annotate = False):
    '''
    计算所有特征的信息增益
    
    Parameter
    ----------
        data: pd.DataFrame，传入的样本，带有特征和标记的dataframe
        
        features: list(str)，特征名组成的list
        
        target: str, 标记(label)的名字
        
        annotate, boolean，是否打印所有特征的信息增益值，默认为False
        
    Returns
    ----------
        information_gains: dict, key: str, 特征名
                                 value: float，信息增益
    '''
    
    # 我们将每个特征划分的信息增益值存储在一个dict中
    # 键是特征名，值是信息增益值
    information_gains = dict()
    
    # 对所有的特征进行遍历，使用信息增益对每个特征进行计算
    for feature in features:
        
        # 左子树保证所有的样本的这个特征取值为0
        left_split_target = data[data[feature] == 0][target]
        
        # 右子树保证所有的样本的这个特征取值为1
        right_split_target =  data[data[feature] == 1][target]
            
        # 计算左子树的信息熵
        left_entropy = information_entropy(left_split_target)
        
        # 计算左子树的权重
        left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))

        # 计算右子树的信息熵
        right_entropy = information_entropy(right_split_target)
        
        # 计算右子树的权重
        right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))
        
        # 计算当前结点的信息熵
        current_entropy = information_entropy(data[target])
            
        # 计算使用当前特征划分的信息增益
        gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy)

        
        # 将特征名与增益值以键值对的形式存储在information_gains中
        information_gains[feature] = gain
        
        if annotate:
            print(" ", feature, gain)
            
    return information_gains

python 复制代码

# 信息增益测试样例1
print(compute_information_gains(train_data, one_hot_features, target)['grade_A']) # 0.01759

# 信息增益测试样例2
print(compute_information_gains(train_data, one_hot_features, target)['term_ 60 months']) # 0.01429

# 信息增益测试样例3
print(compute_information_gains(train_data, one_hot_features, target)['grade_B']) # 0.00370

4.2 信息增益率

信息增益率：

G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) \mathrm{Gain\_ratio}(D, a) = \frac{\mathrm{Gain}(D, a)}{\mathrm{IV}(a)} Gain_ratio(D,a)=IV(a)Gain(D,a)

其中

I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log ⁡ 2 ∣ D v ∣ ∣ D ∣ \mathrm{IV}(a) = - \sum^V_{v=1} \frac{\vert D^v \vert}{\vert D \vert} \log_2 \frac{\vert D^v \vert}{\vert D \vert} IV(a)=−v=1∑V∣D∣∣Dv∣log2∣D∣∣Dv∣

完成计算所有特征信息增益率的函数
这里要完成五个部分

python 复制代码

def compute_information_gain_ratios(data, features, target, annotate = False):
    '''
    计算所有特征的信息增益率并保存起来
    
    Parameter
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    gain_ratios: dict, key: str, 特征名
                       value: float，信息增益率
    '''
    
    gain_ratios = dict()
    
    # 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算
    for feature in features:
        
        # 左子树保证所有的样本的这个特征取值为0
        left_split_target = data[data[feature] == 0][target]
        
        # 右子树保证所有的样本的这个特征取值为1
        right_split_target =  data[data[feature] == 1][target]
            
        # 计算左子树的信息熵
        left_entropy = information_entropy(left_split_target)
        
        # 计算左子树的权重
        left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))

        # 计算右子树的信息熵
        right_entropy = information_entropy(right_split_target)
        
        # 计算右子树的权重
        right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))
        
        # 计算当前结点的信息熵
        current_entropy = information_entropy(data[target])
        
        # 计算当前结点的信息增益
        
        gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy)
        
        # 计算IV公式中，当前特征为0的值
        if left_weight == 0:
            left_IV = 0
        else:
            left_IV = - (left_weight * np.log2(left_weight))
        
        # 计算IV公式中，当前特征为1的值
        if right_weight == 0:
            right_IV = 0
        else:
            right_IV = - (right_weight * np.log2(right_weight))
        
        # IV 等于所有子树IV之和的相反数
        IV = - (left_IV + right_IV)
            
        # 计算使用当前特征划分的信息增益率
        # 这里为了防止IV是0，导致除法得到np.inf（无穷），在分母加了一个很小的小数
        gain_ratio = gain / (IV + np.finfo(np.longdouble).eps)
        
        # 信息增益率的存储
        gain_ratios[feature] = gain_ratio
        
        if annotate:
            print(" ", feature, gain_ratio)
            
    return gain_ratios

python 复制代码

# 信息增益率测试样例1
print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_A']) # 0.02573

# 信息增益率测试样例2
print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_B']) # 0.00417

# 信息增益率测试样例3
print(compute_information_gain_ratios(train_data, one_hot_features, target)['term_ 60 months']) # 0.01970

4.3 基尼指数

数据集 D D D的基尼值：

G i n i ( D ) = ∑ k = 1 ∣ Y ∣ ∑ k ′ ≠ k p k p k ′ = 1 − ∑ k = 1 ∣ Y ∣ p k 2 . \begin{aligned} \mathrm{Gini}(D) & = \sum^{\vert \mathcal{Y} \vert}{k=1} \sum{k' \neq k} p_k p_{k'}\\ & = 1 - \sum^{\vert \mathcal{Y} \vert}_{k=1} p^2_k. \end{aligned} Gini(D)=k=1∑∣Y∣k′=k∑pkpk′=1−k=1∑∣Y∣pk2.

属性 a a a的基尼指数：

G i n i _ i n d e x ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G i n i ( D v ) \mathrm{Gini\index}(D, a) = \sum^V{v = 1} \frac{\vert D^v \vert}{\vert D \vert} \mathrm{Gini}(D^v) Gini_index(D,a)=v=1∑V∣D∣∣Dv∣Gini(Dv)

完成数据集基尼值的计算
这里需要填写三部分

python 复制代码

def gini(labels_in_node):
    '''
    计算一个结点内样本的基尼指数
    
    Paramters
    ----------
    label_in_data: np.ndarray, 样本的标记，如[-1, -1, 1, 1, 1]
    
    Returns
    ---------
    gini: float，基尼指数
    '''
    
    # 统计样本总个数
    num_of_samples = labels_in_node.shape[0]
    
    if num_of_samples == 0:
        return 0
    
    # 统计出1的个数
    num_of_positive = len(labels_in_node[labels_in_node == 1])
    
    # 统计出-1的个数
    num_of_negative = len(labels_in_node[labels_in_node == -1])
    
    # 统计正例的概率
    prob_positive = num_of_positive / num_of_samples
    
    # 统计负例的概率
    prob_negative = num_of_negative / num_of_samples
    
    # 计算基尼值
    gini = 1 - (prob_positive ** 2 + prob_negative ** 2)
    
    return gini

python 复制代码

# 基尼值测试样例1
example_labels = np.array([-1, -1, 1, 1, 1])
print(gini(example_labels)) # 0.48

# 基尼值测试样例2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
print(gini(example_labels)) # 0.40816
    
# 基尼值测试样例3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
print(gini(example_labels)) # 0.40816

# 基尼值测试样例4
example_labels = np.array([-1] * 9 + [1] * 8)
print(gini(example_labels)) # 0.49827

# 基尼值测试样例5
example_labels = np.array([1] * 8)
print(gini(example_labels)) # 0

# 基尼值测试样例6
example_labels = np.array([])
print(gini(example_labels)) # 0

然后计算所有特征的基尼指数
这里需要填写三部分

python 复制代码

def compute_gini_indices(data, features, target, annotate = False):
    '''
    计算使用各个特征进行划分时，各特征的基尼指数
    
    Parameter
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    gini_indices: dict, key: str, 特征名
                       value: float，基尼指数
    '''
    
    gini_indices = dict()
    # 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算
    for feature in features:
        # 左子树保证所有的样本的这个特征取值为0
        left_split_target = data[data[feature] == 0][target]
        
        # 右子树保证所有的样本的这个特征取值为1
        right_split_target =  data[data[feature] == 1][target]
            
        # 计算左子树的基尼值
        left_gini = gini(left_split_target.values)
        
        # 计算左子树的权重
        left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))

        # 计算右子树的基尼值
        right_gini = gini(right_split_target.values)
        
        # 计算右子树的权重
        right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))
        
        # 计算当前结点的基尼指数
        gini_index = left_weight * left_gini + right_weight * right_gini
        
        # 存储
        gini_indices[feature] = gini_index
        
        if annotate:
            print(" ", feature, gini_index)
            
    return gini_indices

python 复制代码

# 基尼指数测试样例1
print(compute_gini_indices(train_data, one_hot_features, target)['grade_A']) # 0.30095

# 基尼指数测试样例2
print(compute_gini_indices(train_data, one_hot_features, target)['grade_B']) # 0.30568

# 基尼指数测试样例3
print(compute_gini_indices(train_data, one_hot_features, target)['term_ 36 months']) # 0.30055

5. 完成最优特征的选择

到此，我们完成了三种划分策略的实现，接下来就是完成获取最优特征的函数
这里需要填写三个部分

python 复制代码

def best_splitting_feature(data, features, target, criterion = 'gini', annotate = False):
    '''
    给定划分方法和数据，找到最优的划分特征
    
    Parameters
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    criterion: str, 使用哪种指标，三种选项: 'information_gain', 'gain_ratio', 'gini'
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    best_feature: str, 最佳的划分特征的名字
    
    '''
    if criterion == 'information_gain':
        if annotate:
            print('using information gain')
        
        # 得到当前所有特征的信息增益
        information_gains = compute_information_gains(data, features, target, annotate)
    
        # information_gains是一个dict类型的对象，我们要找值最大的那个元素的键是谁
        # 根据这些特征和他们的信息增益，找到最佳的划分特征
        # YOUR CODE HERE
        best_feature = max(information_gains, key=information_gains.get)
        return best_feature

    elif criterion == 'gain_ratio':
        if annotate:
            print('using information gain ratio')
        
        # 得到当前所有特征的信息增益率
        gain_ratios = compute_information_gain_ratios(data, features, target, annotate)
    
        # 根据这些特征和他们的信息增益率，找到最佳的划分特征
        # YOUR CODE HERE
        best_feature = max(gain_ratios,key=gain_ratios.get)

        return best_feature
    
    elif criterion == 'gini':
        if annotate:
            print('using gini')
        
        # 得到当前所有特征的基尼指数
        gini_indices = compute_gini_indices(data, features, target, annotate)
        
        # 根据这些特征和他们的基尼指数，找到最佳的划分特征
        # YOUR CODE HERE
        best_feature = min(gini_indices, key=gini_indices.get)
        return best_feature
    else:
        raise Exception("传入的criterion不合规!", criterion)

6. 判断结点内样本的类别是否为同一类

python 复制代码

def intermediate_node_num_mistakes(labels_in_node):
    '''
    求树的结点中，样本数少的那个类的样本有多少，比如输入是[1, 1, -1, -1, 1]，返回2
    
    Parameter
    ----------
    labels_in_node: np.ndarray, pd.Series
    
    Returns
    ----------
    int：个数
    
    '''
    # 如果传入的array为空，返回0
    if len(labels_in_node) == 0:
        return 0
    
    # 统计1的个数
    # YOUR CODE HERE
    num_of_one = np.sum(labels_in_node == 1)
    
    # 统计-1的个数
    # YOUR CODE HERE
    num_of_minus_one = np.sum(labels_in_node == -1)
    
    return num_of_one if num_of_minus_one > num_of_one else num_of_minus_one

python 复制代码

# 测试样例1
print(intermediate_node_num_mistakes(np.array([1, 1, -1, -1, -1]))) # 2

# 测试样例2
print(intermediate_node_num_mistakes(np.array([]))) # 0

# 测试样例3
print(intermediate_node_num_mistakes(np.array([1]))) # 0

7. 创建叶子结点

python 复制代码

def create_leaf(target_values):
    '''
    计算出当前叶子结点的标记是什么，并且将叶子结点信息保存在一个dict中
    
    Parameter:
    ----------
    target_values: pd.Series, 当前叶子结点内样本的标记

    Returns:
    ----------
    leaf: dict，表示一个叶结点，
            leaf['splitting_features'], None，叶结点不需要划分特征
            leaf['left'], None，叶结点没有左子树
            leaf['right'], None，叶结点没有右子树
            leaf['is_leaf'], True, 是否是叶子结点
            leaf['prediction'], int, 表示该叶子结点的预测值
    '''
    # 创建叶子结点
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True}
   
    # 数结点内-1和+1的个数
    num_ones = len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])    

    # 叶子结点的标记使用少数服从多数的原则，为样本数多的那类的标记，保存在 leaf['prediction']
    if num_ones > num_minus_ones:
        leaf['prediction'] = 1
    else:
        leaf['prediction'] = -1

    # 返回叶子结点
    return leaf

8. 递归地创建决策树

递归的创建决策树

递归算法终止的三个条件：

如果结点内所有的样本的标记都相同，该结点就不需要再继续划分，直接做叶子结点即可
如果结点所有的特征都已经在之前使用过了，在当前结点无剩余特征可供划分样本，该结点直接做叶子结点
如果当前结点的深度已经达到了我们限制的树的最大深度，直接做叶子结点

python 复制代码

def decision_tree_create(data, features, target, criterion = 'gini', current_depth = 0, max_depth = 10, annotate = False):
    '''
    Parameter:
    ----------
    data: pd.DataFrame, 数据

    features: iterable, 特征组成的可迭代对象，比如一个list

    target: str, 标记的名字

    criterion: 'str', 特征划分方法，只支持三种：'information_gain', 'gain_ratio', 'gini'

    current_depth: int, 当前深度，递归的时候需要记录

    max_depth: int, 树的最大深度，我们设定的树的最大深度，达到最大深度需要终止递归

    Returns:
    ----------
    dict, dict['is_leaf']          : False, 当前顶点不是叶子结点
          dict['prediction']       : None, 不是叶子结点就没有预测值
          dict['splitting_feature']: splitting_feature, 当前结点是使用哪个特征进行划分的
          dict['left']             : dict
          dict['right']            : dict
    '''
    
    if criterion not in ['information_gain', 'gain_ratio', 'gini']:
        raise Exception("传入的criterion不合规!", criterion)
    
    # 复制一份特征，存储起来，每使用一个特征进行划分，我们就删除一个
    remaining_features = features[:]
    
    # 取出标记值
    target_values = data[target]
    if annotate:
        print("-" * 50)
        print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))

    # 终止条件1
    # 如果当前结点内所有样本同属一类，即这个结点中，各类别样本数最小的那个等于0
    # 使用前面写的intermediate_node_num_mistakes来完成这个判断
    # YOUR CODE HERE
    if intermediate_node_num_mistakes(target_values) == 0:
        if annotate:
            print("Stopping condition 1 reached.")
        return create_leaf(target_values)   # 创建叶子结点
    
    # 终止条件2
    # 如果已经没有剩余的特征可供分割，即remaining_features为空
    
    # YOUR CODE HERE
    if not remaining_features:
        if annotate:
            print("Stopping condition 2 reached.")
        return create_leaf(target_values)   # 创建叶子结点
    
    # 终止条件3
    # 如果已经到达了我们要求的最大深度，即当前深度达到了最大深度
    
    # YOUR CODE HERE
    if current_depth == max_depth:
        if annotate:
            print("Reached maximum depth. Stopping for now.")
        return create_leaf(target_values)   # 创建叶子结点

    # 找到最优划分特征
    # 使用best_splitting_feature这个函数
    
    # YOUR CODE HERE
    splitting_feature = best_splitting_feature(data, remaining_features, target, criterion, annotate)
    # 使用我们找到的最优特征将数据划分成两份
    # 左子树的数据
    left_split = data[data[splitting_feature] == 0]
    
    # 右子树的数据
    # YOUR CODE HERE
    right_split = data[data[splitting_feature] == 1]
    # 现在已经完成划分，我们要从剩余特征中删除掉当前这个特征
    remaining_features.remove(splitting_feature)
    
    # 打印当前划分使用的特征，打印左子树样本个数，右子树样本个数
    if annotate:
        print("Split on feature %s. (%s, %s)" % (\
                          splitting_feature, len(left_split), len(right_split)))
        
    # 如果使用当前的特征，将所有的样本都划分到一棵子树中，那么就直接将这棵子树变成叶子结点
    # 判断左子树是不是"完美"的
    if len(left_split) == len(data):
        if annotate:
            print("Creating leaf node.")
        return create_leaf(left_split[target])
    
    # 判断右子树是不是"完美"的
    if len(right_split) == len(data):
        if annotate:
            print("Creating right node.")
        return create_leaf(right_split[target])
        
    # 递归地创建左子树
    left_tree = decision_tree_create(left_split, remaining_features, target, criterion, current_depth + 1, max_depth, annotate)
    
    # 递归地创建右子树
    right_tree = decision_tree_create(right_split, remaining_features, target, criterion, current_depth + 1, max_depth, annotate)                         # YOUR CODE HERE

    # 返回树的非叶子结点
    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

训练一个模型

python 复制代码

my_decision_tree = decision_tree_create(train_data, one_hot_features, target, 'gini', max_depth = 6, annotate = False)

9. 预测

接下来我们需要完成预测函数

python 复制代码

def classify(tree, x, annotate = False):
    '''
    递归的进行预测，一次只能预测一个样本
    
    Parameters
    ----------
    tree: dict
    
    x: pd.Series，待预测的样本
    
    annotate： boolean, 是否显示注释
    
    Returns
    ----------
    返回预测的标记
    '''
    if tree['is_leaf']:
        if annotate:
            print ("At leaf, predicting %s" % tree['prediction'])
        return tree['prediction']
    else:
        split_feature_value = x[tree['splitting_feature']]
        if annotate:
             print ("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

我们取测试集第一个样本来测试

python 复制代码

test_sample = test_data.iloc[0]
print(test_sample)

python 复制代码

print('True class: %s ' % (test_sample['safe_loans']))
print('Predicted class: %s ' % classify(my_decision_tree, test_sample))

打印出使用决策树判断的过程

python 复制代码

classify(my_decision_tree, test_sample, annotate=True)

10. 在测试集上对我们的模型进行评估

python 复制代码

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

先来编写一个批量预测的函数，传入的是整个测试集那样的pd.DataFrame，这个函数返回一个np.ndarray，存储模型的预测结果
这里需要填写一个部分

python 复制代码

def predict(tree, data):
    '''
    按行遍历data，对每个样本进行预测，将值存在prediction中，最后返回np.ndarray
    
    Parameter
    ----------
    tree: dict, 模型
    
    data: pd.DataFrame, 数据
    
    Returns
    ----------
    predictions：np.ndarray, 模型对这些样本的预测结果
    '''
    predictions = np.zeros(len(data)) # 长度和data一样
    
    # YOUR CODE HERE
    for i in range(len(data)):
        predictions[i] = classify(tree,data.iloc[i])
    
    return predictions

11. 请你计算使用不同评价指标得到模型的四项指标的值，填写在下方表格内

树的最大深度为6

python 复制代码

# YOUR CODE HERE
criteria = ['information_gain', 'gain_ratio', 'gini']
for c in criteria:
    print(f'Using criterion: {c}')
    my_decision_tree=decision_tree_create(train_data, one_hot_features, target, c, max_depth=6, annotate=False)
    predictions = predict(my_decision_tree, test_data)
    acc33 = accuracy_score(test_data[target], predictions)
    precision33 = precision_score(test_data[target], predictions, average='macro')
    recall33 = recall_score(test_data[target], predictions, average='macro')
    f133 = f1_score(test_data[target], predictions, average='macro')
    print(f'Accuracy: {acc33}\nPrecision: {precision33}\nRecall: {recall33}\nF1 Score: {f133}\n')

实验3.4随机森林的应用------鸢尾花分类

加载sklearn中的鸢尾花数据集，选取前两个特征作为分类依据

运用Accuracy, Precision, Recall, F1四个指标进行评测

可视化分类结果

python 复制代码

import sklearn
import numpy as np

1.导入数据

python 复制代码

from sklearn.datasets import load_iris

feat,label = load_iris(return_X_y=True)
data = load_iris()
feat_names = data['feature_names']
label_names = data['target_names']
print(feat_names)
print(label_names)

选取前两个特征

python 复制代码

feat = feat[:,:2]
feat.shape

2.导入模型

python 复制代码

from sklearn.ensemble import RandomForestClassifier

3.模型训练

python 复制代码

# YOUR CODE HERE
from sklearn.model_selection import train_test_split
# 拆分训练集, 测试集
feat_train, feat_test, label_train, label_test = train_test_split(feat, label, test_size=0.2, random_state=42)
# 创建随机森林分类器
rf_classifier34 = RandomForestClassifier(n_estimators=100, random_state=42)
# 拟合模型
rf_classifier34.fit(feat_train, label_train)
# 进行预测
label_pred = rf_classifier34.predict(feat_test)

4.评价指标的计算

python 复制代码

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

python 复制代码

# YOUR CODE HERE
acc34 = accuracy_score(label_test, label_pred)
precision34 = precision_score(label_test, label_pred, average='macro')
recall34 = recall_score(label_test, label_pred, average='macro')
f134 = f1_score(label_test, label_pred, average='macro')
print(f'Accuracy: {acc34}\nPrecision: {precision34}\nRecall: {recall34}\nF1 Score: {f134}\n')

5.可视化分类结果

python 复制代码

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# 创建一个网格，用于绘制决策边界
x_min, x_max = feat[:, 0].min() - 1, feat[:, 0].max() + 1
y_min, y_max = feat[:, 1].min() - 1, feat[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

# 预测网格上的点的类别
Z = rf_classifier34.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

cmap_background = ListedColormap(['#FFAAAA', '#AAAAFF', '#AAFFAA'])
cmap_points = ListedColormap(['#FF0000', '#0000FF', '#00FF00'])

# 绘制决策边界
plt.contourf(xx, yy, Z, cmap=cmap_background, alpha=0.3)

# 绘制数据点
plt.scatter(feat_test[:, 0], feat_test[:, 1], c=label_test, cmap=cmap_points, edgecolors='k', marker='o', s=80)

# 设置图形属性
plt.title('Random Forest Classifier - Iris Dataset')
plt.xlabel(feat_names[0])
plt.ylabel(feat_names[1])

# 显示图例
legend_labels = [f'{label_names[i]} ({i})' for i in range(len(label_names))]
plt.legend(legend_labels)

# 显示图形
plt.show()

实验3.5自行实现AdaBoost并完成肿瘤分类

加载sklearn中的肿瘤归类数据集

自行选择基学习器（可以使用Scikit-learn现成的分类器）自己实现，使用不同的基学习器实现2种以上的AdaBoost

运用Accuracy, Precision, Recall, F1四个指标进行对比评测，随机选择70%作为训练集，30%作为测试集，把结果绘制成表格

与Scikit-learn 的AdaBoostClassifier得到的结果进行对比（基学习器和你自己实现的AdaBoost相同）

python 复制代码

import sklearn
import numpy as np

导入数据

python 复制代码

from sklearn.datasets import load_breast_cancer

feat,label = load_breast_cancer(return_X_y=True)

feat.shape

划分数据集：70%训练集，30%测试集（随机种子固定为32）

python 复制代码

from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(feat, label, test_size = 0.3, random_state = 32)
trainX.shape

实现AdaBoost

python 复制代码

import copy
# YOUR CODE HERE
class MyAdaBoostClassifier:
    def __init__(self, base_estimator, n_estimators=50, learning_rate=1.0, n_classes=2):
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators 
        self.lr = learning_rate
        self.R = n_classes
        self.estimators = []
        self.alphas = []  # model_weight
        for m in range(n_estimators):
            self.estimators.append(copy.deepcopy(base_estimator))

    def fit(self, X, y):
        sample_weight = np.ones(len(X)) / len(X)  # 初始化样本权重为 1/N
        for i in range(self.n_estimators):
            model = self.estimators[i]
            model.fit(X, y, sample_weight)  # 训练弱学习器
            y_pred = model.predict(X)
            error = np.sum(sample_weight* (y_pred != y))  
            alpha = self.lr * (np.log((1-error)/error) + np.log(self.R-1))  # 权重系数
            sample_weight *= np.exp(alpha*(y_pred!=y))  # 更新迭代样本权重
            sample_weight /= np.sum(sample_weight)  # 样本权重归一化
            self.alphas.append(alpha)
        return self   
   
    def predict(self, X):  # 假定类别映射成 0,1,...
        y_pred = [] 
        for i in range(self.n_estimators):
            y_pred.append(self.estimators[i].predict_proba(X) )
        # 将预测类别概率与训练权重乘积作为集成预测类别概率    
        y_pred = np.average(np.asarray(y_pred), weights=np.array(self.alphas), axis=0 ) 
        y_pred = y_pred/np.array(self.alphas).sum()
        y_pred = np.argmax(y_pred, axis=1)
        return y_pred

基分类器选择决策树

python 复制代码

from sklearn.tree import DecisionTreeClassifier
# YOUR CODE HERE
base_tree = DecisionTreeClassifier(max_depth=1)  # You can customize the base estimator
Mymodel1 = MyAdaBoostClassifier(base_estimator=base_tree, n_estimators=50, learning_rate=1.0)
Mymodel1.fit(trainX, trainY)

基分类器选择对数几率回归

python 复制代码

from sklearn.linear_model import LogisticRegression
# YOUR CODE HERE
base_lr = LogisticRegression() 
Mymodel2 = MyAdaBoostClassifier(base_estimator=base_lr, n_estimators=50, learning_rate=1.0)
Mymodel2.fit(trainX, trainY)

评价指标的计算

python 复制代码

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
prediction1 = Mymodel1.predict(testX)
prediction2 = Mymodel2.predict(testX)
# YOUR CODE HERE
acc351 = accuracy_score(testY, prediction1)
precision351 = precision_score(testY, prediction1)
recall351 = recall_score(testY, prediction1)
f1351 = f1_score(testY, prediction1)
print(f'Accuracy: {acc351}\nPrecision: {precision351}\nRecall: {recall351}\nF1 Score: {f1351}\n')
acc352 = accuracy_score(testY, prediction2)
precision352 = precision_score(testY, prediction2)
recall352 = recall_score(testY, prediction2)
f1352 = f1_score(testY, prediction2)
print(f'Accuracy: {acc352}\nPrecision: {precision352}\nRecall: {recall352}\nF1 Score: {f1352}\n')

调用sklearn的模型

python 复制代码

from sklearn.ensemble import AdaBoostClassifier
# YOUR CODE HERE
# 决策树作为基
base_classifier = DecisionTreeClassifier(max_depth=1)  
ada_boost_model = AdaBoostClassifier(base_classifier, n_estimators=50, learning_rate=1.0)
ada_boost_model.fit(trainX, trainY)

# 预测
predictions353 = ada_boost_model.predict(testX)

# 计算评估指标
acc353 = accuracy_score(testY, predictions353)
precision353 = precision_score(testY, predictions353)
recall353 = recall_score(testY, predictions353)
f1353 = f1_score(testY, predictions353)
print(f'Accuracy: {acc353}\nPrecision: {precision353}\nRecall: {recall353}\nF1 Score: {f1353}\n')

机器学习本科课程 实验3 决策树处理分类任务

实验3.1 决策树处理分类任务

1. 读取数据

2. 导入模型

3. 训练与预测

4. 改变最大深度，绘制决策树的精度变换图

5. 通过调整参数，得到一个泛化能力最好的模型

双击此处填写优化后的决策树参数设置与性能指标的结果

实验3.2决策树处理回归任务

1. 读取数据

2. 数据集划分

3. 导入模型

4. 选取特征和标记

5. 训练与预测

6. 改变最大深度，绘制决策树的精度变换图

7. 请你选择一个合理的树的最大深度，并给出理由

实验3.3实现决策树

1. 读取数据

2. 划分训练集和测试集

3. 特征预处理

4. 实现3种特征划分准则

4.1 信息增益

4.2 信息增益率

4.3 基尼指数

5. 完成最优特征的选择

6. 判断结点内样本的类别是否为同一类

7. 创建叶子结点

8. 递归地创建决策树

9. 预测

10. 在测试集上对我们的模型进行评估

11. 请你计算使用不同评价指标得到模型的四项指标的值，填写在下方表格内

实验3.4随机森林的应用------鸢尾花分类

1.导入数据

选取前两个特征

2.导入模型

3.模型训练

4.评价指标的计算

5.可视化分类结果

实验3.5自行实现AdaBoost并完成肿瘤分类

导入数据

划分数据集：70%训练集，30%测试集 （随机种子固定为32）

实现AdaBoost

基分类器选择决策树

基分类器选择对数几率回归

评价指标的计算

调用sklearn的模型

机器学习本科课程实验3 决策树处理分类任务

划分数据集：70%训练集，30%测试集（随机种子固定为32）