机器学习之决策树(DecisionTree)

决策树中选择哪一个特征进行分裂,称之为特征选择。

特征选择是找出某一个特征使得分裂后两边的样本都有最好的"归宿",即左边分支的样本属于一个类别、右边分支的样本属于另外一个类别,左边和右边分支包含的样本尽可能分属同一类别,此时分裂节点的纯度(purity)高,能够表征这种纯度高低的常用指标是信息熵(information entropy)。

假设有一个数据集 D D D,包含 N N N个样本( n = 1 , 2 , 3 , . . . , N n=1,2,3,...,N n=1,2,3,...,N),每一个样本有 k k k个属性( k = 1 , 2 , 3 , . . . , K k=1,2,3,...,K k=1,2,3,...,K),样本共计有 C C C个类别( c = 1 , 2 , 3 , . . . , C c=1,2,3,...,C c=1,2,3,...,C),则 D D D的信息熵可定义为:
E n t r o p y ( D ) = − ∑ c = 1 C p c log ⁡ p c = − ∑ c = 1 C N c N log ⁡ N c N (1) Entropy(D)=-\sum_{c=1}^{C}p_c \log {p_c}=-\sum_{c=1}^{C}\frac{N_c}{N} \log \frac{N_c}{N}\tag{1} Entropy(D)=−c=1∑Cpclogpc=−c=1∑CNNclogNNc(1)

式 ( 1 ) (1) (1)中, p c p_c pc表示数据集 D D D中第 c c c类样本所占的比例, N c N_c Nc表示第 c c c类样本的数量, E n t r o p y ( D ) Entropy(D) Entropy(D)的值越小,则 D D D的纯度越高。

假设离散属性 A A A有个取值 M M M( m = 1 , 2 , 3 , . . . , M m=1,2,3,...,M m=1,2,3,...,M),若使用属性A对样本集 D D D进行分裂,则会将数据集划分为 M M M个子集 D m D^m Dm,每个子集包含的样本数记为 N m N_m Nm,根据式 ( 1 ) (1) (1)计算出 D m D^m Dm的信息熵,考虑到不同的子集所包含的样本数不同,分别给每个子集赋予权重 N m / N N_m/N Nm/N,计算属性A对于样本集 D D D进行划分所得的信息增益(information gain):
G a i n ( D , A ) = E n t r o p y ( D ) − ∑ m = 1 M N m N E n t r o p y ( D m ) = E n t r o p y ( D ) − ∑ m = 1 M N m N ∑ c = 1 C ( − N m c N m log ⁡ N m c N m ) (2) \begin{aligned} Gain(D,A)&=Entropy(D)-\sum_{m=1}^{M}\frac{N_m}{N}Entropy(D^m)\\ &=Entropy(D)-\sum_{m=1}^{M}\frac{N_m}{N}\sum_{c=1}^{C}(-\frac{N_{mc}}{N_m}\log\frac{N_{mc}}{N_m})\tag{2} \end{aligned} Gain(D,A)=Entropy(D)−m=1∑MNNmEntropy(Dm)=Entropy(D)−m=1∑MNNmc=1∑C(−NmNmclogNmNmc)(2)

式 ( 2 ) (2) (2)中, N m c N_{mc} Nmc表示子集 D m D^m Dm中类别为 c c c的样本的个数。
表1 西瓜数据集

编号 色泽 根蒂 敲声 纹理 脐部 触感 好瓜
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑
6 青绿 稍蜷 浊响 清晰 稍凹 软粘
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑
10 青绿 硬挺 清脆 清晰 平坦 软粘
11 浅白 硬挺 清脆 模糊 平坦 硬滑
12 浅白 蜷缩 浊响 模糊 平坦 软粘
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘
16 浅白 蜷缩 浊响 模糊 平坦 硬滑
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑

以表1中的西瓜数据集为例,数据集包含17个样本( n = 1 , 2 , 3 , . . . , 17 n=1,2,3,...,17 n=1,2,3,...,17),每个样本有6个属性( k = 1 , 2 , 3 , . . . , 6 k=1,2,3,...,6 k=1,2,3,...,6),样本共计有2个类别( c = 是 , 否 c=是,否 c=是,否)。

1.17个样本中,好瓜样本有8个、差瓜样本有9个,数据集 D D D的信息熵为:
E n t r o p y ( D ) = − 8 17 log ⁡ 8 17 − 9 17 log ⁡ 9 17 = 0.9975 (3) Entropy(D)=-\frac{8}{17} \log \frac{8}{17}-\frac{9}{17} \log \frac{9}{17}=0.9975\tag{3} Entropy(D)=−178log178−179log179=0.9975(3)

2.计算属性集合{色泽, 根蒂, 敲声, 纹理, 脐部, 触感}中每个属性的信息增益,以属性"触感"为例,有{硬滑, 软粘}两个取值:

  • D 1 D^1 D1(触感=硬滑):包含编号为{1,2,3,4,5,8,9,11,13,14,16,17}的12个样本,其中好瓜有{1,2,3,4,5,8}的6个样本、差瓜有{9,11,13,14,16,17}的6个样本;

  • D 2 D^2 D2(触感=软粘):{6,7,10,12,15}的5个样本,其中好瓜有{6,7}的2个样本、差瓜有{10,12,15}的3个样本。

    根据式 ( 1 ) (1) (1)计算上述两个子集的信息熵:
    E n t r o p y ( D 1 ) = − 6 12 log ⁡ 6 12 − 6 12 log ⁡ 6 12 = 1.000 E n t r o p y ( D 2 ) = − 2 5 log ⁡ 2 5 − 3 5 log ⁡ 3 5 = 0.9709 \begin{aligned} Entropy(D^1)&=-\frac{6}{12} \log \frac{6}{12}-\frac{6}{12} \log \frac{6}{12}=1.000\\ Entropy(D^2)&=-\frac{2}{5} \log \frac{2}{5}-\frac{3}{5} \log \frac{3}{5}=0.9709 \end{aligned} Entropy(D1)Entropy(D2)=−126log126−126log126=1.000=−52log52−53log53=0.9709

3.根据式 ( 2 ) (2) (2)计算属性"触感"的信息增益:
G a i n ( D , 触感 ) = E n t r o p y ( D ) − ∑ m = 1 M N m N E n t r o p y ( D m ) = 0.9975 − ( 12 17 × 1.0000 + 5 17 × 0.9709 ) = 0.0061 \begin{aligned} Gain(D,触感)&=Entropy(D)-\sum_{m=1}^{M}\frac{N_m}{N}Entropy(D^m)\\ &=0.9975-(\frac{12}{17}\times1.0000+\frac{5}{17}\times0.9709)\\ &=0.0061 \end{aligned} Gain(D,触感)=Entropy(D)−m=1∑MNNmEntropy(Dm)=0.9975−(1712×1.0000+175×0.9709)=0.0061

4.根据式 ( 2 ) (2) (2)计算其他属性的信息增益:
G a i n ( D , 色泽 ) = 0.9975 − [ 6 17 × E n t r o p y ( 色泽 = 青绿 ) + 6 17 × E n t r o p y ( 色泽 = 乌黑 ) + 5 17 × E n t r o p y ( 色泽 = 浅白 ) ] = 0.9975 − [ 6 17 × ( − ( 3 6 log ⁡ 3 6 + 3 6 log ⁡ 3 6 ) ) + 6 17 × ( − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ) + 5 17 × ( − ( 1 5 log ⁡ 1 5 + 4 5 log ⁡ 4 5 ) ) ] = 0.1081 G a i n ( D , 根蒂 ) = 0.9975 − [ 8 17 × E n t r o p y ( 根蒂 = 蜷缩 ) + 7 17 × E n t r o p y ( 根蒂 = 稍蜷 ) + 2 17 × E n t r o p y ( 根蒂 = 硬挺 ) ] = 0.9975 − [ 8 17 × ( − ( 5 8 log ⁡ 5 8 + 3 8 log ⁡ 3 8 ) ) + 7 17 × ( − ( 3 7 log ⁡ 3 7 + 4 7 log ⁡ 4 7 ) ) + 2 17 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] = 0.1426 G a i n ( D , 敲声 ) = 0.9975 − [ 10 17 × E n t r o p y ( 敲声 = 浊响 ) + 5 17 × E n t r o p y ( 敲声 = 沉闷 ) + 2 17 × E n t r o p y ( 敲声 = 清脆 ) ] = 0.9975 − [ 10 17 × ( − ( 6 10 log ⁡ 6 10 + 4 10 log ⁡ 4 10 ) ) + 5 17 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) + 2 17 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] = 0.1407 G a i n ( D , 纹理 ) = 0.9975 − [ 9 17 × E n t r o p y ( 纹理 = 清晰 ) + 5 17 × E n t r o p y ( 纹理 = 稍糊 ) + 3 17 × E n t r o p y ( 纹理 = 模糊 ) ] = 0.9975 − [ 9 17 × ( − ( 7 9 log ⁡ 7 9 + 2 9 log ⁡ 2 9 ) ) + 5 17 × ( − ( 1 5 log ⁡ 1 5 + 4 5 log ⁡ 4 5 ) ) + 3 17 × ( − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ) ] = 0.3805 G a i n ( D , 脐部 ) = 0.9975 − [ 7 17 × E n t r o p y ( 脐部 = 凹陷 ) + 6 17 × E n t r o p y ( 脐部 = 稍凹 ) + 3 17 × E n t r o p y ( 脐部 = 平坦 ) ] = 0.9975 − [ 7 17 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 6 17 × ( − ( 3 6 log ⁡ 3 6 + 3 6 log ⁡ 3 6 ) ) + 4 17 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] = 0.2891 \begin{aligned} Gain(D,色泽)&=0.9975-[\frac{6}{17}\times Entropy(色泽=青绿)+\frac{6}{17}\times Entropy(色泽=乌黑)+\frac{5}{17}\times Entropy(色泽=浅白)]\\ &=0.9975-[\frac{6}{17}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+\frac{6}{17}\times(-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6}))+\frac{5}{17}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))]\\ &=0.1081\\ Gain(D,根蒂)&=0.9975-[\frac{8}{17}\times Entropy(根蒂=蜷缩)+\frac{7}{17}\times Entropy(根蒂=稍蜷)+\frac{2}{17}\times Entropy(根蒂=硬挺)]\\ &=0.9975-[\frac{8}{17}\times(-(\frac{5}{8}\log\frac{5}{8}+\frac{3}{8}\log\frac{3}{8}))+\frac{7}{17}\times(-(\frac{3}{7}\log\frac{3}{7}+\frac{4}{7}\log\frac{4}{7}))+\frac{2}{17}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\\ &=0.1426\\ Gain(D,敲声)&=0.9975-[\frac{10}{17}\times Entropy(敲声=浊响)+\frac{5}{17}\times Entropy(敲声=沉闷)+\frac{2}{17}\times Entropy(敲声=清脆)]\\ &=0.9975-[\frac{10}{17}\times(-(\frac{6}{10}\log\frac{6}{10}+\frac{4}{10}\log\frac{4}{10}))+\frac{5}{17}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))+\frac{2}{17}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\\ &=0.1407\\ Gain(D,纹理)&=0.9975-[\frac{9}{17}\times Entropy(纹理=清晰)+\frac{5}{17}\times Entropy(纹理=稍糊)+\frac{3}{17}\times Entropy(纹理=模糊)]\\ &=0.9975-[\frac{9}{17}\times(-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9}))+\frac{5}{17}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))+\frac{3}{17}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\\ &=0.3805\\ Gain(D,脐部)&=0.9975-[\frac{7}{17}\times Entropy(脐部=凹陷)+\frac{6}{17}\times Entropy(脐部=稍凹)+\frac{3}{17}\times Entropy(脐部=平坦)]\\ &=0.9975-[\frac{7}{17}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+\frac{6}{17}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+\frac{4}{17}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\\ &=0.2891\\ \end{aligned} Gain(D,色泽)Gain(D,根蒂)Gain(D,敲声)Gain(D,纹理)Gain(D,脐部)=0.9975−[176×Entropy(色泽=青绿)+176×Entropy(色泽=乌黑)+175×Entropy(色泽=浅白)]=0.9975−[176×(−(63log63+63log63))+176×(−(64log64+62log62))+175×(−(51log51+54log54))]=0.1081=0.9975−[178×Entropy(根蒂=蜷缩)+177×Entropy(根蒂=稍蜷)+172×Entropy(根蒂=硬挺)]=0.9975−[178×(−(85log85+83log83))+177×(−(73log73+74log74))+172×(−(20log20+22log22))]=0.1426=0.9975−[1710×Entropy(敲声=浊响)+175×Entropy(敲声=沉闷)+172×Entropy(敲声=清脆)]=0.9975−[1710×(−(106log106+104log104))+175×(−(52log52+53log53))+172×(−(20log20+22log22))]=0.1407=0.9975−[179×Entropy(纹理=清晰)+175×Entropy(纹理=稍糊)+173×Entropy(纹理=模糊)]=0.9975−[179×(−(97log97+92log92))+175×(−(51log51+54log54))+173×(−(30log30+33log33))]=0.3805=0.9975−[177×Entropy(脐部=凹陷)+176×Entropy(脐部=稍凹)+173×Entropy(脐部=平坦)]=0.9975−[177×(−(75log75+72log72))+176×(−(63log63+63log63))+174×(−(40log40+44log44))]=0.2891

上述计算过程中有一种特殊情况:某属性(根蒂=硬挺、敲声=清脆、纹理=模糊、脐部=平坦)分裂时,属性某个取值对应的样本全是反例,正例数量为0,此时其信息熵为:
E = − ( 1 × log ⁡ ( 1 ) + 0 × log ⁡ ( 0 ) ) E=-(1\times \log(1)+0\times \log(0)) E=−(1×log(1)+0×log(0))

在数学上,由于 lim ⁡ x → 0 x log ⁡ ( x ) = 0 \lim_{x→0}x\log(x)=0 limx→0xlog(x)=0,因此上述情况的信息熵为0

5.根据最大信息增益选择分裂属性,即选择属性"纹理"进行分裂,分裂后的样本分布:

属性: 取值 样本 好瓜 差瓜 信息熵
纹理:清晰 {1,2,3,4,5,6,8,10,15} {1,2,3,4,5,6,8} {10,15} E ( 纹理 = 清晰 ) = − ( 7 9 log ⁡ 7 9 + 2 9 log ⁡ 2 9 ) = 0.7642 E(纹理=清晰)=-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9})=0.7642 E(纹理=清晰)=−(97log97+92log92)=0.7642
纹理:稍糊 {7,9,13,14,17} {7} {9,13,14,17} E ( 纹理 = 稍糊 ) = − ( 1 5 log ⁡ 1 5 + 4 5 log ⁡ 4 5 ) = 0.7219 E(纹理=稍糊)=-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5})=0.7219 E(纹理=稍糊)=−(51log51+54log54)=0.7219
纹理:模糊 {11,12,16} {} {11,12,16} E ( 纹理 = 模糊 ) = − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) = 0.0000 E(纹理=模糊 )=-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3})=0.0000 E(纹理=模糊)=−(30log30+33log33)=0.0000

表2 西瓜数据集------纹理=清晰

编号 色泽 根蒂 敲声 纹理 脐部 触感 好瓜
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑
6 青绿 稍蜷 浊响 清晰 稍凹 软粘
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑
10 青绿 硬挺 清脆 清晰 平坦 软粘
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘

G a i n ( D 纹理 = 清晰 , 色泽 ) = 0.7642 − [ 4 9 × E n t r o p y ( 色泽 = 青绿 ) + 4 9 × E n t r o p y ( 色泽 = 乌黑 ) + 1 9 × E n t r o p y ( 色泽 = 浅白 ) ] = 0.7642 − [ 4 9 × ( − ( 3 4 log ⁡ 3 4 + 1 4 log ⁡ 1 4 ) ) + 4 9 × ( − ( 3 4 log ⁡ 3 4 + 1 4 log ⁡ 1 4 ) ) + 1 9 × ( − ( 1 1 log ⁡ 1 1 + 0 1 log ⁡ 0 1 ) ) ] = 0.0430 G a i n ( D 纹理 = 清晰 , 根蒂 ) = 0.7642 − [ 5 9 × E n t r o p y ( 根蒂 = 蜷缩 ) + 3 9 × E n t r o p y ( 根蒂 = 稍蜷 ) + 1 9 × E n t r o p y ( 根蒂 = 硬挺 ) ] = 0.7642 − [ 5 9 × ( − ( 5 5 log ⁡ 5 5 + 0 0 log ⁡ 0 0 ) ) + 3 9 × ( − ( 2 3 log ⁡ 2 3 + 1 3 log ⁡ 1 3 ) ) + 1 9 × ( − ( 0 1 log ⁡ 0 1 + 1 1 log ⁡ 1 1 ) ) ] = 0.4581 G a i n ( D 纹理 = 清晰 , 敲声 ) = 0.7642 − [ 6 9 × E n t r o p y ( 敲声 = 浊响 ) + 2 9 × E n t r o p y ( 敲声 = 沉闷 ) + 1 9 × E n t r o p y ( 敲声 = 清脆 ) ] = 0.7642 − [ 6 9 × ( − ( 5 6 log ⁡ 5 6 + 1 6 log ⁡ 1 6 ) ) + 2 9 × ( − ( 2 2 log ⁡ 2 2 + 0 0 log ⁡ 0 0 ) ) + 1 9 × ( − ( 0 0 log ⁡ 0 0 + 1 1 log ⁡ 1 1 ) ) ] = 0.3308 G a i n ( D 纹理 = 清晰 , 脐部 ) = 0.7642 − [ 5 9 × E n t r o p y ( 脐部 = 凹陷 ) + 3 9 × E n t r o p y ( 脐部 = 稍凹 ) + 1 9 × E n t r o p y ( 脐部 = 平坦 ) ] = 0.7642 − [ 5 9 × ( − ( 5 5 log ⁡ 5 5 + 0 5 log ⁡ 0 5 ) ) + 3 9 × ( − ( 2 3 log ⁡ 2 3 + 1 3 log ⁡ 1 3 ) ) + 1 9 × ( − ( 0 0 log ⁡ 0 0 + 1 1 log ⁡ 1 1 ) ) ] = 0.4581 G a i n ( D 纹理 = 清晰 , 触感 ) = 0.7642 − [ 6 9 × E n t r o p y ( 触感 = 硬滑 ) + 3 9 × E n t r o p y ( 触感 = 软粘 ) ] = 0.7642 − [ 6 9 × ( − ( 6 6 log ⁡ 6 6 + 0 6 log ⁡ 0 6 ) ) + 3 9 × ( − ( 2 3 log ⁡ 2 3 + 1 3 log ⁡ 1 3 ) ) ] = 0.4581 \begin{aligned} Gain(D^{纹理=清晰},色泽)&=0.7642-[\frac{4}{9}\times Entropy(色泽=青绿)+\frac{4}{9}\times Entropy(色泽=乌黑)+\frac{1}{9}\times Entropy(色泽=浅白)]\\ &=0.7642-[\frac{4}{9}\times(-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4}))+\frac{4}{9}\times(-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4}))+\frac{1}{9}\times(-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1}))]\\ &=0.0430\\ Gain(D^{纹理=清晰},根蒂)&=0.7642-[\frac{5}{9}\times Entropy(根蒂=蜷缩)+\frac{3}{9}\times Entropy(根蒂=稍蜷)+\frac{1}{9}\times Entropy(根蒂=硬挺)]\\ &=0.7642-[\frac{5}{9}\times(-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{0}\log\frac{0}{0}))+\frac{3}{9}\times(-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3}))+\frac{1}{9}\times(-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.4581\\ Gain(D^{纹理=清晰},敲声)&=0.7642-[\frac{6}{9}\times Entropy(敲声=浊响)+\frac{2}{9}\times Entropy(敲声=沉闷)+\frac{1}{9}\times Entropy(敲声=清脆)]\\ &=0.7642-[\frac{6}{9}\times(-(\frac{5}{6}\log\frac{5}{6}+\frac{1}{6}\log\frac{1}{6}))+\frac{2}{9}\times(-(\frac{2}{2}\log\frac{2}{2}+\frac{0}{0}\log\frac{0}{0}))+\frac{1}{9}\times(-(\frac{0}{0}\log\frac{0}{0}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.3308\\ Gain(D^{纹理=清晰},脐部)&=0.7642-[\frac{5}{9}\times Entropy(脐部=凹陷)+\frac{3}{9}\times Entropy(脐部=稍凹)+\frac{1}{9}\times Entropy(脐部=平坦)]\\ &=0.7642-[\frac{5}{9}\times(-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{5}\log\frac{0}{5}))+\frac{3}{9}\times(-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3}))+\frac{1}{9}\times(-(\frac{0}{0}\log\frac{0}{0}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.4581\\ Gain(D^{纹理=清晰},触感)&=0.7642-[\frac{6}{9}\times Entropy(触感=硬滑)+\frac{3}{9}\times Entropy(触感=软粘)]\\ &=0.7642-[\frac{6}{9}\times(-(\frac{6}{6}\log\frac{6}{6}+\frac{0}{6}\log\frac{0}{6}))+\frac{3}{9}\times(-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3}))]\\ &=0.4581\\ \end{aligned} Gain(D纹理=清晰,色泽)Gain(D纹理=清晰,根蒂)Gain(D纹理=清晰,敲声)Gain(D纹理=清晰,脐部)Gain(D纹理=清晰,触感)=0.7642−[94×Entropy(色泽=青绿)+94×Entropy(色泽=乌黑)+91×Entropy(色泽=浅白)]=0.7642−[94×(−(43log43+41log41))+94×(−(43log43+41log41))+91×(−(11log11+10log10))]=0.0430=0.7642−[95×Entropy(根蒂=蜷缩)+93×Entropy(根蒂=稍蜷)+91×Entropy(根蒂=硬挺)]=0.7642−[95×(−(55log55+00log00))+93×(−(32log32+31log31))+91×(−(10log10+11log11))]=0.4581=0.7642−[96×Entropy(敲声=浊响)+92×Entropy(敲声=沉闷)+91×Entropy(敲声=清脆)]=0.7642−[96×(−(65log65+61log61))+92×(−(22log22+00log00))+91×(−(00log00+11log11))]=0.3308=0.7642−[95×Entropy(脐部=凹陷)+93×Entropy(脐部=稍凹)+91×Entropy(脐部=平坦)]=0.7642−[95×(−(55log55+50log50))+93×(−(32log32+31log31))+91×(−(00log00+11log11))]=0.4581=0.7642−[96×Entropy(触感=硬滑)+93×Entropy(触感=软粘)]=0.7642−[96×(−(66log66+60log60))+93×(−(32log32+31log31))]=0.4581

"根蒂"、"脐部"、"触感"3个属性的信息增益均达到最大,可任选其一继续分裂。
表3 西瓜数据集------纹理=稍糊

编号 色泽 根蒂 敲声 纹理 脐部 触感 好瓜
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑

G a i n ( D 纹理 = 稍糊 , 色泽 ) = 0.7219 − [ 4 5 × E n t r o p y ( 色泽 = 青绿 ) + 4 5 × E n t r o p y ( 色泽 = 乌黑 ) + 1 5 × E n t r o p y ( 色泽 = 浅白 ) ] = 0.7219 − [ 2 5 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) + 2 5 × ( − ( 1 2 log ⁡ 1 2 + 1 2 log ⁡ 1 2 ) ) + 1 5 × ( − ( 0 1 log ⁡ 0 1 + 1 1 log ⁡ 1 1 ) ) ] = 0.3219 G a i n ( D 纹理 = 稍糊 , 根蒂 ) = 0.7219 − [ 4 5 × E n t r o p y ( 根蒂 = 稍蜷 ) + 1 5 × E n t r o p y ( 根蒂 = 蜷缩 ) ] = 0.7219 − [ 4 5 × ( − ( 1 4 log ⁡ 1 4 + 3 4 log ⁡ 3 4 ) ) + 1 5 × ( − ( 0 1 log ⁡ 0 1 + 1 1 log ⁡ 1 1 ) ) ] = 0.0728 G a i n ( D 纹理 = 稍糊 , 敲声 ) = 0.7219 − [ 2 3 × E n t r o p y ( 敲声 = 浊响 ) + 3 5 × E n t r o p y ( 敲声 = 沉闷 ) ] = 0.7219 − [ 2 5 × ( − ( 1 2 log ⁡ 1 2 + 1 2 log ⁡ 1 2 ) ) + 3 5 × ( − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ) ] = 0.3219 G a i n ( D 纹理 = 稍糊 , 脐部 ) = 0.7219 − [ 3 5 × E n t r o p y ( 脐部 = 稍凹 ) + 2 5 × E n t r o p y ( 脐部 = 凹陷 ) ] = 0.7219 − [ 3 5 × ( − ( 1 3 log ⁡ 1 3 + 2 3 log ⁡ 2 3 ) ) + 2 5 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] = 0.1709 G a i n ( D 纹理 = 稍糊 , 触感 ) = 0.7219 − [ 1 5 × E n t r o p y ( 触感 = 软粘 ) + 4 5 × E n t r o p y ( 触感 = 硬滑 ) ] = 0.7219 − [ 1 5 × ( − ( 1 1 log ⁡ 1 1 + 0 1 log ⁡ 0 1 ) ) + 4 5 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] = 0.7219 \begin{aligned} Gain(D^{纹理=稍糊},色泽)&=0.7219-[\frac{4}{5}\times Entropy(色泽=青绿)+\frac{4}{5}\times Entropy(色泽=乌黑)+\frac{1}{5}\times Entropy(色泽=浅白)]\\ &=0.7219-[\frac{2}{5}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))+\frac{2}{5}\times(-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2}))+\frac{1}{5}\times(-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.3219\\ Gain(D^{纹理=稍糊},根蒂)&=0.7219-[\frac{4}{5}\times Entropy(根蒂=稍蜷)+\frac{1}{5}\times Entropy(根蒂=蜷缩)]\\ &=0.7219-[\frac{4}{5}\times(-(\frac{1}{4}\log\frac{1}{4}+\frac{3}{4}\log\frac{3}{4}))+\frac{1}{5}\times(-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.0728\\ Gain(D^{纹理=稍糊},敲声)&=0.7219-[\frac{2}{3}\times Entropy(敲声=浊响)+\frac{3}{5}\times Entropy(敲声=沉闷)]\\ &=0.7219-[\frac{2}{5}\times(-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2}))+\frac{3}{5}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\\ &=0.3219\\ Gain(D^{纹理=稍糊},脐部)&=0.7219-[\frac{3}{5}\times Entropy(脐部=稍凹)+\frac{2}{5}\times Entropy(脐部=凹陷)]\\ &=0.7219-[\frac{3}{5}\times(-(\frac{1}{3}\log\frac{1}{3}+\frac{2}{3}\log\frac{2}{3}))+\frac{2}{5}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\\ &=0.1709\\ Gain(D^{纹理=稍糊},触感)&=0.7219-[\frac{1}{5}\times Entropy(触感=软粘)+\frac{4}{5}\times Entropy(触感=硬滑)]\\ &=0.7219-[\frac{1}{5}\times(-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1}))+\frac{4}{5}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\\ &=0.7219\\ \end{aligned} Gain(D纹理=稍糊,色泽)Gain(D纹理=稍糊,根蒂)Gain(D纹理=稍糊,敲声)Gain(D纹理=稍糊,脐部)Gain(D纹理=稍糊,触感)=0.7219−[54×Entropy(色泽=青绿)+54×Entropy(色泽=乌黑)+51×Entropy(色泽=浅白)]=0.7219−[52×(−(20log20+22log22))+52×(−(21log21+21log21))+51×(−(10log10+11log11))]=0.3219=0.7219−[54×Entropy(根蒂=稍蜷)+51×Entropy(根蒂=蜷缩)]=0.7219−[54×(−(41log41+43log43))+51×(−(10log10+11log11))]=0.0728=0.7219−[32×Entropy(敲声=浊响)+53×Entropy(敲声=沉闷)]=0.7219−[52×(−(21log21+21log21))+53×(−(30log30+33log33))]=0.3219=0.7219−[53×Entropy(脐部=稍凹)+52×Entropy(脐部=凹陷)]=0.7219−[53×(−(31log31+32log32))+52×(−(20log20+22log22))]=0.1709=0.7219−[51×Entropy(触感=软粘)+54×Entropy(触感=硬滑)]=0.7219−[51×(−(11log11+10log10))+54×(−(40log40+44log44))]=0.7219

"触感"属性的信息增益达到最大,选择"触感"属性继续分裂。
表4 西瓜数据集------纹理=模糊

编号 色泽 根蒂 敲声 纹理 脐部 触感 好瓜
11 浅白 硬挺 清脆 模糊 平坦 硬滑
12 浅白 蜷缩 浊响 模糊 平坦 软粘
16 浅白 蜷缩 浊响 模糊 平坦 硬滑

当前节点包含的样本全属于同一类别,即差瓜,因此无需再分。

相关推荐
jndingxin4 分钟前
OpenCV 图形API(64)图像结构分析和形状描述符------在图像中查找轮廓函数findContours()
人工智能·opencv
唯创电子6 分钟前
芯资讯|WTR096-16S录音语音芯片:重塑智能家居的情感连接与安全守护
人工智能·智能家居·语音识别·语音芯片·录音芯片
开发小能手-roy12 分钟前
使用PyTorch实现简单图像识别(基于MNIST手写数字数据集)的完整代码示例,包含数据加载、模型定义、训练和预测全流程
人工智能·pytorch·python
嗨,紫玉灵神熊25 分钟前
使用 OpenCV 实现图像中心旋转
图像处理·人工智能·opencv·计算机视觉
cmoaciopm30 分钟前
FastGPT部署的一些问题整理
人工智能
odoo中国33 分钟前
机器学习实操 第一部分 机器学习基础 第6章 决策树
人工智能·决策树·机器学习
giszz34 分钟前
DeepSeek提示词技巧
人工智能
AI技术学长37 分钟前
训练神经网络的批量标准化(使用 PyTorch)
人工智能·pytorch·神经网络·数据科学·计算机技术·批量标准化
ccLianLian39 分钟前
深度学习·经典模型·Transformer
人工智能·深度学习·transformer
belldeep43 分钟前
python:sklearn 决策树(Decision Tree)
python·决策树·机器学习·sklearn