机器学习之决策树(DecisionTree——C4.5)

机器学习之决策树(DecisionTree------ID3)中我们提到,ID3无法处理是连续值或有缺失值的属性。而C4.5算法可以解决ID3算的上述局限性。

1、连续值属性的处理

对于数据集 D D D和连续值属性 A A A,假设连续值属性 A A A有 M M M个不同的取值,可通过二分法bi-partition对联组织属性进行离散化处理,即:

  1. 对 M M M个不同的取值由小到大排序,得到排序后的取值,记为 { a 1 , a 2 , . . . , a M } \{a^1, a^2, ..., a^M\} {a1,a2,...,aM};
  2. 对相邻的属性取值 a i a^{i} ai与 a i + 1 a^{i+1} ai+1,取其均值作为划分点,即 a i + a i + 1 2 \frac{a^{i}+a^{i+1}}{2} 2ai+ai+1,划分后的子集表示为 D t − D_t^- Dt−和 D t + D_t^+ Dt+;
  3. 对于连续值属性 A A A,可获得包含 M − 1 M-1 M−1个元素的候选划分点集合:
    T A = { a i + a i + 1 2 ∣ 1 ≤ i ≤ M − 1 } (1) T_A=\{\frac{a^{i}+a^{i+1}}{2}|1≤i≤M-1\}\tag1 TA={2ai+ai+1∣1≤i≤M−1}(1)
  4. 像离散属性值一样开考察上述候选划分点,选取最优的划分点进行样本集合的划分:
    G a i n ( D , A ) = max ⁡ t ∈ T a G a i n ( D , A , t ) = max ⁡ t ∈ T a ( E n t r o p y ( D ) − ∑ λ ∈ { − , + } N t λ N E n t r o p y ( D t λ ) ) (2) \begin{aligned} Gain(D, A)&=\mathop{\max}\limits_{t\in T_a}Gain(D, A, t)\\ &=\mathop{\max}\limits_{t\in T_a}(Entropy(D)-\sum_{\lambda\in \{-, +\}}\frac{N_t^{\lambda}}{N}Entropy(D_t^{\lambda}))\tag2 \end{aligned} Gain(D,A)=t∈TamaxGain(D,A,t)=t∈Tamax(Entropy(D)−λ∈{−,+}∑NNtλEntropy(Dtλ))(2)
    式(2)中, G a i n ( D , A , t ) Gain(D, A, t) Gain(D,A,t)是样本集 D D D基于划分点 t t t二分后的信息增益, D t λ D_t^{\lambda} Dtλ表示二分后的子集, N t λ N_t^{\lambda} Ntλ表示二分后的子集的样本数量。

表1 西瓜数据集3.0

编号 色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率 好瓜
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215
6 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091
10 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267
11 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
12 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
16 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103

表1中的西瓜数据集包含17个样本( n = 1 , 2 , 3 , . . . , 17 n=1,2,3,...,17 n=1,2,3,...,17),每个样本有8个属性( k = 1 , 2 , 3 , . . . , 8 k = 1 , 2 , 3 , . . . , 8 k=1,2,3,...,8),样本共计有2个类别( c = 是 , 否 c = 是 , 否 c=是,否)。17个样本中,好瓜样本有8个、差瓜样本有9个,数据集 D D D信息熵为:
E n t r o p y ( D ) = − ( 8 17 log ⁡ 8 17 + 9 17 log ⁡ 9 17 ) = 0.9975 Entropy(D)=-(\frac{8}{17}\log\frac{8}{17}+\frac{9}{17}\log\frac{9}{17})=0.9975 Entropy(D)=−(178log178+179log179)=0.9975

以属性"含糖率"为例,17个样本的在该属性的取值由小到大排序后为:
表2 西瓜数据集3.0------sort("含糖率")

编号 色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率 好瓜
16 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042
11 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091
12 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215
6 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264
10 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460

17个样本的在该属性的二分候选划分点为:

|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| |-------| | 0.042 | | 0.057 | | 0.091 | | 0.099 | | 0.103 | | 0.149 | | 0.161 | | 0.198 | | 0.211 | | 0.215 | | 0.237 | | 0.264 | | 0.267 | | 0.318 | | 0.370 | | 0.376 | | 0.460 | | |--------| | 0.0495 | | 0.074 | | 0.095 | | 0.101 | | 0.126 | | 0.155 | | 0.1795 | | 0.2045 | | 0.213 | | 0.226 | | 0.2505 | | 0.2655 | | 0.2925 | | 0.344 | | 0.373 | | 0.418 | |

  • 当划分点为0.0495,划分后两个子集分别为 D 0.0495 − D_{0.0495}^- D0.0495−:{16}和 D 0.0495 + D_{0.0495}^+ D0.0495+:{11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    E n t r o p y ( D 0.0495 − ) = − ( 0 1 log ⁡ 0 1 + 1 1 log ⁡ 1 1 ) = 0 E n t r o p y ( D 0.0495 + ) = − ( 8 16 log ⁡ 8 16 + 8 16 log ⁡ 8 16 ) = 1.0 G a i n ( D , 含糖率 , 0.0495 ) = E n t r o p y ( D ) − ∑ λ ∈ { − , + } N 0.0495 λ N E n t r o p y ( D 0.126 λ ) = 0.9975 − ( 1 17 ∗ 0 + 16 17 ∗ 1.0 ) = 0.0563 \begin{aligned} Entropy(D_{0.0495}^-)&=-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1})=0\\ Entropy(D_{0.0495}^+)&=-(\frac{8}{16}\log\frac{8}{16}+\frac{8}{16}\log\frac{8}{16})=1.0\\ Gain(D, 含糖率, 0.0495)&= Entropy(D)-\sum_{\lambda\in\{-, +\}}\frac{N_{0.0495}^{\lambda}}{N} Entropy(D_{0.126}^{\lambda})\\ &= 0.9975-(\frac{1}{17}*0+\frac{16}{17}*1.0)\\ &=0.0563 \end{aligned} Entropy(D0.0495−)Entropy(D0.0495+)Gain(D,含糖率,0.0495)=−(10log10+11log11)=0=−(168log168+168log168)=1.0=Entropy(D)−λ∈{−,+}∑NN0.0495λEntropy(D0.126λ)=0.9975−(171∗0+1716∗1.0)=0.0563
  • 当划分点为0.074,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074−:{16, 11}和 D 0.074 + D_{0.074}^+ D0.074+:{9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.074 ) = 0.9975 − { 2 17 ∗ [ − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ] + 15 17 ∗ [ − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) ] } = 0.1179 \begin{aligned} Gain(D, 含糖率, 0.074)&= 0.9975-\{\frac{2}{17}*[-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2})]+\frac{15}{17}*[-(\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})]\}=0.1179 \end{aligned} Gain(D,含糖率,0.074)=0.9975−{172∗[−(20log20+22log22)]+1715∗[−(158log158+157log157)]}=0.1179
  • 当划分点为0.095,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074−:{16, 11, 9}和 D 0.074 + D_{0.074}^+ D0.074+:{12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.095 ) = 0.9975 − { 3 17 ∗ [ − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ] + 14 17 ∗ [ − ( 8 14 log ⁡ 8 14 + 6 14 log ⁡ 6 14 ) ] } = 0.1861 \begin{aligned} Gain(D, 含糖率, 0.095)&= 0.9975-\{\frac{3}{17}*[-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3})]+\frac{14}{17}*[-(\frac{8}{14}\log\frac{8}{14}+\frac{6}{14}\log\frac{6}{14})]\}=0.1861 \end{aligned} Gain(D,含糖率,0.095)=0.9975−{173∗[−(30log30+33log33)]+1714∗[−(148log148+146log146)]}=0.1861
  • 当划分点为0.101,划分后两个子集分别为 D 0.101 − D_{0.101}^- D0.101−:{16, 11, 9, 12}和 D 0.101 + D_{0.101}^+ D0.101+:{17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.101 ) = 0.9975 − { 4 17 ∗ [ − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ] + 13 17 ∗ [ − ( 8 13 log ⁡ 8 13 + 5 13 log ⁡ 5 13 ) ] } = 0.2624 \begin{aligned} Gain(D, 含糖率, 0.101)&= 0.9975-\{\frac{4}{17}*[-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4})]+\frac{13}{17}*[-(\frac{8}{13}\log\frac{8}{13}+\frac{5}{13}\log\frac{5}{13})]\}=0.2624 \end{aligned} Gain(D,含糖率,0.101)=0.9975−{174∗[−(40log40+44log44)]+1713∗[−(138log138+135log135)]}=0.2624
  • 当划分点为0.126,划分后两个子集分别为 D 0.126 − D_{0.126}^- D0.126−:{16, 11, 9, 12, 17}和 D 0.126 + D_{0.126}^+ D0.126+:{7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.126 ) = 0.9975 − { 5 17 ∗ [ − ( 0 5 log ⁡ 0 5 + 5 5 log ⁡ 5 5 ) ] + 12 17 ∗ [ − ( 8 12 log ⁡ 8 12 + 4 12 log ⁡ 4 12 ) ] } = 0.3492 \begin{aligned} Gain(D, 含糖率, 0.126)&= 0.9975-\{\frac{5}{17}*[-(\frac{0}{5}\log\frac{0}{5}+\frac{5}{5}\log\frac{5}{5})]+\frac{12}{17}*[-(\frac{8}{12}\log\frac{8}{12}+\frac{4}{12}\log\frac{4}{12})]\}=0.3492 \end{aligned} Gain(D,含糖率,0.126)=0.9975−{175∗[−(50log50+55log55)]+1712∗[−(128log128+124log124)]}=0.3492
  • 当划分点为0.155,划分后两个子集分别为 D 0.155 − D_{0.155}^- D0.155−:{16, 11, 9, 12, 17, 7}和 D 0.155 + D_{0.155}^+ D0.155+:{13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.155 ) = 0.9975 − { 6 17 ∗ [ − ( 1 6 log ⁡ 1 6 + 5 6 log ⁡ 5 6 ) ] + 11 17 ∗ [ − ( 7 11 log ⁡ 7 11 + 4 11 log ⁡ 4 11 ) ] } = 0.1561 \begin{aligned} Gain(D, 含糖率, 0.155)&= 0.9975-\{\frac{6}{17}*[-(\frac{1}{6}\log\frac{1}{6}+\frac{5}{6}\log\frac{5}{6})]+\frac{11}{17}*[-(\frac{7}{11}\log\frac{7}{11}+\frac{4}{11}\log\frac{4}{11})]\}=0.1561 \end{aligned} Gain(D,含糖率,0.155)=0.9975−{176∗[−(61log61+65log65)]+1711∗[−(117log117+114log114)]}=0.1561
  • 当划分点为0.1795,划分后两个子集分别为 D 0.1795 − D_{0.1795}^- D0.1795−:{16, 11, 9, 12, 17, 7, 13}和 D 0.1795 + D_{0.1795}^+ D0.1795+:{14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.1795 ) = 0.9975 − { 7 17 ∗ [ − ( 1 7 log ⁡ 1 7 + 6 7 log ⁡ 6 7 ) ] + 10 17 ∗ [ − ( 7 10 log ⁡ 7 10 + 3 10 log ⁡ 3 10 ) ] } = 0.2354 \begin{aligned} Gain(D, 含糖率, 0.1795)&= 0.9975-\{\frac{7}{17}*[-(\frac{1}{7}\log\frac{1}{7}+\frac{6}{7}\log\frac{6}{7})]+\frac{10}{17}*[-(\frac{7}{10}\log\frac{7}{10}+\frac{3}{10}\log\frac{3}{10})]\}=0.2354 \end{aligned} Gain(D,含糖率,0.1795)=0.9975−{177∗[−(71log71+76log76)]+1710∗[−(107log107+103log103)]}=0.2354
  • 当划分点为0.2045,划分后两个子集分别为 D 0.2045 − D_{0.2045}^- D0.2045−:{16, 11, 9, 12, 17, 7, 13, 14}和 D 0.2045 + D_{0.2045}^+ D0.2045+:{8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2045 ) = 0.9975 − { 8 17 ∗ [ − ( 1 8 log ⁡ 1 8 + 7 8 log ⁡ 7 8 ) ] + 9 17 ∗ [ − ( 7 9 log ⁡ 7 9 + 2 9 log ⁡ 2 9 ) ] } = 0.3371 \begin{aligned} Gain(D, 含糖率, 0.2045)&= 0.9975-\{\frac{8}{17}*[-(\frac{1}{8}\log\frac{1}{8}+\frac{7}{8}\log\frac{7}{8})]+\frac{9}{17}*[-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9})]\}=0.3371 \end{aligned} Gain(D,含糖率,0.2045)=0.9975−{178∗[−(81log81+87log87)]+179∗[−(97log97+92log92)]}=0.3371
  • 当划分点为0.213,划分后两个子集分别为 D 0.213 − D_{0.213}^- D0.213−:{16, 11, 9, 12, 17, 7, 13, 14, 8}和 D 0.213 + D_{0.213}^+ D0.213+:{5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.213 ) = 0.9975 − { 9 17 ∗ [ − ( 2 9 log ⁡ 2 9 + 7 9 log ⁡ 7 9 ) ] + 8 17 ∗ [ − ( 6 8 log ⁡ 6 8 + 2 8 log ⁡ 2 8 ) ] } = 0.2111 \begin{aligned} Gain(D, 含糖率, 0.213)&= 0.9975-\{\frac{9}{17}*[-(\frac{2}{9}\log\frac{2}{9}+\frac{7}{9}\log\frac{7}{9})]+\frac{8}{17}*[-(\frac{6}{8}\log\frac{6}{8}+\frac{2}{8}\log\frac{2}{8})]\}=0.2111 \end{aligned} Gain(D,含糖率,0.213)=0.9975−{179∗[−(92log92+97log97)]+178∗[−(86log86+82log82)]}=0.2111
  • 当划分点为0.226,划分后两个子集分别为 D 0.226 − D_{0.226}^- D0.226−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5}和 D 0.226 + D_{0.226}^+ D0.226+:{6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.226 ) = 0.9975 − { 10 17 ∗ [ − ( 3 10 log ⁡ 3 10 + 7 10 log ⁡ 7 10 ) ] + 7 17 ∗ [ − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ] } = 0.1237 \begin{aligned} Gain(D, 含糖率, 0.226)&= 0.9975-\{\frac{10}{17}*[-(\frac{3}{10}\log\frac{3}{10}+\frac{7}{10}\log\frac{7}{10})]+\frac{7}{17}*[-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7})]\}=0.1237 \end{aligned} Gain(D,含糖率,0.226)=0.9975−{1710∗[−(103log103+107log107)]+177∗[−(75log75+72log72)]}=0.1237
  • 当划分点为0.2505,划分后两个子集分别为 D 0.2505 − D_{0.2505}^- D0.2505−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6}和 D 0.2505 + D_{0.2505}^+ D0.2505+:{3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2505 ) = 0.9975 − { 11 17 ∗ [ − ( 4 11 log ⁡ 4 11 + 7 11 log ⁡ 7 11 ) ] + 6 17 ∗ [ − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ] } = 0.0615 \begin{aligned} Gain(D, 含糖率, 0.2505)&= 0.9975-\{\frac{11}{17}*[-(\frac{4}{11}\log\frac{4}{11}+\frac{7}{11}\log\frac{7}{11})]+\frac{6}{17}*[-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6})]\}=0.0615 \end{aligned} Gain(D,含糖率,0.2505)=0.9975−{1711∗[−(114log114+117log117)]+176∗[−(64log64+62log62)]}=0.0615
  • 当划分点为0.2655,划分后两个子集分别为 D 0.2655 − D_{0.2655}^- D0.2655−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3}和 D 0.2655 + D_{0.2655}^+ D0.2655+:{10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2655 ) = 0.9975 − { 12 17 ∗ [ − ( 5 12 log ⁡ 5 12 + 7 12 log ⁡ 7 12 ) ] + 5 17 ∗ [ − ( 3 5 log ⁡ 3 5 + 2 5 log ⁡ 2 5 ) ] } = 0.0202 \begin{aligned} Gain(D, 含糖率, 0.2655)&= 0.9975-\{\frac{12}{17}*[-(\frac{5}{12}\log\frac{5}{12}+\frac{7}{12}\log\frac{7}{12})]+\frac{5}{17}*[-(\frac{3}{5}\log\frac{3}{5}+\frac{2}{5}\log\frac{2}{5})]\}=0.0202 \end{aligned} Gain(D,含糖率,0.2655)=0.9975−{1712∗[−(125log125+127log127)]+175∗[−(53log53+52log52)]}=0.0202
  • 当划分点为0.2925,划分后两个子集分别为 D 0.2925 − D_{0.2925}^- D0.2925−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10}和 D 0.2925 + D_{0.2925}^+ D0.2925+:{4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2925 ) = 0.9975 − { 13 17 ∗ [ − ( 5 13 log ⁡ 5 13 + 8 13 log ⁡ 8 13 ) ] + 4 17 ∗ [ − ( 3 4 log ⁡ 3 4 + 1 4 log ⁡ 1 4 ) ] } = 0.0715 \begin{aligned} Gain(D, 含糖率, 0.2925)&= 0.9975-\{\frac{13}{17}*[-(\frac{5}{13}\log\frac{5}{13}+\frac{8}{13}\log\frac{8}{13})]+\frac{4}{17}*[-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4})]\}=0.0715 \end{aligned} Gain(D,含糖率,0.2925)=0.9975−{1713∗[−(135log135+138log138)]+174∗[−(43log43+41log41)]}=0.0715
  • 当划分点为0.344,划分后两个子集分别为 D 0.344 − D_{0.344}^- D0.344−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4}和 D 0.344 + D_{0.344}^+ D0.344+:{15, 2, 1}
    G a i n ( D , 含糖率 , 0.344 ) = 0.9975 − { 14 17 ∗ [ − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) ] + 3 17 ∗ [ − ( 2 3 log ⁡ 2 3 + 1 3 log ⁡ 1 3 ) ] } = 0.0241 \begin{aligned} Gain(D, 含糖率, 0.344)&= 0.9975-\{\frac{14}{17}*[-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})]+\frac{3}{17}*[-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3})]\}=0.0241 \end{aligned} Gain(D,含糖率,0.344)=0.9975−{1714∗[−(146log146+148log148)]+173∗[−(32log32+31log31)]}=0.0241
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15}和 D 0.373 + D_{0.373}^+ D0.373+:{2, 1}
    G a i n ( D , 含糖率 , 0.373 ) = 0.9975 − { 15 17 ∗ [ − ( 6 15 log ⁡ 6 15 + 9 15 log ⁡ 9 15 ) ] + 2 17 ∗ [ − ( 2 2 log ⁡ 2 2 + 0 2 log ⁡ 0 2 ) ] } = 0.1041 \begin{aligned} Gain(D, 含糖率, 0.373)&= 0.9975-\{\frac{15}{17}*[-(\frac{6}{15}\log\frac{6}{15}+\frac{9}{15}\log\frac{9}{15})]+\frac{2}{17}*[-(\frac{2}{2}\log\frac{2}{2}+\frac{0}{2}\log\frac{0}{2})]\}=0.1041 \end{aligned} Gain(D,含糖率,0.373)=0.9975−{1715∗[−(156log156+159log159)]+172∗[−(22log22+20log20)]}=0.1041
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2}和 D 0.373 + D_{0.373}^+ D0.373+:{1}
    G a i n ( D , 含糖率 , 0.418 ) = 0.9975 − { 16 17 ∗ [ − ( 7 16 log ⁡ 7 16 + 9 16 log ⁡ 9 16 ) ] + 1 17 ∗ [ − ( 1 1 log ⁡ 1 1 + 0 1 log ⁡ 0 1 ) ] } = 0.0669 \begin{aligned} Gain(D, 含糖率, 0.418)&= 0.9975-\{\frac{16}{17}*[-(\frac{7}{16}\log\frac{7}{16}+\frac{9}{16}\log\frac{9}{16})]+\frac{1}{17}*[-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1})]\}=0.0669 \end{aligned} Gain(D,含糖率,0.418)=0.9975−{1716∗[−(167log167+169log169)]+171∗[−(11log11+10log10)]}=0.0669

因此,属性"含糖率"划分后的最大信息增益为0.349,对应划分点为0.126:
G a i n ( D , 含糖率 ) = G a i n ( D , 含糖率 , t = 0.126 ) = 0.3492 \begin{aligned} Gain(D, 含糖率)&=Gain(D, 含糖率, t=0.126)=0.3492 \end{aligned} Gain(D,含糖率)=Gain(D,含糖率,t=0.126)=0.3492

同理,属性"密度"划分后的最大信息增益为0.2624,对应划分点为0.3815:
G a i n ( D , 密度 ) = G a i n ( D , 密度 , t = 0.3815 ) = 0.2624 \begin{aligned} Gain(D, 密度)&=Gain(D, 密度, t=0.3815)=0.2624 \end{aligned} Gain(D,密度)=Gain(D,密度,t=0.3815)=0.2624

以如此方式即可处理连续值的属性。

2、缺失值属性的处理

表3 西瓜数据集------缺失值

编号 色泽 根蒂 敲声 纹理 脐部 触感 好瓜
1 --- 蜷缩 浊响 清晰 凹陷 硬滑
2 乌黑 蜷缩 沉闷 清晰 凹陷 ---
3 乌黑 蜷缩 --- 清晰 凹陷 硬滑
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑
5 --- 蜷缩 浊响 清晰 凹陷 硬滑
6 青绿 稍蜷 浊响 清晰 --- 软粘
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘
8 乌黑 稍蜷 浊响 --- 稍凹 硬滑
9 乌黑 --- 沉闷 稍糊 稍凹 硬滑
10 青绿 硬挺 清脆 --- 平坦 软粘
11 浅白 硬挺 清脆 模糊 平坦 ---
12 浅白 蜷缩 --- 模糊 平坦 软粘
13 --- 稍蜷 浊响 稍糊 凹陷 硬滑
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑
15 乌黑 稍蜷 浊响 清晰 --- 软粘
16 浅白 蜷缩 浊响 模糊 平坦 硬滑
17 青绿 --- 沉闷 稍糊 稍凹 硬滑

(1) 如何在属性值确实的情况下进行划分属性选择?

给定训练集 D D D和属性 A A A,假设 D ~ \widetilde{D} D 表示属性 A A A上没有缺失值的样本子集,假定属性 A A A有 m m m个可取值 { a 1 , a 2 , . . . , a m } \{a^1, a^2, ..., a^m\} {a1,a2,...,am}, D ~ m \widetilde{D}^m D m表示 D ~ \widetilde{D} D 中属性 A A A上取值为 a m a^m am的样本子集, D ~ k \widetilde{D}k D k表示 D ~ \widetilde{D} D 中属于第 k k k类( k = 1 , 2 , . . . , K k=1,2,...,K k=1,2,...,K)的样本子集,则有 D ~ = ∪ k = 1 K D ~ k = ∪ m = 1 m D ~ m \widetilde{D}=\cup{k=1}^{K}\widetilde{D}k=\cup{m=1}^{m}\widetilde{D}^m D =∪k=1KD k=∪m=1mD m,假定为每一个样本 x x x赋予一个权重 w x w_x wx定义:
ρ = ∑ x ∈ D ~ w x ∑ x ∈ D w x p ~ k = ∑ x ∈ D ~ k w x ∑ x ∈ D ~ w x r ~ m = ∑ x ∈ D ~ m w x ∑ x ∈ D ~ w x \rho=\frac{\sum_{x\in\widetilde{D}}w_x}{\sum_{x\in D}w_x}\\ \widetilde{p}k=\frac{\sum{x\in\widetilde{D}k}w_x}{\sum{x\in \widetilde{D}}w_x}\\ \widetilde{r}m=\frac{\sum{x\in\widetilde{D}^m}w_x}{\sum_{x\in \widetilde{D}}w_x}\\ ρ=∑x∈Dwx∑x∈D wxp k=∑x∈D wx∑x∈D kwxr m=∑x∈D wx∑x∈D mwx

式中, ρ \rho ρ表示无缺失值样本所占的比例, p ~ k \widetilde{p}_k p k表示无缺失值样本中第 k k k类所占的比例, r ~ m \widetilde{r}m r m表示无缺失值样本中属性 A A A上取值 a m a^m am的样本所占的比例,故有 ∑ k = 1 K p ~ k = ∑ n = 1 m r ~ m = 1 \sum{k=1}^K\widetilde{p}k=\sum{n=1}^m\widetilde{r}_m=1 ∑k=1Kp k=∑n=1mr m=1

基于上述定义,可将信息增益的计算在缺失值上推广为:
E n t r o p y ( D ~ ) = − ∑ k = 1 K p ~ k log ⁡ p ~ k G a i n ( D , A ) = ρ × G a i n ( D ~ , A ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] \begin{aligned} Entropy(\widetilde{D})&=-\sum_{k=1}^{K}\widetilde{p}_k\log\widetilde{p}k\\ Gain(D, A)&=\rho\times Gain(\widetilde{D}, A)=\rho\times[Entropy(\widetilde{D})-\sum{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)] \end{aligned} Entropy(D )Gain(D,A)=−k=1∑Kp klogp k=ρ×Gain(D ,A)=ρ×[Entropy(D )−m=1∑mr mEntropy(D m)]

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

  • 属性"色泽",无缺失值样本子集 D ~ = { 2 , 3 , 4 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 14 , 15 , 16 , 17 } \widetilde{D}=\{2,3,4,6,7,8,9,10,11,12,14,15,16,17\} D ={2,3,4,6,7,8,9,10,11,12,14,15,16,17},有"乌黑"、"青绿"、"浅白"3个取值
    G a i n ( D , 色泽 ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] = 14 17 × { − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) − [ 6 14 × ( − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ) + 4 14 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 14 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2519 \begin{aligned} Gain(D, 色泽)&=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)]\\ &=\frac{14}{17}\times\{-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})-[\frac{6}{14}\times(-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6}))+\frac{4}{14}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+\frac{4}{14}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2519 \end{aligned} Gain(D,色泽)=ρ×[Entropy(D )−m=1∑mr mEntropy(D m)]=1714×{−(146log146+148log148)−[146×(−(64log64+62log62))+144×(−(42log42+42log42))+144×(−(40log40+44log44))]}=0.2519
  • 属性"根蒂",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 10 , 11 , 12 , 13 , 14 , 15 , 16 } \widetilde{D}=\{1,2,3,4,5,6,7,8,10,11,12,13,14,15,16\} D ={1,2,3,4,5,6,7,8,10,11,12,13,14,15,16},有"蜷缩"、"稍蜷"、"硬挺"3个取值
    G a i n ( D , 根蒂 ) = 15 17 × { − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 6 15 × ( − ( 3 6 log ⁡ 3 6 + 3 6 log ⁡ 3 6 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1711 \begin{aligned} Gain(D, 根蒂)&=\frac{15}{17}\times\{- (\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{6}{15}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1711 \end{aligned} Gain(D,根蒂)=1715×{−(158log158+157log157)−[157×(−(75log75+72log72))+156×(−(63log63+63log63))+152×(−(20log20+22log22))]}=0.1711
  • 属性"敲声",无缺失值样本子集 D ~ = { 1 , 2 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,4,5,6,7,8,9,10,11,13,14,15,16,17\} D ={1,2,4,5,6,7,8,9,10,11,13,14,15,16,17},有"浊响"、"沉闷"、"清脆"3个取值
    G a i n ( D , 敲声 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 8 15 × ( − ( 5 8 log ⁡ 5 8 + 3 8 log ⁡ 3 8 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1448 \begin{aligned} Gain(D, 敲声)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{8}{15}\times(-(\frac{5}{8}\log\frac{5}{8}+\frac{3}{8}\log\frac{3}{8}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1448 \end{aligned} Gain(D,敲声)=1715×{−(157log157+158log158)−[158×(−(85log85+83log83))+155×(−(52log52+53log53))+152×(−(20log20+22log22))]}=0.1448
  • 属性"纹理",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 11 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,6,7,9,11,12,13,14,15,16,17\} D ={1,2,3,4,5,6,7,9,11,12,13,14,15,16,17},有"清晰"、"稍糊"、"模糊"3个取值
    G a i n ( D , 纹理 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 6 7 log ⁡ 6 7 + 1 7 log ⁡ 1 7 ) ) + 5 15 × ( − ( 1 5 log ⁡ 1 5 + 4 5 log ⁡ 4 5 ) ) + 3 15 × ( − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ) ] } = 0.4235 \begin{aligned} Gain(D, 纹理)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{6}{7}\log\frac{6}{7}+\frac{1}{7}\log\frac{1}{7}))+ \frac{5}{15}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))+ \frac{3}{15}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\}\\ &=0.4235 \end{aligned} Gain(D,纹理)=1715×{−(157log157+158log158)−[157×(−(76log76+71log71))+155×(−(51log51+54log54))+153×(−(30log30+33log33))]}=0.4235
  • 属性"脐部",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,7,8,9,10,11,12,13,14,16,17\} D ={1,2,3,4,5,7,8,9,10,11,12,13,14,16,17},有"凹陷"、"稍凹"、"平坦"3个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 4 15 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 15 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2888 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{4}{15}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+ \frac{4}{15}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2888 \end{aligned} Gain(D,脐部)=1715×{−(157log157+158log158)−[157×(−(75log75+72log72))+154×(−(42log42+42log42))+154×(−(40log40+44log44))]}=0.2888
  • 属性"触感",无缺失值样本子集 D ~ = { 1 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,3,4,5,6,7,8,9,10,12,13,14,15,16,17\} D ={1,3,4,5,6,7,8,9,10,12,13,14,15,16,17},有"硬滑"、"软粘"2个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 10 15 × ( − ( 5 10 log ⁡ 5 10 + 5 10 log ⁡ 5 10 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) ] } = 0.0057 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{10}{15}\times(-(\frac{5}{10}\log\frac{5}{10}+\frac{5}{10}\log\frac{5}{10}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))]\}\\ &=0.0057 \end{aligned} Gain(D,脐部)=1715×{−(157log157+158log158)−[1510×(−(105log105+105log105))+155×(−(52log52+53log53))]}=0.0057
    (2) 给定划分属性,若样本在该属性上确实,如何对样本进行划分?
  • 如果样本在划分属性上的取值已知,则将其分裂到与其取值对应的子节点,且样本权重在子节点中保持为1;
  • 如若样本在划分属性上的取值未知,则将其同时分裂到所有子节点中,在各子节点中的权重为对应子节点的样本权重 ρ \rho ρ。

属性"纹理"的信息增益最大,用于进一步分裂,包含15个取值已知(清晰7个、稍糊5个、模糊3个)和2个取值未知的样本{8,10}。

属性: 取值 样本 好瓜 差瓜 缺失值 缺失值权重 总权重
纹理:清晰 {1,2,3,4,5,6,15} {1,2,3,4,5,6} {15} {8,10} 2 × 7 15 2\times\frac{7}{15} 2×157 7 + 2 × 7 15 7+2\times\frac{7}{15} 7+2×157
纹理:稍糊 {7,9,13,14,17} {7} {9,13,14,17} {8,10} 2 × 5 15 2\times\frac{5}{15} 2×155 5 + 2 × 5 15 5+2\times\frac{5}{15} 5+2×155
纹理:模糊 {11,12,16} - {11,12,16} {8,10} 2 × 3 15 2\times\frac{3}{15} 2×153 3 + 2 × 3 15 3+2\times\frac{3}{15} 3+2×153

子节点属性纹理=清晰,包含7个有取值样本{1,2,3,4,5,6,15},其中6个好瓜和1个差瓜,假设属性在缺失值处对应的类别分布与原始样本一致,分别为 6 7 \frac{6}{7} 76和 1 7 \frac{1}{7} 71,则子节点属性纹理=清晰的信息熵为:
E n t r o p y ( D 纹理 = 清晰 ) = − ∑ i = 1 k p i log ⁡ p i = − ( 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 + 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 ) = 0.5916 \begin{aligned} Entropy(D^{纹理=清晰})&=-\sum_{i=1}^{k}p_i\log p_i\\ &=-(\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}+\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2})\\ &=0.5916 \end{aligned} Entropy(D纹理=清晰)=−i=1∑kpilogpi=−(7+157×26+76×157×2log7+157×26+76×157×2+7+157×21+71×157×2log7+157×21+71×157×2)=0.5916

  • 子节点属性纹理=清晰,计算属性色泽的信息增益
    • 色泽=乌黑的样本数为3(2个正样本和1个负样本);色泽=青绿的样本数为2(2个正样本);2个缺失值样本

    • 缺失值样本的权重:色泽=乌黑的权重 3 5 \frac{3}{5} 53,总权重为 2 × 3 5 = 6 5 2\times\frac{3}{5}=\frac{6}{5} 2×53=56;色泽=青绿的权重 2 5 \frac{2}{5} 52,总权重为 2 × 2 5 = 4 5 2\times\frac{2}{5}=\frac{4}{5} 2×52=54

    • 色泽=乌黑:正样本的权重: 2 + 2 3 × 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2 2+32×53×2;负样本的权重: 1 + 1 3 × 3 5 × 2 1+\frac{1}{3}\times\frac{3}{5}\times2 1+31×53×2;总权重 2 + 2 3 × 3 5 × 2 + 1 + 1 3 × 3 5 × 2 = 3 + 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2+1+\frac{1}{3}\times\frac{3}{5}\times2=3+\frac{3}{5}\times2 2+32×53×2+1+31×53×2=3+53×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 乌黑 ) = − ( 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 + 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 ) = 0.6589 \begin{aligned} Entropy(D^{纹理=清晰},色泽=乌黑)&=-(\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}+\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2})=0.6589 \end{aligned} Entropy(D纹理=清晰,色泽=乌黑)=−(3+53×22+32×53×2log3+53×22+32×53×2+3+53×21+31×53×2log3+53×21+31×53×2)=0.6589

    • 色泽=青绿:正样本的权重: 2 + 2 2 × 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2 2+22×52×2;负样本的权重: 0 + 0 2 × 2 5 × 2 0+\frac{0}{2}\times\frac{2}{5}\times2 0+20×52×2;总权重 2 + 2 2 × 2 5 × 2 + 0 + 0 2 × 2 5 × 2 = 2 + 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2+0+\frac{0}{2}\times\frac{2}{5}\times2=2+\frac{2}{5}\times2 2+22×52×2+0+20×52×2=2+52×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 青绿 ) = − ( 2 + 2 5 × 2 2 + 2 5 × 2 log ⁡ 2 + 2 5 × 2 2 + 2 5 × 2 + 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 log ⁡ 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},色泽=青绿)&=-(\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}+\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2})=0.0 \end{aligned} Entropy(D纹理=清晰,色泽=青绿)=−(2+52×22+52×2log2+52×22+52×2+2+52×20+20×52×2log2+52×20+20×52×2)=0.0

G a i n ( D 纹理 = 清晰 , 色泽 ) = 0.5916 − ( 3 + 3 5 × 2 7 + 7 15 × 2 × 0.6598 + 2 + 2 5 × 2 7 + 7 15 × 2 × 0.0 ) = 0.2423 \begin{aligned} Gain(D^{纹理=清晰},色泽)&=0.5916-(\frac{3+\frac{3}{5}\times2}{7+\frac{7}{15}\times2}\times0.6598+\frac{2+\frac{2}{5}\times2}{7+\frac{7}{15}\times2}\times0.0)&=0.2423 \end{aligned} Gain(D纹理=清晰,色泽)=0.5916−(7+157×23+53×2×0.6598+7+157×22+52×2×0.0)=0.2423

  • 子节点属性纹理=清晰,计算属性根蒂的信息增益

    • 根蒂=蜷缩的样本数为5(5个正样本);根蒂=稍蜷的样本数为2(1个正样本和1个负样本);无缺失值样本
      E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 蜷缩 ) = − ( 5 5 log ⁡ 5 5 + 0 5 log ⁡ 0 5 ) = 0.0 E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 稍蜷 ) = − ( 1 2 log ⁡ 1 2 + 1 2 log ⁡ 1 2 ) = 1.0 G a i n ( D 纹理 = 清晰 , 根蒂 ) = 0.5916 − ( 5 7 × 0.0 + 2 7 × 1.0 ) = 0.3058 \begin{aligned} Entropy(D^{纹理=清晰},根蒂=蜷缩)&=-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{5}\log\frac{0}{5})=0.0\\ Entropy(D^{纹理=清晰},根蒂=稍蜷)&=-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2})=1.0\\ Gain(D^{纹理=清晰},根蒂)&=0.5916-(\frac{5}{7}\times0.0+\frac{2}{7}\times1.0)=0.3058 \end{aligned} Entropy(D纹理=清晰,根蒂=蜷缩)Entropy(D纹理=清晰,根蒂=稍蜷)Gain(D纹理=清晰,根蒂)=−(55log55+50log50)=0.0=−(21log21+21log21)=1.0=0.5916−(75×0.0+72×1.0)=0.3058
  • 子节点属性纹理=清晰,计算属性敲声的信息增益

    • 敲声=浊响的样本数为4(3个正样本和1个负样本);敲声=沉闷的样本数为2(2个正样本);1个缺失值样本

    • 缺失值样本的权重:敲声=浊响的权重 4 6 \frac{4}{6} 64,总权重为 4 6 \frac{4}{6} 64;敲声=沉闷的权重 2 6 \frac{2}{6} 62,总权重为 2 6 \frac{2}{6} 62

    • 敲声=浊响:正样本的权重: 3 + 3 4 × 4 6 3+\frac{3}{4}\times\frac{4}{6} 3+43×64;负样本的权重: 1 + 1 4 × 4 6 1+\frac{1}{4}\times\frac{4}{6} 1+41×64;总权重 3 + 3 4 × 4 6 + 1 + 1 4 × 4 6 = 4 + 4 6 3+\frac{3}{4}\times\frac{4}{6}+1+\frac{1}{4}\times\frac{4}{6}=4+\frac{4}{6} 3+43×64+1+41×64=4+64
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 浊响 ) = − ( 3 + 3 4 × 4 6 4 + 4 6 log ⁡ 3 + 3 4 × 4 6 4 + 4 6 + 1 + 1 4 × 4 6 4 + 4 6 log ⁡ 1 + 1 4 × 4 6 4 + 4 6 ) = 0.8112 \begin{aligned} Entropy(D^{纹理=清晰},敲声=浊响)&=-(\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}+\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}})=0.8112 \end{aligned} Entropy(D纹理=清晰,敲声=浊响)=−(4+643+43×64log4+643+43×64+4+641+41×64log4+641+41×64)=0.8112

    • 敲声=沉闷:正样本的权重: 2 + 2 2 × 2 6 2+\frac{2}{2}\times\frac{2}{6} 2+22×62;负样本的权重: 0 + 0 2 × 2 6 0+\frac{0}{2}\times\frac{2}{6} 0+20×62;总权重 2 + 2 2 × 2 6 + 0 + 0 2 × 2 6 = 2 + 2 6 2+\frac{2}{2}\times\frac{2}{6}+0+\frac{0}{2}\times\frac{2}{6}=2+\frac{2}{6} 2+22×62+0+20×62=2+62
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 沉闷 ) = − ( 2 + 2 2 × 2 6 2 + 2 6 log ⁡ 2 + 2 2 × 2 6 2 + 2 6 + 0 + 0 2 × 2 6 2 + 2 6 log ⁡ 0 + 0 2 × 2 6 2 + 2 6 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},敲声=沉闷)&=-(\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}+\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}})=0.0 \end{aligned} Entropy(D纹理=清晰,敲声=沉闷)=−(2+622+22×62log2+622+22×62+2+620+20×62log2+620+20×62)=0.0
      G a i n ( D 纹理 = 清晰 , 敲声 ) = 0.5916 − ( 4 + 4 6 7 + 7 15 × 2 × 0.8112 + 2 + 2 6 7 + 7 15 × 2 × 0.0 ) = 0.1144 \begin{aligned} Gain(D^{纹理=清晰},敲声)&=0.5916-(\frac{4+\frac{4}{6}}{7+\frac{7}{15}\times2}\times0.8112+\frac{2+\frac{2}{6}}{7+\frac{7}{15}\times2}\times0.0)&=0.1144 \end{aligned} Gain(D纹理=清晰,敲声)=0.5916−(7+157×24+64×0.8112+7+157×22+62×0.0)=0.1144

相关推荐
赛丽曼23 分钟前
机器学习-K近邻算法
人工智能·机器学习·近邻算法
啊波次得饿佛哥2 小时前
7. 计算机视觉
人工智能·计算机视觉·视觉检测
XianxinMao3 小时前
RLHF技术应用探析:从安全任务到高阶能力提升
人工智能·python·算法
Swift社区3 小时前
【分布式日志篇】从工具选型到实战部署:全面解析日志采集与管理路径
人工智能·spring boot·分布式
Quz3 小时前
OpenCV:高通滤波之索贝尔、沙尔和拉普拉斯
图像处理·人工智能·opencv·计算机视觉·矩阵
去往火星3 小时前
OpenCV文字绘制支持中文显示
人工智能·opencv·计算机视觉
海里的鱼20224 小时前
yolov11配置环境,实现OBB带方向目标检测
人工智能·yolo·目标检测·计算机视觉
道友老李4 小时前
【自然语言处理(NLP)】介绍、发展史
人工智能·自然语言处理
皮肤科大白4 小时前
如何在data.table中处理缺失值
学习·算法·机器学习