机器学习之决策树(DecisionTree——C4.5)

机器学习之决策树(DecisionTree------ID3)中我们提到,ID3无法处理是连续值或有缺失值的属性。而C4.5算法可以解决ID3算的上述局限性。

1、连续值属性的处理

对于数据集 D D D和连续值属性 A A A,假设连续值属性 A A A有 M M M个不同的取值,可通过二分法bi-partition对联组织属性进行离散化处理,即:

  1. 对 M M M个不同的取值由小到大排序,得到排序后的取值,记为 { a 1 , a 2 , . . . , a M } \{a^1, a^2, ..., a^M\} {a1,a2,...,aM};
  2. 对相邻的属性取值 a i a^{i} ai与 a i + 1 a^{i+1} ai+1,取其均值作为划分点,即 a i + a i + 1 2 \frac{a^{i}+a^{i+1}}{2} 2ai+ai+1,划分后的子集表示为 D t − D_t^- Dt−和 D t + D_t^+ Dt+;
  3. 对于连续值属性 A A A,可获得包含 M − 1 M-1 M−1个元素的候选划分点集合:
    T A = { a i + a i + 1 2 ∣ 1 ≤ i ≤ M − 1 } (1) T_A=\{\frac{a^{i}+a^{i+1}}{2}|1≤i≤M-1\}\tag1 TA={2ai+ai+1∣1≤i≤M−1}(1)
  4. 像离散属性值一样开考察上述候选划分点,选取最优的划分点进行样本集合的划分:
    G a i n ( D , A ) = max ⁡ t ∈ T a G a i n ( D , A , t ) = max ⁡ t ∈ T a ( E n t r o p y ( D ) − ∑ λ ∈ { − , + } N t λ N E n t r o p y ( D t λ ) ) (2) \begin{aligned} Gain(D, A)&=\mathop{\max}\limits_{t\in T_a}Gain(D, A, t)\\ &=\mathop{\max}\limits_{t\in T_a}(Entropy(D)-\sum_{\lambda\in \{-, +\}}\frac{N_t^{\lambda}}{N}Entropy(D_t^{\lambda}))\tag2 \end{aligned} Gain(D,A)=t∈TamaxGain(D,A,t)=t∈Tamax(Entropy(D)−λ∈{−,+}∑NNtλEntropy(Dtλ))(2)
    式(2)中, G a i n ( D , A , t ) Gain(D, A, t) Gain(D,A,t)是样本集 D D D基于划分点 t t t二分后的信息增益, D t λ D_t^{\lambda} Dtλ表示二分后的子集, N t λ N_t^{\lambda} Ntλ表示二分后的子集的样本数量。

表1 西瓜数据集3.0

编号 色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率 好瓜
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215
6 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091
10 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267
11 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
12 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
16 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103

表1中的西瓜数据集包含17个样本( n = 1 , 2 , 3 , . . . , 17 n=1,2,3,...,17 n=1,2,3,...,17),每个样本有8个属性( k = 1 , 2 , 3 , . . . , 8 k = 1 , 2 , 3 , . . . , 8 k=1,2,3,...,8),样本共计有2个类别( c = 是 , 否 c = 是 , 否 c=是,否)。17个样本中,好瓜样本有8个、差瓜样本有9个,数据集 D D D信息熵为:
E n t r o p y ( D ) = − ( 8 17 log ⁡ 8 17 + 9 17 log ⁡ 9 17 ) = 0.9975 Entropy(D)=-(\frac{8}{17}\log\frac{8}{17}+\frac{9}{17}\log\frac{9}{17})=0.9975 Entropy(D)=−(178log178+179log179)=0.9975

以属性"含糖率"为例,17个样本的在该属性的取值由小到大排序后为:
表2 西瓜数据集3.0------sort("含糖率")

编号 色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率 好瓜
16 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042
11 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091
12 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215
6 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264
10 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460

17个样本的在该属性的二分候选划分点为:

|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| |-------| | 0.042 | | 0.057 | | 0.091 | | 0.099 | | 0.103 | | 0.149 | | 0.161 | | 0.198 | | 0.211 | | 0.215 | | 0.237 | | 0.264 | | 0.267 | | 0.318 | | 0.370 | | 0.376 | | 0.460 | | |--------| | 0.0495 | | 0.074 | | 0.095 | | 0.101 | | 0.126 | | 0.155 | | 0.1795 | | 0.2045 | | 0.213 | | 0.226 | | 0.2505 | | 0.2655 | | 0.2925 | | 0.344 | | 0.373 | | 0.418 | |

  • 当划分点为0.0495,划分后两个子集分别为 D 0.0495 − D_{0.0495}^- D0.0495−:{16}和 D 0.0495 + D_{0.0495}^+ D0.0495+:{11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    E n t r o p y ( D 0.0495 − ) = − ( 0 1 log ⁡ 0 1 + 1 1 log ⁡ 1 1 ) = 0 E n t r o p y ( D 0.0495 + ) = − ( 8 16 log ⁡ 8 16 + 8 16 log ⁡ 8 16 ) = 1.0 G a i n ( D , 含糖率 , 0.0495 ) = E n t r o p y ( D ) − ∑ λ ∈ { − , + } N 0.0495 λ N E n t r o p y ( D 0.126 λ ) = 0.9975 − ( 1 17 ∗ 0 + 16 17 ∗ 1.0 ) = 0.0563 \begin{aligned} Entropy(D_{0.0495}^-)&=-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1})=0\\ Entropy(D_{0.0495}^+)&=-(\frac{8}{16}\log\frac{8}{16}+\frac{8}{16}\log\frac{8}{16})=1.0\\ Gain(D, 含糖率, 0.0495)&= Entropy(D)-\sum_{\lambda\in\{-, +\}}\frac{N_{0.0495}^{\lambda}}{N} Entropy(D_{0.126}^{\lambda})\\ &= 0.9975-(\frac{1}{17}*0+\frac{16}{17}*1.0)\\ &=0.0563 \end{aligned} Entropy(D0.0495−)Entropy(D0.0495+)Gain(D,含糖率,0.0495)=−(10log10+11log11)=0=−(168log168+168log168)=1.0=Entropy(D)−λ∈{−,+}∑NN0.0495λEntropy(D0.126λ)=0.9975−(171∗0+1716∗1.0)=0.0563
  • 当划分点为0.074,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074−:{16, 11}和 D 0.074 + D_{0.074}^+ D0.074+:{9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.074 ) = 0.9975 − { 2 17 ∗ [ − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ] + 15 17 ∗ [ − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) ] } = 0.1179 \begin{aligned} Gain(D, 含糖率, 0.074)&= 0.9975-\{\frac{2}{17}*[-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2})]+\frac{15}{17}*[-(\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})]\}=0.1179 \end{aligned} Gain(D,含糖率,0.074)=0.9975−{172∗[−(20log20+22log22)]+1715∗[−(158log158+157log157)]}=0.1179
  • 当划分点为0.095,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074−:{16, 11, 9}和 D 0.074 + D_{0.074}^+ D0.074+:{12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.095 ) = 0.9975 − { 3 17 ∗ [ − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ] + 14 17 ∗ [ − ( 8 14 log ⁡ 8 14 + 6 14 log ⁡ 6 14 ) ] } = 0.1861 \begin{aligned} Gain(D, 含糖率, 0.095)&= 0.9975-\{\frac{3}{17}*[-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3})]+\frac{14}{17}*[-(\frac{8}{14}\log\frac{8}{14}+\frac{6}{14}\log\frac{6}{14})]\}=0.1861 \end{aligned} Gain(D,含糖率,0.095)=0.9975−{173∗[−(30log30+33log33)]+1714∗[−(148log148+146log146)]}=0.1861
  • 当划分点为0.101,划分后两个子集分别为 D 0.101 − D_{0.101}^- D0.101−:{16, 11, 9, 12}和 D 0.101 + D_{0.101}^+ D0.101+:{17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.101 ) = 0.9975 − { 4 17 ∗ [ − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ] + 13 17 ∗ [ − ( 8 13 log ⁡ 8 13 + 5 13 log ⁡ 5 13 ) ] } = 0.2624 \begin{aligned} Gain(D, 含糖率, 0.101)&= 0.9975-\{\frac{4}{17}*[-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4})]+\frac{13}{17}*[-(\frac{8}{13}\log\frac{8}{13}+\frac{5}{13}\log\frac{5}{13})]\}=0.2624 \end{aligned} Gain(D,含糖率,0.101)=0.9975−{174∗[−(40log40+44log44)]+1713∗[−(138log138+135log135)]}=0.2624
  • 当划分点为0.126,划分后两个子集分别为 D 0.126 − D_{0.126}^- D0.126−:{16, 11, 9, 12, 17}和 D 0.126 + D_{0.126}^+ D0.126+:{7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.126 ) = 0.9975 − { 5 17 ∗ [ − ( 0 5 log ⁡ 0 5 + 5 5 log ⁡ 5 5 ) ] + 12 17 ∗ [ − ( 8 12 log ⁡ 8 12 + 4 12 log ⁡ 4 12 ) ] } = 0.3492 \begin{aligned} Gain(D, 含糖率, 0.126)&= 0.9975-\{\frac{5}{17}*[-(\frac{0}{5}\log\frac{0}{5}+\frac{5}{5}\log\frac{5}{5})]+\frac{12}{17}*[-(\frac{8}{12}\log\frac{8}{12}+\frac{4}{12}\log\frac{4}{12})]\}=0.3492 \end{aligned} Gain(D,含糖率,0.126)=0.9975−{175∗[−(50log50+55log55)]+1712∗[−(128log128+124log124)]}=0.3492
  • 当划分点为0.155,划分后两个子集分别为 D 0.155 − D_{0.155}^- D0.155−:{16, 11, 9, 12, 17, 7}和 D 0.155 + D_{0.155}^+ D0.155+:{13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.155 ) = 0.9975 − { 6 17 ∗ [ − ( 1 6 log ⁡ 1 6 + 5 6 log ⁡ 5 6 ) ] + 11 17 ∗ [ − ( 7 11 log ⁡ 7 11 + 4 11 log ⁡ 4 11 ) ] } = 0.1561 \begin{aligned} Gain(D, 含糖率, 0.155)&= 0.9975-\{\frac{6}{17}*[-(\frac{1}{6}\log\frac{1}{6}+\frac{5}{6}\log\frac{5}{6})]+\frac{11}{17}*[-(\frac{7}{11}\log\frac{7}{11}+\frac{4}{11}\log\frac{4}{11})]\}=0.1561 \end{aligned} Gain(D,含糖率,0.155)=0.9975−{176∗[−(61log61+65log65)]+1711∗[−(117log117+114log114)]}=0.1561
  • 当划分点为0.1795,划分后两个子集分别为 D 0.1795 − D_{0.1795}^- D0.1795−:{16, 11, 9, 12, 17, 7, 13}和 D 0.1795 + D_{0.1795}^+ D0.1795+:{14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.1795 ) = 0.9975 − { 7 17 ∗ [ − ( 1 7 log ⁡ 1 7 + 6 7 log ⁡ 6 7 ) ] + 10 17 ∗ [ − ( 7 10 log ⁡ 7 10 + 3 10 log ⁡ 3 10 ) ] } = 0.2354 \begin{aligned} Gain(D, 含糖率, 0.1795)&= 0.9975-\{\frac{7}{17}*[-(\frac{1}{7}\log\frac{1}{7}+\frac{6}{7}\log\frac{6}{7})]+\frac{10}{17}*[-(\frac{7}{10}\log\frac{7}{10}+\frac{3}{10}\log\frac{3}{10})]\}=0.2354 \end{aligned} Gain(D,含糖率,0.1795)=0.9975−{177∗[−(71log71+76log76)]+1710∗[−(107log107+103log103)]}=0.2354
  • 当划分点为0.2045,划分后两个子集分别为 D 0.2045 − D_{0.2045}^- D0.2045−:{16, 11, 9, 12, 17, 7, 13, 14}和 D 0.2045 + D_{0.2045}^+ D0.2045+:{8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2045 ) = 0.9975 − { 8 17 ∗ [ − ( 1 8 log ⁡ 1 8 + 7 8 log ⁡ 7 8 ) ] + 9 17 ∗ [ − ( 7 9 log ⁡ 7 9 + 2 9 log ⁡ 2 9 ) ] } = 0.3371 \begin{aligned} Gain(D, 含糖率, 0.2045)&= 0.9975-\{\frac{8}{17}*[-(\frac{1}{8}\log\frac{1}{8}+\frac{7}{8}\log\frac{7}{8})]+\frac{9}{17}*[-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9})]\}=0.3371 \end{aligned} Gain(D,含糖率,0.2045)=0.9975−{178∗[−(81log81+87log87)]+179∗[−(97log97+92log92)]}=0.3371
  • 当划分点为0.213,划分后两个子集分别为 D 0.213 − D_{0.213}^- D0.213−:{16, 11, 9, 12, 17, 7, 13, 14, 8}和 D 0.213 + D_{0.213}^+ D0.213+:{5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.213 ) = 0.9975 − { 9 17 ∗ [ − ( 2 9 log ⁡ 2 9 + 7 9 log ⁡ 7 9 ) ] + 8 17 ∗ [ − ( 6 8 log ⁡ 6 8 + 2 8 log ⁡ 2 8 ) ] } = 0.2111 \begin{aligned} Gain(D, 含糖率, 0.213)&= 0.9975-\{\frac{9}{17}*[-(\frac{2}{9}\log\frac{2}{9}+\frac{7}{9}\log\frac{7}{9})]+\frac{8}{17}*[-(\frac{6}{8}\log\frac{6}{8}+\frac{2}{8}\log\frac{2}{8})]\}=0.2111 \end{aligned} Gain(D,含糖率,0.213)=0.9975−{179∗[−(92log92+97log97)]+178∗[−(86log86+82log82)]}=0.2111
  • 当划分点为0.226,划分后两个子集分别为 D 0.226 − D_{0.226}^- D0.226−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5}和 D 0.226 + D_{0.226}^+ D0.226+:{6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.226 ) = 0.9975 − { 10 17 ∗ [ − ( 3 10 log ⁡ 3 10 + 7 10 log ⁡ 7 10 ) ] + 7 17 ∗ [ − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ] } = 0.1237 \begin{aligned} Gain(D, 含糖率, 0.226)&= 0.9975-\{\frac{10}{17}*[-(\frac{3}{10}\log\frac{3}{10}+\frac{7}{10}\log\frac{7}{10})]+\frac{7}{17}*[-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7})]\}=0.1237 \end{aligned} Gain(D,含糖率,0.226)=0.9975−{1710∗[−(103log103+107log107)]+177∗[−(75log75+72log72)]}=0.1237
  • 当划分点为0.2505,划分后两个子集分别为 D 0.2505 − D_{0.2505}^- D0.2505−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6}和 D 0.2505 + D_{0.2505}^+ D0.2505+:{3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2505 ) = 0.9975 − { 11 17 ∗ [ − ( 4 11 log ⁡ 4 11 + 7 11 log ⁡ 7 11 ) ] + 6 17 ∗ [ − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ] } = 0.0615 \begin{aligned} Gain(D, 含糖率, 0.2505)&= 0.9975-\{\frac{11}{17}*[-(\frac{4}{11}\log\frac{4}{11}+\frac{7}{11}\log\frac{7}{11})]+\frac{6}{17}*[-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6})]\}=0.0615 \end{aligned} Gain(D,含糖率,0.2505)=0.9975−{1711∗[−(114log114+117log117)]+176∗[−(64log64+62log62)]}=0.0615
  • 当划分点为0.2655,划分后两个子集分别为 D 0.2655 − D_{0.2655}^- D0.2655−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3}和 D 0.2655 + D_{0.2655}^+ D0.2655+:{10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2655 ) = 0.9975 − { 12 17 ∗ [ − ( 5 12 log ⁡ 5 12 + 7 12 log ⁡ 7 12 ) ] + 5 17 ∗ [ − ( 3 5 log ⁡ 3 5 + 2 5 log ⁡ 2 5 ) ] } = 0.0202 \begin{aligned} Gain(D, 含糖率, 0.2655)&= 0.9975-\{\frac{12}{17}*[-(\frac{5}{12}\log\frac{5}{12}+\frac{7}{12}\log\frac{7}{12})]+\frac{5}{17}*[-(\frac{3}{5}\log\frac{3}{5}+\frac{2}{5}\log\frac{2}{5})]\}=0.0202 \end{aligned} Gain(D,含糖率,0.2655)=0.9975−{1712∗[−(125log125+127log127)]+175∗[−(53log53+52log52)]}=0.0202
  • 当划分点为0.2925,划分后两个子集分别为 D 0.2925 − D_{0.2925}^- D0.2925−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10}和 D 0.2925 + D_{0.2925}^+ D0.2925+:{4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2925 ) = 0.9975 − { 13 17 ∗ [ − ( 5 13 log ⁡ 5 13 + 8 13 log ⁡ 8 13 ) ] + 4 17 ∗ [ − ( 3 4 log ⁡ 3 4 + 1 4 log ⁡ 1 4 ) ] } = 0.0715 \begin{aligned} Gain(D, 含糖率, 0.2925)&= 0.9975-\{\frac{13}{17}*[-(\frac{5}{13}\log\frac{5}{13}+\frac{8}{13}\log\frac{8}{13})]+\frac{4}{17}*[-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4})]\}=0.0715 \end{aligned} Gain(D,含糖率,0.2925)=0.9975−{1713∗[−(135log135+138log138)]+174∗[−(43log43+41log41)]}=0.0715
  • 当划分点为0.344,划分后两个子集分别为 D 0.344 − D_{0.344}^- D0.344−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4}和 D 0.344 + D_{0.344}^+ D0.344+:{15, 2, 1}
    G a i n ( D , 含糖率 , 0.344 ) = 0.9975 − { 14 17 ∗ [ − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) ] + 3 17 ∗ [ − ( 2 3 log ⁡ 2 3 + 1 3 log ⁡ 1 3 ) ] } = 0.0241 \begin{aligned} Gain(D, 含糖率, 0.344)&= 0.9975-\{\frac{14}{17}*[-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})]+\frac{3}{17}*[-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3})]\}=0.0241 \end{aligned} Gain(D,含糖率,0.344)=0.9975−{1714∗[−(146log146+148log148)]+173∗[−(32log32+31log31)]}=0.0241
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15}和 D 0.373 + D_{0.373}^+ D0.373+:{2, 1}
    G a i n ( D , 含糖率 , 0.373 ) = 0.9975 − { 15 17 ∗ [ − ( 6 15 log ⁡ 6 15 + 9 15 log ⁡ 9 15 ) ] + 2 17 ∗ [ − ( 2 2 log ⁡ 2 2 + 0 2 log ⁡ 0 2 ) ] } = 0.1041 \begin{aligned} Gain(D, 含糖率, 0.373)&= 0.9975-\{\frac{15}{17}*[-(\frac{6}{15}\log\frac{6}{15}+\frac{9}{15}\log\frac{9}{15})]+\frac{2}{17}*[-(\frac{2}{2}\log\frac{2}{2}+\frac{0}{2}\log\frac{0}{2})]\}=0.1041 \end{aligned} Gain(D,含糖率,0.373)=0.9975−{1715∗[−(156log156+159log159)]+172∗[−(22log22+20log20)]}=0.1041
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2}和 D 0.373 + D_{0.373}^+ D0.373+:{1}
    G a i n ( D , 含糖率 , 0.418 ) = 0.9975 − { 16 17 ∗ [ − ( 7 16 log ⁡ 7 16 + 9 16 log ⁡ 9 16 ) ] + 1 17 ∗ [ − ( 1 1 log ⁡ 1 1 + 0 1 log ⁡ 0 1 ) ] } = 0.0669 \begin{aligned} Gain(D, 含糖率, 0.418)&= 0.9975-\{\frac{16}{17}*[-(\frac{7}{16}\log\frac{7}{16}+\frac{9}{16}\log\frac{9}{16})]+\frac{1}{17}*[-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1})]\}=0.0669 \end{aligned} Gain(D,含糖率,0.418)=0.9975−{1716∗[−(167log167+169log169)]+171∗[−(11log11+10log10)]}=0.0669

因此,属性"含糖率"划分后的最大信息增益为0.349,对应划分点为0.126:
G a i n ( D , 含糖率 ) = G a i n ( D , 含糖率 , t = 0.126 ) = 0.3492 \begin{aligned} Gain(D, 含糖率)&=Gain(D, 含糖率, t=0.126)=0.3492 \end{aligned} Gain(D,含糖率)=Gain(D,含糖率,t=0.126)=0.3492

同理,属性"密度"划分后的最大信息增益为0.2624,对应划分点为0.3815:
G a i n ( D , 密度 ) = G a i n ( D , 密度 , t = 0.3815 ) = 0.2624 \begin{aligned} Gain(D, 密度)&=Gain(D, 密度, t=0.3815)=0.2624 \end{aligned} Gain(D,密度)=Gain(D,密度,t=0.3815)=0.2624

以如此方式即可处理连续值的属性。

2、缺失值属性的处理

表3 西瓜数据集------缺失值

编号 色泽 根蒂 敲声 纹理 脐部 触感 好瓜
1 --- 蜷缩 浊响 清晰 凹陷 硬滑
2 乌黑 蜷缩 沉闷 清晰 凹陷 ---
3 乌黑 蜷缩 --- 清晰 凹陷 硬滑
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑
5 --- 蜷缩 浊响 清晰 凹陷 硬滑
6 青绿 稍蜷 浊响 清晰 --- 软粘
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘
8 乌黑 稍蜷 浊响 --- 稍凹 硬滑
9 乌黑 --- 沉闷 稍糊 稍凹 硬滑
10 青绿 硬挺 清脆 --- 平坦 软粘
11 浅白 硬挺 清脆 模糊 平坦 ---
12 浅白 蜷缩 --- 模糊 平坦 软粘
13 --- 稍蜷 浊响 稍糊 凹陷 硬滑
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑
15 乌黑 稍蜷 浊响 清晰 --- 软粘
16 浅白 蜷缩 浊响 模糊 平坦 硬滑
17 青绿 --- 沉闷 稍糊 稍凹 硬滑

(1) 如何在属性值确实的情况下进行划分属性选择?

给定训练集 D D D和属性 A A A,假设 D ~ \widetilde{D} D 表示属性 A A A上没有缺失值的样本子集,假定属性 A A A有 m m m个可取值 { a 1 , a 2 , . . . , a m } \{a^1, a^2, ..., a^m\} {a1,a2,...,am}, D ~ m \widetilde{D}^m D m表示 D ~ \widetilde{D} D 中属性 A A A上取值为 a m a^m am的样本子集, D ~ k \widetilde{D}k D k表示 D ~ \widetilde{D} D 中属于第 k k k类( k = 1 , 2 , . . . , K k=1,2,...,K k=1,2,...,K)的样本子集,则有 D ~ = ∪ k = 1 K D ~ k = ∪ m = 1 m D ~ m \widetilde{D}=\cup{k=1}^{K}\widetilde{D}k=\cup{m=1}^{m}\widetilde{D}^m D =∪k=1KD k=∪m=1mD m,假定为每一个样本 x x x赋予一个权重 w x w_x wx定义:
ρ = ∑ x ∈ D ~ w x ∑ x ∈ D w x p ~ k = ∑ x ∈ D ~ k w x ∑ x ∈ D ~ w x r ~ m = ∑ x ∈ D ~ m w x ∑ x ∈ D ~ w x \rho=\frac{\sum_{x\in\widetilde{D}}w_x}{\sum_{x\in D}w_x}\\ \widetilde{p}k=\frac{\sum{x\in\widetilde{D}k}w_x}{\sum{x\in \widetilde{D}}w_x}\\ \widetilde{r}m=\frac{\sum{x\in\widetilde{D}^m}w_x}{\sum_{x\in \widetilde{D}}w_x}\\ ρ=∑x∈Dwx∑x∈D wxp k=∑x∈D wx∑x∈D kwxr m=∑x∈D wx∑x∈D mwx

式中, ρ \rho ρ表示无缺失值样本所占的比例, p ~ k \widetilde{p}_k p k表示无缺失值样本中第 k k k类所占的比例, r ~ m \widetilde{r}m r m表示无缺失值样本中属性 A A A上取值 a m a^m am的样本所占的比例,故有 ∑ k = 1 K p ~ k = ∑ n = 1 m r ~ m = 1 \sum{k=1}^K\widetilde{p}k=\sum{n=1}^m\widetilde{r}_m=1 ∑k=1Kp k=∑n=1mr m=1

基于上述定义,可将信息增益的计算在缺失值上推广为:
E n t r o p y ( D ~ ) = − ∑ k = 1 K p ~ k log ⁡ p ~ k G a i n ( D , A ) = ρ × G a i n ( D ~ , A ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] \begin{aligned} Entropy(\widetilde{D})&=-\sum_{k=1}^{K}\widetilde{p}_k\log\widetilde{p}k\\ Gain(D, A)&=\rho\times Gain(\widetilde{D}, A)=\rho\times[Entropy(\widetilde{D})-\sum{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)] \end{aligned} Entropy(D )Gain(D,A)=−k=1∑Kp klogp k=ρ×Gain(D ,A)=ρ×[Entropy(D )−m=1∑mr mEntropy(D m)]

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

  • 属性"色泽",无缺失值样本子集 D ~ = { 2 , 3 , 4 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 14 , 15 , 16 , 17 } \widetilde{D}=\{2,3,4,6,7,8,9,10,11,12,14,15,16,17\} D ={2,3,4,6,7,8,9,10,11,12,14,15,16,17},有"乌黑"、"青绿"、"浅白"3个取值
    G a i n ( D , 色泽 ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] = 14 17 × { − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) − [ 6 14 × ( − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ) + 4 14 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 14 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2519 \begin{aligned} Gain(D, 色泽)&=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)]\\ &=\frac{14}{17}\times\{-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})-[\frac{6}{14}\times(-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6}))+\frac{4}{14}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+\frac{4}{14}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2519 \end{aligned} Gain(D,色泽)=ρ×[Entropy(D )−m=1∑mr mEntropy(D m)]=1714×{−(146log146+148log148)−[146×(−(64log64+62log62))+144×(−(42log42+42log42))+144×(−(40log40+44log44))]}=0.2519
  • 属性"根蒂",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 10 , 11 , 12 , 13 , 14 , 15 , 16 } \widetilde{D}=\{1,2,3,4,5,6,7,8,10,11,12,13,14,15,16\} D ={1,2,3,4,5,6,7,8,10,11,12,13,14,15,16},有"蜷缩"、"稍蜷"、"硬挺"3个取值
    G a i n ( D , 根蒂 ) = 15 17 × { − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 6 15 × ( − ( 3 6 log ⁡ 3 6 + 3 6 log ⁡ 3 6 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1711 \begin{aligned} Gain(D, 根蒂)&=\frac{15}{17}\times\{- (\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{6}{15}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1711 \end{aligned} Gain(D,根蒂)=1715×{−(158log158+157log157)−[157×(−(75log75+72log72))+156×(−(63log63+63log63))+152×(−(20log20+22log22))]}=0.1711
  • 属性"敲声",无缺失值样本子集 D ~ = { 1 , 2 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,4,5,6,7,8,9,10,11,13,14,15,16,17\} D ={1,2,4,5,6,7,8,9,10,11,13,14,15,16,17},有"浊响"、"沉闷"、"清脆"3个取值
    G a i n ( D , 敲声 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 8 15 × ( − ( 5 8 log ⁡ 5 8 + 3 8 log ⁡ 3 8 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1448 \begin{aligned} Gain(D, 敲声)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{8}{15}\times(-(\frac{5}{8}\log\frac{5}{8}+\frac{3}{8}\log\frac{3}{8}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1448 \end{aligned} Gain(D,敲声)=1715×{−(157log157+158log158)−[158×(−(85log85+83log83))+155×(−(52log52+53log53))+152×(−(20log20+22log22))]}=0.1448
  • 属性"纹理",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 11 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,6,7,9,11,12,13,14,15,16,17\} D ={1,2,3,4,5,6,7,9,11,12,13,14,15,16,17},有"清晰"、"稍糊"、"模糊"3个取值
    G a i n ( D , 纹理 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 6 7 log ⁡ 6 7 + 1 7 log ⁡ 1 7 ) ) + 5 15 × ( − ( 1 5 log ⁡ 1 5 + 4 5 log ⁡ 4 5 ) ) + 3 15 × ( − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ) ] } = 0.4235 \begin{aligned} Gain(D, 纹理)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{6}{7}\log\frac{6}{7}+\frac{1}{7}\log\frac{1}{7}))+ \frac{5}{15}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))+ \frac{3}{15}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\}\\ &=0.4235 \end{aligned} Gain(D,纹理)=1715×{−(157log157+158log158)−[157×(−(76log76+71log71))+155×(−(51log51+54log54))+153×(−(30log30+33log33))]}=0.4235
  • 属性"脐部",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,7,8,9,10,11,12,13,14,16,17\} D ={1,2,3,4,5,7,8,9,10,11,12,13,14,16,17},有"凹陷"、"稍凹"、"平坦"3个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 4 15 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 15 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2888 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{4}{15}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+ \frac{4}{15}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2888 \end{aligned} Gain(D,脐部)=1715×{−(157log157+158log158)−[157×(−(75log75+72log72))+154×(−(42log42+42log42))+154×(−(40log40+44log44))]}=0.2888
  • 属性"触感",无缺失值样本子集 D ~ = { 1 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,3,4,5,6,7,8,9,10,12,13,14,15,16,17\} D ={1,3,4,5,6,7,8,9,10,12,13,14,15,16,17},有"硬滑"、"软粘"2个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 10 15 × ( − ( 5 10 log ⁡ 5 10 + 5 10 log ⁡ 5 10 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) ] } = 0.0057 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{10}{15}\times(-(\frac{5}{10}\log\frac{5}{10}+\frac{5}{10}\log\frac{5}{10}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))]\}\\ &=0.0057 \end{aligned} Gain(D,脐部)=1715×{−(157log157+158log158)−[1510×(−(105log105+105log105))+155×(−(52log52+53log53))]}=0.0057
    (2) 给定划分属性,若样本在该属性上确实,如何对样本进行划分?
  • 如果样本在划分属性上的取值已知,则将其分裂到与其取值对应的子节点,且样本权重在子节点中保持为1;
  • 如若样本在划分属性上的取值未知,则将其同时分裂到所有子节点中,在各子节点中的权重为对应子节点的样本权重 ρ \rho ρ。

属性"纹理"的信息增益最大,用于进一步分裂,包含15个取值已知(清晰7个、稍糊5个、模糊3个)和2个取值未知的样本{8,10}。

属性: 取值 样本 好瓜 差瓜 缺失值 缺失值权重 总权重
纹理:清晰 {1,2,3,4,5,6,15} {1,2,3,4,5,6} {15} {8,10} 2 × 7 15 2\times\frac{7}{15} 2×157 7 + 2 × 7 15 7+2\times\frac{7}{15} 7+2×157
纹理:稍糊 {7,9,13,14,17} {7} {9,13,14,17} {8,10} 2 × 5 15 2\times\frac{5}{15} 2×155 5 + 2 × 5 15 5+2\times\frac{5}{15} 5+2×155
纹理:模糊 {11,12,16} - {11,12,16} {8,10} 2 × 3 15 2\times\frac{3}{15} 2×153 3 + 2 × 3 15 3+2\times\frac{3}{15} 3+2×153

子节点属性纹理=清晰,包含7个有取值样本{1,2,3,4,5,6,15},其中6个好瓜和1个差瓜,假设属性在缺失值处对应的类别分布与原始样本一致,分别为 6 7 \frac{6}{7} 76和 1 7 \frac{1}{7} 71,则子节点属性纹理=清晰的信息熵为:
E n t r o p y ( D 纹理 = 清晰 ) = − ∑ i = 1 k p i log ⁡ p i = − ( 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 + 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 ) = 0.5916 \begin{aligned} Entropy(D^{纹理=清晰})&=-\sum_{i=1}^{k}p_i\log p_i\\ &=-(\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}+\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2})\\ &=0.5916 \end{aligned} Entropy(D纹理=清晰)=−i=1∑kpilogpi=−(7+157×26+76×157×2log7+157×26+76×157×2+7+157×21+71×157×2log7+157×21+71×157×2)=0.5916

  • 子节点属性纹理=清晰,计算属性色泽的信息增益
    • 色泽=乌黑的样本数为3(2个正样本和1个负样本);色泽=青绿的样本数为2(2个正样本);2个缺失值样本

    • 缺失值样本的权重:色泽=乌黑的权重 3 5 \frac{3}{5} 53,总权重为 2 × 3 5 = 6 5 2\times\frac{3}{5}=\frac{6}{5} 2×53=56;色泽=青绿的权重 2 5 \frac{2}{5} 52,总权重为 2 × 2 5 = 4 5 2\times\frac{2}{5}=\frac{4}{5} 2×52=54

    • 色泽=乌黑:正样本的权重: 2 + 2 3 × 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2 2+32×53×2;负样本的权重: 1 + 1 3 × 3 5 × 2 1+\frac{1}{3}\times\frac{3}{5}\times2 1+31×53×2;总权重 2 + 2 3 × 3 5 × 2 + 1 + 1 3 × 3 5 × 2 = 3 + 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2+1+\frac{1}{3}\times\frac{3}{5}\times2=3+\frac{3}{5}\times2 2+32×53×2+1+31×53×2=3+53×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 乌黑 ) = − ( 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 + 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 ) = 0.6589 \begin{aligned} Entropy(D^{纹理=清晰},色泽=乌黑)&=-(\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}+\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2})=0.6589 \end{aligned} Entropy(D纹理=清晰,色泽=乌黑)=−(3+53×22+32×53×2log3+53×22+32×53×2+3+53×21+31×53×2log3+53×21+31×53×2)=0.6589

    • 色泽=青绿:正样本的权重: 2 + 2 2 × 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2 2+22×52×2;负样本的权重: 0 + 0 2 × 2 5 × 2 0+\frac{0}{2}\times\frac{2}{5}\times2 0+20×52×2;总权重 2 + 2 2 × 2 5 × 2 + 0 + 0 2 × 2 5 × 2 = 2 + 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2+0+\frac{0}{2}\times\frac{2}{5}\times2=2+\frac{2}{5}\times2 2+22×52×2+0+20×52×2=2+52×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 青绿 ) = − ( 2 + 2 5 × 2 2 + 2 5 × 2 log ⁡ 2 + 2 5 × 2 2 + 2 5 × 2 + 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 log ⁡ 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},色泽=青绿)&=-(\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}+\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2})=0.0 \end{aligned} Entropy(D纹理=清晰,色泽=青绿)=−(2+52×22+52×2log2+52×22+52×2+2+52×20+20×52×2log2+52×20+20×52×2)=0.0

G a i n ( D 纹理 = 清晰 , 色泽 ) = 0.5916 − ( 3 + 3 5 × 2 7 + 7 15 × 2 × 0.6598 + 2 + 2 5 × 2 7 + 7 15 × 2 × 0.0 ) = 0.2423 \begin{aligned} Gain(D^{纹理=清晰},色泽)&=0.5916-(\frac{3+\frac{3}{5}\times2}{7+\frac{7}{15}\times2}\times0.6598+\frac{2+\frac{2}{5}\times2}{7+\frac{7}{15}\times2}\times0.0)&=0.2423 \end{aligned} Gain(D纹理=清晰,色泽)=0.5916−(7+157×23+53×2×0.6598+7+157×22+52×2×0.0)=0.2423

  • 子节点属性纹理=清晰,计算属性根蒂的信息增益

    • 根蒂=蜷缩的样本数为5(5个正样本);根蒂=稍蜷的样本数为2(1个正样本和1个负样本);无缺失值样本
      E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 蜷缩 ) = − ( 5 5 log ⁡ 5 5 + 0 5 log ⁡ 0 5 ) = 0.0 E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 稍蜷 ) = − ( 1 2 log ⁡ 1 2 + 1 2 log ⁡ 1 2 ) = 1.0 G a i n ( D 纹理 = 清晰 , 根蒂 ) = 0.5916 − ( 5 7 × 0.0 + 2 7 × 1.0 ) = 0.3058 \begin{aligned} Entropy(D^{纹理=清晰},根蒂=蜷缩)&=-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{5}\log\frac{0}{5})=0.0\\ Entropy(D^{纹理=清晰},根蒂=稍蜷)&=-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2})=1.0\\ Gain(D^{纹理=清晰},根蒂)&=0.5916-(\frac{5}{7}\times0.0+\frac{2}{7}\times1.0)=0.3058 \end{aligned} Entropy(D纹理=清晰,根蒂=蜷缩)Entropy(D纹理=清晰,根蒂=稍蜷)Gain(D纹理=清晰,根蒂)=−(55log55+50log50)=0.0=−(21log21+21log21)=1.0=0.5916−(75×0.0+72×1.0)=0.3058
  • 子节点属性纹理=清晰,计算属性敲声的信息增益

    • 敲声=浊响的样本数为4(3个正样本和1个负样本);敲声=沉闷的样本数为2(2个正样本);1个缺失值样本

    • 缺失值样本的权重:敲声=浊响的权重 4 6 \frac{4}{6} 64,总权重为 4 6 \frac{4}{6} 64;敲声=沉闷的权重 2 6 \frac{2}{6} 62,总权重为 2 6 \frac{2}{6} 62

    • 敲声=浊响:正样本的权重: 3 + 3 4 × 4 6 3+\frac{3}{4}\times\frac{4}{6} 3+43×64;负样本的权重: 1 + 1 4 × 4 6 1+\frac{1}{4}\times\frac{4}{6} 1+41×64;总权重 3 + 3 4 × 4 6 + 1 + 1 4 × 4 6 = 4 + 4 6 3+\frac{3}{4}\times\frac{4}{6}+1+\frac{1}{4}\times\frac{4}{6}=4+\frac{4}{6} 3+43×64+1+41×64=4+64
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 浊响 ) = − ( 3 + 3 4 × 4 6 4 + 4 6 log ⁡ 3 + 3 4 × 4 6 4 + 4 6 + 1 + 1 4 × 4 6 4 + 4 6 log ⁡ 1 + 1 4 × 4 6 4 + 4 6 ) = 0.8112 \begin{aligned} Entropy(D^{纹理=清晰},敲声=浊响)&=-(\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}+\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}})=0.8112 \end{aligned} Entropy(D纹理=清晰,敲声=浊响)=−(4+643+43×64log4+643+43×64+4+641+41×64log4+641+41×64)=0.8112

    • 敲声=沉闷:正样本的权重: 2 + 2 2 × 2 6 2+\frac{2}{2}\times\frac{2}{6} 2+22×62;负样本的权重: 0 + 0 2 × 2 6 0+\frac{0}{2}\times\frac{2}{6} 0+20×62;总权重 2 + 2 2 × 2 6 + 0 + 0 2 × 2 6 = 2 + 2 6 2+\frac{2}{2}\times\frac{2}{6}+0+\frac{0}{2}\times\frac{2}{6}=2+\frac{2}{6} 2+22×62+0+20×62=2+62
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 沉闷 ) = − ( 2 + 2 2 × 2 6 2 + 2 6 log ⁡ 2 + 2 2 × 2 6 2 + 2 6 + 0 + 0 2 × 2 6 2 + 2 6 log ⁡ 0 + 0 2 × 2 6 2 + 2 6 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},敲声=沉闷)&=-(\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}+\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}})=0.0 \end{aligned} Entropy(D纹理=清晰,敲声=沉闷)=−(2+622+22×62log2+622+22×62+2+620+20×62log2+620+20×62)=0.0
      G a i n ( D 纹理 = 清晰 , 敲声 ) = 0.5916 − ( 4 + 4 6 7 + 7 15 × 2 × 0.8112 + 2 + 2 6 7 + 7 15 × 2 × 0.0 ) = 0.1144 \begin{aligned} Gain(D^{纹理=清晰},敲声)&=0.5916-(\frac{4+\frac{4}{6}}{7+\frac{7}{15}\times2}\times0.8112+\frac{2+\frac{2}{6}}{7+\frac{7}{15}\times2}\times0.0)&=0.1144 \end{aligned} Gain(D纹理=清晰,敲声)=0.5916−(7+157×24+64×0.8112+7+157×22+62×0.0)=0.1144

相关推荐
Codebee2 小时前
能力中心 (Agent SkillCenter):开启AI技能管理新时代
人工智能
聆风吟º2 小时前
CANN runtime 全链路拆解:AI 异构计算运行时的任务管理与功能适配技术路径
人工智能·深度学习·神经网络·cann
uesowys3 小时前
Apache Spark算法开发指导-One-vs-Rest classifier
人工智能·算法·spark
AI_56783 小时前
AWS EC2新手入门:6步带你从零启动实例
大数据·数据库·人工智能·机器学习·aws
User_芊芊君子3 小时前
CANN大模型推理加速引擎ascend-transformer-boost深度解析:毫秒级响应的Transformer优化方案
人工智能·深度学习·transformer
智驱力人工智能3 小时前
小区高空抛物AI实时预警方案 筑牢社区头顶安全的实践 高空抛物检测 高空抛物监控安装教程 高空抛物误报率优化方案 高空抛物监控案例分享
人工智能·深度学习·opencv·算法·安全·yolo·边缘计算
qq_160144873 小时前
亲测!2026年零基础学AI的入门干货,新手照做就能上手
人工智能
Howie Zphile3 小时前
全面预算管理难以落地的核心真相:“完美模型幻觉”的认知误区
人工智能·全面预算
人工不智能5773 小时前
拆解 BERT:Output 中的 Hidden States 到底藏了什么秘密?
人工智能·深度学习·bert
盟接之桥3 小时前
盟接之桥说制造:引流品 × 利润品,全球电商平台高效产品组合策略(供讨论)
大数据·linux·服务器·网络·人工智能·制造