机器学习之决策树(DecisionTree——C4.5)

机器学习之决策树(DecisionTree------ID3)中我们提到,ID3无法处理是连续值或有缺失值的属性。而C4.5算法可以解决ID3算的上述局限性。

1、连续值属性的处理

对于数据集 D D D和连续值属性 A A A,假设连续值属性 A A A有 M M M个不同的取值,可通过二分法bi-partition对联组织属性进行离散化处理,即:

  1. 对 M M M个不同的取值由小到大排序,得到排序后的取值,记为 { a 1 , a 2 , . . . , a M } \{a^1, a^2, ..., a^M\} {a1,a2,...,aM};
  2. 对相邻的属性取值 a i a^{i} ai与 a i + 1 a^{i+1} ai+1,取其均值作为划分点,即 a i + a i + 1 2 \frac{a^{i}+a^{i+1}}{2} 2ai+ai+1,划分后的子集表示为 D t − D_t^- Dt−和 D t + D_t^+ Dt+;
  3. 对于连续值属性 A A A,可获得包含 M − 1 M-1 M−1个元素的候选划分点集合:
    T A = { a i + a i + 1 2 ∣ 1 ≤ i ≤ M − 1 } (1) T_A=\{\frac{a^{i}+a^{i+1}}{2}|1≤i≤M-1\}\tag1 TA={2ai+ai+1∣1≤i≤M−1}(1)
  4. 像离散属性值一样开考察上述候选划分点,选取最优的划分点进行样本集合的划分:
    G a i n ( D , A ) = max ⁡ t ∈ T a G a i n ( D , A , t ) = max ⁡ t ∈ T a ( E n t r o p y ( D ) − ∑ λ ∈ { − , + } N t λ N E n t r o p y ( D t λ ) ) (2) \begin{aligned} Gain(D, A)&=\mathop{\max}\limits_{t\in T_a}Gain(D, A, t)\\ &=\mathop{\max}\limits_{t\in T_a}(Entropy(D)-\sum_{\lambda\in \{-, +\}}\frac{N_t^{\lambda}}{N}Entropy(D_t^{\lambda}))\tag2 \end{aligned} Gain(D,A)=t∈TamaxGain(D,A,t)=t∈Tamax(Entropy(D)−λ∈{−,+}∑NNtλEntropy(Dtλ))(2)
    式(2)中, G a i n ( D , A , t ) Gain(D, A, t) Gain(D,A,t)是样本集 D D D基于划分点 t t t二分后的信息增益, D t λ D_t^{\lambda} Dtλ表示二分后的子集, N t λ N_t^{\lambda} Ntλ表示二分后的子集的样本数量。

表1 西瓜数据集3.0

编号 色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率 好瓜
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215
6 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091
10 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267
11 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
12 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
16 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103

表1中的西瓜数据集包含17个样本( n = 1 , 2 , 3 , . . . , 17 n=1,2,3,...,17 n=1,2,3,...,17),每个样本有8个属性( k = 1 , 2 , 3 , . . . , 8 k = 1 , 2 , 3 , . . . , 8 k=1,2,3,...,8),样本共计有2个类别( c = 是 , 否 c = 是 , 否 c=是,否)。17个样本中,好瓜样本有8个、差瓜样本有9个,数据集 D D D信息熵为:
E n t r o p y ( D ) = − ( 8 17 log ⁡ 8 17 + 9 17 log ⁡ 9 17 ) = 0.9975 Entropy(D)=-(\frac{8}{17}\log\frac{8}{17}+\frac{9}{17}\log\frac{9}{17})=0.9975 Entropy(D)=−(178log178+179log179)=0.9975

以属性"含糖率"为例,17个样本的在该属性的取值由小到大排序后为:
表2 西瓜数据集3.0------sort("含糖率")

编号 色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率 好瓜
16 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042
11 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091
12 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215
6 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264
10 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460

17个样本的在该属性的二分候选划分点为:

|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| |-------| | 0.042 | | 0.057 | | 0.091 | | 0.099 | | 0.103 | | 0.149 | | 0.161 | | 0.198 | | 0.211 | | 0.215 | | 0.237 | | 0.264 | | 0.267 | | 0.318 | | 0.370 | | 0.376 | | 0.460 | | |--------| | 0.0495 | | 0.074 | | 0.095 | | 0.101 | | 0.126 | | 0.155 | | 0.1795 | | 0.2045 | | 0.213 | | 0.226 | | 0.2505 | | 0.2655 | | 0.2925 | | 0.344 | | 0.373 | | 0.418 | |

  • 当划分点为0.0495,划分后两个子集分别为 D 0.0495 − D_{0.0495}^- D0.0495−:{16}和 D 0.0495 + D_{0.0495}^+ D0.0495+:{11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    E n t r o p y ( D 0.0495 − ) = − ( 0 1 log ⁡ 0 1 + 1 1 log ⁡ 1 1 ) = 0 E n t r o p y ( D 0.0495 + ) = − ( 8 16 log ⁡ 8 16 + 8 16 log ⁡ 8 16 ) = 1.0 G a i n ( D , 含糖率 , 0.0495 ) = E n t r o p y ( D ) − ∑ λ ∈ { − , + } N 0.0495 λ N E n t r o p y ( D 0.126 λ ) = 0.9975 − ( 1 17 ∗ 0 + 16 17 ∗ 1.0 ) = 0.0563 \begin{aligned} Entropy(D_{0.0495}^-)&=-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1})=0\\ Entropy(D_{0.0495}^+)&=-(\frac{8}{16}\log\frac{8}{16}+\frac{8}{16}\log\frac{8}{16})=1.0\\ Gain(D, 含糖率, 0.0495)&= Entropy(D)-\sum_{\lambda\in\{-, +\}}\frac{N_{0.0495}^{\lambda}}{N} Entropy(D_{0.126}^{\lambda})\\ &= 0.9975-(\frac{1}{17}*0+\frac{16}{17}*1.0)\\ &=0.0563 \end{aligned} Entropy(D0.0495−)Entropy(D0.0495+)Gain(D,含糖率,0.0495)=−(10log10+11log11)=0=−(168log168+168log168)=1.0=Entropy(D)−λ∈{−,+}∑NN0.0495λEntropy(D0.126λ)=0.9975−(171∗0+1716∗1.0)=0.0563
  • 当划分点为0.074,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074−:{16, 11}和 D 0.074 + D_{0.074}^+ D0.074+:{9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.074 ) = 0.9975 − { 2 17 ∗ [ − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ] + 15 17 ∗ [ − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) ] } = 0.1179 \begin{aligned} Gain(D, 含糖率, 0.074)&= 0.9975-\{\frac{2}{17}*[-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2})]+\frac{15}{17}*[-(\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})]\}=0.1179 \end{aligned} Gain(D,含糖率,0.074)=0.9975−{172∗[−(20log20+22log22)]+1715∗[−(158log158+157log157)]}=0.1179
  • 当划分点为0.095,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074−:{16, 11, 9}和 D 0.074 + D_{0.074}^+ D0.074+:{12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.095 ) = 0.9975 − { 3 17 ∗ [ − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ] + 14 17 ∗ [ − ( 8 14 log ⁡ 8 14 + 6 14 log ⁡ 6 14 ) ] } = 0.1861 \begin{aligned} Gain(D, 含糖率, 0.095)&= 0.9975-\{\frac{3}{17}*[-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3})]+\frac{14}{17}*[-(\frac{8}{14}\log\frac{8}{14}+\frac{6}{14}\log\frac{6}{14})]\}=0.1861 \end{aligned} Gain(D,含糖率,0.095)=0.9975−{173∗[−(30log30+33log33)]+1714∗[−(148log148+146log146)]}=0.1861
  • 当划分点为0.101,划分后两个子集分别为 D 0.101 − D_{0.101}^- D0.101−:{16, 11, 9, 12}和 D 0.101 + D_{0.101}^+ D0.101+:{17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.101 ) = 0.9975 − { 4 17 ∗ [ − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ] + 13 17 ∗ [ − ( 8 13 log ⁡ 8 13 + 5 13 log ⁡ 5 13 ) ] } = 0.2624 \begin{aligned} Gain(D, 含糖率, 0.101)&= 0.9975-\{\frac{4}{17}*[-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4})]+\frac{13}{17}*[-(\frac{8}{13}\log\frac{8}{13}+\frac{5}{13}\log\frac{5}{13})]\}=0.2624 \end{aligned} Gain(D,含糖率,0.101)=0.9975−{174∗[−(40log40+44log44)]+1713∗[−(138log138+135log135)]}=0.2624
  • 当划分点为0.126,划分后两个子集分别为 D 0.126 − D_{0.126}^- D0.126−:{16, 11, 9, 12, 17}和 D 0.126 + D_{0.126}^+ D0.126+:{7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.126 ) = 0.9975 − { 5 17 ∗ [ − ( 0 5 log ⁡ 0 5 + 5 5 log ⁡ 5 5 ) ] + 12 17 ∗ [ − ( 8 12 log ⁡ 8 12 + 4 12 log ⁡ 4 12 ) ] } = 0.3492 \begin{aligned} Gain(D, 含糖率, 0.126)&= 0.9975-\{\frac{5}{17}*[-(\frac{0}{5}\log\frac{0}{5}+\frac{5}{5}\log\frac{5}{5})]+\frac{12}{17}*[-(\frac{8}{12}\log\frac{8}{12}+\frac{4}{12}\log\frac{4}{12})]\}=0.3492 \end{aligned} Gain(D,含糖率,0.126)=0.9975−{175∗[−(50log50+55log55)]+1712∗[−(128log128+124log124)]}=0.3492
  • 当划分点为0.155,划分后两个子集分别为 D 0.155 − D_{0.155}^- D0.155−:{16, 11, 9, 12, 17, 7}和 D 0.155 + D_{0.155}^+ D0.155+:{13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.155 ) = 0.9975 − { 6 17 ∗ [ − ( 1 6 log ⁡ 1 6 + 5 6 log ⁡ 5 6 ) ] + 11 17 ∗ [ − ( 7 11 log ⁡ 7 11 + 4 11 log ⁡ 4 11 ) ] } = 0.1561 \begin{aligned} Gain(D, 含糖率, 0.155)&= 0.9975-\{\frac{6}{17}*[-(\frac{1}{6}\log\frac{1}{6}+\frac{5}{6}\log\frac{5}{6})]+\frac{11}{17}*[-(\frac{7}{11}\log\frac{7}{11}+\frac{4}{11}\log\frac{4}{11})]\}=0.1561 \end{aligned} Gain(D,含糖率,0.155)=0.9975−{176∗[−(61log61+65log65)]+1711∗[−(117log117+114log114)]}=0.1561
  • 当划分点为0.1795,划分后两个子集分别为 D 0.1795 − D_{0.1795}^- D0.1795−:{16, 11, 9, 12, 17, 7, 13}和 D 0.1795 + D_{0.1795}^+ D0.1795+:{14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.1795 ) = 0.9975 − { 7 17 ∗ [ − ( 1 7 log ⁡ 1 7 + 6 7 log ⁡ 6 7 ) ] + 10 17 ∗ [ − ( 7 10 log ⁡ 7 10 + 3 10 log ⁡ 3 10 ) ] } = 0.2354 \begin{aligned} Gain(D, 含糖率, 0.1795)&= 0.9975-\{\frac{7}{17}*[-(\frac{1}{7}\log\frac{1}{7}+\frac{6}{7}\log\frac{6}{7})]+\frac{10}{17}*[-(\frac{7}{10}\log\frac{7}{10}+\frac{3}{10}\log\frac{3}{10})]\}=0.2354 \end{aligned} Gain(D,含糖率,0.1795)=0.9975−{177∗[−(71log71+76log76)]+1710∗[−(107log107+103log103)]}=0.2354
  • 当划分点为0.2045,划分后两个子集分别为 D 0.2045 − D_{0.2045}^- D0.2045−:{16, 11, 9, 12, 17, 7, 13, 14}和 D 0.2045 + D_{0.2045}^+ D0.2045+:{8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2045 ) = 0.9975 − { 8 17 ∗ [ − ( 1 8 log ⁡ 1 8 + 7 8 log ⁡ 7 8 ) ] + 9 17 ∗ [ − ( 7 9 log ⁡ 7 9 + 2 9 log ⁡ 2 9 ) ] } = 0.3371 \begin{aligned} Gain(D, 含糖率, 0.2045)&= 0.9975-\{\frac{8}{17}*[-(\frac{1}{8}\log\frac{1}{8}+\frac{7}{8}\log\frac{7}{8})]+\frac{9}{17}*[-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9})]\}=0.3371 \end{aligned} Gain(D,含糖率,0.2045)=0.9975−{178∗[−(81log81+87log87)]+179∗[−(97log97+92log92)]}=0.3371
  • 当划分点为0.213,划分后两个子集分别为 D 0.213 − D_{0.213}^- D0.213−:{16, 11, 9, 12, 17, 7, 13, 14, 8}和 D 0.213 + D_{0.213}^+ D0.213+:{5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.213 ) = 0.9975 − { 9 17 ∗ [ − ( 2 9 log ⁡ 2 9 + 7 9 log ⁡ 7 9 ) ] + 8 17 ∗ [ − ( 6 8 log ⁡ 6 8 + 2 8 log ⁡ 2 8 ) ] } = 0.2111 \begin{aligned} Gain(D, 含糖率, 0.213)&= 0.9975-\{\frac{9}{17}*[-(\frac{2}{9}\log\frac{2}{9}+\frac{7}{9}\log\frac{7}{9})]+\frac{8}{17}*[-(\frac{6}{8}\log\frac{6}{8}+\frac{2}{8}\log\frac{2}{8})]\}=0.2111 \end{aligned} Gain(D,含糖率,0.213)=0.9975−{179∗[−(92log92+97log97)]+178∗[−(86log86+82log82)]}=0.2111
  • 当划分点为0.226,划分后两个子集分别为 D 0.226 − D_{0.226}^- D0.226−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5}和 D 0.226 + D_{0.226}^+ D0.226+:{6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.226 ) = 0.9975 − { 10 17 ∗ [ − ( 3 10 log ⁡ 3 10 + 7 10 log ⁡ 7 10 ) ] + 7 17 ∗ [ − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ] } = 0.1237 \begin{aligned} Gain(D, 含糖率, 0.226)&= 0.9975-\{\frac{10}{17}*[-(\frac{3}{10}\log\frac{3}{10}+\frac{7}{10}\log\frac{7}{10})]+\frac{7}{17}*[-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7})]\}=0.1237 \end{aligned} Gain(D,含糖率,0.226)=0.9975−{1710∗[−(103log103+107log107)]+177∗[−(75log75+72log72)]}=0.1237
  • 当划分点为0.2505,划分后两个子集分别为 D 0.2505 − D_{0.2505}^- D0.2505−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6}和 D 0.2505 + D_{0.2505}^+ D0.2505+:{3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2505 ) = 0.9975 − { 11 17 ∗ [ − ( 4 11 log ⁡ 4 11 + 7 11 log ⁡ 7 11 ) ] + 6 17 ∗ [ − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ] } = 0.0615 \begin{aligned} Gain(D, 含糖率, 0.2505)&= 0.9975-\{\frac{11}{17}*[-(\frac{4}{11}\log\frac{4}{11}+\frac{7}{11}\log\frac{7}{11})]+\frac{6}{17}*[-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6})]\}=0.0615 \end{aligned} Gain(D,含糖率,0.2505)=0.9975−{1711∗[−(114log114+117log117)]+176∗[−(64log64+62log62)]}=0.0615
  • 当划分点为0.2655,划分后两个子集分别为 D 0.2655 − D_{0.2655}^- D0.2655−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3}和 D 0.2655 + D_{0.2655}^+ D0.2655+:{10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2655 ) = 0.9975 − { 12 17 ∗ [ − ( 5 12 log ⁡ 5 12 + 7 12 log ⁡ 7 12 ) ] + 5 17 ∗ [ − ( 3 5 log ⁡ 3 5 + 2 5 log ⁡ 2 5 ) ] } = 0.0202 \begin{aligned} Gain(D, 含糖率, 0.2655)&= 0.9975-\{\frac{12}{17}*[-(\frac{5}{12}\log\frac{5}{12}+\frac{7}{12}\log\frac{7}{12})]+\frac{5}{17}*[-(\frac{3}{5}\log\frac{3}{5}+\frac{2}{5}\log\frac{2}{5})]\}=0.0202 \end{aligned} Gain(D,含糖率,0.2655)=0.9975−{1712∗[−(125log125+127log127)]+175∗[−(53log53+52log52)]}=0.0202
  • 当划分点为0.2925,划分后两个子集分别为 D 0.2925 − D_{0.2925}^- D0.2925−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10}和 D 0.2925 + D_{0.2925}^+ D0.2925+:{4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2925 ) = 0.9975 − { 13 17 ∗ [ − ( 5 13 log ⁡ 5 13 + 8 13 log ⁡ 8 13 ) ] + 4 17 ∗ [ − ( 3 4 log ⁡ 3 4 + 1 4 log ⁡ 1 4 ) ] } = 0.0715 \begin{aligned} Gain(D, 含糖率, 0.2925)&= 0.9975-\{\frac{13}{17}*[-(\frac{5}{13}\log\frac{5}{13}+\frac{8}{13}\log\frac{8}{13})]+\frac{4}{17}*[-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4})]\}=0.0715 \end{aligned} Gain(D,含糖率,0.2925)=0.9975−{1713∗[−(135log135+138log138)]+174∗[−(43log43+41log41)]}=0.0715
  • 当划分点为0.344,划分后两个子集分别为 D 0.344 − D_{0.344}^- D0.344−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4}和 D 0.344 + D_{0.344}^+ D0.344+:{15, 2, 1}
    G a i n ( D , 含糖率 , 0.344 ) = 0.9975 − { 14 17 ∗ [ − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) ] + 3 17 ∗ [ − ( 2 3 log ⁡ 2 3 + 1 3 log ⁡ 1 3 ) ] } = 0.0241 \begin{aligned} Gain(D, 含糖率, 0.344)&= 0.9975-\{\frac{14}{17}*[-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})]+\frac{3}{17}*[-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3})]\}=0.0241 \end{aligned} Gain(D,含糖率,0.344)=0.9975−{1714∗[−(146log146+148log148)]+173∗[−(32log32+31log31)]}=0.0241
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15}和 D 0.373 + D_{0.373}^+ D0.373+:{2, 1}
    G a i n ( D , 含糖率 , 0.373 ) = 0.9975 − { 15 17 ∗ [ − ( 6 15 log ⁡ 6 15 + 9 15 log ⁡ 9 15 ) ] + 2 17 ∗ [ − ( 2 2 log ⁡ 2 2 + 0 2 log ⁡ 0 2 ) ] } = 0.1041 \begin{aligned} Gain(D, 含糖率, 0.373)&= 0.9975-\{\frac{15}{17}*[-(\frac{6}{15}\log\frac{6}{15}+\frac{9}{15}\log\frac{9}{15})]+\frac{2}{17}*[-(\frac{2}{2}\log\frac{2}{2}+\frac{0}{2}\log\frac{0}{2})]\}=0.1041 \end{aligned} Gain(D,含糖率,0.373)=0.9975−{1715∗[−(156log156+159log159)]+172∗[−(22log22+20log20)]}=0.1041
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2}和 D 0.373 + D_{0.373}^+ D0.373+:{1}
    G a i n ( D , 含糖率 , 0.418 ) = 0.9975 − { 16 17 ∗ [ − ( 7 16 log ⁡ 7 16 + 9 16 log ⁡ 9 16 ) ] + 1 17 ∗ [ − ( 1 1 log ⁡ 1 1 + 0 1 log ⁡ 0 1 ) ] } = 0.0669 \begin{aligned} Gain(D, 含糖率, 0.418)&= 0.9975-\{\frac{16}{17}*[-(\frac{7}{16}\log\frac{7}{16}+\frac{9}{16}\log\frac{9}{16})]+\frac{1}{17}*[-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1})]\}=0.0669 \end{aligned} Gain(D,含糖率,0.418)=0.9975−{1716∗[−(167log167+169log169)]+171∗[−(11log11+10log10)]}=0.0669

因此,属性"含糖率"划分后的最大信息增益为0.349,对应划分点为0.126:
G a i n ( D , 含糖率 ) = G a i n ( D , 含糖率 , t = 0.126 ) = 0.3492 \begin{aligned} Gain(D, 含糖率)&=Gain(D, 含糖率, t=0.126)=0.3492 \end{aligned} Gain(D,含糖率)=Gain(D,含糖率,t=0.126)=0.3492

同理,属性"密度"划分后的最大信息增益为0.2624,对应划分点为0.3815:
G a i n ( D , 密度 ) = G a i n ( D , 密度 , t = 0.3815 ) = 0.2624 \begin{aligned} Gain(D, 密度)&=Gain(D, 密度, t=0.3815)=0.2624 \end{aligned} Gain(D,密度)=Gain(D,密度,t=0.3815)=0.2624

以如此方式即可处理连续值的属性。

2、缺失值属性的处理

表3 西瓜数据集------缺失值

编号 色泽 根蒂 敲声 纹理 脐部 触感 好瓜
1 --- 蜷缩 浊响 清晰 凹陷 硬滑
2 乌黑 蜷缩 沉闷 清晰 凹陷 ---
3 乌黑 蜷缩 --- 清晰 凹陷 硬滑
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑
5 --- 蜷缩 浊响 清晰 凹陷 硬滑
6 青绿 稍蜷 浊响 清晰 --- 软粘
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘
8 乌黑 稍蜷 浊响 --- 稍凹 硬滑
9 乌黑 --- 沉闷 稍糊 稍凹 硬滑
10 青绿 硬挺 清脆 --- 平坦 软粘
11 浅白 硬挺 清脆 模糊 平坦 ---
12 浅白 蜷缩 --- 模糊 平坦 软粘
13 --- 稍蜷 浊响 稍糊 凹陷 硬滑
14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑
15 乌黑 稍蜷 浊响 清晰 --- 软粘
16 浅白 蜷缩 浊响 模糊 平坦 硬滑
17 青绿 --- 沉闷 稍糊 稍凹 硬滑

(1) 如何在属性值确实的情况下进行划分属性选择?

给定训练集 D D D和属性 A A A,假设 D ~ \widetilde{D} D 表示属性 A A A上没有缺失值的样本子集,假定属性 A A A有 m m m个可取值 { a 1 , a 2 , . . . , a m } \{a^1, a^2, ..., a^m\} {a1,a2,...,am}, D ~ m \widetilde{D}^m D m表示 D ~ \widetilde{D} D 中属性 A A A上取值为 a m a^m am的样本子集, D ~ k \widetilde{D}k D k表示 D ~ \widetilde{D} D 中属于第 k k k类( k = 1 , 2 , . . . , K k=1,2,...,K k=1,2,...,K)的样本子集,则有 D ~ = ∪ k = 1 K D ~ k = ∪ m = 1 m D ~ m \widetilde{D}=\cup{k=1}^{K}\widetilde{D}k=\cup{m=1}^{m}\widetilde{D}^m D =∪k=1KD k=∪m=1mD m,假定为每一个样本 x x x赋予一个权重 w x w_x wx定义:
ρ = ∑ x ∈ D ~ w x ∑ x ∈ D w x p ~ k = ∑ x ∈ D ~ k w x ∑ x ∈ D ~ w x r ~ m = ∑ x ∈ D ~ m w x ∑ x ∈ D ~ w x \rho=\frac{\sum_{x\in\widetilde{D}}w_x}{\sum_{x\in D}w_x}\\ \widetilde{p}k=\frac{\sum{x\in\widetilde{D}k}w_x}{\sum{x\in \widetilde{D}}w_x}\\ \widetilde{r}m=\frac{\sum{x\in\widetilde{D}^m}w_x}{\sum_{x\in \widetilde{D}}w_x}\\ ρ=∑x∈Dwx∑x∈D wxp k=∑x∈D wx∑x∈D kwxr m=∑x∈D wx∑x∈D mwx

式中, ρ \rho ρ表示无缺失值样本所占的比例, p ~ k \widetilde{p}_k p k表示无缺失值样本中第 k k k类所占的比例, r ~ m \widetilde{r}m r m表示无缺失值样本中属性 A A A上取值 a m a^m am的样本所占的比例,故有 ∑ k = 1 K p ~ k = ∑ n = 1 m r ~ m = 1 \sum{k=1}^K\widetilde{p}k=\sum{n=1}^m\widetilde{r}_m=1 ∑k=1Kp k=∑n=1mr m=1

基于上述定义,可将信息增益的计算在缺失值上推广为:
E n t r o p y ( D ~ ) = − ∑ k = 1 K p ~ k log ⁡ p ~ k G a i n ( D , A ) = ρ × G a i n ( D ~ , A ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] \begin{aligned} Entropy(\widetilde{D})&=-\sum_{k=1}^{K}\widetilde{p}_k\log\widetilde{p}k\\ Gain(D, A)&=\rho\times Gain(\widetilde{D}, A)=\rho\times[Entropy(\widetilde{D})-\sum{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)] \end{aligned} Entropy(D )Gain(D,A)=−k=1∑Kp klogp k=ρ×Gain(D ,A)=ρ×[Entropy(D )−m=1∑mr mEntropy(D m)]

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

  • 属性"色泽",无缺失值样本子集 D ~ = { 2 , 3 , 4 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 14 , 15 , 16 , 17 } \widetilde{D}=\{2,3,4,6,7,8,9,10,11,12,14,15,16,17\} D ={2,3,4,6,7,8,9,10,11,12,14,15,16,17},有"乌黑"、"青绿"、"浅白"3个取值
    G a i n ( D , 色泽 ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] = 14 17 × { − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) − [ 6 14 × ( − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ) + 4 14 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 14 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2519 \begin{aligned} Gain(D, 色泽)&=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)]\\ &=\frac{14}{17}\times\{-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})-[\frac{6}{14}\times(-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6}))+\frac{4}{14}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+\frac{4}{14}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2519 \end{aligned} Gain(D,色泽)=ρ×[Entropy(D )−m=1∑mr mEntropy(D m)]=1714×{−(146log146+148log148)−[146×(−(64log64+62log62))+144×(−(42log42+42log42))+144×(−(40log40+44log44))]}=0.2519
  • 属性"根蒂",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 10 , 11 , 12 , 13 , 14 , 15 , 16 } \widetilde{D}=\{1,2,3,4,5,6,7,8,10,11,12,13,14,15,16\} D ={1,2,3,4,5,6,7,8,10,11,12,13,14,15,16},有"蜷缩"、"稍蜷"、"硬挺"3个取值
    G a i n ( D , 根蒂 ) = 15 17 × { − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 6 15 × ( − ( 3 6 log ⁡ 3 6 + 3 6 log ⁡ 3 6 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1711 \begin{aligned} Gain(D, 根蒂)&=\frac{15}{17}\times\{- (\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{6}{15}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1711 \end{aligned} Gain(D,根蒂)=1715×{−(158log158+157log157)−[157×(−(75log75+72log72))+156×(−(63log63+63log63))+152×(−(20log20+22log22))]}=0.1711
  • 属性"敲声",无缺失值样本子集 D ~ = { 1 , 2 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,4,5,6,7,8,9,10,11,13,14,15,16,17\} D ={1,2,4,5,6,7,8,9,10,11,13,14,15,16,17},有"浊响"、"沉闷"、"清脆"3个取值
    G a i n ( D , 敲声 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 8 15 × ( − ( 5 8 log ⁡ 5 8 + 3 8 log ⁡ 3 8 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1448 \begin{aligned} Gain(D, 敲声)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{8}{15}\times(-(\frac{5}{8}\log\frac{5}{8}+\frac{3}{8}\log\frac{3}{8}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1448 \end{aligned} Gain(D,敲声)=1715×{−(157log157+158log158)−[158×(−(85log85+83log83))+155×(−(52log52+53log53))+152×(−(20log20+22log22))]}=0.1448
  • 属性"纹理",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 11 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,6,7,9,11,12,13,14,15,16,17\} D ={1,2,3,4,5,6,7,9,11,12,13,14,15,16,17},有"清晰"、"稍糊"、"模糊"3个取值
    G a i n ( D , 纹理 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 6 7 log ⁡ 6 7 + 1 7 log ⁡ 1 7 ) ) + 5 15 × ( − ( 1 5 log ⁡ 1 5 + 4 5 log ⁡ 4 5 ) ) + 3 15 × ( − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ) ] } = 0.4235 \begin{aligned} Gain(D, 纹理)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{6}{7}\log\frac{6}{7}+\frac{1}{7}\log\frac{1}{7}))+ \frac{5}{15}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))+ \frac{3}{15}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\}\\ &=0.4235 \end{aligned} Gain(D,纹理)=1715×{−(157log157+158log158)−[157×(−(76log76+71log71))+155×(−(51log51+54log54))+153×(−(30log30+33log33))]}=0.4235
  • 属性"脐部",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,7,8,9,10,11,12,13,14,16,17\} D ={1,2,3,4,5,7,8,9,10,11,12,13,14,16,17},有"凹陷"、"稍凹"、"平坦"3个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 4 15 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 15 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2888 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{4}{15}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+ \frac{4}{15}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2888 \end{aligned} Gain(D,脐部)=1715×{−(157log157+158log158)−[157×(−(75log75+72log72))+154×(−(42log42+42log42))+154×(−(40log40+44log44))]}=0.2888
  • 属性"触感",无缺失值样本子集 D ~ = { 1 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,3,4,5,6,7,8,9,10,12,13,14,15,16,17\} D ={1,3,4,5,6,7,8,9,10,12,13,14,15,16,17},有"硬滑"、"软粘"2个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 10 15 × ( − ( 5 10 log ⁡ 5 10 + 5 10 log ⁡ 5 10 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) ] } = 0.0057 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{10}{15}\times(-(\frac{5}{10}\log\frac{5}{10}+\frac{5}{10}\log\frac{5}{10}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))]\}\\ &=0.0057 \end{aligned} Gain(D,脐部)=1715×{−(157log157+158log158)−[1510×(−(105log105+105log105))+155×(−(52log52+53log53))]}=0.0057
    (2) 给定划分属性,若样本在该属性上确实,如何对样本进行划分?
  • 如果样本在划分属性上的取值已知,则将其分裂到与其取值对应的子节点,且样本权重在子节点中保持为1;
  • 如若样本在划分属性上的取值未知,则将其同时分裂到所有子节点中,在各子节点中的权重为对应子节点的样本权重 ρ \rho ρ。

属性"纹理"的信息增益最大,用于进一步分裂,包含15个取值已知(清晰7个、稍糊5个、模糊3个)和2个取值未知的样本{8,10}。

属性: 取值 样本 好瓜 差瓜 缺失值 缺失值权重 总权重
纹理:清晰 {1,2,3,4,5,6,15} {1,2,3,4,5,6} {15} {8,10} 2 × 7 15 2\times\frac{7}{15} 2×157 7 + 2 × 7 15 7+2\times\frac{7}{15} 7+2×157
纹理:稍糊 {7,9,13,14,17} {7} {9,13,14,17} {8,10} 2 × 5 15 2\times\frac{5}{15} 2×155 5 + 2 × 5 15 5+2\times\frac{5}{15} 5+2×155
纹理:模糊 {11,12,16} - {11,12,16} {8,10} 2 × 3 15 2\times\frac{3}{15} 2×153 3 + 2 × 3 15 3+2\times\frac{3}{15} 3+2×153

子节点属性纹理=清晰,包含7个有取值样本{1,2,3,4,5,6,15},其中6个好瓜和1个差瓜,假设属性在缺失值处对应的类别分布与原始样本一致,分别为 6 7 \frac{6}{7} 76和 1 7 \frac{1}{7} 71,则子节点属性纹理=清晰的信息熵为:
E n t r o p y ( D 纹理 = 清晰 ) = − ∑ i = 1 k p i log ⁡ p i = − ( 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 + 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 ) = 0.5916 \begin{aligned} Entropy(D^{纹理=清晰})&=-\sum_{i=1}^{k}p_i\log p_i\\ &=-(\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}+\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2})\\ &=0.5916 \end{aligned} Entropy(D纹理=清晰)=−i=1∑kpilogpi=−(7+157×26+76×157×2log7+157×26+76×157×2+7+157×21+71×157×2log7+157×21+71×157×2)=0.5916

  • 子节点属性纹理=清晰,计算属性色泽的信息增益
    • 色泽=乌黑的样本数为3(2个正样本和1个负样本);色泽=青绿的样本数为2(2个正样本);2个缺失值样本

    • 缺失值样本的权重:色泽=乌黑的权重 3 5 \frac{3}{5} 53,总权重为 2 × 3 5 = 6 5 2\times\frac{3}{5}=\frac{6}{5} 2×53=56;色泽=青绿的权重 2 5 \frac{2}{5} 52,总权重为 2 × 2 5 = 4 5 2\times\frac{2}{5}=\frac{4}{5} 2×52=54

    • 色泽=乌黑:正样本的权重: 2 + 2 3 × 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2 2+32×53×2;负样本的权重: 1 + 1 3 × 3 5 × 2 1+\frac{1}{3}\times\frac{3}{5}\times2 1+31×53×2;总权重 2 + 2 3 × 3 5 × 2 + 1 + 1 3 × 3 5 × 2 = 3 + 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2+1+\frac{1}{3}\times\frac{3}{5}\times2=3+\frac{3}{5}\times2 2+32×53×2+1+31×53×2=3+53×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 乌黑 ) = − ( 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 + 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 ) = 0.6589 \begin{aligned} Entropy(D^{纹理=清晰},色泽=乌黑)&=-(\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}+\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2})=0.6589 \end{aligned} Entropy(D纹理=清晰,色泽=乌黑)=−(3+53×22+32×53×2log3+53×22+32×53×2+3+53×21+31×53×2log3+53×21+31×53×2)=0.6589

    • 色泽=青绿:正样本的权重: 2 + 2 2 × 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2 2+22×52×2;负样本的权重: 0 + 0 2 × 2 5 × 2 0+\frac{0}{2}\times\frac{2}{5}\times2 0+20×52×2;总权重 2 + 2 2 × 2 5 × 2 + 0 + 0 2 × 2 5 × 2 = 2 + 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2+0+\frac{0}{2}\times\frac{2}{5}\times2=2+\frac{2}{5}\times2 2+22×52×2+0+20×52×2=2+52×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 青绿 ) = − ( 2 + 2 5 × 2 2 + 2 5 × 2 log ⁡ 2 + 2 5 × 2 2 + 2 5 × 2 + 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 log ⁡ 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},色泽=青绿)&=-(\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}+\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2})=0.0 \end{aligned} Entropy(D纹理=清晰,色泽=青绿)=−(2+52×22+52×2log2+52×22+52×2+2+52×20+20×52×2log2+52×20+20×52×2)=0.0

G a i n ( D 纹理 = 清晰 , 色泽 ) = 0.5916 − ( 3 + 3 5 × 2 7 + 7 15 × 2 × 0.6598 + 2 + 2 5 × 2 7 + 7 15 × 2 × 0.0 ) = 0.2423 \begin{aligned} Gain(D^{纹理=清晰},色泽)&=0.5916-(\frac{3+\frac{3}{5}\times2}{7+\frac{7}{15}\times2}\times0.6598+\frac{2+\frac{2}{5}\times2}{7+\frac{7}{15}\times2}\times0.0)&=0.2423 \end{aligned} Gain(D纹理=清晰,色泽)=0.5916−(7+157×23+53×2×0.6598+7+157×22+52×2×0.0)=0.2423

  • 子节点属性纹理=清晰,计算属性根蒂的信息增益

    • 根蒂=蜷缩的样本数为5(5个正样本);根蒂=稍蜷的样本数为2(1个正样本和1个负样本);无缺失值样本
      E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 蜷缩 ) = − ( 5 5 log ⁡ 5 5 + 0 5 log ⁡ 0 5 ) = 0.0 E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 稍蜷 ) = − ( 1 2 log ⁡ 1 2 + 1 2 log ⁡ 1 2 ) = 1.0 G a i n ( D 纹理 = 清晰 , 根蒂 ) = 0.5916 − ( 5 7 × 0.0 + 2 7 × 1.0 ) = 0.3058 \begin{aligned} Entropy(D^{纹理=清晰},根蒂=蜷缩)&=-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{5}\log\frac{0}{5})=0.0\\ Entropy(D^{纹理=清晰},根蒂=稍蜷)&=-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2})=1.0\\ Gain(D^{纹理=清晰},根蒂)&=0.5916-(\frac{5}{7}\times0.0+\frac{2}{7}\times1.0)=0.3058 \end{aligned} Entropy(D纹理=清晰,根蒂=蜷缩)Entropy(D纹理=清晰,根蒂=稍蜷)Gain(D纹理=清晰,根蒂)=−(55log55+50log50)=0.0=−(21log21+21log21)=1.0=0.5916−(75×0.0+72×1.0)=0.3058
  • 子节点属性纹理=清晰,计算属性敲声的信息增益

    • 敲声=浊响的样本数为4(3个正样本和1个负样本);敲声=沉闷的样本数为2(2个正样本);1个缺失值样本

    • 缺失值样本的权重:敲声=浊响的权重 4 6 \frac{4}{6} 64,总权重为 4 6 \frac{4}{6} 64;敲声=沉闷的权重 2 6 \frac{2}{6} 62,总权重为 2 6 \frac{2}{6} 62

    • 敲声=浊响:正样本的权重: 3 + 3 4 × 4 6 3+\frac{3}{4}\times\frac{4}{6} 3+43×64;负样本的权重: 1 + 1 4 × 4 6 1+\frac{1}{4}\times\frac{4}{6} 1+41×64;总权重 3 + 3 4 × 4 6 + 1 + 1 4 × 4 6 = 4 + 4 6 3+\frac{3}{4}\times\frac{4}{6}+1+\frac{1}{4}\times\frac{4}{6}=4+\frac{4}{6} 3+43×64+1+41×64=4+64
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 浊响 ) = − ( 3 + 3 4 × 4 6 4 + 4 6 log ⁡ 3 + 3 4 × 4 6 4 + 4 6 + 1 + 1 4 × 4 6 4 + 4 6 log ⁡ 1 + 1 4 × 4 6 4 + 4 6 ) = 0.8112 \begin{aligned} Entropy(D^{纹理=清晰},敲声=浊响)&=-(\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}+\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}})=0.8112 \end{aligned} Entropy(D纹理=清晰,敲声=浊响)=−(4+643+43×64log4+643+43×64+4+641+41×64log4+641+41×64)=0.8112

    • 敲声=沉闷:正样本的权重: 2 + 2 2 × 2 6 2+\frac{2}{2}\times\frac{2}{6} 2+22×62;负样本的权重: 0 + 0 2 × 2 6 0+\frac{0}{2}\times\frac{2}{6} 0+20×62;总权重 2 + 2 2 × 2 6 + 0 + 0 2 × 2 6 = 2 + 2 6 2+\frac{2}{2}\times\frac{2}{6}+0+\frac{0}{2}\times\frac{2}{6}=2+\frac{2}{6} 2+22×62+0+20×62=2+62
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 沉闷 ) = − ( 2 + 2 2 × 2 6 2 + 2 6 log ⁡ 2 + 2 2 × 2 6 2 + 2 6 + 0 + 0 2 × 2 6 2 + 2 6 log ⁡ 0 + 0 2 × 2 6 2 + 2 6 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},敲声=沉闷)&=-(\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}+\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}})=0.0 \end{aligned} Entropy(D纹理=清晰,敲声=沉闷)=−(2+622+22×62log2+622+22×62+2+620+20×62log2+620+20×62)=0.0
      G a i n ( D 纹理 = 清晰 , 敲声 ) = 0.5916 − ( 4 + 4 6 7 + 7 15 × 2 × 0.8112 + 2 + 2 6 7 + 7 15 × 2 × 0.0 ) = 0.1144 \begin{aligned} Gain(D^{纹理=清晰},敲声)&=0.5916-(\frac{4+\frac{4}{6}}{7+\frac{7}{15}\times2}\times0.8112+\frac{2+\frac{2}{6}}{7+\frac{7}{15}\times2}\times0.0)&=0.1144 \end{aligned} Gain(D纹理=清晰,敲声)=0.5916−(7+157×24+64×0.8112+7+157×22+62×0.0)=0.1144

相关推荐
xixixi777777 分钟前
解析 Claude模型 —— Anthropic公司打造,以安全性和推理能力为核心竞争力的顶尖大语言模型
人工智能·ai·语言模型·自然语言处理·大模型·claude·主流模型
BLi4ee10 分钟前
【Scholarly Notes】Adaptive Model Pruning for Federated Learning
算法·机器学习·剪枝
大唐荣华14 分钟前
机器人落地“首台套”补贴,到底指什么?
人工智能·机器人
萤丰信息27 分钟前
数字经济与 “双碳” 战略双轮驱动下 智慧园区的智能化管理实践与未来演进
大数据·人工智能·科技·智慧城市·智慧园区
pingao14137828 分钟前
实时远程监控,4G温湿度传感器守护环境安全
大数据·人工智能·安全
shangjian00737 分钟前
AI大模型-深度学习-卷积神经网络CNN
人工智能·神经网络·cnn
发哥来了43 分钟前
主流AI视频生成商用方案选型:关键维度与成本效益分析
大数据·人工智能
诗远Yolanda1 小时前
EI国际会议-通信技术、电子学与信号处理(CTESP 2026)
图像处理·人工智能·算法·计算机视觉·机器人·信息与通信·信号处理
智定义科技1 小时前
#智慧景区#景区票务综合管理平台:全渠道票务一体化管理新范式
人工智能·智慧文旅·智慧景区·票务系统·景区系统·景区票务系统开发·门票系统
BHXDML1 小时前
推导神经网络前向后向传播算法的优化迭代公式
神经网络·算法·机器学习