使用ID3算法根据信息增益构建决策树
-
-
- [1. 计算数据集的熵(Entropy of the dataset)](#1. 计算数据集的熵(Entropy of the dataset))
- [2. 按特征划分数据集并计算条件熵](#2. 按特征划分数据集并计算条件熵)
-
- [(1)特征 `election`](#(1)特征
election
) - [(2)特征 `season`](#(2)特征
season
) - [(3)特征 `oil price`](#(3)特征
oil price
)
- [(1)特征 `election`](#(1)特征
- [3. 总结信息增益](#3. 总结信息增益)
- [4 根节点选择](#4 根节点选择)
- [5. 根据 `oil price` 划分数据集](#5. 根据
oil price
划分数据集) -
- [(1) 子集 1: `oil price = rise`](#(1) 子集 1:
oil price = rise
) - [(2) 子集 2: `oil price = fall`](#(2) 子集 2:
oil price = fall
)
- [(1) 子集 1: `oil price = rise`](#(1) 子集 1:
- [6. 对 `oil price = rise` 的子集继续划分](#6. 对
oil price = rise
的子集继续划分) -
- [(1) 特征 `election`](#(1) 特征
election
) - [(2) 特征 `season`](#(2) 特征
season
) - [(3) 选择划分特征](#(3) 选择划分特征)
- [(1) 特征 `election`](#(1) 特征
- [7. 根据 `election` 划分子集](#7. 根据
election
划分子集) -
- [(1) 子集 1: `election = yes`](#(1) 子集 1:
election = yes
) - [(2) 子集 2: `election = no`](#(2) 子集 2:
election = no
)
- [(1) 子集 1: `election = yes`](#(1) 子集 1:
- [8. 对 `election = yes` 的子集继续划分](#8. 对
election = yes
的子集继续划分) - [9. 构建完整决策树](#9. 构建完整决策树)
-
根据ID3(信息增益)方法利用以下数据集构建决策树。
election | season | oil price | stock price |
---|---|---|---|
no | winter | rise | rise |
no | summer | fall | fall |
no | summer | rise | rise |
yes | winter | rise | fall |
no | winter | fall | fall |
yes | summer | rise | rise |
yes | summer | fall | fall |
yes | winter | fall | fall |
1. 计算数据集的熵(Entropy of the dataset)
熵的公式为:
H ( S ) = − ∑ i = 1 n p i log 2 ( p i ) H(S) = - \sum_{i=1}^n p_i \log_2(p_i) H(S)=−i=1∑npilog2(pi)
其中 p i p_i pi 是每个类别的概率。
在数据集中,最后一列 stock price
有两种可能的值:rise
和 fall
。统计每种值的频率:
rise
出现了 3 次。fall
出现了 5 次。- 数据集总共有 8 条记录。
因此,rise
的概率 p ( r i s e ) = 3 8 p(rise) = \frac{3}{8} p(rise)=83,fall
的概率 p ( f a l l ) = 5 8 p(fall) = \frac{5}{8} p(fall)=85。
熵 H ( S ) H(S) H(S) 为:
H ( S ) = − ( 3 8 log 2 3 8 + 5 8 log 2 5 8 ) H(S) = - \left( \frac{3}{8} \log_2 \frac{3}{8} + \frac{5}{8} \log_2 \frac{5}{8} \right) H(S)=−(83log283+85log285)
计算:
H ( S ) = − ( 0.375 log 2 0.375 + 0.625 log 2 0.625 ) H(S) = - \left( 0.375 \log_2 0.375 + 0.625 \log_2 0.625 \right) H(S)=−(0.375log20.375+0.625log20.625)
log 2 0.375 ≈ − 1.415 , log 2 0.625 ≈ − 0.678 \log_2 0.375 \approx -1.415, \quad \log_2 0.625 \approx -0.678 log20.375≈−1.415,log20.625≈−0.678
H ( S ) = − ( 0.375 × − 1.415 + 0.625 × − 0.678 ) H(S) = - \left( 0.375 \times -1.415 + 0.625 \times -0.678 \right) H(S)=−(0.375×−1.415+0.625×−0.678)
H ( S ) = − ( − 0.5306 − 0.42375 ) = 0.95435 H(S) = - \left( -0.5306 - 0.42375 \right) = 0.95435 H(S)=−(−0.5306−0.42375)=0.95435
数据集的熵为 H ( S ) ≈ 0.954 H(S) \approx 0.954 H(S)≈0.954。
2. 按特征划分数据集并计算条件熵
我们需要分别计算每个特征(election
、season
、oil price
)的条件熵。
(1)特征 election
election
有两个取值:yes
和 no
。
-
当
election = yes
时,有 4 条记录,其中:rise
出现 1 次,fall
出现 3 次。- 熵为:
H ( S ∣ e l e c t i o n = y e s ) = − ( 1 4 log 2 1 4 + 3 4 log 2 3 4 ) H(S|election=yes) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) H(S∣election=yes)=−(41log241+43log243)
H ( S ∣ e l e c t i o n = y e s ) = − ( 0.25 × − 2 + 0.75 × − 0.415 ) = 0.811 H(S|election=yes) = - \left( 0.25 \times -2 + 0.75 \times -0.415 \right) = 0.811 H(S∣election=yes)=−(0.25×−2+0.75×−0.415)=0.811
-
当
election = no
时,有 4 条记录,其中:rise
出现 2 次,fall
出现 2 次。- 熵为:
H ( S ∣ e l e c t i o n = n o ) = − ( 2 4 log 2 2 4 + 2 4 log 2 2 4 ) H(S|election=no) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) H(S∣election=no)=−(42log242+42log242)
H ( S ∣ e l e c t i o n = n o ) = − ( 0.5 × − 1 + 0.5 × − 1 ) = 1 H(S|election=no) = - \left( 0.5 \times -1 + 0.5 \times -1 \right) = 1 H(S∣election=no)=−(0.5×−1+0.5×−1)=1
条件熵 H ( S ∣ e l e c t i o n ) H(S|election) H(S∣election) 为:
H ( S ∣ e l e c t i o n ) = 4 8 H ( S ∣ e l e c t i o n = y e s ) + 4 8 H ( S ∣ e l e c t i o n = n o ) H(S|election) = \frac{4}{8} H(S|election=yes) + \frac{4}{8} H(S|election=no) H(S∣election)=84H(S∣election=yes)+84H(S∣election=no)
H ( S ∣ e l e c t i o n ) = 0.5 × 0.811 + 0.5 × 1 = 0.9055 H(S|election) = 0.5 \times 0.811 + 0.5 \times 1 = 0.9055 H(S∣election)=0.5×0.811+0.5×1=0.9055
信息增益 I G ( S , e l e c t i o n ) IG(S, election) IG(S,election) 为:
I G ( S , e l e c t i o n ) = H ( S ) − H ( S ∣ e l e c t i o n ) IG(S, election) = H(S) - H(S|election) IG(S,election)=H(S)−H(S∣election)
I G ( S , e l e c t i o n ) = 0.954 − 0.9055 = 0.0485 IG(S, election) = 0.954 - 0.9055 = 0.0485 IG(S,election)=0.954−0.9055=0.0485
(2)特征 season
season
有两个取值:winter
和 summer
。
-
当
season = winter
时,有 4 条记录,其中:rise
出现 1 次,fall
出现 3 次。- 熵为:
H ( S ∣ s e a s o n = w i n t e r ) = − ( 1 4 log 2 1 4 + 3 4 log 2 3 4 ) = 0.811 H(S|season=winter) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) = 0.811 H(S∣season=winter)=−(41log241+43log243)=0.811
-
当
season = summer
时,有 4 条记录,其中:rise
出现 2 次,fall
出现 2 次。- 熵为:
H ( S ∣ s e a s o n = s u m m e r ) = − ( 2 4 log 2 2 4 + 2 4 log 2 2 4 ) = 1 H(S|season=summer) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) = 1 H(S∣season=summer)=−(42log242+42log242)=1
条件熵 H ( S ∣ s e a s o n ) H(S|season) H(S∣season) 为:
H ( S ∣ s e a s o n ) = 4 8 H ( S ∣ s e a s o n = w i n t e r ) + 4 8 H ( S ∣ s e a s o n = s u m m e r ) H(S|season) = \frac{4}{8} H(S|season=winter) + \frac{4}{8} H(S|season=summer) H(S∣season)=84H(S∣season=winter)+84H(S∣season=summer)
H ( S ∣ s e a s o n ) = 0.5 × 0.811 + 0.5 × 1 = 0.9055 H(S|season) = 0.5 \times 0.811 + 0.5 \times 1 = 0.9055 H(S∣season)=0.5×0.811+0.5×1=0.9055
信息增益 I G ( S , s e a s o n ) IG(S, season) IG(S,season) 为:
I G ( S , s e a s o n ) = H ( S ) − H ( S ∣ s e a s o n ) IG(S, season) = H(S) - H(S|season) IG(S,season)=H(S)−H(S∣season)
I G ( S , s e a s o n ) = 0.954 − 0.9055 = 0.0485 IG(S, season) = 0.954 - 0.9055 = 0.0485 IG(S,season)=0.954−0.9055=0.0485
(3)特征 oil price
oil price
有两个取值:rise
和 fall
。
-
当
oil price = rise
时,有 4 条记录,其中:rise
出现 3 次,fall
出现 1 次。- 熵为:
H ( S ∣ o i l p r i c e = r i s e ) = − ( 3 4 log 2 3 4 + 1 4 log 2 1 4 ) H(S|oil price=rise) = - \left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) H(S∣oilprice=rise)=−(43log243+41log241)
H ( S ∣ o i l p r i c e = r i s e ) = − ( 0.75 × − 0.415 + 0.25 × − 2 ) = 0.811 H(S|oil price=rise) = - \left( 0.75 \times -0.415 + 0.25 \times -2 \right) = 0.811 H(S∣oilprice=rise)=−(0.75×−0.415+0.25×−2)=0.811
-
当
oil price = fall
时,有 4 条记录,其中:rise
出现 0 次,fall
出现 4 次。- 熵为:
H ( S ∣ o i l p r i c e = f a l l ) = − ( 0 4 log 2 0 4 + 4 4 log 2 4 4 ) = 0 H(S|oil price=fall) = - \left( \frac{0}{4} \log_2 \frac{0}{4} + \frac{4}{4} \log_2 \frac{4}{4} \right) = 0 H(S∣oilprice=fall)=−(40log240+44log244)=0
条件熵 H ( S ∣ o i l p r i c e ) H(S|oil price) H(S∣oilprice) 为:
H ( S ∣ o i l p r i c e ) = 4 8 H ( S ∣ o i l p r i c e = r i s e ) + 4 8 H ( S ∣ o i l p r i c e = f a l l ) H(S|oil price) = \frac{4}{8} H(S|oil price=rise) + \frac{4}{8} H(S|oil price=fall) H(S∣oilprice)=84H(S∣oilprice=rise)+84H(S∣oilprice=fall)
H ( S ∣ o i l p r i c e ) = 0.5 × 0.811 + 0.5 × 0 = 0.4055 H(S|oil price) = 0.5 \times 0.811 + 0.5 \times 0 = 0.4055 H(S∣oilprice)=0.5×0.811+0.5×0=0.4055
信息增益 I G ( S , o i l p r i c e ) IG(S, oil price) IG(S,oilprice) 为:
I G ( S , o i l p r i c e ) = H ( S ) − H ( S ∣ o i l p r i c e ) IG(S, oil price) = H(S) - H(S|oil price) IG(S,oilprice)=H(S)−H(S∣oilprice)
I G ( S , o i l p r i c e ) = 0.954 − 0.4055 = 0.5485 IG(S, oil price) = 0.954 - 0.4055 = 0.5485 IG(S,oilprice)=0.954−0.4055=0.5485
3. 总结信息增益
- I G ( S , e l e c t i o n ) = 0.0485 IG(S, election) = 0.0485 IG(S,election)=0.0485
- I G ( S , s e a s o n ) = 0.0485 IG(S, season) = 0.0485 IG(S,season)=0.0485
- I G ( S , o i l p r i c e ) = 0.5485 IG(S, oil price) = 0.5485 IG(S,oilprice)=0.5485
因此,oil price
是信息增益最大的特征。
4 根节点选择
在上一步中,我们计算了各特征的信息增益:
- I G ( S , e l e c t i o n ) = 0.0485 IG(S, election) = 0.0485 IG(S,election)=0.0485
- I G ( S , s e a s o n ) = 0.0485 IG(S, season) = 0.0485 IG(S,season)=0.0485
- I G ( S , o i l p r i c e ) = 0.5485 IG(S, oil price) = 0.5485 IG(S,oilprice)=0.5485
由于 oil price
的信息增益最大,我们选择 oil price
作为根节点。
5. 根据 oil price
划分数据集
oil price
有两个取值:rise
和 fall
。
(1) 子集 1: oil price = rise
当 oil price = rise
时,数据如下:
election | season | oil price | stock price |
---|---|---|---|
no | winter | rise | rise |
no | summer | rise | rise |
yes | winter | rise | fall |
yes | summer | rise | rise |
stock price = rise
出现 3 次。stock price = fall
出现 1 次。
熵为:
H ( S ∣ o i l p r i c e = r i s e ) = − ( 3 4 log 2 3 4 + 1 4 log 2 1 4 ) H(S|oil price=rise) = - \left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) H(S∣oilprice=rise)=−(43log243+41log241)
H ( S ∣ o i l p r i c e = r i s e ) = − ( 0.75 × − 0.415 + 0.25 × − 2 ) = 0.811 H(S|oil price=rise) = - \left( 0.75 \times -0.415 + 0.25 \times -2 \right) = 0.811 H(S∣oilprice=rise)=−(0.75×−0.415+0.25×−2)=0.811
由于熵不为 0,oil price = rise
的子集还需要进一步划分。
(2) 子集 2: oil price = fall
当 oil price = fall
时,数据如下:
election | season | oil price | stock price |
---|---|---|---|
no | summer | fall | fall |
no | winter | fall | fall |
yes | summer | fall | fall |
yes | winter | fall | fall |
stock price = rise
出现 0 次。stock price = fall
出现 4 次。
熵为:
H ( S ∣ o i l p r i c e = f a l l ) = − ( 0 4 log 2 0 4 + 4 4 log 2 4 4 ) = 0 H(S|oil price=fall) = - \left( \frac{0}{4} \log_2 \frac{0}{4} + \frac{4}{4} \log_2 \frac{4}{4} \right) = 0 H(S∣oilprice=fall)=−(40log240+44log244)=0
由于熵为 0,oil price = fall
的子集是纯的,不需要进一步划分。此时,stock price = fall
是叶节点。
6. 对 oil price = rise
的子集继续划分
现在我们只需要处理 oil price = rise
的子集:
election | season | oil price | stock price |
---|---|---|---|
no | winter | rise | rise |
no | summer | rise | rise |
yes | winter | rise | fall |
yes | summer | rise | rise |
我们再次计算剩余特征的信息增益。
(1) 特征 election
election
有两个取值:yes
和 no
。
-
当
election = yes
时,有 2 条记录,其中:rise
出现 1 次,fall
出现 1 次。- 熵为:
H ( S ∣ e l e c t i o n = y e s ) = − ( 1 2 log 2 1 2 + 1 2 log 2 1 2 ) = 1 H(S|election=yes) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣election=yes)=−(21log221+21log221)=1
-
当
election = no
时,有 2 条记录,其中:rise
出现 2 次,fall
出现 0 次。- 熵为:
H ( S ∣ e l e c t i o n = n o ) = − ( 2 2 log 2 2 2 + 0 2 log 2 0 2 ) = 0 H(S|election=no) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣election=no)=−(22log222+20log220)=0
条件熵 H ( S ∣ e l e c t i o n ) H(S|election) H(S∣election) 为:
H ( S ∣ e l e c t i o n ) = 2 4 H ( S ∣ e l e c t i o n = y e s ) + 2 4 H ( S ∣ e l e c t i o n = n o ) H(S|election) = \frac{2}{4} H(S|election=yes) + \frac{2}{4} H(S|election=no) H(S∣election)=42H(S∣election=yes)+42H(S∣election=no)
H ( S ∣ e l e c t i o n ) = 0.5 × 1 + 0.5 × 0 = 0.5 H(S|election) = 0.5 \times 1 + 0.5 \times 0 = 0.5 H(S∣election)=0.5×1+0.5×0=0.5
信息增益 I G ( S , e l e c t i o n ) IG(S, election) IG(S,election) 为:
I G ( S , e l e c t i o n ) = H ( S ∣ o i l p r i c e = r i s e ) − H ( S ∣ e l e c t i o n ) IG(S, election) = H(S|oil price=rise) - H(S|election) IG(S,election)=H(S∣oilprice=rise)−H(S∣election)
I G ( S , e l e c t i o n ) = 0.811 − 0.5 = 0.311 IG(S, election) = 0.811 - 0.5 = 0.311 IG(S,election)=0.811−0.5=0.311
(2) 特征 season
season
有两个取值:winter
和 summer
。
-
当
season = winter
时,有 2 条记录,其中:rise
出现 1 次,fall
出现 1 次。- 熵为:
H ( S ∣ s e a s o n = w i n t e r ) = − ( 1 2 log 2 1 2 + 1 2 log 2 1 2 ) = 1 H(S|season=winter) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣season=winter)=−(21log221+21log221)=1
-
当
season = summer
时,有 2 条记录,其中:rise
出现 2 次,fall
出现 0 次。- 熵为:
H ( S ∣ s e a s o n = s u m m e r ) = − ( 2 2 log 2 2 2 + 0 2 log 2 0 2 ) = 0 H(S|season=summer) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣season=summer)=−(22log222+20log220)=0
条件熵 H ( S ∣ s e a s o n ) H(S|season) H(S∣season) 为:
H ( S ∣ s e a s o n ) = 2 4 H ( S ∣ s e a s o n = w i n t e r ) + 2 4 H ( S ∣ s e a s o n = s u m m e r ) H(S|season) = \frac{2}{4} H(S|season=winter) + \frac{2}{4} H(S|season=summer) H(S∣season)=42H(S∣season=winter)+42H(S∣season=summer)
H ( S ∣ s e a s o n ) = 0.5 × 1 + 0.5 × 0 = 0.5 H(S|season) = 0.5 \times 1 + 0.5 \times 0 = 0.5 H(S∣season)=0.5×1+0.5×0=0.5
信息增益 I G ( S , s e a s o n ) IG(S, season) IG(S,season) 为:
I G ( S , s e a s o n ) = H ( S ∣ o i l p r i c e = r i s e ) − H ( S ∣ s e a s o n ) IG(S, season) = H(S|oil price=rise) - H(S|season) IG(S,season)=H(S∣oilprice=rise)−H(S∣season)
I G ( S , s e a s o n ) = 0.811 − 0.5 = 0.311 IG(S, season) = 0.811 - 0.5 = 0.311 IG(S,season)=0.811−0.5=0.311
(3) 选择划分特征
election
和 season
的信息增益相同(均为 0.311)。我们可以任选一个作为划分特征。这里选择 election
。
7. 根据 election
划分子集
(1) 子集 1: election = yes
当 election = yes
时,数据如下:
election | season | oil price | stock price |
---|---|---|---|
yes | winter | rise | fall |
yes | summer | rise | rise |
stock price = rise
出现 1 次。stock price = fall
出现 1 次。
熵为:
H ( S ∣ e l e c t i o n = y e s ) = − ( 1 2 log 2 1 2 + 1 2 log 2 1 2 ) = 1 H(S|election=yes) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣election=yes)=−(21log221+21log221)=1
由于熵不为 0,还需要进一步划分。
(2) 子集 2: election = no
当 election = no
时,数据如下:
election | season | oil price | stock price |
---|---|---|---|
no | winter | rise | rise |
no | summer | rise | rise |
stock price = rise
出现 2 次。stock price = fall
出现 0 次。
熵为:
H ( S ∣ e l e c t i o n = n o ) = − ( 2 2 log 2 2 2 + 0 2 log 2 0 2 ) = 0 H(S|election=no) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣election=no)=−(22log222+20log220)=0
此时,stock price = rise
是叶节点。
8. 对 election = yes
的子集继续划分
对于 election = yes
的子集:
election | season | oil price | stock price |
---|---|---|---|
yes | winter | rise | fall |
yes | summer | rise | rise |
我们可以选择 season
作为划分特征。
- 当
season = winter
时,stock price = fall
。 - 当
season = summer
时,stock price = rise
。
此时,两个子集都是纯的。
9. 构建完整决策树
最终决策树如下:
oil price?
├── fall: stock price = fall
└── rise:
├── election?
│ ├── no: stock price = rise
│ └── yes:
│ ├── season?
│ │ ├── winter: stock price = fall
│ │ └── summer: stock price = rise