使用ID3算法根据信息增益构建决策树

- - [1. 计算数据集的熵（Entropy of the dataset）](#1. 计算数据集的熵（Entropy of the dataset）)
  - [2. 按特征划分数据集并计算条件熵](#2. 按特征划分数据集并计算条件熵)
  - - [（1）特征 `election`](#（1）特征 election)
    - [（2）特征 `season`](#（2）特征 season)
    - [（3）特征 `oil price`](#（3）特征 oil price)
  - [3. 总结信息增益](#3. 总结信息增益)
  - [4 根节点选择](#4 根节点选择)
  - [5. 根据 `oil price` 划分数据集](#5. 根据 oil price 划分数据集)
  - - [(1) 子集 1: `oil price = rise`](#(1) 子集 1: oil price = rise)
    - [(2) 子集 2: `oil price = fall`](#(2) 子集 2: oil price = fall)
  - [6. 对 `oil price = rise` 的子集继续划分](#6. 对 oil price = rise 的子集继续划分)
  - - [(1) 特征 `election`](#(1) 特征 election)
    - [(2) 特征 `season`](#(2) 特征 season)
    - [(3) 选择划分特征](#(3) 选择划分特征)
  - [7. 根据 `election` 划分子集](#7. 根据 election 划分子集)
  - - [(1) 子集 1: `election = yes`](#(1) 子集 1: election = yes)
    - [(2) 子集 2: `election = no`](#(2) 子集 2: election = no)
  - [8. 对 `election = yes` 的子集继续划分](#8. 对 election = yes 的子集继续划分)
  - [9. 构建完整决策树](#9. 构建完整决策树)

根据ID3（信息增益）方法利用以下数据集构建决策树。

election	season	oil price	stock price
no	winter	rise	rise
no	summer	fall	fall
no	summer	rise	rise
yes	winter	rise	fall
no	winter	fall	fall
yes	summer	rise	rise
yes	summer	fall	fall
yes	winter	fall	fall

1. 计算数据集的熵（Entropy of the dataset）

熵的公式为：

H ( S ) = − ∑ i = 1 n p i log ⁡ 2 ( p i ) H(S) = - \sum_{i=1}^n p_i \log_2(p_i) H(S)=−i=1∑npilog2(pi)

其中 p i p_i pi 是每个类别的概率。

在数据集中，最后一列 stock price 有两种可能的值：rise 和 fall。统计每种值的频率：

rise 出现了 3 次。
fall 出现了 5 次。
数据集总共有 8 条记录。

因此，rise 的概率 p ( r i s e ) = 3 8 p(rise) = \frac{3}{8} p(rise)=83，fall 的概率 p ( f a l l ) = 5 8 p(fall) = \frac{5}{8} p(fall)=85。

熵 H ( S ) H(S) H(S) 为：

H ( S ) = − ( 3 8 log ⁡ 2 3 8 + 5 8 log ⁡ 2 5 8 ) H(S) = - \left( \frac{3}{8} \log_2 \frac{3}{8} + \frac{5}{8} \log_2 \frac{5}{8} \right) H(S)=−(83log283+85log285)

计算：

H ( S ) = − ( 0.375 log ⁡ 2 0.375 + 0.625 log ⁡ 2 0.625 ) H(S) = - \left( 0.375 \log_2 0.375 + 0.625 \log_2 0.625 \right) H(S)=−(0.375log20.375+0.625log20.625)

log ⁡ 2 0.375 ≈ − 1.415 , log ⁡ 2 0.625 ≈ − 0.678 \log_2 0.375 \approx -1.415, \quad \log_2 0.625 \approx -0.678 log20.375≈−1.415,log20.625≈−0.678

H ( S ) = − ( 0.375 × − 1.415 + 0.625 × − 0.678 ) H(S) = - \left( 0.375 \times -1.415 + 0.625 \times -0.678 \right) H(S)=−(0.375×−1.415+0.625×−0.678)

H ( S ) = − ( − 0.5306 − 0.42375 ) = 0.95435 H(S) = - \left( -0.5306 - 0.42375 \right) = 0.95435 H(S)=−(−0.5306−0.42375)=0.95435

数据集的熵为 H ( S ) ≈ 0.954 H(S) \approx 0.954 H(S)≈0.954。

2. 按特征划分数据集并计算条件熵

我们需要分别计算每个特征（election、season、oil price）的条件熵。

（1）特征 `election`

election 有两个取值：yes 和 no。

当 election = yes 时，有 4 条记录，其中：
- rise 出现 1 次，fall 出现 3 次。
- 熵为：
  H ( S ∣ e l e c t i o n = y e s ) = − ( 1 4 log ⁡ 2 1 4 + 3 4 log ⁡ 2 3 4 ) H(S|election=yes) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) H(S∣election=yes)=−(41log241+43log243)
  H ( S ∣ e l e c t i o n = y e s ) = − ( 0.25 × − 2 + 0.75 × − 0.415 ) = 0.811 H(S|election=yes) = - \left( 0.25 \times -2 + 0.75 \times -0.415 \right) = 0.811 H(S∣election=yes)=−(0.25×−2+0.75×−0.415)=0.811
当 election = no 时，有 4 条记录，其中：
- rise 出现 2 次，fall 出现 2 次。
- 熵为：
  H ( S ∣ e l e c t i o n = n o ) = − ( 2 4 log ⁡ 2 2 4 + 2 4 log ⁡ 2 2 4 ) H(S|election=no) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) H(S∣election=no)=−(42log242+42log242)
  H ( S ∣ e l e c t i o n = n o ) = − ( 0.5 × − 1 + 0.5 × − 1 ) = 1 H(S|election=no) = - \left( 0.5 \times -1 + 0.5 \times -1 \right) = 1 H(S∣election=no)=−(0.5×−1+0.5×−1)=1

条件熵 H ( S ∣ e l e c t i o n ) H(S|election) H(S∣election) 为：

H ( S ∣ e l e c t i o n ) = 4 8 H ( S ∣ e l e c t i o n = y e s ) + 4 8 H ( S ∣ e l e c t i o n = n o ) H(S|election) = \frac{4}{8} H(S|election=yes) + \frac{4}{8} H(S|election=no) H(S∣election)=84H(S∣election=yes)+84H(S∣election=no)

H ( S ∣ e l e c t i o n ) = 0.5 × 0.811 + 0.5 × 1 = 0.9055 H(S|election) = 0.5 \times 0.811 + 0.5 \times 1 = 0.9055 H(S∣election)=0.5×0.811+0.5×1=0.9055

信息增益 I G ( S , e l e c t i o n ) IG(S, election) IG(S,election) 为：

I G ( S , e l e c t i o n ) = H ( S ) − H ( S ∣ e l e c t i o n ) IG(S, election) = H(S) - H(S|election) IG(S,election)=H(S)−H(S∣election)

I G ( S , e l e c t i o n ) = 0.954 − 0.9055 = 0.0485 IG(S, election) = 0.954 - 0.9055 = 0.0485 IG(S,election)=0.954−0.9055=0.0485

（2）特征 `season`

season 有两个取值：winter 和 summer。

当 season = winter 时，有 4 条记录，其中：
- rise 出现 1 次，fall 出现 3 次。
- 熵为：
  H ( S ∣ s e a s o n = w i n t e r ) = − ( 1 4 log ⁡ 2 1 4 + 3 4 log ⁡ 2 3 4 ) = 0.811 H(S|season=winter) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) = 0.811 H(S∣season=winter)=−(41log241+43log243)=0.811
当 season = summer 时，有 4 条记录，其中：
- rise 出现 2 次，fall 出现 2 次。
- 熵为：
  H ( S ∣ s e a s o n = s u m m e r ) = − ( 2 4 log ⁡ 2 2 4 + 2 4 log ⁡ 2 2 4 ) = 1 H(S|season=summer) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) = 1 H(S∣season=summer)=−(42log242+42log242)=1

条件熵 H ( S ∣ s e a s o n ) H(S|season) H(S∣season) 为：

H ( S ∣ s e a s o n ) = 4 8 H ( S ∣ s e a s o n = w i n t e r ) + 4 8 H ( S ∣ s e a s o n = s u m m e r ) H(S|season) = \frac{4}{8} H(S|season=winter) + \frac{4}{8} H(S|season=summer) H(S∣season)=84H(S∣season=winter)+84H(S∣season=summer)

H ( S ∣ s e a s o n ) = 0.5 × 0.811 + 0.5 × 1 = 0.9055 H(S|season) = 0.5 \times 0.811 + 0.5 \times 1 = 0.9055 H(S∣season)=0.5×0.811+0.5×1=0.9055

信息增益 I G ( S , s e a s o n ) IG(S, season) IG(S,season) 为：

I G ( S , s e a s o n ) = H ( S ) − H ( S ∣ s e a s o n ) IG(S, season) = H(S) - H(S|season) IG(S,season)=H(S)−H(S∣season)

I G ( S , s e a s o n ) = 0.954 − 0.9055 = 0.0485 IG(S, season) = 0.954 - 0.9055 = 0.0485 IG(S,season)=0.954−0.9055=0.0485

（3）特征 `oil price`

oil price 有两个取值：rise 和 fall。

当 oil price = rise 时，有 4 条记录，其中：
- rise 出现 3 次，fall 出现 1 次。
- 熵为：
  H ( S ∣ o i l p r i c e = r i s e ) = − ( 3 4 log ⁡ 2 3 4 + 1 4 log ⁡ 2 1 4 ) H(S|oil price=rise) = - \left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) H(S∣oilprice=rise)=−(43log243+41log241)
  H ( S ∣ o i l p r i c e = r i s e ) = − ( 0.75 × − 0.415 + 0.25 × − 2 ) = 0.811 H(S|oil price=rise) = - \left( 0.75 \times -0.415 + 0.25 \times -2 \right) = 0.811 H(S∣oilprice=rise)=−(0.75×−0.415+0.25×−2)=0.811
当 oil price = fall 时，有 4 条记录，其中：
- rise 出现 0 次，fall 出现 4 次。
- 熵为：
  H ( S ∣ o i l p r i c e = f a l l ) = − ( 0 4 log ⁡ 2 0 4 + 4 4 log ⁡ 2 4 4 ) = 0 H(S|oil price=fall) = - \left( \frac{0}{4} \log_2 \frac{0}{4} + \frac{4}{4} \log_2 \frac{4}{4} \right) = 0 H(S∣oilprice=fall)=−(40log240+44log244)=0

条件熵 H ( S ∣ o i l p r i c e ) H(S|oil price) H(S∣oilprice) 为：

H ( S ∣ o i l p r i c e ) = 4 8 H ( S ∣ o i l p r i c e = r i s e ) + 4 8 H ( S ∣ o i l p r i c e = f a l l ) H(S|oil price) = \frac{4}{8} H(S|oil price=rise) + \frac{4}{8} H(S|oil price=fall) H(S∣oilprice)=84H(S∣oilprice=rise)+84H(S∣oilprice=fall)

H ( S ∣ o i l p r i c e ) = 0.5 × 0.811 + 0.5 × 0 = 0.4055 H(S|oil price) = 0.5 \times 0.811 + 0.5 \times 0 = 0.4055 H(S∣oilprice)=0.5×0.811+0.5×0=0.4055

信息增益 I G ( S , o i l p r i c e ) IG(S, oil price) IG(S,oilprice) 为：

I G ( S , o i l p r i c e ) = H ( S ) − H ( S ∣ o i l p r i c e ) IG(S, oil price) = H(S) - H(S|oil price) IG(S,oilprice)=H(S)−H(S∣oilprice)

I G ( S , o i l p r i c e ) = 0.954 − 0.4055 = 0.5485 IG(S, oil price) = 0.954 - 0.4055 = 0.5485 IG(S,oilprice)=0.954−0.4055=0.5485

3. 总结信息增益

I G ( S , e l e c t i o n ) = 0.0485 IG(S, election) = 0.0485 IG(S,election)=0.0485
I G ( S , s e a s o n ) = 0.0485 IG(S, season) = 0.0485 IG(S,season)=0.0485
I G ( S , o i l p r i c e ) = 0.5485 IG(S, oil price) = 0.5485 IG(S,oilprice)=0.5485

因此，oil price 是信息增益最大的特征。

4 根节点选择

在上一步中，我们计算了各特征的信息增益：

I G ( S , e l e c t i o n ) = 0.0485 IG(S, election) = 0.0485 IG(S,election)=0.0485
I G ( S , s e a s o n ) = 0.0485 IG(S, season) = 0.0485 IG(S,season)=0.0485
I G ( S , o i l p r i c e ) = 0.5485 IG(S, oil price) = 0.5485 IG(S,oilprice)=0.5485

由于 oil price 的信息增益最大，我们选择 oil price 作为根节点。

5. 根据 `oil price` 划分数据集

oil price 有两个取值：rise 和 fall。

(1) 子集 1: `oil price = rise`

当 oil price = rise 时，数据如下：

election	season	oil price	stock price
no	winter	rise	rise
no	summer	rise	rise
yes	winter	rise	fall
yes	summer	rise	rise

stock price = rise 出现 3 次。
stock price = fall 出现 1 次。

熵为：

H ( S ∣ o i l p r i c e = r i s e ) = − ( 3 4 log ⁡ 2 3 4 + 1 4 log ⁡ 2 1 4 ) H(S|oil price=rise) = - \left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) H(S∣oilprice=rise)=−(43log243+41log241)

H ( S ∣ o i l p r i c e = r i s e ) = − ( 0.75 × − 0.415 + 0.25 × − 2 ) = 0.811 H(S|oil price=rise) = - \left( 0.75 \times -0.415 + 0.25 \times -2 \right) = 0.811 H(S∣oilprice=rise)=−(0.75×−0.415+0.25×−2)=0.811

由于熵不为 0，oil price = rise 的子集还需要进一步划分。

(2) 子集 2: `oil price = fall`

当 oil price = fall 时，数据如下：

election	season	oil price	stock price
no	summer	fall	fall
no	winter	fall	fall
yes	summer	fall	fall
yes	winter	fall	fall

stock price = rise 出现 0 次。
stock price = fall 出现 4 次。

熵为：

H ( S ∣ o i l p r i c e = f a l l ) = − ( 0 4 log ⁡ 2 0 4 + 4 4 log ⁡ 2 4 4 ) = 0 H(S|oil price=fall) = - \left( \frac{0}{4} \log_2 \frac{0}{4} + \frac{4}{4} \log_2 \frac{4}{4} \right) = 0 H(S∣oilprice=fall)=−(40log240+44log244)=0

由于熵为 0，oil price = fall 的子集是纯的，不需要进一步划分。此时，stock price = fall 是叶节点。

6. 对 `oil price = rise` 的子集继续划分

现在我们只需要处理 oil price = rise 的子集：

election	season	oil price	stock price
no	winter	rise	rise
no	summer	rise	rise
yes	winter	rise	fall
yes	summer	rise	rise

我们再次计算剩余特征的信息增益。

(1) 特征 `election`

election 有两个取值：yes 和 no。

当 election = yes 时，有 2 条记录，其中：
- rise 出现 1 次，fall 出现 1 次。
- 熵为：
  H ( S ∣ e l e c t i o n = y e s ) = − ( 1 2 log ⁡ 2 1 2 + 1 2 log ⁡ 2 1 2 ) = 1 H(S|election=yes) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣election=yes)=−(21log221+21log221)=1
当 election = no 时，有 2 条记录，其中：
- rise 出现 2 次，fall 出现 0 次。
- 熵为：
  H ( S ∣ e l e c t i o n = n o ) = − ( 2 2 log ⁡ 2 2 2 + 0 2 log ⁡ 2 0 2 ) = 0 H(S|election=no) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣election=no)=−(22log222+20log220)=0

条件熵 H ( S ∣ e l e c t i o n ) H(S|election) H(S∣election) 为：

H ( S ∣ e l e c t i o n ) = 2 4 H ( S ∣ e l e c t i o n = y e s ) + 2 4 H ( S ∣ e l e c t i o n = n o ) H(S|election) = \frac{2}{4} H(S|election=yes) + \frac{2}{4} H(S|election=no) H(S∣election)=42H(S∣election=yes)+42H(S∣election=no)

H ( S ∣ e l e c t i o n ) = 0.5 × 1 + 0.5 × 0 = 0.5 H(S|election) = 0.5 \times 1 + 0.5 \times 0 = 0.5 H(S∣election)=0.5×1+0.5×0=0.5

信息增益 I G ( S , e l e c t i o n ) IG(S, election) IG(S,election) 为：

I G ( S , e l e c t i o n ) = H ( S ∣ o i l p r i c e = r i s e ) − H ( S ∣ e l e c t i o n ) IG(S, election) = H(S|oil price=rise) - H(S|election) IG(S,election)=H(S∣oilprice=rise)−H(S∣election)

I G ( S , e l e c t i o n ) = 0.811 − 0.5 = 0.311 IG(S, election) = 0.811 - 0.5 = 0.311 IG(S,election)=0.811−0.5=0.311

(2) 特征 `season`

season 有两个取值：winter 和 summer。

当 season = winter 时，有 2 条记录，其中：
- rise 出现 1 次，fall 出现 1 次。
- 熵为：
  H ( S ∣ s e a s o n = w i n t e r ) = − ( 1 2 log ⁡ 2 1 2 + 1 2 log ⁡ 2 1 2 ) = 1 H(S|season=winter) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣season=winter)=−(21log221+21log221)=1
当 season = summer 时，有 2 条记录，其中：
- rise 出现 2 次，fall 出现 0 次。
- 熵为：
  H ( S ∣ s e a s o n = s u m m e r ) = − ( 2 2 log ⁡ 2 2 2 + 0 2 log ⁡ 2 0 2 ) = 0 H(S|season=summer) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣season=summer)=−(22log222+20log220)=0

条件熵 H ( S ∣ s e a s o n ) H(S|season) H(S∣season) 为：

H ( S ∣ s e a s o n ) = 2 4 H ( S ∣ s e a s o n = w i n t e r ) + 2 4 H ( S ∣ s e a s o n = s u m m e r ) H(S|season) = \frac{2}{4} H(S|season=winter) + \frac{2}{4} H(S|season=summer) H(S∣season)=42H(S∣season=winter)+42H(S∣season=summer)

H ( S ∣ s e a s o n ) = 0.5 × 1 + 0.5 × 0 = 0.5 H(S|season) = 0.5 \times 1 + 0.5 \times 0 = 0.5 H(S∣season)=0.5×1+0.5×0=0.5

信息增益 I G ( S , s e a s o n ) IG(S, season) IG(S,season) 为：

I G ( S , s e a s o n ) = H ( S ∣ o i l p r i c e = r i s e ) − H ( S ∣ s e a s o n ) IG(S, season) = H(S|oil price=rise) - H(S|season) IG(S,season)=H(S∣oilprice=rise)−H(S∣season)

I G ( S , s e a s o n ) = 0.811 − 0.5 = 0.311 IG(S, season) = 0.811 - 0.5 = 0.311 IG(S,season)=0.811−0.5=0.311

(3) 选择划分特征

election 和 season 的信息增益相同（均为 0.311）。我们可以任选一个作为划分特征。这里选择 election。

7. 根据 `election` 划分子集

(1) 子集 1: `election = yes`

当 election = yes 时，数据如下：

election	season	oil price	stock price
yes	winter	rise	fall
yes	summer	rise	rise

stock price = rise 出现 1 次。
stock price = fall 出现 1 次。

熵为：

H ( S ∣ e l e c t i o n = y e s ) = − ( 1 2 log ⁡ 2 1 2 + 1 2 log ⁡ 2 1 2 ) = 1 H(S|election=yes) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣election=yes)=−(21log221+21log221)=1

由于熵不为 0，还需要进一步划分。

(2) 子集 2: `election = no`

当 election = no 时，数据如下：

election	season	oil price	stock price
no	winter	rise	rise
no	summer	rise	rise

stock price = rise 出现 2 次。
stock price = fall 出现 0 次。

熵为：

H ( S ∣ e l e c t i o n = n o ) = − ( 2 2 log ⁡ 2 2 2 + 0 2 log ⁡ 2 0 2 ) = 0 H(S|election=no) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣election=no)=−(22log222+20log220)=0

此时，stock price = rise 是叶节点。

8. 对 `election = yes` 的子集继续划分

对于 election = yes 的子集：

election	season	oil price	stock price
yes	winter	rise	fall
yes	summer	rise	rise

我们可以选择 season 作为划分特征。

当 season = winter 时，stock price = fall。
当 season = summer 时，stock price = rise。

此时，两个子集都是纯的。

9. 构建完整决策树

最终决策树如下：

复制代码

oil price?
├── fall: stock price = fall
└── rise:
    ├── election?
    │   ├── no: stock price = rise
    │   └── yes:
    │       ├── season?
    │       │   ├── winter: stock price = fall
    │       │   └── summer: stock price = rise

使用ID3算法根据信息增益构建决策树

使用ID3算法根据信息增益构建决策树

1. 计算数据集的熵（Entropy of the dataset）

2. 按特征划分数据集并计算条件熵

（1）特征 election

（2）特征 season

（3）特征 oil price

3. 总结信息增益

4 根节点选择

5. 根据 oil price 划分数据集

(1) 子集 1: oil price = rise

(2) 子集 2: oil price = fall

6. 对 oil price = rise 的子集继续划分

(1) 特征 election

(2) 特征 season

(3) 选择划分特征

7. 根据 election 划分子集

(1) 子集 1: election = yes

(2) 子集 2: election = no

8. 对 election = yes 的子集继续划分

9. 构建完整决策树

（1）特征 `election`

（2）特征 `season`

（3）特征 `oil price`

5. 根据 `oil price` 划分数据集

(1) 子集 1: `oil price = rise`

(2) 子集 2: `oil price = fall`

6. 对 `oil price = rise` 的子集继续划分

(1) 特征 `election`

(2) 特征 `season`

7. 根据 `election` 划分子集

(1) 子集 1: `election = yes`

(2) 子集 2: `election = no`

8. 对 `election = yes` 的子集继续划分