使用ID3算法根据信息增益构建决策树

使用ID3算法根据信息增益构建决策树

      • [1. 计算数据集的熵(Entropy of the dataset)](#1. 计算数据集的熵(Entropy of the dataset))
      • [2. 按特征划分数据集并计算条件熵](#2. 按特征划分数据集并计算条件熵)
        • [(1)特征 `election`](#(1)特征 election)
        • [(2)特征 `season`](#(2)特征 season)
        • [(3)特征 `oil price`](#(3)特征 oil price)
      • [3. 总结信息增益](#3. 总结信息增益)
      • [4 根节点选择](#4 根节点选择)
      • [5. 根据 `oil price` 划分数据集](#5. 根据 oil price 划分数据集)
        • [(1) 子集 1: `oil price = rise`](#(1) 子集 1: oil price = rise)
        • [(2) 子集 2: `oil price = fall`](#(2) 子集 2: oil price = fall)
      • [6. 对 `oil price = rise` 的子集继续划分](#6. 对 oil price = rise 的子集继续划分)
        • [(1) 特征 `election`](#(1) 特征 election)
        • [(2) 特征 `season`](#(2) 特征 season)
        • [(3) 选择划分特征](#(3) 选择划分特征)
      • [7. 根据 `election` 划分子集](#7. 根据 election 划分子集)
        • [(1) 子集 1: `election = yes`](#(1) 子集 1: election = yes)
        • [(2) 子集 2: `election = no`](#(2) 子集 2: election = no)
      • [8. 对 `election = yes` 的子集继续划分](#8. 对 election = yes 的子集继续划分)
      • [9. 构建完整决策树](#9. 构建完整决策树)

根据ID3(信息增益)方法利用以下数据集构建决策树。

election season oil price stock price
no winter rise rise
no summer fall fall
no summer rise rise
yes winter rise fall
no winter fall fall
yes summer rise rise
yes summer fall fall
yes winter fall fall

1. 计算数据集的熵(Entropy of the dataset)

熵的公式为:

H ( S ) = − ∑ i = 1 n p i log ⁡ 2 ( p i ) H(S) = - \sum_{i=1}^n p_i \log_2(p_i) H(S)=−i=1∑npilog2(pi)

其中 p i p_i pi 是每个类别的概率。

在数据集中,最后一列 stock price 有两种可能的值:risefall。统计每种值的频率:

  • rise 出现了 3 次。
  • fall 出现了 5 次。
  • 数据集总共有 8 条记录。

因此,rise 的概率 p ( r i s e ) = 3 8 p(rise) = \frac{3}{8} p(rise)=83,fall 的概率 p ( f a l l ) = 5 8 p(fall) = \frac{5}{8} p(fall)=85。

熵 H ( S ) H(S) H(S) 为:

H ( S ) = − ( 3 8 log ⁡ 2 3 8 + 5 8 log ⁡ 2 5 8 ) H(S) = - \left( \frac{3}{8} \log_2 \frac{3}{8} + \frac{5}{8} \log_2 \frac{5}{8} \right) H(S)=−(83log283+85log285)

计算:

H ( S ) = − ( 0.375 log ⁡ 2 0.375 + 0.625 log ⁡ 2 0.625 ) H(S) = - \left( 0.375 \log_2 0.375 + 0.625 \log_2 0.625 \right) H(S)=−(0.375log20.375+0.625log20.625)

log ⁡ 2 0.375 ≈ − 1.415 , log ⁡ 2 0.625 ≈ − 0.678 \log_2 0.375 \approx -1.415, \quad \log_2 0.625 \approx -0.678 log20.375≈−1.415,log20.625≈−0.678

H ( S ) = − ( 0.375 × − 1.415 + 0.625 × − 0.678 ) H(S) = - \left( 0.375 \times -1.415 + 0.625 \times -0.678 \right) H(S)=−(0.375×−1.415+0.625×−0.678)

H ( S ) = − ( − 0.5306 − 0.42375 ) = 0.95435 H(S) = - \left( -0.5306 - 0.42375 \right) = 0.95435 H(S)=−(−0.5306−0.42375)=0.95435

数据集的熵为 H ( S ) ≈ 0.954 H(S) \approx 0.954 H(S)≈0.954。


2. 按特征划分数据集并计算条件熵

我们需要分别计算每个特征(electionseasonoil price)的条件熵。

(1)特征 election

election 有两个取值:yesno

  • election = yes 时,有 4 条记录,其中:

    • rise 出现 1 次,fall 出现 3 次。
    • 熵为:
      H ( S ∣ e l e c t i o n = y e s ) = − ( 1 4 log ⁡ 2 1 4 + 3 4 log ⁡ 2 3 4 ) H(S|election=yes) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) H(S∣election=yes)=−(41log241+43log243)
      H ( S ∣ e l e c t i o n = y e s ) = − ( 0.25 × − 2 + 0.75 × − 0.415 ) = 0.811 H(S|election=yes) = - \left( 0.25 \times -2 + 0.75 \times -0.415 \right) = 0.811 H(S∣election=yes)=−(0.25×−2+0.75×−0.415)=0.811
  • election = no 时,有 4 条记录,其中:

    • rise 出现 2 次,fall 出现 2 次。
    • 熵为:
      H ( S ∣ e l e c t i o n = n o ) = − ( 2 4 log ⁡ 2 2 4 + 2 4 log ⁡ 2 2 4 ) H(S|election=no) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) H(S∣election=no)=−(42log242+42log242)
      H ( S ∣ e l e c t i o n = n o ) = − ( 0.5 × − 1 + 0.5 × − 1 ) = 1 H(S|election=no) = - \left( 0.5 \times -1 + 0.5 \times -1 \right) = 1 H(S∣election=no)=−(0.5×−1+0.5×−1)=1

条件熵 H ( S ∣ e l e c t i o n ) H(S|election) H(S∣election) 为:

H ( S ∣ e l e c t i o n ) = 4 8 H ( S ∣ e l e c t i o n = y e s ) + 4 8 H ( S ∣ e l e c t i o n = n o ) H(S|election) = \frac{4}{8} H(S|election=yes) + \frac{4}{8} H(S|election=no) H(S∣election)=84H(S∣election=yes)+84H(S∣election=no)

H ( S ∣ e l e c t i o n ) = 0.5 × 0.811 + 0.5 × 1 = 0.9055 H(S|election) = 0.5 \times 0.811 + 0.5 \times 1 = 0.9055 H(S∣election)=0.5×0.811+0.5×1=0.9055

信息增益 I G ( S , e l e c t i o n ) IG(S, election) IG(S,election) 为:

I G ( S , e l e c t i o n ) = H ( S ) − H ( S ∣ e l e c t i o n ) IG(S, election) = H(S) - H(S|election) IG(S,election)=H(S)−H(S∣election)

I G ( S , e l e c t i o n ) = 0.954 − 0.9055 = 0.0485 IG(S, election) = 0.954 - 0.9055 = 0.0485 IG(S,election)=0.954−0.9055=0.0485


(2)特征 season

season 有两个取值:wintersummer

  • season = winter 时,有 4 条记录,其中:

    • rise 出现 1 次,fall 出现 3 次。
    • 熵为:
      H ( S ∣ s e a s o n = w i n t e r ) = − ( 1 4 log ⁡ 2 1 4 + 3 4 log ⁡ 2 3 4 ) = 0.811 H(S|season=winter) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) = 0.811 H(S∣season=winter)=−(41log241+43log243)=0.811
  • season = summer 时,有 4 条记录,其中:

    • rise 出现 2 次,fall 出现 2 次。
    • 熵为:
      H ( S ∣ s e a s o n = s u m m e r ) = − ( 2 4 log ⁡ 2 2 4 + 2 4 log ⁡ 2 2 4 ) = 1 H(S|season=summer) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) = 1 H(S∣season=summer)=−(42log242+42log242)=1

条件熵 H ( S ∣ s e a s o n ) H(S|season) H(S∣season) 为:

H ( S ∣ s e a s o n ) = 4 8 H ( S ∣ s e a s o n = w i n t e r ) + 4 8 H ( S ∣ s e a s o n = s u m m e r ) H(S|season) = \frac{4}{8} H(S|season=winter) + \frac{4}{8} H(S|season=summer) H(S∣season)=84H(S∣season=winter)+84H(S∣season=summer)

H ( S ∣ s e a s o n ) = 0.5 × 0.811 + 0.5 × 1 = 0.9055 H(S|season) = 0.5 \times 0.811 + 0.5 \times 1 = 0.9055 H(S∣season)=0.5×0.811+0.5×1=0.9055

信息增益 I G ( S , s e a s o n ) IG(S, season) IG(S,season) 为:

I G ( S , s e a s o n ) = H ( S ) − H ( S ∣ s e a s o n ) IG(S, season) = H(S) - H(S|season) IG(S,season)=H(S)−H(S∣season)

I G ( S , s e a s o n ) = 0.954 − 0.9055 = 0.0485 IG(S, season) = 0.954 - 0.9055 = 0.0485 IG(S,season)=0.954−0.9055=0.0485


(3)特征 oil price

oil price 有两个取值:risefall

  • oil price = rise 时,有 4 条记录,其中:

    • rise 出现 3 次,fall 出现 1 次。
    • 熵为:
      H ( S ∣ o i l p r i c e = r i s e ) = − ( 3 4 log ⁡ 2 3 4 + 1 4 log ⁡ 2 1 4 ) H(S|oil price=rise) = - \left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) H(S∣oilprice=rise)=−(43log243+41log241)
      H ( S ∣ o i l p r i c e = r i s e ) = − ( 0.75 × − 0.415 + 0.25 × − 2 ) = 0.811 H(S|oil price=rise) = - \left( 0.75 \times -0.415 + 0.25 \times -2 \right) = 0.811 H(S∣oilprice=rise)=−(0.75×−0.415+0.25×−2)=0.811
  • oil price = fall 时,有 4 条记录,其中:

    • rise 出现 0 次,fall 出现 4 次。
    • 熵为:
      H ( S ∣ o i l p r i c e = f a l l ) = − ( 0 4 log ⁡ 2 0 4 + 4 4 log ⁡ 2 4 4 ) = 0 H(S|oil price=fall) = - \left( \frac{0}{4} \log_2 \frac{0}{4} + \frac{4}{4} \log_2 \frac{4}{4} \right) = 0 H(S∣oilprice=fall)=−(40log240+44log244)=0

条件熵 H ( S ∣ o i l p r i c e ) H(S|oil price) H(S∣oilprice) 为:

H ( S ∣ o i l p r i c e ) = 4 8 H ( S ∣ o i l p r i c e = r i s e ) + 4 8 H ( S ∣ o i l p r i c e = f a l l ) H(S|oil price) = \frac{4}{8} H(S|oil price=rise) + \frac{4}{8} H(S|oil price=fall) H(S∣oilprice)=84H(S∣oilprice=rise)+84H(S∣oilprice=fall)

H ( S ∣ o i l p r i c e ) = 0.5 × 0.811 + 0.5 × 0 = 0.4055 H(S|oil price) = 0.5 \times 0.811 + 0.5 \times 0 = 0.4055 H(S∣oilprice)=0.5×0.811+0.5×0=0.4055

信息增益 I G ( S , o i l p r i c e ) IG(S, oil price) IG(S,oilprice) 为:

I G ( S , o i l p r i c e ) = H ( S ) − H ( S ∣ o i l p r i c e ) IG(S, oil price) = H(S) - H(S|oil price) IG(S,oilprice)=H(S)−H(S∣oilprice)

I G ( S , o i l p r i c e ) = 0.954 − 0.4055 = 0.5485 IG(S, oil price) = 0.954 - 0.4055 = 0.5485 IG(S,oilprice)=0.954−0.4055=0.5485


3. 总结信息增益

  • I G ( S , e l e c t i o n ) = 0.0485 IG(S, election) = 0.0485 IG(S,election)=0.0485
  • I G ( S , s e a s o n ) = 0.0485 IG(S, season) = 0.0485 IG(S,season)=0.0485
  • I G ( S , o i l p r i c e ) = 0.5485 IG(S, oil price) = 0.5485 IG(S,oilprice)=0.5485

因此,oil price 是信息增益最大的特征。


4 根节点选择

在上一步中,我们计算了各特征的信息增益:

  • I G ( S , e l e c t i o n ) = 0.0485 IG(S, election) = 0.0485 IG(S,election)=0.0485
  • I G ( S , s e a s o n ) = 0.0485 IG(S, season) = 0.0485 IG(S,season)=0.0485
  • I G ( S , o i l p r i c e ) = 0.5485 IG(S, oil price) = 0.5485 IG(S,oilprice)=0.5485

由于 oil price 的信息增益最大,我们选择 oil price 作为根节点。


5. 根据 oil price 划分数据集

oil price 有两个取值:risefall

(1) 子集 1: oil price = rise

oil price = rise 时,数据如下:

election season oil price stock price
no winter rise rise
no summer rise rise
yes winter rise fall
yes summer rise rise
  • stock price = rise 出现 3 次。
  • stock price = fall 出现 1 次。

熵为:

H ( S ∣ o i l p r i c e = r i s e ) = − ( 3 4 log ⁡ 2 3 4 + 1 4 log ⁡ 2 1 4 ) H(S|oil price=rise) = - \left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) H(S∣oilprice=rise)=−(43log243+41log241)

H ( S ∣ o i l p r i c e = r i s e ) = − ( 0.75 × − 0.415 + 0.25 × − 2 ) = 0.811 H(S|oil price=rise) = - \left( 0.75 \times -0.415 + 0.25 \times -2 \right) = 0.811 H(S∣oilprice=rise)=−(0.75×−0.415+0.25×−2)=0.811

由于熵不为 0,oil price = rise 的子集还需要进一步划分。


(2) 子集 2: oil price = fall

oil price = fall 时,数据如下:

election season oil price stock price
no summer fall fall
no winter fall fall
yes summer fall fall
yes winter fall fall
  • stock price = rise 出现 0 次。
  • stock price = fall 出现 4 次。

熵为:

H ( S ∣ o i l p r i c e = f a l l ) = − ( 0 4 log ⁡ 2 0 4 + 4 4 log ⁡ 2 4 4 ) = 0 H(S|oil price=fall) = - \left( \frac{0}{4} \log_2 \frac{0}{4} + \frac{4}{4} \log_2 \frac{4}{4} \right) = 0 H(S∣oilprice=fall)=−(40log240+44log244)=0

由于熵为 0,oil price = fall 的子集是纯的,不需要进一步划分。此时,stock price = fall 是叶节点。


6. 对 oil price = rise 的子集继续划分

现在我们只需要处理 oil price = rise 的子集:

election season oil price stock price
no winter rise rise
no summer rise rise
yes winter rise fall
yes summer rise rise

我们再次计算剩余特征的信息增益。


(1) 特征 election

election 有两个取值:yesno

  • election = yes 时,有 2 条记录,其中:

    • rise 出现 1 次,fall 出现 1 次。
    • 熵为:
      H ( S ∣ e l e c t i o n = y e s ) = − ( 1 2 log ⁡ 2 1 2 + 1 2 log ⁡ 2 1 2 ) = 1 H(S|election=yes) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣election=yes)=−(21log221+21log221)=1
  • election = no 时,有 2 条记录,其中:

    • rise 出现 2 次,fall 出现 0 次。
    • 熵为:
      H ( S ∣ e l e c t i o n = n o ) = − ( 2 2 log ⁡ 2 2 2 + 0 2 log ⁡ 2 0 2 ) = 0 H(S|election=no) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣election=no)=−(22log222+20log220)=0

条件熵 H ( S ∣ e l e c t i o n ) H(S|election) H(S∣election) 为:

H ( S ∣ e l e c t i o n ) = 2 4 H ( S ∣ e l e c t i o n = y e s ) + 2 4 H ( S ∣ e l e c t i o n = n o ) H(S|election) = \frac{2}{4} H(S|election=yes) + \frac{2}{4} H(S|election=no) H(S∣election)=42H(S∣election=yes)+42H(S∣election=no)

H ( S ∣ e l e c t i o n ) = 0.5 × 1 + 0.5 × 0 = 0.5 H(S|election) = 0.5 \times 1 + 0.5 \times 0 = 0.5 H(S∣election)=0.5×1+0.5×0=0.5

信息增益 I G ( S , e l e c t i o n ) IG(S, election) IG(S,election) 为:

I G ( S , e l e c t i o n ) = H ( S ∣ o i l p r i c e = r i s e ) − H ( S ∣ e l e c t i o n ) IG(S, election) = H(S|oil price=rise) - H(S|election) IG(S,election)=H(S∣oilprice=rise)−H(S∣election)

I G ( S , e l e c t i o n ) = 0.811 − 0.5 = 0.311 IG(S, election) = 0.811 - 0.5 = 0.311 IG(S,election)=0.811−0.5=0.311


(2) 特征 season

season 有两个取值:wintersummer

  • season = winter 时,有 2 条记录,其中:

    • rise 出现 1 次,fall 出现 1 次。
    • 熵为:
      H ( S ∣ s e a s o n = w i n t e r ) = − ( 1 2 log ⁡ 2 1 2 + 1 2 log ⁡ 2 1 2 ) = 1 H(S|season=winter) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣season=winter)=−(21log221+21log221)=1
  • season = summer 时,有 2 条记录,其中:

    • rise 出现 2 次,fall 出现 0 次。
    • 熵为:
      H ( S ∣ s e a s o n = s u m m e r ) = − ( 2 2 log ⁡ 2 2 2 + 0 2 log ⁡ 2 0 2 ) = 0 H(S|season=summer) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣season=summer)=−(22log222+20log220)=0

条件熵 H ( S ∣ s e a s o n ) H(S|season) H(S∣season) 为:

H ( S ∣ s e a s o n ) = 2 4 H ( S ∣ s e a s o n = w i n t e r ) + 2 4 H ( S ∣ s e a s o n = s u m m e r ) H(S|season) = \frac{2}{4} H(S|season=winter) + \frac{2}{4} H(S|season=summer) H(S∣season)=42H(S∣season=winter)+42H(S∣season=summer)

H ( S ∣ s e a s o n ) = 0.5 × 1 + 0.5 × 0 = 0.5 H(S|season) = 0.5 \times 1 + 0.5 \times 0 = 0.5 H(S∣season)=0.5×1+0.5×0=0.5

信息增益 I G ( S , s e a s o n ) IG(S, season) IG(S,season) 为:

I G ( S , s e a s o n ) = H ( S ∣ o i l p r i c e = r i s e ) − H ( S ∣ s e a s o n ) IG(S, season) = H(S|oil price=rise) - H(S|season) IG(S,season)=H(S∣oilprice=rise)−H(S∣season)

I G ( S , s e a s o n ) = 0.811 − 0.5 = 0.311 IG(S, season) = 0.811 - 0.5 = 0.311 IG(S,season)=0.811−0.5=0.311


(3) 选择划分特征

electionseason 的信息增益相同(均为 0.311)。我们可以任选一个作为划分特征。这里选择 election


7. 根据 election 划分子集

(1) 子集 1: election = yes

election = yes 时,数据如下:

election season oil price stock price
yes winter rise fall
yes summer rise rise
  • stock price = rise 出现 1 次。
  • stock price = fall 出现 1 次。

熵为:

H ( S ∣ e l e c t i o n = y e s ) = − ( 1 2 log ⁡ 2 1 2 + 1 2 log ⁡ 2 1 2 ) = 1 H(S|election=yes) = - \left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{2} \log_2 \frac{1}{2} \right) = 1 H(S∣election=yes)=−(21log221+21log221)=1

由于熵不为 0,还需要进一步划分。


(2) 子集 2: election = no

election = no 时,数据如下:

election season oil price stock price
no winter rise rise
no summer rise rise
  • stock price = rise 出现 2 次。
  • stock price = fall 出现 0 次。

熵为:

H ( S ∣ e l e c t i o n = n o ) = − ( 2 2 log ⁡ 2 2 2 + 0 2 log ⁡ 2 0 2 ) = 0 H(S|election=no) = - \left( \frac{2}{2} \log_2 \frac{2}{2} + \frac{0}{2} \log_2 \frac{0}{2} \right) = 0 H(S∣election=no)=−(22log222+20log220)=0

此时,stock price = rise 是叶节点。


8. 对 election = yes 的子集继续划分

对于 election = yes 的子集:

election season oil price stock price
yes winter rise fall
yes summer rise rise

我们可以选择 season 作为划分特征。

  • season = winter 时,stock price = fall
  • season = summer 时,stock price = rise

此时,两个子集都是纯的。


9. 构建完整决策树

最终决策树如下:

复制代码
oil price?
├── fall: stock price = fall
└── rise:
    ├── election?
    │   ├── no: stock price = rise
    │   └── yes:
    │       ├── season?
    │       │   ├── winter: stock price = fall
    │       │   └── summer: stock price = rise

相关推荐
OpenBayes贝式计算3 天前
解决视频模型痛点,TurboDiffusion 高效视频扩散生成系统;Google Streetview 涵盖多个国家的街景图像数据集
人工智能·深度学习·机器学习
OpenBayes贝式计算3 天前
OCR教程汇总丨DeepSeek/百度飞桨/华中科大等开源创新技术,实现OCR高精度、本地化部署
人工智能·深度学习·机器学习
够快云库4 天前
能源行业非结构化数据治理实战:从数据沼泽到智能资产
大数据·人工智能·机器学习·企业文件安全
B站_计算机毕业设计之家4 天前
电影知识图谱推荐问答系统 | Python Django系统 Neo4j MySQL Echarts 协同过滤 大数据 人工智能 毕业设计源码(建议收藏)✅
人工智能·python·机器学习·django·毕业设计·echarts·知识图谱
Flying pigs~~4 天前
机器学习之逻辑回归
人工智能·机器学习·数据挖掘·数据分析·逻辑回归
Evand J4 天前
通过matlab实现机器学习的小项目示例(鸢尾花分类)
机器学习·支持向量机·matlab
_Li.4 天前
Simulink - 6DOF (Euler Angles)
人工智能·算法·机器学习·游戏引擎·cocos2d
Project_Observer4 天前
工时日志在项目进度管理中扮演着怎样的角色?
数据库·深度学习·机器学习
scott1985124 天前
Improving Classifier-Free Guidance of Flow Matching via Manifold Projection
人工智能·python·机器学习