最大似然估计，香农熵，交叉熵与KL散度的详细解读与实现

Ming | 2025.12

这篇文章将会依次讲解最大似然估计、香农熵、交叉熵与 KL 散度。它们之间逻辑紧密、层层递进，共同构成了现代机器学习与信息论中关于概率建模、信息度量与分布比较的核心基础。

1. 最大似然估计

什么是最/极大似然估计？它是用来解决什么问题的？

我们暂时抛开严谨的数学定义，从一个简单的例子入手。

假设我们面对这样一个问题：有一个神秘袋子，里面装有红、绿、蓝三种颜色的球，但我们不知道每种颜色的球所占的比例。我们仅知道颜色的概率分布具有以下结构（其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ 是未知参数）：

<math xmlns="http://www.w3.org/1998/Math/MathML"> X X </math>X	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 x_{1} </math>x1：红	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 2 x_{2} </math>x2：绿	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 3 x_{3} </math>x3：蓝
概率	<math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ	<math xmlns="http://www.w3.org/1998/Math/MathML"> 1 − 3 θ 2 1- \frac{3\theta}{2} </math>1−23θ	<math xmlns="http://www.w3.org/1998/Math/MathML"> θ 2 \frac{\theta}{2} </math>2θ

也就是说，抽到红球的概率是 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ，抽到蓝球的概率是 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ 2 \frac{\theta}{2} </math>2θ，抽到绿球的概率则是 <math xmlns="http://www.w3.org/1998/Math/MathML"> 1 − 3 θ 2 1-\frac{3\theta}{2} </math>1−23θ。

现在我们唯一能做的，就是有放回地从袋中随机抽取若干次，观察颜色。除此之外，我们无法打开袋子查看内部情况。那么，如何根据有限的观测结果来估计参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ 的值呢？

你可能会想到：只要抽取次数足够多，统计各颜色出现的频率，就能近似真实概率，从而推算出 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ。这确实是不错的做法。但在实际问题中，我们往往只能获得有限的样本，甚至样本数很少；或者变量维度很高（比如有 1000 种颜色，未知数也不在是一个 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ，而是非常多的未知变量），直接统计频率可能不稳定或不可行。这时，就需要借助最大似然估计这一强大的思想。

还是以上面的问题为例：

假设我们抽取了 6 次，得到颜色序列为： <math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 , x 2 , x 2 , x 3 , x 2 , x 3 x_{1},x_{2},x_{2},x_{3},x_{2},x_{3} </math>x1,x2,x2,x3,x2,x3，记该观测事件为 <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A。根据上述分布，事件 <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A 发生的概率（在给定 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ 下）为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L ( θ ) = θ ⋅ ( θ 2 ) 2 ⋅ ( 1 − 3 θ 2 ) 3 L(\theta ) = \theta \cdot (\frac{\theta}{2})^{2} \cdot (1-\frac{3\theta}{2} )^{3} </math>L(θ)=θ⋅(2θ)2⋅(1−23θ)3

这里用到了独立事件的概率乘法公式：每次抽取相互独立，因此整个序列的概率等于各次抽取概率的乘积。

现在，关键的思想来了：这个观测结果 <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A 是已经发生的既定事实。既然它发生了，我们就有理由认为，这个事件发生的概率应当是比较大的。既然如此，我们现在的任务就是找到一个 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ，使得 <math xmlns="http://www.w3.org/1998/Math/MathML"> L ( θ ) L(\theta) </math>L(θ)的值最大。这就是极大似然估计的核心直觉------寻找最能让观测数据"看起来合理"的参数。

在数学中，直接求 <math xmlns="http://www.w3.org/1998/Math/MathML"> L ( θ ) L(\theta ) </math>L(θ)的最小值可能不是很方便，因为它是多个概率的乘积，可能非常小，且容易导致数值下溢。为了便于计算与分析，我们通常对其取自然对数，得到对数似然函数：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ℓ ( θ ) = ln ( L ( θ ) ) = ln θ + 3 ln ( 1 − 3 θ 2 ) + 2 ln ( θ 2 ) \ell(\theta) = \text{ln}(L(\theta)) = \text{ln}\theta + 3 \text{ln}(1-\frac{3\theta}{2}) + 2\text{ln}(\frac{\theta}{2}) </math>ℓ(θ)=ln(L(θ))=lnθ+3ln(1−23θ)+2ln(2θ)

由于 <math xmlns="http://www.w3.org/1998/Math/MathML"> ln ⁡ \ln </math>ln 是单调递增函数，最大化 <math xmlns="http://www.w3.org/1998/Math/MathML"> L ( θ ) L(\theta) </math>L(θ) 等价于最大化 <math xmlns="http://www.w3.org/1998/Math/MathML"> ℓ ( θ ) \ell(\theta) </math>ℓ(θ)。对数转换能将连乘转为求和，既简化求导运算，也提升数值稳定性。

现在我们把最大似然估计的概念放到机器学习语境中去，那么上面那个分布表，就是我们定义的模型，我们有放回的抽样获得的数据，就是训练集； <math xmlns="http://www.w3.org/1998/Math/MathML"> L ( θ ) L(\theta ) </math>L(θ)就是损失函数，要求其极小值。用更加专业一点的说法就是：在机器学习中，我们经常将训练数据视为从某个真实分布中采样得到的观测。我们定义一个参数化模型 <math xmlns="http://www.w3.org/1998/Math/MathML"> p θ ( x ) p_{\theta}(x) </math>pθ(x)，希望用它来逼近真实分布。此时，似然函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> L ( θ ) L(\theta) </math>L(θ) 就度量了在该模型下观测到训练数据的"合理程度"。最大化似然函数等价于最小化负对数似然。而这正是许多监督学习模型（如逻辑回归、神经网络分类器）中使用的损失函数------交叉熵损失的理论来源。

2. 香农熵

在探讨香农熵之前，我们首先需要理解什么是"信息量"。信息量直观来说，是一个事件发生时所携带的信息的多少。我们可以通过一个简单的例子来体会：小明是一名优等生，每次考试都能进入年级前几名。因此，下一次考试他再次进入年级前几名的概率非常高。假设他真的做到了，这个事件所包含的信息量其实很少，因为大家早已预料到这一结果------小明成绩优秀是常态。相反，如果小明下一次考试成绩大幅下滑，排名跌至年级下游，那么这个事件所包含的信息量就会非常大，会引发各种猜测与关注：小明是不是生病了？是不是最近遇到了什么困难？或是家庭出现了什么变故？这种"意外"所携带的信息，远远超过符合预期的结果。

从这一点我们可以总结：一个事件发生的概率越大，它所包含的信息量就越小；反之，概率越小，信息量就越大。换句话说，信息量与事件发生的概率成反比。

对于事件 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x，其发生的概率为 <math xmlns="http://www.w3.org/1998/Math/MathML"> P ( x ) P(x) </math>P(x)，则它的信息量 <math xmlns="http://www.w3.org/1998/Math/MathML"> I p ( x ) I_{p}(x) </math>Ip(x)计算方式如下
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> I p ( x ) = log ⁡ 2 ( 1 p ( x ) ) = − log ⁡ 2 p ( x ) I_{p}(x) = \log_{2}{(\frac{1}{p(x)} )} = -\log_{2}{p(x)} </math>Ip(x)=log2(p(x)1)=−log2p(x)

为了更具体地理解这个公式，我们来看一个离散概率分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P ( x ) P(x) </math>P(x)的示例：

<math xmlns="http://www.w3.org/1998/Math/MathML"> X X </math>X	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 x_{1} </math>x1	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 2 x_{2} </math>x2	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 3 x_{3} </math>x3
<math xmlns="http://www.w3.org/1998/Math/MathML"> P ( x ) P(x) </math>P(x)	0.05	0.3	0.65

根据公式计算每个事件的信息量就能得到

<math xmlns="http://www.w3.org/1998/Math/MathML"> X X </math>X	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 x_{1} </math>x1	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 2 x_{2} </math>x2	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 3 x_{3} </math>x3
<math xmlns="http://www.w3.org/1998/Math/MathML"> I p ( x ) I_{p}(x) </math>Ip(x)	4.3219	1.7369	0.6214

可以看出，概率极小的事件 <math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 x_1 </math>x1 信息量很大，而概率很大的事件 <math xmlns="http://www.w3.org/1998/Math/MathML"> x 3 x_3 </math>x3 信息量则很小。这正是我们直觉的数学体现。

接下来，我们引入香农熵 的概念。如果我们不仅关心单个事件的信息量，而是希望衡量整个概率分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P 的"平均不确定性"或"平均信息量"，就需要计算信息量的期望值，即香农熵：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> H ( P ) = ∑ i = 1 n P ( x i ) I p ( x i ) = − ∑ i = 1 n P ( x i ) log ⁡ 2 P ( x i ) H(P)= \sum_{i=1}^n P(x_{i})I_{p}(x_i) =-\sum_{i=1}^n P(x_{i})\log_2P(x_{i}) </math>H(P)=i=1∑nP(xi)Ip(xi)=−i=1∑nP(xi)log2P(xi)

香农熵描述的是：在概率分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P 下，随机变量所携带的平均信息量。熵越大，表示分布的不确定性越高，平均信息量也越大；熵越小，则表示分布越集中，不确定性越低。

举个例子，如果某个分布中一个事件几乎必然发生（概率接近 1），那么熵会接近 0，因为几乎不需要额外信息来描述结果。相反，如果所有事件概率均等（如掷一枚均匀的骰子），熵就会达到最大值，因为每个结果都同样"意外"，描述结果所需的信息最多。

相信看了下面的代码你就会更加理解香农熵这个概念

python 复制代码

# 三个不同的概率分布
p_1 = np.array([0.01, 0.0, 0.98,0.01])
p_2 = np.array([0.25, 0.25, 0.25,0.25])
p_3 = np.array([0.3, 0.2, 0.1,0.4])

def shannonEntropy(p: np.ndarray):	# 计算香农熵的函数
    return -np.sum(p * np.log2(p + 1e-12))    # 加一个1e-12防止出现log(0)的情况

print(shannonEntropy(p_1))
print(shannonEntropy(p_2))
print(shannonEntropy(p_3))

# 输出
0.16144054253749263
1.9999999999942293
1.8464393446652445

3. 交叉熵

交叉熵在机器学习，尤其是分类任务中无处不在。简单来说，交叉熵 <math xmlns="http://www.w3.org/1998/Math/MathML"> H ( P , Q ) H(P,Q) </math>H(P,Q)直接刻画了两个概率分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P和 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q之间的差异程度：两个分布越相似，交叉熵越小；两个分布差异越大，交叉熵越大。

交叉熵计算公式如下:
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> H ( P , Q ) = ∑ i = 1 n P ( x i ) I q ( x i ) = − ∑ i = 1 n P ( x i ) log ⁡ 2 Q ( x i ) H(P,Q)= \sum_{i=1}^n P(x_i)I_{q}(x_i) =-\sum_{i=1}^n P(x_i)\log_2Q(x_i) </math>H(P,Q)=i=1∑nP(xi)Iq(xi)=−i=1∑nP(xi)log2Q(xi)

为了直观理解，来看两组例子。

假设在一个三分类任务中，某样本的真实标签分布和模型的预测分布如下：

事件 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 x_1 </math>x1 (猫)	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 2 x_2 </math>x2 (狗)	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 3 x_3 </math>x3 (鸟)
真实分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P 1 ( x ) P_1(x) </math>P1(x)	0.00	1.00	0.00
预测分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q 1 ( x ) Q_1(x) </math>Q1(x)	0.05	0.90	0.05

这里，真实分布表示样本是"狗"（概率为1）。模型的预测也高度集中在"狗"上，只是给其他类别分配了很小的概率。计算其交叉熵：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> H ( P 1 , Q 1 ) = − ∑ i = 1 3 P 1 ( x i ) log ⁡ 2 Q 1 ( x i ) = − [ P 1 ( x 1 ) log ⁡ 2 Q 1 ( x 1 ) + P 1 ( x 2 ) log ⁡ 2 Q 1 ( x 2 ) + P 1 ( x 3 ) log ⁡ 2 Q 1 ( x 3 ) ] = − [ 0 × log ⁡ 2 ( 0.05 ) + 1 × log ⁡ 2 ( 0.90 ) + 0 × log ⁡ 2 ( 0.05 ) ] = − log ⁡ 2 ( 0.90 ) ≈ 0.152 \begin{aligned} H(P_1, Q_1) &= -\sum_{i=1}^{3} P_1(x_i) \log_2 Q_1(x_i) \\ &= -\left[P_1(x_1) \log_2 Q_1(x_1) + P_1(x_2) \log_2 Q_1(x_2) + P_1(x_3) \log_2 Q_1(x_3)\right] \\ &= -\left[0 \times \log_2(0.05) + 1 \times \log_2(0.90) + 0 \times \log_2(0.05)\right] \\ &= -\log_2(0.90) \\ &\approx 0.152 \end{aligned} </math>H(P1,Q1)=−i=1∑3P1(xi)log2Q1(xi)=−[P1(x1)log2Q1(x1)+P1(x2)log2Q1(x2)+P1(x3)log2Q1(x3)]=−[0×log2(0.05)+1×log2(0.90)+0×log2(0.05)]=−log2(0.90)≈0.152

这个值非常小，说明两者高度相似。

现在考虑另一种情况，模型做出了完全错误的预测：

事件 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 x_1 </math>x1 (猫)	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 2 x_2 </math>x2 (狗)	<math xmlns="http://www.w3.org/1998/Math/MathML"> x 3 x_3 </math>x3 (鸟)
真实分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P 2 ( x ) P_2(x) </math>P2(x)	0.00	1.00	0.00
预测分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q 2 ( x ) Q_2(x) </math>Q2(x)	0.70	0.15	0.15

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> H ( P 2 , Q 2 ) = − ∑ i = 1 3 P 2 ( x i ) log ⁡ 2 Q 2 ( x i ) = − [ P 2 ( x 1 ) log ⁡ 2 Q 2 ( x 1 ) + P 2 ( x 2 ) log ⁡ 2 Q 2 ( x 2 ) + P 2 ( x 3 ) log ⁡ 2 Q 2 ( x 3 ) ] = − [ 0 × log ⁡ 2 ( 0.70 ) + 1 × log ⁡ 2 ( 0.15 ) + 0 × log ⁡ 2 ( 0.15 ) ] = − log ⁡ 2 ( 0.15 ) ≈ 2.737 \begin{aligned} H(P_2, Q_2) &= -\sum_{i=1}^{3} P_2(x_i) \log_2 Q_2(x_i) \\ &= -\left[P_2(x_1) \log_2 Q_2(x_1) + P_2(x_2) \log_2 Q_2(x_2) + P_2(x_3) \log_2 Q_2(x_3)\right] \\ &= -\left[0 \times \log_2(0.70) + 1 \times \log_2(0.15) + 0 \times \log_2(0.15)\right] \\ &= -\log_2(0.15) \\ &\approx 2.737 \end{aligned} </math>H(P2,Q2)=−i=1∑3P2(xi)log2Q2(xi)=−[P2(x1)log2Q2(x1)+P2(x2)log2Q2(x2)+P2(x3)log2Q2(x3)]=−[0×log2(0.70)+1×log2(0.15)+0×log2(0.15)]=−log2(0.15)≈2.737

交叉熵值显著增大，清晰地反映了两个分布之间的巨大差异。

因此，交叉熵的本质是：以真实分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P 为权重，对"用分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 定义的信息量"求期望 。它衡量的是，基于错误的估计 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q ，描述真实事件 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P 所需要的平均比特数。

或者换个更直白的比喻：

假设你要用英语（Q）向外国人（P）传达信息。

如果你的英语很好（Q接近P），你不需要说很多话就能让对方理解

如果你的英语水平不高（Q远离P），你需要说很多废话、打很多手势才能沟通

交叉熵衡量的是：用英语向外国人传达信息时，平均每句要多说多少废话

可以通过简单的代码来验证和感受交叉熵的计算：

python 复制代码

import numpy as np

def cross_entropy(p: np.ndarray, q: np.ndarray) -> float:
    """
    计算离散概率分布 P 和 Q 之间的交叉熵 H(P, Q)。
    """
    return -np.sum(p * np.log2(q + 1e-12))

# 示例计算
p = np.array([0.3, 0.3, 0.4])
q = np.array([0.25, 0.25, 0.5])
print(cross_entropy(p, q))  # 输出: 1.599...

4. KL散度

KL散度和交叉熵的功能相同，都是来衡量两个不同的概率分布之间的相似性的。同样的，两个分布越相似，KL散度值越小；两个分布差异越大，KL散度值越大。

你可能会问：既然交叉熵也可以衡量分布之间的差异，为什么还需要KL散度？要理解这一点，我们可以直接观察KL散度的计算公式：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> D K L ( P ∥ Q ) = H ( P , Q ) − H ( P ) = ∑ i = 1 n P ( x i ) log ⁡ 2 P ( x i ) Q ( x i ) D_{KL}(P \parallel Q)= H(P,Q)-H(P) = \sum_{i=1}^{n}P(x_i)\log_{2}{\frac{P(x_i)}{Q(x_i)} } </math>DKL(P∥Q)=H(P,Q)−H(P)=i=1∑nP(xi)log2Q(xi)P(xi)

你会惊奇的发现概率分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P和 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q的KL散度竟然只是其交叉熵减去 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P的香农熵这么简单！对，KL散度就是交叉熵与真实分布熵的差值 。这意味着，KL散度在交叉熵的基础上，减去了真实分布自身的不确定性，从而更纯粹地反映了"近似分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 与真实分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P 之间的差异"。换句话说，交叉熵同时包含了"真实分布的熵"与"两个分布之间的差异"，而KL散度则只提取了后者。

注意，从公式中还可以看出，KL散度具有以下重要数学性质：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> D K L ( p ∥ q ) ≥ 0 并且 D K L ( p ∥ q ) ≠ D K L ( q ∥ p ) D_{KL}(p \parallel q) \ge 0 \text{ 并且 }D_{KL}(p \parallel q) \ne D_{KL}(q \parallel p) </math>DKL(p∥q)≥0 并且 DKL(p∥q)=DKL(q∥p)

任何两个概率分布的KL散度一定是大于等于0的，如果未来某一天你在计算KL散度的时候发现计算结果小于0，那你就要注意了，很可能意味着代码实现有误，或者概率值未正确归一化。另外，KL散度是不对称的，这一点仅从公式上就能看出来了， <math xmlns="http://www.w3.org/1998/Math/MathML"> D K L ( P ∥ Q ) D_{KL}(P \parallel Q) </math>DKL(P∥Q)表示目标分布是 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P，它的值为预测分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q与目标 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P的差异，而 <math xmlns="http://www.w3.org/1998/Math/MathML"> D K L ( Q ∥ P ) D_{KL}(Q \parallel P) </math>DKL(Q∥P)则相反。

我们可以通过下面的这份代码来更加深入的理解交叉熵，KL散度之间的关系：

python 复制代码

def kl_divergence(p, q):
    """
    计算离散分布P和Q之间的KL散度 D_KL(P || Q)
    """
    # 添加一个小值避免log(0)
    p = np.clip(p, 1e-12, 1)
    q = np.clip(q, 1e-12, 1)
    # 归一化确保是概率分布
    p = p / np.sum(p)
    q = q / np.sum(q)

    return np.sum(p * np.log(p / q))

python 复制代码

p = np.array([0.02, 0.08, 0.9])
q = np.array([0.1, 0.1, 0.8])
print("H(p) = " + str(shannonEntropy(p)))		#计算p的香农熵
print("H(q) = " + str(shannonEntropy(q)))		#计算q的香农熵
print("H(p,q) = " + str(cross_entropy(p, q)))	#计算p与q之间的交叉熵
print("H(q,p) = " + str(cross_entropy(q, p)))	#计算q与p之间的交叉熵
print("D_kl(p|q) = " + str(kl_divergence(p, q)))#计算p与q之间的KL散度
print("D_kl(q|p) = " + str(kl_divergence(q, p)))#计算q与p之间的KL散度

python 复制代码

# 输出
H(p) = 0.5411884030736893
H(q) = 0.9219280948830342
H(p,q) = 0.6219280948842965
H(q,p) = 1.0503737127006858
D_kl(p|q) = 0.05596448973692631
D_kl(q|p) = 0.0890317178497243

上面说到：交叉熵同时包含了"真实分布的熵"与"两个分布之间的差异"，而KL散度则只提取了后者。因此KL散度更能很好的体现出两个概率分布之间的差异性，KL散度要比交叉熵更加的"干净，纯粹"。

python 复制代码

p = np.array([0.02, 0.08, 0.9])
q = np.array([0.1, 0.1, 0.8])
k = np.array([0.5, 0.4, 0.1])
print("H(p,q) = " + str(cross_entropy(p, q)))
print("H(p,k) = " + str(cross_entropy(p, k)))
print("D_kl(p|q) = " + str(kl_divergence(p, q)))
print("D_kl(p|k) = " + str(kl_divergence(p, k)))

python 复制代码

# 输出
H(p,q) = 0.6219280948842965
H(p,k) = 3.1154895329762846
D_kl(p|q) = 0.05596448973692631
D_kl(p|k) = 1.7843695701105056	# 可以看到KL散度明显比交叉熵要小很多，因为它只包含差异信息，不包含额外的目标概率分布的熵的信息

既然这样，为什么在机器学习中经常使用交叉熵做损失函数，而很少听到KL散度呢？

其实在机器学习中，一直用的都是KL散度来做损失函数，只不过机器学习的训练数据的标签 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P是固定的，因此 <math xmlns="http://www.w3.org/1998/Math/MathML"> H ( P ) H(P) </math>H(P)是一个常数，根据公式
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> D K L ( P ∥ Q ) = H ( P , Q ) − H ( P ) D_{KL}(P \parallel Q)= H(P,Q)-H(P) </math>DKL(P∥Q)=H(P,Q)−H(P)

要求 <math xmlns="http://www.w3.org/1998/Math/MathML"> D K L ( P ∥ Q ) D_{KL}(P \parallel Q) </math>DKL(P∥Q)的最小值，其实就是求 <math xmlns="http://www.w3.org/1998/Math/MathML"> H ( P , Q ) H(P,Q) </math>H(P,Q)的最小值。