Fisher矩阵和Hessian矩阵的关系:证明Fisher为负对数似然函数的Hessian的期望

证明Fisher等于Expectation of Hessian of Negative log likelihood.

符号约定

  • f θ ( ⋅ ) f_{\theta}(\cdot) fθ(⋅): 概率密度

  • p ( x ∣ θ ) = p θ ( x ) = ∏ i N f θ ( x i ) p(x|\theta) = p_{\theta}(x) = \prod\limits_i^N f_{\theta}(x_i) p(x∣θ)=pθ(x)=i∏Nfθ(xi): 似然函数

  • s ( θ ) = ∇ θ p θ ( x ) s(\theta) = \nabla_{\theta} \ p_{\theta}(x) s(θ)=∇θ pθ(x): score function,即似然函数的梯度。

  • I = E p θ ( x ) ( ∇ θ l o g p θ ( x ) ) ( ∇ θ l o g p θ ( x ) ) T I = E_{p_{\theta}(x)}(\\nabla_{\\theta} log p_{\\theta}(x))(\\nabla_{\\theta} log p_{\\theta}(x))\^T I=Epθ(x)(∇θlogpθ(x))(∇θlogpθ(x))T: Fisher矩阵。

  • I i , j ( θ ) = E p θ ( x ) ( D i l o g p θ ( x ) ) ( D j l o g p θ ( x ) ) I_{i,j}(\theta) = E_{p_{\theta}(x)}(D_i log p_{\\theta}(x))(D_j log p_{\\theta}(x)) Ii,j(θ)=Epθ(x)(Dilogpθ(x))(Djlogpθ(x)): 为Fisher的第i行第j列元素。其中 D i = ∂ ∂ θ i ; D i , j = ∂ ∂ θ i ∂ θ j D_i = \frac{\partial}{\partial{\theta_i}}; \ D_{i,j} = \frac{\partial}{\partial{\theta_i} \partial{\theta_j}} Di=∂θi∂; Di,j=∂θi∂θj∂。

  • H i , j = D i , j l o g P θ ( x ) H_{i,j} = D_{i,j} log P_{\theta}(x) Hi,j=Di,jlogPθ(x): Hessian矩阵的第i行第j列元素。

证明

证明目标:
I i , j ( θ ) = − E p θ ( x ) H i , j I_{i,j}(\theta) = -E_{p_{\theta}(x)} H_{i,j} Ii,j(θ)=−Epθ(x)Hi,j

从 H i , j H_{i,j} Hi,j入手。
H i , j = D i , j l o g P θ ( x ) = D i ( D j p θ ( x ) p θ ( x ) ) = ( D i , j p θ ( x ) ) ⋅ p θ ( x ) − D i p θ ( x ) D j p θ ( x ) p θ 2 ( x ) = D i , j p θ ( x ) p θ ( x ) − D i p θ ( x ) p θ ( x ) D j p θ ( x ) p θ ( x ) \begin{align*} H_{i,j} & = D_{i,j} log P_{\theta}(x) \\ & = D_i(\frac{ D_j p_{\theta}(x) }{ p_{\theta}(x) }) \\ & = \frac{(D_{i,j}p_{\theta}(x)) \cdot p_{\theta}(x) - D_i p_{\theta}(x) D_j p_{\theta}(x)} {p_{\theta}^2(x)} \\ & = \frac{D_{i,j}p_{\theta}(x)}{p_{\theta}(x)} - \frac{D_{i}p_{\theta}(x)}{p_{\theta}(x)}\frac{D_{j}p_{\theta}(x)}{p_{\theta}(x)} \end{align*} Hi,j=Di,jlogPθ(x)=Di(pθ(x)Djpθ(x))=pθ2(x)(Di,jpθ(x))⋅pθ(x)−Dipθ(x)Djpθ(x)=pθ(x)Di,jpθ(x)−pθ(x)Dipθ(x)pθ(x)Djpθ(x)

故右式:

− E p θ ( x ) ( H i , j ) = − E p θ ( x ) D i , j p θ ( x ) p θ ( x ) + E p θ ( x ) ( D i p θ ( x ) p θ ( x ) ) ⋅ ( D j p θ ( x ) p θ ( x ) ) \begin{align*} -E_{p_{\theta}(x)}( H_{i,j} ) & = -E_{p_{\theta}(x)} \\frac{D_{i,j}p_{\\theta}(x)}{p_{\\theta}(x)} + E_{p_{\theta}(x)}(\\frac{D_i p_{\\theta}(x)}{p_{\\theta}(x)}) \\cdot (\\frac{D_j p_{\\theta}(x)}{p_{\\theta}(x)}) \end{align*} −Epθ(x)(Hi,j)=−Epθ(x)pθ(x)Di,jpθ(x)+Epθ(x)(pθ(x)Dipθ(x))⋅(pθ(x)Djpθ(x))

其中:
E p θ ( x ) ( D i , j p θ ( x ) p θ ( x ) ) = ∫ D i , j p θ ( x ) p θ ( x ) ⋅ p θ ( x ) ⋅ d x = D i , j ∫ p θ ( x ) ⋅ d x ( 积分求导换序 ) = D i , j 1 ( 对常数求导,为0 ) = 0 \begin{align*} E_{p_{\theta}(x)}( \frac{D_{i,j}p_{\theta}(x)}{p_{\theta}(x)}) & = \int \frac{D_{i,j}p_{\theta}(x)}{p_{\theta}(x)} \cdot p_{{\theta}(x)} \cdot dx \\ & = D_{i,j} \int {p_{\theta}(x) \cdot dx} \qquad & (\text{积分求导换序}) \\ & = D_{i,j} 1 \qquad & (\text{对常数求导,为0}) \\ & = 0 \end{align*} Epθ(x)(pθ(x)Di,jpθ(x))=∫pθ(x)Di,jpθ(x)⋅pθ(x)⋅dx=Di,j∫pθ(x)⋅dx=Di,j1=0(积分求导换序)(对常数求导,为0)

且根据复合函数求导可知:
D i p θ ( x ) p θ ( x ) = D i l o g p θ ( x ) \frac{D_i p_{\theta}(x)}{p_{\theta}(x)} = D_i log p_{\theta}(x) pθ(x)Dipθ(x)=Dilogpθ(x)

故右式为:
E p θ ( x ) ( D i p θ ( x ) p θ ( x ) ) ⋅ ( D j p θ ( x ) p θ ( x ) ) = E p θ ( x ) ( D i l o g p θ ( x ) ) ( D j l o g p θ ( x ) ) = I i , j ( θ ) \begin{align*} & E_{p_{\theta}(x)}(\\frac{D_i p_{\\theta}(x)}{p_{\\theta}(x)}) \\cdot (\\frac{D_j p_{\\theta}(x)}{p_{\\theta}(x)}) = E_{p_{\theta}(x)}(D_i log p_{\\theta}(x))(D_j log p_{\\theta}(x)) \\ & = I_{i,j}(\theta) \end{align*} Epθ(x)(pθ(x)Dipθ(x))⋅(pθ(x)Djpθ(x))=Epθ(x)(Dilogpθ(x))(Djlogpθ(x))=Ii,j(θ)

得证

实际应用中,计算 H H H非常复杂,但是计算 I I I并将其作为 H H H的近似值是比较容易的,一些剪枝方法中就利用了这一点,如NAP Network Automatic Pruning Start NAP and Take a Nap(基于OBS,OBD)

参考链接:

https://zhuanlan.zhihu.com/p/546885304?utm_psn=1840735001693523969
https://zhuanlan.zhihu.com/p/546885304?utm_psn=1840431492376969216
https://jaketae.github.io/study/fisher/
https://mark.reid.name/blog/fisher-information-and-log-likelihood.html
https://bobondemon.github.io/2022/01/07/Score-Function-and-Fisher-Information-Matrix/

相关推荐
打不死的技术工小强17 小时前
2026海外社媒新玩法:如何用AI批量运营海外社媒矩阵?
人工智能·线性代数·矩阵
hai31524754321 小时前
# 矩阵算法·算子对齐工具 v6.1 — 技术规格与使用手册
java·开发语言·驱动开发·神经网络·spring·目标检测·矩阵
赛博云推-Twitter热门霸屏工具1 天前
Twitter矩阵运营实践:账号分层、流量协同与自动化执行方案解析
矩阵·自动化·twitter
会Tk矩阵群控的小木2 天前
小红书矩阵系统开发:私域流量转化与管理完整技术实现
矩阵·新媒体运营·开源软件·个人开发·tk
AI_yangxi2 天前
短视频矩阵系统服务商
大数据·人工智能·矩阵
装不满的克莱因瓶2 天前
实现矩阵的转置:从数学原理到 NumPy 实战
线性代数·机器学习·矩阵·数据分析·numpy·特征分解
吃好睡好便好2 天前
矩阵旋转的计算
学习·线性代数·算法·矩阵
列星随旋2 天前
矩阵快速幂
java·算法·矩阵
装不满的克莱因瓶2 天前
机器学习和数据科学的基石:NumPy详解与实战技巧
人工智能·线性代数·机器学习·ai·矩阵·numpy
吃好睡好便好2 天前
矩阵秩的计算
人工智能·学习·线性代数·算法·机器学习·matlab·矩阵