统计决策与Bayes风险

贝叶斯统计决策将各种统计推断问题用统一的观点与方法处理。统计决策给统计推断一个合理的解释。

贝叶斯假设检验可以看成是贝叶斯统计决策在0-1损失下的特例。

统计决策	→\rightarrow→	准则函数	→\rightarrow→	决策函数
↓\downarrow↓引入先验
贝叶斯统计决策	→\rightarrow→	贝叶斯风险	→Bayes决策规则\xrightarrow{\text{Bayes决策规则}}Bayes决策规则	Bayes决策函数

统计决策引入风险

统计决策问题三要素

在统计决策理论中，统计决策问题（Statistical decision problem）通常包含三个基本要素：

样本空间（Sample space）与样本分布族（Family of distributions）
- 样本所有可能的取值组成样本空间 X\mathcal{X}X。
- 样本X=(X1,X2,⋯ ,Xn)⊤∈X\boldsymbol X = (X_1, X_2, \cdots, X_n)^{\top} \in \mathcal{X}X=(X1,X2,⋯,Xn)⊤∈X的联合分布属于一族分布{Fθ(x):x∈X,θ∈Θ}\{F_\theta(\boldsymbol x) : \boldsymbol x \in \mathcal{X}, \theta \in \varTheta\}{Fθ(x):x∈X,θ∈Θ}，其中θ\thetaθ是未知参数，Θ\varThetaΘ为参数空间。
决策空间（Decision space）全体决策aaa组成的集合A\mathcal{A}A称为决策空间。
损失函数

定义在Θ×A\varTheta \times \mathcal{A}Θ×A上的二元非负函数
L(θ,a)⩾0,∀θ∈Θ, a∈A L(\theta, a) \geqslant 0, \quad \forall \theta \in \varTheta, \, a \in \mathcal{A} L(θ,a)⩾0,∀θ∈Θ,a∈A

称为统计决策问题的损失函数 。L(θ,a)L(\theta, a)L(θ,a)表示当真实参数为θ\thetaθ，而做出决策aaa时，所付出的代价。

决策函数

给定统计决策问题的三要素：样本空间X\mathcal{X}X及样本分布族{Fθ(x):θ∈Θ}\{F_\theta(\boldsymbol x): \theta \in \varTheta\}{Fθ(x):θ∈Θ}，决策空间A\mathcal{A}A及损失函数L(θ,a)L(\theta, a)L(θ,a)。

统计决策是在某种准则下找到一个函数fff，为每一个样本观测值x=(x1,x2,⋯ ,xn)⊤∈X\boldsymbol x = (x_1, x_2, \cdots, x_n)^{\top} \in \mathcal{X}x=(x1,x2,⋯,xn)⊤∈X，确定一个决策a∈Aa \in \mathcal{A}a∈A。这个函数从样本空间X\mathcal{X}X到决策空间A\mathcal{A}A的映射f:X→Af: \mathcal{X} \to \mathcal{A}f:X→A，将样本x\boldsymbol xx映射到决策a=f(x)a = f(\boldsymbol x)a=f(x)。称这个函数为统计决策函数 ，简称决策函数（Decision function）。

常见的统计决策准则有贝叶斯决策（最小化期望损失）、Neyman-Pearson（N-P）决策、最小最大决策、一致最优决策等。决策函数f(x)f(x)f(x)是所给定的统计决策问题的一个解，决定如何根据观测数据做出最优决策。

对于样本X=(X1,X2,⋯ ,Xn)⊤\boldsymbol X = (X_1, X_2, \cdots, X_n)^{\top}X=(X1,X2,⋯,Xn)⊤，f(X)=f(X1,X2,⋯ ,Xn)f(\boldsymbol X) = f(X_1, X_2, \cdots, X_n)f(X)=f(X1,X2,⋯,Xn)是一个统计量（样本的函数）。

风险函数

决策函数fff的性能通常使用风险函数来度量。在损失函数L(θ,a)L(\theta, a)L(θ,a)的基础上，风险函数 定义为
R(θ,f)=Eθ $L(θ,f(X))$ =∫L(θ,f(x))p(x∣θ) dx R(\theta, f) = {E}_\theta $L(\\theta, f(\\boldsymbol X))$ = \int L(\theta, f(\boldsymbol x)) p(\boldsymbol x\mid \theta)\,{\rm d} \boldsymbol x R(θ,f)=Eθ $L(θ,f(X))$ =∫L(θ,f(x))p(x∣θ)dx

R(θ,f)R(\theta, f)R(θ,f)表示在参数θ\thetaθ下，使用决策函数fff所产生的期望损失（平均损失）。

贝叶斯引入先验

不同于经典统计中仅依赖数据，贝叶斯方法把参数θ\thetaθ看作一个随机变量，服从先验分布π(θ)\pi(\theta)π(θ)，使用先验信息 + 数据做出最优决策。

先验分布π(θ)\pi(\theta)π(θ)表示对未知参数θ\thetaθ的主观或客观信念的概率分布，反映了在看到数据之前对θ\thetaθ的不确定性。

Bayes风险 Rπ(f)R_\pi(f)Rπ(f)是将风险函数R(θ,f)R(\theta, f)R(θ,f)对先验分布π(θ)\pi(\theta)π(θ)做加权平均的结果：
Rπ(f)=Eπ $R(θ,f)$ =∫R(θ,f)π(θ) dθ R_\pi(f) = {E}_\pi $R(\\theta, f)$ = \int R(\theta, f) \pi(\theta)\,{\rm d}\theta Rπ(f)=Eπ $R(θ,f)$ =∫R(θ,f)π(θ)dθ

Rπ(f)=Eπ $Eθ\[L(θ,f(X))$ ] R_\pi(f) = {E}_\pi ${E}_\\theta\[L(\\theta, f(\\boldsymbol X))$ ] Rπ(f)=Eπ $Eθ\[L(θ,f(X))$ ]

或者直接写成：
Rπ(f)=E $L(θ,f(X))$ R_\pi(f) = {E} $L(\\theta, f(\\boldsymbol X))$ Rπ(f)=E $L(θ,f(X))$

其中期望是对联合分布π(θ)p(x∣θ)\pi(\theta) p(\boldsymbol x\mid \theta)π(θ)p(x∣θ)取的。

Rπ(f)R_\pi(f)Rπ(f)是决策函数fff在先验分布π(θ)\pi(\theta)π(θ)下的Bayes风险，Bayes风险是在所有可能的θ\thetaθ上，这个决策函数带来的平均损失。如果存在一个决策函数使得其 Bayes风险达到最小，则它是贝叶斯最优解 。最优的决策函数称为Bayes决策函数。

Bayes决策规则

后验风险

给定观测值x\boldsymbol xx，对某个决策aaa，其后验期望损失定义为：
R(π,a∣x)=Eπ $L(θ,a)∣X=x$ =∫L(θ,a)π(θ∣x) dθ R(\pi, a \mid \boldsymbol x) ={E}_\pi $L(\\theta, a) \\mid \\boldsymbol X = \\boldsymbol x$ = \int L(\theta, a) \pi(\theta \mid \boldsymbol x)\,{\rm d}\theta R(π,a∣x)=Eπ $L(θ,a)∣X=x$ =∫L(θ,a)π(θ∣x)dθ

其中期望是对后验分布π(θ∣x)\pi(\theta\mid x)π(θ∣x)取的。后验期望损失是由后验分布计算出的期望损失，表示在当前数据下，采取决策aaa的平均损失，也称为后验风险。

后验风险最小原则

在观察到数据x\boldsymbol xx后，根据后验分布π(θ∣x)\pi(\theta\mid \boldsymbol x)π(θ∣x)计算每个可能决策aaa的期望损失（即后验风险），选择使后验风险最小的决策aaa，称为Bayes决策规则（decision rule）。

最优决策f∗(x)f^*(x)f∗(x)可表示为
f∗(x)=arg⁡min⁡a∈AR(π,a∣x) f^*(x) = \arg\min_{a \in \mathcal{A}} R(\pi, a \mid \boldsymbol x) f∗(x)=arga∈AminR(π,a∣x)

后验分布 vs 后验风险

在贝叶斯推断中，最重要的结果是后验分布π(θ∣x)\pi(\theta\mid \boldsymbol x)π(θ∣x)，它总结了所有关于参数的信息。
在贝叶斯决策中，最关键的则是后验风险，它是从后验分布出发，用来指导具体行动的工具。

举个例子

假设要估计正态总体的均值μ\muμ，且知道X∼N(μ,σ2)X \sim N(\mu, \sigma^2)X∼N(μ,σ2)，并设先验μ∼N(μ0,σ02)\mu \sim N(\mu_0, \sigma_0^2)μ∼N(μ0,σ02)。

观察到数据xxx
得到后验分布μ∣x∼N(μpost,σpost2)\mu \mid x \sim N(\mu_{\text{post}}, \sigma_{\text{post}}^2)μ∣x∼N(μpost,σpost2)
现在要决定用什么值作为μ\muμ的估计（比如点估计）
如果损失函数是平方损失L(μ,a)=(μ−a)2L(\mu, a) = (\mu - a)^2L(μ,a)=(μ−a)2，那么最小化后验期望损失的解就是后验均值：
μ^=E $μ∣x$ \hat{\mu} = {E} $\\mu \\mid x$ μ^=E $μ∣x$

"A statistical decision problem involves a space of distributions... a decision space DDD... and a loss function WWW. A decision function δ(x)\delta(x)δ(x) is a function defined over the sample space, taking values in the decision space."
"A statistical decision problem is characterized by a family of distributions... a decision space A\mathcal{A}A, and a loss function LLL. A decision rule (or decision function) is a function δ(x)\delta(x)δ(x) mapping the sample space X\mathcal{X}X into the decision space A\mathcal{A}A."

Given the three components of a statistical decision problem: a sample X\boldsymbol{X}X taking values in the sample space X\mathcal{X}X with a family of distributions {Fθ(x):θ∈Θ}\{F_\theta(x) : \theta \in \Theta\}{Fθ(x):θ∈Θ}, a decision space A\mathcal{A}A, and a loss function L(θ,a)L(\theta, a)L(θ,a).

The problem is to find a rule which, for every observed sample point x=(x1,x2,⋯ ,xn)⊤∈X\boldsymbol{x} = (x_1, x_2, \cdots, x_n)^\top \in \mathcal{X}x=(x1,x2,⋯,xn)⊤∈X, determines a specific decision a∈Aa \in \mathcal{A}a∈A. Such a rule is a function defined on the sample space X\mathcal{X}X and taking values in the decision space A\mathcal{A}A (i.e., a mapping from X\mathcal{X}X to A\mathcal{A}A). This function is called a statistical decision function, or simply a decision function, denoted by
f(x)=f(x1,x2,⋯ ,xn). f(\boldsymbol{x}) = f(x_1, x_2, \cdots, x_n). f(x)=f(x1,x2,⋯,xn).