充分统计量（Sufficient Statistic）概念与应用: 中英双语

充分统计量：概念与应用

在统计学中，充分统计量（Sufficient Statistic） 是一个核心概念。它是从样本中计算得出的函数，能够完整且无损地表征样本中与分布参数相关的信息。在参数估计中，充分统计量能够帮助我们提取必要的统计信息，从而实现更高效的推断。

本文将从充分统计量的定义出发，结合指数族分布的例子，深入探讨这一概念及其在统计推断中的重要性。

1. 充分统计量的定义

设 ( X = { x 1 , x 2 , ... , x n } X = \{x_1, x_2, \dots, x_n\} X={x1,x2,...,xn} ) 是来自分布 ( p ( x ∣ θ ) p(x|\theta) p(x∣θ) ) 的样本，其中 ( θ \theta θ ) 是分布的参数。统计量 ( T ( X ) T(X) T(X) ) 被称为关于参数 ( θ \theta θ ) 的充分统计量，如果满足因子分解定理（Factorization Theorem）：

p ( X ∣ θ ) = h ( X ) g ( T ( X ) , θ ) , p(X|\theta) = h(X) g(T(X), \theta), p(X∣θ)=h(X)g(T(X),θ),

其中：

( T ( X ) T(X) T(X) ) 是样本的函数，即统计量；
( h ( X ) h(X) h(X) ) 是与 ( θ \theta θ ) 无关的函数；
( g ( T ( X ) , θ ) g(T(X), \theta) g(T(X),θ) ) 是 ( T ( X ) T(X) T(X) ) 与 ( θ \theta θ ) 的联合函数。

直观解释 ：充分统计量 ( T ( X ) T(X) T(X) ) 能够提取样本中关于参数 ( θ \theta θ ) 的全部信息，( h ( X ) h(X) h(X) ) 则捕捉了样本中与 ( θ \theta θ ) 无关的其他信息。

2. 充分统计量的意义

假设我们已经计算了充分统计量 ( T ( X ) T(X) T(X) )，则原始样本 ( X X X ) 中的其他信息对于 ( θ \theta θ ) 的估计是冗余的。也就是说，利用 ( T ( X ) T(X) T(X) ) 进行推断，与直接使用整个样本 ( X X X ) 的效果是等价的。

例如，在正态分布 ( X ∼ N ( μ , σ 2 ) X \sim \mathcal{N}(\mu, \sigma^2) X∼N(μ,σ2) ) 中：

样本均值 ( x ˉ = 1 n ∑ i = 1 n x i \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i xˉ=n1∑i=1nxi ) 是 ( μ \mu μ ) 的充分统计量；
样本方差 ( s 2 = 1 n ∑ i = 1 n ( x i − x ˉ ) 2 s^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 s2=n1∑i=1n(xi−xˉ)2 ) 是 ( σ 2 \sigma^2 σ2 ) 的充分统计量。

3. 指数族分布与充分统计量

指数族分布是统计学中一类重要的分布形式，其概率密度函数（或质量函数）可以统一表示为：如果读者对指数族分布的概率密度函数的形式有疑问，请参考笔者的另一篇文章指数族分布（Exponential Family of Distributions）的两种形式及其区别

p ( x ∣ θ ) = h ( x ) exp ⁡ ( η ( θ ) T t ( x ) − A ( θ ) ) , p(x|\theta) = h(x) \exp\left(\eta(\theta)^T t(x) - A(\theta)\right), p(x∣θ)=h(x)exp(η(θ)Tt(x)−A(θ)),

其中：

( η ( θ ) \eta(\theta) η(θ) ) 是参数 ( θ \theta θ ) 的自然参数；
( t ( x ) t(x) t(x) ) 是样本的充分统计量；
( A ( θ ) A(\theta) A(θ) ) 是规范化因子，保证分布的积分为 1；
( h ( x ) h(x) h(x) ) 是与参数无关的测度函数。

3.1 常见的指数族分布例子

正态分布（均值已知，方差未知）

概率密度函数：
p ( x ∣ μ , σ 2 ) = 1 2 π σ 2 exp ⁡ ( − ( x − μ ) 2 2 σ 2 ) . p(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). p(x∣μ,σ2)=2πσ2 1exp(−2σ2(x−μ)2).

写成指数族形式：
p ( x ∣ μ , σ 2 ) = exp ⁡ ( − 1 2 σ 2 x 2 + μ σ 2 x − μ 2 2 σ 2 − 1 2 ln ⁡ ( 2 π σ 2 ) ) . p(x|\mu, \sigma^2) = \exp\left(-\frac{1}{2\sigma^2} x^2 + \frac{\mu}{\sigma^2} x - \frac{\mu^2}{2\sigma^2} - \frac{1}{2} \ln(2\pi\sigma^2)\right). p(x∣μ,σ2)=exp(−2σ21x2+σ2μx−2σ2μ2−21ln(2πσ2)).

充分统计量为：
t ( x ) = { x , x 2 } . t(x) = \{x, x^2\}. t(x)={x,x2}.

泊松分布

概率质量函数：
p ( x ∣ λ ) = λ x e − λ x ! , x = 0 , 1 , 2 , ... p(x|\lambda) = \frac{\lambda^x e^{-\lambda}}{x!}, \quad x = 0, 1, 2, \dots p(x∣λ)=x!λxe−λ,x=0,1,2,...

写成指数族形式：
p ( x ∣ λ ) = exp ⁡ ( x ln ⁡ λ − λ − ln ⁡ x ! ) . p(x|\lambda) = \exp\left(x \ln \lambda - \lambda - \ln x!\right). p(x∣λ)=exp(xlnλ−λ−lnx!).

充分统计量为：
t ( x ) = x . t(x) = x. t(x)=x.

二项分布

概率质量函数：
p ( x ∣ n , p ) = ( n x ) p x ( 1 − p ) n − x , x = 0 , 1 , ... , n . p(x|n, p) = \binom{n}{x} p^x (1-p)^{n-x}, \quad x = 0, 1, \dots, n. p(x∣n,p)=(xn)px(1−p)n−x,x=0,1,...,n.

写成指数族形式：
p ( x ∣ n , p ) = exp ⁡ ( x ln ⁡ p 1 − p + n ln ⁡ ( 1 − p ) + ln ⁡ ( n x ) ) . p(x|n, p) = \exp\left(x \ln \frac{p}{1-p} + n \ln (1-p) + \ln \binom{n}{x}\right). p(x∣n,p)=exp(xln1−pp+nln(1−p)+ln(xn)).

充分统计量为：
t ( x ) = x . t(x) = x. t(x)=x.

4. 应用场景

4.1 参数估计

充分统计量极大地简化了参数估计的过程。例如，在最大似然估计（MLE）中，充分统计量允许我们直接基于 ( T ( X ) T(X) T(X) ) 构建似然函数，而无需处理整个样本。

4.2 数据压缩

充分统计量将数据从高维样本 ( X X X ) 压缩为低维统计量 ( T ( X ) T(X) T(X) )，但仍然保留了关于参数 ( θ \theta θ ) 的全部信息。这对于大数据分析尤为重要。

4.3 贝叶斯推断

在贝叶斯框架中，充分统计量可以简化后验分布的计算，因为 ( p ( θ ∣ X ) ∝ p ( T ( X ) ∣ θ ) p ( θ ) p(\theta|X) \propto p(T(X)|\theta)p(\theta) p(θ∣X)∝p(T(X)∣θ)p(θ) )。

5. 总结

充分统计量是统计推断中的关键工具，能够高效提取样本中关于分布参数的信息。通过指数族分布的形式化，我们不仅能够清晰地识别充分统计量，还能理解其在不同分布中的表现形式。充分统计量在参数估计、数据压缩和贝叶斯推断中的广泛应用，进一步凸显了其重要性。

读者在学习时，可以从正态分布、泊松分布等常见的指数族分布入手，尝试推导其充分统计量，以加深对这一概念的理解。

Sufficient Statistic: Concept and Applications

In statistics, the concept of sufficient statistic plays a fundamental role. A sufficient statistic is a function of a dataset that captures all the information about a parameter of interest contained within the data. By leveraging sufficient statistics, we can efficiently perform parameter inference without processing the entire dataset.

This article introduces sufficient statistics, their mathematical definition, and their relevance in statistical inference. We will illustrate the concept with examples from exponential family distributions, along with detailed mathematical formulations.

1. Definition of Sufficient Statistic

Let ( X = { x 1 , x 2 , ... , x n } X = \{x_1, x_2, \dots, x_n\} X={x1,x2,...,xn} ) be a sample drawn from a probability distribution ( p ( x ∣ θ p(x|\theta p(x∣θ) ), where ( θ \theta θ ) is the parameter of interest. A statistic ( T ( X ) T(X) T(X) ) is called a sufficient statistic for ( θ \theta θ ) if it satisfies the factorization theorem:

p ( X ∣ θ ) = h ( X ) g ( T ( X ) , θ ) , p(X|\theta) = h(X) \, g(T(X), \theta), p(X∣θ)=h(X)g(T(X),θ),

where:

( T ( X ) T(X) T(X) ) is the statistic (a function of the data);
( h ( X ) h(X) h(X) ) is a function independent of ( θ \theta θ );
( g ( T ( X ) , θ ) g(T(X), \theta) g(T(X),θ) ) depends only on ( T ( X ) T(X) T(X) ) and ( θ \theta θ ).

Intuition

A sufficient statistic ( T ( X ) T(X) T(X) ) extracts all the information about ( θ \theta θ ) from the dataset ( X X X ). Once ( T ( X ) T(X) T(X) ) is computed, the original dataset ( X X X ) provides no additional value for parameter estimation.

2. Importance of Sufficient Statistics

Efficient Parameter Estimation

Once the sufficient statistic ( T ( X ) T(X) T(X) ) is computed, we can perform inference on ( θ \theta θ ) without using the entire dataset. This simplifies calculations, especially for large datasets.
Data Compression

A sufficient statistic reduces the dimensionality of the data while retaining all relevant information about ( θ \theta θ ). For example, instead of using a large dataset, we only need ( T ( X ) T(X) T(X) ), which is often a low-dimensional vector.
Bayesian Inference

In Bayesian statistics, the posterior distribution ( p ( θ ∣ X ) p(\theta|X) p(θ∣X) ) depends only on ( T ( X ) T(X) T(X) ). This simplifies the computation of posterior distributions.

3. Exponential Family and Sufficient Statistics

The exponential family of distributions provides a convenient framework for identifying sufficient statistics. A probability distribution belongs to the exponential family if it can be expressed as:

p ( x ∣ θ ) = h ( x ) exp ⁡ ( η ( θ ) T t ( x ) − A ( θ ) ) , p(x|\theta) = h(x) \exp\left(\eta(\theta)^T t(x) - A(\theta)\right), p(x∣θ)=h(x)exp(η(θ)Tt(x)−A(θ)),

where:

( η ( θ ) \eta(\theta) η(θ) ) is the natural parameter;
( t ( x ) t(x) t(x) ) is the sufficient statistic;
( A ( θ ) A(\theta) A(θ)) is the log-partition function, ensuring normalization;
( h ( x ) h(x) h(x) ) is a base measure independent of ( θ \theta θ ).

3.1 Examples of Exponential Family Distributions

Normal Distribution (( μ \mu μ ) known, ( σ 2 \sigma^2 σ2 ) unknown)

Probability density function:
p ( x ∣ σ 2 ) = 1 2 π σ 2 exp ⁡ ( − ( x − μ ) 2 2 σ 2 ) . p(x|\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). p(x∣σ2)=2πσ2 1exp(−2σ2(x−μ)2).

Rewritten in exponential family form:
p ( x ∣ σ 2 ) = exp ⁡ ( − 1 2 σ 2 x 2 + μ σ 2 x − μ 2 2 σ 2 − 1 2 ln ⁡ ( 2 π σ 2 ) ) . p(x|\sigma^2) = \exp\left(-\frac{1}{2\sigma^2}x^2 + \frac{\mu}{\sigma^2}x - \frac{\mu^2}{2\sigma^2} - \frac{1}{2}\ln(2\pi\sigma^2)\right). p(x∣σ2)=exp(−2σ21x2+σ2μx−2σ2μ2−21ln(2πσ2)).

The sufficient statistic is:
t ( x ) = { x , x 2 } . t(x) = \{x, x^2\}. t(x)={x,x2}.

Poisson Distribution

Probability mass function:
p ( x ∣ λ ) = λ x e − λ x ! , x = 0 , 1 , 2 , ... p(x|\lambda) = \frac{\lambda^x e^{-\lambda}}{x!}, \quad x = 0, 1, 2, \dots p(x∣λ)=x!λxe−λ,x=0,1,2,...

Rewritten in exponential family form:
p ( x ∣ λ ) = exp ⁡ ( x ln ⁡ λ − λ − ln ⁡ x ! ) . p(x|\lambda) = \exp\left(x \ln \lambda - \lambda - \ln x!\right). p(x∣λ)=exp(xlnλ−λ−lnx!).

The sufficient statistic is:
t ( x ) = x . t(x) = x. t(x)=x.

Binomial Distribution

Probability mass function:
p ( x ∣ n , p ) = ( n x ) p x ( 1 − p ) n − x , x = 0 , 1 , ... , n . p(x|n, p) = \binom{n}{x} p^x (1-p)^{n-x}, \quad x = 0, 1, \dots, n. p(x∣n,p)=(xn)px(1−p)n−x,x=0,1,...,n.

Rewritten in exponential family form:
p ( x ∣ n , p ) = exp ⁡ ( x ln ⁡ p 1 − p + n ln ⁡ ( 1 − p ) + ln ⁡ ( n x ) ) . p(x|n, p) = \exp\left(x \ln \frac{p}{1-p} + n \ln (1-p) + \ln \binom{n}{x}\right). p(x∣n,p)=exp(xln1−pp+nln(1−p)+ln(xn)).

The sufficient statistic is:
t ( x ) = x . t(x) = x. t(x)=x.

4. Applications of Sufficient Statistics

4.1 Maximum Likelihood Estimation (MLE)

The likelihood function for parameter ( θ \theta θ ) can be written in terms of the sufficient statistic ( T ( X ) T(X) T(X) ). This simplifies the optimization process in MLE, reducing computational complexity.

For example, for the Poisson distribution, the MLE for ( λ \lambda λ ) is:
λ ^ = ∑ i = 1 n x i n , \hat{\lambda} = \frac{\sum_{i=1}^n x_i}{n}, λ^=n∑i=1nxi,

where ( T ( X ) = ∑ i = 1 n x i T(X) = \sum_{i=1}^n x_i T(X)=∑i=1nxi ).

4.2 Bayesian Inference

In Bayesian inference, the posterior distribution depends only on ( T ( X ) T(X) T(X) ):
p ( θ ∣ X ) ∝ p ( T ( X ) ∣ θ ) p ( θ ) . p(\theta|X) \propto p(T(X)|\theta)p(\theta). p(θ∣X)∝p(T(X)∣θ)p(θ).

This makes the computation of posterior distributions more tractable, especially in conjugate prior settings.

4.3 Data Summarization

Sufficient statistics compress data into a smaller, sufficient representation. For instance, in large-scale data applications, computing sufficient statistics instead of storing entire datasets saves storage and computational resources.

5. Summary

Sufficient statistics are a cornerstone of statistical inference, enabling efficient parameter estimation and data summarization. By focusing on the exponential family, we can better understand how sufficient statistics operate in various common distributions, such as the normal, Poisson, and binomial distributions.

Understanding and utilizing sufficient statistics not only simplifies complex statistical procedures but also offers practical advantages in data analysis, particularly in settings with large datasets or complex Bayesian models. Readers are encouraged to explore further by deriving sufficient statistics for different distributions and applying them to real-world problems.