频率学派与贝叶斯学派:两种概率观的比较与机器学习中的应用
引言
概率是现代统计学和机器学习的核心概念,而对概率的理解却有不同的学派和观点。频率学派(Frequentism) 和 贝叶斯学派(Bayesianism) 是概率领域的两大主要流派,它们分别以频率 和信念为基础,对概率的定义、计算和应用有着显著的区别。
本文将通过机器学习中的实际案例,深入介绍频率学派和贝叶斯学派的基本思想,探讨它们在模型参数估计中的应用,并着重介绍贝叶斯方法如何通过引入先验信息来解决实际问题。
频率学派:概率即频率的极限值
频率学派认为概率是随机事件在大量重复实验中的频率的极限值 。这种观点的核心思想是依赖于随机实验 和大数定律。
基本思想
- 随机实验 :假设某个实验有样本空间 ( Ω \Omega Ω ),其中包含所有可能的结果。一个事件 ( A A A ) 是样本空间的子集。
- 概率定义 :通过大量实验,我们可以计算事件 ( A ) 发生的频率:
P ( A ) = lim n → ∞ 事件 A 发生的次数 n P(A) = \lim_{n \to \infty} \frac{\text{事件 } A \text{ 发生的次数}}{n} P(A)=n→∞limn事件 A 发生的次数
其中 ( n n n ) 是实验次数。
机器学习中的应用
频率学派在机器学习中最常见的应用是最大似然估计(Maximum Likelihood Estimation, MLE)。MLE 假设模型参数是固定的,通过使观察数据的概率最大化来估计参数。
示例:线性回归中的 MLE
在线性回归模型中,假设输入 ( x x x ) 和输出 ( y y y ) 之间满足以下关系:
y = β x + ϵ , ϵ ∼ N ( 0 , σ 2 ) y = \beta x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2) y=βx+ϵ,ϵ∼N(0,σ2)
其中,( β \beta β) 是模型的回归系数,( ϵ \epsilon ϵ) 是服从正态分布的噪声。
-
目标函数 :给定数据集 ( { ( x i , y i ) } i = 1 n \{(x_i, y_i)\}{i=1}^n {(xi,yi)}i=1n),目标是最大化数据的似然函数:
L ( β ) = ∏ i = 1 n p ( y i ∣ x i , β ) L(\beta) = \prod{i=1}^n p(y_i | x_i, \beta) L(β)=i=1∏np(yi∣xi,β)其中,( p ( y i ∣ x i , β ) p(y_i | x_i, \beta) p(yi∣xi,β)) 是在给定参数 ( β \beta β) 下,输出 ( y i y_i yi) 的概率。
-
对数似然函数 :为了简化计算,取对数:
ℓ ( β ) = ∑ i = 1 n ln p ( y i ∣ x i , β ) \ell(\beta) = \sum_{i=1}^n \ln p(y_i | x_i, \beta) ℓ(β)=i=1∑nlnp(yi∣xi,β)对 ( ℓ ( β ) \ell(\beta) ℓ(β)) 求导并令其等于 0,可以得到参数 ( β \beta β) 的估计值。
频率学派方法的优点是计算简单、直观易懂,但它假设参数是固定值,无法直接量化参数的不确定性,也无法引入额外的先验信息。
贝叶斯学派:概率即信念的度量
贝叶斯学派将概率定义为对事件发生的信念程度(Degree of Belief) ,并强调利用先验知识更新信念。其核心思想源于 贝叶斯定理:
P ( θ ∣ D ) = P ( D ∣ θ ) P ( θ ) P ( D ) P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)} P(θ∣D)=P(D)P(D∣θ)P(θ)
基本思想
- 先验分布 ( P ( θ ) P(\theta) P(θ) ) :表示在观察到数据 ( D D D ) 之前,对参数 ( θ \theta θ ) 的信念。
- 似然函数 ( P ( D ∣ θ ) P(D | \theta) P(D∣θ) ) :表示在参数为 ( θ \theta θ ) 时,数据 ( D D D ) 出现的可能性。
- 后验分布 ( P ( θ ∣ D ) P(\theta | D) P(θ∣D) ):结合先验分布和似然函数,更新后得到的参数分布。
贝叶斯方法的核心是通过数据和先验知识的结合,动态更新对参数的认识。
机器学习中的应用
贝叶斯学派在机器学习中广泛应用于参数估计、模型选择和不确定性量化。以下以 贝叶斯线性回归 为例,展示其具体方法。
示例:贝叶斯线性回归
假设线性回归模型如下:
y = β x + ϵ , ϵ ∼ N ( 0 , σ 2 ) y = \beta x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2) y=βx+ϵ,ϵ∼N(0,σ2)
-
先验分布 :对回归系数 ( β \beta β ) 的先验知识假设为正态分布:
β ∼ N ( μ 0 , σ 0 2 ) \beta \sim \mathcal{N}(\mu_0, \sigma_0^2) β∼N(μ0,σ02) -
似然函数 :给定数据 ( D = { ( x i , y i ) } i = 1 n D = \{(x_i, y_i)\}{i=1}^n D={(xi,yi)}i=1n ),数据的似然函数为:
P ( D ∣ β ) = ∏ i = 1 n N ( y i ∣ β x i , σ 2 ) P(D | \beta) = \prod{i=1}^n \mathcal{N}(y_i | \beta x_i, \sigma^2) P(D∣β)=i=1∏nN(yi∣βxi,σ2) -
后验分布 :根据贝叶斯定理,后验分布为:
P ( β ∣ D ) ∝ P ( D ∣ β ) P ( β ) P(\beta | D) \propto P(D | \beta) P(\beta) P(β∣D)∝P(D∣β)P(β)对于正态分布的先验和似然,其后验分布也是正态分布:
β ∣ D ∼ N ( μ n , σ n 2 ) \beta | D \sim \mathcal{N}(\mu_n, \sigma_n^2) β∣D∼N(μn,σn2)其中:
σ n 2 = ( 1 σ 0 2 + 1 σ 2 ∑ i = 1 n x i 2 ) − 1 , μ n = σ n 2 ( μ 0 σ 0 2 + 1 σ 2 ∑ i = 1 n x i y i ) \sigma_n^2 = \left(\frac{1}{\sigma_0^2} + \frac{1}{\sigma^2} \sum_{i=1}^n x_i^2\right)^{-1}, \quad \mu_n = \sigma_n^2 \left(\frac{\mu_0}{\sigma_0^2} + \frac{1}{\sigma^2} \sum_{i=1}^n x_i y_i\right) σn2=(σ021+σ21i=1∑nxi2)−1,μn=σn2(σ02μ0+σ21i=1∑nxiyi) -
预测分布 :对于新输入 ( x ∗ x^* x∗ ),预测分布为:
P ( y ∗ ∣ x ∗ , D ) = ∫ P ( y ∗ ∣ x ∗ , β ) P ( β ∣ D ) d β P(y^* | x^*, D) = \int P(y^* | x^*, \beta) P(\beta | D) d\beta P(y∗∣x∗,D)=∫P(y∗∣x∗,β)P(β∣D)dβ通过推导可得:
y ∗ ∣ x ∗ , D ∼ N ( μ n x ∗ , σ 2 + σ n 2 x ∗ 2 ) y^* | x^*, D \sim \mathcal{N}(\mu_n x^*, \sigma^2 + \sigma_n^2 x^{*2}) y∗∣x∗,D∼N(μnx∗,σ2+σn2x∗2)
贝叶斯方法不仅能估计参数,还能量化不确定性,并在模型中自然融入先验信息。
贝叶斯学派的优势
- 融合先验信息:通过先验分布引入外部知识,尤其在数据有限的情况下表现突出。
- 不确定性量化:后验分布直接反映参数的不确定性,而频率学派中无法直接体现。
- 模型比较灵活:可以自然处理复杂模型和高维数据。
- 适用于在线学习:通过更新先验实现动态调整。
总结
频率学派和贝叶斯学派分别从不同角度定义了概率的本质,在机器学习中的应用也各具特色。频率学派方法简单直接,适合大样本数据;贝叶斯学派则以引入先验信息和量化不确定性为优势,尤其在数据稀缺或不确定性高的任务中更具吸引力。
在现代机器学习中,贝叶斯方法日益受到重视,从 贝叶斯神经网络(BNN) 到 变分推断(VI),贝叶斯学派正在深刻影响着概率模型的发展与应用。
Frequentist vs. Bayesian Approaches: Understanding Probability through Machine Learning Examples
Introduction
Probability is a foundational concept in both statistics and machine learning, yet it is interpreted differently by two major schools of thought: Frequentism and Bayesianism . The Frequentist approach defines probability as the long-run frequency of events in repeated experiments, whereas the Bayesian approach interprets probability as a degree of belief based on prior knowledge and evidence.
In this blog, we'll explore these two perspectives, focusing on the Bayesian approach and its powerful applications in machine learning. We will also look at how the Bayesian methodology integrates prior knowledge to better model uncertainty and guide decision-making.
The Frequentist Approach: Probability as the Limit of Frequencies
In the Frequentist interpretation, probability is defined as the long-run frequency with which an event occurs when an experiment is repeated infinitely. This means probability is objective and is rooted in observable data through repeated experiments.
Key Concepts
- Random Experiment : Consider a random experiment with a sample space ( Ω \Omega Ω ) that contains all possible outcomes. An event ( A A A ) is a subset of the sample space.
- Probability : For any given event ( A A A ), its probability is the limiting value of the frequency of ( A A A ) occurring in repeated trials. Mathematically, this is expressed as:
P ( A ) = lim n → ∞ Number of occurrences of A n P(A) = \lim_{n \to \infty} \frac{\text{Number of occurrences of } A}{n} P(A)=n→∞limnNumber of occurrences of A
Here, the probability is based purely on observed frequencies as the number of trials becomes large.
Example in Machine Learning
In a typical Frequentist setting, if we want to estimate the probability that a given model classifies a data point correctly, we might do the following:
- Training and Testing: Split the data into training and test sets.
- Repetitive Experiments: Train the model multiple times on different data subsets and evaluate it on others.
- Probability Estimate: The probability of correct classification for a given model is the relative frequency of correct classifications across these repeated experiments.
For example, if we test a classifier on a dataset of 1000 data points and it correctly classifies 850 of them, then the probability of correct classification is approximately:
P ( correct ) = 850 1000 = 0.85 P(\text{correct}) = \frac{850}{1000} = 0.85 P(correct)=1000850=0.85
This is a frequentist probability --- it is the observed frequency of correct classifications in repeated tests.
The Bayesian Approach: Probability as a Degree of Belief
The Bayesian interpretation, by contrast, defines probability as the degree of belief or confidence in an event, given the available evidence. Bayesian probability allows for subjectivity, as it incorporates prior knowledge or beliefs before new data is observed.
Key Concepts
- Prior : A Bayesian starts with a prior belief about the probability of an event, represented as ( P ( θ ) P(\theta) P(θ) ), where ( θ \theta θ ) represents the parameters of interest.
- Likelihood : When new data ( D D D ) is observed, the likelihood function ( P ( D ∣ θ ) P(D|\theta) P(D∣θ) ) is updated to reflect how likely the data is, given different parameter values.
- Posterior : Using Bayes' Theorem , the prior belief is updated with the new data to produce the posterior probability ( P ( θ ∣ D ) P(\theta|D) P(θ∣D) ), which represents the updated belief after observing the data.
Mathematically, Bayes' Theorem is written as:
P ( θ ∣ D ) = P ( D ∣ θ ) P ( θ ) P ( D ) P(\theta|D) = \frac{P(D|\theta) P(\theta)}{P(D)} P(θ∣D)=P(D)P(D∣θ)P(θ)
Where:
- ( P ( θ ∣ D ) P(\theta|D) P(θ∣D) ) is the posterior probability (updated belief).
- ( P ( D ∣ θ ) P(D|\theta) P(D∣θ) ) is the likelihood (how likely the data is given the parameters).
- ( P ( θ ) P(\theta) P(θ) ) is the prior (initial belief).
- ( P ( D ) P(D) P(D) ) is the marginal likelihood or evidence, ensuring that the posterior sums to 1.
Example in Machine Learning
Consider a classification problem where we want to predict whether an email is spam or not. A Bayesian classifier could be used, where:
- The prior belief might be based on the frequency of spam emails in the past (e.g., 20% of emails are spam).
- The likelihood would capture how likely certain words (like "free" or "winner") are in spam versus non-spam emails.
- The posterior would combine these prior beliefs with the observed evidence (i.e., the actual words in the current email) to predict whether the email is spam or not.
For a Bayesian classifier, the posterior probability would give us the probability of the email being spam, considering both our prior knowledge about the frequency of spam emails and the current data (the words in the email).
Why Bayesianism in Machine Learning?
While the Frequentist approach focuses on observing data through repeated experiments, the Bayesian approach is especially powerful because it allows us to incorporate prior knowledge and beliefs. This makes it highly useful in situations where we don't have enough data to make reliable frequentist estimates, or where prior expertise can guide model predictions.
Bayesian in Practice: Parameter Estimation
In machine learning, Bayesian methods can be used to estimate model parameters, especially when there is limited data. For example, consider a simple linear regression model:
y = θ 0 + θ 1 x + ϵ y = \theta_0 + \theta_1 x + \epsilon y=θ0+θ1x+ϵ
In a Bayesian framework, we would:
- Specify a prior for the parameters ( θ 0 \theta_0 θ0 ) and ( θ 1 \theta_1 θ1 ), such as assuming they follow a normal distribution with mean 0 and variance ( σ 2 \sigma^2 σ2 ).
- Calculate the likelihood of observing the data given these parameters.
- Use Bayes' Theorem to update our beliefs about the parameters after observing the data and compute the posterior distribution.
The advantage here is that instead of estimating fixed values for ( θ 0 \theta_0 θ0 ) and ( θ 1 \theta_1 θ1 ), Bayesian inference gives us a distribution over possible values for each parameter, reflecting the uncertainty in our estimates.
Conclusion: Why Choose Bayesianism?
The Bayesian approach provides a flexible and systematic way to incorporate prior knowledge and beliefs, which can be incredibly useful when data is scarce or uncertain. Unlike the Frequentist approach, which relies on large amounts of data for frequent estimates, Bayesianism enables us to make predictions and decisions with limited information by constantly updating our beliefs as new data becomes available.
In modern machine learning, where dealing with uncertainty and prior knowledge is critical (especially in tasks like classification, regression, and time series forecasting), Bayesian methods are a powerful tool. The ability to integrate prior knowledge, model uncertainty, and update predictions systematically makes Bayesian analysis particularly appealing for many real-world applications.
As machine learning continues to advance, Bayesian methods will undoubtedly play a crucial role in shaping how models learn from data, make decisions, and adapt to new information.
后记
2024年12月2日13点14分于上海,在GPT4o大模型辅助下完成。