《机器学习导论》第 16 章-贝叶斯估计

贝叶斯估计是机器学习中极具 "哲学思想" 的参数估计方法 ------ 它不像频率派那样把参数当成固定不变的 "硬石头"，而是将参数视为有自己 "脾气"（概率分布）的随机变量，通过先验知识 + 观测数据不断修正对参数的认知。这篇帖子会用通俗易懂的语言拆解《机器学习导论》第 16 章贝叶斯估计的核心内容，每个核心知识点都配套可直接运行的 Python 代码、可视化对比图，帮你彻底吃透贝叶斯估计！

先配置通用环境（Mac 专属中文显示 + 基础库）

所有代码都基于以下环境运行，先执行这段基础配置：

复制代码

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# Mac系统Matplotlib中文显示配置（按你的要求定制）
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['font.family'] = 'Arial Unicode MS'
plt.rcParams['axes.facecolor'] = 'white'

# 全局设置：图表风格+大小
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

16.1 引言

核心概念

贝叶斯估计的本质是 "更新认知"：

先验分布 ：你对参数的 "初始猜测"（比如 "我猜硬币正面概率是 0.5"）；
似然函数 ：观测数据对参数的 "反馈"（比如抛 10 次硬币出 8 次正面，修正你的猜测）；
后验分布 ：结合初始猜测和数据后的 "最终认知"，公式：后验 ∝ 先验 × 似然

形象比喻

贝叶斯估计就像医生看病：先根据经验（先验）判断你可能感冒，再结合你的症状（似然），最终确定你感冒的概率（后验）。

思维导图

16.2 离散分布的参数的贝叶斯估计

离散分布的参数估计是贝叶斯的 "入门练习"，核心是用共轭先验（先验和后验同分布，计算超方便）。

16.2.1 K>2 个状态：狄利克雷分布

核心概念

当离散变量有 K 个状态（比如骰子的 6 个面），参数是每个状态的概率[θ1,θ2,...,θK]（和为 1），此时用狄利克雷分布作为先验（共轭先验），观测数据是各状态的计数[n1,n2,...,nK]，后验依然是狄利克雷分布。

通俗理解

狄利克雷分布是 "骰子的先验"：比如你猜骰子每个面概率均等（先验参数α=[1,1,1,1,1,1]），抛 100 次后各面出现次数不同（似然），后验会修正每个面的概率，且依然符合狄利克雷分布。

完整代码 + 可视化对比（先验 vs 后验）

python 复制代码

# 导入所需的库
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# Mac系统Matplotlib配置（仅英文显示，无需中文字体）
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['axes.facecolor'] = 'white'

# 全局设置：图表风格+大小
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

def dirichlet_visualization():
    # 1. 定义参数：K=6（骰子6个面）
    K = 6
    # 先验参数：alpha越小，先验越"模糊"；alpha越大，先验越"确定"
    alpha_prior = np.ones(K)  # 均匀先验：认为每个面概率均等
    # 观测数据：抛骰子100次，各面出现次数
    n_obs = np.array([15, 20, 18, 17, 14, 16])
    # 后验参数：狄利克雷的共轭性，后验alpha = 先验alpha + 观测计数
    alpha_post = alpha_prior + n_obs

    # 2. 生成狄利克雷样本（用于可视化）
    np.random.seed(42)  # 固定随机种子，结果可复现
    prior_samples = stats.dirichlet.rvs(alpha_prior, size=10000)
    post_samples = stats.dirichlet.rvs(alpha_post, size=10000)

    # 3. 可视化：先验vs后验分布对比（每个面的概率分布）
    fig, axes = plt.subplots(2, 1, figsize=(12, 10))

    # 先验分布（英文标签）
    axes[0].boxplot(prior_samples)
    axes[0].set_title('Dirichlet Prior Distribution (K=6, alpha=[1,1,1,1,1,1])', fontsize=14)
    axes[0].set_ylabel('Probability Value', fontsize=12)
    axes[0].set_xticklabels([f'Face {i + 1}' for i in range(K)])

    # 后验分布（英文标签）
    axes[1].boxplot(post_samples)
    axes[1].set_title('Dirichlet Posterior Distribution (Observed Counts=[15,20,18,17,14,16])', fontsize=14)
    axes[1].set_ylabel('Probability Value', fontsize=12)
    axes[1].set_xlabel('Dice Face Number', fontsize=12)
    axes[1].set_xticklabels([f'Face {i + 1}' for i in range(K)])

    plt.tight_layout()
    plt.show()

# 运行函数
if __name__ == "__main__":
    dirichlet_visualization()

代码说明

stats.dirichlet.rvs：生成狄利克雷分布的样本；
共轭性体现：后验参数 = 先验参数 + 观测计数，无需复杂积分；
可视化用箱线图对比先验（均匀分布，各面概率波动大）和后验（波动变小，贴合观测数据）。

运行效果

先验箱线图各面概率集中在 0.17 左右（1/6），波动大；后验箱线图各面概率贴合观测计数（比如面 2 概率略高），波动显著减小 ------ 体现 "数据修正认知"。

16.2.2 K=2 个状态：贝塔分布

核心概念

K=2 是离散分布的特例（比如硬币正反面），参数θ是正面概率（0<θ<1），共轭先验是贝塔分布，后验依然是贝塔分布：

先验：Beta(α,β)
似然：二项分布（n 次试验，k 次正面）
后验：Beta(α+k,β+(n−k))

通俗理解

贝塔分布是 "硬币的先验"：α=1,β=1是均匀先验（猜硬币公平）；α>1,β>1是偏向中间的先验（认为硬币大概率公平）；α!=β是偏向某一面的先验。

完整代码 + 可视化对比（不同先验 vs 后验）

python 复制代码

# 导入所需的库
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# 仅基础配置（无需中文字体）
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['axes.facecolor'] = 'white'
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

def beta_visualization():
    theta_true = 0.6
    n_trials = 50
    np.random.seed(42)
    k_heads = np.sum(stats.bernoulli.rvs(theta_true, size=n_trials))
    print(f'Number of heads in {n_trials} flips: {k_heads}')

    priors = {
        'Uniform Prior': (1, 1),
        'Fairness-Favoring Prior': (10, 10),
        'Heads-Favoring Prior': (20, 5)
    }

    theta_range = np.linspace(0, 1, 1000)
    fig, ax = plt.subplots(figsize=(12, 8))

    for prior_name, (alpha, beta) in priors.items():
        prior_pdf = stats.beta.pdf(theta_range, alpha, beta)
        alpha_post = alpha + k_heads
        beta_post = beta + (n_trials - k_heads)
        post_pdf = stats.beta.pdf(theta_range, alpha_post, beta_post)

        ax.plot(theta_range, prior_pdf, label=f'{prior_name} (α={alpha}, β={beta})', linestyle='--', linewidth=2)
        ax.plot(theta_range, post_pdf, label=f'{prior_name} Posterior (α={alpha_post}, β={beta_post})', linewidth=2)

    ax.axvline(theta_true, color='red', linestyle='-.', linewidth=3, label=f'True θ={theta_true}')
    ax.set_title('Beta Distribution: Posterior Comparison with Different Priors (50 Coin Flips)', fontsize=14)
    ax.set_xlabel('Probability of Heads θ', fontsize=12)
    ax.set_ylabel('Probability Density', fontsize=12)
    ax.legend(fontsize=10)
    plt.tight_layout()
    plt.show()

if __name__ == "__main__":
    beta_visualization()

代码说明

stats.bernoulli.rvs：生成伯努利分布样本（模拟抛硬币）；

stats.beta.pdf：计算贝塔分布的概率密度；

对比 3 种先验的后验：数据量足够时，无论先验如何，后验都会趋近真实值（体现 "数据碾压先验"）。

运行效果

均匀先验的后验波动最大，偏向公平先验的后验更集中，偏向正面先验的后验初始偏右但最终贴合真实值；

红色虚线是真实 θ=0.6，所有后验都围绕真实值分布。

16.3 高斯分布的参数的贝叶斯估计

高斯分布（正态分布）是连续分布的核心，贝叶斯估计分 "已知 / 未知均值 / 方差" 多种情况，重点讲最常用的 3 种。

16.3.1 一元情况：未知均值，已知方差

核心概念

通俗理解

就像你猜一个班的平均身高（先验：猜 170cm，误差 5cm），测了 10 个同学的身高（样本均值 172cm），结合已知的身高方差，最终修正平均身高的猜测（后验）。

完整代码 + 可视化对比（先验 vs 后验）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'    # White background
plt.style.use('seaborn-v0_8-whitegrid')     # Plot style
plt.rcParams['figure.figsize'] = (12, 8)    # Default figure size

def gaussian_mean_unknown_var_known():
    # 1. True parameters
    mu_true = 5.0    # True mean
    sigma2_true = 4.0  # Known variance (σ²=4 → σ=2)
    sigma_true = np.sqrt(sigma2_true)
    
    # 2. Generate observed data (fixed random seed for reproducibility)
    np.random.seed(42)
    n_samples_list = [5, 20, 100]  # Different sample sizes to show posterior changes
    data = {n: stats.norm.rvs(mu_true, sigma_true, size=n) for n in n_samples_list}
    
    # 3. Prior parameters (intentionally misaligned to show Bayesian correction)
    mu_prior = 0.0    # Prior mean (wrong guess)
    sigma2_prior = 9.0  # Prior variance (σ²=9 → σ=3)
    sigma_prior = np.sqrt(sigma2_prior)
    
    # 4. Calculate posterior parameters (for different sample sizes)
    post_params = {}
    for n, sample in data.items():
        x_bar = np.mean(sample)  # Sample mean
        # Posterior variance formula: 1/(n/σ²_true + 1/σ²_prior)
        sigma2_post = 1 / (n / sigma2_true + 1 / sigma2_prior)
        # Posterior mean formula: σ²_post * (μ_prior/σ²_prior + n*x̄/σ²_true)
        mu_post = sigma2_post * (mu_prior / sigma2_prior + n * x_bar / sigma2_true)
        post_params[n] = (mu_post, np.sqrt(sigma2_post))
        # Print intermediate results for clarity
        print(f"Sample size {n}: Sample mean={x_bar:.2f}, Posterior mean={mu_post:.2f}, Posterior std={np.sqrt(sigma2_post):.2f}")
    
    # 5. Visualization: Prior + Posteriors with different sample sizes
    mu_range = np.linspace(-5, 10, 1000)  # Range for mean μ
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Plot prior distribution (black dashed line for contrast)
    prior_pdf = stats.norm.pdf(mu_range, mu_prior, sigma_prior)
    ax.plot(mu_range, prior_pdf, label=f'Prior N({mu_prior}, {sigma2_prior})', 
            linestyle='--', linewidth=3, color='black')
    
    # Plot posteriors for different sample sizes (different colors)
    colors = ['red', 'blue', 'green']
    for i, (n, (mu_post, sigma_post)) in enumerate(post_params.items()):
        post_pdf = stats.norm.pdf(mu_range, mu_post, sigma_post)
        ax.plot(mu_range, post_pdf, 
                label=f'Sample size {n}: Posterior N({mu_post:.2f}, {sigma_post**2:.2f})', 
                color=colors[i], linewidth=2)
    
    # Mark true mean (orange dash-dot line as reference)
    ax.axvline(mu_true, color='orange', linestyle='-.', linewidth=3, label=f'True μ={mu_true}')
    
    # Plot annotations and styling
    ax.set_title('Gaussian Distribution: Bayesian Estimation of Unknown Mean (Known Variance σ²=4)', fontsize=14)
    ax.set_xlabel('Mean μ', fontsize=12)
    ax.set_ylabel('Probability Density', fontsize=12)
    ax.legend(fontsize=10)  # Adjust legend font size
    plt.tight_layout()  # Prevent label truncation
    plt.show()

# Run the function
if __name__ == "__main__":
    gaussian_mean_unknown_var_known()

代码说明

故意将先验均值设为 0（真实值是 5），看样本量增加后后验如何修正；
样本量越大，后验均值越接近真实值 5，后验方差越小（越确定）。

运行效果

样本量 5 时，后验均值≈3（刚修正一点）；样本量 20 时≈4.5；样本量 100 时≈5（几乎等于真实值）；

后验曲线越来越 "瘦"（方差减小），体现对均值的认知越来越确定。

16.3.2 一元情况：未知均值，未知方差

核心概念

此时需要同时估计μ和σ2，用正态 - 逆伽马分布作为共轭先验（μ~ 正态，σ2~ 逆伽马），后验依然是正态 - 逆伽马分布。

通俗理解

相当于既猜平均身高，又猜身高的波动范围，两个参数互相影响，通过共轭先验可以同时修正。

完整代码 + 3D 可视化（后验分布）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'    # White background
plt.style.use('seaborn-v0_8-whitegrid')     # Clean plot style
plt.rcParams['figure.figsize'] = (14, 10)   # Default figure size for 3D plot

def gaussian_mean_var_unknown():
    # 1. True parameters
    mu_true = 2.0
    sigma2_true = 1.5  # True variance
    sigma_true = np.sqrt(sigma2_true)
    
    # 2. Generate observed data (fixed seed for reproducibility)
    np.random.seed(42)
    n_samples = 50
    data = stats.norm.rvs(mu_true, sigma_true, size=n_samples)
    x_bar = np.mean(data)  # Sample mean
    s2 = np.sum((data - x_bar)**2) / n_samples  # Sample variance
    print(f"Generated {n_samples} samples | Sample mean: {x_bar:.2f} | Sample variance: {s2:.2f}")
    
    # 3. Prior parameters (Normal-Inverse Gamma)
    mu0 = 0.0    # Prior mean
    k0 = 1.0     # Prior precision weight
    alpha0 = 2.0 # Inverse Gamma prior α
    beta0 = 2.0  # Inverse Gamma prior β
    print(f"Prior parameters | μ₀={mu0}, k₀={k0}, α₀={alpha0}, β₀={beta0}")
    
    # 4. Posterior parameters (Normal-Inverse Gamma)
    kn = k0 + n_samples
    mun = (k0 * mu0 + n_samples * x_bar) / kn
    alphan = alpha0 + n_samples / 2
    betan = beta0 + 0.5 * s2 * n_samples + (k0 * n_samples * (x_bar - mu0)**2) / (2 * kn)
    print(f"Posterior parameters | μₙ={mun:.2f}, kₙ={kn}, αₙ={alphan:.2f}, βₙ={betan:.2f}")
    
    # 5. Generate grid data for 3D visualization
    mu_range = np.linspace(mun - 2, mun + 2, 100)  # Range around posterior mean
    sigma2_range = np.linspace(0.1, 5, 100)        # Variance range (avoid 0)
    mu_grid, sigma2_grid = np.meshgrid(mu_range, sigma2_range)
    
    # 6. Calculate joint posterior PDF (Normal-Inverse Gamma)
    # Step 1: Inverse Gamma PDF for variance σ²
    inv_gamma_pdf = stats.invgamma.pdf(sigma2_grid, a=alphan, scale=betan)
    # Step 2: Normal PDF for mean μ (conditional on σ²)
    norm_pdf = stats.norm.pdf(mu_grid, loc=mun, scale=np.sqrt(sigma2_grid / kn))
    # Step 3: Joint posterior = Normal PDF * Inverse Gamma PDF
    post_pdf = norm_pdf * inv_gamma_pdf
    
    # 7. 3D Visualization of posterior distribution
    fig = plt.figure(figsize=(14, 10))
    ax = fig.add_subplot(111, projection='3d')
    
    # Plot 3D surface
    surf = ax.plot_surface(mu_grid, sigma2_grid, post_pdf, 
                           cmap='viridis', alpha=0.8, linewidth=0)
    fig.colorbar(surf, ax=ax, shrink=0.5, aspect=5, label='Posterior Probability Density')
    
    # Mark true parameter values (red star for reference)
    ax.scatter(mu_true, sigma2_true, 0, color='red', s=200, 
               label=f'True Values (μ={mu_true}, σ²={sigma2_true})', marker='*')
    
    # Plot annotations and styling
    ax.set_title('Gaussian Distribution: Joint Posterior of Unknown Mean & Variance (Normal-Inverse Gamma)', fontsize=14)
    ax.set_xlabel('Mean μ', fontsize=12, labelpad=10)
    ax.set_ylabel('Variance σ²', fontsize=12, labelpad=10)
    ax.set_zlabel('Posterior Probability Density', fontsize=12, labelpad=10)
    ax.legend(fontsize=10, loc='upper left')
    
    # Adjust viewing angle for better visibility
    ax.view_init(elev=20, azim=45)
    plt.tight_layout()
    plt.show()

# Run the function
if __name__ == "__main__":
    gaussian_mean_var_unknown()

代码说明

stats.invgamma.pdf：计算逆伽马分布的概率密度；
3D 曲面图展示μ和σ2的联合后验，峰值位置接近真实值（2.0, 1.5）；
红色星号标记真实值，直观看到后验分布围绕真实值集中。

16.3.3 多元情况：未知均值，未知协方差

核心概念

多元高斯分布N(μ,Σ)，未知均值向量μ和协方差矩阵Σ，用正态 - 逆威沙特分布作为共轭先验，后验依然是该分布。

通俗理解

比如估计多个特征（身高、体重、年龄）的联合分布，既猜每个特征的均值，又猜特征间的协方差（比如身高和体重的相关性）。

完整代码 + 可视化（协方差矩阵热力图）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'  # White background
plt.style.use('seaborn-v0_8-whitegrid')  # Clean plot style
plt.rcParams['figure.figsize'] = (14, 6)  # Default figure size for heatmaps


def multivariate_gaussian_estimation():
    # 1. True parameters (2D Gaussian for visualization)
    mu_true = np.array([3.0, 5.0])  # 2-dimensional mean vector
    cov_true = np.array([[2.0, 1.0], [1.0, 3.0]])  # True covariance matrix
    print(f"True mean vector: {mu_true}")
    print(f"True covariance matrix:\n{cov_true}")

    # 2. Generate observed data (fixed seed for reproducibility)
    np.random.seed(42)
    n_samples = 100
    data = stats.multivariate_normal.rvs(mean=mu_true, cov=cov_true, size=n_samples)
    print(f"\nGenerated {n_samples} 2D samples")

    # 3. Prior parameters (Normal-Inverse Wishart)
    mu0 = np.array([0.0, 0.0])  # Prior mean vector
    k0 = 1.0  # Prior weight
    nu0 = 4.0  # Inverse Wishart degrees of freedom
    S0 = np.eye(2)  # Inverse Wishart scale matrix
    print(f"\nPrior parameters:")
    print(f"- Prior mean (μ₀): {mu0}")
    print(f"- Prior weight (k₀): {k0}")
    print(f"- Inverse Wishart df (ν₀): {nu0}")
    print(f"- Inverse Wishart scale (S₀):\n{S0}")

    # 4. Calculate posterior parameters
    x_bar = np.mean(data, axis=0)  # Sample mean vector
    print(f"\nSample mean vector: {x_bar.round(2)}")

    # Posterior mean vector
    kn = k0 + n_samples
    mun = (k0 * mu0 + n_samples * x_bar) / kn

    # Posterior degrees of freedom
    nun = nu0 + n_samples

    # Posterior scale matrix (corrected calculation)
    # Step 1: Sum of squared deviations (sample covariance scaled by n)
    ss_dev = (data - x_bar).T @ (data - x_bar)
    # Step 2: Prior-sample correction term
    correction_term = (k0 * n_samples / kn) * np.outer((x_bar - mu0), (x_bar - mu0))
    # Step 3: Full posterior scale matrix
    S_n = S0 + ss_dev + correction_term

    print(f"\nPosterior parameters:")
    print(f"- Posterior mean (μₙ): {mun.round(2)}")
    print(f"- Posterior weight (kₙ): {kn}")
    print(f"- Posterior df (νₙ): {nun}")
    print(f"- Posterior scale (Sₙ):\n{S_n.round(2)}")

    # 5. Generate posterior covariance matrix samples (Inverse Wishart)
    # Note: invwishart.rvs(scale=S, df=nu) → samples from IW(nu, S)
    cov_post_samples = stats.invwishart.rvs(df=nun, scale=S_n, size=1000)
    cov_post_mean = np.mean(cov_post_samples, axis=0)  # Mean of posterior samples
    print(f"\nMean of posterior covariance samples:\n{cov_post_mean.round(2)}")

    # 6. Visualization: True vs Posterior Covariance (Heatmaps)
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 9))

    # True covariance heatmap
    sns.heatmap(cov_true, annot=True, fmt='.2f', ax=ax1, cmap='Blues',
                cbar=True, cbar_kws={'label': 'Covariance Value'})
    ax1.set_title('True Covariance Matrix', fontsize=14, pad=10)
    ax1.set_xlabel('Feature Dimension', fontsize=12)
    ax1.set_ylabel('Feature Dimension', fontsize=12)
    ax1.set_xticklabels(['Dimension 1', 'Dimension 2'])
    ax1.set_yticklabels(['Dimension 1', 'Dimension 2'])

    # Posterior covariance mean heatmap
    sns.heatmap(cov_post_mean, annot=True, fmt='.2f', ax=ax2, cmap='Blues',
                cbar=True, cbar_kws={'label': 'Covariance Value'})
    ax2.set_title('Posterior Covariance Matrix (Mean of 1000 Samples)', fontsize=14, pad=10)
    ax2.set_xlabel('Feature Dimension', fontsize=12)
    ax2.set_ylabel('Feature Dimension', fontsize=12)
    ax2.set_xticklabels(['Dimension 1', 'Dimension 2'])
    ax2.set_yticklabels(['Dimension 1', 'Dimension 2'])

    # Overall plot title
    fig.suptitle('Multivariate Gaussian: True vs Posterior Covariance (Normal-Inverse Wishart)',
                 fontsize=16, y=1.02)
    plt.tight_layout()
    plt.show()


# Run the function
if __name__ == "__main__":
    multivariate_gaussian_estimation()

代码说明

stats.multivariate_normal.rvs：生成多元高斯样本；
stats.invwishart.rvs：生成逆威沙特分布样本（协方差的先验 / 后验）；
热力图对比真实协方差和后验协方差均值，数值几乎一致。

16.4 函数的参数的贝叶斯估计

贝叶斯不仅能估计分布参数，还能估计函数参数（比如回归、分类模型的参数），核心是 "把函数参数当成随机变量"。

16.4.1 回归（贝叶斯线性回归）

核心概念

通俗理解

普通线性回归找 "最优 w"（固定值），贝叶斯线性回归找 "w 的分布"（所有可能的 w 及其概率），预测时用所有 w 的加权平均，更鲁棒。

完整代码 + 可视化（贝叶斯 vs 普通线性回归）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'    # White background
plt.style.use('seaborn-v0_8-whitegrid')     # Clean plot style
plt.rcParams['figure.figsize'] = (12, 8)    # Default figure size

def bayesian_linear_regression():
    # 1. Generate simulated data
    np.random.seed(42)  # Fixed seed for reproducibility
    x = np.linspace(0, 10, 100).reshape(-1, 1)  # Feature vector (100 samples)
    w_true = np.array([[2.0], [-0.1]])          # True weights (intercept + slope)
    X = np.hstack([np.ones_like(x), x])         # Add intercept term (design matrix)
    y_true = X @ w_true                         # True regression line (no noise)
    y = y_true + np.random.normal(0, 0.5, size=y_true.shape)  # Add Gaussian noise
    
    # Print key data parameters
    print("=== Data Generation ===")
    print(f"True weights (intercept, slope): {w_true.ravel()}")
    print(f"Feature shape: {x.shape}, Design matrix shape: {X.shape}")
    print(f"Observed data shape: {y.shape}")
    
    # 2. Ordinary Least Squares (OLS) regression
    w_ols = np.linalg.inv(X.T @ X) @ X.T @ y
    y_ols = X @ w_ols
    print("\n=== OLS Results ===")
    print(f"OLS weights (intercept, slope): {w_ols.ravel().round(4)}")
    
    # 3. Bayesian Linear Regression
    alpha = 1.0  # Prior precision (regularization parameter)
    beta = 4.0   # Likelihood precision (1/variance of noise)
    # Posterior covariance matrix of weights
    Sigma_w = np.linalg.inv(alpha * np.eye(X.shape[1]) + beta * X.T @ X)
    # Posterior mean of weights
    mu_w = beta * Sigma_w @ X.T @ y
    
    print("\n=== Bayesian Regression Results ===")
    print(f"Prior precision (α): {alpha}, Likelihood precision (β): {beta}")
    print(f"Posterior mean weights (intercept, slope): {mu_w.ravel().round(4)}")
    print(f"Posterior covariance matrix:\n{Sigma_w.round(4)}")
    
    # Generate weight samples from posterior distribution (for prediction uncertainty)
    n_samples = 100  # Number of weight samples
    w_samples = stats.multivariate_normal.rvs(mean=mu_w.ravel(), cov=Sigma_w, size=n_samples)
    
    # 4. Bayesian prediction (with uncertainty)
    y_pred_samples = X @ w_samples.T  # Predictions from each weight sample
    y_pred_mean = np.mean(y_pred_samples, axis=1).reshape(-1,1)  # Mean prediction
    y_pred_std = np.std(y_pred_samples, axis=1).reshape(-1,1)    # Prediction std (uncertainty)
    
    # 5. Visualization: OLS vs Bayesian Regression
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Plot observed data points
    ax.scatter(x, y, label='Observed Data', alpha=0.6, s=50, color='gray')
    # Plot true regression line (black solid line)
    ax.plot(x, y_true, label='True Regression Line', color='black', linewidth=3)
    # Plot OLS regression line (red dashed line)
    ax.plot(x, y_ols, label='Ordinary Least Squares (OLS)', color='red', linewidth=2, linestyle='--')
    # Plot Bayesian regression mean (blue solid line)
    ax.plot(x, y_pred_mean, label='Bayesian Regression Mean', color='blue', linewidth=2)
    # Plot 95% credible interval (±2σ) for Bayesian prediction
    ax.fill_between(
        x.ravel(), 
        (y_pred_mean - 2*y_pred_std).ravel(), 
        (y_pred_mean + 2*y_pred_std).ravel(), 
        alpha=0.2, color='blue', 
        label='Bayesian 95% Credible Interval'
    )
    
    # Plot styling and annotations
    ax.set_title('Bayesian Linear Regression vs Ordinary Least Squares (OLS)', fontsize=14, pad=10)
    ax.set_xlabel('x (Feature)', fontsize=12)
    ax.set_ylabel('y (Response)', fontsize=12)
    ax.legend(fontsize=10, loc='upper right')
    ax.set_xlim(-0.5, 10.5)  # Slight padding for better visualization
    plt.tight_layout()
    plt.show()

# Run the function
if __name__ == "__main__":
    bayesian_linear_regression()

代码说明

贝叶斯回归不仅给出预测均值，还给出置信区间（蓝色阴影），能量化预测不确定性；
普通线性回归只有一条曲线，无法体现不确定性；
贝叶斯回归的均值曲线和真实曲线几乎重合，置信区间覆盖大部分观测数据。

16.4.2 具有噪声精度先验的回归

核心概念

在 16.4.1 的基础上，给噪声精度β（1 / 方差）加先验（伽马分布），不再固定β，而是估计β的后验，进一步提升模型鲁棒性。

完整代码（核心扩展）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'    # White background
plt.style.use('seaborn-v0_8-whitegrid')     # Clean plot style
plt.rcParams['figure.figsize'] = (12, 8)    # Default figure size

def bayesian_regression_with_noise_prior():
    # Reuse base data from Bayesian Linear Regression example
    np.random.seed(42)  # Fixed seed for reproducibility
    x = np.linspace(0, 10, 100).reshape(-1, 1)  # Feature vector (100 samples)
    w_true = np.array([[2.0], [-0.1]])          # True weights (intercept + slope)
    X = np.hstack([np.ones_like(x), x])         # Design matrix (add intercept)
    y_true = X @ w_true                         # True regression line (no noise)
    y = y_true + np.random.normal(0, 0.5, size=y_true.shape)  # Add Gaussian noise
    
    # Print basic data info
    print("=== Data Generation ===")
    print(f"True weights (intercept, slope): {w_true.ravel()}")
    print(f"True noise standard deviation: 0.5")
    print(f"Feature shape: {x.shape}, Design matrix shape: {X.shape}")
    
    # 1. Prior parameters
    alpha = 1.0  # Precision of weight prior (Gaussian)
    a0 = 1.0     # Shape parameter of Gamma prior for noise precision β
    b0 = 1.0     # Rate parameter of Gamma prior for noise precision β
    print("\n=== Prior Parameters ===")
    print(f"Weight prior precision (α): {alpha}")
    print(f"Noise precision prior (Gamma) - shape (a₀): {a0}, rate (b₀): {b0}")
    
    # 2. Iterative estimation of weights (w) and noise precision (β) via MAP
    def neg_log_posterior(params):
        """Negative log posterior for minimization (MAP estimation)"""
        # Split parameters into weights (w) and noise precision (β)
        w = params[:-1].reshape(-1, 1)
        beta = params[-1]
        
        # Prevent non-positive beta (invalid precision)
        if beta <= 1e-6:
            return 1e9  # Large penalty for invalid beta
        
        # Calculate components of log posterior
        log_prior_w = -0.5 * alpha * np.sum(w**2)  # Log prior for weights (Gaussian)
        log_likelihood = 0.5 * len(y) * np.log(beta) - 0.5 * beta * np.sum((y - X @ w)**2)  # Log likelihood
        log_prior_beta = (a0 - 1) * np.log(beta) - b0 * beta  # Log prior for β (Gamma)
        
        # Return negative log posterior (for minimization)
        return -(log_prior_w + log_likelihood + log_prior_beta)
    
    # Initialize parameters (weights=0, beta=1.0)
    init_params = np.hstack([np.zeros(X.shape[1]), [1.0]])
    print(f"\n=== MAP Estimation ===")
    print(f"Initial parameters (intercept, slope, β): {init_params}")
    
    # Minimize negative log posterior (L-BFGS-B for constrained optimization)
    res = minimize(
        neg_log_posterior, 
        init_params, 
        method='L-BFGS-B',
        bounds=[(None, None), (None, None), (1e-6, None)]  # β must be positive
    )
    
    # Extract MAP estimates
    if res.success:
        w_map = res.x[:-1].reshape(-1, 1)  # MAP estimate of weights
        beta_map = res.x[-1]               # MAP estimate of noise precision
        sigma_map = 1 / np.sqrt(beta_map)  # Noise standard deviation (from precision)
        
        print(f"Optimization successful!")
        print(f"MAP weights (intercept, slope): {w_map.ravel().round(4)}")
        print(f"MAP noise precision (β): {beta_map:.4f}")
        print(f"MAP noise standard deviation (σ): {sigma_map:.4f}")
    else:
        raise RuntimeError(f"Optimization failed: {res.message}")
    
    # 3. Prediction with MAP estimates
    y_pred = X @ w_map  # Mean prediction
    y_std = sigma_map   # Noise standard deviation (for uncertainty)
    
    # 4. Visualization
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Plot observed data points
    ax.scatter(x, y, label='Observed Data', alpha=0.6, s=50, color='gray')
    # Plot true regression line
    ax.plot(x, y_true, label='True Regression Line', color='black', linewidth=3)
    # Plot MAP prediction line
    ax.plot(
        x, y_pred, 
        label=f'Bayesian Regression (MAP, β={beta_map:.2f})', 
        color='green', linewidth=2
    )
    # Plot 95% predictive interval (±2σ)
    ax.fill_between(
        x.ravel(), 
        (y_pred - 2*y_std).ravel(), 
        (y_pred + 2*y_std).ravel(), 
        alpha=0.2, color='green', 
        label='95% Predictive Interval'
    )
    
    # Plot styling and annotations
    ax.set_title('Bayesian Linear Regression with Gamma Prior on Noise Precision', fontsize=14, pad=10)
    ax.set_xlabel('x (Feature)', fontsize=12)
    ax.set_ylabel('y (Response)', fontsize=12)
    ax.legend(fontsize=10, loc='upper right')
    ax.set_xlim(-0.5, 10.5)  # Add padding to x-axis
    plt.tight_layout()
    plt.show()

# Run the function
if __name__ == "__main__":
    bayesian_regression_with_noise_prior()

16.4.3 基或核函数的使用（贝叶斯核回归）

核心概念

用核函数（比如高斯核）将低维特征映射到高维，再做贝叶斯回归，解决非线性问题。

完整代码 + 可视化（非线性拟合）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'  # White background
plt.style.use('seaborn-v0_8-whitegrid')  # Clean plot style
plt.rcParams['figure.figsize'] = (12, 8)  # Default figure size


def bayesian_kernel_regression():
    # 1. Generate non-linear synthetic data
    np.random.seed(42)  # Fixed seed for reproducibility
    x = np.linspace(0, 10, 100).reshape(-1, 1)  # Feature vector (100 samples)
    y_true = np.sin(x) + 0.5 * x  # True non-linear function
    y = y_true + np.random.normal(0, 0.3, size=y_true.shape)  # Add Gaussian noise

    # Print basic data information
    print("=== Data Generation ===")
    print(f"Feature shape: {x.shape}")
    print(f"True function: y = sin(x) + 0.5x")
    print(f"Noise standard deviation: 0.3")

    # 2. Gaussian Radial Basis Function (RBF) kernel
    def rbf_kernel(x, centers, gamma=0.1):
        """
        Radial Basis Function (RBF) kernel (Gaussian kernel)
        Maps input features to high-dimensional space for non-linear regression

        Parameters:
            x: Input features (n_samples, 1)
            centers: Kernel centers (n_centers, 1)
            gamma: Kernel bandwidth parameter (controls smoothness)

        Returns:
            Kernel matrix (n_samples, n_centers)
        """
        # 修复维度问题：使用广播计算每个样本到每个中心的距离
        # 将x扩展为(100,1)，centers扩展为(1,10)，计算两两距离
        x_expanded = x[:, np.newaxis]  # (100, 1, 1)
        centers_expanded = centers[np.newaxis, :, :]  # (1, 10, 1)
        # 计算平方欧氏距离 (100, 10)
        dists = np.sum((x_expanded - centers_expanded) ** 2, axis=2)
        return np.exp(-gamma * dists)

    # Select kernel centers (10 evenly spaced centers over [0, 10])
    n_centers = 10
    centers = np.linspace(0, 10, n_centers).reshape(-1, 1)
    print(f"\n=== Kernel Configuration ===")
    print(f"Number of RBF kernel centers: {n_centers}")
    print(f"Kernel centers: {centers.ravel().round(2)}")
    print(f"RBF kernel gamma (bandwidth): 0.1")

    # Generate kernel matrix (map to high-dimensional space)
    K = rbf_kernel(x, centers, gamma=0.1)
    K = np.hstack([np.ones_like(x), K])  # Add intercept term to kernel matrix
    print(f"Kernel matrix shape (after adding intercept): {K.shape}")

    # 3. Bayesian Kernel Regression
    alpha = 1.0  # Prior precision for kernel weights
    beta = 10.0  # Likelihood precision (1/variance of noise)
    print(f"\n=== Bayesian Regression Parameters ===")
    print(f"Weight prior precision (α): {alpha}")
    print(f"Likelihood precision (β): {beta}")

    # Calculate posterior covariance and mean of kernel weights
    Sigma_w = np.linalg.inv(alpha * np.eye(K.shape[1]) + beta * K.T @ K)
    mu_w = beta * Sigma_w @ K.T @ y

    # Mean prediction (posterior mean)
    y_pred = K @ mu_w

    # Generate weight samples to estimate prediction uncertainty
    n_weight_samples = 100
    w_samples = stats.multivariate_normal.rvs(
        mean=mu_w.ravel(),
        cov=Sigma_w,
        size=n_weight_samples
    )
    # Calculate prediction standard deviation (uncertainty)
    y_pred_samples = K @ w_samples.T
    y_pred_std = np.std(y_pred_samples, axis=1).reshape(-1, 1)

    print(f"\n=== Prediction Results ===")
    print(f"Number of weight samples for uncertainty: {n_weight_samples}")
    print(f"Mean prediction std (across all x): {np.mean(y_pred_std):.4f}")

    # 4. Visualization: Bayesian Kernel Regression for Non-linear Data
    fig, ax = plt.subplots(figsize=(12, 8))

    # Plot observed data points
    ax.scatter(x, y, label='Observed Data', alpha=0.6, s=50, color='gray')
    # Plot true non-linear function
    ax.plot(x, y_true, label='True Non-linear Function', color='black', linewidth=3)
    # Plot Bayesian kernel regression mean prediction
    ax.plot(x, y_pred, label='Bayesian Kernel Regression (RBF)', color='purple', linewidth=2)
    # Plot 95% credible interval (±2σ) for prediction uncertainty
    ax.fill_between(
        x.ravel(),
        (y_pred - 2 * y_pred_std).ravel(),
        (y_pred + 2 * y_pred_std).ravel(),
        alpha=0.2, color='purple',
        label='95% Credible Interval'
    )

    # Plot styling and annotations
    ax.set_title('Bayesian Kernel Regression (RBF Kernel) for Non-linear Data', fontsize=14, pad=10)
    ax.set_xlabel('x (Feature)', fontsize=12)
    ax.set_ylabel('y (Response)', fontsize=12)
    ax.legend(fontsize=10, loc='upper left')
    ax.set_xlim(-0.5, 10.5)  # Add padding to x-axis
    ax.set_ylim(-0.5, 7)  # Adjust y-axis for better visualization
    plt.tight_layout()
    plt.show()


# Run the function
if __name__ == "__main__":
    bayesian_kernel_regression()

16.4.4 贝叶斯分类（贝叶斯逻辑回归）

核心概念

分类问题中，用贝叶斯估计逻辑回归的参数，输出类别概率而非硬分类，更适合不确定性场景。

完整代码 + 可视化（二分类）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize, approx_fprime
from mpl_toolkits.mplot3d import Axes3D

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'    # White background
plt.style.use('seaborn-v0_8-whitegrid')     # Clean plot style
plt.rcParams['figure.figsize'] = (12, 8)    # Default figure size

def bayesian_logistic_regression():
    # 1. Generate binary classification data
    np.random.seed(42)  # Fixed seed for reproducibility
    # Class 0 samples (100 samples, 2 features)
    x1 = np.random.multivariate_normal([1, 2], [[1, 0.5], [0.5, 1]], 100)
    # Class 1 samples (100 samples, 2 features)
    x2 = np.random.multivariate_normal([4, 5], [[1, 0.5], [0.5, 1]], 100)
    # Combine features and labels
    X = np.vstack([x1, x2])  # (200, 2)
    y = np.hstack([np.zeros(100), np.ones(100)])  # (200,)
    
    # Print basic data information
    print("=== Data Generation ===")
    print(f"Feature matrix shape: {X.shape}")
    print(f"Label vector shape: {y.shape}")
    print(f"Number of Class 0 samples: {np.sum(y == 0)}")
    print(f"Number of Class 1 samples: {np.sum(y == 1)}")

    # 2. Sigmoid (logistic) function with numerical stability
    def sigmoid(z):
        """
        Sigmoid function with clipping to avoid numerical overflow/underflow
        """
        # Clip z to prevent exp(-z) from being too large/small
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))

    # 3. Bayesian Logistic Regression (MAP Estimation)
    def neg_log_posterior(w):
        """
        Negative log posterior for MAP estimation (likelihood + prior)
        """
        # Linear predictor (z = X*w + intercept)
        z = X @ w[:-1] + w[-1]
        # Log likelihood (add small epsilon to avoid log(0))
        ll = np.sum(
            y * np.log(sigmoid(z) + 1e-9) + 
            (1 - y) * np.log(1 - sigmoid(z) + 1e-9)
        )
        # Gaussian prior (L2 regularization, variance=10)
        prior = -0.5 * np.sum(w**2) / 10
        # Return negative log posterior (for minimization)
        return -(ll + prior)
    
    # Initialize parameters (2 features + intercept)
    init_w = np.zeros(3)
    print(f"\n=== MAP Estimation ===")
    print(f"Initial parameters (w1, w2, intercept): {init_w}")
    
    # Minimize negative log posterior (BFGS method)
    res = minimize(
        neg_log_posterior, 
        init_w, 
        method='BFGS',
        options={'gtol': 1e-6, 'maxiter': 1000}  # Tighter convergence criteria
    )
    
    # Check optimization success
    if res.success:
        w_map = res.x  # MAP estimate of parameters
        print(f"Optimization successful!")
        print(f"MAP parameters (w1, w2, intercept): {w_map.round(4)}")
    else:
        raise RuntimeError(f"Optimization failed: {res.message}")

    # 4. Generate parameter samples (Laplace approximation)
    # Calculate Hessian matrix (second derivative) for posterior covariance
    def hessian(f, x, epsilon=1e-5):
        """
        Numerical approximation of Hessian matrix
        """
        n = len(x)
        hess = np.zeros((n, n))
        for i in range(n):
            x_i = x.copy()
            x_i[i] += epsilon
            grad_i = approx_fprime(x_i, f, epsilon)
            x_i[i] -= 2*epsilon
            grad_j = approx_fprime(x_i, f, epsilon)
            hess[i] = (grad_i - grad_j) / (2*epsilon)
        # Symmetrize Hessian (numerical stability)
        return (hess + hess.T) / 2
    
    # Compute Hessian at MAP estimate
    H = hessian(neg_log_posterior, w_map)
    # Regularize Hessian to ensure positive definiteness
    H_reg = H + 1e-6 * np.eye(H.shape[0])
    # Posterior covariance (inverse of Hessian)
    Sigma_w = np.linalg.inv(H_reg)
    
    # Generate weight samples from Laplace approximation (Gaussian)
    n_samples = 100
    w_samples = stats.multivariate_normal.rvs(
        mean=w_map, 
        cov=Sigma_w, 
        size=n_samples,
        random_state=42  # Fixed seed for reproducibility
    )
    print(f"\n=== Laplace Approximation ===")
    print(f"Number of parameter samples: {n_samples}")
    print(f"Posterior covariance matrix:\n{Sigma_w.round(4)}")

    # 5. Visualization: Decision Boundaries (Parameter Uncertainty)
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Plot data points (Class 0 and Class 1)
    ax.scatter(X[y==0, 0], X[y==0, 1], label='Class 0', alpha=0.6, s=60, color='blue')
    ax.scatter(X[y==1, 0], X[y==1, 1], label='Class 1', alpha=0.6, s=60, color='orange')
    
    # Plot multiple decision boundaries (from parameter samples)
    x_grid = np.linspace(-2, 7, 100)
    for w in w_samples[:20]:  # Plot first 20 samples (uncertainty)
        # Avoid division by zero
        if np.abs(w[1]) > 1e-6:
            y_grid = -(w[0]*x_grid + w[2]) / w[1]
            ax.plot(x_grid, y_grid, color='gray', alpha=0.2, linewidth=1)
    
    # Plot MAP decision boundary (red highlight)
    if np.abs(w_map[1]) > 1e-6:
        y_map = -(w_map[0]*x_grid + w_map[2]) / w_map[1]
        ax.plot(x_grid, y_map, color='red', linewidth=3, label='MAP Decision Boundary')
    
    # Plot styling and annotations
    ax.set_title('Bayesian Logistic Regression: Decision Boundaries (Parameter Uncertainty)', fontsize=14, pad=10)
    ax.set_xlabel('Feature 1', fontsize=12)
    ax.set_ylabel('Feature 2', fontsize=12)
    ax.legend(fontsize=10, loc='upper left')
    ax.set_xlim(-2, 7)
    ax.set_ylim(-1, 8)
    plt.tight_layout()
    plt.show()

# Run the function
if __name__ == "__main__":
    bayesian_logistic_regression()

16.5 选择先验

核心概念

先验选择是贝叶斯估计的 "灵魂"，常见策略：

无信息先验：不引入主观信息（比如均匀先验、杰弗里斯先验）；
共轭先验：计算方便，后验和先验同分布；
经验先验：基于领域知识或历史数据；
层次先验：先验的参数也有自己的先验（超先验）。

可视化对比（不同先验的影响）

复用 16.2.2 的贝塔分布代码，核心结论：

数据量小时，先验影响大；
数据量大时，先验影响可忽略（"数据说话"）；
无信息先验（如 α=1, β=1）是新手的安全选择。

思维导图

16.6 贝叶斯模型比较

核心概念

贝叶斯不仅能估计参数，还能比较不同模型的优劣，核心是边际似然（积分掉参数后的似然）：

通俗理解

模型比较就像 "选工具"：用边际似然衡量每个工具（模型）解决问题的效果，贝叶斯因子越大，模型越优。

完整代码（两个回归模型的比较）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize, approx_fprime
from mpl_toolkits.mplot3d import Axes3D

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'  # White background
plt.style.use('seaborn-v0_8-whitegrid')  # Clean plot style
plt.rcParams['figure.figsize'] = (12, 8)  # Default figure size


def bayesian_model_comparison():
    # 1. Generate synthetic data (true model: linear + quadratic term)
    np.random.seed(42)  # Fixed seed for reproducibility
    x = np.linspace(0, 10, 50).reshape(-1, 1)  # 50 samples, 1 feature
    y_true = 2 * x + 0.1 * x ** 2  # True quadratic function
    y = y_true + np.random.normal(0, 1, size=y_true.shape)  # Add Gaussian noise (σ=1)

    # Print basic data information
    print("=== Data Generation ===")
    print(f"Feature shape: {x.shape}")
    print(f"Label shape: {y.shape}")
    print(f"True model: y = 2x + 0.1x²")
    print(f"Noise standard deviation: 1.0")

    # 2. Define candidate models (design matrices)
    # Model 1: Linear model (y = w0 + w1*x)
    X1 = np.hstack([np.ones_like(x), x])
    # Model 2: Quadratic model (y = w0 + w1*x + w2*x²)
    X2 = np.hstack([np.ones_like(x), x, x ** 2])

    print(f"\n=== Model Configuration ===")
    print(f"Model 1 (Linear) - Design matrix shape: {X1.shape} (intercept + x)")
    print(f"Model 2 (Quadratic) - Design matrix shape: {X2.shape} (intercept + x + x²)")

    # 3. Marginal likelihood calculation (Laplace approximation)
    def marginal_likelihood(X, y, alpha=1.0, beta=1.0):
        """
        Calculate marginal likelihood using Laplace approximation
        (Bayesian model evidence for model comparison)

        Parameters:
            X: Design matrix (n_samples, n_params)
            y: Observed labels (n_samples, 1)
            alpha: Prior precision for weights
            beta: Likelihood precision (1/variance of noise)

        Returns:
            Marginal likelihood (model evidence)
        """
        n_params = X.shape[1]

        # Negative log posterior for MAP estimation
        def neg_log_post(w):
            # Reshape weights to match label dimensions
            w_reshaped = w.reshape(-1, 1)
            # Log likelihood (Gaussian)
            ll = -0.5 * beta * np.sum((y - X @ w_reshaped) ** 2)
            # Gaussian prior (L2 regularization)
            prior = -0.5 * alpha * np.sum(w ** 2)
            # Return negative log posterior (for minimization)
            return -(ll + prior)
        
        # 修复1：使用OLS结果作为初始值（而非全0），加速收敛
        # OLS解：(X.TX)^-1 X.Ty
        w_ols = np.linalg.inv(X.T @ X + alpha/beta * np.eye(n_params)) @ X.T @ y
        init_w = w_ols.ravel()  # 转换为1D数组
        
        # 修复2：调整优化参数，使用更鲁棒的方法和宽松的收敛条件
        res = minimize(
            neg_log_post,
            init_w,
            method='L-BFGS-B',  # 比BFGS更鲁棒的拟牛顿法
            tol=1e-4,           # 放宽收敛阈值（避免精度损失导致的失败）
            options={
                'gtol': 1e-4, 
                'maxiter': 2000,
                'disp': False    # 关闭优化过程打印
            }
        )

        # 修复3：放宽优化成功判断（只要迭代完成且参数有效即可）
        if not res.success:
            print(f"⚠️  Warning: MAP optimization did not fully converge for model with {n_params} params: {res.message}")
            print(f"   But using the best found parameters (converged enough for practical use)")
        
        w_map = res.x  # MAP estimate of weights
        neg_log_post_map = neg_log_post(w_map)  # Negative log posterior at MAP

        # Calculate Hessian matrix (analytical form for Gaussian likelihood/prior)
        H = alpha * np.eye(n_params) + beta * X.T @ X

        # Regularize Hessian to ensure positive definiteness
        H_reg = H + 1e-8 * np.eye(n_params)

        # Laplace approximation for marginal likelihood
        det_H = np.linalg.det(H_reg)
        # Add small epsilon to avoid log(0) in det (numerical stability)
        if det_H <= 1e-10:
            det_H = 1e-10

        ml = (2 * np.pi) ** (n_params / 2) * (det_H) ** (-0.5) * np.exp(-neg_log_post_map)

        # Print model-specific results
        print(f"\n--- Model with {n_params} parameters ---")
        print(f"OLS initial weights: {init_w.round(4)}")
        print(f"MAP weights: {w_map.round(4)}")
        print(f"Negative log posterior at MAP: {neg_log_post_map:.4f}")
        print(f"Hessian determinant: {det_H:.2e}")
        print(f"Marginal likelihood: {ml:.2e}")

        return ml

    # 4. Calculate marginal likelihood for both models
    print("\n=== Marginal Likelihood Calculation ===")
    ml1 = marginal_likelihood(X1, y)  # Linear model
    ml2 = marginal_likelihood(X2, y)  # Quadratic model

    # 5. Calculate Bayes Factor (BF_21 = ML2 / ML1)
    bf_21 = ml2 / ml1
    # Handle extreme values (numerical stability)
    bf_21 = np.clip(bf_21, 1e-10, 1e10)

    print(f"\n=== Bayesian Model Comparison Results ===")
    print(f"Model 1 (Linear) marginal likelihood: {ml1:.2e}")
    print(f"Model 2 (Quadratic) marginal likelihood: {ml2:.2e}")
    print(f"Bayes Factor BF_21 (Model 2 vs Model 1): {bf_21:.2e}")

    # Interpret Bayes Factor (standard interpretation)
    if bf_21 > 100:
        interpretation = "Extreme evidence for Model 2"
    elif bf_21 > 10:
        interpretation = "Strong evidence for Model 2"
    elif bf_21 > 3:
        interpretation = "Moderate evidence for Model 2"
    elif bf_21 > 1:
        interpretation = "Weak evidence for Model 2"
    elif bf_21 == 1:
        interpretation = "No evidence for either model"
    else:
        interpretation = "Evidence for Model 1"

    print(f"Interpretation: {interpretation}")
    
    # 新增：可视化两个模型的拟合效果（直观验证模型选择结果）
    fig, ax = plt.subplots(figsize=(12, 8))
    # 绘制原始数据
    ax.scatter(x, y, alpha=0.6, s=50, color='gray', label='Observed Data')
    # 绘制真实曲线
    ax.plot(x, y_true, color='black', linewidth=3, label='True Quadratic Model')
    
    # 计算并绘制模型1（线性）的拟合曲线
    w1_map = np.linalg.inv(X1.T @ X1 + 1.0/1.0 * np.eye(2)) @ X1.T @ y
    y1_pred = X1 @ w1_map
    ax.plot(x, y1_pred, color='red', linewidth=2, linestyle='--', label='Model 1 (Linear) Fit')
    
    # 计算并绘制模型2（二次）的拟合曲线
    w2_map = np.linalg.inv(X2.T @ X2 + 1.0/1.0 * np.eye(3)) @ X2.T @ y
    y2_pred = X2 @ w2_map
    ax.plot(x, y2_pred, color='blue', linewidth=2, label='Model 2 (Quadratic) Fit')
    
    ax.set_title('Bayesian Model Comparison: Linear vs Quadratic Fit', fontsize=14)
    ax.set_xlabel('x (Feature)', fontsize=12)
    ax.set_ylabel('y (Response)', fontsize=12)
    ax.legend(fontsize=10)
    ax.set_xlim(-0.5, 10.5)
    plt.tight_layout()
    plt.show()

# Run the function
if __name__ == "__main__":
    bayesian_model_comparison()

运行结果

模型 2（二次）的边际似然远大于模型 1（线性），贝叶斯因子远大于 1，说明模型 2 更优（贴合真实数据生成过程）。

16.7 混合模型的贝叶斯估计

核心概念

混合模型（比如高斯混合模型 GMM）的贝叶斯估计，核心是用变分推断 或MCMC估计每个成分的参数和权重，避免过拟合。

完整代码（贝叶斯 GMM）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize, approx_fprime
from mpl_toolkits.mplot3d import Axes3D
from sklearn.mixture import BayesianGaussianMixture

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'    # White background
plt.style.use('seaborn-v0_8-whitegrid')     # Clean plot style
plt.rcParams['figure.figsize'] = (12, 8)    # Default figure size

def bayesian_mixture_model():
    # 1. Generate synthetic data (3 Gaussian components)
    np.random.seed(42)  # Fixed seed for reproducibility
    
    # Component 1: Mean [0,0], Identity covariance
    X1 = np.random.multivariate_normal([0, 0], [[1, 0], [0, 1]], 100)
    # Component 2: Mean [3,3], Positive covariance
    X2 = np.random.multivariate_normal([3, 3], [[1, 0.5], [0.5, 1]], 100)
    # Component 3: Mean [-3,3], Negative covariance
    X3 = np.random.multivariate_normal([-3, 3], [[1, -0.5], [-0.5, 1]], 100)
    
    # Combine all components
    X = np.vstack([X1, X2, X3])
    n_true_components = 3
    
    # Print basic data information
    print("=== Data Generation ===")
    print(f"Total samples: {X.shape[0]} (100 per component)")
    print(f"Feature dimensions: {X.shape[1]}")
    print(f"True number of Gaussian components: {n_true_components}")
    print(f"Component 1: Mean [0,0], Cov [[1,0],[0,1]]")
    print(f"Component 2: Mean [3,3], Cov [[1,0.5],[0.5,1]]")
    print(f"Component 3: Mean [-3,3], Cov [[1,-0.5],[-0.5,1]]")

    # 2. Bayesian Gaussian Mixture Model (automatic component selection)
    print(f"\n=== Bayesian GMM Training ===")
    # Initialize BGMM with more components than needed (automatic pruning)
    # Use Dirichlet process prior for automatic component selection
    bgmm = BayesianGaussianMixture(
        n_components=10,          # Preset 10 components (will be pruned)
        covariance_type='full',   # Full covariance matrix (matches data generation)
        weight_concentration_prior_type='dirichlet_process',  # Key for auto-selection
        weight_concentration_prior=1.0,
        random_state=42,
        max_iter=1000,            # Sufficient iterations for convergence
        tol=1e-6                  # Tight convergence tolerance
    )
    
    # Fit BGMM to data
    bgmm.fit(X)
    # Predict component assignments for each sample
    y_pred = bgmm.predict(X)
    
    # 3. Analyze component selection results
    # Threshold for "effective" components (weights > small epsilon)
    weight_threshold = 1e-3
    n_effective = np.sum(bgmm.weights_ > weight_threshold)
    effective_weights = bgmm.weights_[bgmm.weights_ > weight_threshold]
    
    print(f"Preset number of components: 10")
    print(f"Effective components (weight > {weight_threshold}): {n_effective}")
    print(f"True components: {n_true_components}")
    print(f"Weights of effective components: {effective_weights.round(4)}")
    print(f"Means of effective components:\n{bgmm.means_[bgmm.weights_ > weight_threshold].round(2)}")

    # 4. Visualization: Clustering results with component centers
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Plot clustered data points
    scatter = ax.scatter(
        X[:, 0], X[:, 1], 
        c=y_pred, 
        cmap='viridis', 
        alpha=0.6, 
        s=60,
        edgecolors='black',
        linewidth=0.5
    )
    
    # Plot effective component centers (mean positions)
    effective_means = bgmm.means_[bgmm.weights_ > weight_threshold]
    ax.scatter(
        effective_means[:, 0], effective_means[:, 1],
        marker='*',
        s=300,
        c='red',
        edgecolors='white',
        linewidth=2,
        label='Effective Component Centers'
    )
    
    # Plot styling and annotations
    ax.set_title('Bayesian Gaussian Mixture Model (Automatic Component Selection)', fontsize=14, pad=10)
    ax.set_xlabel('Feature 1', fontsize=12)
    ax.set_ylabel('Feature 2', fontsize=12)
    ax.legend(fontsize=10, loc='upper right')
    
    # Add colorbar with component labels
    cbar = plt.colorbar(scatter, ax=ax, label='Component Label')
    cbar.ax.tick_params(labelsize=10)
    
    # Set axis limits for better visualization
    ax.set_xlim(-6, 6)
    ax.set_ylim(-2, 6)
    
    plt.tight_layout()
    plt.show()

# Run the function
if __name__ == "__main__":
    bayesian_mixture_model()

16.8 非参数贝叶斯建模

核心概念

参数模型假设参数数量固定，非参数贝叶斯允许参数数量随数据变化（比如无限混合模型），核心是 "无限维先验"。

通俗理解

参数模型是 "固定盒子数装球"，非参数模型是 "有多少球就自动生成多少盒子"。

16.9 高斯过程

核心概念

高斯过程（GP）是 "无限维高斯分布"，用于回归 / 分类，核心是核函数（刻画样本间的相关性），输出是整个函数的分布，而非单个参数。

完整代码（高斯过程回归）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize, approx_fprime
from mpl_toolkits.mplot3d import Axes3D
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'  # White background
plt.style.use('seaborn-v0_8-whitegrid')  # Clean plot style
plt.rcParams['figure.figsize'] = (12, 8)  # Default figure size


def gaussian_process():
    # 1. Generate non-linear synthetic data
    np.random.seed(42)  # Fixed seed for reproducibility
    x = np.linspace(0, 10, 50).reshape(-1, 1)  # 50 training samples
    y_true = np.sin(x)  # True underlying function
    y = y_true + np.random.normal(0, 0.1, size=x.shape)  # Add Gaussian noise (σ=0.1)

    # Print basic data information
    print("=== Data Generation ===")
    print(f"Training data shape: {x.shape} (50 samples, 1 feature)")
    print(f"True function: y = sin(x)")
    print(f"Noise standard deviation: 0.1")

    # 2. Gaussian Process Regression (GPR) setup and training
    # Define RBF (Squared Exponential) kernel with initial hyperparameters
    kernel = 1.0 * RBF(length_scale=1.0)
    print(f"\n=== Gaussian Process Configuration ===")
    print(f"Initial kernel: {kernel}")
    print(f"Kernel parameters: amplitude=1.0, length_scale=1.0")

    # Initialize GPR model (optimize kernel hyperparameters during fitting)
    gpr = GaussianProcessRegressor(
        kernel=kernel,
        random_state=42,
        n_restarts_optimizer=10,  # Multiple restarts to avoid local minima
        alpha=0.01,  # Noise variance (matches data generation)
        normalize_y=True  # Normalize target for better optimization
    )

    # Fit GPR to training data (optimizes kernel hyperparameters)
    gpr.fit(x, y)

    # Print optimized kernel parameters (修复核心：正确解析核参数)
    print(f"Optimized kernel: {gpr.kernel_}")
    
    # 修复1：正确提取核参数数值（从kernel_的参数中获取）
    # 获取核的超参数向量（数值）
    kernel_params = gpr.kernel_.get_params()
    # 对于 常数*RBF 的组合核，正确提取振幅和长度尺度
    if isinstance(gpr.kernel_, ConstantKernel):
        amplitude = np.sqrt(gpr.kernel_.constant_value)
        length_scale = 1.0
    elif hasattr(gpr.kernel_, 'k1') and hasattr(gpr.kernel_, 'k2'):
        # 处理 ConstantKernel * RBF 的组合
        amplitude = np.sqrt(gpr.kernel_.k1.constant_value)
        length_scale = gpr.kernel_.k2.length_scale
    else:
        # 回退方案：直接获取超参数向量
        hyperparams = gpr.kernel_.theta
        amplitude = np.exp(hyperparams[0]/2) if len(hyperparams)>=1 else 1.0
        length_scale = np.exp(hyperparams[1]) if len(hyperparams)>=2 else 1.0
    
    print(f"Optimized amplitude: {amplitude:.4f}")
    print(f"Optimized length_scale: {length_scale:.4f}")

    # 3. Prediction with uncertainty estimation
    # Create prediction grid (extend beyond training data to show extrapolation)
    x_pred = np.linspace(0, 12, 200).reshape(-1, 1)  # 200 test points (0-12)
    y_pred, y_std = gpr.predict(x_pred, return_std=True)

    print(f"\n=== Prediction Results ===")
    print(f"Prediction grid shape: {x_pred.shape} (0 to 12)")
    print(f"Mean prediction std (0-10): {np.mean(y_std[x_pred.ravel() <= 10]):.4f}")
    print(f"Mean prediction std (10-12): {np.mean(y_std[x_pred.ravel() > 10]):.4f} (extrapolation)")

    # 4. Visualization: GPR with uncertainty
    fig, ax = plt.subplots(figsize=(12, 8))

    # Plot training data points
    ax.scatter(
        x, y,
        label='Training Data',
        alpha=0.8,
        s=60,
        color='gray',
        edgecolors='black',
        linewidth=0.5
    )

    # Plot true underlying function (for reference)
    ax.plot(
        x_pred, np.sin(x_pred),
        label='True Function (sin(x))',
        color='black',
        linewidth=2,
        linestyle='--'
    )

    # Plot GPR mean prediction
    ax.plot(
        x_pred, y_pred,
        label='GPR Mean Prediction',
        color='red',
        linewidth=2
    )

    # Plot 95% predictive interval (±2σ)
    ax.fill_between(
        x_pred.ravel(),
        y_pred - 2 * y_std,
        y_pred + 2 * y_std,
        alpha=0.2,
        color='red',
        label='95% Predictive Interval'
    )

    # Highlight extrapolation region (10-12)
    ax.axvspan(10, 12, alpha=0.1, color='orange', label='Extrapolation Region')

    # Plot styling and annotations
    ax.set_title('Gaussian Process Regression (Non-linear Fitting with Uncertainty)', fontsize=14, pad=10)
    ax.set_xlabel('x (Feature)', fontsize=12)
    ax.set_ylabel('y (Response)', fontsize=12)
    ax.legend(fontsize=10, loc='upper right')
    ax.set_xlim(-0.5, 12.5)
    ax.set_ylim(-1.5, 1.5)

    plt.tight_layout()
    plt.show()


# Run the function
if __name__ == "__main__":
    gaussian_process()

16.10 狄利克雷过程和中国餐馆

核心概念

狄利克雷过程（DP）是非参数贝叶斯的核心，"中国餐馆过程" 是 DP 的直观比喻：

餐馆有无限张桌子（对应混合模型的成分）；
新顾客（数据点）要么坐到已有桌子（加入已有成分），要么新开桌子（新增成分）；
坐桌子的概率和桌子上的人数（成分的权重）成正比。

16.11 本征狄利克雷分配（LDA）

核心概念

LDA 是基于狄利克雷过程的主题模型，用于文本分类，核心是 "文档 - 主题 - 单词" 的三层贝叶斯模型：

每个文档对应一个主题分布（狄利克雷先验）；
每个主题对应一个单词分布（狄利克雷先验）；
文档中的每个单词由 "选主题→选单词" 生成。

完整代码（LDA 主题建模）

python 复制代码

# Import required libraries
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize, approx_fprime
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import warnings

warnings.filterwarnings('ignore')  # Suppress LDA convergence warnings

# Basic matplotlib configuration (no Chinese font dependency)
plt.rcParams['axes.unicode_minus'] = False  # Fix minus sign display
plt.rcParams['axes.facecolor'] = 'white'  # White background
plt.style.use('seaborn-v0_8-whitegrid')  # Clean plot style
plt.rcParams['figure.figsize'] = (12, 10)  # Default figure size


def lda_topic_modeling():
    # 1. Load and preprocess data (20 Newsgroups)
    print("=== Data Loading ===")
    # Select 3 distinct categories for clear topic separation
    categories = ['sci.space', 'comp.graphics', 'rec.sport.baseball']
    print(f"Selected categories: {categories}")

    # Fetch data (remove headers/footers/quotes to focus on content)
    data = fetch_20newsgroups(
        subset='all',
        categories=categories,
        remove=('headers', 'footers', 'quotes'),
        random_state=42
    )

    # Use first 100 documents for faster training
    n_docs = 100
    texts = data.data[:n_docs]
    true_labels = data.target[:n_docs]

    # Print basic data information
    print(f"Total documents loaded: {len(data.data)} (using first {n_docs})")
    print(f"Number of categories: {len(categories)}")
    print(
        f"Document length range: {min(len(t) for t in texts[:10])} - {max(len(t) for t in texts[:10])} chars (first 10 docs)")

    # 2. Text vectorization (Bag-of-Words with frequency counts)
    print(f"\n=== Text Vectorization ===")
    vectorizer = CountVectorizer(
        max_features=1000,  # Keep top 1000 most frequent words
        stop_words='english',  # Remove English stop words (the, a, etc.)
        lowercase=True,  # Convert all text to lowercase
        token_pattern=r'\b[a-zA-Z]{3,}\b'  # Keep words with ≥3 letters
    )

    # Fit vectorizer and transform text to document-term matrix
    X = vectorizer.fit_transform(texts)
    feature_names = vectorizer.get_feature_names_out()

    # Print vectorization results
    print(f"Document-term matrix shape: {X.shape} (documents × vocabulary)")
    print(f"Vocabulary size: {len(feature_names)} (top 1000 words)")
    print(f"Top 5 words in vocabulary: {feature_names[:5]}")

    # 3. LDA Topic Modeling
    print(f"\n=== LDA Model Training ===")
    n_topics = 3  # Match number of categories
    lda = LatentDirichletAllocation(
        n_components=n_topics,
        random_state=42,
        max_iter=100,  # Sufficient iterations for convergence
        learning_method='batch',  # Batch learning for better accuracy
        evaluate_every=10,  # Evaluate perplexity every 10 iterations
        perp_tol=1e-3,  # Perplexity tolerance for convergence
        n_jobs=-1  # Use all CPU cores for faster training
    )

    # Fit LDA model to document-term matrix
    lda.fit(X)

    # Print model convergence metrics
    print(f"Final perplexity score: {lda.perplexity(X):.2f} (lower = better)")
    print(f"Number of iterations: {lda.n_iter_}")

    # 4. Analyze and visualize topic results
    # Function to print top words for each topic
    def print_top_words(model, feature_names, n_top_words=10):
        """
        Print top N words for each LDA topic (sorted by importance)
        """
        print(f"\n=== Top {n_top_words} Words per Topic ===")
        for topic_idx, topic in enumerate(model.components_):
            # Normalize topic distribution to probabilities
            topic_dist = topic / topic.sum()
            # Get indices of top N words
            top_word_indices = topic.argsort()[:-n_top_words - 1:-1]
            # Get top words and their probabilities
            top_words = [(feature_names[i], topic_dist[i]) for i in top_word_indices]

            # Print topic information
            print(f"\nTopic {topic_idx + 1}:")
            for word, prob in top_words:
                print(f"  {word:<12} (prob: {prob:.4f})")

    # Print top 10 words for each topic
    print_top_words(lda, feature_names, 10)

    # 5. Get document-topic distributions
    doc_topic_dist = lda.transform(X)
    # Assign each document to the most probable topic
    doc_topic_assignments = np.argmax(doc_topic_dist, axis=1)

    # Print topic assignment statistics
    print(f"\n=== Topic Assignment Results ===")
    for topic_idx in range(n_topics):
        n_docs_assigned = np.sum(doc_topic_assignments == topic_idx)
        print(f"Topic {topic_idx + 1}: {n_docs_assigned} documents ({n_docs_assigned / n_docs * 100:.1f}%)")

    # 6. Visualize topic-word distributions (top 8 words per topic)
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    n_vis_words = 8

    for topic_idx, ax in enumerate(axes):
        # Get top words for current topic
        topic = lda.components_[topic_idx]
        top_word_indices = topic.argsort()[:-n_vis_words - 1:-1]
        top_words = [feature_names[i] for i in top_word_indices]
        top_probs = topic[top_word_indices] / topic.sum()

        # Create horizontal bar plot
        ax.barh(range(n_vis_words), top_probs[::-1], color=f'C{topic_idx}')
        ax.set_yticks(range(n_vis_words))
        ax.set_yticklabels(top_words[::-1])
        ax.set_title(f'Topic {topic_idx + 1} (Top {n_vis_words} Words)', fontsize=12)
        ax.set_xlabel('Word Probability', fontsize=10)
        ax.set_xlim(0, max(top_probs) * 1.1)

    # Add main title
    fig.suptitle('LDA Topic Modeling: Word Distributions for Each Topic', fontsize=14, y=0.99)
    plt.tight_layout()
    plt.show()

    # 7. Visualize document-topic distribution (first 20 documents)
    fig, ax = plt.subplots(figsize=(12, 8))
    n_docs_vis = 20
    doc_indices = range(n_docs_vis)

    # Plot stacked bar chart of topic probabilities
    ax.bar(doc_indices, doc_topic_dist[:n_docs_vis, 0], label='Topic 1 (Space)', color='C0')
    ax.bar(doc_indices, doc_topic_dist[:n_docs_vis, 1], bottom=doc_topic_dist[:n_docs_vis, 0],
           label='Topic 2 (Graphics)', color='C1')
    ax.bar(doc_indices, doc_topic_dist[:n_docs_vis, 2],
           bottom=doc_topic_dist[:n_docs_vis, 0] + doc_topic_dist[:n_docs_vis, 1], label='Topic 3 (Baseball)',
           color='C2')

    ax.set_title(f'Document-Topic Distribution (First {n_docs_vis} Documents)', fontsize=14)
    ax.set_xlabel('Document Index', fontsize=12)
    ax.set_ylabel('Topic Probability', fontsize=12)
    ax.legend(fontsize=10)
    ax.set_xticks(doc_indices)
    ax.set_ylim(0, 1.05)

    plt.tight_layout()
    plt.show()


# Run the function
if __name__ == "__main__":
    lda_topic_modeling()