第22节：相关性分析——协方差、相关系数与热力图解读

- 第22节：相关性分析------协方差、相关系数与热力图解读
- 本节学习目标
- 为什么学这个
- 核心知识点讲解
- - 一、协方差（Covariance）
  - - [1.1 数学公式](#1.1 数学公式)
    - [1.2 协方差的局限性](#1.2 协方差的局限性)
  - [二、Pearson 相关系数](#二、Pearson 相关系数)
  - - [2.1 数学公式](#2.1 数学公式)
    - [2.2 相关系数的解读](#2.2 相关系数的解读)
    - [2.3 Pearson 相关系数的限制](#2.3 Pearson 相关系数的限制)
  - [三、Spearman 秩相关系数](#三、Spearman 秩相关系数)
  - - [3.1 为什么需要 Spearman？](#3.1 为什么需要 Spearman？)
  - [四、Kendall 秩相关系数](#四、Kendall 秩相关系数)
  - 五、相关性显著性检验
  - - [5.1 P值的含义](#5.1 P值的含义)
  - 六、相关性热力图解读
  - - [6.1 标准相关性热力图](#6.1 标准相关性热力图)
    - [6.2 热力图解读技巧](#6.2 热力图解读技巧)
  - 七、多重共线性检测（VIF）
  - 八、偏相关分析
- 实战练习
- - 练习一：电商数据分析
  - 练习二：相关不等于因果的演示
- 本节总结
- 下一节预告

专栏导读

🌸 欢迎来到Python办公自动化专栏---Python处理办公问题，解放您的双手

🏳️‍🌈 个人博客主页：请点击------> 个人的博客主页求收藏

🏳️‍🌈 Github主页：请点击------> Github主页求Star⭐

🏳️‍🌈 知乎主页：请点击------> 知乎主页求关注

🏳️‍🌈 CSDN博客主页：请点击------> CSDN的博客主页求关注

👍 该系列文章专栏：请点击------>Python办公自动化专栏求订阅

🕷 此外还有爬虫专栏：请点击------>Python爬虫基础专栏求订阅

📕 此外还有python基础专栏：请点击------>Python基础学习专栏求订阅

文章作者技术和水平有限，如果文中出现错误，希望大家能指正🙏

❤️ 欢迎各位佬关注！ ❤️

第22节：相关性分析------协方差、相关系数与热力图解读

本节学习目标

完成本节学习后，你将能够：

理解协方差的概念、计算方法和局限性
掌握三种相关系数：Pearson、Spearman、Kendall
理解相关系数的数学含义和解读要点
掌握相关性显著性检验的方法（P值检验）
正确解读相关性热力图，避免常见误读
理解"相关不等于因果"的核心原则
使用相关性分析进行特征选择和多重共线性检测
在业务场景中灵活应用相关性分析

为什么学这个

在数据分析中，我们常常问这样的问题：

"广告投入和销售额有关系吗？"
"客户年龄和消费金额相关吗？"
"这两个指标是不是重复的？"

这些问题本质上都在问同一个事情：两个变量之间的关系有多强？

相关性分析就是回答这个问题的标准方法。它不仅是数据分析的核心技能，也是机器学习特征选择、风控模型、市场调研等领域的必备工具。

打个比方：相关性分析就像"测量两个人之间的默契程度"。你可以说他们"关系很好"或"不太熟"------但默契程度到底是多少？用一个 0 到 1（或 -1 到 1）的数字来量化，这就是相关系数。

但有一个致命陷阱：相关不等于因果。冰淇淋销量和溺水人数高度相关------难道吃冰淇淋会导致溺水？当然不是！真正的原因是夏天到了（共同因素：气温升高）。这就是本节需要重点强调的。

核心知识点讲解

一、协方差（Covariance）

协方差是相关性的"原始版"，衡量两个变量同时变化的方向和幅度。

1.1 数学公式

复制代码

Cov(X, Y) = Σ(xi - x̄)(yi - ȳ) / (n - 1)

简单来说：如果 X 增大时 Y 也增大，协方差为正；X 增大时 Y 减小，协方差为负。

python 复制代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid', font_scale=1.05)

# 模拟数据
np.random.seed(42)
n = 100
X = np.random.normal(50, 10, n)
Y = 2 * X + np.random.normal(0, 5, n)  # Y 与 X 正相关

# 手动计算协方差
def covariance(x, y):
    return np.sum((x - np.mean(x)) * (y - np.mean(y))) / (len(x) - 1)

cov_manual = covariance(X, Y)
cov_numpy = np.cov(X, Y)[0, 1]

print("=" * 60)
print("协方差计算")
print("=" * 60)
print(f"手动计算: {cov_manual:.2f}")
print(f"NumPy 计算: {cov_numpy:.2f}")

# 协方差矩阵
cov_matrix = np.cov(X, Y)
print(f"\n协方差矩阵:\n{cov_matrix}")

# 可视化
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(X, Y, alpha=0.6, color='#1E88E5', s=50)
ax.set_title(f'正相关散点图 (协方差 = {cov_numpy:.2f})', fontsize=14)
ax.set_xlabel('X')
ax.set_ylabel('Y = 2X + noise')
plt.tight_layout()
plt.show()

1.2 协方差的局限性

协方差有一个大问题：它的大小依赖于变量的量纲。比如，身高用"厘米"和用"米"计算出来的协方差差了 100 倍。这使得协方差难以直接比较。

python 复制代码

# 量纲对协方差的影响
height_cm = np.array([165, 170, 175, 180, 185])
height_m = height_cm / 100
weight = np.array([55, 60, 65, 70, 75])

print(f"身高(厘米)与体重的协方差: {np.cov(height_cm, weight)[0, 1]:.2f}")
print(f"身高(米)与体重的协方差:   {np.cov(height_m, weight)[0, 1]:.4f}")
print(f"\n两者相差 100 倍！说明协方差受量纲影响，不便比较。")

这就是为什么我们需要相关系数------它消除了量纲的影响，让相关性可以在不同变量之间直接比较。

二、Pearson 相关系数

Pearson 相关系数是最常用 的相关系数，衡量两个变量之间的线性关系强度。

2.1 数学公式

复制代码

r = Cov(X, Y) / (σ_X * σ_Y)

其实就是协方差除以两个变量的标准差------这就消除了量纲的影响。

python 复制代码

# Pearson 相关系数
def pearson_correlation(x, y):
    """手动计算 Pearson 相关系数"""
    cov = np.sum((x - np.mean(x)) * (y - np.mean(y))) / (len(x) - 1)
    return cov / (np.std(x, ddof=1) * np.std(y, ddof=1))

r_manual = pearson_correlation(X, Y)
r_scipy = np.corrcoef(X, Y)[0, 1]

print(f"手动计算 Pearson r: {r_manual:.4f}")
print(f"NumPy 计算:          {r_scipy:.4f}")

2.2 相关系数的解读

复制代码

┌──────────────────────────────────────────────────────────┐
│          Pearson 相关系数解读指南                          │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  r 的范围: [-1, 1]                                       │
│                                                          │
│  |r| >= 0.8  ──► 极强相关                                 │
│  0.6 <= |r| < 0.8 ──► 强相关                             │
│  0.4 <= |r| < 0.6 ──► 中等相关                            │
│  0.2 <= |r| < 0.4 ──► 弱相关                              │
│  |r| < 0.2   ──► 极弱相关或无相关                          │
│                                                          │
│  r > 0: 正相关（X增大，Y也增大）                           │
│  r < 0: 负相关（X增大，Y减小）                             │
│  r = 0: 无线性相关（但可能有非线性关系！）                  │
│                                                          │
└──────────────────────────────────────────────────────────┘

2.3 Pearson 相关系数的限制

Pearson 只能检测线性关系。如果关系是非线性的，Pearson 可能会给出 r=0 的结论，但实际上两个变量存在很强的非线性关系。

python 复制代码

# Pearson 的局限：无法检测非线性关系
np.random.seed(42)
n = 200

# 场景1：线性关系
x1 = np.random.uniform(-5, 5, n)
y1 = 2 * x1 + np.random.normal(0, 1, n)

# 场景2：抛物线关系（非线性）
x2 = np.random.uniform(-5, 5, n)
y2 = x2**2 + np.random.normal(0, 2, n)

# 场景3：正弦关系（非线性）
x3 = np.random.uniform(-5, 5, n)
y3 = np.sin(x3) + np.random.normal(0, 0.2, n)

# 场景4：X型关系（无相关性）
x4 = np.random.uniform(-3, 3, n)
y4 = np.random.choice([-1, 1], n) * np.abs(x4) + np.random.normal(0, 0.3, n)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for ax, x, y, title in zip(axes.flatten(),
                            [x1, x2, x3, x4],
                            [y1, y2, y3, y4],
                            [f'线性关系 (r = {np.corrcoef(x1, y1)[0,1]:.3f})',
                             f'抛物线关系 (r = {np.corrcoef(x2, y2)[0,1]:.3f})',
                             f'正弦关系 (r = {np.corrcoef(x3, y3)[0,1]:.3f})',
                             f'X型关系 (r = {np.corrcoef(x4, y4)[0,1]:.3f})']):
    ax.scatter(x, y, alpha=0.5, color='#1E88E5', s=40)
    ax.set_title(title, fontsize=13, fontweight='bold')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("关键发现：")
print("  - 线性关系：Pearson r 能准确反映")
print("  - 抛物线/正弦：实际有很强的关系，但 Pearson r ≈ 0")
print("  - X型：Pearson r ≈ 0，但实际有结构性关系")
print("\n结论：Pearson 只检测线性关系！发现 r≈0 时，务必画散点图确认。")

三、Spearman 秩相关系数

Spearman 相关系数通过先将数据转换为"排名"，再计算排名的 Pearson 相关系数，来衡量单调关系（不仅限于线性）。

3.1 为什么需要 Spearman？

Spearman 适合以下场景：

变量之间存在单调关系但不一定是线性的
数据中有极端异常值（Spearman 对异常值不敏感）
变量是序数类型（如满意度：1-5分）

python 复制代码

from scipy.stats import spearmanr, pearsonr

# Spearman vs Pearson 对比
print("=" * 60)
print("Spearman vs Pearson 对比")
print("=" * 60)

# 场景：单调非线性关系
np.random.seed(42)
x = np.random.uniform(1, 100, 200)
y = np.log(x) + np.random.normal(0, 0.1, 200)

pearson_r, pearson_p = pearsonr(x, y)
spearman_r, spearman_p = spearmanr(x, y)

print(f"\n单调非线性关系 (Y = log(X) + noise)")
print(f"Pearson r:  {pearson_r:.4f} (P值: {pearson_p:.6f})")
print(f"Spearman r: {spearman_r:.4f} (P值: {spearman_p:.6f})")

# 含异常值的场景
x_clean = np.random.normal(50, 10, 100)
y_clean = 2 * x_clean + np.random.normal(0, 5, 100)

# 加入一个极端异常值
x_dirty = np.append(x_clean, [200])
y_dirty = np.append(y_clean, [50])

pearson_r_dirty, _ = pearsonr(x_dirty, y_dirty)
spearman_r_dirty, _ = spearmanr(x_dirty, y_dirty)

print(f"\n含异常值的数据:")
print(f"Pearson r:  {pearson_r_dirty:.4f}")
print(f"Spearman r: {spearman_r_dirty:.4f}")

print("\n结论：Spearman 对异常值更鲁棒，对非线性单调关系更敏感")

# 可视化
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 单调非线性
axes[0].scatter(x, y, alpha=0.6, color='#1E88E5')
axes[0].plot(np.sort(x), np.log(np.sort(x)), 'r--', linewidth=2, label='log(x)')
axes[0].set_title(f'单调非线性关系', fontsize=13)
axes[0].legend()

# 异常值影响
axes[1].scatter(x_clean, y_clean, alpha=0.6, color='#1E88E5', label='正常数据')
axes[1].scatter([200], [50], color='red', s=100, zorder=5, label='异常值')
axes[1].set_title(f'异常值对 Pearson 的影响', fontsize=13)
axes[1].legend()

plt.tight_layout()
plt.show()

四、Kendall 秩相关系数

Kendall 相关系数基于"一致对"和"不一致对"的比例计算，适合小样本和有很多重复值的数据。

python 复制代码

from scipy.stats import kendalltau

# Kendall 相关系数
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 1, 4, 3, 6, 5, 8, 7, 10, 9]  # 几乎完全一致，偶有交换

tau, p_value = kendalltau(x, y)

print("Kendall 相关系数示例:")
print(f"数据 X: {x}")
print(f"数据 Y: {y}")
print(f"Kendall tau: {tau:.4f} (P值: {p_value:.4f})")

# 三种相关系数对比
np.random.seed(42)
x_large = np.random.normal(50, 15, 500)
y_large = x_large + np.random.normal(0, 10, 500)

pearson_r, pearson_p = pearsonr(x_large, y_large)
spearman_r, spearman_p = spearmanr(x_large, y_large)
kendall_r, kendall_p = kendalltau(x_large, y_large)

print("\n三种相关系数对比（大数据集）:")
print(f"Pearson r:  {pearson_r:.4f} (P值: {pearson_p:.6f})")
print(f"Spearman r: {spearman_r:.4f} (P值: {spearman_p:.6f})")
print(f"Kendall tau: {kendall_r:.4f} (P值: {kendall_p:.6f})")

print("\n选择指南:")
print("  - Pearson: 数据近似正态分布，关注线性关系")
print("  - Spearman: 有异常值，或关系单调但不一定线性")
print("  - Kendall: 样本量小，或数据有很多重复值（序数数据）")

五、相关性显著性检验

相关系数告诉你"关系有多强"，但统计显著性告诉你"这个关系是否可靠"。

5.1 P值的含义

复制代码

假设检验框架：
  H0（零假设）: 两个变量之间的真实相关系数为 0（无相关）
  H1（备择假设）: 相关系数不为 0（有相关）

P值 < 0.05: 拒绝 H0，认为相关关系"统计显著"
P值 >= 0.05: 不能拒绝 H0，相关关系"不显著"

python 复制代码

from scipy.stats import pearsonr

np.random.seed(42)

# 场景1：大样本 + 弱相关
n1 = 10000
x1 = np.random.normal(0, 1, n1)
y1 = 0.1 * x1 + np.random.normal(0, 1, n1)  # 很弱的相关

# 场景2：小样本 + 强相关
n2 = 20
x2 = np.random.normal(0, 1, n2)
y2 = 3 * x2 + np.random.normal(0, 1, n2)    # 很强的相关

r1, p1 = pearsonr(x1, y1)
r2, p2 = pearsonr(x2, y2)

print("=" * 60)
print("相关性与显著性")
print("=" * 60)
print(f"\n场景1（大样本 + 弱相关）:")
print(f"  r = {r1:.4f}, P值 = {p1:.6f}")
print(f"  {'结论：虽然相关系数小，但样本量大，关系显著 ✓' if p1 < 0.05 else '结论：不显著'}")

print(f"\n场景2（小样本 + 强相关）:")
print(f"  r = {r2:.4f}, P值 = {p2:.6f}")
print(f"  {'结论：强相关且显著 ✓' if p2 < 0.05 else '结论：不显著（样本太小）'}")

print("\n核心认知：")
print("  - 大样本时，即使是微弱的相关也可能'显著'（统计显著 ≠ 实际重要）")
print("  - 小样本时，即使是强相关也可能'不显著'（缺乏统计功效）")
print("  - 永远同时看 r 值和 P值！")

六、相关性热力图解读

6.1 标准相关性热力图

python 复制代码

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 模拟业务数据
np.random.seed(42)
n = 500
df = pd.DataFrame({
    '广告投入': np.random.uniform(10, 100, n),
    '网站流量': 0,
    '注册人数': 0,
    '购买人数': 0,
    '客单价': np.random.normal(200, 40, n),
    '总营收': 0,
    '客户满意度': np.random.uniform(3, 5, n),
    '复购率': 0
})

# 生成有相关性的数据
df['网站流量'] = df['广告投入'] * 5 + np.random.normal(0, 100, n)
df['注册人数'] = df['网站流量'] * 0.08 + np.random.normal(0, 50, n)
df['购买人数'] = df['注册人数'] * 0.3 + np.random.normal(0, 20, n)
df['总营收'] = df['购买人数'] * df['客单价'] + np.random.normal(0, 10000, n)
df['复购率'] = 0.3 + 0.1 * df['客户满意度'] / 5 + np.random.normal(0, 0.05, n)

# 计算相关系数矩阵
corr_pearson = df.corr(method='pearson')

fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# 子图1：完整热力图
mask_upper = np.triu(np.ones_like(corr_pearson, dtype=bool))
sns.heatmap(corr_pearson, annot=True, cmap='coolwarm', center=0,
            mask=mask_upper, fmt='.2f', linewidths=1,
            ax=axes[0], vmin=-1, vmax=1,
            cbar_kws={'label': 'Pearson 相关系数'})
axes[0].set_title('Pearson 相关性热力图', fontsize=16, fontweight='bold')

# 子图2：Spearman 相关性
corr_spearman = df.corr(method='spearman')
sns.heatmap(corr_spearman, annot=True, cmap='coolwarm', center=0,
            mask=mask_upper, fmt='.2f', linewidths=1,
            ax=axes[1], vmin=-1, vmax=1,
            cbar_kws={'label': 'Spearman 相关系数'})
axes[1].set_title('Spearman 相关性热力图', fontsize=16, fontweight='bold')

plt.tight_layout()
plt.show()

6.2 热力图解读技巧

python 复制代码

# 自动识别强相关特征对
def find_strong_correlations(corr_matrix, threshold=0.7):
    """找出相关系数超过阈值的变量对"""
    pairs = []
    columns = corr_matrix.columns
    for i in range(len(columns)):
        for j in range(i + 1, len(columns)):
            r = corr_matrix.iloc[i, j]
            if abs(r) >= threshold:
                pairs.append({
                    '变量1': columns[i],
                    '变量2': columns[j],
                    '相关系数': round(r, 3),
                    '方向': '正相关' if r > 0 else '负相关'
                })
    return pd.DataFrame(pairs)

print("=" * 60)
print("强相关变量对（|r| >= 0.7）")
print("=" * 60)
strong_pairs = find_strong_correlations(corr_pearson, threshold=0.7)
print(strong_pairs.to_string(index=False))

# 寻找与目标变量最相关的特征
target = '总营收'
corr_with_target = corr_pearson[target].drop(target).abs().sort_values(ascending=False)

print(f"\n与 '{target}' 最相关的特征（按绝对值排序）:")
for var, r in corr_with_target.items():
    direction = '+' if corr_pearson.loc[var, target] > 0 else '-'
    print(f"  {var}: {direction}{r:.3f}")

热力图解读要点：

颜色深浅：颜色越深，相关性越强
正负区分：暖色（红/橙）通常为正相关，冷色（蓝）为负相关
对角线：永远是 1（变量与自身完全相关）
强相关特征对：如果两个特征之间 r>0.8，可能存在多重共线性，考虑只保留一个

七、多重共线性检测（VIF）

当两个或多个特征高度相关时，会导致回归模型的系数不稳定。方差膨胀因子（VIF）用于检测多重共线性。

python 复制代码

from statsmodels.stats.outliers_influence import variance_inflation_factor

# 计算 VIF
numeric_df = df[['广告投入', '网站流量', '注册人数', '购买人数', '客单价', '客户满意度']]

# 添加常数项
X_with_const = pd.DataFrame(numeric_df)
X_with_const['const'] = 1

vif_data = pd.DataFrame()
vif_data['特征'] = X_with_const.columns
vif_data['VIF'] = [variance_inflation_factor(X_with_const.values, i) 
                   for i in range(X_with_const.shape[1])]

print("=" * 60)
print("方差膨胀因子（VIF）检测")
print("=" * 60)
print(vif_data.to_string(index=False))

print("\nVIF 解读：")
print("  VIF = 1:  无共线性")
print("  1 < VIF < 5: 轻度共线性（通常可接受）")
print("  5 <= VIF < 10: 中度共线性（需关注）")
print("  VIF >= 10: 严重共线性（建议处理）")

# 可视化
fig, ax = plt.subplots(figsize=(10, 5))
vif_no_const = vif_data[vif_data['特征'] != 'const']
bars = ax.barh(vif_no_const['特征'], vif_no_const['VIF'], color='Set1')
ax.axvline(5, color='#E53935', linestyle='--', alpha=0.7, label='VIF=5（关注线）')
ax.axvline(10, color='red', linestyle='--', alpha=0.7, label='VIF=10（危险线）')
ax.set_title('特征多重共线性检测（VIF）', fontsize=14, fontweight='bold')
ax.set_xlabel('VIF 值')
ax.legend()

# 标注
for bar, vif in zip(bars, vif_no_const['VIF']):
    ax.text(bar.get_width() + 0.2, bar.get_y() + bar.get_height()/2, 
            f'{vif:.1f}', va='center', fontsize=11)

plt.tight_layout()
plt.show()

八、偏相关分析

偏相关系数衡量的是在控制其他变量后，两个变量之间的净相关。

python 复制代码

from pingouin import partial_corr

# 偏相关分析：控制"广告投入"后，"网站流量"与"注册人数"的关系
# 安装: pip install pingouin

try:
    result = partial_corr(data=df, x='网站流量', y='注册人数', covar='广告投入')
    print("偏相关分析结果:")
    print(f"  控制'广告投入'后:")
    print(f"  网站流量 ↔ 注册人数的偏相关系数: {result['r'].values[0]:.4f}")
    print(f"  P值: {result['p-val'].values[0]:.6f}")
    
    # 对比简单相关
    simple_r = df['网站流量'].corr(df['注册人数'])
    print(f"\n  不控制任何变量时的简单相关系数: {simple_r:.4f}")
    print(f"\n  差异说明：")
    if abs(result['r'].values[0]) < abs(simple_r):
        print(f"  → 控制'广告投入'后相关性降低，说明'网站流量-注册人数'的部分关系")
        print(f"    是通过'广告投入'间接产生的")
except ImportError:
    print("提示：安装 pingouin 库可进行偏相关分析")
    print("命令: pip install pingouin")

实战练习

练习一：电商数据分析

对以下电商数据集进行完整的相关性分析：

计算所有数值变量的 Pearson 和 Spearman 相关系数
绘制相关性热力图
找出与"总营收"最相关的 3 个特征
检测多重共线性（VIF）
写出分析结论

参考答案：

python 复制代码

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr
from statsmodels.stats.outliers_influence import variance_inflation_factor

np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'UV': np.random.uniform(1000, 50000, n),
    '转化率': np.random.uniform(0.01, 0.15, n),
    '客单价': np.random.normal(150, 50, n),
    '复购率': np.random.uniform(0.1, 0.6, n),
    '广告费用': np.random.uniform(5000, 500000, n),
    '客服满意度': np.random.uniform(3.0, 5.0, n)
})

# 生成GMV
df['订单数'] = df['UV'] * df['转化率']
df['GMV'] = df['订单数'] * df['客单价']
df['利润'] = df['GMV'] - df['广告费用']

# 1. 相关性矩阵
corr = df.corr()
target = '利润'

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 热力图
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, mask=mask,
            fmt='.2f', linewidths=0.5, ax=axes[0], vmin=-1, vmax=1)
axes[0].set_title('相关性热力图', fontsize=14)

# 与目标变量的相关性
corr_target = corr[target].drop([target, 'GMV', '订单数']).sort_values(ascending=False)
colors = ['#E53935' if x > 0 else '#1E88E5' for x in corr_target.values]
axes[1].barh(corr_target.index, corr_target.values, color=colors)
axes[1].set_title(f'与{target}的相关系数', fontsize=14)
axes[1].axvline(0, color='black', linewidth=0.5)

plt.tight_layout()
plt.show()

print(f"\n与'利润'最相关的特征（前3个）:")
for i, (var, r) in enumerate(corr_target.abs().nlargest(3).items()):
    sign = '+' if corr_target[var] > 0 else '-'
    print(f"  {i+1}. {var}: {sign}{corr_target[var]:.3f}")

练习二：相关不等于因果的演示

创建一个数据集，展示"相关但不因果"的经典案例。

参考答案：

python 复制代码

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
n = 365  # 一年的数据

# 隐藏变量：温度
temperature = 15 + 15 * np.sin(np.arange(n) * 2 * np.pi / 365) + np.random.normal(0, 3, n)

# 冰淇淋销量 = 温度的函数
ice_cream = np.maximum(0, 50 * (temperature - 15) / 20 + np.random.normal(0, 10, n))

# 溺水人数 = 温度的函数（夏天游泳的人多）
drowning = np.maximum(0, 5 * (temperature - 18) / 15 + np.random.normal(0, 1.5, n))

df = pd.DataFrame({
    '日期': pd.date_range('2024-01-01', periods=n),
    '温度': temperature,
    '冰淇淋销量': ice_cream,
    '溺水人数': drowning.astype(int)
})

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 误导性相关
r, p = pearsonr(df['冰淇淋销量'], df['溺水人数'])
axes[0].scatter(df['冰淇淋销量'], df['溺水人数'], alpha=0.5, color='#E53935', s=40)
axes[0].set_title(f'冰淇淋销量 ↔ 溺水人数\nr = {r:.3f}（虚假相关！）', fontsize=13)
axes[0].set_xlabel('冰淇淋销量')
axes[0].set_ylabel('溺水人数')
axes[0].grid(True, alpha=0.3)

# 真相：温度的影响
axes[1].scatter(df['温度'], df['冰淇淋销量'], alpha=0.5, color='#1E88E5', s=40, label='冰淇淋销量')
axes[1].scatter(df['温度'], df['溺水人数'], alpha=0.5, color='#43A047', s=40, label='溺水人数')
axes[1].set_title('真相：温度是共同原因', fontsize=13)
axes[1].set_xlabel('温度')
axes[1].set_ylabel('数量')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("经典教训：")
print("  冰淇淋销量和溺水人数高度相关，但没有因果关系！")
print("  真正的共同原因（混杂变量）是：气温")
print("\n数据分析师的黄金法则：")
print("  相关性只能说明'有关系'，不能说明'谁导致谁'！")
print("  确定因果需要：随机实验、控制混杂因素、因果推断方法")

本节总结

本节我们系统学习了相关性分析的核心方法：

协方差：相关性的原始测量，受量纲影响，不便直接比较
Pearson 相关系数：最常用，只检测线性关系，要求数据近似正态分布
Spearman 秩相关系数：对异常值鲁棒，适合非线性单调关系
Kendall 秩相关系数：适合小样本和序数数据
显著性检验：P值判断相关关系是否统计显著
相关性热力图：一目了然的矩阵可视化
VIF 多重共线性检测：识别高度相关的特征对
偏相关分析：控制其他变量后的净相关

三个核心认知：

相关不等于因果：这是数据分析中最重要的一条原则
r值 + P值一起看：强相关但不显著，或显著但不强烈，都需要进一步分析
先画散点图：相关系数只是一个数字，散点图能看到非线性关系和异常值

下一节预告

第23节我们将学习 统计推断基础与假设检验。前面我们分析了数据之间的相关性，但这些"关系"是真的存在，还是随机波动产生的假象？假设检验就是回答这个问题的标准工具。我们将学习 t检验、卡方检验、ANOVA、P值的正确理解、置信区间的计算，以及一个完整的 A/B 测试案例。

结尾

希望对初学者有帮助；致力于办公自动化的小小程序员一枚

希望能得到大家的【❤️一个免费关注❤️】感谢！

求个 🤞 关注 🤞 +❤️ 喜欢 ❤️ +👍 收藏 👍

此外还有办公自动化专栏，欢迎大家订阅：Python办公自动化专栏

此外还有爬虫专栏，欢迎大家订阅：Python爬虫基础专栏

此外还有Python基础专栏，欢迎大家订阅：Python基础学习专栏

第22节：相关性分析——协方差、相关系数与热力图解读

目录

专栏导读

🌸 欢迎来到Python办公自动化专栏---Python处理办公问题，解放您的双手

🏳️‍🌈 个人博客主页：请点击------> 个人的博客主页 求收藏

🏳️‍🌈 Github主页：请点击------> Github主页 求Star⭐

🏳️‍🌈 知乎主页：请点击------> 知乎主页 求关注

🏳️‍🌈 CSDN博客主页：请点击------> CSDN的博客主页 求关注

👍 该系列文章专栏：请点击------>Python办公自动化专栏 求订阅

🕷 此外还有爬虫专栏：请点击------>Python爬虫基础专栏 求订阅

📕 此外还有python基础专栏：请点击------>Python基础学习专栏 求订阅

文章作者技术和水平有限，如果文中出现错误，希望大家能指正🙏

❤️ 欢迎各位佬关注！ ❤️