因果推断入门：从相关性到因果性的思维转变与基础方法

文章目录

- 一、为什么相关性不够用
- 二、因果推断的三大框架
- - [2.1 潜在结果框架（Rubin 因果模型）](#2.1 潜在结果框架（Rubin 因果模型）)
  - [2.2 因果图（Pearl 的 do-演算）](#2.2 因果图（Pearl 的 do-演算）)
  - [2.3 三大框架的互补关系](#2.3 三大框架的互补关系)
- 三、混杂变量：相关性谬误的根源
- - [3.1 混杂变量的正式定义](#3.1 混杂变量的正式定义)
  - [3.2 辛普森悖论：混杂导致方向性错误](#3.2 辛普森悖论：混杂导致方向性错误)
- 四、后门调整：消除混杂的标准方法
- - [4.1 后门准则与调整公式](#4.1 后门准则与调整公式)
- 五、倾向分数匹配（PSM）：模拟随机实验
- - [5.1 倾向分数的定义](#5.1 倾向分数的定义)
  - [5.2 PSM 完整实现](#5.2 PSM 完整实现)
- 六、工具变量法（IV）：处理不可观测的混杂
- - [6.1 什么时候需要工具变量](#6.1 什么时候需要工具变量)
  - [6.2 两阶段最小二乘（2SLS）](#6.2 两阶段最小二乘（2SLS）)
- 七、因果发现：从数据自动推断因果图
- - [7.1 为什么需要因果发现](#7.1 为什么需要因果发现)
  - [7.2 PC 算法的核心思路](#7.2 PC 算法的核心思路)
- 八、实战：广告效果的三种因果估计对比
- - [8.1 场景设定](#8.1 场景设定)
  - [8.2 结果解读](#8.2 结果解读)
- 九、因果推断的工程落地
- - [9.1 从相关模型到因果模型的迁移成本](#9.1 从相关模型到因果模型的迁移成本)
  - [9.2 何时需要因果推断](#9.2 何时需要因果推断)
  - [9.3 双重机器学习（Double ML）：大规模因果估计](#9.3 双重机器学习（Double ML）：大规模因果估计)
- 十、总结
- 参考资料与延伸阅读

一、为什么相关性不够用

有一个经典案例：在某电商平台的数据中，购买了高端鞋垫的用户，其 30 日留存率比普通用户高出 15%。如果基于这个相关性做决策------"向所有用户推广鞋垫购买，可以提升留存"------大概率会浪费预算。

真实原因可能是：原本就高活跃的用户既倾向于购买高端配件，也倾向于长期留存。鞋垫和留存都是"高活跃"这个共同原因的结果，它们之间并没有因果关系。

这就是机器学习工程师面临的核心困境：ML 模型擅长发现相关性，但业务决策需要因果性。

相关性问题 ： P ( Y ∣ X = x ) P(Y | X = x) P(Y∣X=x)，观察到 X=x 时 Y 的分布
因果性问题 ： P ( Y ∣ do ( X = x ) ) P(Y | \text{do}(X = x)) P(Y∣do(X=x))，强制干预 X=x 之后 Y 的分布

"观察到"和"强制干预后"是完全不同的问题。前者是统计学，后者是因果推断。
#mermaid-svg-tiemD1iG3u7RQR3n{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-tiemD1iG3u7RQR3n .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-tiemD1iG3u7RQR3n .error-icon{fill:#552222;}#mermaid-svg-tiemD1iG3u7RQR3n .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-tiemD1iG3u7RQR3n .marker{fill:#333333;stroke:#333333;}#mermaid-svg-tiemD1iG3u7RQR3n .marker.cross{stroke:#333333;}#mermaid-svg-tiemD1iG3u7RQR3n svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-tiemD1iG3u7RQR3n p{margin:0;}#mermaid-svg-tiemD1iG3u7RQR3n .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster-label text{fill:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster-label span{color:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster-label span p{background-color:transparent;}#mermaid-svg-tiemD1iG3u7RQR3n .label text,#mermaid-svg-tiemD1iG3u7RQR3n span{fill:#333;color:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .node rect,#mermaid-svg-tiemD1iG3u7RQR3n .node circle,#mermaid-svg-tiemD1iG3u7RQR3n .node ellipse,#mermaid-svg-tiemD1iG3u7RQR3n .node polygon,#mermaid-svg-tiemD1iG3u7RQR3n .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-tiemD1iG3u7RQR3n .rough-node .label text,#mermaid-svg-tiemD1iG3u7RQR3n .node .label text,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape .label,#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape .label{text-anchor:middle;}#mermaid-svg-tiemD1iG3u7RQR3n .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-tiemD1iG3u7RQR3n .rough-node .label,#mermaid-svg-tiemD1iG3u7RQR3n .node .label,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape .label,#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape .label{text-align:center;}#mermaid-svg-tiemD1iG3u7RQR3n .node.clickable{cursor:pointer;}#mermaid-svg-tiemD1iG3u7RQR3n .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-tiemD1iG3u7RQR3n .arrowheadPath{fill:#333333;}#mermaid-svg-tiemD1iG3u7RQR3n .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-tiemD1iG3u7RQR3n .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-tiemD1iG3u7RQR3n .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tiemD1iG3u7RQR3n .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-tiemD1iG3u7RQR3n .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tiemD1iG3u7RQR3n .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-tiemD1iG3u7RQR3n .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster text{fill:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster span{color:#333;}#mermaid-svg-tiemD1iG3u7RQR3n div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-tiemD1iG3u7RQR3n .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-tiemD1iG3u7RQR3n rect.text{fill:none;stroke-width:0;}#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape p,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape .label rect,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tiemD1iG3u7RQR3n .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-tiemD1iG3u7RQR3n .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-tiemD1iG3u7RQR3n :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 因果性思维
识别混杂：用户活跃度

同时影响鞋垫购买和留存
控制混杂后的净效应
结论：鞋垫本身效应接近0
决策：提升用户活跃度才是正路
相关性思维
观察：活跃用户买了鞋垫
结论：买鞋垫→留存↑
决策：推广鞋垫购买
结果：效果不显著

二、因果推断的三大框架

2.1 潜在结果框架（Rubin 因果模型）

每个个体同时存在两个潜在结果：

Y i ( 1 ) Y_i(1) Yi(1)：接受处理（treatment = 1）时的结果
Y i ( 0 ) Y_i(0) Yi(0)：不接受处理（treatment = 0）时的结果

个体因果效应 ： τ i = Y i ( 1 ) − Y i ( 0 ) \tau_i = Y_i(1) - Y_i(0) τi=Yi(1)−Yi(0)

问题：同一个用户不能同时出现在处理组和对照组，所以 Y i ( 1 ) Y_i(1) Yi(1) 和 Y i ( 0 ) Y_i(0) Yi(0) 只能观察到其中一个。另一个叫反事实（counterfactual）。

平均处理效应（ATE） ： ATE = E $Y ( 1 ) - Y ( 0 )$ \text{ATE} = \mathbb{E} $Y(1) - Y(0)$ ATE=E $Y(1)-Y(0)$

随机实验（A/B 测试）保证了 Y ( t ) ⊥ T Y(t) \perp T Y(t)⊥T（潜在结果与处理分配独立），所以简单的组间均值差就是无偏的 ATE 估计。观察性研究则需要额外假设和方法来识别因果效应。

2.2 因果图（Pearl 的 do-演算）

用**有向无环图（DAG）**表示变量间的直接因果关系：
#mermaid-svg-bQqPzkU36JRanTKN{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bQqPzkU36JRanTKN .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bQqPzkU36JRanTKN .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bQqPzkU36JRanTKN .error-icon{fill:#552222;}#mermaid-svg-bQqPzkU36JRanTKN .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bQqPzkU36JRanTKN .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bQqPzkU36JRanTKN .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bQqPzkU36JRanTKN .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bQqPzkU36JRanTKN .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bQqPzkU36JRanTKN .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bQqPzkU36JRanTKN .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bQqPzkU36JRanTKN .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bQqPzkU36JRanTKN .marker.cross{stroke:#333333;}#mermaid-svg-bQqPzkU36JRanTKN svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bQqPzkU36JRanTKN p{margin:0;}#mermaid-svg-bQqPzkU36JRanTKN .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-bQqPzkU36JRanTKN .cluster-label text{fill:#333;}#mermaid-svg-bQqPzkU36JRanTKN .cluster-label span{color:#333;}#mermaid-svg-bQqPzkU36JRanTKN .cluster-label span p{background-color:transparent;}#mermaid-svg-bQqPzkU36JRanTKN .label text,#mermaid-svg-bQqPzkU36JRanTKN span{fill:#333;color:#333;}#mermaid-svg-bQqPzkU36JRanTKN .node rect,#mermaid-svg-bQqPzkU36JRanTKN .node circle,#mermaid-svg-bQqPzkU36JRanTKN .node ellipse,#mermaid-svg-bQqPzkU36JRanTKN .node polygon,#mermaid-svg-bQqPzkU36JRanTKN .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-bQqPzkU36JRanTKN .rough-node .label text,#mermaid-svg-bQqPzkU36JRanTKN .node .label text,#mermaid-svg-bQqPzkU36JRanTKN .image-shape .label,#mermaid-svg-bQqPzkU36JRanTKN .icon-shape .label{text-anchor:middle;}#mermaid-svg-bQqPzkU36JRanTKN .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-bQqPzkU36JRanTKN .rough-node .label,#mermaid-svg-bQqPzkU36JRanTKN .node .label,#mermaid-svg-bQqPzkU36JRanTKN .image-shape .label,#mermaid-svg-bQqPzkU36JRanTKN .icon-shape .label{text-align:center;}#mermaid-svg-bQqPzkU36JRanTKN .node.clickable{cursor:pointer;}#mermaid-svg-bQqPzkU36JRanTKN .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-bQqPzkU36JRanTKN .arrowheadPath{fill:#333333;}#mermaid-svg-bQqPzkU36JRanTKN .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-bQqPzkU36JRanTKN .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-bQqPzkU36JRanTKN .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-bQqPzkU36JRanTKN .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-bQqPzkU36JRanTKN .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-bQqPzkU36JRanTKN .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-bQqPzkU36JRanTKN .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-bQqPzkU36JRanTKN .cluster text{fill:#333;}#mermaid-svg-bQqPzkU36JRanTKN .cluster span{color:#333;}#mermaid-svg-bQqPzkU36JRanTKN div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-bQqPzkU36JRanTKN .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-bQqPzkU36JRanTKN rect.text{fill:none;stroke-width:0;}#mermaid-svg-bQqPzkU36JRanTKN .icon-shape,#mermaid-svg-bQqPzkU36JRanTKN .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-bQqPzkU36JRanTKN .icon-shape p,#mermaid-svg-bQqPzkU36JRanTKN .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-bQqPzkU36JRanTKN .icon-shape .label rect,#mermaid-svg-bQqPzkU36JRanTKN .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-bQqPzkU36JRanTKN .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-bQqPzkU36JRanTKN .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-bQqPzkU36JRanTKN :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 用户活跃度

混杂变量 U

不可观测
鞋垫购买

处理变量 T
30日留存

结果变量 Y
广告曝光

工具变量 Z

在这个 DAG 中：

后门路径 ： T ← U → Y T \leftarrow U \rightarrow Y T←U→Y，这条路径是混杂路径，导致 T 和 Y 的相关性被夸大
前门路径 ： T → Y T \rightarrow Y T→Y，这是真正的因果路径
工具变量：Z（广告曝光）只影响 T，不直接影响 Y

2.3 三大框架的互补关系

框架	优势	适用场景
潜在结果（Rubin）	直觉清晰，与实验设计紧密对应	A/B 测试设计，PSM 匹配
因果图（Pearl）	可视化因果结构，系统化识别混杂	复杂混杂关系分析，工具变量识别
结构因果模型（SCM）	最完整，支持反事实推断	需要推断个体级因果效应的高级场景

三、混杂变量：相关性谬误的根源

3.1 混杂变量的正式定义

变量 C C C 是 T → Y T \rightarrow Y T→Y 因果路径的混杂变量，当且仅当：

C C C 影响处理变量 T T T
C C C 影响结果变量 Y Y Y
C C C 不在 T → Y T \rightarrow Y T→Y 的因果路径上（不是中介变量）

3.2 辛普森悖论：混杂导致方向性错误

python 复制代码

import pandas as pd
import numpy as np

# 辛普森悖论经典案例：药物效果分析
# 整体数据显示：药物组恢复率更低？！
data_overall = pd.DataFrame({
    '用药': [700, 300],
    '恢复': [500, 250],
}, index=['用药组', '对照组'])

print("=== 整体数据（忽略性别混杂）===")
print("用药组恢复率:", 500/700)
print("对照组恢复率:", 250/300)
print("结论：用药组恢复率更低？\n")

# 按性别分组后：
data_male = pd.DataFrame({
    '用药': [600, 100],
    '恢复': [450, 70],
}, index=['用药组', '对照组'])

data_female = pd.DataFrame({
    '用药': [100, 200],
    '恢复': [50, 180],
}, index=['用药组', '对照组'])

print("=== 按性别分组（控制混杂）===")
print("男性 - 用药组恢复率:", 450/600)
print("男性 - 对照组恢复率:", 70/100)
print()
print("女性 - 用药组恢复率:", 50/100)
print("女性 - 对照组恢复率:", 180/200)
print()
print("真实结论：无论男女，用药组恢复率均更高")
print("混杂原因：女性既更倾向用药，恢复率又更低（混杂变量=性别）")

核心教训：在不平衡的观察性数据中，不控制混杂变量直接比较组间差异，结论可能完全相反。

四、后门调整：消除混杂的标准方法

4.1 后门准则与调整公式

后门准则 ：若变量集合 Z Z Z 满足以下条件，则 Z Z Z 是有效的后门调整集：

Z Z Z 阻断了 T T T 和 Y Y Y 之间所有的后门路径
Z Z Z 中没有 T T T 的后代（不能控制中介变量）

后门调整公式：

P ( Y ∣ do ( T = t ) ) = ∑ z P ( Y ∣ T = t , Z = z ) ⋅ P ( Z = z ) P(Y | \text{do}(T=t)) = \sum_z P(Y | T=t, Z=z) \cdot P(Z=z) P(Y∣do(T=t))=z∑P(Y∣T=t,Z=z)⋅P(Z=z)

直觉：在每个 Z Z Z 的取值层次内，比较 T 的效果；然后按 Z Z Z 的边缘分布加权平均。

python 复制代码

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

def backdoor_adjustment(data, treatment, outcome, confounders):
    """
    后门调整估计平均因果效应（ATE）
    
    适用条件：
    1. 已知并可观测所有混杂变量
    2. 混杂变量数量可控（避免高维诅咒）
    
    data: DataFrame
    treatment: 处理变量列名（二值）
    outcome: 结果变量列名（连续）
    confounders: 混杂变量列名列表
    """
    # 分层计算：在每个混杂层内估计因果效应
    # 对于连续混杂变量，使用回归方法
    
    # 方法1：回归控制（线性假设下）
    X = data[confounders + [treatment]]
    y = data[outcome]
    
    model = LinearRegression()
    model.fit(X, y)
    
    # 处理变量的系数即为（线性假设下的）因果效应
    treatment_coef = model.coef_[-1]
    
    # 方法2：预测反事实（更通用）
    data_treat = data.copy()
    data_treat[treatment] = 1
    data_control = data.copy()
    data_control[treatment] = 0
    
    X_treat = data_treat[confounders + [treatment]]
    X_control = data_control[confounders + [treatment]]
    
    potential_outcome_1 = model.predict(X_treat)
    potential_outcome_0 = model.predict(X_control)
    
    ate_regression = (potential_outcome_1 - potential_outcome_0).mean()
    
    return {
        'treatment_coefficient': treatment_coef,
        'ate_regression': ate_regression,
        'method': 'backdoor_regression_adjustment'
    }

# 示例：广告效果分析
np.random.seed(42)
n = 5000

# 混杂变量：用户历史活跃度
activity = np.random.normal(0, 1, n)
# 广告曝光概率受活跃度影响（活跃用户更容易看到广告）
ad_exposure_prob = 1 / (1 + np.exp(-0.8 * activity))
ad_shown = np.random.binomial(1, ad_exposure_prob)
# 转化结果受广告和活跃度共同影响
# 广告的真实因果效应为 0.3
true_effect = 0.3
conversion_prob = 1 / (1 + np.exp(-(0.3 * ad_shown + 0.7 * activity)))
conversion = np.random.binomial(1, conversion_prob)

df = pd.DataFrame({
    'activity': activity,
    'ad_shown': ad_shown,
    'conversion': conversion
})

# 朴素比较（不控制混杂）
naive = df.groupby('ad_shown')['conversion'].mean()
naive_effect = naive[1] - naive[0]

# 后门调整（控制 activity）
adjusted = backdoor_adjustment(df, 'ad_shown', 'conversion', ['activity'])

print(f"真实因果效应:     {true_effect:.3f}")
print(f"朴素相关性估计:    {naive_effect:.3f}  (被混杂夸大)")
print(f"后门调整估计(ATE): {adjusted['ate_regression']:.3f}")

五、倾向分数匹配（PSM）：模拟随机实验

5.1 倾向分数的定义

倾向分数（Propensity Score）是在给定协变量 X X X 的条件下，个体接受处理的概率：

e ( X ) = P ( T = 1 ∣ X ) e(X) = P(T=1 | X) e(X)=P(T=1∣X)

Rosenbaum-Rubin 定理 ：若倾向分数 e ( X ) e(X) e(X) 已知，则条件独立性成立：

T ⊥ X ∣ e ( X ) T \perp X | e(X) T⊥X∣e(X)

这意味着：在倾向分数相同的子组内，处理分配近似于随机。因此只需匹配倾向分数相似的处理组和对照组样本，就能消除协变量的混杂影响。

5.2 PSM 完整实现

python 复制代码

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

class PropensityScoreMatching:
    """
    倾向分数匹配（PSM）
    
    流程：
    1. 用逻辑回归估计倾向分数 P(T=1|X)
    2. 用最近邻匹配：为每个处理组样本找1个倾向分数最接近的对照组样本
    3. 在匹配后的数据集上估计 ATT（处理组的平均处理效应）
    
    关键假设（可忽略性/无混杂）：
    给定观测协变量 X，T 与潜在结果 Y(0), Y(1) 条件独立
    """
    
    def __init__(self, caliper=0.05, ratio=1):
        """
        caliper: 匹配时允许的最大倾向分数差距（标准差单位）
        ratio: 每个处理组样本匹配的对照组样本数
        """
        self.caliper = caliper
        self.ratio = ratio
        self.propensity_model = None
        self.matched_data = None
    
    def fit(self, X, T, outcome_name=None, outcome=None):
        """
        X: 协变量矩阵
        T: 处理变量（0/1）
        outcome: 结果变量（可选，用于后续效应估计）
        """
        # Step 1: 估计倾向分数
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        self.propensity_model = LogisticRegression(
            max_iter=1000, C=1.0, random_state=42
        )
        self.propensity_model.fit(X_scaled, T)
        self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
        self.scaler = scaler
        
        # Step 2: 分离处理组和对照组
        treat_idx = np.where(T == 1)[0]
        control_idx = np.where(T == 0)[0]
        
        treat_ps = self.propensity_scores[treat_idx]
        control_ps = self.propensity_scores[control_idx]
        
        # Step 3: 最近邻匹配（带 caliper 约束）
        # caliper 用倾向分数的标准差校准
        ps_std = self.propensity_scores.std()
        actual_caliper = self.caliper * ps_std
        
        nbrs = NearestNeighbors(n_neighbors=self.ratio, algorithm='ball_tree')
        nbrs.fit(control_ps.reshape(-1, 1))
        
        distances, indices = nbrs.kneighbors(treat_ps.reshape(-1, 1))
        
        # 筛选满足 caliper 约束的匹配对
        matched_treat = []
        matched_control = []
        
        for i, (dist_row, idx_row) in enumerate(zip(distances, indices)):
            valid = dist_row <= actual_caliper
            if valid.any():
                matched_treat.append(treat_idx[i])
                matched_control.extend(control_idx[idx_row[valid]])
        
        self.matched_treat_idx = matched_treat
        self.matched_control_idx = matched_control
        
        print(f"处理组样本数:    {len(treat_idx)}")
        print(f"对照组样本数:    {len(control_idx)}")
        print(f"成功匹配对数:    {len(matched_treat)}")
        print(f"匹配率:          {len(matched_treat)/len(treat_idx):.1%}")
        
        # 构建匹配后的数据集
        if outcome is not None:
            treat_outcomes = outcome[matched_treat]
            control_outcomes = outcome[matched_control[:len(matched_treat)]]
            self.att = (treat_outcomes - control_outcomes).mean()
            print(f"\nATT（处理组平均处理效应）: {self.att:.4f}")
        
        return self
    
    def check_balance(self, X, T, feature_names=None):
        """
        检查匹配前后的协变量平衡性
        标准化均值差（SMD）< 0.1 通常视为平衡良好
        """
        if feature_names is None:
            feature_names = [f'X{i}' for i in range(X.shape[1])]
        
        treat_idx = np.where(T == 1)[0]
        control_idx = np.where(T == 0)[0]
        
        results = []
        for i, name in enumerate(feature_names):
            # 匹配前
            t_before = X[treat_idx, i].mean()
            c_before = X[control_idx, i].mean()
            pooled_std = np.sqrt((X[treat_idx, i].var() + X[control_idx, i].var()) / 2)
            smd_before = abs(t_before - c_before) / (pooled_std + 1e-10)
            
            # 匹配后
            matched_t = X[self.matched_treat_idx, i]
            matched_c = X[self.matched_control_idx[:len(self.matched_treat_idx)], i]
            smd_after = abs(matched_t.mean() - matched_c.mean()) / (pooled_std + 1e-10)
            
            results.append({
                '特征': name,
                'SMD_匹配前': round(smd_before, 3),
                'SMD_匹配后': round(smd_after, 3),
                '平衡': '✅' if smd_after < 0.1 else '⚠️'
            })
        
        df = pd.DataFrame(results)
        print("\n=== 协变量平衡检验 ===")
        print(df.to_string(index=False))
        return df

六、工具变量法（IV）：处理不可观测的混杂

6.1 什么时候需要工具变量

后门调整和 PSM 都有一个前提：所有混杂变量可观测。但现实中，"用户的真实购买意愿""医生的经验偏好"这类变量无法直接测量。

工具变量（Instrumental Variable） 的三个条件：

相关性 ： Z Z Z 与处理变量 T T T 相关（ Z Z Z 影响 T T T）
外生性 ： Z Z Z 不直接影响结果变量 Y Y Y（只通过 T T T 间接影响）
排他性 ： Z Z Z 与不可观测混杂变量无关

#mermaid-svg-G1o0wGdxYJtTOBct{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-G1o0wGdxYJtTOBct .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-G1o0wGdxYJtTOBct .error-icon{fill:#552222;}#mermaid-svg-G1o0wGdxYJtTOBct .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-G1o0wGdxYJtTOBct .marker{fill:#333333;stroke:#333333;}#mermaid-svg-G1o0wGdxYJtTOBct .marker.cross{stroke:#333333;}#mermaid-svg-G1o0wGdxYJtTOBct svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-G1o0wGdxYJtTOBct p{margin:0;}#mermaid-svg-G1o0wGdxYJtTOBct .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster-label text{fill:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster-label span{color:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster-label span p{background-color:transparent;}#mermaid-svg-G1o0wGdxYJtTOBct .label text,#mermaid-svg-G1o0wGdxYJtTOBct span{fill:#333;color:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .node rect,#mermaid-svg-G1o0wGdxYJtTOBct .node circle,#mermaid-svg-G1o0wGdxYJtTOBct .node ellipse,#mermaid-svg-G1o0wGdxYJtTOBct .node polygon,#mermaid-svg-G1o0wGdxYJtTOBct .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-G1o0wGdxYJtTOBct .rough-node .label text,#mermaid-svg-G1o0wGdxYJtTOBct .node .label text,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape .label,#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape .label{text-anchor:middle;}#mermaid-svg-G1o0wGdxYJtTOBct .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-G1o0wGdxYJtTOBct .rough-node .label,#mermaid-svg-G1o0wGdxYJtTOBct .node .label,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape .label,#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape .label{text-align:center;}#mermaid-svg-G1o0wGdxYJtTOBct .node.clickable{cursor:pointer;}#mermaid-svg-G1o0wGdxYJtTOBct .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-G1o0wGdxYJtTOBct .arrowheadPath{fill:#333333;}#mermaid-svg-G1o0wGdxYJtTOBct .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-G1o0wGdxYJtTOBct .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-G1o0wGdxYJtTOBct .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-G1o0wGdxYJtTOBct .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-G1o0wGdxYJtTOBct .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-G1o0wGdxYJtTOBct .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-G1o0wGdxYJtTOBct .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster text{fill:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster span{color:#333;}#mermaid-svg-G1o0wGdxYJtTOBct div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-G1o0wGdxYJtTOBct .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-G1o0wGdxYJtTOBct rect.text{fill:none;stroke-width:0;}#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape p,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape .label rect,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-G1o0wGdxYJtTOBct .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-G1o0wGdxYJtTOBct .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-G1o0wGdxYJtTOBct :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 不可观测混杂变量 U

如用户真实购买意愿
处理变量 T

广告点击
结果变量 Y

购买转化
工具变量 Z

广告随机投放位置

6.2 两阶段最小二乘（2SLS）

python 复制代码

import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats

class TwoStageLeastSquares:
    """
    两阶段最小二乘（2SLS）工具变量估计器
    
    Stage 1: 用工具变量 Z 回归处理变量 T，得到 T_hat（去除了不可观测混杂）
    Stage 2: 用 T_hat 回归结果变量 Y，得到因果效应
    
    局限：
    1. 需要强工具变量（弱工具变量会导致大方差）
    2. 只识别"遵从者"（compliers）的局部平均处理效应（LATE），
       不是所有人的平均处理效应（ATE）
    """
    
    def __init__(self):
        self.stage1_model = None
        self.stage2_model = None
    
    def fit(self, T, Y, Z, X_controls=None):
        """
        T: 处理变量（n,）
        Y: 结果变量（n,）
        Z: 工具变量（n, k）
        X_controls: 额外控制变量（可选）
        """
        n = len(T)
        
        # 构建 Stage 1 特征矩阵
        if X_controls is not None:
            X1 = np.column_stack([Z, X_controls])
        else:
            X1 = Z.reshape(-1, 1) if Z.ndim == 1 else Z
        
        # Stage 1: T ~ Z + X_controls
        self.stage1_model = LinearRegression()
        self.stage1_model.fit(X1, T)
        T_hat = self.stage1_model.predict(X1)
        
        # 弱工具变量检验（F统计量 > 10 视为强工具变量）
        residuals_stage1 = T - T_hat
        ss_total = np.sum((T - T.mean()) ** 2)
        ss_resid = np.sum(residuals_stage1 ** 2)
        r2_stage1 = 1 - ss_resid / ss_total
        
        k = X1.shape[1]
        f_stat = (r2_stage1 / k) / ((1 - r2_stage1) / (n - k - 1))
        print(f"Stage 1 F统计量: {f_stat:.2f} {'✅ 强工具变量' if f_stat > 10 else '⚠️ 弱工具变量'}")
        
        # Stage 2: Y ~ T_hat + X_controls
        if X_controls is not None:
            X2 = np.column_stack([T_hat, X_controls])
        else:
            X2 = T_hat.reshape(-1, 1)
        
        self.stage2_model = LinearRegression()
        self.stage2_model.fit(X2, Y)
        
        self.late_estimate = self.stage2_model.coef_[0]
        
        # 标准误估计（使用 2SLS 一致标准误）
        Y_hat = self.stage2_model.predict(X2)
        residuals = Y - Y_hat
        sigma2 = np.sum(residuals ** 2) / (n - X2.shape[1])
        
        # 近似标准误（假设同方差）
        XtX_inv = np.linalg.pinv(X2.T @ X2)
        self.se = np.sqrt(sigma2 * XtX_inv[0, 0])
        self.t_stat = self.late_estimate / self.se
        self.p_value = 2 * (1 - stats.t.cdf(abs(self.t_stat), df=n - X2.shape[1]))
        
        print(f"\nLATE（局部平均处理效应）: {self.late_estimate:.4f}")
        print(f"标准误: {self.se:.4f}")
        print(f"t统计量: {self.t_stat:.4f}")
        print(f"p值: {self.p_value:.4f}")
        
        return self


# 示例：随机广告曝光位置作为工具变量
def simulate_iv_example():
    np.random.seed(42)
    n = 3000
    
    # 不可观测混杂：用户购买意愿
    purchase_intent = np.random.normal(0, 1, n)
    
    # 工具变量：广告是否出现在首屏（随机分配，不受用户意愿影响）
    first_screen = np.random.binomial(1, 0.5, n)  # 随机分配
    
    # 广告点击受首屏曝光和购买意愿共同影响
    click_prob = 1 / (1 + np.exp(-(0.6 * first_screen + 0.8 * purchase_intent)))
    ad_click = np.random.binomial(1, click_prob)
    
    # 购买转化：真实因果效应 = 0.2，但混杂变量也强烈影响
    conversion_prob = 1 / (1 + np.exp(-(0.2 * ad_click + 1.0 * purchase_intent)))
    conversion = np.random.binomial(1, conversion_prob)
    
    # 朴素估计（不控制混杂）
    naive_effect = conversion[ad_click == 1].mean() - conversion[ad_click == 0].mean()
    
    # 2SLS 估计
    iv = TwoStageLeastSquares()
    iv.fit(ad_click.astype(float), conversion.astype(float), first_screen.astype(float))
    
    print(f"\n真实因果效应:    0.2000")
    print(f"朴素相关性估计:  {naive_effect:.4f}（混杂夸大了效应）")
    print(f"2SLS 工具变量估计: {iv.late_estimate:.4f}")
    
    return iv

iv_result = simulate_iv_example()

七、因果发现：从数据自动推断因果图

7.1 为什么需要因果发现

后门调整、PSM、工具变量都假设已知因果结构（哪些是混杂变量，哪些是中介变量）。但现实中，研究者往往不完全清楚变量间的因果关系。

因果发现（Causal Discovery）：从观测数据中自动推断 DAG 结构。

7.2 PC 算法的核心思路

python 复制代码

from itertools import combinations
import numpy as np
from scipy.stats import pearsonr

def pc_algorithm_skeleton(data, alpha=0.05):
    """
    PC 算法：第一阶段------骨架学习
    
    核心思路：
    1. 从完全图开始（所有变量两两相连）
    2. 逐步增加条件集大小，做条件独立性检验
    3. 如果 X ⊥ Y | Z，则删去 X-Y 之间的边，并记录 Z 为分离集
    
    局限：
    - 假设线性关系和高斯噪声（可用非参数版本扩展）
    - 计算复杂度随变量数指数增长
    - 对违反假设的数据不鲁棒
    """
    n_vars = data.shape[1]
    var_names = data.columns.tolist() if hasattr(data, 'columns') else list(range(n_vars))
    
    # 初始化：完全无向图（邻接矩阵）
    adjacency = {v: set(var_names) - {v} for v in var_names}
    sep_sets = {}
    
    cond_set_size = 0
    
    while True:
        edges_removed = False
        
        for x in var_names:
            for y in list(adjacency[x]):
                if y <= x:  # 避免重复检验
                    continue
                
                # 候选条件集：x 或 y 的邻居（排除 x 和 y 本身）
                adj_x = adjacency[x] - {y}
                
                if len(adj_x) < cond_set_size:
                    continue
                
                # 遍历所有大小为 cond_set_size 的条件子集
                for z_set in combinations(adj_x, cond_set_size):
                    z_set = list(z_set)
                    
                    # 条件独立性检验（这里用偏相关作为近似）
                    is_independent = conditional_independence_test(
                        data, x, y, z_set, alpha
                    )
                    
                    if is_independent:
                        # 删除边 x-y
                        adjacency[x].discard(y)
                        adjacency[y].discard(x)
                        sep_sets[(x, y)] = z_set
                        sep_sets[(y, x)] = z_set
                        edges_removed = True
                        break
        
        cond_set_size += 1
        
        # 停止条件：所有相邻节点对的邻居数量都小于当前条件集大小
        max_adj_size = max(len(adj) for adj in adjacency.values())
        if max_adj_size < cond_set_size or cond_set_size > n_vars:
            break
    
    return adjacency, sep_sets


def conditional_independence_test(data, x, y, z_set, alpha=0.05):
    """
    偏相关系数的条件独立性检验
    H0: X ⊥ Y | Z
    """
    import numpy as np
    from scipy.stats import t as t_dist
    
    n = len(data)
    
    if len(z_set) == 0:
        # 无条件独立：直接计算相关系数
        if hasattr(data, 'values'):
            x_vals = data[x].values
            y_vals = data[y].values
        else:
            x_vals = data[:, x]
            y_vals = data[:, y]
        
        r, p = pearsonr(x_vals, y_vals)
        return p > alpha
    
    else:
        # 偏相关：用残差计算（回归掉 z_set 的影响）
        from sklearn.linear_model import LinearRegression
        
        if hasattr(data, 'values'):
            Z = data[z_set].values
            X_col = data[x].values
            Y_col = data[y].values
        else:
            Z = data[:, z_set]
            X_col = data[:, x]
            Y_col = data[:, y]
        
        reg_x = LinearRegression().fit(Z, X_col)
        reg_y = LinearRegression().fit(Z, Y_col)
        
        resid_x = X_col - reg_x.predict(Z)
        resid_y = Y_col - reg_y.predict(Z)
        
        r, p = pearsonr(resid_x, resid_y)
        return p > alpha

八、实战：广告效果的三种因果估计对比

8.1 场景设定

某平台投放了一批个性化广告，想评估"展示广告"对"7日购买转化"的真实因果效应。

观测数据中，高价值用户既更容易被广告定向，也更容易购买------这是典型的混杂。

python 复制代码

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def simulate_ad_effectiveness_study(n=10000, seed=42):
    """
    模拟广告效果因果分析
    真实因果效应 = 0.05（广告本身的提升非常有限）
    """
    np.random.seed(seed)
    
    # 用户特征（可观测协变量）
    age = np.random.normal(35, 10, n)
    income_level = np.random.randint(1, 6, n)  # 1-5级
    platform_activity = np.random.exponential(2, n)  # 历史活跃度
    
    # 购买意愿（不可观测混杂！这是真实问题所在）
    purchase_intent = 0.3 * (income_level - 3) + 0.4 * platform_activity + np.random.normal(0, 1, n)
    purchase_intent = (purchase_intent - purchase_intent.min()) / (purchase_intent.max() - purchase_intent.min())
    
    # 广告投放：高价值用户（高活跃+高收入）更可能被定向
    ad_score = 0.5 * income_level + 0.8 * platform_activity + np.random.normal(0, 0.5, n)
    ad_prob = 1 / (1 + np.exp(-0.8 * (ad_score - ad_score.mean()) / ad_score.std()))
    ad_shown = np.random.binomial(1, ad_prob)
    
    # 工具变量：广告投放时间段（系统随机轮换）
    # 高峰时段覆盖率更高（随机分配，与用户特征无关）
    peak_hour = np.random.binomial(1, 0.4)  # 40% 在高峰时段
    # 高峰时段展示率更高（工具变量的相关性条件）
    ad_prob_with_iv = np.clip(ad_prob + 0.15 * peak_hour, 0, 1)
    ad_shown_iv = np.random.binomial(1, ad_prob_with_iv)
    
    # 购买结果：真实因果效应 = 0.05
    true_effect = 0.05
    buy_prob = 1 / (1 + np.exp(-(true_effect * ad_shown + 0.9 * purchase_intent + np.random.normal(0, 0.1, n))))
    bought = np.random.binomial(1, buy_prob)
    
    df = pd.DataFrame({
        'age': age,
        'income_level': income_level,
        'platform_activity': platform_activity,
        'ad_shown': ad_shown,
        'ad_shown_iv': ad_shown_iv,
        'peak_hour': peak_hour,
        'bought': bought,
        'purchase_intent': purchase_intent  # 实际不可观测，仅用于验证
    })
    
    return df, true_effect

df, true_effect = simulate_ad_effectiveness_study()

# ================================================================
# 方法1：朴素比较（不控制混杂）
# ================================================================
naive = df.groupby('ad_shown')['bought'].mean()
naive_effect = naive[1] - naive[0]

# ================================================================
# 方法2：后门调整（控制可观测混杂）
# ================================================================
X_confounders = df[['age', 'income_level', 'platform_activity']].values
T = df['ad_shown'].values
Y = df['bought'].values

result_backdoor = backdoor_adjustment(
    df, 'ad_shown', 'bought',
    ['age', 'income_level', 'platform_activity']
)

# ================================================================
# 方法3：PSM（倾向分数匹配）
# ================================================================
psm = PropensityScoreMatching(caliper=0.05)
psm.fit(X_confounders, T, outcome=Y)

# ================================================================
# 方法4：工具变量（2SLS），处理不可观测混杂
# ================================================================
iv_2sls = TwoStageLeastSquares()
iv_2sls.fit(
    df['ad_shown_iv'].values.astype(float),
    Y.astype(float),
    df['peak_hour'].values.astype(float)
)

# ================================================================
# 对比结果汇总
# ================================================================
print("\n" + "="*55)
print(f"{'方法':<25} {'估计效应':>12} {'误差':>10}")
print("="*55)
estimates = {
    '真实因果效应': true_effect,
    '朴素相关性': naive_effect,
    '后门调整(回归)': result_backdoor['ate_regression'],
    'PSM(ATT)': getattr(psm, 'att', None),
    '工具变量(LATE)': iv_2sls.late_estimate
}

for method, est in estimates.items():
    if est is not None:
        error = abs(est - true_effect)
        marker = " ✅" if error < 0.01 else (" ⚠️" if error < 0.03 else " ❌")
        print(f"{method:<25} {est:>12.4f} {error:>10.4f}{marker}")
print("="*55)

8.2 结果解读

#mermaid-svg-0dTHWeltSyO3IliU{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-0dTHWeltSyO3IliU .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-0dTHWeltSyO3IliU .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-0dTHWeltSyO3IliU .error-icon{fill:#552222;}#mermaid-svg-0dTHWeltSyO3IliU .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-0dTHWeltSyO3IliU .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-0dTHWeltSyO3IliU .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-0dTHWeltSyO3IliU .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-0dTHWeltSyO3IliU .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-0dTHWeltSyO3IliU .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-0dTHWeltSyO3IliU .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-0dTHWeltSyO3IliU .marker{fill:#333333;stroke:#333333;}#mermaid-svg-0dTHWeltSyO3IliU .marker.cross{stroke:#333333;}#mermaid-svg-0dTHWeltSyO3IliU svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-0dTHWeltSyO3IliU p{margin:0;}#mermaid-svg-0dTHWeltSyO3IliU .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-0dTHWeltSyO3IliU .cluster-label text{fill:#333;}#mermaid-svg-0dTHWeltSyO3IliU .cluster-label span{color:#333;}#mermaid-svg-0dTHWeltSyO3IliU .cluster-label span p{background-color:transparent;}#mermaid-svg-0dTHWeltSyO3IliU .label text,#mermaid-svg-0dTHWeltSyO3IliU span{fill:#333;color:#333;}#mermaid-svg-0dTHWeltSyO3IliU .node rect,#mermaid-svg-0dTHWeltSyO3IliU .node circle,#mermaid-svg-0dTHWeltSyO3IliU .node ellipse,#mermaid-svg-0dTHWeltSyO3IliU .node polygon,#mermaid-svg-0dTHWeltSyO3IliU .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-0dTHWeltSyO3IliU .rough-node .label text,#mermaid-svg-0dTHWeltSyO3IliU .node .label text,#mermaid-svg-0dTHWeltSyO3IliU .image-shape .label,#mermaid-svg-0dTHWeltSyO3IliU .icon-shape .label{text-anchor:middle;}#mermaid-svg-0dTHWeltSyO3IliU .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-0dTHWeltSyO3IliU .rough-node .label,#mermaid-svg-0dTHWeltSyO3IliU .node .label,#mermaid-svg-0dTHWeltSyO3IliU .image-shape .label,#mermaid-svg-0dTHWeltSyO3IliU .icon-shape .label{text-align:center;}#mermaid-svg-0dTHWeltSyO3IliU .node.clickable{cursor:pointer;}#mermaid-svg-0dTHWeltSyO3IliU .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-0dTHWeltSyO3IliU .arrowheadPath{fill:#333333;}#mermaid-svg-0dTHWeltSyO3IliU .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-0dTHWeltSyO3IliU .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-0dTHWeltSyO3IliU .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0dTHWeltSyO3IliU .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-0dTHWeltSyO3IliU .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0dTHWeltSyO3IliU .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-0dTHWeltSyO3IliU .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-0dTHWeltSyO3IliU .cluster text{fill:#333;}#mermaid-svg-0dTHWeltSyO3IliU .cluster span{color:#333;}#mermaid-svg-0dTHWeltSyO3IliU div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-0dTHWeltSyO3IliU .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-0dTHWeltSyO3IliU rect.text{fill:none;stroke-width:0;}#mermaid-svg-0dTHWeltSyO3IliU .icon-shape,#mermaid-svg-0dTHWeltSyO3IliU .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0dTHWeltSyO3IliU .icon-shape p,#mermaid-svg-0dTHWeltSyO3IliU .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-0dTHWeltSyO3IliU .icon-shape .label rect,#mermaid-svg-0dTHWeltSyO3IliU .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0dTHWeltSyO3IliU .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-0dTHWeltSyO3IliU .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-0dTHWeltSyO3IliU :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 方法对比
偏高
接近
接近
最准
朴素比较

0.07-0.12

误差大
真实效应

0.05
后门调整

~0.052

需要可观测混杂
PSM

~0.053

非参数，灵活
工具变量

~0.048

可处理不可观测混杂

方法	估计精度	适用条件	实施难度
朴素比较	❌ 偏差大	无混杂（极少见）	极易
后门调整	✅ 较好	混杂可观测	低
PSM	✅ 较好	混杂可观测，非参数假设	中
工具变量	✅✅ 最优	需要有效工具变量	高

九、因果推断的工程落地

9.1 从相关模型到因果模型的迁移成本

维度	相关模型（ML）	因果模型
目标	预测 Y ^ \hat{Y} Y^	估计 τ \tau τ （处理效应）
数据要求	大量历史数据	需要随机化或工具变量
评估方式	AUC、MSE	ATT、LATE + 置信区间
适合决策	个性化推荐	政策干预、定价、营销
外推性	弱	强（因果效应可外推）

9.2 何时需要因果推断

三个判断问题：

是否要做干预？（"给用户发优惠券" vs "预测用户会不会买"）→ 干预决策需要因果
是否存在明显混杂？（高价值用户更可能被选中处理）→ 有混杂需要因果
能否做 A/B 测试？→ 能做就做 A/B；不能做（涨价、长期干预）才用观察性方法

9.3 双重机器学习（Double ML）：大规模因果估计

当混杂变量高维时，双重机器学习（Chernozhukov et al., 2018）用 ML 模型同时做倾向分数估计和结果预测，然后用残差回归得到去偏的因果估计：

python 复制代码

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import KFold
import numpy as np

def double_machine_learning(T, Y, X, n_splits=5, random_state=42):
    """
    双重机器学习（Partially Linear Model）
    适合：高维混杂变量场景，线性因果效应假设
    
    核心思路：
    1. 用 ML 预测 T（去除混杂对处理的影响），得到残差 V
    2. 用 ML 预测 Y（去除混杂对结果的影响），得到残差 U
    3. 回归 U ~ V，系数即为因果效应
    
    关键：使用交叉拟合（cross-fitting）避免过拟合偏差
    """
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    V = np.zeros_like(T, dtype=float)  # T 的残差
    U = np.zeros_like(Y, dtype=float)  # Y 的残差
    
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        T_train, T_test = T[train_idx], T[test_idx]
        Y_train, Y_test = Y[train_idx], Y[test_idx]
        
        # 第一阶段：预测 T
        model_t = GradientBoostingClassifier(n_estimators=50, random_state=random_state)
        model_t.fit(X_train, T_train)
        T_hat = model_t.predict_proba(X_test)[:, 1]
        V[test_idx] = T_test - T_hat
        
        # 第二阶段：预测 Y
        model_y = GradientBoostingRegressor(n_estimators=50, random_state=random_state)
        model_y.fit(X_train, Y_train)
        Y_hat = model_y.predict(X_test)
        U[test_idx] = Y_test - Y_hat
    
    # 最终：回归 U ~ V
    from numpy.linalg import lstsq
    theta, _, _, _ = lstsq(V.reshape(-1, 1), U, rcond=None)
    ate_estimate = theta[0]
    
    # 标准误
    residuals = U - V * ate_estimate
    se = np.sqrt(np.sum(residuals**2) / (len(T) - 1)) / np.sqrt(np.sum(V**2))
    
    print(f"\n双重机器学习估计:")
    print(f"ATE: {ate_estimate:.4f}")
    print(f"95% CI: [{ate_estimate - 1.96*se:.4f}, {ate_estimate + 1.96*se:.4f}]")
    
    return ate_estimate, se

十、总结

从相关性到因果性，不是技术升级，而是思维框架的切换：

相关性模型回答"谁会买"，帮助精准触达
因果模型回答"广告让多少人额外买了"，帮助量化干预价值

四种方法的适用边界：

方法	核心假设	适用场景
后门调整	无未观测混杂	协变量数量少、业务理解充分
PSM	无未观测混杂	处理分配复杂、希望非参数化
工具变量	有效工具变量存在	存在不可观测混杂，有随机性来源
Double ML	线性因果效应	高维混杂、大样本量

因果推断在 ML 工程师工具箱中长期处于"听说过但没用过"的状态。它的门槛不在数学，而在识别问题：学会识别混杂变量、学会问"如果干预了会怎样"，才是这套方法真正的价值所在。

参考资料与延伸阅读

因果推断与不平衡数据处理有天然联系------观察性研究中的处理组和对照组往往严重不平衡，可参阅前文不平衡数据处理实战：采样策略/代价敏感学习/评估指标/业务场景了解 PSM 之外的平衡策略。

模型评估的严谨性同样适用于因果估计：因果效应的置信区间和统计显著性检验，与前文模型评估与验证体系：交叉验证策略/统计检验/校准曲线/多指标决策框架中的思路高度一致。

要理解为什么特征与目标的相关性不等于因果性，可参阅特征选择与特征工程进阶：过滤/包裹/嵌入 + 领域特定特征设计------互信息和相关系数衡量的是统计关联，不是因果强度。

半监督与自监督学习中的"标注成本"问题，本质上也是一个因果问题：有标注数据和无标注数据的分布差异就是一种选择性偏差，可参阅前文半监督与自监督学习：标注稀缺场景的实用解法了解相关处理策略。

欢迎点赞支持 👍 关注专栏可以在第一时间收到更新推送，感谢每一位认真读完的读者 ⭐