因果推断入门:从相关性到因果性的思维转变与基础方法

文章目录

一、为什么相关性不够用

有一个经典案例:在某电商平台的数据中,购买了高端鞋垫的用户,其 30 日留存率比普通用户高出 15%。如果基于这个相关性做决策------"向所有用户推广鞋垫购买,可以提升留存"------大概率会浪费预算。

真实原因可能是:原本就高活跃的用户既倾向于购买高端配件,也倾向于长期留存。鞋垫和留存都是"高活跃"这个共同原因的结果,它们之间并没有因果关系。

这就是机器学习工程师面临的核心困境:ML 模型擅长发现相关性,但业务决策需要因果性

  • 相关性问题 : P ( Y ∣ X = x ) P(Y | X = x) P(Y∣X=x),观察到 X=x 时 Y 的分布
  • 因果性问题 : P ( Y ∣ do ( X = x ) ) P(Y | \text{do}(X = x)) P(Y∣do(X=x)),强制干预 X=x 之后 Y 的分布

"观察到"和"强制干预后"是完全不同的问题。前者是统计学,后者是因果推断。
#mermaid-svg-tiemD1iG3u7RQR3n{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-tiemD1iG3u7RQR3n .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-tiemD1iG3u7RQR3n .error-icon{fill:#552222;}#mermaid-svg-tiemD1iG3u7RQR3n .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-tiemD1iG3u7RQR3n .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-tiemD1iG3u7RQR3n .marker{fill:#333333;stroke:#333333;}#mermaid-svg-tiemD1iG3u7RQR3n .marker.cross{stroke:#333333;}#mermaid-svg-tiemD1iG3u7RQR3n svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-tiemD1iG3u7RQR3n p{margin:0;}#mermaid-svg-tiemD1iG3u7RQR3n .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster-label text{fill:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster-label span{color:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster-label span p{background-color:transparent;}#mermaid-svg-tiemD1iG3u7RQR3n .label text,#mermaid-svg-tiemD1iG3u7RQR3n span{fill:#333;color:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .node rect,#mermaid-svg-tiemD1iG3u7RQR3n .node circle,#mermaid-svg-tiemD1iG3u7RQR3n .node ellipse,#mermaid-svg-tiemD1iG3u7RQR3n .node polygon,#mermaid-svg-tiemD1iG3u7RQR3n .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-tiemD1iG3u7RQR3n .rough-node .label text,#mermaid-svg-tiemD1iG3u7RQR3n .node .label text,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape .label,#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape .label{text-anchor:middle;}#mermaid-svg-tiemD1iG3u7RQR3n .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-tiemD1iG3u7RQR3n .rough-node .label,#mermaid-svg-tiemD1iG3u7RQR3n .node .label,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape .label,#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape .label{text-align:center;}#mermaid-svg-tiemD1iG3u7RQR3n .node.clickable{cursor:pointer;}#mermaid-svg-tiemD1iG3u7RQR3n .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-tiemD1iG3u7RQR3n .arrowheadPath{fill:#333333;}#mermaid-svg-tiemD1iG3u7RQR3n .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-tiemD1iG3u7RQR3n .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-tiemD1iG3u7RQR3n .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tiemD1iG3u7RQR3n .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-tiemD1iG3u7RQR3n .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tiemD1iG3u7RQR3n .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-tiemD1iG3u7RQR3n .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster text{fill:#333;}#mermaid-svg-tiemD1iG3u7RQR3n .cluster span{color:#333;}#mermaid-svg-tiemD1iG3u7RQR3n div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-tiemD1iG3u7RQR3n .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-tiemD1iG3u7RQR3n rect.text{fill:none;stroke-width:0;}#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape p,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-tiemD1iG3u7RQR3n .icon-shape .label rect,#mermaid-svg-tiemD1iG3u7RQR3n .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-tiemD1iG3u7RQR3n .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-tiemD1iG3u7RQR3n .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-tiemD1iG3u7RQR3n :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 因果性思维
识别混杂:用户活跃度

同时影响鞋垫购买和留存
控制混杂后的净效应
结论:鞋垫本身效应接近0
决策:提升用户活跃度才是正路
相关性思维
观察:活跃用户买了鞋垫
结论:买鞋垫→留存↑
决策:推广鞋垫购买
结果:效果不显著


二、因果推断的三大框架

2.1 潜在结果框架(Rubin 因果模型)

每个个体同时存在两个潜在结果:

  • Y i ( 1 ) Y_i(1) Yi(1):接受处理(treatment = 1)时的结果
  • Y i ( 0 ) Y_i(0) Yi(0):不接受处理(treatment = 0)时的结果

个体因果效应 : τ i = Y i ( 1 ) − Y i ( 0 ) \tau_i = Y_i(1) - Y_i(0) τi=Yi(1)−Yi(0)

问题 :同一个用户不能同时出现在处理组和对照组,所以 Y i ( 1 ) Y_i(1) Yi(1) 和 Y i ( 0 ) Y_i(0) Yi(0) 只能观察到其中一个。另一个叫反事实(counterfactual)。

平均处理效应(ATE) : ATE = E Y ( 1 ) − Y ( 0 ) \text{ATE} = \mathbb{E}Y(1) - Y(0) ATE=EY(1)−Y(0)

随机实验(A/B 测试)保证了 Y ( t ) ⊥ T Y(t) \perp T Y(t)⊥T(潜在结果与处理分配独立),所以简单的组间均值差就是无偏的 ATE 估计。观察性研究则需要额外假设和方法来识别因果效应。

2.2 因果图(Pearl 的 do-演算)

用**有向无环图(DAG)**表示变量间的直接因果关系:
#mermaid-svg-bQqPzkU36JRanTKN{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bQqPzkU36JRanTKN .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bQqPzkU36JRanTKN .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bQqPzkU36JRanTKN .error-icon{fill:#552222;}#mermaid-svg-bQqPzkU36JRanTKN .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bQqPzkU36JRanTKN .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bQqPzkU36JRanTKN .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bQqPzkU36JRanTKN .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bQqPzkU36JRanTKN .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bQqPzkU36JRanTKN .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bQqPzkU36JRanTKN .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bQqPzkU36JRanTKN .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bQqPzkU36JRanTKN .marker.cross{stroke:#333333;}#mermaid-svg-bQqPzkU36JRanTKN svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bQqPzkU36JRanTKN p{margin:0;}#mermaid-svg-bQqPzkU36JRanTKN .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-bQqPzkU36JRanTKN .cluster-label text{fill:#333;}#mermaid-svg-bQqPzkU36JRanTKN .cluster-label span{color:#333;}#mermaid-svg-bQqPzkU36JRanTKN .cluster-label span p{background-color:transparent;}#mermaid-svg-bQqPzkU36JRanTKN .label text,#mermaid-svg-bQqPzkU36JRanTKN span{fill:#333;color:#333;}#mermaid-svg-bQqPzkU36JRanTKN .node rect,#mermaid-svg-bQqPzkU36JRanTKN .node circle,#mermaid-svg-bQqPzkU36JRanTKN .node ellipse,#mermaid-svg-bQqPzkU36JRanTKN .node polygon,#mermaid-svg-bQqPzkU36JRanTKN .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-bQqPzkU36JRanTKN .rough-node .label text,#mermaid-svg-bQqPzkU36JRanTKN .node .label text,#mermaid-svg-bQqPzkU36JRanTKN .image-shape .label,#mermaid-svg-bQqPzkU36JRanTKN .icon-shape .label{text-anchor:middle;}#mermaid-svg-bQqPzkU36JRanTKN .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-bQqPzkU36JRanTKN .rough-node .label,#mermaid-svg-bQqPzkU36JRanTKN .node .label,#mermaid-svg-bQqPzkU36JRanTKN .image-shape .label,#mermaid-svg-bQqPzkU36JRanTKN .icon-shape .label{text-align:center;}#mermaid-svg-bQqPzkU36JRanTKN .node.clickable{cursor:pointer;}#mermaid-svg-bQqPzkU36JRanTKN .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-bQqPzkU36JRanTKN .arrowheadPath{fill:#333333;}#mermaid-svg-bQqPzkU36JRanTKN .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-bQqPzkU36JRanTKN .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-bQqPzkU36JRanTKN .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-bQqPzkU36JRanTKN .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-bQqPzkU36JRanTKN .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-bQqPzkU36JRanTKN .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-bQqPzkU36JRanTKN .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-bQqPzkU36JRanTKN .cluster text{fill:#333;}#mermaid-svg-bQqPzkU36JRanTKN .cluster span{color:#333;}#mermaid-svg-bQqPzkU36JRanTKN div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-bQqPzkU36JRanTKN .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-bQqPzkU36JRanTKN rect.text{fill:none;stroke-width:0;}#mermaid-svg-bQqPzkU36JRanTKN .icon-shape,#mermaid-svg-bQqPzkU36JRanTKN .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-bQqPzkU36JRanTKN .icon-shape p,#mermaid-svg-bQqPzkU36JRanTKN .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-bQqPzkU36JRanTKN .icon-shape .label rect,#mermaid-svg-bQqPzkU36JRanTKN .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-bQqPzkU36JRanTKN .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-bQqPzkU36JRanTKN .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-bQqPzkU36JRanTKN :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 用户活跃度

混杂变量 U

不可观测
鞋垫购买

处理变量 T
30日留存

结果变量 Y
广告曝光

工具变量 Z

在这个 DAG 中:

  • 后门路径 : T ← U → Y T \leftarrow U \rightarrow Y T←U→Y,这条路径是混杂路径,导致 T 和 Y 的相关性被夸大
  • 前门路径 : T → Y T \rightarrow Y T→Y,这是真正的因果路径
  • 工具变量:Z(广告曝光)只影响 T,不直接影响 Y

2.3 三大框架的互补关系

框架 优势 适用场景
潜在结果(Rubin) 直觉清晰,与实验设计紧密对应 A/B 测试设计,PSM 匹配
因果图(Pearl) 可视化因果结构,系统化识别混杂 复杂混杂关系分析,工具变量识别
结构因果模型(SCM) 最完整,支持反事实推断 需要推断个体级因果效应的高级场景

三、混杂变量:相关性谬误的根源

3.1 混杂变量的正式定义

变量 C C C 是 T → Y T \rightarrow Y T→Y 因果路径的混杂变量,当且仅当:

  1. C C C 影响处理变量 T T T
  2. C C C 影响结果变量 Y Y Y
  3. C C C 不在 T → Y T \rightarrow Y T→Y 的因果路径上(不是中介变量)

3.2 辛普森悖论:混杂导致方向性错误

python 复制代码
import pandas as pd
import numpy as np

# 辛普森悖论经典案例:药物效果分析
# 整体数据显示:药物组恢复率更低?!
data_overall = pd.DataFrame({
    '用药': [700, 300],
    '恢复': [500, 250],
}, index=['用药组', '对照组'])

print("=== 整体数据(忽略性别混杂)===")
print("用药组恢复率:", 500/700)
print("对照组恢复率:", 250/300)
print("结论:用药组恢复率更低?\n")

# 按性别分组后:
data_male = pd.DataFrame({
    '用药': [600, 100],
    '恢复': [450, 70],
}, index=['用药组', '对照组'])

data_female = pd.DataFrame({
    '用药': [100, 200],
    '恢复': [50, 180],
}, index=['用药组', '对照组'])

print("=== 按性别分组(控制混杂)===")
print("男性 - 用药组恢复率:", 450/600)
print("男性 - 对照组恢复率:", 70/100)
print()
print("女性 - 用药组恢复率:", 50/100)
print("女性 - 对照组恢复率:", 180/200)
print()
print("真实结论:无论男女,用药组恢复率均更高")
print("混杂原因:女性既更倾向用药,恢复率又更低(混杂变量=性别)")

核心教训:在不平衡的观察性数据中,不控制混杂变量直接比较组间差异,结论可能完全相反。


四、后门调整:消除混杂的标准方法

4.1 后门准则与调整公式

后门准则 :若变量集合 Z Z Z 满足以下条件,则 Z Z Z 是有效的后门调整集:

  1. Z Z Z 阻断了 T T T 和 Y Y Y 之间所有的后门路径
  2. Z Z Z 中没有 T T T 的后代(不能控制中介变量)

后门调整公式

P ( Y ∣ do ( T = t ) ) = ∑ z P ( Y ∣ T = t , Z = z ) ⋅ P ( Z = z ) P(Y | \text{do}(T=t)) = \sum_z P(Y | T=t, Z=z) \cdot P(Z=z) P(Y∣do(T=t))=z∑P(Y∣T=t,Z=z)⋅P(Z=z)

直觉:在每个 Z Z Z 的取值层次内,比较 T 的效果;然后按 Z Z Z 的边缘分布加权平均。

python 复制代码
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

def backdoor_adjustment(data, treatment, outcome, confounders):
    """
    后门调整估计平均因果效应(ATE)
    
    适用条件:
    1. 已知并可观测所有混杂变量
    2. 混杂变量数量可控(避免高维诅咒)
    
    data: DataFrame
    treatment: 处理变量列名(二值)
    outcome: 结果变量列名(连续)
    confounders: 混杂变量列名列表
    """
    # 分层计算:在每个混杂层内估计因果效应
    # 对于连续混杂变量,使用回归方法
    
    # 方法1:回归控制(线性假设下)
    X = data[confounders + [treatment]]
    y = data[outcome]
    
    model = LinearRegression()
    model.fit(X, y)
    
    # 处理变量的系数即为(线性假设下的)因果效应
    treatment_coef = model.coef_[-1]
    
    # 方法2:预测反事实(更通用)
    data_treat = data.copy()
    data_treat[treatment] = 1
    data_control = data.copy()
    data_control[treatment] = 0
    
    X_treat = data_treat[confounders + [treatment]]
    X_control = data_control[confounders + [treatment]]
    
    potential_outcome_1 = model.predict(X_treat)
    potential_outcome_0 = model.predict(X_control)
    
    ate_regression = (potential_outcome_1 - potential_outcome_0).mean()
    
    return {
        'treatment_coefficient': treatment_coef,
        'ate_regression': ate_regression,
        'method': 'backdoor_regression_adjustment'
    }

# 示例:广告效果分析
np.random.seed(42)
n = 5000

# 混杂变量:用户历史活跃度
activity = np.random.normal(0, 1, n)
# 广告曝光概率受活跃度影响(活跃用户更容易看到广告)
ad_exposure_prob = 1 / (1 + np.exp(-0.8 * activity))
ad_shown = np.random.binomial(1, ad_exposure_prob)
# 转化结果受广告和活跃度共同影响
# 广告的真实因果效应为 0.3
true_effect = 0.3
conversion_prob = 1 / (1 + np.exp(-(0.3 * ad_shown + 0.7 * activity)))
conversion = np.random.binomial(1, conversion_prob)

df = pd.DataFrame({
    'activity': activity,
    'ad_shown': ad_shown,
    'conversion': conversion
})

# 朴素比较(不控制混杂)
naive = df.groupby('ad_shown')['conversion'].mean()
naive_effect = naive[1] - naive[0]

# 后门调整(控制 activity)
adjusted = backdoor_adjustment(df, 'ad_shown', 'conversion', ['activity'])

print(f"真实因果效应:     {true_effect:.3f}")
print(f"朴素相关性估计:    {naive_effect:.3f}  (被混杂夸大)")
print(f"后门调整估计(ATE): {adjusted['ate_regression']:.3f}")

五、倾向分数匹配(PSM):模拟随机实验

5.1 倾向分数的定义

倾向分数(Propensity Score)是在给定协变量 X X X 的条件下,个体接受处理的概率:

e ( X ) = P ( T = 1 ∣ X ) e(X) = P(T=1 | X) e(X)=P(T=1∣X)

Rosenbaum-Rubin 定理 :若倾向分数 e ( X ) e(X) e(X) 已知,则条件独立性成立:

T ⊥ X ∣ e ( X ) T \perp X | e(X) T⊥X∣e(X)

这意味着:在倾向分数相同的子组内,处理分配近似于随机。因此只需匹配倾向分数相似的处理组和对照组样本,就能消除协变量的混杂影响。

5.2 PSM 完整实现

python 复制代码
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

class PropensityScoreMatching:
    """
    倾向分数匹配(PSM)
    
    流程:
    1. 用逻辑回归估计倾向分数 P(T=1|X)
    2. 用最近邻匹配:为每个处理组样本找1个倾向分数最接近的对照组样本
    3. 在匹配后的数据集上估计 ATT(处理组的平均处理效应)
    
    关键假设(可忽略性/无混杂):
    给定观测协变量 X,T 与潜在结果 Y(0), Y(1) 条件独立
    """
    
    def __init__(self, caliper=0.05, ratio=1):
        """
        caliper: 匹配时允许的最大倾向分数差距(标准差单位)
        ratio: 每个处理组样本匹配的对照组样本数
        """
        self.caliper = caliper
        self.ratio = ratio
        self.propensity_model = None
        self.matched_data = None
    
    def fit(self, X, T, outcome_name=None, outcome=None):
        """
        X: 协变量矩阵
        T: 处理变量(0/1)
        outcome: 结果变量(可选,用于后续效应估计)
        """
        # Step 1: 估计倾向分数
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        self.propensity_model = LogisticRegression(
            max_iter=1000, C=1.0, random_state=42
        )
        self.propensity_model.fit(X_scaled, T)
        self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
        self.scaler = scaler
        
        # Step 2: 分离处理组和对照组
        treat_idx = np.where(T == 1)[0]
        control_idx = np.where(T == 0)[0]
        
        treat_ps = self.propensity_scores[treat_idx]
        control_ps = self.propensity_scores[control_idx]
        
        # Step 3: 最近邻匹配(带 caliper 约束)
        # caliper 用倾向分数的标准差校准
        ps_std = self.propensity_scores.std()
        actual_caliper = self.caliper * ps_std
        
        nbrs = NearestNeighbors(n_neighbors=self.ratio, algorithm='ball_tree')
        nbrs.fit(control_ps.reshape(-1, 1))
        
        distances, indices = nbrs.kneighbors(treat_ps.reshape(-1, 1))
        
        # 筛选满足 caliper 约束的匹配对
        matched_treat = []
        matched_control = []
        
        for i, (dist_row, idx_row) in enumerate(zip(distances, indices)):
            valid = dist_row <= actual_caliper
            if valid.any():
                matched_treat.append(treat_idx[i])
                matched_control.extend(control_idx[idx_row[valid]])
        
        self.matched_treat_idx = matched_treat
        self.matched_control_idx = matched_control
        
        print(f"处理组样本数:    {len(treat_idx)}")
        print(f"对照组样本数:    {len(control_idx)}")
        print(f"成功匹配对数:    {len(matched_treat)}")
        print(f"匹配率:          {len(matched_treat)/len(treat_idx):.1%}")
        
        # 构建匹配后的数据集
        if outcome is not None:
            treat_outcomes = outcome[matched_treat]
            control_outcomes = outcome[matched_control[:len(matched_treat)]]
            self.att = (treat_outcomes - control_outcomes).mean()
            print(f"\nATT(处理组平均处理效应): {self.att:.4f}")
        
        return self
    
    def check_balance(self, X, T, feature_names=None):
        """
        检查匹配前后的协变量平衡性
        标准化均值差(SMD)< 0.1 通常视为平衡良好
        """
        if feature_names is None:
            feature_names = [f'X{i}' for i in range(X.shape[1])]
        
        treat_idx = np.where(T == 1)[0]
        control_idx = np.where(T == 0)[0]
        
        results = []
        for i, name in enumerate(feature_names):
            # 匹配前
            t_before = X[treat_idx, i].mean()
            c_before = X[control_idx, i].mean()
            pooled_std = np.sqrt((X[treat_idx, i].var() + X[control_idx, i].var()) / 2)
            smd_before = abs(t_before - c_before) / (pooled_std + 1e-10)
            
            # 匹配后
            matched_t = X[self.matched_treat_idx, i]
            matched_c = X[self.matched_control_idx[:len(self.matched_treat_idx)], i]
            smd_after = abs(matched_t.mean() - matched_c.mean()) / (pooled_std + 1e-10)
            
            results.append({
                '特征': name,
                'SMD_匹配前': round(smd_before, 3),
                'SMD_匹配后': round(smd_after, 3),
                '平衡': '✅' if smd_after < 0.1 else '⚠️'
            })
        
        df = pd.DataFrame(results)
        print("\n=== 协变量平衡检验 ===")
        print(df.to_string(index=False))
        return df

六、工具变量法(IV):处理不可观测的混杂

6.1 什么时候需要工具变量

后门调整和 PSM 都有一个前提:所有混杂变量可观测。但现实中,"用户的真实购买意愿""医生的经验偏好"这类变量无法直接测量。

工具变量(Instrumental Variable) 的三个条件:

  1. 相关性 : Z Z Z 与处理变量 T T T 相关( Z Z Z 影响 T T T)
  2. 外生性 : Z Z Z 不直接影响结果变量 Y Y Y(只通过 T T T 间接影响)
  3. 排他性 : Z Z Z 与不可观测混杂变量无关

#mermaid-svg-G1o0wGdxYJtTOBct{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-G1o0wGdxYJtTOBct .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-G1o0wGdxYJtTOBct .error-icon{fill:#552222;}#mermaid-svg-G1o0wGdxYJtTOBct .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-G1o0wGdxYJtTOBct .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-G1o0wGdxYJtTOBct .marker{fill:#333333;stroke:#333333;}#mermaid-svg-G1o0wGdxYJtTOBct .marker.cross{stroke:#333333;}#mermaid-svg-G1o0wGdxYJtTOBct svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-G1o0wGdxYJtTOBct p{margin:0;}#mermaid-svg-G1o0wGdxYJtTOBct .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster-label text{fill:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster-label span{color:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster-label span p{background-color:transparent;}#mermaid-svg-G1o0wGdxYJtTOBct .label text,#mermaid-svg-G1o0wGdxYJtTOBct span{fill:#333;color:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .node rect,#mermaid-svg-G1o0wGdxYJtTOBct .node circle,#mermaid-svg-G1o0wGdxYJtTOBct .node ellipse,#mermaid-svg-G1o0wGdxYJtTOBct .node polygon,#mermaid-svg-G1o0wGdxYJtTOBct .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-G1o0wGdxYJtTOBct .rough-node .label text,#mermaid-svg-G1o0wGdxYJtTOBct .node .label text,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape .label,#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape .label{text-anchor:middle;}#mermaid-svg-G1o0wGdxYJtTOBct .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-G1o0wGdxYJtTOBct .rough-node .label,#mermaid-svg-G1o0wGdxYJtTOBct .node .label,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape .label,#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape .label{text-align:center;}#mermaid-svg-G1o0wGdxYJtTOBct .node.clickable{cursor:pointer;}#mermaid-svg-G1o0wGdxYJtTOBct .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-G1o0wGdxYJtTOBct .arrowheadPath{fill:#333333;}#mermaid-svg-G1o0wGdxYJtTOBct .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-G1o0wGdxYJtTOBct .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-G1o0wGdxYJtTOBct .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-G1o0wGdxYJtTOBct .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-G1o0wGdxYJtTOBct .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-G1o0wGdxYJtTOBct .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-G1o0wGdxYJtTOBct .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster text{fill:#333;}#mermaid-svg-G1o0wGdxYJtTOBct .cluster span{color:#333;}#mermaid-svg-G1o0wGdxYJtTOBct div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-G1o0wGdxYJtTOBct .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-G1o0wGdxYJtTOBct rect.text{fill:none;stroke-width:0;}#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape p,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-G1o0wGdxYJtTOBct .icon-shape .label rect,#mermaid-svg-G1o0wGdxYJtTOBct .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-G1o0wGdxYJtTOBct .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-G1o0wGdxYJtTOBct .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-G1o0wGdxYJtTOBct :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 不可观测混杂变量 U

如用户真实购买意愿
处理变量 T

广告点击
结果变量 Y

购买转化
工具变量 Z

广告随机投放位置

6.2 两阶段最小二乘(2SLS)

python 复制代码
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats

class TwoStageLeastSquares:
    """
    两阶段最小二乘(2SLS)工具变量估计器
    
    Stage 1: 用工具变量 Z 回归处理变量 T,得到 T_hat(去除了不可观测混杂)
    Stage 2: 用 T_hat 回归结果变量 Y,得到因果效应
    
    局限:
    1. 需要强工具变量(弱工具变量会导致大方差)
    2. 只识别"遵从者"(compliers)的局部平均处理效应(LATE),
       不是所有人的平均处理效应(ATE)
    """
    
    def __init__(self):
        self.stage1_model = None
        self.stage2_model = None
    
    def fit(self, T, Y, Z, X_controls=None):
        """
        T: 处理变量(n,)
        Y: 结果变量(n,)
        Z: 工具变量(n, k)
        X_controls: 额外控制变量(可选)
        """
        n = len(T)
        
        # 构建 Stage 1 特征矩阵
        if X_controls is not None:
            X1 = np.column_stack([Z, X_controls])
        else:
            X1 = Z.reshape(-1, 1) if Z.ndim == 1 else Z
        
        # Stage 1: T ~ Z + X_controls
        self.stage1_model = LinearRegression()
        self.stage1_model.fit(X1, T)
        T_hat = self.stage1_model.predict(X1)
        
        # 弱工具变量检验(F统计量 > 10 视为强工具变量)
        residuals_stage1 = T - T_hat
        ss_total = np.sum((T - T.mean()) ** 2)
        ss_resid = np.sum(residuals_stage1 ** 2)
        r2_stage1 = 1 - ss_resid / ss_total
        
        k = X1.shape[1]
        f_stat = (r2_stage1 / k) / ((1 - r2_stage1) / (n - k - 1))
        print(f"Stage 1 F统计量: {f_stat:.2f} {'✅ 强工具变量' if f_stat > 10 else '⚠️ 弱工具变量'}")
        
        # Stage 2: Y ~ T_hat + X_controls
        if X_controls is not None:
            X2 = np.column_stack([T_hat, X_controls])
        else:
            X2 = T_hat.reshape(-1, 1)
        
        self.stage2_model = LinearRegression()
        self.stage2_model.fit(X2, Y)
        
        self.late_estimate = self.stage2_model.coef_[0]
        
        # 标准误估计(使用 2SLS 一致标准误)
        Y_hat = self.stage2_model.predict(X2)
        residuals = Y - Y_hat
        sigma2 = np.sum(residuals ** 2) / (n - X2.shape[1])
        
        # 近似标准误(假设同方差)
        XtX_inv = np.linalg.pinv(X2.T @ X2)
        self.se = np.sqrt(sigma2 * XtX_inv[0, 0])
        self.t_stat = self.late_estimate / self.se
        self.p_value = 2 * (1 - stats.t.cdf(abs(self.t_stat), df=n - X2.shape[1]))
        
        print(f"\nLATE(局部平均处理效应): {self.late_estimate:.4f}")
        print(f"标准误: {self.se:.4f}")
        print(f"t统计量: {self.t_stat:.4f}")
        print(f"p值: {self.p_value:.4f}")
        
        return self


# 示例:随机广告曝光位置作为工具变量
def simulate_iv_example():
    np.random.seed(42)
    n = 3000
    
    # 不可观测混杂:用户购买意愿
    purchase_intent = np.random.normal(0, 1, n)
    
    # 工具变量:广告是否出现在首屏(随机分配,不受用户意愿影响)
    first_screen = np.random.binomial(1, 0.5, n)  # 随机分配
    
    # 广告点击受首屏曝光和购买意愿共同影响
    click_prob = 1 / (1 + np.exp(-(0.6 * first_screen + 0.8 * purchase_intent)))
    ad_click = np.random.binomial(1, click_prob)
    
    # 购买转化:真实因果效应 = 0.2,但混杂变量也强烈影响
    conversion_prob = 1 / (1 + np.exp(-(0.2 * ad_click + 1.0 * purchase_intent)))
    conversion = np.random.binomial(1, conversion_prob)
    
    # 朴素估计(不控制混杂)
    naive_effect = conversion[ad_click == 1].mean() - conversion[ad_click == 0].mean()
    
    # 2SLS 估计
    iv = TwoStageLeastSquares()
    iv.fit(ad_click.astype(float), conversion.astype(float), first_screen.astype(float))
    
    print(f"\n真实因果效应:    0.2000")
    print(f"朴素相关性估计:  {naive_effect:.4f}(混杂夸大了效应)")
    print(f"2SLS 工具变量估计: {iv.late_estimate:.4f}")
    
    return iv

iv_result = simulate_iv_example()

七、因果发现:从数据自动推断因果图

7.1 为什么需要因果发现

后门调整、PSM、工具变量都假设已知因果结构(哪些是混杂变量,哪些是中介变量)。但现实中,研究者往往不完全清楚变量间的因果关系。

因果发现(Causal Discovery):从观测数据中自动推断 DAG 结构。

7.2 PC 算法的核心思路

python 复制代码
from itertools import combinations
import numpy as np
from scipy.stats import pearsonr

def pc_algorithm_skeleton(data, alpha=0.05):
    """
    PC 算法:第一阶段------骨架学习
    
    核心思路:
    1. 从完全图开始(所有变量两两相连)
    2. 逐步增加条件集大小,做条件独立性检验
    3. 如果 X ⊥ Y | Z,则删去 X-Y 之间的边,并记录 Z 为分离集
    
    局限:
    - 假设线性关系和高斯噪声(可用非参数版本扩展)
    - 计算复杂度随变量数指数增长
    - 对违反假设的数据不鲁棒
    """
    n_vars = data.shape[1]
    var_names = data.columns.tolist() if hasattr(data, 'columns') else list(range(n_vars))
    
    # 初始化:完全无向图(邻接矩阵)
    adjacency = {v: set(var_names) - {v} for v in var_names}
    sep_sets = {}
    
    cond_set_size = 0
    
    while True:
        edges_removed = False
        
        for x in var_names:
            for y in list(adjacency[x]):
                if y <= x:  # 避免重复检验
                    continue
                
                # 候选条件集:x 或 y 的邻居(排除 x 和 y 本身)
                adj_x = adjacency[x] - {y}
                
                if len(adj_x) < cond_set_size:
                    continue
                
                # 遍历所有大小为 cond_set_size 的条件子集
                for z_set in combinations(adj_x, cond_set_size):
                    z_set = list(z_set)
                    
                    # 条件独立性检验(这里用偏相关作为近似)
                    is_independent = conditional_independence_test(
                        data, x, y, z_set, alpha
                    )
                    
                    if is_independent:
                        # 删除边 x-y
                        adjacency[x].discard(y)
                        adjacency[y].discard(x)
                        sep_sets[(x, y)] = z_set
                        sep_sets[(y, x)] = z_set
                        edges_removed = True
                        break
        
        cond_set_size += 1
        
        # 停止条件:所有相邻节点对的邻居数量都小于当前条件集大小
        max_adj_size = max(len(adj) for adj in adjacency.values())
        if max_adj_size < cond_set_size or cond_set_size > n_vars:
            break
    
    return adjacency, sep_sets


def conditional_independence_test(data, x, y, z_set, alpha=0.05):
    """
    偏相关系数的条件独立性检验
    H0: X ⊥ Y | Z
    """
    import numpy as np
    from scipy.stats import t as t_dist
    
    n = len(data)
    
    if len(z_set) == 0:
        # 无条件独立:直接计算相关系数
        if hasattr(data, 'values'):
            x_vals = data[x].values
            y_vals = data[y].values
        else:
            x_vals = data[:, x]
            y_vals = data[:, y]
        
        r, p = pearsonr(x_vals, y_vals)
        return p > alpha
    
    else:
        # 偏相关:用残差计算(回归掉 z_set 的影响)
        from sklearn.linear_model import LinearRegression
        
        if hasattr(data, 'values'):
            Z = data[z_set].values
            X_col = data[x].values
            Y_col = data[y].values
        else:
            Z = data[:, z_set]
            X_col = data[:, x]
            Y_col = data[:, y]
        
        reg_x = LinearRegression().fit(Z, X_col)
        reg_y = LinearRegression().fit(Z, Y_col)
        
        resid_x = X_col - reg_x.predict(Z)
        resid_y = Y_col - reg_y.predict(Z)
        
        r, p = pearsonr(resid_x, resid_y)
        return p > alpha

八、实战:广告效果的三种因果估计对比

8.1 场景设定

某平台投放了一批个性化广告,想评估"展示广告"对"7日购买转化"的真实因果效应。

观测数据中,高价值用户既更容易被广告定向,也更容易购买------这是典型的混杂。

python 复制代码
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def simulate_ad_effectiveness_study(n=10000, seed=42):
    """
    模拟广告效果因果分析
    真实因果效应 = 0.05(广告本身的提升非常有限)
    """
    np.random.seed(seed)
    
    # 用户特征(可观测协变量)
    age = np.random.normal(35, 10, n)
    income_level = np.random.randint(1, 6, n)  # 1-5级
    platform_activity = np.random.exponential(2, n)  # 历史活跃度
    
    # 购买意愿(不可观测混杂!这是真实问题所在)
    purchase_intent = 0.3 * (income_level - 3) + 0.4 * platform_activity + np.random.normal(0, 1, n)
    purchase_intent = (purchase_intent - purchase_intent.min()) / (purchase_intent.max() - purchase_intent.min())
    
    # 广告投放:高价值用户(高活跃+高收入)更可能被定向
    ad_score = 0.5 * income_level + 0.8 * platform_activity + np.random.normal(0, 0.5, n)
    ad_prob = 1 / (1 + np.exp(-0.8 * (ad_score - ad_score.mean()) / ad_score.std()))
    ad_shown = np.random.binomial(1, ad_prob)
    
    # 工具变量:广告投放时间段(系统随机轮换)
    # 高峰时段覆盖率更高(随机分配,与用户特征无关)
    peak_hour = np.random.binomial(1, 0.4)  # 40% 在高峰时段
    # 高峰时段展示率更高(工具变量的相关性条件)
    ad_prob_with_iv = np.clip(ad_prob + 0.15 * peak_hour, 0, 1)
    ad_shown_iv = np.random.binomial(1, ad_prob_with_iv)
    
    # 购买结果:真实因果效应 = 0.05
    true_effect = 0.05
    buy_prob = 1 / (1 + np.exp(-(true_effect * ad_shown + 0.9 * purchase_intent + np.random.normal(0, 0.1, n))))
    bought = np.random.binomial(1, buy_prob)
    
    df = pd.DataFrame({
        'age': age,
        'income_level': income_level,
        'platform_activity': platform_activity,
        'ad_shown': ad_shown,
        'ad_shown_iv': ad_shown_iv,
        'peak_hour': peak_hour,
        'bought': bought,
        'purchase_intent': purchase_intent  # 实际不可观测,仅用于验证
    })
    
    return df, true_effect

df, true_effect = simulate_ad_effectiveness_study()

# ================================================================
# 方法1:朴素比较(不控制混杂)
# ================================================================
naive = df.groupby('ad_shown')['bought'].mean()
naive_effect = naive[1] - naive[0]

# ================================================================
# 方法2:后门调整(控制可观测混杂)
# ================================================================
X_confounders = df[['age', 'income_level', 'platform_activity']].values
T = df['ad_shown'].values
Y = df['bought'].values

result_backdoor = backdoor_adjustment(
    df, 'ad_shown', 'bought',
    ['age', 'income_level', 'platform_activity']
)

# ================================================================
# 方法3:PSM(倾向分数匹配)
# ================================================================
psm = PropensityScoreMatching(caliper=0.05)
psm.fit(X_confounders, T, outcome=Y)

# ================================================================
# 方法4:工具变量(2SLS),处理不可观测混杂
# ================================================================
iv_2sls = TwoStageLeastSquares()
iv_2sls.fit(
    df['ad_shown_iv'].values.astype(float),
    Y.astype(float),
    df['peak_hour'].values.astype(float)
)

# ================================================================
# 对比结果汇总
# ================================================================
print("\n" + "="*55)
print(f"{'方法':<25} {'估计效应':>12} {'误差':>10}")
print("="*55)
estimates = {
    '真实因果效应': true_effect,
    '朴素相关性': naive_effect,
    '后门调整(回归)': result_backdoor['ate_regression'],
    'PSM(ATT)': getattr(psm, 'att', None),
    '工具变量(LATE)': iv_2sls.late_estimate
}

for method, est in estimates.items():
    if est is not None:
        error = abs(est - true_effect)
        marker = " ✅" if error < 0.01 else (" ⚠️" if error < 0.03 else " ❌")
        print(f"{method:<25} {est:>12.4f} {error:>10.4f}{marker}")
print("="*55)

8.2 结果解读

#mermaid-svg-0dTHWeltSyO3IliU{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-0dTHWeltSyO3IliU .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-0dTHWeltSyO3IliU .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-0dTHWeltSyO3IliU .error-icon{fill:#552222;}#mermaid-svg-0dTHWeltSyO3IliU .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-0dTHWeltSyO3IliU .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-0dTHWeltSyO3IliU .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-0dTHWeltSyO3IliU .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-0dTHWeltSyO3IliU .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-0dTHWeltSyO3IliU .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-0dTHWeltSyO3IliU .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-0dTHWeltSyO3IliU .marker{fill:#333333;stroke:#333333;}#mermaid-svg-0dTHWeltSyO3IliU .marker.cross{stroke:#333333;}#mermaid-svg-0dTHWeltSyO3IliU svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-0dTHWeltSyO3IliU p{margin:0;}#mermaid-svg-0dTHWeltSyO3IliU .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-0dTHWeltSyO3IliU .cluster-label text{fill:#333;}#mermaid-svg-0dTHWeltSyO3IliU .cluster-label span{color:#333;}#mermaid-svg-0dTHWeltSyO3IliU .cluster-label span p{background-color:transparent;}#mermaid-svg-0dTHWeltSyO3IliU .label text,#mermaid-svg-0dTHWeltSyO3IliU span{fill:#333;color:#333;}#mermaid-svg-0dTHWeltSyO3IliU .node rect,#mermaid-svg-0dTHWeltSyO3IliU .node circle,#mermaid-svg-0dTHWeltSyO3IliU .node ellipse,#mermaid-svg-0dTHWeltSyO3IliU .node polygon,#mermaid-svg-0dTHWeltSyO3IliU .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-0dTHWeltSyO3IliU .rough-node .label text,#mermaid-svg-0dTHWeltSyO3IliU .node .label text,#mermaid-svg-0dTHWeltSyO3IliU .image-shape .label,#mermaid-svg-0dTHWeltSyO3IliU .icon-shape .label{text-anchor:middle;}#mermaid-svg-0dTHWeltSyO3IliU .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-0dTHWeltSyO3IliU .rough-node .label,#mermaid-svg-0dTHWeltSyO3IliU .node .label,#mermaid-svg-0dTHWeltSyO3IliU .image-shape .label,#mermaid-svg-0dTHWeltSyO3IliU .icon-shape .label{text-align:center;}#mermaid-svg-0dTHWeltSyO3IliU .node.clickable{cursor:pointer;}#mermaid-svg-0dTHWeltSyO3IliU .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-0dTHWeltSyO3IliU .arrowheadPath{fill:#333333;}#mermaid-svg-0dTHWeltSyO3IliU .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-0dTHWeltSyO3IliU .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-0dTHWeltSyO3IliU .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0dTHWeltSyO3IliU .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-0dTHWeltSyO3IliU .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0dTHWeltSyO3IliU .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-0dTHWeltSyO3IliU .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-0dTHWeltSyO3IliU .cluster text{fill:#333;}#mermaid-svg-0dTHWeltSyO3IliU .cluster span{color:#333;}#mermaid-svg-0dTHWeltSyO3IliU div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-0dTHWeltSyO3IliU .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-0dTHWeltSyO3IliU rect.text{fill:none;stroke-width:0;}#mermaid-svg-0dTHWeltSyO3IliU .icon-shape,#mermaid-svg-0dTHWeltSyO3IliU .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0dTHWeltSyO3IliU .icon-shape p,#mermaid-svg-0dTHWeltSyO3IliU .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-0dTHWeltSyO3IliU .icon-shape .label rect,#mermaid-svg-0dTHWeltSyO3IliU .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0dTHWeltSyO3IliU .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-0dTHWeltSyO3IliU .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-0dTHWeltSyO3IliU :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 方法对比
偏高
接近
接近
最准
朴素比较

0.07-0.12

误差大
真实效应

0.05
后门调整

~0.052

需要可观测混杂
PSM

~0.053

非参数,灵活
工具变量

~0.048

可处理不可观测混杂

方法 估计精度 适用条件 实施难度
朴素比较 ❌ 偏差大 无混杂(极少见) 极易
后门调整 ✅ 较好 混杂可观测
PSM ✅ 较好 混杂可观测,非参数假设
工具变量 ✅✅ 最优 需要有效工具变量

九、因果推断的工程落地

9.1 从相关模型到因果模型的迁移成本

维度 相关模型(ML) 因果模型
目标 预测 Y ^ \hat{Y} Y^ 估计 τ \tau τ (处理效应)
数据要求 大量历史数据 需要随机化或工具变量
评估方式 AUC、MSE ATT、LATE + 置信区间
适合决策 个性化推荐 政策干预、定价、营销
外推性 强(因果效应可外推)

9.2 何时需要因果推断

三个判断问题

  1. 是否要做干预?("给用户发优惠券" vs "预测用户会不会买")→ 干预决策需要因果
  2. 是否存在明显混杂?(高价值用户更可能被选中处理)→ 有混杂需要因果
  3. 能否做 A/B 测试?→ 能做就做 A/B;不能做(涨价、长期干预)才用观察性方法

9.3 双重机器学习(Double ML):大规模因果估计

当混杂变量高维时,双重机器学习(Chernozhukov et al., 2018)用 ML 模型同时做倾向分数估计和结果预测,然后用残差回归得到去偏的因果估计:

python 复制代码
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import KFold
import numpy as np

def double_machine_learning(T, Y, X, n_splits=5, random_state=42):
    """
    双重机器学习(Partially Linear Model)
    适合:高维混杂变量场景,线性因果效应假设
    
    核心思路:
    1. 用 ML 预测 T(去除混杂对处理的影响),得到残差 V
    2. 用 ML 预测 Y(去除混杂对结果的影响),得到残差 U
    3. 回归 U ~ V,系数即为因果效应
    
    关键:使用交叉拟合(cross-fitting)避免过拟合偏差
    """
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    V = np.zeros_like(T, dtype=float)  # T 的残差
    U = np.zeros_like(Y, dtype=float)  # Y 的残差
    
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        T_train, T_test = T[train_idx], T[test_idx]
        Y_train, Y_test = Y[train_idx], Y[test_idx]
        
        # 第一阶段:预测 T
        model_t = GradientBoostingClassifier(n_estimators=50, random_state=random_state)
        model_t.fit(X_train, T_train)
        T_hat = model_t.predict_proba(X_test)[:, 1]
        V[test_idx] = T_test - T_hat
        
        # 第二阶段:预测 Y
        model_y = GradientBoostingRegressor(n_estimators=50, random_state=random_state)
        model_y.fit(X_train, Y_train)
        Y_hat = model_y.predict(X_test)
        U[test_idx] = Y_test - Y_hat
    
    # 最终:回归 U ~ V
    from numpy.linalg import lstsq
    theta, _, _, _ = lstsq(V.reshape(-1, 1), U, rcond=None)
    ate_estimate = theta[0]
    
    # 标准误
    residuals = U - V * ate_estimate
    se = np.sqrt(np.sum(residuals**2) / (len(T) - 1)) / np.sqrt(np.sum(V**2))
    
    print(f"\n双重机器学习估计:")
    print(f"ATE: {ate_estimate:.4f}")
    print(f"95% CI: [{ate_estimate - 1.96*se:.4f}, {ate_estimate + 1.96*se:.4f}]")
    
    return ate_estimate, se

十、总结

从相关性到因果性,不是技术升级,而是思维框架的切换

  • 相关性模型回答"谁会买",帮助精准触达
  • 因果模型回答"广告让多少人额外买了",帮助量化干预价值

四种方法的适用边界:

方法 核心假设 适用场景
后门调整 无未观测混杂 协变量数量少、业务理解充分
PSM 无未观测混杂 处理分配复杂、希望非参数化
工具变量 有效工具变量存在 存在不可观测混杂,有随机性来源
Double ML 线性因果效应 高维混杂、大样本量

因果推断在 ML 工程师工具箱中长期处于"听说过但没用过"的状态。它的门槛不在数学,而在识别问题:学会识别混杂变量、学会问"如果干预了会怎样",才是这套方法真正的价值所在。


参考资料与延伸阅读

因果推断与不平衡数据处理有天然联系------观察性研究中的处理组和对照组往往严重不平衡,可参阅前文 不平衡数据处理实战:采样策略/代价敏感学习/评估指标/业务场景 了解 PSM 之外的平衡策略。

模型评估的严谨性同样适用于因果估计:因果效应的置信区间和统计显著性检验,与前文 模型评估与验证体系:交叉验证策略/统计检验/校准曲线/多指标决策框架 中的思路高度一致。

要理解为什么特征与目标的相关性不等于因果性,可参阅 特征选择与特征工程进阶:过滤/包裹/嵌入 + 领域特定特征设计------互信息和相关系数衡量的是统计关联,不是因果强度。

半监督与自监督学习中的"标注成本"问题,本质上也是一个因果问题:有标注数据和无标注数据的分布差异就是一种选择性偏差,可参阅前文 半监督与自监督学习:标注稀缺场景的实用解法 了解相关处理策略。


欢迎点赞支持 👍 关注专栏可以在第一时间收到更新推送,感谢每一位认真读完的读者 ⭐