混杂示例:从观测数据中寻找因果效应#

假设你获得了一些包含处理(treatment)和结果(outcome)的数据。你能确定处理是否导致了结果,还是观测到的相关性完全是由于另一个共同原因造成的?

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import dowhy
from dowhy import CausalModel
import dowhy.datasets, dowhy.plotter

# Config dict to set the logging level
import logging.config
DEFAULT_LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'loggers': {
        '': {
            'level': 'INFO',
        },
    }
}

logging.config.dictConfig(DEFAULT_LOGGING)

让我们创建一个神秘数据集,我们需要确定是否存在因果效应。#

创建数据集。数据集由以下两种模型之一生成:* 模型 1:处理导致结果。* 模型 2:处理不导致结果。所有观测到的相关性均由共同原因引起。

[2]:
rvar = 1 if np.random.uniform() >0.5 else 0
data_dict = dowhy.datasets.xy_dataset(10000, effect=rvar,
                                      num_common_causes=1,
                                      sd_error=0.2)
df = data_dict['df']
print(df[["Treatment", "Outcome", "w0"]].head())
   Treatment    Outcome        w0
0   7.588773  15.439320  1.483548
1   8.464945  17.079393  2.464048
2   1.985935   4.099270 -3.865950
3   7.118398  13.774239  0.746508
4   3.870168   7.347380 -2.322948
[3]:
dowhy.plotter.plot_treatment_outcome(df[data_dict["treatment_name"]], df[data_dict["outcome_name"]],
                             df[data_dict["time_val"]])
../_images/example_notebooks_dowhy_confounder_example_4_0.png

使用 DoWhy 解决这个谜团:处理是否会导致结果?#

步骤 1:将问题建模为因果图#

初始化因果模型。

[4]:
model= CausalModel(
        data=df,
        treatment=data_dict["treatment_name"],
        outcome=data_dict["outcome_name"],
        common_causes=data_dict["common_causes_names"],
        instruments=data_dict["instrument_names"])
model.view_model(layout="dot")
../_images/example_notebooks_dowhy_confounder_example_6_0.png

显示存储在本地文件“causal_model.png”中的因果模型

[5]:
from IPython.display import Image, display
display(Image(filename="causal_model.png"))
../_images/example_notebooks_dowhy_confounder_example_8_0.png

步骤 2:使用形式化因果图的属性识别因果效应#

使用因果图的属性识别因果效应。

[6]:
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
     d
────────────(E[Outcome|w0])
d[Treatment]
Estimand assumption 1, Unconfoundedness: If U→{Treatment} and U→Outcome then P(Outcome|Treatment,w0,U) = P(Outcome|Treatment,w0)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

步骤 3:估计因果效应#

一旦我们确定了可估计量,就可以使用任何统计方法来估计因果效应。

为了简单起见,我们使用线性回归。

[7]:
estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression")
print("Causal Estimate is " + str(estimate.value))

# Plot Slope of line between treamtent and outcome =causal effect
dowhy.plotter.plot_causal_effect(estimate, df[data_dict["treatment_name"]], df[data_dict["outcome_name"]])
Causal Estimate is 1.0121551154547133
../_images/example_notebooks_dowhy_confounder_example_12_1.png

检查估计是否正确#

[8]:
print("DoWhy estimate is " + str(estimate.value))
print ("Actual true causal effect was {0}".format(rvar))
DoWhy estimate is 1.0121551154547133
Actual true causal effect was 1

步骤 4:反驳估计#

我们还可以反驳估计,以检查其对假设的鲁棒性(敏感性分析,但更强劲)。

添加一个随机共同原因变量#

[9]:
res_random=model.refute_estimate(identified_estimand, estimate, method_name="random_common_cause")
print(res_random)
Refute: Add a random common cause
Estimated effect:1.0121551154547133
New effect:1.0121408959949347
p value:0.96

用随机(安慰剂)变量替换处理#

[10]:
res_placebo=model.refute_estimate(identified_estimand, estimate,
        method_name="placebo_treatment_refuter", placebo_type="permute")
print(res_placebo)
Refute: Use a Placebo Treatment
Estimated effect:1.0121551154547133
New effect:-7.908637019507836e-05
p value:1.0

移除数据的随机子集#

[11]:
res_subset=model.refute_estimate(identified_estimand, estimate,
        method_name="data_subset_refuter", subset_fraction=0.9)
print(res_subset)

Refute: Use a subset of data
Estimated effect:1.0121551154547133
New effect:1.0119332285840539
p value:0.96

如您所见,我们的因果估计器对简单的反驳具有鲁棒性。