Twins数据集上的DoWhy示例#
在此,我们研究Louizos等人研究过的Twins数据集。我们重点关注同性别且体重小于2公斤的双胞胎。治疗变量 t = 1 表示出生时体重较重的那个双胞胎,结果变量是每个双胞胎在生命第一年内的死亡率。所取的混杂变量是‘gestat10’,即出生前的妊娠周数,因为它与结果高度相关。使用以下方法获得的结果与论文中获得的结果一致。
[1]:
import pandas as pd
import numpy as np
import dowhy
from dowhy import CausalModel
from dowhy import causal_estimators
# Config dict to set the logging level
import logging.config
DEFAULT_LOGGING = {
'version': 1,
'disable_existing_loggers': False,
'loggers': {
'': {
'level': 'WARN',
},
}
}
logging.config.dictConfig(DEFAULT_LOGGING)
# Disabling warnings output
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
加载数据
数据加载过程包括组合协变量、治疗变量和结果变量,并解决数据中的配对属性。由于包含双胞胎双方的条目,他们的死亡率可以视为两个潜在结果。治疗变量是根据双胞胎体重设定的。因此,为了获得二元治疗变量,每个孩子的信息都添加在单独的行中,而不是像原始数据源那样将两者的信息压缩在同一行中。
[2]:
#The covariates data has 46 features
x = pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/TWINS/twin_pairs_X_3years_samesex.csv")
#The outcome data contains mortality of the lighter and heavier twin
y = pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/TWINS/twin_pairs_Y_3years_samesex.csv")
#The treatment data contains weight in grams of both the twins
t = pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/TWINS/twin_pairs_T_3years_samesex.csv")
[3]:
#_0 denotes features specific to the lighter twin and _1 denotes features specific to the heavier twin
lighter_columns = ['pldel', 'birattnd', 'brstate', 'stoccfipb', 'mager8',
'ormoth', 'mrace', 'meduc6', 'dmar', 'mplbir', 'mpre5', 'adequacy',
'orfath', 'frace', 'birmon', 'gestat10', 'csex', 'anemia', 'cardiac',
'lung', 'diabetes', 'herpes', 'hydra', 'hemo', 'chyper', 'phyper',
'eclamp', 'incervix', 'pre4000', 'preterm', 'renal', 'rh', 'uterine',
'othermr', 'tobacco', 'alcohol', 'cigar6', 'drink5', 'crace',
'data_year', 'nprevistq', 'dfageq', 'feduc6', 'infant_id_0',
'dlivord_min', 'dtotord_min', 'bord_0',
'brstate_reg', 'stoccfipb_reg', 'mplbir_reg']
heavier_columns = [ 'pldel', 'birattnd', 'brstate', 'stoccfipb', 'mager8',
'ormoth', 'mrace', 'meduc6', 'dmar', 'mplbir', 'mpre5', 'adequacy',
'orfath', 'frace', 'birmon', 'gestat10', 'csex', 'anemia', 'cardiac',
'lung', 'diabetes', 'herpes', 'hydra', 'hemo', 'chyper', 'phyper',
'eclamp', 'incervix', 'pre4000', 'preterm', 'renal', 'rh', 'uterine',
'othermr', 'tobacco', 'alcohol', 'cigar6', 'drink5', 'crace',
'data_year', 'nprevistq', 'dfageq', 'feduc6',
'infant_id_1', 'dlivord_min', 'dtotord_min', 'bord_1',
'brstate_reg', 'stoccfipb_reg', 'mplbir_reg']
[4]:
#Since data has pair property,processing the data to get separate row for each twin so that each child can be treated as an instance
data = []
for i in range(len(t.values)):
#select only if both <=2kg
if t.iloc[i].values[1]>=2000 or t.iloc[i].values[2]>=2000:
continue
this_instance_lighter = list(x.iloc[i][lighter_columns].values)
this_instance_heavier = list(x.iloc[i][heavier_columns].values)
#adding weight
this_instance_lighter.append(t.iloc[i].values[1])
this_instance_heavier.append(t.iloc[i].values[2])
#adding treatment, is_heavier
this_instance_lighter.append(0)
this_instance_heavier.append(1)
#adding the outcome
this_instance_lighter.append(y.iloc[i].values[1])
this_instance_heavier.append(y.iloc[i].values[2])
data.append(this_instance_lighter)
data.append(this_instance_heavier)
[5]:
cols = [ 'pldel', 'birattnd', 'brstate', 'stoccfipb', 'mager8',
'ormoth', 'mrace', 'meduc6', 'dmar', 'mplbir', 'mpre5', 'adequacy',
'orfath', 'frace', 'birmon', 'gestat10', 'csex', 'anemia', 'cardiac',
'lung', 'diabetes', 'herpes', 'hydra', 'hemo', 'chyper', 'phyper',
'eclamp', 'incervix', 'pre4000', 'preterm', 'renal', 'rh', 'uterine',
'othermr', 'tobacco', 'alcohol', 'cigar6', 'drink5', 'crace',
'data_year', 'nprevistq', 'dfageq', 'feduc6',
'infant_id', 'dlivord_min', 'dtotord_min', 'bord',
'brstate_reg', 'stoccfipb_reg', 'mplbir_reg','wt','treatment','outcome']
df = pd.DataFrame(columns=cols,data=data)
df.head()
[5]:
pldel | birattnd | brstate | stoccfipb | mager8 | ormoth | mrace | meduc6 | dmar | mplbir | ... | infant_id | dlivord_min | dtotord_min | bord | brstate_reg | stoccfipb_reg | mplbir_reg | wt | treatment | outcome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 0.0 | 1.0 | 3.0 | 1.0 | 1.0 | ... | 35.0 | 3.0 | 3.0 | 2.0 | 5.0 | 5.0 | 5.0 | 936.0 | 0 | 0.0 |
1 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 0.0 | 1.0 | 3.0 | 1.0 | 1.0 | ... | 34.0 | 3.0 | 3.0 | 1.0 | 5.0 | 5.0 | 5.0 | 1006.0 | 1 | 0.0 |
2 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 0.0 | 1.0 | 2.0 | 0.0 | 1.0 | ... | 47.0 | NaN | NaN | NaN | 5.0 | 5.0 | 5.0 | 737.0 | 0 | 0.0 |
3 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 0.0 | 1.0 | 2.0 | 0.0 | 1.0 | ... | 46.0 | NaN | NaN | NaN | 5.0 | 5.0 | 5.0 | 850.0 | 1 | 1.0 |
4 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 0.0 | 1.0 | 3.0 | 1.0 | 1.0 | ... | 52.0 | 1.0 | 1.0 | 1.0 | 5.0 | 5.0 | 5.0 | 1830.0 | 0 | 0.0 |
5行 × 53列
[6]:
df = df.astype({"treatment":'bool'}, copy=False) #explicitly assigning treatment column as boolean
df.fillna(value=df.mean(),inplace=True) #filling the missing values
df.fillna(value=df.mode().loc[0],inplace=True)
data_1 = df[df["treatment"]==1]
data_0 = df[df["treatment"]==0]
print(np.mean(data_1["outcome"]))
print(np.mean(data_0["outcome"]))
print("ATE", np.mean(data_1["outcome"])- np.mean(data_0["outcome"]))
0.16421895861148197
0.1894192256341789
ATE -0.025200267022696926
1. 建模
[7]:
#The causal model has "treatment = is_heavier", "outcome = mortality" and "gestat10 = gestational weeks before birth"
model=CausalModel(
data = df,
treatment='treatment',
outcome='outcome',
common_causes='gestat10'
)
2. 识别
[8]:
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────────────(E[outcome|gestat10])
d[treatment]
Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→outcome then P(outcome|treatment,gestat10,U) = P(outcome|treatment,gestat10)
### Estimand : 2
Estimand name: iv
No such variable(s) found!
### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!
3. 使用各种方法进行估计
3.1 使用线性回归
[9]:
estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.linear_regression", test_significance=True
)
print(estimate)
print("ATE", np.mean(data_1["outcome"])- np.mean(data_0["outcome"]))
print("Causal Estimate is " + str(estimate.value))
*** Causal Estimate ***
## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────────────(E[outcome|gestat10])
d[treatment]
Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→outcome then P(outcome|treatment,gestat10,U) = P(outcome|treatment,gestat10)
## Realized estimand
b: outcome~treatment+gestat10
Target units: ate
## Estimate
Mean value: -0.0252002670226934
p-value: [7.18902894e-08]
ATE -0.025200267022696926
Causal Estimate is -0.0252002670226934
3.2 使用倾向得分匹配
[10]:
estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_matching"
)
print("Causal Estimate is " + str(estimate.value))
print("ATE", np.mean(data_1["outcome"])- np.mean(data_0["outcome"]))
Causal Estimate is 0.4873998664886515
ATE -0.025200267022696926
4. 反驳
4.1 添加随机原因
[11]:
refute_results=model.refute_estimate(identified_estimand, estimate,
method_name="random_common_cause")
print(refute_results)
Refute: Add a random common cause
Estimated effect:0.4873998664886515
New effect:0.48739986648865147
p value:1.0
4.2 使用安慰剂治疗
[12]:
res_placebo=model.refute_estimate(identified_estimand, estimate,
method_name="placebo_treatment_refuter", placebo_type="permute",
num_simulations=20)
print(res_placebo)
Refute: Use a Placebo Treatment
Estimated effect:0.4873998664886515
New effect:-0.224774699599466
p value:0.22366883618379163
4.3 使用数据子集反驳器
[13]:
res_subset=model.refute_estimate(identified_estimand, estimate,
method_name="data_subset_refuter", subset_fraction=0.9,
num_simulations=20)
print(res_subset)
Refute: Use a subset of data
Estimated effect:0.4873998664886515
New effect:0.01235223216355291
p value:0.053337396435071405