Twins数据集上的DoWhy示例#

在此,我们研究Louizos等人研究过的Twins数据集。我们重点关注同性别且体重小于2公斤的双胞胎。治疗变量 t = 1 表示出生时体重较重的那个双胞胎,结果变量是每个双胞胎在生命第一年内的死亡率。所取的混杂变量是‘gestat10’,即出生前的妊娠周数,因为它与结果高度相关。使用以下方法获得的结果与论文中获得的结果一致。

[1]:
import pandas as pd
import numpy as np
import dowhy
from dowhy import CausalModel
from dowhy import causal_estimators

# Config dict to set the logging level
import logging.config
DEFAULT_LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'loggers': {
        '': {
            'level': 'WARN',
        },
    }
}
logging.config.dictConfig(DEFAULT_LOGGING)
# Disabling warnings output
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

加载数据

数据加载过程包括组合协变量、治疗变量和结果变量,并解决数据中的配对属性。由于包含双胞胎双方的条目,他们的死亡率可以视为两个潜在结果。治疗变量是根据双胞胎体重设定的。因此,为了获得二元治疗变量,每个孩子的信息都添加在单独的行中,而不是像原始数据源那样将两者的信息压缩在同一行中。

[2]:
#The covariates data has 46 features
x = pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/TWINS/twin_pairs_X_3years_samesex.csv")

#The outcome data contains mortality of the lighter and heavier twin
y = pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/TWINS/twin_pairs_Y_3years_samesex.csv")

#The treatment data contains weight in grams of both the twins
t = pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/TWINS/twin_pairs_T_3years_samesex.csv")
[3]:
#_0 denotes features specific to the lighter twin and _1 denotes features specific to the heavier twin
lighter_columns = ['pldel', 'birattnd', 'brstate', 'stoccfipb', 'mager8',
       'ormoth', 'mrace', 'meduc6', 'dmar', 'mplbir', 'mpre5', 'adequacy',
       'orfath', 'frace', 'birmon', 'gestat10', 'csex', 'anemia', 'cardiac',
       'lung', 'diabetes', 'herpes', 'hydra', 'hemo', 'chyper', 'phyper',
       'eclamp', 'incervix', 'pre4000', 'preterm', 'renal', 'rh', 'uterine',
       'othermr', 'tobacco', 'alcohol', 'cigar6', 'drink5', 'crace',
       'data_year', 'nprevistq', 'dfageq', 'feduc6', 'infant_id_0',
       'dlivord_min', 'dtotord_min', 'bord_0',
       'brstate_reg', 'stoccfipb_reg', 'mplbir_reg']
heavier_columns = [ 'pldel', 'birattnd', 'brstate', 'stoccfipb', 'mager8',
       'ormoth', 'mrace', 'meduc6', 'dmar', 'mplbir', 'mpre5', 'adequacy',
       'orfath', 'frace', 'birmon', 'gestat10', 'csex', 'anemia', 'cardiac',
       'lung', 'diabetes', 'herpes', 'hydra', 'hemo', 'chyper', 'phyper',
       'eclamp', 'incervix', 'pre4000', 'preterm', 'renal', 'rh', 'uterine',
       'othermr', 'tobacco', 'alcohol', 'cigar6', 'drink5', 'crace',
       'data_year', 'nprevistq', 'dfageq', 'feduc6',
       'infant_id_1', 'dlivord_min', 'dtotord_min', 'bord_1',
       'brstate_reg', 'stoccfipb_reg', 'mplbir_reg']
[4]:
#Since data has pair property,processing the data to get separate row for each twin so that each child can be treated as an instance
data = []

for i in range(len(t.values)):

    #select only if both <=2kg
    if t.iloc[i].values[1]>=2000 or t.iloc[i].values[2]>=2000:
        continue

    this_instance_lighter = list(x.iloc[i][lighter_columns].values)
    this_instance_heavier = list(x.iloc[i][heavier_columns].values)

    #adding weight
    this_instance_lighter.append(t.iloc[i].values[1])
    this_instance_heavier.append(t.iloc[i].values[2])

    #adding treatment, is_heavier
    this_instance_lighter.append(0)
    this_instance_heavier.append(1)

    #adding the outcome
    this_instance_lighter.append(y.iloc[i].values[1])
    this_instance_heavier.append(y.iloc[i].values[2])
    data.append(this_instance_lighter)
    data.append(this_instance_heavier)
[5]:
cols = [ 'pldel', 'birattnd', 'brstate', 'stoccfipb', 'mager8',
       'ormoth', 'mrace', 'meduc6', 'dmar', 'mplbir', 'mpre5', 'adequacy',
       'orfath', 'frace', 'birmon', 'gestat10', 'csex', 'anemia', 'cardiac',
       'lung', 'diabetes', 'herpes', 'hydra', 'hemo', 'chyper', 'phyper',
       'eclamp', 'incervix', 'pre4000', 'preterm', 'renal', 'rh', 'uterine',
       'othermr', 'tobacco', 'alcohol', 'cigar6', 'drink5', 'crace',
       'data_year', 'nprevistq', 'dfageq', 'feduc6',
       'infant_id', 'dlivord_min', 'dtotord_min', 'bord',
       'brstate_reg', 'stoccfipb_reg', 'mplbir_reg','wt','treatment','outcome']
df = pd.DataFrame(columns=cols,data=data)
df.head()
[5]:
pldel birattnd brstate stoccfipb mager8 ormoth mrace meduc6 dmar mplbir ... infant_id dlivord_min dtotord_min bord brstate_reg stoccfipb_reg mplbir_reg wt treatment outcome
0 1.0 1.0 1.0 1.0 3.0 0.0 1.0 3.0 1.0 1.0 ... 35.0 3.0 3.0 2.0 5.0 5.0 5.0 936.0 0 0.0
1 1.0 1.0 1.0 1.0 3.0 0.0 1.0 3.0 1.0 1.0 ... 34.0 3.0 3.0 1.0 5.0 5.0 5.0 1006.0 1 0.0
2 1.0 1.0 1.0 1.0 3.0 0.0 1.0 2.0 0.0 1.0 ... 47.0 NaN NaN NaN 5.0 5.0 5.0 737.0 0 0.0
3 1.0 1.0 1.0 1.0 3.0 0.0 1.0 2.0 0.0 1.0 ... 46.0 NaN NaN NaN 5.0 5.0 5.0 850.0 1 1.0
4 1.0 1.0 1.0 1.0 3.0 0.0 1.0 3.0 1.0 1.0 ... 52.0 1.0 1.0 1.0 5.0 5.0 5.0 1830.0 0 0.0

5行 × 53列

[6]:
df = df.astype({"treatment":'bool'}, copy=False) #explicitly assigning treatment column as boolean

df.fillna(value=df.mean(),inplace=True)    #filling the missing values
df.fillna(value=df.mode().loc[0],inplace=True)

data_1 = df[df["treatment"]==1]
data_0 = df[df["treatment"]==0]
print(np.mean(data_1["outcome"]))
print(np.mean(data_0["outcome"]))
print("ATE", np.mean(data_1["outcome"])- np.mean(data_0["outcome"]))
0.16421895861148197
0.1894192256341789
ATE -0.025200267022696926

1. 建模

[7]:
#The causal model has "treatment = is_heavier", "outcome = mortality" and "gestat10 = gestational weeks before birth"
model=CausalModel(
        data = df,
        treatment='treatment',
        outcome='outcome',
        common_causes='gestat10'
        )

2. 识别

[8]:
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
     d
────────────(E[outcome|gestat10])
d[treatment]
Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→outcome then P(outcome|treatment,gestat10,U) = P(outcome|treatment,gestat10)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

3. 使用各种方法进行估计

3.1 使用线性回归

[9]:
estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression", test_significance=True
)

print(estimate)
print("ATE", np.mean(data_1["outcome"])- np.mean(data_0["outcome"]))
print("Causal Estimate is " + str(estimate.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
     d
────────────(E[outcome|gestat10])
d[treatment]
Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→outcome then P(outcome|treatment,gestat10,U) = P(outcome|treatment,gestat10)

## Realized estimand
b: outcome~treatment+gestat10
Target units: ate

## Estimate
Mean value: -0.0252002670226934
p-value: [7.18902894e-08]

ATE -0.025200267022696926
Causal Estimate is -0.0252002670226934

3.2 使用倾向得分匹配

[10]:
estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.propensity_score_matching"
)

print("Causal Estimate is " + str(estimate.value))

print("ATE", np.mean(data_1["outcome"])- np.mean(data_0["outcome"]))
Causal Estimate is 0.4873998664886515
ATE -0.025200267022696926

4. 反驳

4.1 添加随机原因

[11]:
refute_results=model.refute_estimate(identified_estimand, estimate,
        method_name="random_common_cause")
print(refute_results)
Refute: Add a random common cause
Estimated effect:0.4873998664886515
New effect:0.48739986648865147
p value:1.0

4.2 使用安慰剂治疗

[12]:
res_placebo=model.refute_estimate(identified_estimand, estimate,
        method_name="placebo_treatment_refuter", placebo_type="permute",
        num_simulations=20)
print(res_placebo)
Refute: Use a Placebo Treatment
Estimated effect:0.4873998664886515
New effect:-0.224774699599466
p value:0.22366883618379163

4.3 使用数据子集反驳器

[13]:
res_subset=model.refute_estimate(identified_estimand, estimate,
        method_name="data_subset_refuter", subset_fraction=0.9,
        num_simulations=20)
print(res_subset)
Refute: Use a subset of data
Estimated effect:0.4873998664886515
New effect:0.01235223216355291
p value:0.053337396435071405