Lalonde Pandas API 示例#

作者:Adam Kelleher

我们将快速介绍一个使用 DoSampler 高级 Python API 的示例。DoSampler 与大多数经典因果效应估计器不同。它不估计干预下的统计量,而是旨在提供 Pearl 因果推断的通用性。在这种情况下,干预下变量的联合分布是感兴趣的量。非参数地表示联合分布很困难,因此我们提供该分布的样本,我们称之为“do”样本。

在这里,当你指定一个结果时,它就是你在干预下采样的变量。我们仍然需要执行通常的流程,以确保数量(结果的条件干预分布)是可识别的。我们利用包中其余部分熟悉的功能在“底层”完成此操作。你会注意到 DoSampler 的 kwargs 有一些相似之处。

[1]:
import os, sys
sys.path.append(os.path.abspath("../../../"))

获取数据#

首先,从 LaLonde 示例下载数据。

[2]:
import dowhy.datasets

lalonde = dowhy.datasets.lalonde_dataset()

causal 命名空间#

我们为包含因果推断方法的 pandas.DataFrame 创建了一个“命名空间”。你可以在这里通过 lalonde.causal 访问它,其中 lalonde 是我们的 pandas.DataFramecausal 包含了我们所有的新方法!当你 import dowhy.api 时,这些方法会神奇地加载到你现有的(以及未来的)dataframe 中。

[3]:
import dowhy.api

现在我们有了 causal 命名空间,让我们试试吧!

do 操作#

这里的关键特性是 do 方法,它会生成一个新的 dataframe,用指定的值替换治疗变量,并用结果的干预分布中的样本替换结果。如果你不为治疗变量指定值,它将保持治疗变量不变。

[4]:
do_df = lalonde.causal.do(x='treat',
                          outcome='re78',
                          common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
                          variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
                                          'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'}
                         )

注意你会得到通常的关于可识别性的输出和提示。这全都是 dowhy 在底层完成的!

我们现在在 do_df 中有了一个干预样本。它看起来与原始 dataframe 非常相似。比较它们。

[5]:
lalonde.head()
[5]:
treat age educ black hisp married nodegr re74 re75 re78 u74 u75
0 False 23.0 10.0 1.0 0.0 0.0 1.0 0.0 0.0 0.00 1.0 1.0
1 False 26.0 12.0 0.0 0.0 0.0 0.0 0.0 0.0 12383.68 1.0 1.0
2 False 22.0 9.0 1.0 0.0 0.0 1.0 0.0 0.0 0.00 1.0 1.0
3 False 18.0 9.0 1.0 0.0 0.0 1.0 0.0 0.0 10740.08 1.0 1.0
4 False 45.0 11.0 1.0 0.0 0.0 1.0 0.0 0.0 11796.47 1.0 1.0
[6]:
do_df.head()
[6]:
treat age educ black hisp married nodegr re74 re75 re78 u74 u75 propensity_score weight
0 True 42.0 12.0 1.0 0.0 0.0 0.0 0.000 0.0000 2456.1530 1.0 1.0 0.566923 1.763908
1 True 38.0 9.0 0.0 0.0 0.0 1.0 0.000 0.0000 6408.9500 1.0 1.0 0.446850 2.237888
2 False 17.0 10.0 1.0 0.0 0.0 1.0 0.000 0.0000 275.5661 1.0 1.0 0.638315 1.566624
3 False 26.0 10.0 1.0 0.0 1.0 1.0 6140.367 558.7734 0.0000 0.0 0.0 0.573852 1.742608
4 False 39.0 12.0 1.0 0.0 1.0 0.0 19785.320 6608.1370 499.2572 0.0 0.0 0.387147 2.582995

治疗效果估计#

我们可以通过以下方式获得治疗效果的朴素估计:

[7]:
(lalonde[lalonde['treat'] == 1].mean() - lalonde[lalonde['treat'] == 0].mean())['re78']
[7]:
$\displaystyle 1794.34240427027$

我们可以对干预分布的新样本做同样的操作,以获得因果效应估计。

[8]:
(do_df[do_df['treat'] == 1].mean() - do_df[do_df['treat'] == 0].mean())['re78']
[8]:
$\displaystyle 1727.89919646219$

我们可以使用正态近似法得到 95% 置信区间的大致误差条,例如:

[9]:
import numpy as np
1.96*np.sqrt((do_df[do_df['treat'] == 1].var()/len(do_df[do_df['treat'] == 1])) +
             (do_df[do_df['treat'] == 0].var()/len(do_df[do_df['treat'] == 0])))['re78']
[9]:
$\displaystyle 1245.9600841561$

但请注意,这些不包含倾向得分估计误差。为此,自助法程序可能更合适。

这只是我们可以从 're78' 的干预分布中计算的一个统计量。我们也可以获得所有干预矩,包括 're78' 的函数。我们可以充分利用 pandas 的强大功能,例如:

[10]:
do_df['re78'].describe()
[10]:
count      445.000000
mean      5080.937222
std       6618.419440
min          0.000000
25%          0.000000
50%       3523.578000
75%       8048.603000
max      60307.930000
Name: re78, dtype: float64
[11]:
lalonde['re78'].describe()
[11]:
count      445.000000
mean      5300.763699
std       6631.491695
min          0.000000
25%          0.000000
50%       3701.812000
75%       8124.715000
max      60307.930000
Name: re78, dtype: float64

甚至绘制聚合图表,例如:

[12]:
%matplotlib inline
[13]:
import seaborn as sns

sns.barplot(data=lalonde, x='treat', y='re78')
[13]:
<Axes: xlabel='treat', ylabel='re78'>
../_images/example_notebooks_lalonde_pandas_api_25_1.png
[14]:
sns.barplot(data=do_df, x='treat', y='re78')
[14]:
<Axes: xlabel='treat', ylabel='re78'>
../_images/example_notebooks_lalonde_pandas_api_26_1.png

指定干预#

你可以找到干预下结果的分布,以设定治疗的值。

[15]:
do_df = lalonde.causal.do(x={'treat': 1},
                          outcome='re78',
                          common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
                          variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
                                          'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'}
                         )
[16]:
do_df.head()
[16]:
treat age educ black hisp married nodegr re74 re75 re78 u74 u75 propensity_score weight
0 True 26.0 10.0 1.0 0.0 0.0 1.0 25929.68 6788.958 672.8773 0.0 0.0 0.375730 2.661487
1 True 29.0 12.0 1.0 0.0 0.0 0.0 10881.94 1817.284 0.0000 0.0 0.0 0.545410 1.833483
2 True 18.0 11.0 1.0 0.0 0.0 1.0 0.00 0.000 4814.6270 1.0 1.0 0.351612 2.844044
3 True 28.0 11.0 1.0 0.0 0.0 1.0 0.00 1284.079 60307.9300 1.0 0.0 0.367046 2.724453
4 True 25.0 8.0 1.0 0.0 0.0 1.0 0.00 0.000 0.0000 1.0 1.0 0.398144 2.511651

这个新的 dataframe 给出了当 'treat' 设置为 1're78' 的分布。

关于 do 方法如何工作的更多详细信息,请查看 docstring:

[17]:
help(lalonde.causal.do)
Help on method do in module dowhy.api.causal_data_frame:

do(x, method='weighting', num_cores=1, variable_types={}, outcome=None, params=None, graph: networkx.classes.digraph.DiGraph = None, common_causes=None, estimand_type=<EstimandType.NONPARAMETRIC_ATE: 'nonparametric-ate'>, stateful=False) method of dowhy.api.causal_data_frame.CausalAccessor instance
    The do-operation implemented with sampling. This will return a pandas.DataFrame with the outcome
    variable(s) replaced with samples from P(Y|do(X=x)).

    If the value of `x` is left unspecified (e.g. as a string or list), then the original values of `x` are left in
    the DataFrame, and Y is sampled from its respective P(Y|do(x)). If the value of `x` is specified (passed with a
    `dict`, where variable names are keys, and values are specified) then the new `DataFrame` will contain the
    specified values of `x`.

    For some methods, the `variable_types` field must be specified. It should be a `dict`, where the keys are
    variable names, and values are 'o' for ordered discrete, 'u' for un-ordered discrete, 'd' for discrete, or 'c'
    for continuous.

    Inference requires a set of control variables. These can be provided explicitly using `common_causes`, which
    contains a list of variable names to control for. These can be provided implicitly by specifying a causal graph
    with `dot_graph`, from which they will be chosen using the default identification method.

    When the set of control variables can't be identified with the provided assumptions, a prompt will raise to the
    user asking whether to proceed. To automatically over-ride the prompt, you can set the flag
    `proceed_when_unidentifiable` to `True`.

    Some methods build components during inference which are expensive. To retain those components for later
    inference (e.g. successive calls to `do` with different values of `x`), you can set the `stateful` flag to `True`.
    Be cautious about using the `do` operation statefully. State is set on the namespace, rather than the method, so
    can behave unpredictably. To reset the namespace and run statelessly again, you can call the `reset` method.

    :param x: str, list, dict: The causal state on which to intervene, and (optional) its interventional value(s).
    :param method: The inference method to use with the sampler. Currently, `'mcmc'`, `'weighting'`, and
        `'kernel_density'` are supported. The `mcmc` sampler requires `pymc3>=3.7`.
    :param num_cores: int: if the inference method only supports sampling a point at a time, this will parallelize
        sampling.
    :param variable_types: dict: The dictionary containing the variable types. Must contain the union of the causal
        state, control variables, and the outcome.
    :param outcome: str: The outcome variable.
    :param params: dict: extra parameters to set as attributes on the sampler object
    :param dot_graph: str: A string specifying the causal graph.
    :param common_causes: list: A list of strings containing the variable names to control for.
    :param estimand_type: str: 'nonparametric-ate' is the only one currently supported. Others may be added later, to allow for specific, parametric estimands.
    :param proceed_when_unidentifiable: bool: A flag to over-ride user prompts to proceed when effects aren't
        identifiable with the assumptions provided.
    :param stateful: bool: Whether to retain state. By default, the do operation is stateless.

    :return: pandas.DataFrame: A DataFrame containing the sampled outcome