Lalonde Pandas API 示例#

作者：Adam Kelleher

我们将快速介绍一个使用 DoSampler 高级 Python API 的示例。DoSampler 与大多数经典因果效应估计器不同。它不估计干预下的统计量，而是旨在提供 Pearl 因果推断的通用性。在这种情况下，干预下变量的联合分布是感兴趣的量。非参数地表示联合分布很困难，因此我们提供该分布的样本，我们称之为“do”样本。

在这里，当你指定一个结果时，它就是你在干预下采样的变量。我们仍然需要执行通常的流程，以确保数量（结果的条件干预分布）是可识别的。我们利用包中其余部分熟悉的功能在“底层”完成此操作。你会注意到 DoSampler 的 kwargs 有一些相似之处。

[1]:

import os, sys
sys.path.append(os.path.abspath("../../../"))

获取数据#

首先，从 LaLonde 示例下载数据。

[2]:

import dowhy.datasets

lalonde = dowhy.datasets.lalonde_dataset()

`causal` 命名空间#

我们为包含因果推断方法的 pandas.DataFrame 创建了一个“命名空间”。你可以在这里通过 lalonde.causal 访问它，其中 lalonde 是我们的 pandas.DataFrame，causal 包含了我们所有的新方法！当你 import dowhy.api 时，这些方法会神奇地加载到你现有的（以及未来的）dataframe 中。

[3]:

import dowhy.api

现在我们有了 causal 命名空间，让我们试试吧！

`do` 操作#

这里的关键特性是 do 方法，它会生成一个新的 dataframe，用指定的值替换治疗变量，并用结果的干预分布中的样本替换结果。如果你不为治疗变量指定值，它将保持治疗变量不变。

[4]:

do_df = lalonde.causal.do(x='treat',
                          outcome='re78',
                          common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
                          variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
                                          'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'}
                         )

注意你会得到通常的关于可识别性的输出和提示。这全都是 dowhy 在底层完成的！

我们现在在 do_df 中有了一个干预样本。它看起来与原始 dataframe 非常相似。比较它们。

[5]:

lalonde.head()

[5]:

	treat	age	educ	black	nodegr	re78	u74	u75
0	False	23.0	10.0	1.0	1.0	0.00	1.0	1.0
1	False	26.0	12.0	0.0	0.0	12383.68	1.0	1.0
2	False	22.0	9.0	1.0	1.0	0.00	1.0	1.0
3	False	18.0	9.0	1.0	1.0	10740.08	1.0	1.0
4	False	45.0	11.0	1.0	1.0	11796.47	1.0	1.0

[6]:

do_df.head()

[6]:

	treat	age	educ	black	married	nodegr	re74	re75	re78	u74	u75	propensity_score	weight
0	True	42.0	12.0	1.0	0.0	0.0	0.000	0.0000	2456.1530	1.0	1.0	0.566923	1.763908
1	True	38.0	9.0	0.0	0.0	1.0	0.000	0.0000	6408.9500	1.0	1.0	0.446850	2.237888
2	False	17.0	10.0	1.0	0.0	1.0	0.000	0.0000	275.5661	1.0	1.0	0.638315	1.566624
3	False	26.0	10.0	1.0	1.0	1.0	6140.367	558.7734	0.0000	0.0	0.0	0.573852	1.742608
4	False	39.0	12.0	1.0	1.0	0.0	19785.320	6608.1370	499.2572	0.0	0.0	0.387147	2.582995

治疗效果估计#

我们可以通过以下方式获得治疗效果的朴素估计：

[7]:

(lalonde[lalonde['treat'] == 1].mean() - lalonde[lalonde['treat'] == 0].mean())['re78']

[7]:

$\displaystyle 1794.34240427027$

我们可以对干预分布的新样本做同样的操作，以获得因果效应估计。

[8]:

(do_df[do_df['treat'] == 1].mean() - do_df[do_df['treat'] == 0].mean())['re78']

[8]:

$\displaystyle 1727.89919646219$

我们可以使用正态近似法得到 95% 置信区间的大致误差条，例如：

[9]:

import numpy as np
1.96*np.sqrt((do_df[do_df['treat'] == 1].var()/len(do_df[do_df['treat'] == 1])) +
             (do_df[do_df['treat'] == 0].var()/len(do_df[do_df['treat'] == 0])))['re78']

[9]:

$\displaystyle 1245.9600841561$

但请注意，这些不包含倾向得分估计误差。为此，自助法程序可能更合适。

这只是我们可以从 're78' 的干预分布中计算的一个统计量。我们也可以获得所有干预矩，包括 're78' 的函数。我们可以充分利用 pandas 的强大功能，例如：

[10]:

do_df['re78'].describe()

[10]:

count      445.000000
mean      5080.937222
std       6618.419440
min          0.000000
25%          0.000000
50%       3523.578000
75%       8048.603000
max      60307.930000
Name: re78, dtype: float64

[11]:

lalonde['re78'].describe()

[11]:

count      445.000000
mean      5300.763699
std       6631.491695
min          0.000000
25%          0.000000
50%       3701.812000
75%       8124.715000
max      60307.930000
Name: re78, dtype: float64

甚至绘制聚合图表，例如：

[12]:

%matplotlib inline

[13]:

import seaborn as sns

sns.barplot(data=lalonde, x='treat', y='re78')

[13]:

<Axes: xlabel='treat', ylabel='re78'>

../_images/example_notebooks_lalonde_pandas_api_25_1.png

[14]:

sns.barplot(data=do_df, x='treat', y='re78')

[14]:

<Axes: xlabel='treat', ylabel='re78'>

../_images/example_notebooks_lalonde_pandas_api_26_1.png

指定干预#

你可以找到干预下结果的分布，以设定治疗的值。

[15]:

do_df = lalonde.causal.do(x={'treat': 1},
                          outcome='re78',
                          common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
                          variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
                                          'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'}
                         )

[16]:

do_df.head()

[16]:

	treat	age	educ	black	nodegr	re74	re75	re78	u74	u75	propensity_score	weight
0	True	26.0	10.0	1.0	1.0	25929.68	6788.958	672.8773	0.0	0.0	0.375730	2.661487
1	True	29.0	12.0	1.0	0.0	10881.94	1817.284	0.0000	0.0	0.0	0.545410	1.833483
2	True	18.0	11.0	1.0	1.0	0.00	0.000	4814.6270	1.0	1.0	0.351612	2.844044
3	True	28.0	11.0	1.0	1.0	0.00	1284.079	60307.9300	1.0	0.0	0.367046	2.724453
4	True	25.0	8.0	1.0	1.0	0.00	0.000	0.0000	1.0	1.0	0.398144	2.511651

这个新的 dataframe 给出了当 'treat' 设置为 1 时 're78' 的分布。

关于 do 方法如何工作的更多详细信息，请查看 docstring：

[17]:

help(lalonde.causal.do)

Help on method do in module dowhy.api.causal_data_frame:

do(x, method='weighting', num_cores=1, variable_types={}, outcome=None, params=None, graph: networkx.classes.digraph.DiGraph = None, common_causes=None, estimand_type=<EstimandType.NONPARAMETRIC_ATE: 'nonparametric-ate'>, stateful=False) method of dowhy.api.causal_data_frame.CausalAccessor instance
The do-operation implemented with sampling. This will return a pandas.DataFrame with the outcome
variable(s) replaced with samples from P(Y|do(X=x)).

If the value of `x` is left unspecified (e.g. as a string or list), then the original values of `x` are left in
the DataFrame, and Y is sampled from its respective P(Y|do(x)). If the value of `x` is specified (passed with a
`dict`, where variable names are keys, and values are specified) then the new `DataFrame` will contain the
specified values of `x`.

For some methods, the `variable_types` field must be specified. It should be a `dict`, where the keys are
variable names, and values are 'o' for ordered discrete, 'u' for un-ordered discrete, 'd' for discrete, or 'c'
for continuous.

Inference requires a set of control variables. These can be provided explicitly using `common_causes`, which
contains a list of variable names to control for. These can be provided implicitly by specifying a causal graph
with `dot_graph`, from which they will be chosen using the default identification method.

When the set of control variables can't be identified with the provided assumptions, a prompt will raise to the
user asking whether to proceed. To automatically over-ride the prompt, you can set the flag
`proceed_when_unidentifiable` to `True`.

Some methods build components during inference which are expensive. To retain those components for later
inference (e.g. successive calls to `do` with different values of `x`), you can set the `stateful` flag to `True`.
Be cautious about using the `do` operation statefully. State is set on the namespace, rather than the method, so
can behave unpredictably. To reset the namespace and run statelessly again, you can call the `reset` method.

:param x: str, list, dict: The causal state on which to intervene, and (optional) its interventional value(s).
:param method: The inference method to use with the sampler. Currently, `'mcmc'`, `'weighting'`, and
`'kernel_density'` are supported. The `mcmc` sampler requires `pymc3>=3.7`.
:param num_cores: int: if the inference method only supports sampling a point at a time, this will parallelize
sampling.
:param variable_types: dict: The dictionary containing the variable types. Must contain the union of the causal
state, control variables, and the outcome.
:param outcome: str: The outcome variable.
:param params: dict: extra parameters to set as attributes on the sampler object
:param dot_graph: str: A string specifying the causal graph.
:param common_causes: list: A list of strings containing the variable names to control for.
:param estimand_type: str: 'nonparametric-ate' is the only one currently supported. Others may be added later, to allow for specific, parametric estimands.
:param proceed_when_unidentifiable: bool: A flag to over-ride user prompts to proceed when effects aren't
identifiable with the assumptions provided.
:param stateful: bool: Whether to retain state. By default, the do operation is stateless.

:return: pandas.DataFrame: A DataFrame containing the sampled outcome