Lalonde Pandas API 示例#
作者:Adam Kelleher
我们将快速介绍一个使用 DoSampler 高级 Python API 的示例。DoSampler 与大多数经典因果效应估计器不同。它不估计干预下的统计量,而是旨在提供 Pearl 因果推断的通用性。在这种情况下,干预下变量的联合分布是感兴趣的量。非参数地表示联合分布很困难,因此我们提供该分布的样本,我们称之为“do”样本。
在这里,当你指定一个结果时,它就是你在干预下采样的变量。我们仍然需要执行通常的流程,以确保数量(结果的条件干预分布)是可识别的。我们利用包中其余部分熟悉的功能在“底层”完成此操作。你会注意到 DoSampler 的 kwargs 有一些相似之处。
[1]:
import os, sys
sys.path.append(os.path.abspath("../../../"))
获取数据#
首先,从 LaLonde 示例下载数据。
[2]:
import dowhy.datasets
lalonde = dowhy.datasets.lalonde_dataset()
causal
命名空间#
我们为包含因果推断方法的 pandas.DataFrame
创建了一个“命名空间”。你可以在这里通过 lalonde.causal
访问它,其中 lalonde
是我们的 pandas.DataFrame
,causal
包含了我们所有的新方法!当你 import dowhy.api
时,这些方法会神奇地加载到你现有的(以及未来的)dataframe 中。
[3]:
import dowhy.api
现在我们有了 causal
命名空间,让我们试试吧!
do
操作#
这里的关键特性是 do
方法,它会生成一个新的 dataframe,用指定的值替换治疗变量,并用结果的干预分布中的样本替换结果。如果你不为治疗变量指定值,它将保持治疗变量不变。
[4]:
do_df = lalonde.causal.do(x='treat',
outcome='re78',
common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'}
)
注意你会得到通常的关于可识别性的输出和提示。这全都是 dowhy
在底层完成的!
我们现在在 do_df
中有了一个干预样本。它看起来与原始 dataframe 非常相似。比较它们。
[5]:
lalonde.head()
[5]:
treat | age | educ | black | hisp | married | nodegr | re74 | re75 | re78 | u74 | u75 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | 23.0 | 10.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.00 | 1.0 | 1.0 |
1 | False | 26.0 | 12.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 12383.68 | 1.0 | 1.0 |
2 | False | 22.0 | 9.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.00 | 1.0 | 1.0 |
3 | False | 18.0 | 9.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 10740.08 | 1.0 | 1.0 |
4 | False | 45.0 | 11.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 11796.47 | 1.0 | 1.0 |
[6]:
do_df.head()
[6]:
treat | age | educ | black | hisp | married | nodegr | re74 | re75 | re78 | u74 | u75 | propensity_score | weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | 42.0 | 12.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.000 | 0.0000 | 2456.1530 | 1.0 | 1.0 | 0.566923 | 1.763908 |
1 | True | 38.0 | 9.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.000 | 0.0000 | 6408.9500 | 1.0 | 1.0 | 0.446850 | 2.237888 |
2 | False | 17.0 | 10.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.000 | 0.0000 | 275.5661 | 1.0 | 1.0 | 0.638315 | 1.566624 |
3 | False | 26.0 | 10.0 | 1.0 | 0.0 | 1.0 | 1.0 | 6140.367 | 558.7734 | 0.0000 | 0.0 | 0.0 | 0.573852 | 1.742608 |
4 | False | 39.0 | 12.0 | 1.0 | 0.0 | 1.0 | 0.0 | 19785.320 | 6608.1370 | 499.2572 | 0.0 | 0.0 | 0.387147 | 2.582995 |
治疗效果估计#
我们可以通过以下方式获得治疗效果的朴素估计:
[7]:
(lalonde[lalonde['treat'] == 1].mean() - lalonde[lalonde['treat'] == 0].mean())['re78']
[7]:
我们可以对干预分布的新样本做同样的操作,以获得因果效应估计。
[8]:
(do_df[do_df['treat'] == 1].mean() - do_df[do_df['treat'] == 0].mean())['re78']
[8]:
我们可以使用正态近似法得到 95% 置信区间的大致误差条,例如:
[9]:
import numpy as np
1.96*np.sqrt((do_df[do_df['treat'] == 1].var()/len(do_df[do_df['treat'] == 1])) +
(do_df[do_df['treat'] == 0].var()/len(do_df[do_df['treat'] == 0])))['re78']
[9]:
但请注意,这些不包含倾向得分估计误差。为此,自助法程序可能更合适。
这只是我们可以从 're78'
的干预分布中计算的一个统计量。我们也可以获得所有干预矩,包括 're78'
的函数。我们可以充分利用 pandas 的强大功能,例如:
[10]:
do_df['re78'].describe()
[10]:
count 445.000000
mean 5080.937222
std 6618.419440
min 0.000000
25% 0.000000
50% 3523.578000
75% 8048.603000
max 60307.930000
Name: re78, dtype: float64
[11]:
lalonde['re78'].describe()
[11]:
count 445.000000
mean 5300.763699
std 6631.491695
min 0.000000
25% 0.000000
50% 3701.812000
75% 8124.715000
max 60307.930000
Name: re78, dtype: float64
甚至绘制聚合图表,例如:
[12]:
%matplotlib inline
[13]:
import seaborn as sns
sns.barplot(data=lalonde, x='treat', y='re78')
[13]:
<Axes: xlabel='treat', ylabel='re78'>

[14]:
sns.barplot(data=do_df, x='treat', y='re78')
[14]:
<Axes: xlabel='treat', ylabel='re78'>

指定干预#
你可以找到干预下结果的分布,以设定治疗的值。
[15]:
do_df = lalonde.causal.do(x={'treat': 1},
outcome='re78',
common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'}
)
[16]:
do_df.head()
[16]:
treat | age | educ | black | hisp | married | nodegr | re74 | re75 | re78 | u74 | u75 | propensity_score | weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | 26.0 | 10.0 | 1.0 | 0.0 | 0.0 | 1.0 | 25929.68 | 6788.958 | 672.8773 | 0.0 | 0.0 | 0.375730 | 2.661487 |
1 | True | 29.0 | 12.0 | 1.0 | 0.0 | 0.0 | 0.0 | 10881.94 | 1817.284 | 0.0000 | 0.0 | 0.0 | 0.545410 | 1.833483 |
2 | True | 18.0 | 11.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.00 | 0.000 | 4814.6270 | 1.0 | 1.0 | 0.351612 | 2.844044 |
3 | True | 28.0 | 11.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.00 | 1284.079 | 60307.9300 | 1.0 | 0.0 | 0.367046 | 2.724453 |
4 | True | 25.0 | 8.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.00 | 0.000 | 0.0000 | 1.0 | 1.0 | 0.398144 | 2.511651 |
这个新的 dataframe 给出了当 'treat'
设置为 1
时 're78'
的分布。
关于 do
方法如何工作的更多详细信息,请查看 docstring:
[17]:
help(lalonde.causal.do)
Help on method do in module dowhy.api.causal_data_frame:
do(x, method='weighting', num_cores=1, variable_types={}, outcome=None, params=None, graph: networkx.classes.digraph.DiGraph = None, common_causes=None, estimand_type=<EstimandType.NONPARAMETRIC_ATE: 'nonparametric-ate'>, stateful=False) method of dowhy.api.causal_data_frame.CausalAccessor instance
The do-operation implemented with sampling. This will return a pandas.DataFrame with the outcome
variable(s) replaced with samples from P(Y|do(X=x)).
If the value of `x` is left unspecified (e.g. as a string or list), then the original values of `x` are left in
the DataFrame, and Y is sampled from its respective P(Y|do(x)). If the value of `x` is specified (passed with a
`dict`, where variable names are keys, and values are specified) then the new `DataFrame` will contain the
specified values of `x`.
For some methods, the `variable_types` field must be specified. It should be a `dict`, where the keys are
variable names, and values are 'o' for ordered discrete, 'u' for un-ordered discrete, 'd' for discrete, or 'c'
for continuous.
Inference requires a set of control variables. These can be provided explicitly using `common_causes`, which
contains a list of variable names to control for. These can be provided implicitly by specifying a causal graph
with `dot_graph`, from which they will be chosen using the default identification method.
When the set of control variables can't be identified with the provided assumptions, a prompt will raise to the
user asking whether to proceed. To automatically over-ride the prompt, you can set the flag
`proceed_when_unidentifiable` to `True`.
Some methods build components during inference which are expensive. To retain those components for later
inference (e.g. successive calls to `do` with different values of `x`), you can set the `stateful` flag to `True`.
Be cautious about using the `do` operation statefully. State is set on the namespace, rather than the method, so
can behave unpredictably. To reset the namespace and run statelessly again, you can call the `reset` method.
:param x: str, list, dict: The causal state on which to intervene, and (optional) its interventional value(s).
:param method: The inference method to use with the sampler. Currently, `'mcmc'`, `'weighting'`, and
`'kernel_density'` are supported. The `mcmc` sampler requires `pymc3>=3.7`.
:param num_cores: int: if the inference method only supports sampling a point at a time, this will parallelize
sampling.
:param variable_types: dict: The dictionary containing the variable types. Must contain the union of the causal
state, control variables, and the outcome.
:param outcome: str: The outcome variable.
:param params: dict: extra parameters to set as attributes on the sampler object
:param dot_graph: str: A string specifying the causal graph.
:param common_causes: list: A list of strings containing the variable names to control for.
:param estimand_type: str: 'nonparametric-ate' is the only one currently supported. Others may be added later, to allow for specific, parametric estimands.
:param proceed_when_unidentifiable: bool: A flag to over-ride user prompts to proceed when effects aren't
identifiable with the assumptions provided.
:param stateful: bool: Whether to retain state. By default, the do operation is stateless.
:return: pandas.DataFrame: A DataFrame containing the sampled outcome