使用 OverRule 评估支持和重叠#

动机#

在执行后门调整时,即使我们正确指定了因果图并观察到所有相关的后门调整变量 \(X\),我们只能估计满足以下两个条件的协变量 \(X\) 个体的因果效应:* 支持:简单来说,我们要求在我们的数据集中观察到类似的个体。对于离散协变量 \(X\),这可以形式化为要求

\[P(X) > 0\]

* 重叠:我们要求对于相似的个体,存在观察到处理和对照的可能性。形式上,我们要求

\[1 > P(T = 1 \mid X) > 0\]

OverRule [1] 是一种学习布尔规则集的方法,用于表征满足这两个条件的个体集合,并在本笔记本中通过一些简单的示例来构建直觉。

致谢#

OverRule 的代码改编自 (clinicalml/overlap-code),进行了最小修改和一些简化,遵循 MIT 许可证。

[1] Oberst, M., Johansson, F., Wei, D., Gao, T., Brat, G., Sontag, D., & Varshney, K. (2020). 观测研究中重叠的表征。载于 S. Chiappa & R. Calandra (编著),第二十三届国际人工智能与统计会议论文集(第 108 卷,页 788–798)。PMLR。https://arxiv.org/abs/1907.04138

目录#

  1. 简单二维示例图示

    1. 问题数据

    2. 使用默认参数应用 OverRule

    3. 解释 OverRule 的输出

    4. 配置 OverRule

  2. Lalonde 和 PSID 数据集图示

[1]:
import numpy as np
import pandas as pd

import dowhy.datasets
from dowhy import CausalModel
# Functional API
from dowhy.causal_refuters.assess_overlap import assess_support_and_overlap_overrule

# Data classes to configure ruleset optimization
from dowhy.causal_refuters.assess_overlap_overrule import SupportConfig, OverlapConfig

import matplotlib.pyplot as plt

简单二维示例图示#

返回目录

问题数据#

返回目录

在此示例中,我们有一对二元协变量 \(X_1, X_2\),它们简单地违反了上述条件:* 支持:没有 \(X_1 = X_2 = 1\) 的样本,即 \(P(X_1 = 1, X_2 = 1) = 0\) * 重叠:只有 \(X_1 = 0, X_2 = 0\) 的个体有机会同时接受处理 \((T = 1)\) 和对照 \((T = 0)\)

[2]:
test_data = pd.DataFrame(
        np.array(
            [
                [0, 0, 1, 1],
                [0, 0, 0, 0],
                [0, 1, 1, 0],
                [1, 0, 0, 0],
            ]
            * 50
        ),
        columns=["X1", "X2", "T", "Y"],
    )

我们可以按如下方式可视化这些模式,其中我们添加了一些扰动以便更容易看到

[3]:
rng = np.random.default_rng(0)
jitter_data = test_data.copy()
jitter_data['X1'] = jitter_data['X1'] + rng.normal(0, 0.05, size=(200,))
jitter_data['X2'] = jitter_data['X2'] + rng.normal(0, 0.05, size=(200,))

fig, ax = plt.subplots(1, 1, figsize=(5, 5))
for t in [0, 1]:
    this_data = jitter_data.query('T == @t')
    ax.scatter(this_data['X1'], this_data['X2'], label=f'T = {t}', edgecolor='w', linewidths=0.5)

plt.legend()
plt.show()
../_images/example_notebooks_dowhy_refuter_assess_overlap_7_0.png

注意,重叠区域仅在 \(X_1 = X_2 = 0\) 时成立。

使用默认参数应用 OverRule#

返回目录

此方法目前实现为一种反驳方法,但也可以通过函数式 API 访问。下面我们演示这两种方法,然后讨论它们的参数。

请注意,OverRule 返回析取范式 (DNF) 规则,可以理解为 OR 的 AND,并在调用 print(refute) 时给出。

使用 CausalModel 的示例用法#

[4]:
model = CausalModel(
    data=test_data,
    treatment="T",
    outcome="Y",
    common_causes=['X1', 'X2']
)
[5]:
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)

estimate = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression")
[6]:
# We disucss configuration in more depth below.
support_config = SupportConfig(seed=0)
overlap_config = OverlapConfig()
[7]:
refute = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name='assess_overlap',
    support_config=support_config,
    overlap_config=overlap_config,
)
[8]:
print(refute)
Rules cover 50.0% of all samples
Overall, 50.0% of samples meet the criteria for inclusion in the overlap set,
defined as (a) being covered by support rules and having propensity
score in (0.10, 0.90)
Rules capture 100.0% of samples which meet these criteria

How to read rules: The following rules are given in Disjuntive Normal Form,
a series of AND clauses (e.g., X and Y and Z) joined by ORs. Hence, if a sample
 satifies any of the clauses for the support rules, it is included in the support,
and likewise for the overlap rules.

DETAILED RULES:
SUPPORT Rules: Found 2 rule(s), covering 100.0% of samples
            Rule #0: (not X1)
                 [Covers 75.0% of samples]
         OR Rule #1: (X1)
                 AND (not X2)
                 [Covers 25.0% of samples]
OVERLAP Rules: Found 1 rule(s), covering 50.0% of samples
            Rule #0: (not X1)
                 AND (not X2)
                 [Covers 50.0% of samples]

使用函数式 API 的示例用法#

请注意,结果 \(Y\) 未用作 OverRule 的一部分,因此在此处不需要。

[9]:
refute = assess_support_and_overlap_overrule(
    data=test_data,
    backdoor_vars=['X1', 'X2'],
    treatment_name='T',
    support_config=support_config,
    overlap_config=overlap_config
)

只拟合支持或重叠规则#

您还可以通过传递 support_only=Trueoverlap_only=True 来运行 OverRule,以便学习支持或重叠规则。

[10]:
refute = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name='assess_overlap',
    support_only=True,
    support_config=support_config,
)
[11]:
print(refute)
Rules cover 100.0% of all samples

How to read rules: The following rules are given in Disjuntive Normal Form,
a series of AND clauses (e.g., X and Y and Z) joined by ORs. Hence, if a sample
 satifies any of the clauses for the support rules, it is included in the support,
and likewise for the overlap rules.

DETAILED RULES:
SUPPORT Rules: Found 2 rule(s), covering 100.0% of samples
            Rule #0: (not X1)
                 [Covers 75.0% of samples]
         OR Rule #1: (X1)
                 AND (not X2)
                 [Covers 25.0% of samples]
No Overlap Rules Fitted (support_only=True).
[12]:
refute = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name='assess_overlap',
    overlap_only=True,
    overlap_config=overlap_config
)
[13]:
print(refute)
Rules cover 50.0% of all samples
Overall, 50.0% of samples meet the criteria for inclusion in the overlap set,
defined as (a) being covered by support rules and having propensity
score in (0.10, 0.90)
Rules capture 100.0% of samples which meet these criteria

How to read rules: The following rules are given in Disjuntive Normal Form,
a series of AND clauses (e.g., X and Y and Z) joined by ORs. Hence, if a sample
 satifies any of the clauses for the support rules, it is included in the support,
and likewise for the overlap rules.

DETAILED RULES:
No Support Rules Fitted (overlap_only=True).
OVERLAP Rules: Found 1 rule(s), covering 50.0% of samples
            Rule #0: (not X1)
                 AND (not X2)
                 [Covers 50.0% of samples]

解释 OverRule 的输出#

返回目录

我们将使用以下规则集来说明如何解释输出。

SUPPORT Rules: Found 2 rule(s), covering 100.0% of samples
        Rule #0: (not X1)
             [Covers 75.0% of samples]
     OR Rule #1: (X1)
             AND (not X2)
             [Covers 25.0% of samples]
OVERLAP Rules: Found 1 rule(s), covering 50.0% of samples
        Rule #0: (not X2)
             AND (not X1)
             [Covers 50.0% of samples]

对于任何样本,我们可以通过检查每个规则(规则 0 和规则 1)来检查该样本是否被支持覆盖。如果这些规则中的任意一条适用,则该样本在支持内。考虑一个点 (X1 = 1, X2 = 1),我们知道它不在原始数据的支持中。由于每个二元输入被视为布尔值,我们可以检查:* 规则 0: (not X1) –> 此规则不被 (X1 = 1, X2 = 1) 满足,因为 X1 = 1 等同于 X2 = True * 规则 1: (X1) AND (not X2) –> 此规则不被满足,因为 X2 = 1

同样,我们可以检查一个样本是否被重叠规则覆盖。考虑点 (X1 = 1, X2 = 0)。它满足支持规则,因为它匹配规则 0。然而,对于重叠规则,我们可以检查:* 规则 0: (not X2) AND (not X1) –> 此规则不被满足,因为 X1 = 1。

注意:当我们稍后基于 OverRule 学习到的规则过滤样本时,我们将过滤掉任何不同时匹配支持规则和重叠规则的样本。

配置 OverRule#

返回目录

OverRule 可以通过传递以下参数进行配置:* overlap_eps:这是一个位于 (0, 0.5) 中的数字 \(\epsilon\),它将重叠区域定义为 \(P(T = 1 \mid X)\) 位于 \((\epsilon, 1 - \epsilon)\) 中的点。

  • cat_feats:任何类别特征的列表(如果 dtype 是 object 则自动推断)。

  • support_config:一个 SupportConfig(见下文),控制优化参数。

  • overlap_config:一个 OverlapConfig(见下文),控制优化参数。

[14]:
support_config = SupportConfig()
print(support_config)
SupportConfig(n_ref_multiplier=1.0, seed=None, alpha=0.98, lambda0=0.01, lambda1=0.001, K=20, D=20, B=10, iterMax=10, num_thresh=9, thresh_override=None, solver='ECOS', rounding='greedy_sweep')
[15]:
overlap_config = OverlapConfig()
print(overlap_config)
OverlapConfig(alpha=0.95, lambda0=0.001, lambda1=0.001, K=20, D=20, B=10, iterMax=10, num_thresh=9, thresh_override=None, solver='ECOS', rounding='greedy_sweep')

关键参数:对于 SupportConfigOverlapConfig,默认值通常是合理的,但根据问题可能需要更改一些参数,详细如下。

  • alpha:

    • 对于学习支持规则,我们要求至少包含原始数据的 alpha 分数,同时最小化所得支持集的体积。

    • 对于学习重叠规则,至少包含“重叠点”的 alpha 分数,同时排除尽可能多的非重叠点。此处,“重叠点”定义为那些 \(P(T = 1 \mid X)\) 位于 [overlap_eps, 1 - overlap_eps] 范围内(使用 XGBClassifier 估计)的点。

  • lambda0, lambda1:这些是控制最终规则复杂度的正则化项。

    • lambda0 是为每个新规则添加的惩罚(一个规则可以包含多个文字)。例如,(not X1) AND (not X2) 是一个单一规则。

    • lambda1 是为每个文字添加的惩罚。例如,(not X1) AND (not X2) 包含两个文字。

  • num_thresh:OverRule 在学习规则集之前将所有特征转换为二元规则。对于连续变量,默认情况下通过转换为十分位数来实现。可以通过更改阈值数量或指定一个字典(在 thresh_override 中)将特征映射到阈值的 np.ndarray 来更改此默认行为。

Lalonde 和 PSID 数据集图示#

返回目录

本节展示了如何在进行另一项因果分析的过程中应用 OverRule。此处,我们演示了两种方法:* 在进行任何因果效应估计之前应用 OverRule * 使用学习到的规则过滤样本

将 OverRule 纳入因果推断工作流程如下所示:* 首先,在执行任何因果效应估计之前,应用 OverRule 来学习规则 * 其次,根据学习到的规则过滤样本 * 第三,在过滤后的样本上估计因果效应

为了说明这一点,我们使用两个数据集构建了一个有点人工的因果推断任务:* 来自 Lalonde (1986) 的原始数据集,其中包含实验样本,通过 dowhy.datasets.lalonde_dataset() 加载。由于这是实验样本,我们将看到重叠在该样本中的所有样本上都成立。 * 一个完全由未处理样本组成的观测数据集,即 PSID 数据集,在 Smith & Todd (2005) 中结合原始实验数据集进行了分析。它从 dowhy.datasets.psid_dataset() 加载。

通过合并这些数据集然后评估重叠,我们发现在合并的数据集中,重叠仅在约 30% 的样本中成立。 直观地说,许多来自 PSID 的样本被排除,这是由于原始实验的严格纳入标准所致。

[48]:
import pandas as pd

import dowhy.datasets
from dowhy.causal_refuters.assess_overlap import assess_support_and_overlap_overrule
from dowhy.causal_refuters.assess_overlap_overrule import OverlapConfig, SupportConfig

# Experimental sample from Lalonde
experiment_data = dowhy.datasets.lalonde_dataset()

# Observational sample
observational_controls = dowhy.datasets.psid_dataset()

experiment_data["exp_samp"] = True
observational_controls["exp_samp"] = False

data = pd.concat([experiment_data, observational_controls], axis=0, ignore_index=True)

support_config = SupportConfig(seed=0, lambda0=0.01, lambda1=0.01)
overlap_config = OverlapConfig(lambda0=0.01, lambda1=0.01)

在运行任何因果分析之前,我们首先检查支持和重叠。请注意,alpha 的默认设置为 0.98,这意味着:* 支持规则将尝试覆盖至少 98% (alpha) 的数据,且体积最小。* 重叠规则将尝试覆盖至少 98% (alpha) 具有倾向得分在 [overlap_eps, 1-overlap_eps] 范围内的样本。

检查实验样本中的支持/重叠#

[17]:
refute_experiment = assess_support_and_overlap_overrule(
    data=experiment_data,
    backdoor_vars="nodegr+black+hisp+age+educ+married+re74+re75".split("+"),
    treatment_name="treat",
    support_config=support_config,
    overlap_config=overlap_config,
)

# Observe how everyone is in the overlap set, so we do not learn any overlap rules
print(refute_experiment)
All samples in the support region satisfy the overlap condition.
No Overlap Rules will be learned.
Rules cover 98.9% of all samples

How to read rules: The following rules are given in Disjuntive Normal Form,
a series of AND clauses (e.g., X and Y and Z) joined by ORs. Hence, if a sample
 satifies any of the clauses for the support rules, it is included in the support,
and likewise for the overlap rules.

DETAILED RULES:
SUPPORT Rules: Found 4 rule(s), covering 98.9% of samples
            Rule #0: ([re74 <= 0.000])
                 [Covers 73.3% of samples]
         OR Rule #1: (black)
                 AND (not hisp)
                 AND ([age <= 35.000])
                 AND ([educ <= 12.000])
                 [Covers 71.5% of samples]
         OR Rule #2: (not married)
                 AND ([re74 <= 7838.192])
                 AND ([re75 <= 5146.040])
                 [Covers 73.7% of samples]
         OR Rule #3: (not black)
                 AND ([age <= 30.000])
                 AND ([educ <= 12.000])
                 AND ([educ > 8.000])
                 [Covers 10.6% of samples]
No Overlap Rules Fitted (support_only=True).

注意:如上所示,如果 OverRule 发现所有样本都满足重叠条件,则不会麻烦地学习单独的重叠规则。

评估合并样本中的支持/重叠#

现在,分析合并数据集,我们发现存在显著的重叠不足。

[18]:
var_list = "nodegr+black+hisp+age+educ+married+re74+re75".split("+")
refute = assess_support_and_overlap_overrule(
    data=data,
    backdoor_vars=var_list,
    treatment_name="treat",
    support_config=support_config,
    overlap_config=overlap_config,
)
print(refute)
Rules cover 34.6% of all samples
Overall, 31.2% of samples meet the criteria for inclusion in the overlap set,
defined as (a) being covered by support rules and having propensity
score in (0.10, 0.90)
Rules capture 96.8% of samples which meet these criteria

How to read rules: The following rules are given in Disjuntive Normal Form,
a series of AND clauses (e.g., X and Y and Z) joined by ORs. Hence, if a sample
 satifies any of the clauses for the support rules, it is included in the support,
and likewise for the overlap rules.

DETAILED RULES:
SUPPORT Rules: Found 2 rule(s), covering 98.1% of samples
            Rule #0: ([re74 <= 33307.536])
                 AND ([re75 <= 32691.290])
                 [Covers 87.6% of samples]
         OR Rule #1: (not nodegr)
                 AND (not hisp)
                 AND ([educ > 11.000])
                 [Covers 60.6% of samples]
OVERLAP Rules: Found 5 rule(s), covering 35.3% of samples
            Rule #0: ([re74 <= 1282.723])
                 [Covers 20.0% of samples]
         OR Rule #1: ([re75 <= 716.129])
                 [Covers 20.1% of samples]
         OR Rule #2: (black)
                 AND (not married)
                 [Covers 15.2% of samples]
         OR Rule #3: (not nodegr)
                 AND ([re74 <= 7837.067])
                 [Covers 12.2% of samples]
         OR Rule #4: ([age <= 26.000])
                 AND ([educ > 11.000])
                 AND ([re75 <= 11637.097])
                 [Covers 7.7% of samples]

在进行因果分析前使用 OverRule 进行过滤#

OverRule 可以接受与原始数据框具有相同列的任何数据框,并过滤到包含在支持区域和重叠区域中的单元。

[56]:
bool_mask = refute.predict_overlap_support(data)

# This is a simple wrapper around refute.predict_overlap_support
filtered_df = refute.filter_dataframe(data).copy()

assert np.array_equal(data[bool_mask], filtered_df)
[47]:
bool_mask.mean()
[47]:
$\displaystyle 0.345826235093697$

过滤后进行因果推断#

假设我们只获得了合并样本,并希望进行因果效应估计。使用 OverRule 表征支持/重叠区域后,我们现在可以过滤数据并在缩减后的数据集上进行分析。

[43]:
# Note that we are using the filtered_df here
model = CausalModel(
        data = filtered_df,
        treatment='treat',
        outcome='re78',
        common_causes="nodegr+black+hisp+age+educ+married+re74+re75".split("+")
)
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)

estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.propensity_score_weighting",
        target_units="ate",
        method_params={"weighting_scheme":"ips_weight"})

print("Causal Estimate is " + str(estimate.value))
Causal Estimate is -958.9002273474416

免责声明:这仅是说明如何将 OverRule 纳入因果推断工作流程的示例。然而,任务本身有点人工:在这里,我们合并了实验数据(来自 Lalonde)和观测数据(来自 PSID),这可能会给我们的估计带来偏差。我们在此这样做仅是为了证明 OverRule 可以识别合并样本中处理和对照之间缺乏重叠的情况。