从 GCM 生成样本的基础示例#

图形因果模型 (GCM) 描述了建模变量的数据生成过程。因此,拟合 GCM 后,我们还可以从中生成全新的样本,从而将其视为基于底层模型的合成数据的数据生成器。生成新样本通常可以通过拓扑排序节点、从根节点随机采样,然后通过评估下游因果机制并随机采样噪声将数据传播通过图来实现。dowhy.gcm 包提供了一个简单的辅助函数,可以自动完成此操作,从而提供了一个简单的 API 来从 GCM 中提取样本。

让我们看以下示例

[1]:
import numpy as np, pandas as pd

X = np.random.normal(loc=0, scale=1, size=1000)
Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000)
Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)
data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
data.head()
[1]:
X Y Z
0 -0.810737 -0.393341 -2.842339
1 0.475764 0.753873 2.736163
2 0.652685 -1.099695 -0.105575
3 -2.459928 -3.809661 -15.383110
4 -0.110089 -0.447364 -1.085281

与引言类似,我们为简单的线性 DAG X→Y→Z 生成数据。让我们定义 GCM 并将其拟合到数据中

[2]:
import networkx as nx
import dowhy.gcm as gcm

causal_model = gcm.StructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')]))
gcm.auto.assign_causal_mechanisms(causal_model, data) # Automatically assigns additive noise models to non-root nodes
gcm.fit(causal_model, data)
Fitting causal mechanism of node Z: 100%|██████████| 3/3 [00:00<00:00, 457.41it/s]

现在,我们基于定义的因果图和加性噪声模型假设,学习了变量的生成模型。要从该模型生成新样本,我们现在只需调用

[3]:
generated_data = gcm.draw_samples(causal_model, num_samples=1000)
generated_data.head()
[3]:
X Y Z
0 1.265052 3.045911 7.820090
1 1.017337 0.639482 2.721905
2 0.359946 1.851538 3.884363
3 -0.448821 -1.161121 -3.166857
4 0.125756 2.136710 8.440217

如果我们的建模假设正确,生成的 数据现在应该类似于观测数据的分布,即生成的样本对应于我们开始时为示例数据定义的联合分布。确保这一点的其中一种方法是估计观测分布和生成分布之间的 KL 散度。为此,我们可以利用评估模块

[4]:
print(gcm.evaluate_causal_model(causal_model, data, evaluate_causal_mechanisms=False, evaluate_invertibility_assumptions=False))
Test permutations of given graph: 100%|██████████| 6/6 [00:00<00:00, 18.71it/s]
../_images/example_notebooks_gcm_draw_samples_8_1.png
Evaluated and the overall average KL divergence between generated and observed distribution and the graph structure. The results are as follows:

==== Evaluation of Generated Distribution ====
The overall average KL divergence between the generated and observed distribution is 0.02997159374891036
The estimated KL divergence indicates an overall very good representation of the data distribution.

==== Evaluation of the Causal Graph Structure ====
+-------------------------------------------------------------------------------------------------------+
|                                         Falsification Summary                                         |
+-------------------------------------------------------------------------------------------------------+
| The given DAG is not informative because 2 / 6 of the permutations lie in the Markov                  |
| equivalence class of the given DAG (p-value: 0.33).                                                   |
| The given DAG violates 0/1 LMCs and is better than 66.7% of the permuted DAGs (p-value: 0.33).        |
| Based on the provided significance level (0.2) and because the DAG is not informative,                |
| we do not reject the DAG.                                                                             |
+-------------------------------------------------------------------------------------------------------+

==== NOTE ====
Always double check the made model assumptions with respect to the graph structure and choice of causal mechanisms.
All these evaluations give some insight into the goodness of the causal model, but should not be overinterpreted, since some causal relationships can be intrinsically hard to model. Furthermore, many algorithms are fairly robust against misspecifications or poor performances of causal mechanisms.

这证实了生成的分布与观测分布接近。

虽然评估也为我们提供了关于因果图结构的洞见,但我们无法确认图结构,只能在发现观测结构的依赖关系与图表示的内容之间存在不一致时拒绝它。在我们的例子中,我们不拒绝该 DAG,但也存在其他等效的 DAG 也不会被拒绝。要理解这一点,考虑上面的例子 - X→Y→Z 和 X←Y←Z 会生成相同的观测分布(因为它们编码了相同的条件分布),但只有 X→Y→Z 会生成正确的干预分布(例如,在 Y 上进行干预时)。