Graph Autoencoder Causal Discovery

Updated 2 February 2026

The paper demonstrates that GAE integrates acyclicity constraints with neural networks to accurately infer DAG structures, achieving lower SHD and higher TPR.
GAE-based causal discovery employs an encoder-decoder architecture with MLPs to capture nonlinear dependencies and generate latent embeddings from vector-valued data.
Empirical results show competitive training times and robust performance on synthetic benchmarks, though scalability to larger real-world graphs remains an open challenge.

Graph autoencoder based causal discovery refers to a class of neural methodologies that integrate graph autoencoder (GAE) architectures with explicit constraints and objectives for learning causal structures, typically in the form of directed acyclic graphs (DAGs), from multivariate data. These methods exploit the representational capacity of deep architectures, coupled with acyclicity constraints inspired by combinatorial causal discovery, to efficiently and flexibly infer the underlying causal graph, particularly in regimes characterized by nonlinear or vector-valued structural equation models (SEMs) and higher dimensionality (Ng et al., 2019).

1. Problem Formulation and Background

The primary goal in causal structure learning is to estimate the graphical structure $\mathcal{G}$ —represented as a DAG—governing a set of observed variables $X_1, \dots, X_d$ , potentially vector-valued, given i.i.d.\ samples. The canonical model class is the additive-noise model (ANM):

$X_i = f_i\big(X_{\text{pa}(i)}\big) + Z_i, \quad i=1,\dots,d$

where $X_{\text{pa}(i)}$ denotes the parents of node $i$ , $f_i$ are generally unknown deterministic functions, and the noise terms $Z_i$ are jointly independent. The structure learning task is to recover the adjacency matrix of the ground truth DAG from data.

Traditional algorithms, including constraint- and score-based methods, must search the discrete, combinatorial space of DAGs (NP-hard). Recent advances, such as in NOTEARS, reparameterize the DAG as a continuous adjacency matrix $A \in \mathbb{R}^{d \times d}$ , imposing a smooth acyclicity constraint:

$h(A) = \mathrm{Tr}[e^{A \odot A}] - d = 0$

This approach transforms the structure search into a differentiable optimization problem.

2. Graph Autoencoder Architecture for Causal Discovery

The GAE-based approach generalizes traditional structural equation models by introducing deep neural parameterizations:

Encoder $g_1: \mathbb{R}^l \rightarrow \mathbb{R}^{l'}$ (e.g., multi-layer perceptron)
Decoder $g_2: \mathbb{R}^{l'} \rightarrow \mathbb{R}^l$

Given a sample $X^{(j)} \in \mathbb{R}^{d \times l}$ :

The encoder computes latent embeddings $H^{(j)} = g_1(X^{(j)})$ .
Latent embeddings are mixed by adjacency: $H^{(j)'} = A^T H^{(j)}$ .
The decoder reconstructs: $\widehat X^{(j)} = g_2(H^{(j)'})$ .

This pipeline enforces a causal inductive bias, where $A$ dictates information flow, and nonlinearities $g_1$ , $g_2$ enable the model to represent arbitrary deterministic SEMs. The encoder $f_E(X) = g_1(X)$ and decoder $f_D(H, A) = g_2(A^T H)$ are both parameter-shared across nodes.

3. Objective Function and Acyclicity Constraint

The model is trained to minimize the reconstruction error, subject to a regularization on sparsity and a soft acyclicity constraint:

$\min_{A,\Theta_1,\Theta_2} \frac{1}{2n} \sum_{j=1}^n \|X^{(j)} - \widehat X^{(j)}\|_F^2 + \lambda \|A\|_1$

subject to $h(A) = 0$ .

An augmented Lagrangian formulation is used:

$L_\rho(A, \Theta_1, \Theta_2, \alpha) = L_{\text{rec}} + \lambda \|A\|_1 + \alpha h(A) + \frac{\rho}{2} h(A)^2$

Optimization employs alternating updates:

Gradient-based steps for $(A, \Theta_1, \Theta_2)$ (Adam optimizer, full-batch)
Lagrange multiplier $\alpha \leftarrow \alpha + \rho h(A)$ and penalty update $\rho \leftarrow \beta \rho$ as needed

This continuous relaxation ensures differentiability and scalability, while $h(A)$ enforces that the learned adjacency encodes a DAG.

4. Expressiveness: Nonlinear SEMs and Vector-Valued Data

The usage of MLPs $g_1$ and $g_2$ enables the learning framework to model a broad class of nonlinear dependencies within ANMs. When $g_1$ , $g_2$ are set to identity, the model reduces to a linear SEM. The approach naturally extends to vector-valued variables ( $X_i \in \mathbb{R}^l$ ), by appropriately setting $g_1: \mathbb{R}^l \to \mathbb{R}^{l'}$ , $g_2: \mathbb{R}^{l'} \to \mathbb{R}^l$ , allowing for efficient low-dimensional embedding and information flow, with a single $A$ controlling inter-variable causality (Ng et al., 2019).

5. Network Architecture and Training Regimen

Experiments have used 3-layer MLPs for both $g_1$ and $g_2$ (hidden layer width 16, ReLU activation), with latent dimension $l'=1$ when $l=1$ , $l'=3$ when $l=5$ . Hyperparameters for the Lagrangian penalty and $\ell_1$ sparsity are tuned empirically. Optimization is performed until the acyclicity constraint $|h(A)|$ is sufficiently small ( $<10^{-8}$ ). The model supports efficient training, with wall-clock training time scaling near-linearly with $d$ and allowing for large graphs ( $d\geq 100$ ) to be handled on standard hardware.

6. Empirical Performance and Benchmarks

The approach has been benchmarked on synthetic data generated from random Erdős–Rényi DAGs with nonlinear SEMs:

$X = A^T \cos(X+1) + Z$
$X = 2\sin(A^T\cos(X+1)+0.5) + (A^T\cos(X+1)+0.5) + Z$ with $Z \sim \mathcal{N}(0, I)$ .

Metrics include Structural Hamming Distance (SHD) and True Positive Rate (TPR). GAE-based causal discovery attains lower SHD and higher TPR than NOTEARS (linear) and DAG-GNN (variational autoencoder), especially for larger graphs ( $d=100$ ) and nonlinear dependencies. Training times remain competitive (under 2 minutes for $d=100$ on an NVIDIA V100 GPU), while baseline methods are significantly slower (Ng et al., 2019).

Model	SHD (nonlinear, $d=100$ )	TPR (nonlinear, $d=100$ )	Training Time
GAE	Lowest	Highest	<2 min
NOTEARS	Higher	Lower	N/A
DAG-GNN	Higher	Lower	~77 min

7. Limitations, Interpretability, and Open Challenges

GAE-based causal discovery reliably integrates acyclicity enforcement into neural pipelines and maintains explicit causal interpretability of the learned adjacency $A$ . A shared encoder/decoder architecture is both parameter-efficient and broadly applicable. However, experiments to date have been largely restricted to synthetic data ( $d \leq 100$ ), leaving scalability to much larger graphs and real-world datasets an open direction. Sensitivity to hyperparameters and appropriate selection of the latent dimension remain underexplored.

No formal guarantee is provided that the continuous relaxation, even when $h(A)=0$ is satisfied, exactly recovers the true DAG at finite sample sizes. This suggests caution is warranted when applying these techniques to situations with complex confounding or where theoretical identifiability is critical (Ng et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

A Graph Autoencoder Approach to Causal Structure Learning (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph Autoencoder Based Causal Discovery.