Papers
Topics
Authors
Recent
Search
2000 character limit reached

Graph Autoencoder Causal Discovery

Updated 2 February 2026
  • The paper demonstrates that GAE integrates acyclicity constraints with neural networks to accurately infer DAG structures, achieving lower SHD and higher TPR.
  • GAE-based causal discovery employs an encoder-decoder architecture with MLPs to capture nonlinear dependencies and generate latent embeddings from vector-valued data.
  • Empirical results show competitive training times and robust performance on synthetic benchmarks, though scalability to larger real-world graphs remains an open challenge.

Graph autoencoder based causal discovery refers to a class of neural methodologies that integrate graph autoencoder (GAE) architectures with explicit constraints and objectives for learning causal structures, typically in the form of directed acyclic graphs (DAGs), from multivariate data. These methods exploit the representational capacity of deep architectures, coupled with acyclicity constraints inspired by combinatorial causal discovery, to efficiently and flexibly infer the underlying causal graph, particularly in regimes characterized by nonlinear or vector-valued structural equation models (SEMs) and higher dimensionality (Ng et al., 2019).

1. Problem Formulation and Background

The primary goal in causal structure learning is to estimate the graphical structure G\mathcal{G}—represented as a DAG—governing a set of observed variables X1,,XdX_1, \dots, X_d, potentially vector-valued, given i.i.d.\ samples. The canonical model class is the additive-noise model (ANM):

Xi=fi(Xpa(i))+Zi,i=1,,dX_i = f_i\big(X_{\text{pa}(i)}\big) + Z_i, \quad i=1,\dots,d

where Xpa(i)X_{\text{pa}(i)} denotes the parents of node ii, fif_i are generally unknown deterministic functions, and the noise terms ZiZ_i are jointly independent. The structure learning task is to recover the adjacency matrix of the ground truth DAG from data.

Traditional algorithms, including constraint- and score-based methods, must search the discrete, combinatorial space of DAGs (NP-hard). Recent advances, such as in NOTEARS, reparameterize the DAG as a continuous adjacency matrix ARd×dA \in \mathbb{R}^{d \times d}, imposing a smooth acyclicity constraint:

h(A)=Tr[eAA]d=0h(A) = \mathrm{Tr}[e^{A \odot A}] - d = 0

This approach transforms the structure search into a differentiable optimization problem.

2. Graph Autoencoder Architecture for Causal Discovery

The GAE-based approach generalizes traditional structural equation models by introducing deep neural parameterizations:

  • Encoder g1:RlRlg_1: \mathbb{R}^l \rightarrow \mathbb{R}^{l'} (e.g., multi-layer perceptron)
  • Decoder g2:RlRlg_2: \mathbb{R}^{l'} \rightarrow \mathbb{R}^l

Given a sample X(j)Rd×lX^{(j)} \in \mathbb{R}^{d \times l}:

  1. The encoder computes latent embeddings H(j)=g1(X(j))H^{(j)} = g_1(X^{(j)}).
  2. Latent embeddings are mixed by adjacency: H(j)=ATH(j)H^{(j)'} = A^T H^{(j)}.
  3. The decoder reconstructs: X^(j)=g2(H(j))\widehat X^{(j)} = g_2(H^{(j)'}).

This pipeline enforces a causal inductive bias, where AA dictates information flow, and nonlinearities g1g_1, g2g_2 enable the model to represent arbitrary deterministic SEMs. The encoder fE(X)=g1(X)f_E(X) = g_1(X) and decoder fD(H,A)=g2(ATH)f_D(H, A) = g_2(A^T H) are both parameter-shared across nodes.

3. Objective Function and Acyclicity Constraint

The model is trained to minimize the reconstruction error, subject to a regularization on sparsity and a soft acyclicity constraint:

minA,Θ1,Θ212nj=1nX(j)X^(j)F2+λA1\min_{A,\Theta_1,\Theta_2} \frac{1}{2n} \sum_{j=1}^n \|X^{(j)} - \widehat X^{(j)}\|_F^2 + \lambda \|A\|_1

subject to h(A)=0h(A) = 0.

An augmented Lagrangian formulation is used:

Lρ(A,Θ1,Θ2,α)=Lrec+λA1+αh(A)+ρ2h(A)2L_\rho(A, \Theta_1, \Theta_2, \alpha) = L_{\text{rec}} + \lambda \|A\|_1 + \alpha h(A) + \frac{\rho}{2} h(A)^2

Optimization employs alternating updates:

  • Gradient-based steps for (A,Θ1,Θ2)(A, \Theta_1, \Theta_2) (Adam optimizer, full-batch)
  • Lagrange multiplier αα+ρh(A)\alpha \leftarrow \alpha + \rho h(A) and penalty update ρβρ\rho \leftarrow \beta \rho as needed

This continuous relaxation ensures differentiability and scalability, while h(A)h(A) enforces that the learned adjacency encodes a DAG.

4. Expressiveness: Nonlinear SEMs and Vector-Valued Data

The usage of MLPs g1g_1 and g2g_2 enables the learning framework to model a broad class of nonlinear dependencies within ANMs. When g1g_1, g2g_2 are set to identity, the model reduces to a linear SEM. The approach naturally extends to vector-valued variables (XiRlX_i \in \mathbb{R}^l), by appropriately setting g1:RlRlg_1: \mathbb{R}^l \to \mathbb{R}^{l'}, g2:RlRlg_2: \mathbb{R}^{l'} \to \mathbb{R}^l, allowing for efficient low-dimensional embedding and information flow, with a single AA controlling inter-variable causality (Ng et al., 2019).

5. Network Architecture and Training Regimen

Experiments have used 3-layer MLPs for both g1g_1 and g2g_2 (hidden layer width 16, ReLU activation), with latent dimension l=1l'=1 when l=1l=1, l=3l'=3 when l=5l=5. Hyperparameters for the Lagrangian penalty and 1\ell_1 sparsity are tuned empirically. Optimization is performed until the acyclicity constraint h(A)|h(A)| is sufficiently small (<108<10^{-8}). The model supports efficient training, with wall-clock training time scaling near-linearly with dd and allowing for large graphs (d100d\geq 100) to be handled on standard hardware.

6. Empirical Performance and Benchmarks

The approach has been benchmarked on synthetic data generated from random Erdős–Rényi DAGs with nonlinear SEMs:

  • X=ATcos(X+1)+ZX = A^T \cos(X+1) + Z
  • X=2sin(ATcos(X+1)+0.5)+(ATcos(X+1)+0.5)+ZX = 2\sin(A^T\cos(X+1)+0.5) + (A^T\cos(X+1)+0.5) + Z with ZN(0,I)Z \sim \mathcal{N}(0, I).

Metrics include Structural Hamming Distance (SHD) and True Positive Rate (TPR). GAE-based causal discovery attains lower SHD and higher TPR than NOTEARS (linear) and DAG-GNN (variational autoencoder), especially for larger graphs (d=100d=100) and nonlinear dependencies. Training times remain competitive (under 2 minutes for d=100d=100 on an NVIDIA V100 GPU), while baseline methods are significantly slower (Ng et al., 2019).

Model SHD (nonlinear, d=100d=100) TPR (nonlinear, d=100d=100) Training Time
GAE Lowest Highest <2 min
NOTEARS Higher Lower N/A
DAG-GNN Higher Lower ~77 min

7. Limitations, Interpretability, and Open Challenges

GAE-based causal discovery reliably integrates acyclicity enforcement into neural pipelines and maintains explicit causal interpretability of the learned adjacency AA. A shared encoder/decoder architecture is both parameter-efficient and broadly applicable. However, experiments to date have been largely restricted to synthetic data (d100d \leq 100), leaving scalability to much larger graphs and real-world datasets an open direction. Sensitivity to hyperparameters and appropriate selection of the latent dimension remain underexplored.

No formal guarantee is provided that the continuous relaxation, even when h(A)=0h(A)=0 is satisfied, exactly recovers the true DAG at finite sample sizes. This suggests caution is warranted when applying these techniques to situations with complex confounding or where theoretical identifiability is critical (Ng et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph Autoencoder Based Causal Discovery.