Graph Autoencoder Causal Discovery
- The paper demonstrates that GAE integrates acyclicity constraints with neural networks to accurately infer DAG structures, achieving lower SHD and higher TPR.
- GAE-based causal discovery employs an encoder-decoder architecture with MLPs to capture nonlinear dependencies and generate latent embeddings from vector-valued data.
- Empirical results show competitive training times and robust performance on synthetic benchmarks, though scalability to larger real-world graphs remains an open challenge.
Graph autoencoder based causal discovery refers to a class of neural methodologies that integrate graph autoencoder (GAE) architectures with explicit constraints and objectives for learning causal structures, typically in the form of directed acyclic graphs (DAGs), from multivariate data. These methods exploit the representational capacity of deep architectures, coupled with acyclicity constraints inspired by combinatorial causal discovery, to efficiently and flexibly infer the underlying causal graph, particularly in regimes characterized by nonlinear or vector-valued structural equation models (SEMs) and higher dimensionality (Ng et al., 2019).
1. Problem Formulation and Background
The primary goal in causal structure learning is to estimate the graphical structure —represented as a DAG—governing a set of observed variables , potentially vector-valued, given i.i.d.\ samples. The canonical model class is the additive-noise model (ANM):
where denotes the parents of node , are generally unknown deterministic functions, and the noise terms are jointly independent. The structure learning task is to recover the adjacency matrix of the ground truth DAG from data.
Traditional algorithms, including constraint- and score-based methods, must search the discrete, combinatorial space of DAGs (NP-hard). Recent advances, such as in NOTEARS, reparameterize the DAG as a continuous adjacency matrix , imposing a smooth acyclicity constraint:
This approach transforms the structure search into a differentiable optimization problem.
2. Graph Autoencoder Architecture for Causal Discovery
The GAE-based approach generalizes traditional structural equation models by introducing deep neural parameterizations:
- Encoder (e.g., multi-layer perceptron)
- Decoder
Given a sample :
- The encoder computes latent embeddings .
- Latent embeddings are mixed by adjacency: .
- The decoder reconstructs: .
This pipeline enforces a causal inductive bias, where dictates information flow, and nonlinearities , enable the model to represent arbitrary deterministic SEMs. The encoder and decoder are both parameter-shared across nodes.
3. Objective Function and Acyclicity Constraint
The model is trained to minimize the reconstruction error, subject to a regularization on sparsity and a soft acyclicity constraint:
subject to .
An augmented Lagrangian formulation is used:
Optimization employs alternating updates:
- Gradient-based steps for (Adam optimizer, full-batch)
- Lagrange multiplier and penalty update as needed
This continuous relaxation ensures differentiability and scalability, while enforces that the learned adjacency encodes a DAG.
4. Expressiveness: Nonlinear SEMs and Vector-Valued Data
The usage of MLPs and enables the learning framework to model a broad class of nonlinear dependencies within ANMs. When , are set to identity, the model reduces to a linear SEM. The approach naturally extends to vector-valued variables (), by appropriately setting , , allowing for efficient low-dimensional embedding and information flow, with a single controlling inter-variable causality (Ng et al., 2019).
5. Network Architecture and Training Regimen
Experiments have used 3-layer MLPs for both and (hidden layer width 16, ReLU activation), with latent dimension when , when . Hyperparameters for the Lagrangian penalty and sparsity are tuned empirically. Optimization is performed until the acyclicity constraint is sufficiently small (). The model supports efficient training, with wall-clock training time scaling near-linearly with and allowing for large graphs () to be handled on standard hardware.
6. Empirical Performance and Benchmarks
The approach has been benchmarked on synthetic data generated from random Erdős–Rényi DAGs with nonlinear SEMs:
- with .
Metrics include Structural Hamming Distance (SHD) and True Positive Rate (TPR). GAE-based causal discovery attains lower SHD and higher TPR than NOTEARS (linear) and DAG-GNN (variational autoencoder), especially for larger graphs () and nonlinear dependencies. Training times remain competitive (under 2 minutes for on an NVIDIA V100 GPU), while baseline methods are significantly slower (Ng et al., 2019).
| Model | SHD (nonlinear, ) | TPR (nonlinear, ) | Training Time |
|---|---|---|---|
| GAE | Lowest | Highest | <2 min |
| NOTEARS | Higher | Lower | N/A |
| DAG-GNN | Higher | Lower | ~77 min |
7. Limitations, Interpretability, and Open Challenges
GAE-based causal discovery reliably integrates acyclicity enforcement into neural pipelines and maintains explicit causal interpretability of the learned adjacency . A shared encoder/decoder architecture is both parameter-efficient and broadly applicable. However, experiments to date have been largely restricted to synthetic data (), leaving scalability to much larger graphs and real-world datasets an open direction. Sensitivity to hyperparameters and appropriate selection of the latent dimension remain underexplored.
No formal guarantee is provided that the continuous relaxation, even when is satisfied, exactly recovers the true DAG at finite sample sizes. This suggests caution is warranted when applying these techniques to situations with complex confounding or where theoretical identifiability is critical (Ng et al., 2019).