Score Matching for Causal Discovery

Updated 28 January 2026

The paper introduces a score matching approach that transforms causal discovery into recovering DAG structures via the data's score function and its derivatives.
It details mathematical foundations such as the use of gradients and Hessians to identify leaf nodes and parental relationships in both continuous and discrete settings.
The framework extends to various data modalities—including temporal, networked, and latent confounded systems—using scalable algorithms with robust theoretical guarantees.

The score matching framework for causal discovery is a methodology that exploits the score function—defined as the gradient of the log-density of a multivariate observational distribution—to infer causal directed acyclic graph (DAG) structures from data. Originating in continuous, additive noise models, the approach has been significantly broadened in recent literature to encompass arbitrary noise distributions, temporal and networked data, discrete settings, and even partially observed or confounded systems. This encyclopedic entry surveys the theoretical foundations, identifiability results, main algorithms, sample complexity, latent variable extensions, and recent developments in score matching–based approaches to causal structure learning.

1. Mathematical Foundations: Score Function and Objectives

Let $X \in \mathbb{R}^d$ (or $X \in \mathcal{X}^d$ for discrete $\mathcal{X}$ ) have joint density (or mass) $p(x)$ . The score function is defined as $s(x) = \nabla_x \log p(x)$ . For discrete data, generalized score matching replaces $\nabla$ with a suitable linear operator, often the marginalization operator $\mathcal{M}_i p(x) = \sum_{x_i} p(x)$ , leading to the "discrete score" $S_i(x) = p(x_{-i})/p(x) = 1/p(x_i|x_{-i})$ (Vo et al., 22 Jan 2026).

Causal discovery is recast as a problem of identifying structural properties (topological order, parental sets) from the observed score landscape. The score matching objective minimizes the Fisher divergence between $p$ and an unnormalized model $q_\theta$ : $J_{\mathrm{SM}}(\theta) = \mathbb{E}_p \left[ \frac{1}{2}\|s_\theta(x)\|_2^2 + \operatorname{Tr}(\partial_x s_\theta(x)) \right]$ where $s_\theta(x) = \nabla_x \log q_\theta(x)$ . For discrete distributions, a generalized loss over discrete scores is used: $J_{\mathrm{disc}}(\theta) = \mathbb{E}_p \sum_{i=1}^d [S_i(x) - S_{\theta, i}(x)]^2$

Discretization enables extension to categorical or ordinal data domains via the same formalism (Vo et al., 22 Jan 2026). For temporal and networked data, the objective is applied to stacked variable representations capturing network-lagged or time-lagged dependencies (Chen et al., 2024).

2. Identifiability, Leaf Criteria, and Causal Structure Recovery

The identifiability of the causal DAG, or its Markov equivalence class, underpins the efficacy of score matching for causal discovery. Key results are established under additive noise models (ANMs): $X_i = f_i(\mathrm{PA}_i(X)) + N_i$ with mutually independent noises $N_i$ of arbitrary distribution, and $f_i$ non-constant, continuously differentiable (no nonlinearity assumption required except for nontrivial identifiability) (Montagna et al., 2023, Montagna et al., 2024).

Leaf (or sink) nodes can be identified via properties of the score and its derivatives:

In continuous ANMs, for a leaf $j$ , the diagonal Hessian entry $\partial_{x_j} s_j(x)$ is constant; thus $\operatorname{Var}_x [\partial_{x_j} s_j(x)] = 0$ (Rolland et al., 2022, Montagna et al., 2023).
For non-leaves, this variance is strictly positive due to dependence on children.
In discrete models, a Schur-concave "randomness measure" $\phi$ (e.g., entropy, negative variance) applied to the singleton conditional reciprocals $S_i(x)$ achieves separation: the leaf is identified as the node maximizing $V_i = \mathbb{E}_{x_{-i}} [\phi(S_i(x))]$ (Vo et al., 22 Jan 2026).

Recovery proceeds iteratively: at each step, the node with minimal (continuous case) or maximal (discrete case) variance/randomness is peeled as a leaf. Parent identification is achieved by variance drops in the scores or by inspecting off-diagonal Hessian entries (Chen et al., 2024).

Identifiability results extend to models with arbitrary (non-Gaussian) noise and to settings where only partial observability or latent confounding is present, with careful conditions and additional testing for direct versus confounded edges (Montagna et al., 2024, Bellot et al., 2021).

3. Algorithms and Scalable Estimation Strategies

Multiple algorithmic instantiations implement the score matching paradigm:

Algorithm	Data Type	Causal Content	Core Principle
SCORE (Rolland et al., 2022)	Continuous, ANM	Topological order + full DAG	Variance of diagonal Hessians
NoGAM (Montagna et al., 2023)	Continuous, arbitrary noise	Topological order + full DAG	Residual prediction of score components
DAS (Montagna et al., 2023)	Continuous, large-scale	Complete DAG via Hessian structure	Parent tests via second derivatives
PICK (Chen et al., 2024)	i.i.d./temporal/network	Order + parents via efficiency	Variance-drop parent identification
SciNO (Kang et al., 18 Aug 2025)	Continuous, high-dim.	Hessian-stable, scalable ordering	Neural operator for score estimation
AdaScore (Montagna et al., 2024)	General (with latents)	PAG, direct/indirect effects	Sink/Hessian/MSE and cross-derivative tests
Generalized (discrete) (Vo et al., 22 Jan 2026)	Discrete/categorical	Topological order	Randomness-based discrete score

Efficient estimation of the score and its Hessian is central:

Kernel Stein estimators compute the score and Hessian through RKHS function fitting with closed-form matrix expressions (Rolland et al., 2022).
Neural score methods and denoising score matching rely on deep networks regressing toward known optimal forms in the presence of noise (Zhu et al., 2023, Kang et al., 18 Aug 2025).
Neural operators (e.g., SciNO) train Fourier or functional neural operators in Sobolev spaces to simultaneously control the function and all necessary derivatives, supporting stable high-dimensional inference (Kang et al., 18 Aug 2025).
Parent identification acceleration (PICK) leverages score and Hessian variance comparisons to achieve $O(nd^2)$ or better scaling, drastically reducing expensive regression steps required in classic pruning (Chen et al., 2024).

Edge pruning and skeleton completion utilize score or Hessian criteria, followed by additive model significance testing or targeted parent search. In the presence of confounding, functional trimming or spectral adjustment disentangles score components contributing to causal versus spurious associations (Bellot et al., 2021).

4. Theoretical Guarantees and Sample Complexity

Mathematical guarantees substantiate the framework's recovery abilities:

Identifiability: Under general structural conditions (smooth additive mechanisms, independent noise), a sequence of leaf and parent tests on the score and its derivatives yields the unique DAG (Rolland et al., 2022, Montagna et al., 2023, Vo et al., 22 Jan 2026).
Consistency: Both kernel- and neural-based score estimators converge to the true score under mild regularity (universal kernels, proper regularization, $n \to \infty$ ) (Montagna et al., 2023).
Sample complexity: For neural score estimation, achieving uniform $\epsilon$ -accuracy in score approximation and ensuing exact topological order recovery typically requires $n \gtrsim d \log n/\epsilon^2$ , up to log factors for covering numbers of the network class (Zhu et al., 2023). The error rates in order recovery are controlled by curvature margins (variance gaps in Hessians), model smoothness, and estimator approximation rates.
Discrete case: Identifiability and order recovery from randomness measures are established under minimal non-degeneracy and monotonicity of the chosen functional (entropy or negative variance) (Vo et al., 22 Jan 2026).

For high-dimensional or confounded data, theoretical guarantees rely on signal strength assumptions (beta-min conditions, spectral separation), compliance with the independent mechanisms principle, and, for deconfounding, sufficient singular value separation in the confounder matrix (Bellot et al., 2021).

5. Extensions: Latent Variables, Discrete Data, and Temporal Models

Recent advances systematically extend score matching causal discovery to complex data regimes:

Latent variable and confounded systems:
- Deconfounded Score approaches remove spectral signatures of latent confounders prior to standard score-based neighborhood regression (Bellot et al., 2021).
- AdaScore unifies linear, nonlinear, and hidden variable settings, iteratively applying Hessian and sink tests to recover the correct equivalence class (PAG) and direct/unconfounded effect directionality (Montagna et al., 2024).
- The Hessian of the observed log-density links directly to $m$ -separation in the corresponding maximal ancestral graph, enabling recovery of graph skeletons under hidden confounding (Montagna et al., 2024).
Discrete/categorical data:
- Generalized score matching employs marginalization operators, using reciprocal singleton conditional probabilities to define a score function. Identifiability of the causal order is achieved via concave "randomness" measures, with practical algorithms leveraging continuous-time diffusion surrogates for estimation (Vo et al., 22 Jan 2026).
Temporal and networked observations:
- PICK adapts score-matching for time-series or networked data via stacked graph representations, allowing for joint intra- and inter-snapshot structure recovery while maintaining scalability through efficiency-lifted parent search (Chen et al., 2024).
Hybrid and mixed strategies with latent variables:
- Algorithmic frameworks combining decomposable (score-guided) CPDAG search (e.g., BOSS, GRaSP) with targeted CI testing (FCIT) provide permutation- and path-guided search schemes for structure learning, outperforming standard FCI variants in precision and computational efficiency (Ramsey et al., 5 Oct 2025).

6. Empirical Results, Scalability, and Limitations

Extensive empirical validation demonstrates:

Accuracy: Score-based approaches (SCORE, NoGAM, DAS, SciNO, PICK) consistently achieve lower or competitive SHD, SID, and topological-order divergence on standard synthetic, real-world, and high-dimensional datasets relative to classical constraint-based and greedy score-based DAG-learning baselines (Rolland et al., 2022, Montagna et al., 2023, Montagna et al., 2023, Chen et al., 2024, Kang et al., 18 Aug 2025).
Scalability: DAS, SciNO, and PICK scale to hundreds of variables; kernel-based methods are computationally intensive for very large $d$ but neural operator implementations enable orders-of-magnitude acceleration (Montagna et al., 2023, Kang et al., 18 Aug 2025).
Latent/confounded cases: Deconfounded scoring methods produce more reliable adjacency skeletons and robust causal edge recovery in dense confounding, as evidenced in semi-synthetic and biological network benchmarks (Bellot et al., 2021, Montagna et al., 2024).

Limitations remain in score approximation cost for high dimensions, the need for non-Gaussian/noiseless mechanisms for identifiability in certain cases, and the necessity for careful hyperparameter/tuning in the presence of confounding or purely discrete settings.

7. Future Directions and Open Challenges

Active research directions include:

Unified discrete-continuous models: Extending generalized score matching to mixed-type and irregular data domains using combined operators.
Amortized and online inference: Reducing computational burden by designing masked or amortized score estimators that obviate repeated model retraining per variable (Vo et al., 22 Jan 2026).
Robustness and finite-sample theory: Sharpening finite-sample consistency and error rates for order and edge recovery, particularly for discrete diffusion estimators and in the presence of confounders.
Integration with neural generative models: Pairing score-informed causal orderings with autoregressive or foundation models to enable semantically informed structure learning and reasoning (Kang et al., 18 Aug 2025).
Large-scale benchmarking and validation: Establishing standardized, challenging benchmarks to probe algorithmic behavior in diverse, high-dimensional, and confounded real-world data (Montagna et al., 2023).