Ground Truth Causal Effects

Updated 24 January 2026

Ground truth causal effects are precisely defined causal estimands constructed via experimental design or synthetic data to objectively validate inference methods.
State-of-the-art methodologies leverage generative models, RCT rejection sampling, and local graph discovery to accurately recover and certify causal effects.
Benchmarking with metrics like bias, RMSE, and coverage ensures the reliability of causal estimators under various structural and statistical assumptions.

A ground truth causal effect is a causal estimand whose value is known exactly—either by design or by construction of the data generating process—thus providing an objective benchmark for validating causal inference methods. The ground truth may pertain to average, conditional, direct, indirect, or isolated causal effects, depending on the setting. True causal effects are unobserved in real-world observational data because both potential outcomes are not jointly accessible for any unit. This section reviews formal definitions, synthetic and semi-synthetic settings, benchmarking methodologies, estimator evaluation, and the state-of-the-art in recovering and certifying ground-truth causal effects.

1. Formal Definitions and Potential Outcomes Framework

Ground truth causal effects are specified within the potential outcomes (Neyman–Rubin) or structural causal model (SCM) formalism. The canonical estimands include:

Individual Treatment Effect (ITE):

$\tau(w) = \mathbb{E}\left[ Y(1) - Y(0) \mid W=w \right]$

where $W$ are covariates and $Y(1), Y(0)$ potential outcomes (Neal et al., 2020).

Average Treatment Effect (ATE):

$\tau = \mathbb{E}[ Y(1) - Y(0) ] = \int \tau(w) \, dP_{W}(w)$

Conditional Average Treatment Effect (CATE):

$\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X=x]$

for covariates $X$ (Young et al., 17 Jan 2026).

Controlled Direct Effect (CDE):

$\mathrm{CDE}_{X \rightarrow Y|do(Z=z)}(x,x') = \mathbb{E}\left[ Y \mid do(X=x), do(Z=z) \right] - \mathbb{E}\left[ Y \mid do(X=x'), do(Z=z) \right]$

(Loranchet et al., 5 May 2025).

Direct, Indirect, and Total Causal Effects in Neural Networks:

$\begin{align*} \mathrm{ACE}_i &= \mathbb{E}[ \hat{Y} | do(X_i = x) ] - \mathbb{E}[ \hat{Y} | do(X_i = x^*) ] \ \mathrm{ADCE}_i &= \mathbb{E}[ \hat{Y} | do(X_i = x, Z=Z_0) ] - \mathbb{E}[ \hat{Y} | do(X_i = x^*, Z=Z_0) ] \ \mathrm{AICE}_i &= \mathbb{E}[ \hat{Y} | do(X_i = x^*, Z=Z_x) ] - \mathbb{E}[ \hat{Y} | do(X_i = x^*, Z=Z_0) ] \end{align*}$

where $Z$ are the children of $X_i$ in the SCM (Reddy et al., 2023).

Isolated Causal Effect of Natural Language:

$\tau^* = \mathbb{E}_{a^c \sim P^*} \left[ \mathbb{E}_Y \left( Y(1, a^c) - Y(0, a^c) \right) \right]$

for text $X = (a(X), a^c(X))$ and a target non-focal distribution $P^*$ (Lin et al., 2024).

2. Synthetic and Semi-Synthetic Data: Engineered Ground Truth

Because individual causal effects are not observable in real data, ground truth is established either via controlled experimental design or sophisticated generative modeling.

2.1 Flexible Generative Models (RealCause, Credence)

Structural causal models are parameterized via expressive neural networks (e.g., normalizing flows, VAEs) to match observed covariate, treatment, and outcome distributions, while also encoding a user-specified function for the ITE, confounding bias, or other estimands.
After fitting, these generative models support deterministic sampling of all potential outcomes for any unit, yielding known $\tau(w), \tau(x)$ , and global quantities such as ATE.
Validation of realism is performed with two-sample tests (KS, MMD, Energy distance) to confirm indistinguishability from empirical data (Neal et al., 2020, Parikh et al., 2022).

2.2 RCT Rejection Sampling

In settings where a randomized controlled trial (RCT) is available, ground-truth ATE is set by the randomization; subsampling via rejection sampling (with respect to a user-defined $P^*(T|C)$ ) creates confounded observational datasets while ensuring the sample ATE coincides exactly with the RCT ATE (under positivity and SUTVA).
This enables benchmarking of any estimator against the true ATE from the original randomization (Keith et al., 2023).

2.3 Experimental–Observational Data Pairs

In cases with paired experimental and observational data from nearly identical populations, the experimental ATE serves as the ground truth. Methods are judged by their ability to recover this value under observational assumptions (unconfoundedness, overlap, etc.), with identical covariate support as a precondition (Young et al., 17 Jan 2026).

3. Algorithms for Ground-Truth Recovery and Effect Certification

3.1 Structural Learning under Faithfulness Assumptions

Generalized $k$ -Triangle-Faithfulness enables consistent estimation of identifiable causal effects and parent sets in non-Gaussian, nonparametric settings. The VCSGS and Edge Estimation algorithms recover interventional distributions $p(Y \mid do(X=\cdot))$ with uniform convergence, but output "Unknown" when the effect is not identifiable (Wang et al., 2021).

3.2 Local vs Global Graph Discovery for Direct Effects

The LocPC and LocPC-CDE algorithms recover the minimal portion of the essential graph needed to identify a controlled direct effect, using only local conditional independence tests. The non-orientability criterion certifies when the CDE is not identifiable, bypassing computationally expensive global discovery (Loranchet et al., 5 May 2025).

3.3 Witness-Protection Program (WPP) for Partial Identification

When point identification fails (unmeasured confounding), WPP blends conditional independence constraints, path-cancellation relaxations, and linear programming to yield posterior bounds on the average causal effect. Bayesian inference over faithfulness violation parameters quantifies uncertainty in these bounds (Silva et al., 2014).

4. Performance Metrics and Method Evaluation

Ground-truth causal effects allow rigorous benchmarking via the following metrics:

Metric	Definition	Application
Bias	$\|\hat{\tau} - \tau\|$	Point estimation accuracy, e.g. ATE
RMSE	$\sqrt{ \frac{1}{n} \sum_{i=1}^n (\hat{\tau}_i - \tau_i)^2 }$	ATE, ITE, PEHE
PEHE	$\sqrt{ \frac{1}{n} \sum_{i=1}^n ( \hat{\tau}(w_i) - \tau(w_i) )^2 }$	Individual effect accuracy
Coverage	Empirical CI coverage of $\tau$	Frequentist inference calibration

(Neal et al., 2020, Parikh et al., 2022, Lin et al., 2024)

For heterogeneous effect estimation, cross-validation metrics are constructed via honest-sample splitting or transformed-outcome surrogates, enabling selection and pruning of tree-based models in the absence of individual-level ground truth (Athey et al., 2015).

5. Identifiability, Assumptions, and Sensitivity

The validity of ground-truth recovery is contingent on structural and statistical assumptions:

Randomization: True ground-truth ATE in RCTs relies on perfect random assignment and consistency.
Unconfoundedness / Overlap: Observational estimators require that potential outcomes are independent of treatment given covariates, and that propensity scores are bounded away from 0 and 1.
Local Faithfulness: Structural learning methods may operate with local as opposed to global faithfulness, facilitating identification of targeted effects with minimal assumptions (Loranchet et al., 5 May 2025).
Omitted Variable Bias: For isolated effects (especially in language), fidelity–overlap trade-offs and bias bounds quantify the vulnerability of ground-truth approximation to omitted features, encapsulating finite sample and model-specification error (Lin et al., 2024).
Path Cancelations and Relaxations: When point identification is compromised by confounding or faithfulness violations, frameworks such as WPP provide transparent, worst-case bounds rather than point estimates, with practical recommendations for tuning the strength of assumptions to interval width (Silva et al., 2014).

6. Empirical Findings and Best Practices

State-of-the-art empirical studies report the following:

S-learner meta-estimators with flexible outcome models (e.g. RBF-SVM) exhibit lowest ATE RMSE and PEHE under realistic generative ground-truths (Neal et al., 2020).
Predictive cross-validation (RMSE for outcomes, F-score/AP for propensities) highly correlates with improved causal performance ( $\rho \approx 0.7$ –$0.9$) (Neal et al., 2020).
Doubly-robust algorithms (DR-learner) provide the most stable recovery of ground truth ATE under correct trimming, careful hyperparameter tuning, and model ensembling (Young et al., 17 Jan 2026).
For synthetic or semi-synthetic benchmarks (RealCause/Credence), unit-level and population-level ground-truth effects are precisely accessible, permitting estimation error assessment and method ranking (Parikh et al., 2022).
Controlled manipulation via RCT rejection sampling preserves ground truth ATE while generating confounded datasets for robust estimator comparison (Keith et al., 2023).
In applications where text is the treatment, isolated effect estimation must explicitly model and quantify omitted confounders, using robust sensitivity metrics and reporting fidelity–overlap trade-offs (Lin et al., 2024).
Honest splitting and cross-fitted evaluation restore valid inference for tree-based heterogeneous effect models, despite never observing $\tau_i$ directly (Athey et al., 2015).

7. Limitations and Open Directions

While synthetic and semi-synthetic approaches provide unit-level ground truth, they rest on the assumption that the generative or experimental process mimics relevant structure and confounding of actual data. Approximations—e.g., in language representation, causal structure specification, or high-dimensional covariance—impose practical limits. Open problems include:

Automating discovery of causal structure from data to enable scalable ground-truth construction (Reddy et al., 2023).
Incorporating latent confounders or complex mediation in generative benchmarking frameworks.
Extending local-identification theory to uncertain or partially observed graphs (Loranchet et al., 5 May 2025).
Quantifying robustness of isolated/multimodal causal effect estimates in the presence of model misfit and limited overlap (Lin et al., 2024).

The continued development of methodologies for constructing, validating, and precisely benchmarking against ground-truth causal effects—across tabular, text, time series, and complex system domains—remains a core focus of modern causal inference research.