Valid statistical inference under synthetic data augmentation

Develop statistical inference procedures for models trained with synthetic data augmentation that yield valid uncertainty quantification by characterizing and propagating both synthesis‑induced randomness and generative‑model error, rather than treating augmented synthetic samples as fixed or equivalent to real observations.

Background

Synthetic data-augmented approaches explicitly generate perturbed or counterfactual samples to improve robustness and out-of-distribution generalization, often targeting distributions that differ from the original training distribution. While such augmentation can improve predictive performance, it introduces additional stochasticity and potential bias from the generative model.

The paper emphasizes that treating augmented synthetic observations as if they were real data undermines valid inference because it ignores both the randomness of the augmentation process and errors from model misspecification. A principled framework is needed to quantify and propagate these uncertainties to support valid statistical inference.

References

However, conducting valid statistical inference under data-augmented approaches remains challenging and largely open, due to the difficulty of characterizing both the randomness introduced by synthetic data and the errors arising from the generative model.

Harnessing Synthetic Data from Generative AI for Statistical Inference  (2603.05396 - Abdel-Azim et al., 5 Mar 2026) in Section 3.3, Synthetic data-augmented approaches