Inference with Predicted Data (IPD) Insights

Updated 8 December 2025

Inference with Predicted Data (IPD) is a framework that uses machine learning predictions alongside true observations to enable valid estimation of regression coefficients and other statistical parameters.
IPD methodologies correct for bias and propagate prediction uncertainty using techniques such as post-prediction calibration, prediction-powered inference, and moment-based corrections.
Modern approaches, including bootstrap and multiple-imputation methods, yield unbiased estimators and reliable confidence intervals even when using error-prone predicted outcomes.

Inference with Predicted Data (IPD) concerns the statistical analysis of parameters when true observations of an outcome are partially or entirely replaced by predictions from a ML model. The challenge arises in leveraging large amounts of inexpensive but error-prone predicted data, together with a much smaller set of gold-standard observations, to obtain valid inference—particularly for regression coefficients, group means, quantiles, or model coefficients. Standard statistical methods, if naively applied to predicted outcomes as if they were observed, generally yield biased estimates and invalid confidence intervals due to unaddressed model bias, prediction uncertainty, and error propagation. Modern IPD methodologies provide formal frameworks for correcting bias and appropriately calibrating uncertainty, thereby enabling valid inference in high-throughput, semi-supervised, or data-augmented environments.

1. Formal Statistical Framework

Let $Y \in \mathbb{R}$ denote the true, often unobserved outcome, and $X \in \mathbb{R}^p$ the fully observed covariate vector. A possibly complex “black box” prediction function $\hat{f}: \mathbb{R}^p \to \mathbb{R}$ , typically trained externally, yields predicted outcomes $\hat{Y} = \hat{f}(X)$ . The IPD scenario consists of two datasets: a small labeled subset $D_\ell = \{(X_i, Y_i)\}_{i=1}^n$ where both $X$ and $Y$ are observed, and a large unlabeled or semi-supervised set $D_u = \{X_j\}_{j=1}^N$ where only $X$ is observed and $\hat{Y}$ can be computed.

The scientific estimand is a low-dimensional functional $\theta = \Psi(P_{Y,X})$ , such as regression coefficients in $Y = X^\top \beta + \varepsilon$ . If one naively fits a downstream model $g(X,\hat{Y})$ to the predicted outcome, the corresponding parameter target is generally $\Psi(P_{\hat{Y},X}) \neq \Psi(P_{Y,X})$ , resulting in estimand distortion.

Key formal elements include:

Distributional framework: $(X,Y) \sim P_0$ unknown but fixed.
Prediction rule uncertainty: Even for fixed $X$ , $\hat{f}$ exhibits model (training) variance and possibly bias.
Goal: obtain valid inference for $\theta$ (e.g., regression coefficients, means, odds ratios) using $(D_\ell, D_u, \hat{f})$ .

2. Sources and Decomposition of Error

Three primary sources of inferential error arise when substituting predicted for observed outcomes (Hoffman et al., 2024):

Model Bias: The prediction function $\hat{f}$ may not provide unbiased or well-calibrated estimates of $E[Y|X]$ , so $Bias(x) = E[\hat{Y} | X = x] - E[Y | X = x]$ is generally nonzero. Systematic miscalibration or lack of model transportability manifests as persistent bias across $X$ .
Model Training Uncertainty: Due to the finite nature of the training data used to fit $\hat{f}$ , $Var[\hat{f}(x) | D_{train}]$ is nonzero. This uncertainty propagates to downstream effect estimates.
Error Propagation to Inference: Using $\hat{Y}$ as fixed introduces two problems: (i) bias in estimating equations for $\theta$ , and (ii) underestimation of standard errors, as the additional variance due to imputation is ignored.

The total variance of the prediction error $\epsilon = \hat{Y} - Y$ decomposes into: $Var(\epsilon) = \text{Bias}^2 + Var_{\text{model}} + Var(\eta),$ where $Var_{\text{model}}$ is the model (training) variance and $Var(\eta)$ is the irreducible noise in $Y$ given $X$ (Hoffman et al., 2024).

For downstream linear regression, the induced bias in the coefficient estimator is $Bias(\hat{\beta}_{naive}) \approx \Sigma_X^{-1} E[X \cdot Bias(X)]$ , where $\Sigma_X = E[XX^\top]$ .

3. Statistical Methodologies for IPD

Multiple estimator families have been developed to address IPD bias and variance. The methodologies fall under broadly assumption-lean, relationship-model-based, and measurement-error-inspired approaches.

A. Post-Prediction Inference (PostPI):

Relies on a low-dimensional relationship model $Y = \gamma_0 + \gamma_1 \hat{Y} + \eta$ fit on the labeled sample $D_\ell$ .
Naive estimates are corrected by applying the estimated calibration relationship, with variance estimates based on the residual variance in $D_\ell$ .
Validity requires $Cov(X, \eta) = 0$ —typically violated if $f$ does not account for all structure in $X$ (Salerno et al., 12 Jul 2025, Salerno et al., 2024).

B. Prediction-Powered Inference (PPI):

Constructs an unbiased estimator by adding a “rectifier” term: $\hat{\theta}_{PPI} = \hat{\theta}_{naive} + \widehat{\Delta},$ where $\widehat{\Delta} = \frac{1}{n_\ell} \sum_{i=1}^{n_\ell} [\psi_\theta(X_i, Y_i) - \psi_\theta(X_i, \hat{Y}_i)]$ , and $\psi_\theta$ is the relevant estimating equation (Angelopoulos et al., 2023, Salerno et al., 5 Dec 2025).
No assumptions are made on the quality of $f$ .
Variance estimates combine the variance in the labeled and unlabeled sets (Luo et al., 2024).

C. Moment-Based and Semiparametric Corrections:

Generalizations such as moment-based correction (Salerno et al., 12 Jul 2025) address the failure of the classical calibration model by targeting the fundamental population moment condition (e.g., $E[X(Y - X^\top \beta)] = 0$ ), yielding unbiased plug-in estimators and consistent variance estimators even when calibration errors correlate with $X$ .

D. Bootstrap and Multiple-Imputation-Style Methods:

Algorithms implement multiple imputations of predicted outcomes, followed by downstream estimation for each draw and combination using Rubin’s rules (Hoffman et al., 2024).
Resampling (bootstrap) over $D_{train}$ or $D_\ell$ provides empirical confidence intervals that account for prediction and calibration uncertainty (Kluger et al., 30 Jan 2025).

E. Software Implementation:

The R package ipd implements these modern IPD methods through a unified interface, providing plug-in and bias-corrected estimators, with rigorous (sandwich-form or bootstrap) variance estimation (Salerno et al., 2024).

4. Theoretical Guarantees and Efficiency

IPD methods provide robust statistical guarantees:

Unbiasedness: Correction procedures using labeled data yield estimators centered at the true parameter $\theta$ under standard conditions, regardless of the ML model quality (provided selection is MCAR) (Angelopoulos et al., 2023, Salerno et al., 12 Jul 2025, Kluger et al., 30 Jan 2025).
Correct Coverage: Confidence intervals constructed via estimation or plug-in variance formulas (including those with $N/n$ inflation for the calibration error) achieve nominal coverage even for arbitrarily complex $f$ .
Variance Reduction: The asymptotic variance of IPD estimators is never greater (and typically smaller) than the variance from using only the labeled sample, whenever the ML predictions are at least weakly informative (Kluger et al., 30 Jan 2025, Angelopoulos et al., 2023).
Finite-Sample Validity: Nonasymptotic versions of the main theorems guarantee finite-sample validity for broad settings, including nonuniform sampling and arbitrary ML model accuracy (Kluger et al., 30 Jan 2025).

In federated settings, distributed IPD methods remain valid without sharing raw data, provided suitable aggregation of summary statistics (Luo et al., 2024).

5. Practical Considerations and Implementation

Effective use of IPD methodologies requires attention to data splitting, diagnostics, choice of calibration or correction method, and reporting.

Labeled Subsample Size: Representative gold-standard labeled data (minimum $n \sim 100$ –$200$ for regression calibration; larger for complex $f$ or high-dimensional $X$ ) is required for stable estimation of bias and calibration models (Salerno et al., 12 Jul 2025).
Diagnostics: Residual plots, calibration curve slope, and $R^2$ between $\hat{Y}_i$ and $Y_i$ should be examined on the labeled set to assess appropriateness and potential model miscalibration or heteroskedasticity (Hoffman et al., 2024, Salerno et al., 2024).
Method Selection: Assumption-lean (PPI, PSPA) methods are recommended when calibration models may be misspecified; relationship-model-based corrections are tractable in simple settings. Recent generalizations (moment-based estimators) remove restrictive independence assumptions (Salerno et al., 12 Jul 2025, Salerno et al., 5 Dec 2025).
Variance Inflation: Sandwich-style or bootstrap variance estimators are essential to avoid undercoverage, particularly with large $N/n$ ratios.
Reporting: Practitioners should report empirical bias curves, variance decompositions (bias $^2$ , model variance, irreducible noise), and CI coverage diagnostics to quantify uncertainty and methodological robustness (Hoffman et al., 2024).

6. Extensions and Advanced Topics

A. Nonparametric and Survival Targets:

Extensions to survival analysis (IPD reconstruction from Kaplan–Meier curves), quantile regression, and nonparametric functionals leverage analogous calibration and correction frameworks, often under additional structure such as piecewise constant hazard models (Fu et al., 2022, Lang et al., 3 Nov 2025).

B. Nonuniform Sampling & Covariate Imputation:

The Predict-Then-Debias (PTD) estimator and its weighted bootstrap extension allow valid inference in two-phase stratified, weighted, or clustered designs and when only a subset of features is imputed by ML (Kluger et al., 30 Jan 2025).

C. Federated and Decentralized IPD:

Federated Prediction-Powered Inference (Fed-PPI) extends IPD to multiple data silos, enabling valid aggregated inference when both labeled and unlabeled data are decentralized and private (Luo et al., 2024).

7. Connections to Classical Theory and Open Problems

Inference with Predicted Data is fundamentally linked to classical survey sampling (double sampling), measurement error models (regression calibration), missing data (imputation theory), and semi-supervised learning. All recent IPD methods can be seen as modern generalizations of these paradigms, applying efficient influence function–based estimation, bias correction, and variance inflation to arbitrary downstream targets (Salerno et al., 5 Dec 2025). Open questions include semiparametric efficiency bounds under missing completely or at random, optimal allocation of labeling budgets, distributional shift handling, and extension to high-dimensional or non-Euclidean targets.

References: