Mass Imputation: Survey Data Integration

Updated 21 December 2025

Mass Imputation is a survey data integration method that imputes entire outcome vectors for probability samples using predictive models based on observational big data.
It relies on key assumptions like ignorability and common support to achieve design-consistent point estimation and valid variance estimation.
Practical implementations include nearest neighbor, k-nearest neighbor, GAM, and regression calibration approaches to efficiently combine survey and nonprobability data.

Mass Imputation (MI) is a survey data integration technique in which values of a study variable collected in an observational “big data” source are statistically imputed for all units of a probability sample that lacks the variable of interest. Unlike traditional missing data imputation, which fills in sporadic or partially missing items within a dataset, mass imputation fills in the entire outcome vector for the target probability sample, leveraging a predictive relationship established on the nonprobability data. Mass imputation is particularly relevant in finite-population inference when researchers seek to combine design-based probability surveys with large, often nonrepresentative, observational datasets to estimate population totals, means, or distributional quantities. Rigorous versions of the method provide design-consistent point estimation and valid variance estimation under minimal modeling assumptions, provided ignorability and support conditions are met (Yang et al., 2018).

1. Problem Structure and Formal Assumptions

Let $U = \{1, \dots, N\}$ index a finite population, with covariates $X_i \in \mathbb{R}^p$ and a survey outcome $Y_i$ for all $i$ in $U$ . In the classic mass imputation setup:

Sample A: A probability sample of size $n$ , with known inclusion probabilities $\pi_i$ , provides observed $X_i$ only.
Sample B: A large observational sample (nonprobability), size $N_B$ , provides joint observations $\{X_j, Y_j\}$ , but lacks design weights.

The inferential target is a finite-population mean or sum, such as $\mu_g = N^{-1} \sum_{i=1}^N g(Y_i)$ for a fixed function $g$ .

Key conditions for statistical validity are:

Ignorability: The conditional distribution $f(Y|X, \delta_B=1)$ in the big data matches the superpopulation: $f(Y|X, \delta_B=1) = f(Y|X)$ .
Overlap (Common Support): The design densities $f(X)$ and $f(X|\delta_B=1)$ are mutually absolutely continuous; the density ratio is bounded away from 0 and $\infty$ (Yang et al., 2018).

Distinct from conventional missing-data imputation, mass imputation does not attempt to recover $Y_i$ for a small subset but imputes for each element in Sample A, filling all of $Y_A$ .

2. Mass Imputation Estimators

Several classes of estimators fall under mass imputation, depending on the imputation and prediction strategy:

(A) Nearest Neighbor Imputation (NNI):

For each $i \in A$ , define $i(1) = \arg\min_{j\in B} \|X_j - X_i\|$ .

$Y_i^* = Y_{i(1)}$

$\hat\mu_{nni} = \frac{1}{N} \sum_{i \in A} \frac{1}{\pi_i} g(Y_{i(1)})$

(B) $k$ -Nearest Neighbor Fractional Imputation:

For each $i \in A$ , find $k$ nearest neighbors in $B$ and set

$\widehat\mu_g(X_i) = \frac{1}{k}\sum_{j=1}^k g(Y_{i(j)})$

$\hat\mu_{knn} = \frac{1}{N}\sum_{i \in A}\frac{1}{\pi_i}\, \widehat\mu_g(X_i)$

(C) Generalized Additive Model (GAM) Imputation:

Fit $\mu_g(X) = E\{g(Y)|X\}$ by an exponential family GAM (e.g., $h^{-1}[\mu_g(X)] = \sum_r f_r(X^r)$ ), then impute: $\widehat\mu_{g, \rm GAM}(X_i) = h\left(\sum_r \hat f_r(X_i^r)\right)$

$\hat\mu_{GAM} = \frac{1}{N}\sum_{i \in A}\frac{1}{\pi_i}\, \widehat\mu_{g, \rm GAM}(X_i)$

(D) Regression-Calibration-Imputed MI:

When sample B membership $\delta_{B,i}$ is known, perform calibration to population-level totals $H$ using auxiliary variables and outcomes. Update design weights $\omega_i$ subject to

$\frac{1}{N} \sum_{i\in A} \omega_i\,h_i = H$

and set

$\hat\mu_{RC} = \frac{1}{N}\sum_{i\in A}\omega_i\,g(Y_{i(1)})$

These approaches generalize Rivers’s (2007) mass-imputation matching estimator (Yang et al., 2018).

3. Theoretical Guarantees and Variance Estimation

Under $N_B \rightarrow \infty$ (big data), ignorability, and common support, the MI estimators are:

Consistent: $\hat\mu_{nni} \rightarrow \mu_g$ as $N, n, N_B \rightarrow \infty$
Asymptotically normal: $\sqrt{n}(\hat\mu_{nni} - \mu_g) \xrightarrow{d} N(0, V_{nni})$ , where $V_{nni}$ matches the design-based variance of the Horvitz–Thompson estimator as if $Y$ were observed in $A$ .

Variance estimation is “plug-in”: $\hat V_{nni} = \frac{1}{N^2} \sum_{i\in A}\sum_{j\in A} \frac{\pi_{ij}-\pi_i\pi_j}{\pi_i\pi_j}\, g(Y_{i(1)})\, g(Y_{j(1)})$

For $k$ -NNI and GAM MI, a second (“model-based”) term enters due to smoothing bias, but for large $N_B$ this is negligible for NNI and minor for $k$ -NNI.

Calibration-based MI achieves further variance reduction, with explicit design-based variance expressions involving calibration residuals $e_i = g(\tilde Y_i) - h_i^\top \hat\beta$ (Yang et al., 2018).

4. Performance: Simulation Evidence and Empirical Comparison

Simulation studies with $N=10^6$ and $N_B = 0.3N$ under various scenarios (linear/nonlinear outcome, linear/nonlinear selection into $B$ ) yield:

Estimator	Bias $(\times 10^2)$	SE	95% Coverage
Horvitz–Thompson	0.2	6.5	96.0%
$\hat\mu_{nni}$	0.2	6.5	95.1%
$\hat\mu_{knn}$	0.2	4.9	96.1%
$\hat\mu_{GAM}$	0.1	4.5	95.7%
$\hat\mu_{RC}$	0.0	3.2	95.5%

Key conclusions:

Single-NNI matches the efficiency of the full-sample design estimator when feasible.
$k$ -NNI and GAM imputation approach further variance reduction.
Regression calibration is the most efficient, nearly halving SE versus non-calibrated approaches.
In contrast, inverse-probability weighting (IPW) and doubly robust estimators are highly sensitive to model misspecification and can be severely biased.

5. Connections to Other Methods and Special Cases

Rivers (2007) matching is a special case of single nearest-neighbor mass imputation, and its asymptotic variance is matched by the plug-in Horvitz–Thompson formula assuming the big data are “very large” (Yang et al., 2018). By treating the outputs of mass imputation as completed survey data, standard design-based and model-assisted estimators are accessible.

Calibration extensions leverage auxiliary sample membership indicators, enabling even greater efficiency gains by satisfying calibration equations that align to known population benchmarks.

6. Limitations, Implementation, and Practical Guidance

Key assumptions and limitations:

Ignorability of the selection into $B$ is critical; $Y \perp \delta_B \mid X$ must be reasonable.
Overlap may fail in high dimension, though large $N_B$ mitigates this via “close” matches in $X$ .
Variance inflation due to finite $k$ and model error is negligible if $N_B$ is large and calibration is used.
In practice, high-dimensional $X$ can degrade nearest-neighbor matching, though $k$ -NNI and model-based imputation (GAM or regression) can help, as can variable selection or sufficient reduction. If $B$ ’s selection is nonignorable even after conditioning on $X$ , further modeling is necessary.

Practical guidance:

Use the richest possible set of covariates to improve ignorability.
Employ $k$ -NNI, GAM, or calibrated versions for improved efficiency.
Estimate variance using standard replication (jackknife, bootstrap) or the Horvitz–Thompson plug-in formula; in large $B$ scenarios, imputation error is asymptotically negligible.
Calibration—if feasible—should be performed whenever $B$ membership or population totals are known for $X$ and $g(Y)$ (Yang et al., 2018).

Mass Imputation sets a robust, design-based framework for general-purpose survey and big data integration that allows for efficient, statistically valid, and readily implemented finite-population inference under minimal but explicit modeling assumptions.

Markdown Report Issue Upgrade to Chat

References (1)

Integration of survey data and big observational data for finite population inference using mass imputation (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mass Imputation (MI).