Survey-Calibrated Distributional Random Forest (SDRF)
- The SDRF method provides a robust nonparametric framework that integrates survey design features for accurate estimation of conditional laws.
- It employs pseudo-population bootstrapping and PSU-level honesty to handle stratified, clustered, and weighted survey samples effectively.
- Empirical results and NHANES applications show that SDRF achieves lower bias and RMSE compared to traditional methods, ensuring consistency in finite and super-population settings.
Survey-Calibrated Distributional Random Forest (SDRF) is a forest-based nonparametric methodology for model-free estimation of conditional laws and their functionals, designed to accommodate data from complex survey designs. SDRF builds on distributional random forests (DRF) but incorporates features essential for validity under stratified, clustered, and weighted survey sampling. Leveraging kernel mean embeddings and the Maximum Mean Discrepancy (MMD) criterion, SDRF achieves consistent estimation of conditional distributions and associated functionals—such as means, quantiles, and tolerance regions—under both finite-population and model-based (super-population) inferential regimes. It is constructed using survey-calibrated (pseudo-population) bootstrapping, enforces “PSU-level honesty,” and performs node splits based on an MMD criterion computed from Hájek-type design-weighted node distributions, enabling coherent inference for population-representative targets (Zou et al., 9 Dec 2025).
1. Estimation Targets and Consistency Criteria
Let denote data observed under a complex survey design , with population units, first-order inclusion probabilities , and survey weights . SDRF targets the estimation of:
- The finite-population conditional law:
where .
- The super-population conditional law under an i.i.d. model:
- Continuous functionals , including conditional means, quantiles, and covariance operators.
Estimation is assessed under two forms of consistency:
- Design consistency: For (almost every) realized finite population,
- Model (super-population) consistency:
2. Ensemble Construction: Pseudo-Population Bootstrap and PSU-Level Honesty
Pseudo-Population Bootstrap: SDRF follows the pseudo-population bootstrap strategy of Conti–Mecatti (2020) and Wang–Peng–Kim (2022):
- Construct a pseudo-population of size by replicating unit with weight .
- Resample with probabilities according to the original survey design , yielding bootstrap multipliers with and bounded moments/weak dependence.
PSU-level Honesty: For each tree, SDRF splits primary sampling units (PSUs) into “splitting” and “estimation” samples via independent Bernoulli draws at the PSU level (parameter ). All observations within a PSU are allocated together, ensuring that splits and leaf estimation are separated, which enforces honesty conditional on the survey design and is essential for consistent variance estimation under clustering.
3. Split Selection via Design-Weighted MMD on Kernel Mean Embeddings
SDRF conducts splitting using MMD between design-weighted embedded node (child) distributions:
- For kernel (bounded, -universal, p.d.), with RKHS and feature map .
- The kernel mean embedding for a measure is
and the MMD is
- For a candidate split with children , design-weighted (Hájek-type) empirical measures are formed using the split-sample and bootstrap multipliers:
with corresponding mean embeddings and MMD.
- The split criterion maximizes the design-weighted MMD:
The split maximizing is selected at each node.
4. SDRF Algorithmic Steps and Prediction
Each SDRF tree is constructed as follows:
- Draw pseudo-population bootstrap multipliers.
- Randomly assign PSUs to split/estimation groups.
- Recursively partition the split-sample according to the MMD criterion (axis-aligned splits), subject to depth and node size regularization.
- For each query , aggregate over all trees: compute ensemble weights
and estimate
The induced empirical conditional law is , from which any continuous functional may be estimated plug-in fashion.
5. Theoretical Guarantees and Assumptions
SDRF is analyzed under a set of explicit conditions:
- Design: Conditional non-informativeness (D1), stable sampling fraction (D2), bounded inclusion probabilities (D3), controlled second order (D4).
- Resampling: Design-adapted pseudo-population bootstrap (R1–R3).
- Kernel: Bounded, -universal, locally compact Polish (S1, K1, K2).
- Algorithmic regularity: Leaf shape/mass (A1–A3), partition diameter and bootstrap-leaf consistency (B1–B2).
Theorems establish:
- Local split consistency: The bootstrap-based split criterion converges to its finite-population and super-population counterparts.
- Forest consistency (MMD): The estimated conditional law converges in MMD to (finite population) and (super-population), respectively.
- Plugin functional consistency: Continuous of the estimated law converges to its finite-population or model-based target.
Underlying proofs rely on survey-weighted LLN/CLT for the Hájek estimator, uniform convergence of weighted kernel means, argmax-regularity, and risk decompositions for MMD (Zou et al., 9 Dec 2025).
6. Empirical Evaluation
Simulation Design: SDRF is evaluated under a stratified two-stage cluster survey design (multiple strata, first-stage PPS sampling of PSUs, second-stage SRSWOR within PSUs) and a bivariate Gaussian super-population model for outcomes. Varying population sizes and ensemble sizes, 200 replicates per scenario are conducted.
Metrics:
- MMD distance on holdout grids.
- MSE of the conditional mean .
- SDRF is compared to DRF (unweighted, i.i.d.).
Results:
- MMD error mean and SD decrease with increasing ; increasing mainly reduces SD.
- SDRF realizes uniformly lower RMSE than DRF for all ; DRF exhibits persistent bias while SDRF’s bias decreases as .
7. Application: NHANES Conditional Tolerance Regions
SDRF is applied to NHANES 2011–2012, featuring a multistage survey design.
- Outcomes: ; covariates: .
- SDRF yields survey-calibrated conditional weights, from which conditional mean and covariance are estimated.
- Mahalanobis scores and survey-weighted empirical quantiles yield tolerance regions:
facilitating subgroup risk profiling. Tolerance regions are visualized across and compared with ADA diagnostic cutoffs.
8. Practical Implementation and Limitations
- Hyperparameter choices: Number of trees ; PSU subsampling fraction ; minimum node size ; depth restricted to ensure regular leaf shapes.
- Kernel bandwidth: Median-heuristic for pairwise .
- Complexity: Each tree evaluates splits, each split incurs kernel sums; pseudo-population bootstrap adds overhead per tree.
- Extensions: Accommodates non-Euclidean and functional outcomes via appropriate kernels; outcome-dependent sampling requires modified assumptions.
- Current limitations: Finite-sample variability may benefit from post-forest smoothing in . Construction of confidence bands and comprehensive uncertainty quantification under survey designs is noted as an ongoing area of research.
SDRF demonstrates the feasibility and statistical validity of survey-aware, model-free conditional distribution estimation in complex survey contexts, addressing both algorithmic and inferential challenges associated with finite-population and super-population targets (Zou et al., 9 Dec 2025).