Survey-Calibrated Distributional Random Forest (SDRF)

Updated 11 December 2025

The SDRF method provides a robust nonparametric framework that integrates survey design features for accurate estimation of conditional laws.
It employs pseudo-population bootstrapping and PSU-level honesty to handle stratified, clustered, and weighted survey samples effectively.
Empirical results and NHANES applications show that SDRF achieves lower bias and RMSE compared to traditional methods, ensuring consistency in finite and super-population settings.

Survey-Calibrated Distributional Random Forest (SDRF) is a forest-based nonparametric methodology for model-free estimation of conditional laws and their functionals, designed to accommodate data from complex survey designs. SDRF builds on distributional random forests (DRF) but incorporates features essential for validity under stratified, clustered, and weighted survey sampling. Leveraging kernel mean embeddings and the Maximum Mean Discrepancy (MMD) criterion, SDRF achieves consistent estimation of conditional distributions and associated functionals—such as means, quantiles, and tolerance regions—under both finite-population and model-based (super-population) inferential regimes. It is constructed using survey-calibrated (pseudo-population) bootstrapping, enforces “PSU-level honesty,” and performs node splits based on an MMD criterion computed from Hájek-type design-weighted node distributions, enabling coherent inference for population-representative targets (Zou et al., 9 Dec 2025).

1. Estimation Targets and Consistency Criteria

Let $(Y,X)$ denote data observed under a complex survey design $\mathbf p$ , with $N$ population units, first-order inclusion probabilities $\pi_i$ , and survey weights $w_i=1/\pi_i$ . SDRF targets the estimation of:

The finite-population conditional law:

$P^N_{Y\mid X=\mathbf x} = \frac{1}{N(\mathbf x)}\sum_{i=1}^N I(X_i = \mathbf x)\delta_{Y_i},$

where $N(\mathbf x) = \sum_i I(X_i = \mathbf x)$ .

The super-population conditional law under an i.i.d. model:

$P_{Y\mid X=\mathbf x} = \mathcal L(Y \mid X = \mathbf x).$

Continuous functionals $\Psi(P(Y|X=\mathbf{x}))$ , including conditional means, quantiles, and covariance operators.

Estimation is assessed under two forms of consistency:

Design consistency: For (almost every) realized finite population,

$P_{S^N\mid\omega}\left( d_v\left(\widehat\Psi_N(\mathbf x),\,\Psi(P^N_{Y\mid X=\mathbf x})\right) > \varepsilon \right) \to 0 \quad (N\to\infty).$

Model (super-population) consistency:

$P_{\Omega\times\mathcal S^N}\left( d_v\left(\widehat\Psi_N(\mathbf x),\,\Psi(P_{Y\mid X=\mathbf x})\right) > \varepsilon \right) \to 0.$

2. Ensemble Construction: Pseudo-Population Bootstrap and PSU-Level Honesty

Pseudo-Population Bootstrap: SDRF follows the pseudo-population bootstrap strategy of Conti–Mecatti (2020) and Wang–Peng–Kim (2022):

Construct a pseudo-population of size $\widehat N = \sum_{i=1}^{n_s} w_i$ by replicating unit $i$ with weight $w_i$ .
Resample with probabilities according to the original survey design $\mathbf p$ , yielding bootstrap multipliers $\{n_i^*\}$ with $E(n_i^*|\xi)=1$ and bounded moments/weak dependence.

PSU-level Honesty: For each tree, SDRF splits primary sampling units (PSUs) into “splitting” and “estimation” samples via independent Bernoulli draws at the PSU level (parameter $q$ ). All observations within a PSU are allocated together, ensuring that splits and leaf estimation are separated, which enforces honesty conditional on the survey design and is essential for consistent variance estimation under clustering.

3. Split Selection via Design-Weighted MMD on Kernel Mean Embeddings

SDRF conducts splitting using MMD between design-weighted embedded node (child) distributions:

For kernel $\mathbf k: \mathcal Y \times \mathcal Y \to \mathbb{R}$ (bounded, $C_0$ -universal, p.d.), with RKHS $\mathcal H$ and feature map $\Phi$ .
The kernel mean embedding for a measure $P$ is

$\mu_{\mathbf k}(P) = \int_{\mathcal Y} \Phi(y) \, dP(y),$

and the MMD is

$d_{\mathbf k}(P,Q) = \|\mu_{\mathbf k}(P)-\mu_{\mathbf k}(Q)\|_{\mathcal H}.$

For a candidate split $\theta=(j,t)$ with children $C_L, C_R$ , design-weighted (Hájek-type) empirical measures are formed using the split-sample and bootstrap multipliers:

$\hat P^*_L(\theta) = \frac{1}{\hat N_L} \sum_{i:\, X_i\in C_L(\theta)}\frac{n^*_{b,i}}{\pi_i q}\delta_{Y_i},$

with corresponding mean embeddings and MMD.

The split criterion maximizes the design-weighted MMD:

$M^*_b(\theta) = \frac{\hat N_L\,\hat N_R}{\hat N_{P_a}^{2}} \Bigl\| \mu_{\mathbf k}(\hat P^*_{L}(\theta)) - \mu_{\mathbf k}(\hat P^*_{R}(\theta)) \Bigr\|^2_{\mathcal H}.$

The split maximizing $M^*_b(\theta)$ is selected at each node.

4. SDRF Algorithmic Steps and Prediction

Each SDRF tree is constructed as follows:

Draw pseudo-population bootstrap multipliers.
Randomly assign PSUs to split/estimation groups.
Recursively partition the split-sample according to the MMD criterion (axis-aligned splits), subject to depth and node size regularization.
For each query $\mathbf x$ , aggregate over all trees: compute ensemble weights

$\widehat\omega_i(\mathbf x) = \frac{1}{B} \sum_{b=1}^B \frac{n^*_{b,i}/(\pi_i(1-q))\,I(X_i\in L^*_b(\mathbf x))}{\sum_{j\in D^*_{b,\mathrm{est}}} n^*_{b,j}/(\pi_j(1-q))\,I(X_j\in L^*_b(\mathbf x))}$

and estimate

$\widehat\mu(\mathbf x) = \sum_{i=1}^{n_s} \widehat\omega_i(\mathbf x)\,\Phi(Y_i).$

The induced empirical conditional law is $\widehat P(\cdot|X=\mathbf x) = \sum_i \widehat\omega_i(\mathbf x)\delta_{Y_i}$ , from which any continuous functional $\Psi$ may be estimated plug-in fashion.

5. Theoretical Guarantees and Assumptions

SDRF is analyzed under a set of explicit conditions:

Design: Conditional non-informativeness (D1), stable sampling fraction (D2), bounded inclusion probabilities (D3), controlled second order (D4).
Resampling: Design-adapted pseudo-population bootstrap (R1–R3).
Kernel: Bounded, $C_0$ -universal, $\mathcal Y$ locally compact Polish (S1, K1, K2).
Algorithmic regularity: Leaf shape/mass (A1–A3), partition diameter and bootstrap-leaf consistency (B1–B2).

Theorems establish:

Local split consistency: The bootstrap-based split criterion converges to its finite-population and super-population counterparts.
Forest consistency (MMD): The estimated conditional law $\widehat P(\cdot|X=\mathbf x)$ converges in MMD to $P^N_{Y|X=\mathbf x}$ (finite population) and $P_{Y|X=\mathbf x}$ (super-population), respectively.
Plugin functional consistency: Continuous $\Psi$ of the estimated law converges to its finite-population or model-based target.

Underlying proofs rely on survey-weighted LLN/CLT for the Hájek estimator, uniform convergence of weighted kernel means, argmax-regularity, and risk decompositions for MMD (Zou et al., 9 Dec 2025).

6. Empirical Evaluation

Simulation Design: SDRF is evaluated under a stratified two-stage cluster survey design (multiple strata, first-stage PPS sampling of PSUs, second-stage SRSWOR within PSUs) and a bivariate Gaussian super-population model for outcomes. Varying population sizes and ensemble sizes, 200 replicates per scenario are conducted.

Metrics:

MMD distance $d_{\mathbf k}(\widehat P(\cdot\,|\,X), P^N(\cdot\,|\,X))$ on holdout $X$ grids.
MSE of the conditional mean $\Psi(\mathbf x) = E[Y_1|X=\mathbf x]$ .
SDRF is compared to DRF (unweighted, i.i.d.).

Results:

MMD error mean and SD decrease with increasing $N$ ; increasing $B$ mainly reduces SD.
SDRF realizes uniformly lower RMSE than DRF for all $N,B$ ; DRF exhibits persistent bias while SDRF’s bias decreases as $N,B\to\infty$ .

7. Application: NHANES Conditional Tolerance Regions

SDRF is applied to NHANES 2011–2012, featuring a multistage survey design.

Outcomes: $Y=(\text{FPG},\text{HbA1c})$ ; covariates: $X=(\text{Age}, \text{BMI}, \text{Sex})$ .
SDRF yields survey-calibrated conditional weights, from which conditional mean $\hat\mu(\mathbf x)$ and covariance $\hat\Sigma(\mathbf x)$ are estimated.
Mahalanobis scores $S(\mathbf{x}, y)$ and survey-weighted empirical quantiles yield tolerance regions:

$\hat C^{\alpha}(\mathbf x) = \{y : S(\mathbf x,y) \le \hat q_{1-\alpha}\},$

facilitating subgroup risk profiling. Tolerance regions are visualized across $(\mathrm{Age},\mathrm{BMI},\mathrm{Sex})$ and compared with ADA diagnostic cutoffs.

8. Practical Implementation and Limitations

Hyperparameter choices: Number of trees $B\in[50,200]$ ; PSU subsampling fraction $q\in[0.5,0.8]$ ; minimum node size $\max\{20, \lceil\sqrt{n_s}\rceil\}$ ; depth restricted to ensure regular leaf shapes.
Kernel bandwidth: Median-heuristic for pairwise $\|Y_i-Y_j\|$ .
Complexity: Each tree evaluates $\mathcal O(n_s \log n_s)$ splits, each split incurs $\mathcal O(n_\mathrm{node})$ kernel sums; pseudo-population bootstrap adds $\mathcal O(n_s)$ overhead per tree.
Extensions: Accommodates non-Euclidean $\mathcal Y$ and functional outcomes via appropriate kernels; outcome-dependent sampling requires modified assumptions.
Current limitations: Finite-sample variability may benefit from post-forest smoothing in $X$ . Construction of confidence bands and comprehensive uncertainty quantification under survey designs is noted as an ongoing area of research.

SDRF demonstrates the feasibility and statistical validity of survey-aware, model-free conditional distribution estimation in complex survey contexts, addressing both algorithmic and inferential challenges associated with finite-population and super-population targets (Zou et al., 9 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Survey-Calibrated Distributional Random Forest (SDRF).