Robust Model Merging with Bagging (BOOM)

Updated 21 February 2026

The paper introduces BOOM as a method that employs bagging to generate diverse models and merging strategies to reduce variance under model misspecification.
It integrates techniques from frequentist mixtures, Bayesian posteriors, and deep neural merging to enhance robustness and improve uncertainty calibration.
Empirical results highlight significant gains, including quadrupled p-values and reduced MSE, demonstrating BOOM’s effectiveness in small-sample and high-variance regimes.

Bagging-Based Robust Model Merging (BOOM) refers to a family of methodologies that use bagging principles—i.e., bootstrap aggregation—combined with rigorous merging strategies to produce robust probabilistic, Bayesian, or deep learning models. The approach has found realization in frequentist mixture modeling, Bayesian inference, empirical Bayes estimation, and large-scale neural model training. BOOM’s central innovation is to generate multiple models via bootstrap or subsample variation and to merge them, either in distribution or parameter space, to enhance robustness against model misspecification, instability, and domain shift.

1. Core Principles and Mathematical Foundations

BOOM generally leverages the following pipeline:

Bagging: Generate $M$ replicates of the data set (either by nonparametric bootstrapping or subsampling) and independently fit a base model to each replicate.
Merging: Fuse these base models, either by averaging predictions/distributions (density/marginal/posteriors) or by direct parameter-space fusion (e.g., weighted averaging of neural net weights).
Robustness Objective: Replace overoptimistic or unstable single-model estimates with ensemble- or bootstrapped-aggregated results, reducing variance and increasing distributional robustness under model/data misspecification.

The mathematical implementation of BOOM varies by statistical paradigm:

Mixture Models: Form a convex combination $p(x; W, \Theta) = \sum_{m=1}^{M} w_m p_m(x; \theta_m)$ , with $w_m \geq 0$ , $\sum w_m = 1$ ; maximize an external robustness criterion (e.g., goodness-of-fit $p$ -value) rather than pure likelihood (Adnan et al., 2021).
Bayesian Inference: Compute bagged posteriors $p_{\mathrm{BB}}(\theta \mid D) = \mathbb{E}_{D^*}[p(\theta \mid D^*)]$ , where $D^*$ is a bootstrap sample; merge by averaging posteriors (Huggins et al., 2019, Huggins et al., 2020).
Empirical Bayes: Apply bagging to hyperparameter estimation and average plug-in posterior means $\hat{\theta}_i^{\mathrm{bag}} = \frac{1}{B}\sum_{b=1}^B \hat{\theta}_i^{*(b)}$ (Sugasawa, 2017).
Deep Learning/Representation Models: Train multiple models on diverse data subsamples and merge parameters (e.g., weight averaging or Multi-SLERP) into a single model (Zhang et al., 5 Feb 2026).

2. General BOOM Algorithmic Frameworks

Frequentist Context: Mixture Probabilistic Models

In the convex mixture setting, given data $x_1, ..., x_n \in \mathbb{R}$ , one selects $M$ base densities $p_m(\cdot ; \theta_m)$ . Each $\theta_m$ is fit to a bootstrap of the data. The mixture weights $W = (w_1, ..., w_M)$ are optimized, typically by coordinate ascent, to maximize a robustness criterion (e.g., the $p$ -value from a goodness-of-fit test), rather than likelihood. The process iterates until convergence, yielding $(W^\star, \Theta^\star)$ (Adnan et al., 2021).

Bayesian and Empirical Bayes Inference

For general posterior inference, the BOOM principle ("BayesBag") is to average posteriors over $B$ bootstrap datasets:

$p_{\mathrm{BB}}(\theta \mid D) \approx \frac{1}{B} \sum_{b=1}^B p(\theta \mid D^{*(b)}),$

where $D^{*(b)}$ denotes the $b$ -th bootstrap sample. For model selection, the process aggregates posterior model probabilities over bootstraps, dampening overconfident or unstable selections when models are nearly indistinguishable (Huggins et al., 2020, Huggins et al., 2019).

In empirical Bayes, the BOOM estimator averages plug-in estimates produced by fitting hyperparameters on bootstraps, delivering guaranteed reduction in expected MSE relative to the average of individual plug-in estimators (Sugasawa, 2017).

Deep Learning: Robust Model Merging for Representation Learning

BOOM in high-dimensional neural models involves separately training multiple models $\theta_1, ..., \theta_M$ on mutually independent subsampled datasets, then parameter-wise merging into a single model $\theta_{\mathrm{BOOM}} = \sum_{m=1}^M \alpha_m \theta_m$ , where $\sum \alpha_m = 1$ . The merge can be tuned on a validation set for out-of-domain (OOD) robustness (Zhang et al., 5 Feb 2026).

Generic Pseudocode for BOOM (Bayesian/Deep Learning context):

for m in range(M):
    D_m = subsample_or_bootstrap(full_data)
    theta_m = train_model(D_m)

theta_BOOM = merge_parameters([theta_1, ..., theta_M], alphas)

3. Optimization, Statistical Guarantees, and Theoretical Properties

Theoretical properties of BOOM approaches vary with context:

Simplex Constraint and Non-Concavity: In mixture modeling, weights $W$ are constrained to the simplex, but the optimization criterion (e.g., $p$ -value) is not necessarily concave in $W$ . No global convergence is guaranteed beyond local stationary points in coordinate search (Adnan et al., 2021).
Asymptotic Distributional Theory: For bagged posteriors, under regularity conditions, the BOOM posterior converges to a normal distribution with variance inflated by the bootstrap size, yielding better uncertainty calibration and sandwich-type variance correction under model misspecification (Huggins et al., 2019).
MSE Reduction for EB: Bagged empirical Bayes estimators are theoretically guaranteed to have integrated mean squared error not exceeding the average MSE of the constituent estimators:

$\mathbb{E}[(\hat{\theta}_i^{\mathrm{bag}} - \theta_i)^2] \leq \frac{1}{B} \sum_{b=1}^B \mathbb{E}[(\hat{\theta}_i^{*(b)} - \theta_i)^2]$

(Sugasawa, 2017).

No Oracle Bounds: No explicit error or oracle bounds are generally provided in current BOOM literature; empirical improvement is the main evidence.

4. Practical Implementations and Empirical Results

BOOM has been validated in a range of statistical and machine learning contexts:

Context	BOOM Implementation	Empirical Result Highlights
Mixtures/Frequentist	Convex mixtures, optimizing p-value over simplex	Substantially increases goodness-of-fit p-value; e.g., quadruples $p$ -value versus single model on extreme temperature data (Adnan et al., 2021)
Bayesian Inference	Bagged posteriors (BayesBag)	Improves reproducibility, uncertainty calibration, and predictive accuracy under misspecification (Huggins et al., 2019, Huggins et al., 2020)
Empirical Bayes	Bootstrap-averaged plug-in estimators	Uniformly lowers MSE, especially for small-sample/high-shrinkage settings (Sugasawa, 2017)
Deep Text Embeddings	Merge parameter vectors of M models, OOD-tuned weights	Increases MTEB (Eng) and RTEB scores, reduces GPU training time by 60–80%; improves OOD generalization (Zhang et al., 5 Feb 2026)

A recurring finding is that BOOM methods provide the greatest benefit in small-sample, high-variance, or model-misspecified regimes, while incurring minimal loss (sometimes slightly more conservative inference) in well-specified or large-sample settings.

5. Extensions, Limitations, and Best Practices

Extensions:

Applicability spans empirical Bayes, nonparametric Bayesian models, mixed-effects and hierarchical models, deep neural networks, disease mapping, model selection, and real-world high-dimensional regression.
In time-series or spatial contexts, block-bootstraps can replace i.i.d. resampling.

Limitations:

BOOM approaches incur additional computational cost (factor $B$ in Bayesian/EB, proportional to $M$ in deep models), though all model fits are embarrassingly parallel.
Optimization is often brute-force or coordinate-based; no universally efficient convex optimization algorithm is available for non-likelihood targets.
Some techniques assume approximate normality for calibration indices; performance in multimodal or singular models may not align with theory.

Recommended guidelines:

Select $B$ (number of bootstraps) between $100$ and $200$ for Bayesian/EB settings, trading off stability versus cost.
Calibrate bootstrap size $M$ as $N^{0.95}$ or $N^{0.75}$ under suspected misspecification.
In deep models, use diverse subcorpora (e.g., 50–100% data fractions), 4–6 component models, and tune merge weights for OOD subsets.
Use the mismatch index in Bayesian contexts to diagnose model-data calibration.

6. Relation to Broader Model Averaging and Ensemble Methods

BOOM generalizes and fuses traditional bagging (bootstrap aggregation) with robust model merging:

Distinct from classical ensembles: BOOM typically provides a single, merged model for inference, rather than a voting or averaging ensemble at prediction time, preserving computational efficiency.
Distinction from “bootstrap-of-MAP”: In Bayesian settings, BOOM merges full posteriors, not point estimates, thereby maintaining richer uncertainty quantification.
Differentiation from stacking: While stacking aggregates predictions rather than model representations or posteriors, BOOM fuses at the parameter/distributional level.
Applicability to continual/incremental learning: BOOM enables lightweight model updates by merging a newly trained model on recent data (with a small historical subset) into the previous BOOM model, avoiding full retraining (Zhang et al., 5 Feb 2026).

7. Key References and Published Results

BOOM has been explored or formalized under various names and in different subfields:

Mixture modeling via bagging and boosting: "A Bagging and Boosting Based Convexly Combined Optimum Mixture Probabilistic Model" (Adnan et al., 2021).
Bagged posteriors for Bayesian inference/model selection: "Robust Inference and Model Criticism Using Bagged Posteriors" (Huggins et al., 2019); "Reproducible Model Selection Using Bagged Posteriors" (Huggins et al., 2020).
Bagged Empirical Bayes: "On Bootstrap Averaging Empirical Bayes Estimators" (Sugasawa, 2017).
Robust neural model merging for text embedding: "Bagging-Based Model Merging for Robust General Text Embeddings" (Zhang et al., 5 Feb 2026).