Bagging-Based Model Merging for Robust General Text Embeddings

Published 5 Feb 2026 in cs.IR, cs.AI, and cs.CL | (2602.05787v2)

Abstract: General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. However, it remains unclear how different multi-task training strategies compare in practice, and how to efficiently adapt embedding models as new domains and data types continually emerge. In this work, we present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging. We compare batch-level shuffling, sequential training variants, two-stage training, and multiple merging granularities, and find that simple batch-level shuffling consistently yields the strongest overall performance, suggesting that task conflicts are limited and training datasets are largely complementary. Despite its effectiveness, batch-level shuffling exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (BOOM), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, BOOM naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that BOOM consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces BOOM, a bagging-based model merging framework that combines models trained on bootstrapped subsets to enhance robustness and out-of-domain generalization.
It systematically compares data scheduling and merging strategies, demonstrating that batch-level shuffling outperforms sequential training and isolated merging techniques.
BOOM supports both static and incremental learning, offering operational efficiency and continual adaptation without the need for full retraining.

Bagging-Based Model Merging for Robust General Text Embeddings

Background and Motivation

General text embedding models form the backbone of a broad spectrum of NLP and IR applications, including retrieval, classification, clustering, reranking, and STS tasks. The ability to generalize across seen and unseen domains remains pivotal for practical deployment, especially as the landscape of retrieval-augmented generation and domain-specific applications expands. Traditionally, multi-task corpus training—often through batch-level shuffling or sequential scheduling—has been utilized to foster broad generalization. However, operational challenges arise when new domains are encountered, as full retraining is computationally expensive and suboptimal for incremental adaptation.

Prior works have explored ensemble learning, particularly bootstrap aggregating (bagging), and model merging as practical alternatives, with the latter offering a solution to the inference latency and resource concerns associated with ensembling. Despite these advances, a systematic comparison of various multi-task strategies and merging granularities for text embedding has remained lacking.

Systematic Study of Multi-Task Training and Model Merging

This paper conducts a comprehensive evaluation of both data scheduling and model merging paradigms for general text embedding models. The examined data scheduling strategies include batch-level shuffling, dataset-level sequential, task-level sequential training, and two-stage training. With regard to merging, dataset-level, task-level, and cluster-level approaches are considered using several merging algorithms, such as Multi-SLERP, Task Arithmetic, TIES, SCE, and Model Stock.

Empirical results highlight key observations:

Batch-level shuffling consistently yields the highest in-domain and OOD performance across all baselines, indicating limited task conflict and complementarity among datasets.
Sequential dataset-level training suffers from severe catastrophic forgetting and lower generalization, with a positive correlation observed between the granularity of inter-dataset mixing and model capability.
Model merging after separate task/dataset training generally underperforms batch-level shuffling, underscoring the value of dynamic, interleaved training.
Figure 1: Average performance (\%) of general text embedding models trained with different proportions of the multi-task training set on in-domain and OOD evaluation sets.

Furthermore, pairwise dataset interactions and hierarchical clustering analyses demonstrate a predominance of synergistic relationships. Even for pairs identified as potentially conflicting, joint batch-level training surpasses independent training, nullifying the argument for aggressive conflict avoidance via merging.

Figure 2: Pairwise Dataset Interaction Matrix quantifies synergy and conflict across training datasets.

Figure 3: Average performance (\%) comparison on MTEB (Eng, v2) between models trained jointly and independently on three pairs of datasets.

Bagging-Based Robust Model Merging (BOOM)

Building upon the observation of pervasive synergy and limited conflict, the authors introduce the Bagging-based rObust mOdel Merging (BOOM) framework. BOOM leverages standard batch-level shuffling for training multiple models on different bootstrapped subsets, followed by parameter-space fusion using merging methods (e.g., Multi-SLERP). This approach retains the robustness advantages of ensembles but compresses them into a single model, maintaining inference efficiency.

Figure 4: The overall framework of BOOM for static and incremental settings.

BOOM is designed for both static and incremental learning settings:

Static Setting: Models are trained on $M$ sampled subsets of the corpus, each with varied size or composition, and merged.
Incremental Learning Setting: When new data arrives, a new model is trained on the union of new data and a small, representative historical subset, then merged with the previously deployed model. This enables rapid knowledge update with minimal computational overhead.

Experimental evaluation is conducted on MTEB (English, v2), MTEB (Code, v1), and RTEB (beta), covering in-domain and OOD scenarios. Results show:

BOOM consistently yields superior in-domain and OOD performance versus batch-level shuffling and BGE-en-ICL style in-context learning, even when training cost is matched.
Performance saturates as the number and diversity of merged models increases, with inclusion of a model trained on the full dataset being crucial for maximizing robustness.
In incremental settings, BOOM enables continual adaptation with reduced training cost, preserving and enhancing generalization.

Implications and Outlook

The findings have significant practical and theoretical implications:

Operational Efficiency: BOOM enables sustainable, low-cost continual learning and deployment, allowing enterprises to update text embedders without expensive full-corpus retraining.
Generalization and Robustness: By leveraging the synergy among diverse tasks and domains, BOOM achieves stronger OOD robustness while maintaining inference efficiency and scalability.
Component Reusability: Modular merging protocols and sampling strategies support flexible adaptation as new use cases and domains arise.

From a theoretical vantage point, the limited task conflict within general text embedding training suggests that future multi-task learning frameworks may increasingly prioritize interleaved batch-level exposure over explicit task partitioning and conflict mitigation. Model merging theories and algorithms developed for other domains (e.g., vision)—such as task vectors and geometric consensus—require tailoring to the unique requirements of language representation and generalization.

Future Directions

Key avenues remain for further exploration:

Development of specialized merging algorithms optimized for text embedding geometric constraints and semantic objectives.
End-to-end assessment of BOOM’s impact on downstream retrieval and retrieval-augmented generation workflows.
Extension to non-English, cross-lingual, and multimodal embedding scenarios to further validate robustness under extreme distributional shifts.

Conclusion

This work presents a rigorous analysis of multi-task training and merging strategies for general text embeddings. The empirical results refute the premise of pervasive task conflict, advocating for fine-grained batch mixing. The BOOM framework, inspired by ensemble bagging, emerges as a scalable, robust solution for static and incremental learning. By combining bootstrapped training and model merging, BOOM demonstrates state-of-the-art performance and operational efficiency across diverse benchmarks. This contribution sets an actionable blueprint for developing sustainable and robust text embedding models in evolving NLP ecosystems.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper advances a useful empirical perspective and proposes BOOM, yet several aspects remain uncertain or underexplored. The following concrete gaps can guide future work:

Compute-controlled comparisons are missing: BOOM trains multiple base models and then merges; improvements may conflate bagging benefits with increased total optimization (tokens/steps). Establish baselines under equalized compute and data exposure.
BOOM hyperparameters are under-specified: number of submodels, subset sizes, sampling with/without replacement, overlap ratios, and random seed variance are not systematically ablated. Provide a clear protocol and sensitivity analysis.
Merging algorithm choice within BOOM is not explored: the method defaults to Multi-SLERP without testing whether TIES, Task Arithmetic, Karcher Mean, or other task-vector approaches affect performance, stability, or OOD robustness.
Order and compositional stability of merging is unclear: behavior under sequential pairwise merges, repeated incremental updates, and long chains of merges (merge-of-merges) is not measured; potential order sensitivity and drift remain open.
Limited theoretical insight on why batch-level shuffling dominates: conflict is inferred via loss-based heuristics; direct measures (e.g., gradient cosine similarity, representational interference/transfer metrics) and causal analyses are absent.
The pairwise “conflict/synergy” metric relies on average training loss: its relation to test-time generalization is unvalidated; alternative diagnostics (e.g., gradient conflicts, validation loss, cross-task transfer) are not compared.
Backbone diversity is narrow: experiments use only Qwen3-0.6B and Qwen3-4B. Generality to encoder-only PLMs (e.g., BERT/T5), larger/smaller LLMs, and other families remains unknown.
Training-length dependence is untested: conclusions are drawn from one-epoch fine-tuning; whether rankings of strategies (BLS vs sequential vs two-stage) persist across multi-epoch schedules and different LR regimes is unexamined.
Batch-level shuffling (BLS) sampling policy is unspecified: dataset/task sampling probabilities and reweighting strategies are not detailed or ablated; impact on under/over-represented tasks is unclear.
Two-stage baseline fidelity is uncertain: the implemented two-stage training may not match recipes from NV-Embed or Gemini embeddings; sensitivity to stage composition, mixing ratios, and negative policies is not studied.
Statistical rigor is limited: results are reported as averages without multiple seeds, confidence intervals, or statistical tests; claims lack significance evidence.
IND/OOD split methodology is not fully specified: how IND vs OOD in MTEB (Eng, v2) is defined, and whether any residual train–test leakage exists beyond three excluded datasets, is not audited.
Multilingual generalization is not evaluated: despite including multilingual datasets in General-Full-Data, no cross-lingual benchmarks (e.g., MIRACL, mTEB multilingual tasks) are reported.
Real-world RAG impact is unmeasured: end-to-end improvements in retrieval-augmented generation (answer quality, faithfulness, latency) from BOOM are not assessed.
Inference characteristics of merged models are unquantified: latency, throughput, memory footprint, and quantization compatibility post-merge are not reported.
Long-horizon incremental learning remains open: cumulative effects of many small updates (e.g., stability, drift, order dependence, and retention) are not measured over multiple rounds.
Forgetting/preservation trade-offs in incremental merges are not analyzed: per-domain/per-task retention when adding new domains via BOOM is not quantified beyond aggregate metrics.
Robustness to stronger distribution shifts and adversarial settings is untested: current OOD coverage (MTEB v2 OOD, RTEB open, MTEB Code) may not capture harder shifts or adversarial perturbations.
Macro averages may hide regressions: per-dataset and per-task trade-offs (e.g., specific domains harmed by merging) are not provided; targeted error analyses are lacking.
Alternative compression baselines are missing: no comparison to distilling an ensemble into a single student, model soups from checkpoints, or MoE adapters; relative merits vs BOOM are unclear.
Interaction with LoRA is ambiguous: for Qwen3-4B, fine-tuning uses LoRA (r=32), but it is unclear whether merging operates on adapters, merged full weights, or both; best practices for adapter-level merging are unsettled.
Negative sampling choices are fixed: “7 hard negatives” and in-batch negative policies are adopted without ablation; how these interact with BOOM and affect OOD generalization is unknown.
Code retrieval coverage is limited: training uses ~10k queries for each of five languages; generalization to other languages/frameworks and larger code corpora is untested.
Cluster-level merging findings may be dataset-specific: the observed broad “synergy” could depend on Eng-Text-Data; replication on different training collections (e.g., more diverse or noisier datasets) is needed.
Safety, bias, and fairness are unexamined: effects of bagging/merging on representational biases across demographics, domains, or languages are not evaluated.
Calibration and uncertainty of embeddings are not assessed: whether BOOM reduces variance or stabilizes similarity scores (as bagging suggests) is unquantified.
Practitioner guidance is missing: no actionable rules-of-thumb on selecting BOOM settings (subset sizes, number of models) under fixed budgets or desired OOD gains.
Reproducibility gaps: configuration labels like “{50}-and-R” are not defined; the anonymous code link and dataset availability/licensing for long-term reproducibility are uncertain.
Comparisons to continual-learning baselines are absent: no evaluation against EWC, LwF, replay-based methods, or streaming ERM for incremental adaptation.
Representation geometry is not analyzed: effects of merging on embedding anisotropy, clustering structure, and neighborhood stability are not examined.

Bagging-Based Model Merging for Robust General Text Embeddings

Summary

Bagging-Based Model Merging for Robust General Text Embeddings

Background and Motivation

Systematic Study of Multi-Task Training and Model Merging

Bagging-Based Robust Model Merging (BOOM)

Implications and Outlook

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Collections

Tweets