Branch-and-Merge Language Adaptation
- The paper introduces a branch-and-merge paradigm that integrates specialized model branches to mitigate catastrophic forgetting and boost cross-lingual performance.
- The methodology employs targeted fine-tuning and versatile merge operators, including linear averaging, task arithmetic, and adapter-based fusion, to consolidate expertise.
- Empirical results demonstrate significant improvements in low-resource settings and multilingual adaptation, with gains up to 7.3% in cross-lingual benchmarks.
Branch-and-merge language adaptation encompasses a family of strategies in which a base model is “branched” into one or more variants—typically through task- or domain-specific fine-tuning or pretraining—followed by a “merge” step that algorithmically fuses the specialized models or their parameter deltas to yield a single, more capable or more robust system. This paradigm facilitates efficient integration of skills or knowledge acquired across distinct domains, languages, or data distributions, with recent work demonstrating benefits for cross-lingual transfer, multilingual model construction, catastrophic forgetting mitigation, low-resource language adaptation, modular code integration, and efficient scaling. The term applies both to full-parameter merging of independently trained LLMs and to branch-and-merge recipes for parameter-efficient tuning (e.g., merging language adapters), as well as to task-specific and modality-compositional extensions.
1. Theoretical Foundations and Problem Formulation
Branch-and-merge methods are motivated by the fact that many adaptation tasks require a model to simultaneously acquire new skills (e.g., understanding a novel language or task) without erasing preexisting capabilities. In the canonical setup, one begins from an initial set of model parameters , then creates multiple branches by fine-tuning or continued pretraining each copy on a different domain, language, modality, or data slice. The core objective is to recombine these branches—typically by an explicit parameter merge operator—so the resulting model encapsulates the union of capabilities more efficiently than curriculum, continued pretraining, or naive mixture approaches.
Formally, let each branch be defined by a weight vector and potentially its associated “task vector” . The merge step is a mapping: where can range from simple linear/interpolative schemes to more complex, sparsity- or importance-aware logic, sometimes leveraging additional meta-information (e.g., domain posteriors, adapter structure).
2. Branching Strategies: Domain, Language, and Data Partitioning
Branching can be instantiated along several dimensions:
- Domain/Task: Models are fine-tuned on specific domains (e.g., law, medical, code, math), as in Branch-Train-Merge (BTM) (Li et al., 2022), Branch-Train-MiX (BTX) (Sukhbaatar et al., 2024), or domain-expert SFT followed by fusion (Sun et al., 6 Mar 2025).
- Language: Continued pretraining or fine-tuning on monolingual, cross-lingual, or code-mixed data produces language-specialized experts. Example: Faroese from Scandinavian branches (Kunz et al., 1 Oct 2025), integeration for code-mixed En-Hi/En-Es (Kodali et al., 22 Oct 2025), or low-resource language merging (Tao et al., 2024).
- Adapter-based PEFT: Adapter branching allows for parameter-efficient skill encapsulation, merging adapters to enable cross-lingual transfer (e.g., AdaMergeX (Zhao et al., 2024)) or phylogeny-inspired sharing (Faisal et al., 2022).
- Slices or Curricula: Data can be partitioned into arbitrary slices (chronological, random, or experience replay-augmented), then each branch is trained independently on its slice (Alexandrov et al., 2024).
Branch selection often reflects typological, task, or provenance-related variation, and can be determined empirically (replay-based, prior-informed branching, etc.).
3. Merging Algorithms and Parameter Fusion
Merge operators are technically diverse, with major classes including:
| Merge Method | Formulaic Template | Characteristics |
|---|---|---|
| Linear Averaging | Default; fast, agnostic | |
| Task Arithmetic | Exploits vector deltas | |
| Slerp | Spherical interpolation in parameter space | Maintains parameter norm and directionality |
| TIES | TrIm-and-mErge-Sign, sign-preserving, sparsifies conflicting weights | Prioritizes large, aligned changes |
| DARE-TIES | Adds dropout and rescaling before TIES | Further sparsity |
| Arcee Fusion | Importance mask based on distillation loss × parameter delta | Selective per-parameter merge |
| MoE Routing | FFN experts are assembled from branches, non-FFN by averaging; add routing layer | Mixture-of-experts merging |
Adapter merging in AdaMergeX leverages adapter-internal algebra (e.g., addition for LoRA, elementwise multiplication for (IA), etc.), with language-gap divergences estimated on reference tasks and transferred accordingly (Zhao et al., 2024). Language integration via Tree Sitter-based ASTs in LastMerge (Duarte et al., 25 Jul 2025) is another syntax-aware “merge” scenario.
4. Empirical Results, Metrics, and Applications
Branch-and-merge techniques consistently yield positive outcomes across several axes:
- Cross-lingual and low-resource adaptation: Merging monolingual adaptation and task-solving branches yields significant improvements for low-resource languages. For example, TIES-based merging achieves up to +4.7% average in low-resource settings versus sequential CT-then-SFT (Tao et al., 2024), and merged Scandinavian/Faroese models surpass all English-only baselines on FoBLiMP/FoBCoMP (Kunz et al., 1 Oct 2025). AdaMergeX shows +3.4% to +7.3% gains over SOTA cross-lingual methods on XCOPA/XQuAD (Zhao et al., 2024).
- Catastrophic forgetting mitigation: Iterative branching/merging over data slices (BaM) matches or exceeds standard CPT in target-language accuracy, with up to 66% lower parameter-change magnitude and drastically less English forgetting (Alexandrov et al., 2024).
- Code integration and structured merge: LastMerge achieves 15% fewer aFP than jDime due to identifier-aware AST matching, with runtime and accuracy at (or above) language-specific baselines (Duarte et al., 25 Jul 2025).
- Multilingual or multimodal fusion: Slerp/Linear or Task Arithmetic merges (“Model Soup”, “MultiSlerp”) tend to outperform TIES/DARE on linguistic acceptability and minimal-pair probes, with upscaled model merges retaining more language features (Glocker et al., 11 Dec 2025).
- Mixture-of-Experts (MoE) scaling: BTX achieves +18.8 points in math and +17.2 in code (over 7B Llama) by branching on specialist data and then fusing via MoE layers (Sukhbaatar et al., 2024).
Metrics typically include perplexity, average F1, Pass@k, morphology/syntax probe accuracy, and minimal-pair acceptability, with paired baselines (full FT, continued pretraining, joint training) for comparison.
5. Practical Implementation Considerations
Effective deployment of branch-and-merge pipelines centers on several crucial practices:
- Base model alignment: All branches must be initialized from the same checkpoint to maintain parameter-space compatibility (Sigris et al., 23 Sep 2025).
- Merge weighting and tuning: Coefficients (, , ) require validation-based selection for best behavioral trade-offs; per-layer tuning, slack variables, or adaptive soft gates are plausible extensions (Zhao et al., 2024, Tao et al., 2024).
- Branch granularity: Overly fine-grained or semantic branch splitting may degrade mergeability or yield diminishing returns (observed in large- merges, especially for distant languages) (Glocker et al., 11 Dec 2025).
- Adapter type and merge algebra: Merging must honor the algebraic structure of each adapter type (addition, multiplication, matrix op) for effectiveness in PEFT scenarios (Zhao et al., 2024).
- Replay and tokenization: For language transfer, high-quality experience replay prevents forgetting, and tokenizer adaptation (e.g., Cyrillic extension) minimizes fertility costs (Alexandrov et al., 2024).
- Sparsification and parameter selection: Algorithms like TIES, DARE, or Arcee Fusion optimize which parameters—or signs—are imported, balancing specialization vs. generalization (Sun et al., 6 Mar 2025).
General failure modes include representation misalignment (notably when merging LoRA/adapter weights under low-data regimes (Kunz et al., 1 Oct 2025)), loss of fine-grained features under aggressive sparsification, and application to typologically distant language pairs.
6. Extensions, Impact, and Future Directions
Recent developments extend the branch-and-merge concept beyond the full-parameter, monolingual paradigm:
- Phylogenetically-structured adapters: Branching and joint merging at language family/group/tree levels yields large transfer gains for unseen languages (Faisal et al., 2022).
- Multiplex thinking in LLMs: Token-wise “branch-and-merge” is formalized via stochastic multiplex tokens, compacting parallel reasoning steps into a merged embedding at each step, improving math reasoning accuracy and test-time scalability across Pass@ (Tang et al., 13 Jan 2026).
- Distillation and model compression: The Branch–Merge distillation approach, with selective SFT on domain-expert teachers and Arcee-masked parameter fusion, enables nearly teacher-level accuracy at a small fraction of size and retraining cost (Sun et al., 6 Mar 2025).
- Large-scale system modularity: Upscaling via HyperCloning (Glocker et al., 11 Dec 2025) increases “mergeability” for modular multilingual systems, though naive merging remains inferior to joint training; this suggests targeted merge-specialist algorithms as the next frontier.
Open questions and directions concern per-language/layer adaptive weighting, learned merge architectures, automated hyperparameter selection, alignment-aware parameter fusion, and theoretical underpinnings for when language gap invariance or merge success is guaranteed.
7. Comparison of Key Branch-and-Merge Methods
| Approach | Branch Mechanism | Merge Operator / Recipe | Application | Key Reference |
|---|---|---|---|---|
| BTM | Domain SFT/data split | Posterior-weighted average, ensembling | Modular domain adaptation | (Li et al., 2022) |
| BTX | Async domain SFT | MoE construction + router FT | Multi-domain expert LLMs | (Sukhbaatar et al., 2024) |
| AdaMergeX | Adapter PEFT SFT | Structure-adaptive algebraic merging | Cross-lingual transfer | (Zhao et al., 2024) |
| BaM | Data slicing + SFT | Iterative linear/Slerp merge | Catastrophic forgetting mitigation | (Alexandrov et al., 2024) |
| LastMerge | Language-agnostic AST | Tree-sitter based structural merging | Polyglot code integration | (Duarte et al., 25 Jul 2025) |
| TinyR1 Branch–Merge | Domain SFT distillation | Importance-masked per-parameter fusion | Model compression | (Sun et al., 6 Mar 2025) |
| UniMoS | Modality separation | Dynamic ensemble of “modality branches” | VLM unsupervised domain adaptation | (Li et al., 2024) |
| Code-mixed Merge | CPT on code-mixed | Task Arithmetic, TIES, + supervised vector merge | Robust mixed-code NLP | (Kodali et al., 22 Oct 2025) |
| Scaling + Merge | Data-matched upscaling | Linear, Task Arithmetic, MultiSlerp, TIES | Modular high-resource language LMs | (Glocker et al., 11 Dec 2025) |
Branch-and-merge language adaptation provides a modular, computationally scalable alternative to monolithic or sequential adaptation, with empirical improvements across cross-lingual transfer, catastrophic forgetting, model compression, and high-accuracy domain mixture construction. Its success crucially depends on principled branch selection, merge strategy optimization, and structural compatibility across branches, with ongoing research addressing scaling, multilinguality, and fine-grained fusion mechanisms.