Branch-and-Merge Language Adaptation

Updated 18 January 2026

The paper introduces a branch-and-merge paradigm that integrates specialized model branches to mitigate catastrophic forgetting and boost cross-lingual performance.
The methodology employs targeted fine-tuning and versatile merge operators, including linear averaging, task arithmetic, and adapter-based fusion, to consolidate expertise.
Empirical results demonstrate significant improvements in low-resource settings and multilingual adaptation, with gains up to 7.3% in cross-lingual benchmarks.

Branch-and-merge language adaptation encompasses a family of strategies in which a base model is “branched” into one or more variants—typically through task- or domain-specific fine-tuning or pretraining—followed by a “merge” step that algorithmically fuses the specialized models or their parameter deltas to yield a single, more capable or more robust system. This paradigm facilitates efficient integration of skills or knowledge acquired across distinct domains, languages, or data distributions, with recent work demonstrating benefits for cross-lingual transfer, multilingual model construction, catastrophic forgetting mitigation, low-resource language adaptation, modular code integration, and efficient scaling. The term applies both to full-parameter merging of independently trained LLMs and to branch-and-merge recipes for parameter-efficient tuning (e.g., merging language adapters), as well as to task-specific and modality-compositional extensions.

1. Theoretical Foundations and Problem Formulation

Branch-and-merge methods are motivated by the fact that many adaptation tasks require a model to simultaneously acquire new skills (e.g., understanding a novel language or task) without erasing preexisting capabilities. In the canonical setup, one begins from an initial set of model parameters $\theta_0$ , then creates multiple branches $\theta_1, \ldots, \theta_K$ by fine-tuning or continued pretraining each copy on a different domain, language, modality, or data slice. The core objective is to recombine these branches—typically by an explicit parameter merge operator—so the resulting model $\theta_\text{merge}$ encapsulates the union of capabilities more efficiently than curriculum, continued pretraining, or naive mixture approaches.

Formally, let each branch be defined by a weight vector $\theta_k$ and potentially its associated “task vector” $\tau_k = \theta_k - \theta_0$ . The merge step is a mapping: $\theta_\text{merge} = M(\theta_1, \ldots, \theta_K)$ where $M$ can range from simple linear/interpolative schemes to more complex, sparsity- or importance-aware logic, sometimes leveraging additional meta-information (e.g., domain posteriors, adapter structure).

2. Branching Strategies: Domain, Language, and Data Partitioning

Branching can be instantiated along several dimensions:

Domain/Task: Models are fine-tuned on specific domains (e.g., law, medical, code, math), as in Branch-Train-Merge (BTM) (Li et al., 2022), Branch-Train-MiX (BTX) (Sukhbaatar et al., 2024), or domain-expert SFT followed by fusion (Sun et al., 6 Mar 2025).
Language: Continued pretraining or fine-tuning on monolingual, cross-lingual, or code-mixed data produces language-specialized experts. Example: Faroese from Scandinavian branches (Kunz et al., 1 Oct 2025), integeration for code-mixed En-Hi/En-Es (Kodali et al., 22 Oct 2025), or low-resource language merging (Tao et al., 2024).
Adapter-based PEFT: Adapter branching allows for parameter-efficient skill encapsulation, merging adapters to enable cross-lingual transfer (e.g., AdaMergeX (Zhao et al., 2024)) or phylogeny-inspired sharing (Faisal et al., 2022).
Slices or Curricula: Data can be partitioned into arbitrary slices (chronological, random, or experience replay-augmented), then each branch is trained independently on its slice (Alexandrov et al., 2024).

Branch selection often reflects typological, task, or provenance-related variation, and can be determined empirically (replay-based, prior-informed branching, etc.).

3. Merging Algorithms and Parameter Fusion

Merge operators are technically diverse, with major classes including:

Merge Method	Formulaic Template	Characteristics
Linear Averaging	$\theta_\text{merge} = \alpha \theta_1 + (1-\alpha)\theta_2$	Default; fast, agnostic
Task Arithmetic	$\theta_\text{merge} = \theta_0 + \beta(\theta_1 - \theta_0) + \gamma(\theta_2 - \theta_0)$	Exploits vector deltas
Slerp	Spherical interpolation in parameter space	Maintains parameter norm and directionality
TIES	TrIm-and-mErge-Sign, sign-preserving, sparsifies conflicting weights	Prioritizes large, aligned changes
DARE-TIES	Adds dropout and rescaling before TIES	Further sparsity
Arcee Fusion	Importance mask based on distillation loss × parameter delta	Selective per-parameter merge
MoE Routing	FFN experts are assembled from branches, non-FFN by averaging; add routing layer	Mixture-of-experts merging

Adapter merging in AdaMergeX leverages adapter-internal algebra (e.g., addition for LoRA, elementwise multiplication for (IA) $^3$ , etc.), with language-gap divergences estimated on reference tasks and transferred accordingly (Zhao et al., 2024). Language integration via Tree Sitter-based ASTs in LastMerge (Duarte et al., 25 Jul 2025) is another syntax-aware “merge” scenario.

4. Empirical Results, Metrics, and Applications

Branch-and-merge techniques consistently yield positive outcomes across several axes:

Cross-lingual and low-resource adaptation: Merging monolingual adaptation and task-solving branches yields significant improvements for low-resource languages. For example, TIES-based merging achieves up to +4.7% average in low-resource settings versus sequential CT-then-SFT (Tao et al., 2024), and merged Scandinavian/Faroese models surpass all English-only baselines on FoBLiMP/FoBCoMP (Kunz et al., 1 Oct 2025). AdaMergeX shows +3.4% to +7.3% gains over SOTA cross-lingual methods on XCOPA/XQuAD (Zhao et al., 2024).
Catastrophic forgetting mitigation: Iterative branching/merging over data slices (BaM) matches or exceeds standard CPT in target-language accuracy, with up to 66% lower parameter-change magnitude and drastically less English forgetting (Alexandrov et al., 2024).
Code integration and structured merge: LastMerge achieves 15% fewer aFP than jDime due to identifier-aware AST matching, with runtime and accuracy at (or above) language-specific baselines (Duarte et al., 25 Jul 2025).
Multilingual or multimodal fusion: Slerp/Linear or Task Arithmetic merges (“Model Soup”, “MultiSlerp”) tend to outperform TIES/DARE on linguistic acceptability and minimal-pair probes, with upscaled model merges retaining more language features (Glocker et al., 11 Dec 2025).
Mixture-of-Experts (MoE) scaling: BTX achieves +18.8 points in math and +17.2 in code (over 7B Llama) by branching on specialist data and then fusing via MoE layers (Sukhbaatar et al., 2024).

Metrics typically include perplexity, average F1, Pass@k, morphology/syntax probe accuracy, and minimal-pair acceptability, with paired baselines (full FT, continued pretraining, joint training) for comparison.

5. Practical Implementation Considerations

Effective deployment of branch-and-merge pipelines centers on several crucial practices:

Base model alignment: All branches must be initialized from the same checkpoint $\theta_1, \ldots, \theta_K$ 0 to maintain parameter-space compatibility (Sigris et al., 23 Sep 2025).
Merge weighting and tuning: Coefficients ( $\theta_1, \ldots, \theta_K$ 1, $\theta_1, \ldots, \theta_K$ 2, $\theta_1, \ldots, \theta_K$ 3) require validation-based selection for best behavioral trade-offs; per-layer tuning, slack variables, or adaptive soft gates are plausible extensions (Zhao et al., 2024, Tao et al., 2024).
Branch granularity: Overly fine-grained or semantic branch splitting may degrade mergeability or yield diminishing returns (observed in large- $\theta_1, \ldots, \theta_K$ 4 merges, especially for distant languages) (Glocker et al., 11 Dec 2025).
Adapter type and merge algebra: Merging must honor the algebraic structure of each adapter type (addition, multiplication, matrix op) for effectiveness in PEFT scenarios (Zhao et al., 2024).
Replay and tokenization: For language transfer, high-quality experience replay prevents forgetting, and tokenizer adaptation (e.g., Cyrillic extension) minimizes fertility costs (Alexandrov et al., 2024).
Sparsification and parameter selection: Algorithms like TIES, DARE, or Arcee Fusion optimize which parameters—or signs—are imported, balancing specialization vs. generalization (Sun et al., 6 Mar 2025).

General failure modes include representation misalignment (notably when merging LoRA/adapter weights under low-data regimes (Kunz et al., 1 Oct 2025)), loss of fine-grained features under aggressive sparsification, and application to typologically distant language pairs.

6. Extensions, Impact, and Future Directions

Recent developments extend the branch-and-merge concept beyond the full-parameter, monolingual paradigm:

Phylogenetically-structured adapters: Branching and joint merging at language family/group/tree levels yields large transfer gains for unseen languages (Faisal et al., 2022).
Multiplex thinking in LLMs: Token-wise “branch-and-merge” is formalized via stochastic multiplex tokens, compacting $\theta_1, \ldots, \theta_K$ 5 parallel reasoning steps into a merged embedding at each step, improving math reasoning accuracy and test-time scalability across Pass@ $\theta_1, \ldots, \theta_K$ 6 (Tang et al., 13 Jan 2026).
Distillation and model compression: The Branch–Merge distillation approach, with selective SFT on domain-expert teachers and Arcee-masked parameter fusion, enables nearly teacher-level accuracy at a small fraction of size and retraining cost (Sun et al., 6 Mar 2025).
Large-scale system modularity: Upscaling via HyperCloning (Glocker et al., 11 Dec 2025) increases “mergeability” for modular multilingual systems, though naive merging remains inferior to joint training; this suggests targeted merge-specialist algorithms as the next frontier.

Open questions and directions concern per-language/layer adaptive weighting, learned merge architectures, automated hyperparameter selection, alignment-aware parameter fusion, and theoretical underpinnings for when language gap invariance or merge success is guaranteed.

7. Comparison of Key Branch-and-Merge Methods

Approach	Branch Mechanism	Merge Operator / Recipe	Application	Key Reference
BTM	Domain SFT/data split	Posterior-weighted average, ensembling	Modular domain adaptation	(Li et al., 2022)
BTX	Async domain SFT	MoE construction + router FT	Multi-domain expert LLMs	(Sukhbaatar et al., 2024)
AdaMergeX	Adapter PEFT SFT	Structure-adaptive algebraic merging	Cross-lingual transfer	(Zhao et al., 2024)
BaM	Data slicing + SFT	Iterative linear/Slerp merge	Catastrophic forgetting mitigation	(Alexandrov et al., 2024)
LastMerge	Language-agnostic AST	Tree-sitter based structural merging	Polyglot code integration	(Duarte et al., 25 Jul 2025)
TinyR1 Branch–Merge	Domain SFT distillation	Importance-masked per-parameter fusion	Model compression	(Sun et al., 6 Mar 2025)
UniMoS	Modality separation	Dynamic ensemble of “modality branches”	VLM unsupervised domain adaptation	(Li et al., 2024)
Code-mixed Merge	CPT on code-mixed	Task Arithmetic, TIES, + supervised vector merge	Robust mixed-code NLP	(Kodali et al., 22 Oct 2025)
Scaling + Merge	Data-matched upscaling	Linear, Task Arithmetic, MultiSlerp, TIES	Modular high-resource language LMs	(Glocker et al., 11 Dec 2025)

Branch-and-merge language adaptation provides a modular, computationally scalable alternative to monolithic or sequential adaptation, with empirical improvements across cross-lingual transfer, catastrophic forgetting, model compression, and high-accuracy domain mixture construction. Its success crucially depends on principled branch selection, merge strategy optimization, and structural compatibility across branches, with ongoing research addressing scaling, multilinguality, and fine-grained fusion mechanisms.