Weighted Adapter Merging

Updated 2 February 2026

The paper presents a parameter-efficient method for synthesizing a unified adapter by merging task-specific adapters using convex or affine combinations, enhancing multi-task and continual learning.
It details various weight selection strategies—including uniform, similarity-based, and router-based approaches—to minimize interference and optimize performance across diverse tasks.
The technique is extended through dynamic, hierarchical, and structure-adaptive merging methods, achieving improved transfer accuracy and scalable on-device deployment.

Weighted adapter merging is a parameter-efficient technique for synthesizing a single adapter (or a small number of adapters) from a collection of task-specific adapters trained on different datasets, domains, or tasks. By constructing a convex or affine combination of pretrained adapter weights, this approach enables multi-task and continual learning without additional back-propagation, while controlling interference and memory footprint. Weighted adapter merging underpins many recent developments in scalable multi-domain adaptation, task arithmetic, cross-lingual transfer, and storage-efficient on-device deployment.

1. Formal Definition and Theoretical Foundations

Let $\Theta$ denote the parameters of a frozen pretrained base model, and let $\{\Theta^*_i\}_{i=1}^K$ be $K$ fine-tuned models (usually adapters) trained on non-overlapping datasets $D_1, ..., D_K$ . The task vector for task $i$ is defined as $\tau_i = \Theta^*_i - \Theta$ . In the standard Task Arithmetic (TA) framework, weighted adapter merging forms a merged model as

$\widehat{\Theta} = \Theta + \sum_{i=1}^K \lambda_i\tau_i,$

where $\lambda_i$ are scalar weights. This formulation, or its vectorized (per-parameter/per-group) generalization, defines the class of weighted adapter merging techniques.

A key theoretical insight is that non-orthogonal $\tau_i$ generally interfere when combined, degrading per-task performance. The merging gap for each task $i$ ,

$G_i = L_i(\Theta + \sum_j\lambda_j\tau_j) - L_i(\Theta + \lambda_i \tau_i),$

has the first-order Taylor expansion

$G_i \approx k_i \sum_{j\ne i} \lambda_j \langle \tau_i, \tau_j \rangle, \; k_i<0,$

implying zero gap is achieved when $\{\tau_i\}$ are mutually orthogonal, i.e., $\langle \tau_i, \tau_j \rangle = 0$ for all $i \ne j$ (Xiong et al., 2024).

2. Weight Selection Strategies

Uniform and Data-Driven Weights

The simplest method sets $\lambda_i = 1/K$ for uniform averaging ("adapter soup") (Chronopoulou et al., 2023). However, uniform weights can lead to performance collapse when many diverse or incompatible adapters are merged, due to destructive interference (Nguyen et al., 2024).

Data-driven or validation-based weighting strategies include:

Similarity-based selection: Selecting the top- $k$ adapters with highest cosine or SBERT similarity to a held-out sample, then merging with uniform weights (Chronopoulou et al., 2023).
Metric-weighted averaging (MWA): Assign each adapter a scalar metric $m_i$ (e.g., validation loss), then set

$w_i = \frac{\exp(\alpha \Delta m_i)}{\sum_j \exp(\alpha \Delta m_j)}$

where $\Delta m_i$ is a positive measure of adapter "quality" and $\alpha$ is a penalty factor controlling peakiness (Yu et al., 23 Apr 2025).

Router-based dynamic weights: At inference, per-instance weights $p_t$ are computed via a router as normalized softmaxed cosine similarities between input features and adapter centroids,

$p_t = \frac{\exp(\ell_t/\tau)}{\sum_{i=1}^T \exp(\ell_i/\tau)}$

enabling instance-level merging (Cheng et al., 2024).

Conflict- and Sign-Aware Merging

Certain schemes address conflicts by considering parameter signs and trimming conflicting updates:

TIES-Merging: Averages only parameters where all adapters agree in sign; conflicting coordinates are set to zero (Dehghan et al., 2024).
DARE ("Drop-And-Rescale"): Randomly drops a fraction of parameters and re-normalizes before averaging (Dehghan et al., 2024).
FSD (Fraction-of-Sign-Difference) selection: Subset adapters to be merged based on minimal pairwise sign disagreement, as high sign conflict predicts severe accuracy drops (Nguyen et al., 2024).

3. Regularization, Orthogonalization, and Algebraic Structure

The Adaptive Weight Disentanglement (AWD) approach explicitly introduces a redundant vector $\delta$ and defines disentangled task vectors $\hat{\tau}_i = \tau_i - \delta$ , optimizing for

Orthogonality loss: $L_0(\delta) = \frac{1}{K(K-1)}\sum_{i\ne j} |\cos(\tau_i - \delta, \tau_j - \delta)|$
Redundancy penalty: $L_r(\delta) = \|\delta\|_2$

with total loss $L(\delta) = L_0(\delta) + \alpha L_r(\delta)$ , where $\alpha$ trades off orthogonality against task fidelity. AWDisentangles interfering directions, boosting merged model accuracy, with optimal performance obtained for small $\alpha \in [10^{-2}, 10^{-6}]$ (Xiong et al., 2024).

For adapters with specific algebraic insertion forms (e.g., LoRA, (IA) $^3$ , prefix-tuning), structure-adaptive merging (as in AdaMergeX) generalizes the weighted sum to appropriately matched algebraic operations (additive, multiplicative, or concatenative) (Zhao et al., 2024).

4. Dynamic, Hierarchical, and Continual Weighted Merging

Dynamic Instance-Level Weighted Merging: Methods such as DAM compute per-sample weights for adapters at inference, allowing on-the-fly construction of an instance-adaptive adapter via softmaxed similarities (Cheng et al., 2024).
Online Continual Merging: Approaches like K-Merge incrementally merge new adapters into stored clusters, maintaining a running weighted mean:

$\Delta W^* = \frac{n_c \Delta W_c + \Delta W^{(t)}}{n_c+1}$

with $n_c$ tracking the cluster size (Shenaj et al., 15 Oct 2025).

Hierarchical (HAM): HAM learns a scalar importance $\alpha_t$ per task-adapter, forms groups by similarity, concatenates pruned task adapters in each group, and finally produces a merged adapter as a weighted sum of group adapters:

$\Delta W_{\mathrm{merged}} = \frac{1}{M} \sum_{j=1}^M \alpha_{G_j}(B_{G_j}A_{G_j})$

(Coleman et al., 16 Sep 2025).

5. Practical Considerations and Empirical Findings

Weighted adapter merging has consistently demonstrated:

Improved transfer and multi-task performance relative to naïve averaging, with task gains especially marked when interference is minimized via orthogonalization or subset selection (Xiong et al., 2024, He et al., 2023, Chronopoulou et al., 2023).
Data-free and scalable applicability, including for LoRA and other PEFT methods, with memory and compute requirements effectively reduced relative to storing a full adapter per task (Yu et al., 23 Apr 2025, Shenaj et al., 15 Oct 2025, Ceritli et al., 23 Jul 2025).
Empirical gains over uniform merging: metric-weighted averaging provided up to +5.05% task accuracy improvement vs. uniform, and router-based dynamic merging improved accuracy by 6–7.5 percentage points relative to single best or fixed-weight baselines (Yu et al., 23 Apr 2025, Cheng et al., 2024).

Typical merging schemes are robust to weight selection in moderate intervals (e.g., $\lambda \in [0.3, 1.0]$ for scaled merging), with performance optimized near $\lambda \in [0.5, 0.8]$ (Xiong et al., 2024). In medical and safety-critical settings, careful tuning of the merge ratio yields intermediate models that balance domain-specific knowledge with instruction alignment (Zou, 26 Jan 2026).

6. Limitations, Failure Modes, and Open Challenges

Negative Interference: Uniform merging across incompatible adapters (high sign disagreement or unrelated domains) can destroy both in-domain and generalization performance; performance drops $\sim$ 12% have been observed for indiscriminate averaging over a large set (Nguyen et al., 2024).
Order Sensitivity: In continual merging, the sequence and weighting schedule (e.g., geometric, harmonic) impact final performance (Dehghan et al., 2024).
Initialization Synchrony: All adapters must be initialized from the same random seed or checkpoint; mismatched initializations cause catastrophic degradation (Chronopoulou et al., 2023).
Static vs. Dynamic Weights: Static schemes may be suboptimal for inputs dissimilar to the training set; instance-aware routing/gating improves transfer in such regimes (Cheng et al., 2024, Ozsoy, 22 Jan 2026).
Layerwise and Structure-Specific Fusion: Layerwise or per-parameter gating remains largely unexplored; improper merging algebra (e.g., additive on multiplicative adapters) can collapse performance (Zhao et al., 2024).

There is ongoing research to address these limitations via orthogonalization (AWD), dynamic routing, hierarchical splitting (HAM), and structure-adaptive frameworks.

7. Algorithm Summaries and Implementation Guidance

Implementations of weighted adapter merging generally follow this pattern:

Collect task-specific adapters and, optionally, select a relevant subset using domain/semantic similarity or sign-based conflict metrics.
Compute weights: uniform, validation/metric-based, router-based, or learned via gating MLP.
Merge parameters via weighted sum, possibly with sign-trimming, drop-and-rescale, or with disentangled/orthogonalized task vectors.
Replace or concatenate merged adapter(s) into the base model, preserving proper layer naming and structure (see (Xiong et al., 2024, Yu et al., 23 Apr 2025, Zou, 26 Jan 2026)).
Validate merged model on target domains/tasks, tuning merge coefficients and monitoring performance.

When merging LoRA adapters, ensure consistency in factorization, seed initialization, and that merging is restricted to only the adapters (not the base weights).

Empirical best practices include:

Restricting merges to adapters with low pairwise conflict (as informed by FSD or task similarity).
Regularizing with orthogonality constraints or redundancy penalties in the disentanglement loss.
For continual or on-device settings, adopting streaming or hierarchical update rules to minimize interference and computational overhead.

Weighted adapter merging thus provides a unified, theoretically principled, and empirically validated toolbox for parameter-efficient multi-task modeling, transfer, and continual adaptation in both language and vision domains (Xiong et al., 2024, Chronopoulou et al., 2023, Coleman et al., 16 Sep 2025, Yu et al., 23 Apr 2025, Cheng et al., 2024, He et al., 2023, Zhao et al., 2024, Zou, 26 Jan 2026).