Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weighted Adapter Merging

Updated 2 February 2026
  • The paper presents a parameter-efficient method for synthesizing a unified adapter by merging task-specific adapters using convex or affine combinations, enhancing multi-task and continual learning.
  • It details various weight selection strategies—including uniform, similarity-based, and router-based approaches—to minimize interference and optimize performance across diverse tasks.
  • The technique is extended through dynamic, hierarchical, and structure-adaptive merging methods, achieving improved transfer accuracy and scalable on-device deployment.

Weighted adapter merging is a parameter-efficient technique for synthesizing a single adapter (or a small number of adapters) from a collection of task-specific adapters trained on different datasets, domains, or tasks. By constructing a convex or affine combination of pretrained adapter weights, this approach enables multi-task and continual learning without additional back-propagation, while controlling interference and memory footprint. Weighted adapter merging underpins many recent developments in scalable multi-domain adaptation, task arithmetic, cross-lingual transfer, and storage-efficient on-device deployment.

1. Formal Definition and Theoretical Foundations

Let Θ\Theta denote the parameters of a frozen pretrained base model, and let {Θi}i=1K\{\Theta^*_i\}_{i=1}^K be KK fine-tuned models (usually adapters) trained on non-overlapping datasets D1,...,DKD_1, ..., D_K. The task vector for task ii is defined as τi=ΘiΘ\tau_i = \Theta^*_i - \Theta. In the standard Task Arithmetic (TA) framework, weighted adapter merging forms a merged model as

Θ^=Θ+i=1Kλiτi,\widehat{\Theta} = \Theta + \sum_{i=1}^K \lambda_i\tau_i,

where λi\lambda_i are scalar weights. This formulation, or its vectorized (per-parameter/per-group) generalization, defines the class of weighted adapter merging techniques.

A key theoretical insight is that non-orthogonal τi\tau_i generally interfere when combined, degrading per-task performance. The merging gap for each task ii,

Gi=Li(Θ+jλjτj)Li(Θ+λiτi),G_i = L_i(\Theta + \sum_j\lambda_j\tau_j) - L_i(\Theta + \lambda_i \tau_i),

has the first-order Taylor expansion

Gikijiλjτi,τj,  ki<0,G_i \approx k_i \sum_{j\ne i} \lambda_j \langle \tau_i, \tau_j \rangle, \; k_i<0,

implying zero gap is achieved when {τi}\{\tau_i\} are mutually orthogonal, i.e., τi,τj=0\langle \tau_i, \tau_j \rangle = 0 for all iji \ne j (Xiong et al., 2024).

2. Weight Selection Strategies

Uniform and Data-Driven Weights

The simplest method sets λi=1/K\lambda_i = 1/K for uniform averaging ("adapter soup") (Chronopoulou et al., 2023). However, uniform weights can lead to performance collapse when many diverse or incompatible adapters are merged, due to destructive interference (Nguyen et al., 2024).

Data-driven or validation-based weighting strategies include:

  • Similarity-based selection: Selecting the top-kk adapters with highest cosine or SBERT similarity to a held-out sample, then merging with uniform weights (Chronopoulou et al., 2023).
  • Metric-weighted averaging (MWA): Assign each adapter a scalar metric mim_i (e.g., validation loss), then set

wi=exp(αΔmi)jexp(αΔmj)w_i = \frac{\exp(\alpha \Delta m_i)}{\sum_j \exp(\alpha \Delta m_j)}

where Δmi\Delta m_i is a positive measure of adapter "quality" and α\alpha is a penalty factor controlling peakiness (Yu et al., 23 Apr 2025).

  • Router-based dynamic weights: At inference, per-instance weights ptp_t are computed via a router as normalized softmaxed cosine similarities between input features and adapter centroids,

pt=exp(t/τ)i=1Texp(i/τ)p_t = \frac{\exp(\ell_t/\tau)}{\sum_{i=1}^T \exp(\ell_i/\tau)}

enabling instance-level merging (Cheng et al., 2024).

Conflict- and Sign-Aware Merging

Certain schemes address conflicts by considering parameter signs and trimming conflicting updates:

  • TIES-Merging: Averages only parameters where all adapters agree in sign; conflicting coordinates are set to zero (Dehghan et al., 2024).
  • DARE ("Drop-And-Rescale"): Randomly drops a fraction of parameters and re-normalizes before averaging (Dehghan et al., 2024).
  • FSD (Fraction-of-Sign-Difference) selection: Subset adapters to be merged based on minimal pairwise sign disagreement, as high sign conflict predicts severe accuracy drops (Nguyen et al., 2024).

3. Regularization, Orthogonalization, and Algebraic Structure

The Adaptive Weight Disentanglement (AWD) approach explicitly introduces a redundant vector δ\delta and defines disentangled task vectors τ^i=τiδ\hat{\tau}_i = \tau_i - \delta, optimizing for

  • Orthogonality loss: L0(δ)=1K(K1)ijcos(τiδ,τjδ)L_0(\delta) = \frac{1}{K(K-1)}\sum_{i\ne j} |\cos(\tau_i - \delta, \tau_j - \delta)|
  • Redundancy penalty: Lr(δ)=δ2L_r(\delta) = \|\delta\|_2

with total loss L(δ)=L0(δ)+αLr(δ)L(\delta) = L_0(\delta) + \alpha L_r(\delta), where α\alpha trades off orthogonality against task fidelity. AWDisentangles interfering directions, boosting merged model accuracy, with optimal performance obtained for small α[102,106]\alpha \in [10^{-2}, 10^{-6}] (Xiong et al., 2024).

For adapters with specific algebraic insertion forms (e.g., LoRA, (IA)3^3, prefix-tuning), structure-adaptive merging (as in AdaMergeX) generalizes the weighted sum to appropriately matched algebraic operations (additive, multiplicative, or concatenative) (Zhao et al., 2024).

4. Dynamic, Hierarchical, and Continual Weighted Merging

  • Dynamic Instance-Level Weighted Merging: Methods such as DAM compute per-sample weights for adapters at inference, allowing on-the-fly construction of an instance-adaptive adapter via softmaxed similarities (Cheng et al., 2024).
  • Online Continual Merging: Approaches like K-Merge incrementally merge new adapters into stored clusters, maintaining a running weighted mean:

ΔW=ncΔWc+ΔW(t)nc+1\Delta W^* = \frac{n_c \Delta W_c + \Delta W^{(t)}}{n_c+1}

with ncn_c tracking the cluster size (Shenaj et al., 15 Oct 2025).

  • Hierarchical (HAM): HAM learns a scalar importance αt\alpha_t per task-adapter, forms groups by similarity, concatenates pruned task adapters in each group, and finally produces a merged adapter as a weighted sum of group adapters:

ΔWmerged=1Mj=1MαGj(BGjAGj)\Delta W_{\mathrm{merged}} = \frac{1}{M} \sum_{j=1}^M \alpha_{G_j}(B_{G_j}A_{G_j})

(Coleman et al., 16 Sep 2025).

5. Practical Considerations and Empirical Findings

Weighted adapter merging has consistently demonstrated:

Typical merging schemes are robust to weight selection in moderate intervals (e.g., λ[0.3,1.0]\lambda \in [0.3, 1.0] for scaled merging), with performance optimized near λ[0.5,0.8]\lambda \in [0.5, 0.8] (Xiong et al., 2024). In medical and safety-critical settings, careful tuning of the merge ratio yields intermediate models that balance domain-specific knowledge with instruction alignment (Zou, 26 Jan 2026).

6. Limitations, Failure Modes, and Open Challenges

  • Negative Interference: Uniform merging across incompatible adapters (high sign disagreement or unrelated domains) can destroy both in-domain and generalization performance; performance drops \sim12% have been observed for indiscriminate averaging over a large set (Nguyen et al., 2024).
  • Order Sensitivity: In continual merging, the sequence and weighting schedule (e.g., geometric, harmonic) impact final performance (Dehghan et al., 2024).
  • Initialization Synchrony: All adapters must be initialized from the same random seed or checkpoint; mismatched initializations cause catastrophic degradation (Chronopoulou et al., 2023).
  • Static vs. Dynamic Weights: Static schemes may be suboptimal for inputs dissimilar to the training set; instance-aware routing/gating improves transfer in such regimes (Cheng et al., 2024, Ozsoy, 22 Jan 2026).
  • Layerwise and Structure-Specific Fusion: Layerwise or per-parameter gating remains largely unexplored; improper merging algebra (e.g., additive on multiplicative adapters) can collapse performance (Zhao et al., 2024).

There is ongoing research to address these limitations via orthogonalization (AWD), dynamic routing, hierarchical splitting (HAM), and structure-adaptive frameworks.

7. Algorithm Summaries and Implementation Guidance

Implementations of weighted adapter merging generally follow this pattern:

  1. Collect task-specific adapters and, optionally, select a relevant subset using domain/semantic similarity or sign-based conflict metrics.
  2. Compute weights: uniform, validation/metric-based, router-based, or learned via gating MLP.
  3. Merge parameters via weighted sum, possibly with sign-trimming, drop-and-rescale, or with disentangled/orthogonalized task vectors.
  4. Replace or concatenate merged adapter(s) into the base model, preserving proper layer naming and structure (see (Xiong et al., 2024, Yu et al., 23 Apr 2025, Zou, 26 Jan 2026)).
  5. Validate merged model on target domains/tasks, tuning merge coefficients and monitoring performance.

When merging LoRA adapters, ensure consistency in factorization, seed initialization, and that merging is restricted to only the adapters (not the base weights).

Empirical best practices include:

  • Restricting merges to adapters with low pairwise conflict (as informed by FSD or task similarity).
  • Regularizing with orthogonality constraints or redundancy penalties in the disentanglement loss.
  • For continual or on-device settings, adopting streaming or hierarchical update rules to minimize interference and computational overhead.

Weighted adapter merging thus provides a unified, theoretically principled, and empirically validated toolbox for parameter-efficient multi-task modeling, transfer, and continual adaptation in both language and vision domains (Xiong et al., 2024, Chronopoulou et al., 2023, Coleman et al., 16 Sep 2025, Yu et al., 23 Apr 2025, Cheng et al., 2024, He et al., 2023, Zhao et al., 2024, Zou, 26 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weighted Adapter Merging.