Dynamical Adapter Fusion (DAF)

Updated 5 February 2026

Dynamical Adapter Fusion is a method that fuses multiple task-specific adapters dynamically to enable scalable and memory-efficient adaptation in pretrained models.
It employs an MLP-based gating mechanism for multilingual Text2Cypher and a PAC-Bayes fusion for vision class-incremental learning, balancing knowledge transfer and retention.
DAF achieves competitive performance by recovering joint fine-tuning gains using minimal training data while effectively mitigating catastrophic forgetting.

Dynamical Adapter Fusion (DAF) is a technique for combining multiple task- or language-specific adapters into a single, effective module for inference in pretrained models. DAF addresses two major use cases: (1) enabling scalable, incremental adaptation for multilingual text-to-Cypher (Text2Cypher) systems without retraining or touching the base model or prior adapters (Ozsoy, 22 Jan 2026), and (2) constructing a global adapter for class-incremental learning (CIL) of vision backbones, dynamically fusing information from new tasks, the prior global adapter, and initialization in a PAC-Bayesian framework (Liu et al., 29 Jan 2026). DAF’s key innovation is to replace static, uniform merging of adapters with a dynamic, input-dependent (or learning-stage dependent) mechanism, balancing knowledge transfer, data efficiency, and retention across tasks and languages.

1. Motivation and Problem Setting

Adapter-based transfer learning circumvents the inefficiencies of full model fine-tuning by introducing lightweight modules—adapters—that can be specialized for distinct tasks, languages, or incremental tasks. However, naively maintaining one adapter per task/language hinders knowledge sharing and increases retrieval/serving costs. Uniform or static merging of adapters is suboptimal, as it cannot adapt to per-input cues or minimize interference.

Multilingual NLU Scenario

In the multilingual Text2Cypher setting, a frozen LLM (Meta-Llama-3.1-8B) is augmented with LoRA adapters $\{\Delta W_i\}$ , each specialized for a target language (EN, ES, TR) (Ozsoy, 22 Jan 2026). Joint multilingual fine-tuning yields the highest performance but is not scalable: each new language addition would require re-tuning the whole base model and all adapters.

Class-Incremental Learning Scenario

In class-incremental learning, a frozen backbone (e.g., ViT-B/16) is paired with adapters trained on sequences of disjoint tasks. Post-hoc fusion of task-specific adapters into a single global module is required to ensure inference uses only one adapter and to facilitate cross-task generalization, all while preventing catastrophic forgetting of prior knowledge (Liu et al., 29 Jan 2026).

2. Algorithmic and Architectural Frameworks

Dynamic Gating via Fusion MLP (Multilingual Task)

DAF for adapter fusion employs a two-layer MLP to compute input-dependent routing weights across frozen adapters (Ozsoy, 22 Jan 2026):

Inputs: The MLP receives the mean-pooled base-model embedding $\bar h_{\mathrm{base}}\in\mathbb{R}^{B\times H}$ and "preview" features $f_{i,\mathrm{preview}}$ from each adapter (pre-softmax logits averaged over the last 200 tokens), concatenated across adapters.
Architecture: Let $x_{\mathrm{MLP}} = [\bar h_{\mathrm{base}}, f_{\mathrm{preview}}]$ of dimension $D=H+nV$ ; the MLP computes $\mathrm{ReLU}(x_{\mathrm{MLP}}W_1 + b_1)\to z_2 = z_1 W_2 + b_2$ , with $z_2\in\mathbb{R}^{B\times n}$ . No dropout is employed.
Gating: Row-wise softmax produces adapter weights $w(x) = \mathrm{softmax}(\hat g(x))$ such that $\sum_{i=1}^n w_i(x) = 1.$ The final output logits are fused as $\mathrm{logits}_\mathrm{fused}(x) = \sum_{i=1}^n w_i(x) \mathrm{logits}_i(x)$ .

PAC-Bayes-Derived Fusion for Incremental Learning

For CIL, DAF fuses the new task adapter $\theta_t$ , preceding global adapter $\theta_g^{(t-1)}$ , and a robust initialization prior $\theta_0$ via an affine combination parameterized by $\beta_t$ :

$\theta_g^{(t)} = \beta_t \theta_0 + \beta_t \theta_g^{(t-1)} + (1-2\beta_t)\theta_t$

The fusion weight $\beta_t$ is dynamically optimized to minimize a Taylor-approximated surrogate loss derived from a PAC-Bayes generalization bound. The calculation involves the gradients and Fisher Information (diagonal Hessian approximation) at $\theta_t$ and incorporates stabilization constraints (Liu et al., 29 Jan 2026). Robust initialization is achieved via a running average of past task adapters.

3. Theoretical Foundations

PAC-Bayes Risk Analysis

For CIL, the risk of the fused adapter (mean-field posterior) is bounded by three terms:

$\mathcal{R}(Q_t) \leq \frac{1}{|D_t|}\mathbb{E}_{h\sim Q_t}\sum_{(x,y)\in D_t}\ell(h,x,y) + \frac{1}{\lambda}\mathrm{KL}(Q_s||P_\mathrm{init}) + \frac{1}{\lambda}\mathrm{KL}(Q_g||Q_g^{(t-1)}) + O\left(\frac{\log(1/\delta)}{|D_t|}\right)$

This decomposition captures empirical risk (on new data), a generalization regularizer (proximity to the data-free prior), and a forgetting regularizer (proximity to the previous global adapter). The fusion objective formalizes the trade-off between stability (memory retention) and plasticity (adaptability to new data).

Gating and Fusion Formulas

Both application domains use explicit, mathematically grounded fusion formulas. In the multilingual scenario, the fusion MLP gating produces input-conditional mixture weights, whereas in CIL, the fusion is a solved quadratic minimization over $\beta_t$ , with practical stabilization via Fisher scaling and robust averaging (Ozsoy, 22 Jan 2026, Liu et al., 29 Jan 2026).

4. Training Procedures and Implementation

Multilingual Adapter Fusion

Frozen Weights: Only MLP weights are trained; LoRA adapters and base LLM are fixed.
Supervision and Data Efficiency: Only a subsample (2,500 per language, 20% of total) is needed; one epoch suffices.
Loss: Standard cross-entropy on the fused logits; $\ell_2$ regularization on the MLP.
Optimization: AdamW, learning rate $2\times 10^{-4}$ , weight decay $0.01$, gradient accumulation factor 4.

CIL Adapter Fusion

Stage-wise Algorithm: For each task, after adapter-specific training (SGD), DAF computes $\beta_t$ per closed-form, fuses adapters, updates the running average, and discards task-specific parameters.
Hessian Approximation: Fisher information (diagonal) is used to stabilize $\beta_t$ .
Memory Efficiency: No exemplar or replay buffer is used. Only the global adapter and running average are retained post-fusion.

5. Empirical Results and Comparative Analysis

Multilingual Fusion

ROUGE-L Performance (Text2Cypher):
- Base model: EN 0.65, ES 0.60, TR 0.55 (avg 0.60)
- Joint fine-tuning: EN 0.86, ES 0.85, TR 0.83 (avg 0.85)
- Uniform merge: EN 0.79, ES 0.76, TR 0.71 (avg 0.75)
- DAF (fusion MLP): EN/ES 0.80, TR 0.78 (avg 0.79)
DAF achieves a +0.04 average improvement over linear merging and recovers $\approx 75\%$ of joint fine-tuning gains using only 20% of the data.
Routing weights $w(x)$ align strongly with true language (e.g., $w_{\mathrm{ES}} \approx 0.96$ for Spanish inputs).

CIL Fusion

Vision Benchmarks (ViT-B/16-IN21K):
- CIFAR-100: 94.58% avg, 91.15% final (beats MOS by 1.28–1.90 pts)
- ImageNet-R: 84.01%/79.63% vs. MOS 82.96%/77.93%
- ImageNet-A: 72.06%/62.54% vs. MOS 67.08%/56.22%
DAF surpasses rehearsal-based methods without replay on three datasets by 3–7 pts.
Ablations show static fusion (fixed $\beta=1/3$ ) improves over naïve last-adapter; PAC-Bayes-derived fusion adds a further ≈2 pts; robust initialization further boosts results.
Hyperparameter sweeps for Fisher scaling $\alpha$ have negligible impact (within ±0.2 pts).

Scalability and Incremental Expansion

Method	Retrain on New Lang./Task?	Data Needed per Step	Merge Complexity
Joint fine-tuning	Yes	Grows linearly	Full
Linear merging	No	Static	O(#adapters)
DAF (all domains)	No	2.5k examples/adapter–MLP	O(MLP params)

With DAF, incremental expansion to a new language or class requires only a small number of new training examples and retraining of the small fusion MLP, not the adapters or backbone (Ozsoy, 22 Jan 2026, Liu et al., 29 Jan 2026).

6. Insights, Limitations, and Future Directions

DAF consistently achieves strong empirical performance in both NLU and vision CIL benchmarks without retraining the base or adapters. Its dynamic gating/fusion architectures minimize catastrophic forgetting (by interpolating between previous knowledge and new data), capture global knowledge via robust initialization, and facilitate scalable, memory-efficient deployment.

Limitations include the reliance on prior task boundaries (CIL setting), the use of Fisher diagonals as a curvature proxy (which could be refined), and validation primarily on Vision Transformers and select language tasks. Extensions to task-free streams and other backbone architectures remain areas for further research.

7. References

"Adapter Fusion for Multilingual Text2Cypher with Linear and Learned Gating" (Ozsoy, 22 Jan 2026)
"Dynamical Adapter Fusion: Constructing A Global Adapter for Pre-Trained Model-based Class-Incremental Learning" (Liu et al., 29 Jan 2026)

Markdown Report Issue Upgrade to Chat

References (2)

Adapter Fusion for Multilingual Text2Cypher with Linear and Learned Gating (2026)

Dynamical Adapter Fusion: Constructing A Global Adapter for Pre-Trained Model-based Class-Incremental Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamical Adapter Fusion (DAF).