Adapter-Based Continual Pre-Training
- Adapter-based continual pre-training is a modular technique that integrates lightweight, task-specific adapters into a frozen backbone to prevent catastrophic forgetting.
- It utilizes a two-layer feedforward architecture with down- and up-projections to efficiently adapt pre-trained models across diverse tasks.
- Empirical results across vision, language, and multi-modal domains demonstrate strong performance with minimal parameter updates compared to full fine-tuning.
Adapter-based continual pre-training is a paradigm for extending the capabilities of large pre-trained neural networks to new tasks, languages, or domains while mitigating catastrophic forgetting and avoiding the inefficiencies of full model re-training. This approach instantiates small, often layer-wise, task-specific modules—adapters—into a frozen backbone, enabling modular, parameter-efficient transfer and specialization. Contemporary research demonstrates that such methods yield strong empirical performance across vision, language, speech, and multi-modal domains, with only a minor fraction of the backbone model’s parameters being updated per task.
1. Core Principles and Adapter Architectures
Adapter-based continual pre-training is predicated on the decoupling of feature extraction (handled by a fixed backbone, typically a Transformer) from continual adaptation (handled by lightweight adapters). The standard adapter is a two-layer feed-forward MLP "bottleneck" with down-projection, non-linearity, and up-projection, all sandwiched by a residual connection: where , (with ), , and is a nonlinearity (e.g., ReLU, GELU) (Feng et al., 2022, Kessler et al., 2021, Zhang et al., 2023, Yan et al., 2022).
Variants include:
- Hierarchical adapters: Partition adapters into base (multi-task-capturing general skills) and task-specific (specializing on each new task), yielding hierarchical inductive transfer (Feng et al., 2022).
- Linked or lateral adapters: Fuse the outputs of all (past, present, and even future) task adapters via attention-weighted sums, computed by a trainable MLP over task embeddings (Chandra et al., 2024).
- Self-expanding adapters: Dynamically allocate new adapters only when significant distribution shift is detected, employing a z-score outlier test on representation descriptors (Wang et al., 2024).
- Label-specific adapters: Each class in continual learning is equipped with a small prototype memory; feature aggregation occurs over all class adapters for inference (Luo et al., 29 May 2025).
- Multi-modal adapters: Adapters are placed not just in unimodal blocks but also in cross-modal fusion layers (e.g., vision-language transformers) (Li et al., 2024).
- Domain-specific heads/units: Parallel domain-heads/hidden-units are added to self-attention and FFN, increasing capacity adaptively (Yan et al., 2022).
2. Training Methodologies and Continual Learning Protocols
The prototypical continual pre-training protocol involves freezing the pre-trained backbone and successively training or extending adapters as tasks arrive:
- Sequential task introduction: For each new task or domain, either a fresh adapter slice (e.g., new down/up projection) is introduced, or an existing set of adapters is extended (Gao et al., 2024, Kessler et al., 2021, Yan et al., 2022). Old adapters are frozen after being trained.
- Two-stage training (ATLAS): First, an "experience-based" stage reuses knowledge via weighted sums of old adapters; second, a "novel knowledge expansion" stage introduces a new adapter if needed (Li et al., 2024).
- Replay/memory-free: Most methods avoid storing raw data or gradient rehearsal; stability is enforced by architectural isolation and sometimes regularization (e.g., Fisher penalty in EWC, or orthogonality constraints between new and old adapters) (Chandra et al., 2024, Gao et al., 2024, Yan et al., 2022).
- Adapter consolidation or distillation: Where memory is limited or redundancy is detected, adapter fusion or MSE-based distillation over unlabeled buffer pools consolidates multiple adapters (Ermis et al., 2022).
- Momentum accumulation: Parallel online/offline adapters are maintained, with the offline adapter updated by exponential moving average (EMA) for cross-task stability (Gao et al., 2023).
A representative procedure is shown below for hierarchical adapters (Feng et al., 2022):
- Multi-task train the base adapter on all old tasks; freeze.
- For each task, train its specific adapter on that task, keeping base (general) adapters frozen.
- When a new task arrives, add a new task-specific adapter and train only it, keeping both base and all previous adapters fixed.
3. Objectives, Regularization, and Stability
The primary loss is typically cross-entropy (classification), retrieval, or contrastive/self-supervised objectives (for SSL or MAE). Adapter-based approaches introduce distinctive regularizers or losses:
- Orthogonal loss: Enforce new adapter directions orthogonal to all previous adapters to minimize interference (Gao et al., 2024).
- EWC-style penalty: Online computation of the Fisher information matrix constrains adapter-attention network weights to prevent forgetting (Chandra et al., 2024).
- Feature distillation: For per-label adapters, old label embeddings are preserved by L2 regularization over class prototypes (Luo et al., 29 May 2025).
- Adaptive expansion: A pre-trained autoencoder per adapter (at each representation level) signals whether a new adapter is needed, driving sub-linear parameter growth by amortizing adaptation over the distributional complexity rather than total number of tasks (Wang et al., 2024).
In self-supervised or domain-adaptive settings (e.g., PAD for infrared images (Zhang et al., 2023)) the adapter is trained via a masked reconstruction loss (e.g., MAE) with only adapters and the decoder updated, while all backbone parameters remain fixed.
4. Empirical Evaluation and Comparative Results
Multiple experimental protocols are established across vision, dialogue, language, speech, and multi-modal domains:
- Parameter efficiency: Adapter-based continual pre-training consistently reduces the trainable parameter count per task, often to <5% of the full model (Feng et al., 2022, Kessler et al., 2021, Zhang et al., 2023, Gao et al., 2023).
- Catastrophic forgetting: By freezing both the backbone and non-current adapters, prior task performance remains stable, outperforming full fine-tuning or regularization-only baselines (Kessler et al., 2021, Yan et al., 2022, Chandra et al., 2024).
- Downstream transfer: Tasks such as dialogue response retrieval, semantic segmentation, ASR, classification, and multi-modal inference benefit from continual adapter-based pre-training, often surpassing prompt-based PETs, naive replay, and even some rehearsal-based methods (Feng et al., 2022, Ebouky et al., 22 Sep 2025, Li et al., 2024).
- Quantitative gains: For instance, AdaHIT attains average hits@1 of 0.8158 with only 4% per-task parameter cost (vs. 100% for full fine-tuning) (Feng et al., 2022); SEMA achieves 86.98% accuracy on CIFAR-100 incrementally vs. ≈80% for prompt-based L2P (Wang et al., 2024); LADA shows 2–3% gains over MoE-style adapter routing on X-TAIL (Luo et al., 29 May 2025).
| Method | Backbones | Continual Setting | Notable Results |
|---|---|---|---|
| AdaHIT | Poly-Encoder (Dialog) | Multi-task and incremental | 0.8158 avg hits@1 @ 4.2% ∆params (Feng et al., 2022) |
| cwav2vec 2.0 | wav2vec 2.0 (Speech) | Language-incremental | <0.5% WER drop, 0% forgetting, 32% faster (Kessler et al., 2021) |
| ADA | ViT/DeiT (Vision) | Adapter-pool/binary/multi-class | ≈95M params for all tasks @ pool K=4 (Ermis et al., 2022) |
| SEMA | ViT-B (Vision) | Class-incremental, VTAB | Sub-linear growth, 6.5× lower than linear (Wang et al., 2024) |
| Linked Adapters | ViT (Vision) | Task-incremental | +1–2% over standalone adapters (Chandra et al., 2024) |
| C-ADA | ViT (Vision) | Class/Domain-incremental | +2.5–6% over CODA/L2P, 0% rehearsal (Gao et al., 2024) |
| PAD | ViT (Vision, IR-SSL) | Domain transfer | +0.9 to +2.4 mIoU over full MAE ft (Zhang et al., 2023) |
| LADA | CLIP ViT | Label-incremental | +0.8–2.9% avg/last over strong PET (Luo et al., 29 May 2025) |
| GLARE | ViT (Semantic Seg) | Domain-SSL | +0.4–0.6 mIoU, ≈3% param overhead (Ebouky et al., 22 Sep 2025) |
| AF-Adapter | RoBERTa (Chinese, Biomed) | Domain-continual | +2% downstream vs. PCL-MedBERT, –11% forgetting (Yan et al., 2022) |
| ATLAS | ViLT (Vision-Language) | Multi-modal, upstream | +0.7–1.4% avg, <1% forgetting (Li et al., 2024) |
5. Extensions: Modularity, Knowledge Sharing, and Sub-Linear Growth
Advanced architectures address several challenges beyond naive per-task adapter assignment:
- Knowledge sharing and lateral transfer: Attention-based mechanisms (e.g., Linked Adapters (Chandra et al., 2024), ATLAS (Li et al., 2024)) achieve both forward and backward knowledge transfer by learning per-layer, per-task attention weights via MLPs or cross-attention. During inference, the system leverages not just own-task but all adapters, mitigating under-sharing and redundancy.
- Self-regulating growth: Expansion is triggered only by genuine shifts in representation space (quantified by descriptor AEs), yielding parameter growth proportional to intrinsic task diversity rather than total task count (Wang et al., 2024).
- Distillation/consolidation: Adapter fusion and pool-size management (e.g., ADA with K=4) maintain a memory-efficient pool, replacing or distilling only as needed (Ermis et al., 2022).
- Per-label scaling: In label-incremental CLIP settings, dedicated, lightweight memory per label sidesteps router errors endemic to MoE or prompt-pool architectures (Luo et al., 29 May 2025).
- Patchwise scaling: Dynamic patchwise scaling in vision-specific adapters further improves domain adaptation (e.g., in IR) by allowing fine spatial control over adapter influence (Zhang et al., 2023).
6. Domains of Application and Current Limitations
Adapter-based continual pre-training has been scaled across the following axes:
- Vision: Class/domain/task-incremental learning with ViT/DeiT, semantic segmentation with adapter-only pre-training (UniAdapter in GLARE (Ebouky et al., 22 Sep 2025)), and IR-specific SSL (PAD (Zhang et al., 2023)).
- Language: Domain-incremental masked language modeling (e.g., biomedical adaptation in AF-Adapter (Yan et al., 2022)), cross-lingual SSL in speech (wav2vec 2.0 + Language Adapters (Kessler et al., 2021)).
- Multi-modal: Unified handling of image, text, and combinations thereof, leveraging two-stage learning (ATLAS (Li et al., 2024)).
- Rehearsal- and memory-free CL: All architectures by design avoid or minimize the need for replay buffers or task data storage.
Current limitations include:
- Parameter growth: Most per-task or per-label adapter instantiations scale at least linearly with the number of tasks or labels; progressive self-expansion and adapter compression mechanisms aim to mitigate this overhead (Wang et al., 2024, Chandra et al., 2024).
- Domain shift/extrapolation: The freezing of the entire backbone strongly constrains representational adaptation; in cases of extreme domain shift, existing frozen features may not suffice (Zhang et al., 2023).
- Task identity: Most knowledge fusion or attention-based reuse schemes assume task labels are known at inference; class-incremental or task-free regimes require further methodological advances (Chandra et al., 2024).
- Adapter placement: Current conventions often inject adapters only into MLP blocks or upper transformer layers; the optimal per-layer allocation remains under-explored.
7. Outlook and Future Directions
Recent advances suggest several key research trajectories:
- Improved knowledge transfer: Learning sophisticated, dynamic fusion of all adapters (forward/backward), possibly leveraging meta-learning or generative router networks for adaptive weighting (Chandra et al., 2024, Li et al., 2024).
- Adapter compression/merging: Approaches that consolidate redundant adapters post hoc, reducing parameter and compute footprint for long-sequence continual learning (Ermis et al., 2022).
- Cross-modal continual adaptation: Extension of modular adapters to ever more complex combinations (vision, audio, language), with multi-modal fusion modules trained in continual settings (Li et al., 2024).
- Task-agnostic and streaming settings: Removal of reliance on explicit task labels at inference, demanding more distributed, context-aware routing (Chandra et al., 2024).
- Integration with memory/rehearsal: Hybrid architectures combining adapter isolation with external memory, replay, or generative rehearsal for theoretical guarantees of no forgetting.
- AutoML/NAS for adapter placement and sizing: Search for per-layer, per-task optimal adapter capacity, balancing stability and plasticity on-the-fly (Yan et al., 2022).
Adapter-based continual pre-training enables highly parameter-efficient, robust extension of large pre-trained models. Through methodological advances in modular isolation, hierarchical transfer, adaptive attention, and self-regulating growth, these approaches have enabled continual learning systems to match or surpass traditional fine-tuning methods with orders-of-magnitude smaller per-task overhead, broad applicability, and greatly increased resistance to catastrophic forgetting (Feng et al., 2022, Chandra et al., 2024, Wang et al., 2024, Luo et al., 29 May 2025, Li et al., 2024).