Continual Learning LLM (CL-LLM) Overview

Updated 31 January 2026

Continual Learning LLMs are large models designed to adapt sequentially to new tasks without catastrophic forgetting.
They employ methods such as replay-based techniques, parameter-efficient adapter approaches, and prompt-only strategies to balance rapid adaptation with knowledge retention.
Evaluations use metrics like overall accuracy and forgetting, while addressing challenges such as dynamic task boundaries and compute efficiency.

Continual Learning LLM (CL-LLM) refers to LLMs equipped with explicit mechanisms to adapt sequentially to a stream of tasks, domains, or modalities without catastrophic forgetting of previously acquired skills or knowledge. CL-LLMs are evaluated on their ability to balance plasticity (rapid adaptation) with stability (retention of earlier information) across various forms of sequential adaptation, including supervised fine-tuning, domain or ability transfer, instruction tuning, and multimodal integration. The field has converged around distinct algorithmic approaches, benchmark suites, and evaluation protocols, as synthesized in recent empirical and survey works.

1. Conceptual Framework and Taxonomy

CL-LLMs are situated at the intersection of traditional continual learning and high-capacity deep LLMs. The fundamental taxonomy distinguishes:

Vertical Continual Learning (VCL): Progression from large-scale general pre-training to specialized adaptation (e.g., domain- or user-specific fine-tuning), mitigating "vertical forgetting" as the model specializes (Shi et al., 2024).
Horizontal Continual Learning (HCL): Sequential adaptation across tasks, domains, or time-steps at a fixed level of specificity, with the challenge of "horizontal forgetting" as the model incorporates novel content (Shi et al., 2024).

Three canonical stages are recognized:

Continual Pre-Training (CPT): Ongoing self-supervised adaptation on new, typically unlabeled corpora.
Domain-Adaptive Pre-Training (DAP): Intermediate phase incorporating large-scale domain data, often with replay or adapters.
Continual Fine-Tuning (CFT): Sequential supervised training on explicit downstream tasks.

Formally, the CL-LLM objective for tasks/domains $\{T_i\}$ ( $i=1,\dots,N$ ) is: $\min_{\theta}\ \frac{1}{N} \sum_{i=1}^N \mathcal{L}_i(\theta),$ subject to constraints on minimal degradation on previous tasks (Kang et al., 14 Sep 2025, Muttakhiroh et al., 5 Aug 2025).

2. Algorithmic Approaches

Major approaches to CL-LLMs include replay methods, parameter-efficient adaptation, architectural isolation, modular/memory-based methods, and inference-time (prompt-only or training-free) adaptation.

Replay-based Methods:

Surprise-prioritized Replay (SuRe): Maintains a buffer of high-negative-log-likelihood (high-NLL, i.e., "surprising") examples per task to tightly approximate the earlier task mixture. This sampling is empirically superior to uniform replay, and SuRe+EMA dual learners yield SOTA performance on both standard and large-number-of-tasks (LNT) benchmarks (Hazard et al., 27 Nov 2025).
Knowledge-graph Replay (KILO): Instead of raw example buffers, maintains a dynamic knowledge graph, retrieving task/domain-relevant fact triples for use as prompts with logit distillation to regularize drift (Muttakhiroh et al., 5 Aug 2025).

Adapter and Subspace Methods:

Dual-Learner LoRA: Two LoRA adapters per attention layer ("fast" updated by SGD, "slow" tracking fast via EMA) to reduce update variance and prevent fast knowledge drift (Hazard et al., 27 Nov 2025).
MoE-CL: Parameter isolation via dedicated LoRA expert per task, combined with a shared expert and adversarial discriminator to encourage transfer while preserving task-specific information (Kang et al., 14 Sep 2025).
Selective Subspace De-correlation (ELLA): Penalizes the overlap of current adapter updates with high-energy regions of past updates (via a memory-constant energy matrix), achieving SOTA without replay or architectural expansion (Biswas et al., 5 Jan 2026).
CUR-based LoRA (CURLoRA): CUR decomposition of weights with only $U$ matrix fine-tuned, drastically reducing parameters and regularizing updates (Fawi, 2024).

Modular Routing and Multimodal CL:

Parameter Isolation + MLLM Router (MR-LoRA): Task/domain adapters selected by a trained router, enabling domain- or ability-specific adaptation in multimodal LLMs (images+text) without interference (Zhao et al., 5 Jun 2025).
Replay and Regularizers in Multimodal Integration: Soft targets (label smoothing) and tiny replay buffers significantly reduce linguistic forgetting in continual multimodal LLM training (Srivastava et al., 2024).

Prompt-only and Training-free CL:

CLOB/CIS: Uses only prompt engineering over a frozen LLM; all class/task knowledge is summarized into short text snippets updated incrementally, completely avoiding parameter update and classic catastrophic forgetting (Qiu et al., 2024).
InCA: Combines an external, incremental tag-based Gaussian classifier for class narrowing, with in-context prompting of only top-k class summaries, achieving high accuracy without parameter updates or replay (Momeni et al., 2024).
JitRL: Non-parametric memory of transitions and test-time logit modification realizes closed-form KL-constrained RL without gradient updates, allowing LLM agents to adapt continually in interactive tasks (Li et al., 26 Jan 2026).

3. Evaluation Protocols and Metrics

Standard evaluation schemes construct a performance matrix $P_{t,i}$ where the model is trained up to task $t$ and evaluated on task $i$ (Shi et al., 2024, Joshi et al., 13 Jun 2025). Core metrics include:

Overall Accuracy (OA): Final average performance across all tasks.
Forgetting (F): Average performance drop from peak on earlier tasks.
Forward Transfer (FWT): Improvement on new tasks relative to baseline after sequential adaptation.
Backward Transfer (BWT): Performance change on prior tasks after new ones are learned.
Area Under Learning Curve (AULC): Integrates improvement trajectory.
Continual-Learning F $_\beta$ Score: Harmonic mean of stability (1–F) and plasticity (diagonal performance).
Task/Domain Benchmarks: Standard CL datasets (AGNews, Yahoo, GLUE/SuperGLUE), domain adaptation suites, multimodal streams (LLaVA, MLLM-CL), agentic coding benchmarks (SWE-Bench-CL), and large-scale industrial testbeds (Tencent3, MTL5).

4. Representative Results and Ablation Insights

Empirical advances have established adapter-based CL methods with explicit buffer or memory mechanisms as current SOTA, with notable observations:

Method	Std CL OA (%)	LNT/LS OA (%)	REPLAY	BUFFER	PARAM EXPANSION
Sequential FT / LoRA	28.5–39.3	4.2	No	No	No
EWC/O-LoRA	47.9–75.4	44.8–68.8	No	No	Yes
Reservoir/Surprise Replay	76.9–77.2	69.1–72.1	Yes	Yes	No
Dual-Learner EMA (SuRe)	78.1	75.1	Yes	Yes	No
ELLA	79.9	73.6	No	No	No
MoE-CL	—	—	No	No	Yes

SuRe+EMA: Outperforms prior regularization and adapter expansion methods by up to +5 points on LNT. Replay buffer selection using sequence NLL is superior to uniform or label-only sampling. EMA rates near $\beta=0.995$ optimize the stability/plasticity trade-off (Hazard et al., 27 Nov 2025).
ELLA: Outperforms replay-based approaches despite no buffer, due to selective de-correlation. Gains of up to +9.6% OA and improved generalization to new tasks (Biswas et al., 5 Jan 2026).
MoE-CL: Adversarial discriminator on the shared expert further boosts transfer and retention compared to purely isolated adapters (Kang et al., 14 Sep 2025).
MR-LoRA: Router-based expert selection achieves highest last-task and average accuracy on multimodal continual benchmarks, with router performance saturating at ≈20 few-shot samples (Zhao et al., 5 Jun 2025).
CLOB/CIS, InCA: Prompt-only and replay-free prompt selection approaches perform near joint-fine-tuning upper bounds for class-incremental text classification, with negligible explicit forgetting (Qiu et al., 2024, Momeni et al., 2024).
JitRL: Achieves >30x cost reduction compared to training-intensive RL methods for continual learning in LLM agents while outperforming all prior training-free baselines (Li et al., 26 Jan 2026).
Multimodal CL: mSGM+small replay (1% buffer) yields highest retained NL skills while matching VL accuracy in sequential vision-language tasks (Srivastava et al., 2024).

5. Multimodal and Agentic Continual Learning

Multimodal CL-LLMs extend the challenge to mixed vision-language or action domains:

Domain/Ability Split: MLLM-CL and similar benchmarks separate domain-incremental (IID) vs ability-incremental (non-IID) transfer (Zhao et al., 5 Jun 2025).
Adapter Isolation and Routing: Parameter isolation per modality/domain, with a pretrained router (LoRA-based), prevents catastrophic interference as new abilities—such as OCR, math reasoning, or agentic GUI navigation—are acquired (Zhao et al., 5 Jun 2025, Srivastava et al., 2024).
Agentic Learning: SWE-Bench-CL introduces continual sequences for LLM-based coding agents, measuring success/forgetting across evolving issue streams. Memory-enabled models require semantically precise retrieval to avoid performance drift via "prompt poisoning" (Joshi et al., 13 Jun 2025).

6. Practical and Industrial Considerations

Scalability, memory/computational cost, and deployment constraints shape realistic CL-LLM deployment:

Parameter-Efficient CL: Adapter approaches (LoRA, ELLA, MoE-CL) minimize trainable parameters, supporting billions of backbone weights (Biswas et al., 5 Jan 2026, Hazard et al., 27 Nov 2025, Kang et al., 14 Sep 2025).
Replay Buffer Economics: SuRe, KILO, and mSGM show that small, well-curated buffers (≤2% of past data) suffice for marked gains; full-scale replay is rarely tractable (Hazard et al., 27 Nov 2025, Muttakhiroh et al., 5 Aug 2025, Srivastava et al., 2024).
Training-Free Solutions: Prompt-only (CLOB/CIS) and memory-based (InCA, JitRL) models eliminate private data retention, making them suitable for privacy-sensitive and API-based deployments (Qiu et al., 2024, Momeni et al., 2024, Li et al., 26 Jan 2026).
Industrial Impact: MoE-CL reduced manual review costs by 15.3% in Tencent Video by robust continual adaptation (Kang et al., 14 Sep 2025).

7. Open Challenges and Future Directions

Despite substantial progress, several fundamental open questions remain:

Unified Theory: No complete theoretical framework yet explains knowledge transfer, retention, and forgetting in high-capacity transformer models (Shi et al., 2024).
Dynamic Task Boundary Detection: Most methods rely on explicit task boundaries, which may be unavailable in real-world streams (Biswas et al., 5 Jan 2026, Hazard et al., 27 Nov 2025).
Bufferless and Replay-free CL: True replay-free continual adaptation, especially for multimodal and streaming settings, remains challenging (Zhao et al., 5 Jun 2025, Biswas et al., 5 Jan 2026).
Personalization and Preference Alignment: Continual, non-forgetting adaptation to user-specific preferences and distributions requires new frameworks for privacy and safety (Shi et al., 2024).
Compute-Efficient CL: Optimization of CL-LLMs under compute and memory constraints, rather than pure buffer size limits, is an emerging area (Shi et al., 2024).
Evaluation Benchmarks: Ongoing development and adoption of robust, long-sequence and open-ended benchmarks (e.g., SWE-Bench-CL, TRACE, MLLM-CL) are necessary for meaningful measurement of stability–plasticity balance at scale (Zhao et al., 5 Jun 2025, Joshi et al., 13 Jun 2025).

Continual Learning LLMs form a rapidly progressing field aimed at scalable, robust sequential adaptation in high-capacity LLMs, encompassing buffer-based rehearsal, adapter and subspace methods, modular and memory-augmented architectures, and training-free inference strategies. Comprehensive evaluation frameworks and benchmarks are essential for capturing the trade-offs between plasticity and stability across modalities and application domains (Shi et al., 2024, Hazard et al., 27 Nov 2025, Biswas et al., 5 Jan 2026, Kang et al., 14 Sep 2025, Muttakhiroh et al., 5 Aug 2025, Qiu et al., 2024, Momeni et al., 2024).