Online Continual Learning Overview

Updated 26 January 2026

Online Continual Learning is the process where models update their parameters incrementally from evolving, non-stationary data streams in a single pass without revisiting past samples.
Key methodologies include memory-based replay, regularization, and contrastive techniques that mitigate catastrophic forgetting and manage domain drift.
Emerging advances leverage plug-and-play modules and resource-adaptive frameworks to balance rapid adaptation with the preservation of past knowledge.

Online Continual Learning (OCL) is the study of algorithms and frameworks that enable learning models—typically deep neural networks—to update their knowledge incrementally from a non-stationary data stream, in a single pass, without revisiting previously encountered samples. OCL imposes strict memory, computational, and latency constraints, presenting unique challenges relative to classical batch continual learning, notably catastrophic forgetting, which is the rapid degradation of performance on old tasks when new knowledge is acquired.

1. Foundational Problem Definition and Constraints

OCL is formally characterized by the sequential arrival of data points or (mini-)batches $(x_t, y_t)$ , each drawn from a possibly non-stationary, temporally evolving distribution $\pi_t$ , with the model parameters $\theta_t$ updated after each step: $\theta_{t+1} = \mathcal{U}(\theta_t, (x_t, y_t))$ . The defining OCL constraints are:

Single pass over each sample: Each data point is used only once for parameter update, prohibiting multiple epochs or revisiting.
Hard or partial memory constraints: Either prohibit storage of previous samples (rehearsal-free), limit memory to a small buffer ( $|\mathcal{M}| \ll T$ ), or ban auxiliary model parameters.
Absence of task identity: Task boundaries are not signaled; inference and learning proceed without knowledge of task indices.
Real-time or low-latency requirements: Updates must be performed quickly, often in resource-constrained environments (Bidaki et al., 9 Jan 2025).
No access to future data: Only present and possibly buffered past data is available.

These properties distinguish OCL from both classical online learning (which assumes i.i.d. streams) and batch continual learning (which allows multiple epochs or access to full task data) (Bidaki et al., 9 Jan 2025, Parisi et al., 2020).

2. Core Algorithms and Methodological Taxonomy

The OCL literature divides algorithms into major families, sometimes integrated in hybrid forms (Bidaki et al., 9 Jan 2025, Parisi et al., 2020):

Memory-based replay (experience replay and its variants): Maintain a small buffer $\mathcal{M}$ of exemplars sampled (often by reservoir sampling [Vitter '85]) from the data stream. The model is updated by interleaving current examples and replayed past samples, typically via a composite loss:

$\mathcal{L} = \mathcal{L}_\mathrm{new}(x_t, y_t; \theta) + \lambda\,\mathcal{L}_\mathrm{replay}(\mathcal{M}; \theta)$

Notable methods: ER, ER-ACE, DER++, MIR, SCR, HPCR (Soutif--Cormerais et al., 2023, Lin et al., 2023, Zhang et al., 2022).

Regularization-based: Impose quadratic or distillation penalties to slow parameter drift relevant to past tasks (EWC, SI), or to preserve old logits (LwF, Dark Experience Replay, Batch-level Distillation) (Fini et al., 2020).
Proxy-based/Contrastive replay: Leverage class proxies or use contrastive losses in buffer replay (PCR, HPCR), often combining positive sample--proxy or sample--sample relations (Lin et al., 2023).
Parameter-isolation/architecture-based: Allocate distinct parameter regions to different tasks or data regions (Progressive Neural Networks, HAT) (Bidaki et al., 9 Jan 2025, Parisi et al., 2020), less common in strictly memory-constrained or task-free OCL.
Regularization via self-supervision: Integrate equivariant pretext tasks (rotation/jigsaw) for feature regularization (CLER) (Bonicelli et al., 2023) or other self-supervised signals.
Replay-free and task-free methods: Eliminate all inter-batch data storage, instead relying on class-level prompts and prototype/class mean accumulation to preserve discrimination (e.g. prompt-NCM methods) (Wang et al., 1 Oct 2025).
Plug-and-play modules for adaptability: E.g., S6MOD augments replay pipelines with learnable mixture discretizations and class-conditional routing in downstream SSM branches (Liu et al., 2024).
Gradient compensation and resource-efficient pipeline frameworks: Automated parallelization with staleness correction, as in the Ferret framework, enables hardware-efficient OCL under varying memory budgets (Zhou et al., 15 Mar 2025).

3. Catastrophic Forgetting, Domain Drift, and Stability–Plasticity Trade-Off

A central OCL challenge is catastrophic forgetting: as the model adapts to data from new classes or tasks, its accuracy on earlier ones drops. This results from parameter updates overwriting prior representations, exacerbated by class-imbalance and single-pass constraints (Parisi et al., 2020, Lin et al., 2023).

Domain drift further aggravates forgetting: as new-task data shifts the feature space, both intra-class compactness and inter-task separation can decay, leading to overlapping decision boundaries and rapid knowledge loss. Solutions such as Drift-Reducing Rehearsal (DRR) integrate centroids for representative rehearsal and two-level margin losses to anchor both class and task clusters against drift (Lyu et al., 2024).
The stability–plasticity dilemma—the need to adapt rapidly to new data (plasticity) without erasing past knowledge (stability)—is operationalized algorithmically via loss weighting, gradient-norm balancing (as in Batch-level Distillation, BLD), or architectural allocation (Fini et al., 2020).

4. Memory Efficiency and Algorithmic Strategies

OCL algorithms are often differentiated by their memory overhead and mechanisms for rehearsal or knowledge preservation:

Method	Inter-batch memory	Auxiliary params	Rehearsal Buffer	Replay Type
Batch-level Distillation	None	None	No	Intra-batch distillation
Experience Replay (ER)	Bounded ( $\mathcal{M}$ )	Maybe (depends)	Yes	Memory-based replay
ER-FSL	Bounded ( $\mathcal{M}$ )	None (standard NN)	Yes	Subspace-based replay
Proxy-based (PCR/HPCR)	Bounded ( $\mathcal{M}$ )	Softmax proxies	Yes	Proxy-contrastive batch replay
Prompt NCM (F2OCL)	None	Learnable prompts	No	Rehearsal-free, prototypes only

Replay-free/strict constraint settings (MC-OCL): Only model parameters are retained across batches, with possible intra-batch temporary storage (e.g., probability/logit banks). BLD achieves high accuracy (e.g., 86.2% on 5-task CIFAR-10) with only ≈32 kB batch memory, outperforming L2 regularization and closely matching single-pass LwF at a small fraction of the overhead (Fini et al., 2020).
Buffer-limited settings: Most practical OCL setups use buffer sizes in the hundreds or low thousands (e.g., 1000–5000), with reservoir or sophisticated selection strategies for sample diversity (GSS, ring buffer, centroid-based) (Soutif--Cormerais et al., 2023, Zhang et al., 2022, Lyu et al., 2024).
Feature subspace learning: ER-FSL assigns each incoming mini-batch to a subspace of features (Fₖ), then replays buffer data in the union of all subspaces, minimizing destructive interference and preserving degrees of freedom for old tasks. Empirically, ER-FSL outperforms PCR and previous ER variants by 1.6–2.2% on CIFAR-100 and MiniImageNet (Lin, 2024).

5. Experimental Benchmarks, Metrics, and Empirical Insights

Standard evaluation protocols and metrics reflect the streaming, incremental nature of OCL:

Benchmarks: Split CIFAR-10/100, Split TinyImageNet and MiniImageNet, CORe50, permuted MNIST, among others (Bidaki et al., 9 Jan 2025, Soutif--Cormerais et al., 2023, Parisi et al., 2020, Lyu et al., 2024).
Metrics:
- Average Accuracy $A_T$ : Mean test accuracy after all T tasks.
- Forgetting $F_T$ : Average drop from peak to final accuracy per task.
- Backward Transfer (BWT): $BWT = \tfrac{1}{T-1}\sum_{i=1}^{T-1}(a_{T,i} - a_{i,i})$ .
- Probed accuracy: Evaluation of learned features by re-training a linear probe at each stage (Soutif--Cormerais et al., 2023).
- Resource metrics: Online accuracy gain per unit memory, latency (Ferret) (Zhou et al., 15 Mar 2025).
Empirical findings:
- Properly tuned ER remains a very strong baseline, often difficult to outperform by more complex methods (difference within 1–2%) (Soutif--Cormerais et al., 2023).
- Underfitting is prevalent in strict OCL settings, as each mini-batch is often only replayed a few times (3–9 passes), limiting convergence to any task (Soutif--Cormerais et al., 2023).
- Techniques integrating sample diversity in the buffer, strong/semantically distinct augmentations, or subspace allocation improve both final accuracy and forgetting metrics (Lin, 2024, Zhang et al., 2022, Lin et al., 2023, Yu et al., 2022).
- Memory-free, prompt-based NCM approaches reach impressive average accuracy (e.g., 71.18% on CIFAR-100, Sup-21K backbone) without rehearsal and with minimal forgetting (Wang et al., 1 Oct 2025).

6. Advances and Emerging Directions

Recent methodological advances and system-level frameworks emphasize robustness, adaptability, and efficiency:

Holistic Proxy-based Contrastive Replay (HPCR): Integrates anchor–sample contrast, decoupled temperature control for gradient vs. probability, and dual distillation channels (proxy, sample-to-sample) to enhance feature discrimination and prevent catastrophic forgetting, achieving 2–5% accuracy gains over strong baselines across CIFAR-10/100, MiniImageNet, and TinyImageNet (Lin et al., 2023).
Plug-and-play modules for adaptability (S6MOD): Embeds multiple state-space model discretizations with class-conditional routing based on uncertainty, combining contrastive and regularization objectives to boost both plasticity and stability (Liu et al., 2024).
Multi-level supervision and reverse self-distillation (MOSE): Leverages a hierarchy of sub-experts, aligning shallow and deep representations to mitigate overfitting/underfitting, and achieves substantial accuracy improvements (e.g., +7.3% on CIFAR-100) (Yan et al., 2024).
Resource-adaptive frameworks (Ferret): Employs pipeline parallelism, dynamic model partitioning, and self-tuned gradient compensation to optimize data value under memory constraints, achieving up to 3.7× lower memory overhead for the same online accuracy as competitors (Zhou et al., 15 Mar 2025).
Rehearsal-free, task-free OCL: Prompt-NCM architectures eliminate buffers, combining learnable prompts and prototype mean aggregation to match or exceed rehearsal-based baselines (e.g., 55.26% average accuracy on ImageNet-R) (Wang et al., 1 Oct 2025).
Augmentation-augmented rehearsal: Retrospective Augmented Rehearsal (RAR) smooths the empirical risk landscape and is plug-compatible with state-of-the-art replay algorithms, yielding consistent 9–17% gains (Zhang et al., 2022).

7. Challenges, Limitations, and Open Problems

Continued OCL progress faces key obstacles:

Catastrophic forgetting remains fundamental, especially as memory budgets decrease or rehearsal is prohibited.
Computational and memory resource management: Achieving high online accuracy in strictly memory- or latency-bounded environments requires sophisticated system-level techniques (pipeline parallelism, dynamic partitioning) (Zhou et al., 15 Mar 2025).
Scalability and real-world deployment: Most OCL algorithms are tuned for vision tasks; extension to audio, NLP, or multimodal streams is comparatively underexplored (Bidaki et al., 9 Jan 2025).
Hyperparameter tuning online: Automatic, adaptive optimization of buffer allocation, loss weighting, and other hyperparameters remains an open challenge (Soutif--Cormerais et al., 2023, Zhang et al., 2022).
Benchmarking heterogeneity: Lack of standardized, domain-diverse, and authentically non-i.i.d. OCL benchmarks complicates fair, reproducible evaluation (Soutif--Cormerais et al., 2023, Kruszewski et al., 2020).
Extending OCL to new learning paradigms: Federated, unsupervised, or open-set continual learning introduce additional complexity (e.g., privacy, communication, data drift/detection).

Promising research directions include adaptive memory mechanisms (dynamic buffer sizing, sparse retrieval, generative replay), unsupervised/self-supervised continual learning for data streams, and system-level approaches for resource-constrained, real-time, and federated settings (Bidaki et al., 9 Jan 2025, Zhou et al., 15 Mar 2025, Lyu et al., 2024).

Selected references: