Exemplar Replay in Continual Learning

Updated 5 February 2026

Exemplar replay is a continual learning strategy that stores representative past samples to maintain performance and counteract catastrophic forgetting.
Modern variants use synthetic, compressed, and masked proxies to efficiently replay information while handling privacy and memory constraints.
Adaptive exemplar selection methods, including herding and proportional allocation, are key to balancing learning stability and plasticity.

Exemplar replay is a continual learning strategy in which a representative subset of previously observed samples—termed “exemplars”—is stored and periodically re-introduced during subsequent training phases. By replaying these exemplars alongside new data, the method seeks to preserve model performance on prior tasks and mitigate catastrophic forgetting, especially in non-stationary, incremental, or federated settings. Modern research has expanded exemplar replay beyond conventional raw sample replay to include synthetic, compressed, masked, or privacy-preserving proxies, forming a rich design space for continual and lifelong learning.

1. Classical Exemplar Replay: Mechanism and Mathematical Foundations

In the standard form, exemplar replay operates by maintaining a fixed-size buffer, $\mathcal{M}$ , containing samples from previously encountered tasks. During new-task training, minibatches are drawn from both the incoming data and the buffer, and losses are computed jointly. Typical loss functions comprise the primary objective (e.g., cross-entropy over new samples) and a knowledge-distillation or replay loss over exemplars to preserve past task performance.

For example, in the Adaptively Distilled Exemplar Replay (ADER) framework (Mi et al., 2020), the combined loss at cycle $t$ is: $L_{\rm ADER}(\theta_t) = L_{\rm CE}(\theta_t) + \lambda_t L_{\rm KD}(\theta_t),$ where $L_{\rm CE}$ is the standard cross-entropy on new data and $L_{\rm KD}$ is a distillation loss on exemplars, with $\lambda_t$ calculated adaptively based on the rates of new versus old items and samples.

Exemplar selection is a key issue. Representative methods include:

Herding: Greedy selection of samples whose features best match the mean class embedding.
Proportional allocation: Assigning more buffer slots to more frequently observed classes or items.

Buffer management enforces a global memory constraint, redistributing slots and replacing exemplars as new classes arrive. The ADER procedure further weights distillation based on the empirical rarity of new events, automatically tuning the stability/plasticity trade-off (Mi et al., 2020).

2. Compressed, Synthetic, and Privacy-Preserving Variants

Raw exemplar storage is often impractical in privacy-sensitive, resource-constrained, or large-scale settings. Recent research introduces several alternatives:

Autoencoder-Based Hybrid Replay (AHR) (Nori et al., 9 May 2025): Exemplars are compressed into latent codes using a hybrid autoencoder, reducing storage from $\mathcal{O}(t)$ to $\mathcal{O}(0.1t)$ . The decoder reconstructs samples for replay, and latent-space structure is enforced using a charged-particle repulsion mechanism for novel class centroids.
Distilled Replay (Rosasco et al., 2021, Wang et al., 3 Aug 2025): Instead of storing real samples, highly informative synthetic exemplars are distilled via optimization to match gradient or output trajectories of previous models. Privacy-Preserving Replay (Pr²R) (Wang et al., 3 Aug 2025) extends this to Re-ID by condensing multiple images into a single blurred, privacy-neutral exemplar, optimized to induce the same update gradients as the originals.
Exemplar Masking for Multimodal CL (Lee et al., 2024): In multimodal scenarios, storage per exemplar is reduced by masking non-informative tokens (image or text) based on modal and cross-modal attention scores. This enables a larger effective buffer size under the same memory constraint, with masking ratios around 70%.

These developments represent a movement from literal replay toward information-dense, privacy-robust, and memory-efficient rehearsal mechanisms.

3. Theoretical Insights and Limitations

While exemplar replay empirically mitigates catastrophic forgetting, recent analysis reveals it is not universally benign. In overparameterized continual linear regression, replay can actually increase forgetting if exemplar selection and buffer size interact adversely with the geometry of task subspaces (Mahdaviyeh et al., 4 Jun 2025). In worst-case constructions, a single replayed sample can elevate forgetting from $O(1/T)$ to $\Theta(1)$ . Even under random selection, forgetting can be non-monotonic with buffer size and may worsen unless the stored exemplars span the appropriate subspaces.

Principal-angle analysis quantifies this effect: when the span of replayed samples aligns with critical directions (angles near $t$ 0), forgetting is amplified. Sufficiently large buffers ( $t$ 1) or careful task alignment can mitigate this, but in practice, both selection policy and domain similarity must be accounted for.

4. Exemplar Replay Beyond Vision: Multimodal, Recommender, and Federated Contexts

Exemplar replay strategies extend to multimodal, sequence-based, and federated domains, often requiring significant adaptations:

Session-based recommendation (Mi et al., 2020): Exemplars correspond to historical user-item interaction sequences. ADER selects representative sessions per item, proportional to item frequency, and applies adaptive distillation.
Federated class-incremental learning (Sun et al., 2024): Exemplar condensation is performed locally on clients, optimizing condensed exemplars to match training gradients, feature relationships, and global prototype consistency. Disentangled generative sharing (via a shared VAE) reduces information heterogeneity across non-i.i.d. clients.
Lifelong person re-ID under privacy constraints (Wang et al., 3 Aug 2025, Xu et al., 2024): Exemplar-free replay methods (e.g., DASK) rehearse old distributions via learned style-transfer networks parametrized by adaptive convolution kernels, while Pr²R builds blurred, gradient-matched synthetic images.

Exemplar replay thus underpins continual learning protocols across supervised, multimodal, generative, and federated architectures, but buffer geometry, privacy, and domain shifts place unique constraints on implementation.

5. Empirical Performance and Comparative Evaluations

Exemplar replay methods often dominate regularization-only, naive fine-tuning, or exemplar-free baselines with respect to catastrophic forgetting across class-incremental and domain-incremental benchmarks:

ADER yields Recall@20 scores (DIGINETICA/YOOCHOOSE) exceeding both jointly retrained and regularized models (Mi et al., 2020).
AHR achieves state-of-the-art accuracy and compute/memory efficiency, outperforming iCaRL, BiC, REMIND, and related exemplars-plus-generative approaches (Nori et al., 9 May 2025).
Pr²R achieves +6.1% mAP on seen domains over the best prior replay-based Re-ID methods and closes the gap to joint-training upper bounds (Wang et al., 3 Aug 2025).
Exemplar Masking methods can triple the number of multimodal exemplars stored, yielding up to +2.4% accuracy compared to unmasked replay at long incremental horizons (Lee et al., 2024).

Ablation studies confirm that tailored exemplar selection (herding, proportional allocation, token masking, or relationship-matching) is critical for memory efficiency and stability.

6. Alternative Exemplar-Free and Summarization Approaches

Methods entirely eliminating exemplar storage are increasingly prominent:

Nearest-Class-Mean (NCM) classifier (He et al., 2022): Maintains only online running means of class features, obviating exemplar replay. When the feature extractor is strong (e.g., a pretrained ImageNet backbone), NCM outperforms contemporary exemplar-replay approaches under memory constraints.
Distribution Rehearsal (DASK) (Xu et al., 2024): Trains an instance-adaptive kernel prediction network (AKPNet) to reconstruct old-style images, enabling knowledge rehearsal and consolidation with no exemplar storage.

These approaches are optimal under hard privacy or edge-memory restrictions, though may sometimes lag in expressivity versus well-tuned, memory-efficient replay.

7. Practical Guidelines and Trade-offs

The effectiveness of exemplar replay is tightly coupled to buffer selection, allocation, replay schedule, and the space of possible sample representations:

Sample selection: Balanced, herding, or relationship-preserving selection mitigates sampling bias.
Buffer size: Marginal replay gains are non-monotonic; both tiny and sufficiently large buffers may outperform intermediate sizes (Mahdaviyeh et al., 4 Jun 2025).
Generative vs. memory-based trade-off: Hybrid and condensed methods—autoencoders, VAEs, or synthetic exemplars—reduce storage at cost of additional computation and hyperparameter tuning (Nori et al., 9 May 2025, Sun et al., 2024).
Privacy considerations: Synthetic or masked exemplars achieve privacy compliance but may entail a fidelity gap unless optimized for gradient correspondence and inter-class separation (Wang et al., 3 Aug 2025).

In sum, exemplar replay constitutes a crucial axis in the contemporary landscape of continual, federated, and privacy-sensitive machine learning. Ongoing work balances between expressive, memory-intensive replay, compressed or synthetic summarization, and distribution-matching rehearsal, with rigorous empirical and theoretical scrutiny informing best practices and future directions.