Memory Replay & Experience Buffers

Updated 6 February 2026

Memory replay and experience buffers are systems that store past transitions to allow interleaved training and combat issues like sample inefficiency and catastrophic forgetting.
They utilize diverse sampling schemes—uniform, prioritization, adaptive, and learnable policies—to refine training stability and improve convergence in both RL and continual learning settings.
Advanced techniques such as reservoir sampling, synthetic replay, and buffer refreshment optimize memory composition and effectively manage limited-memory regimes for enhanced task retention.

Memory replay and experience buffers constitute a central paradigm in contemporary deep reinforcement learning (RL) and continual learning. The experience buffer holds transitions or exemplars, enabling the interleaving of past and new data during learning. This mechanism allows agents and models to mitigate issues of sample inefficiency and catastrophic forgetting arising from non-iid data streams and sequential task exposure. The structure, sampling schemes, prioritization policies, memory composition, and specific integration strategies of replay mechanisms have substantial impact on training dynamics, learning stability, and task retention.

1. Fundamentals of Experience Replay and Buffer Architectures

The canonical experience replay buffer in off-policy RL, denoted $\mathcal{M}$ or $\mathcal{D}_n$ , is a finite-capacity data structure storing a sequence of transitions $(s,a,r,s')$ or supervised examples $(x,y)$ . At each update step $t$ , learning may be performed on current experience and memory samples, e.g., by minimizing a composite loss

$\mathcal{L}(\theta)=\mathcal{L}_{\mathrm{new}}(\theta; x_t,y_t)+\lambda\,\frac{1}{B}\sum_{k=1}^B\mathcal{L}_{\mathrm{replay}}(\theta; x_{j_k},y_{j_k})$

where replay minibatches are drawn from the buffer $\mathcal{M}$ (Krutsylo, 16 Feb 2025).

In RL, the buffer enables breaking the correlation in sequential rollouts and stabilizes bootstrapped updates by offering a broader data distribution. In the continual learning regime, it enables the rehearsal of prior task distributions, directly addressing catastrophic forgetting (Buzzega et al., 2020, Zhuo et al., 2023).

Variants include:

Reservoir sampling for unbiased maintenance of an iid subsample of all seen data (Buzzega et al., 2020, Isele et al., 2018, Zhuo et al., 2023).
Ring buffers/FIFO for fixed-horizon retention (Isele et al., 2018).
Episodic/long-term and short-term buffers that emulate complementary learning systems (Isele et al., 2018, Korycki et al., 2021).

Experience replay applies not only to value-based RL, but also to continuous-control and streaming learning, with architecture-specific buffer management (Ramicic et al., 2019, Hayes et al., 2018).

2. Sampling Schemes: Uniformity, Prioritization, and Adaptive Policies

A critical design choice is the selection mechanism for replayed samples. Uniform sampling sets $p_i=1/M$ for each slot in a buffer of size $M$ . However, both heuristic prioritization (e.g., TD-error, loss, reward, typicality) and principled learnable policies can reshape the replay distribution.

Uniform vs. Non-Uniform and Adaptive Schemes

Uniform replay is the default, but recent work demonstrates non-uniform sampling can strictly improve retention and accuracy on standard continual learning benchmarks (Krutsylo, 16 Feb 2025). For example, randomized weights $w_i$ assigned to each buffer entry and normalized to produce sampling probabilities $\mathcal{D}_n$ 0 consistently yield a significant $\mathcal{D}_n$ 1 over uniform sampling, with statistically significant improvements across multiple buffer sizes and datasets (e.g., CIFAR-10: up to +4.68%, Imagenette: up to +3.54% for $\mathcal{D}_n$ 2) (Krutsylo, 16 Feb 2025).
Adaptive per-sample weight updates—using exponential moving average loss, gradient magnitude, or other utility proxies—enable online prioritization:
- $\mathcal{D}_n$ 3 or $\mathcal{D}_n$ 4 after each replay step, with periodic normalization (Krutsylo, 16 Feb 2025).
- Empirical evidence shows that higher replay probability correlates moderately with lower loss samples, rather than strictly focusing on the highest-loss or most uncertain (Krutsylo, 16 Feb 2025).
Learnable replay policies have also been instantiated in the "Experience Replay Optimization" (ERO) bilevel framework, where one policy $\mathcal{D}_n$ 5 selects samples to maximize agent improvement as measured by the reward-difference signal $\mathcal{D}_n$ 6 (Zha et al., 2019).
Safety-biased and risk-sensitive sampling has been investigated, with convergence guarantees attained by setting sampling weights proportional to empirical reward variance and negatively exponentiated reward (to emphasize risk-averse behavior) (Szlak et al., 2021).
Quantum-inspired schemes perform qubit-based amplitude manipulation on each transition, encoding both priority (TD-error) and diversity (replay count), from which a sampling probability $\mathcal{D}_n$ 7 is derived. This approach matches or outperforms classical PER on numerous Atari benchmarks (Wei et al., 2021).

3. Memory Buffer Composition: Selection, Compression, and Packing

Buffer population—and, in the context of finite continual learning memory, class-specific allocation—is crucial.

Reservoir sampling enables class-unbiased selection, but further strategies for small buffers have emerged:
- TEAL prioritizes typical exemplars in feature space (inverse average KNN distance) and ensures diversity via progressive clustering, outperforming herding and random selection especially for $\mathcal{D}_n$ 8 exemplars per class (Shaul-Ariel et al., 2024).
- Saliency-Guided Experience Packing (SGEP/EPR): Selects high-saliency image patches using Grad-CAM, compresses multiple highly-informative patches per class into the fixed buffer, with zero-padding for compatibility, yielding sharper accuracy and lower backward transfer loss especially in the "tiny" buffer regime (Saha et al., 2021).
- Compressed Activation Replay (CAR): Stores compressed intermediate activations alongside input-output pairs, enforcing feature-space consistency through an additional loss term and drastically reducing forgetting with only marginal memory overhead (Balaji et al., 2020).
Streaming clustering (ExStream) summarizes data into a handful of per-class prototypes via on-the-fly merging. This approach achieves near-offline performance with $\mathcal{D}_n$ 9 prototypes per class, reducing catastrophic forgetting with $(s,a,r,s')$ 0 memory (Hayes et al., 2018).
Distilled Replay generates synthetic per-class exemplars $(s,a,r,s')$ 1 by matching the full class-gradient of the loss function, permitting an extreme reduction to $(s,a,r,s')$ 2 example per class at competitive accuracy (Rosasco et al., 2021).

Buffer content must also manage class balance, utility, and feature distribution drift: - Class-balanced reservoir, loss-aware replacement, and data or feature-based augmentation (e.g., MBA, Lossoir, Balancoir, BiC, ELRD) jointly improve retention (Buzzega et al., 2020). - Empirically, buffer management schemes that enforce class balance or importance-based retention yield substantial gains in class-IL/continual learning metrics (Buzzega et al., 2020, Shaul-Ariel et al., 2024).

4. Theoretical Analysis and Optimization of Replay Mechanisms

Theoretical frameworks model replay as variance-reducing resampling, and dynamical systems capture the effect of buffer size and prioritization:

Variance-reduction via U- and V-statistics: For estimators of the form

$(s,a,r,s')$ 3

with $(s,a,r,s')$ 4 a mini-batch estimator on a buffer $(s,a,r,s')$ 5, both the variance and computational complexity of RL estimators (e.g., LSTD) can be improved over the classic plug-in estimator $(s,a,r,s')$ 6. Analytical results show that for suitable replay ratio $(s,a,r,s')$ 7 and mini-batch size $(s,a,r,s')$ 8, the variance strictly improves ( $(s,a,r,s')$ 9), and cost reduces to $(x,y)$ 0 in kernel ridge regression (Han et al., 1 Feb 2025).

ODE analyses of Q-learning with replay buffer of size $(x,y)$ 1 and minibatch $(x,y)$ 2 show that both too small and too large buffers slow convergence, with an interior optimum $(x,y)$ 3 for small $(x,y)$ 4, and that prioritized replay helps primarily in the large $(x,y)$ 5 regime. Adaptive buffer resizing strategies based on TD-error drift of oldest samples reliably seek this optimum (Liu et al., 2017).
Convergence guarantees for Q-learning with arbitrary replay sampling schemes require that sampling weights $(x,y)$ 6; GLIE conditions and appropriate step-size schedules suffice for almost-sure convergence to the fixed-point of the limit Bellman operator. Safety-biased replay (variance-prioritized, negative-reward weighted) can provably shift learned policy toward risk-averse behavior in finite MDPs (Szlak et al., 2021).

5. Memory Replay in Continual Learning: Catastrophic Forgetting and Buffer-Efficient Methods

Experience replay is the principal paradigm for rehearsal-based continual learning, with buffer strategies deeply impacting catastrophic forgetting and final task accuracy.

Distribution-matching and coverage-maximization: Empirical studies consistently show that (a) uniform reservoir sampling, which approximates the distribution of all prior data, is superior to reward or “surprise” prioritization for catastrophic forgetting; (b) coverage maximization (favoring outliers) may outperform when rare events or short tasks are critical (Isele et al., 2018).
Small-buffer regimes particularly benefit from advanced selection methods:
- TEAL delivers up to 4–6% absolute accuracy gain over random or herding in CIFAR/tinyImageNet/CUB with $(x,y)$ 7– $(x,y)$ 8 exemplars/class (Shaul-Ariel et al., 2024).
- Saliency packing delivers 2–5% higher accuracy than full-image replay when $(x,y)$ 9 is minimal (Saha et al., 2021).
Strong Experience Replay (SER): Incorporates backward consistency (distillation on old buffer items) and forward consistency (distillation on new-task data using old model), leading to substantially reduced forgetting, especially in low-memory cases (e.g., class-IL accuracy on CIFAR-100 with $t$ 0: ER $t$ 1, DER++ $t$ 2, SER $t$ 3) (Zhuo et al., 2023).
Feature space drift and mitigation: Classical ER can leave intermediate representations unconstrained, especially in hybrid encoder–multihead architectures; methods such as CAR store compressed activations and penalize drift, achieving lower task forgetting (40% → 13.4% on Taskonomy at $t$ 4) (Balaji et al., 2020).

6. Buffer Refreshment, Synthetic Replay, and Novel Augmentation Mechanisms

Recent methods extend beyond simple storage and sampling, incorporating buffer refreshing and synthetic experience generation to further improve sample-efficiency and stability.

Lucid Dreaming (LiDER): Past states are revisited under the current policy, and if the “dreamed” return exceeds the original, the new trajectory replaces or augments the buffer. Extensive Atari experiments confirm consistent performance gains, with refreshed high-advantage trajectories prioritized (Du et al., 2020).
Synthetic experience construction: Interpolated Experience Replay (IER) estimates expected $t$ 5 tuples via averaging over observed rewards and transitions, yielding synthetic replay elements that reduce variance, boost sample efficiency, and ensure stability in stochastic/grid-world RL tasks (2002.01370). Distilled Replay incontinual learning generates per-class synthetic examples through gradient-matching, maintaining competitive final accuracies while reducing buffer size by $t$ 6 (Rosasco et al., 2021).
Augmented Memory Replay (AMR): Introduces scalar per-sample reward augmentation, learned via a compact neural net using TD-error, reward, and entropy features, dynamically shifting replay influence toward underfit or rare transitions and empirically improving sample efficiency in continuous-control domains (Ramicic et al., 2019).

7. Empirical Insights, Benchmarks, and Practical Recommendations

Across RL and continual learning domains, memory replay is consistently validated for both sample efficiency and task retention.

For deep RL (Atari, MuJoCo), buffer size, replay ratio, prioritization, and synthetic augmentation all tightly interact with learning stability (Fedus et al., 2020, Zha et al., 2019, Du et al., 2020, Wei et al., 2021, Ramicic et al., 2019).
In large-scale continual learning (Split-CIFAR, Taskonomy), classical ER is competitive but degrades under tight memory; techniques such as CAR, TEAL, and SER consistently boost performance in these regimes (Balaji et al., 2020, Shaul-Ariel et al., 2024, Zhuo et al., 2023, Saha et al., 2021).
The joint selection (which samples to store) and prioritization (how often to replay) remains an open optimization frontier (Krutsylo, 16 Feb 2025).
Variance-reduction theorems and ODE analyses provide guidelines for buffer/batch size and subsampling, with a key scaling law $t$ 7 to guarantee convergence and efficiency (Liu et al., 2017, Han et al., 1 Feb 2025).
Adaptive buffer management (e.g., aER) and principled selection (TEAL, CAR, EPR) are recommended for low-memory regimes (Liu et al., 2017, Shaul-Ariel et al., 2024, Balaji et al., 2020, Saha et al., 2021).

In summary, memory replay and experience buffers have evolved from uniform, FIFO designs to encompass highly adaptive, task- and sample-aware policies: selecting, compressing, prioritizing, synthesizing, and even refreshing experiences, with sharp theoretical and empirical understanding underpinning current and future systems. Researchers are directed to integrate buffer selection and prioritization, employ compressed or task-aware storage, and deploy adaptive or principled sampling policies—particularly when operating under tight memory and demanding regime of lifelong learning and deep RL (Krutsylo, 16 Feb 2025, Buzzega et al., 2020, Ramicic et al., 2019, Wei et al., 2021, Saha et al., 2021, Fedus et al., 2020, Zhuo et al., 2023, Balaji et al., 2020, Shaul-Ariel et al., 2024, Liu et al., 2017, 2002.01370, Isele et al., 2018).