Memory Rewriting in RL: Concepts & Architectures

Updated 27 January 2026

Memory rewriting in RL is a mechanism where agents actively erase outdated data and selectively integrate new evidence.
It underpins methods like gating, explicit overwrite actions, and convex blending used in architectures such as LSTM, Transformers, and replay buffers.
This dynamic update strategy improves sample efficiency, minimizes catastrophic forgetting, and enhances decision-making in nonstationary tasks.

Memory rewriting in reinforcement learning (RL) refers to the set of mechanisms by which agents selectively update, overwrite, or erase prior experiences or internal representations, balancing long-term retention against dynamic adaptation to new data, signals, or environmental shifts. Unlike pure retention—simply storing and retrieving prior information—memory rewriting is central to adaptive decision-making in partially observable or nonstationary domains, with concrete instantiations spanning experience replay buffers, external structured memories, recurrent and attention-based architectures, and agent-designed transition graphs. Modern approaches utilize explicit erasure, gating, merging, overwrite actions, or calibrated updates to regulate the persistence and transformation of learned traces, with significant impacts on sample efficiency, generalization, catastrophic forgetting, and memory interference. Recent research has shifted toward benchmarking and constructing mechanisms that support continual, context-driven update, recognizing that trainable forgetting and selective overwriting are as fundamental as stable retention.

1. Formalism and Core Principles of Memory Rewriting

At its foundation, memory rewriting in RL is defined within the POMDP framework as actions and update functions that modulate a latent memory state $m_t = f_\phi(h_t)$ , where $h_t = (o_0, a_0, \dots, o_t)$ is the agent’s action-observation history. The general differentiable update is expressed as

$m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$

where $F_\phi$ implements the retention/forgetting gate, and $E_\phi$ encodes new evidence $\eta_t = (a_t, o_{t+1})$ (Shchendrigin et al., 21 Jan 2026). The crux of memory rewriting is that $F_\phi$ must actively erase outdated, irrelevant, or contradictory content in $m_t$ when $E_\phi(\eta_t)$ provides new salient signals, such that

$\frac{\partial m_{t'}}{\partial m_t} \approx 0 \quad \text{(erasure)}, \qquad \frac{\partial m_{t'}}{\partial \eta_t} > 0 \quad \text{(integration)}$

for specific sequences of inputs. This principle holds whether memory is implemented as buffer, recurrent state, attention-weighted cache, or external structured store.

Benchmarks such as Endless T-Maze and Color-Cubes explicitly separate regimes of pure retention ( $h_t = (o_0, a_0, \dots, o_t)$ 0 cues) from rewriting ( $h_t = (o_0, a_0, \dots, o_t)$ 1, sequential cue overwrite), exposing architectures incapable of adaptive erasure (Shchendrigin et al., 21 Jan 2026). Empirical results indicate that trainable forgetting gates (as in LSTM/GRU) and explicit overwrite mechanisms are necessary conditions for robust, context-driven memory rewriting.

2. Architectures and Algorithms for Memory Rewriting

A full spectrum of RL approaches implement memory rewriting via discrete or continuous mechanisms:

Recurrent Networks (LSTM/GRU): Employ input/forget/output gates, with the forget gate $h_t = (o_0, a_0, \dots, o_t)$ 2 enabling selective erasure of latent cell features. LSTM’s gating supports high-dimensional, context-conditioned overwrite, yielding superior performance in rewriting benchmarks (Shchendrigin et al., 21 Jan 2026).
Transformer and Structured Memory Extensions: Transformers maintain caches or global slots but lack explicit erasure, resulting in stale information persisting unless complemented by adaptive rewriting modules (e.g., ELMUR’s LRU-based slot overwrite and convex blending mechanism) (Cherepanov et al., 8 Oct 2025).
External and Modular Memories: Agents utilize buffer-based or structured memories, such as Stable Hadamard Memory (SHM), which update memory as

$h_t = (o_0, a_0, \dots, o_t)$ 3

where $h_t = (o_0, a_0, \dots, o_t)$ 4 calibrates (reinforces/erases) and $h_t = (o_0, a_0, \dots, o_t)$ 5 writes new content (Le et al., 2024). Hadamard calibration bounds gradients and enables cell-wise selective rewriting.

Replay Buffers with Active Overwriting: Experience replay control via buffer size and sample weighting directly implements rewriting. Adaptive algorithms (aER) adjust buffer capacity online in response to old samples’ TD-error, optimizing the trade-off between retaining and overwriting experiences (Liu et al., 2017).

Architectures such as GWR-R merge similar transitions into graph nodes and prune stale edges, dynamically rewriting stored state-action trajectories, while Forget-and-Grow (FoG) exploits explicit decay of sampling probabilities and critic expansion to combine continuous erasure and new feature encoding (Hafez et al., 2023, Kang et al., 3 Jul 2025).

3. Mechanisms: Write, Erase, Overwrite, and Blending

Memory rewriting mechanisms are instantiated as:

Explicit Overwrite Actions: Agents decide on discrete write actions $h_t = (o_0, a_0, \dots, o_t)$ 6, directly choosing the new memory state $h_t = (o_0, a_0, \dots, o_t)$ 7 (binary or buffer-based external memory) (Icarte et al., 2020).
Sliding Windows and Push/Skip Buffers: Policies select when to push new observations/actions into a windowed buffer or skip (retain previous state), controlling the overwrite schedule (Icarte et al., 2020).
Erasure/Calibration Gates: Structured memories (SHM) generate calibration matrices $h_t = (o_0, a_0, \dots, o_t)$ 8 whose element-wise product with old memory erases or reinforces, providing stability and selectivity (Le et al., 2024).
Convex Blending with LRU: ELMUR updates selected slots using replacement for empty and $h_t = (o_0, a_0, \dots, o_t)$ 9-weighted blending for least-recently used slots,

$m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 0

resulting in smooth decay of obsolete content and insertion of critical new cues (Cherepanov et al., 8 Oct 2025).

Decay and Weighted Sampling: FoG applies time-decayed weights $m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 1 for buffer transitions, bounding replay counts and gradually erasing early samples (Kang et al., 3 Jul 2025).
Structural Merge and Pruning: Map-based experience replay (GWR-R) dynamically merges similar nodes and prunes old edges, rewriting the graph-based memory representation (Hafez et al., 2023).

4. Theoretical Bounds and Empirical Validation

The dynamics of memory rewriting have been characterized both analytically and empirically:

ODE Models of Replay Memory: Continuous-time models reveal non-monotonic dependencies between buffer size $m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 2 and learning speed, with both small and large $m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 3 constraining convergence. Adaptive schemes (aER) auto-tune capacity for maximal efficiency (Liu et al., 2017).
Stability of Calibration Gates: Randomized, context-driven calibration in SHM maintains bounded expectation of cumulative gating factors and decorrelates temporal updates, preventing gradient explosion/vanishing—contrary to fixed calibration (Le et al., 2024).
Effective Memory Horizon: In ELMUR, LRU rewrites combined with cross-attention decouple memory size from sequence length, sustaining retention and update over up to $m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 4 steps, exceeding standard transformer context by $m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 5 (Cherepanov et al., 8 Oct 2025).
Sample Diversity and Forgetting: GWR-R’s rewrites increase the minimal pairwise distance in replay samples, decorrelating training batches, reducing catastrophic forgetting, but introducing a compression-performance trade-off controlled by activation threshold $m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 6 (Hafez et al., 2023).

Benchmarks in Endless T-Maze and Color-Cubes, as well as meta-RL environments, consistently demonstrate that recurrent models (especially LSTM), structured calibration, and explicit overwrite actions outperform standard transformer and fixed-decay memories in rewriting regimes (Shchendrigin et al., 21 Jan 2026, Le et al., 2024).

5. Impact, Limitations, and Ongoing Challenges

Memory rewriting directly impacts sample efficiency, generalizability, and robustness to new or contradicting evidence. Major findings across recent works include:

Need for Trainable Forgetting: Fixed schedules (e.g., exponential decay) and indiscriminate attention caches often fail under variable episode lengths or task nonstationarity. Architectures with trainable gating or context-driven overwrite generalize better (Shchendrigin et al., 21 Jan 2026).
Credit Assignment for Writes: Scalable optimization of memory rewriting requires improved alignment of RL reward signals with specific erase/add or overwrite events, especially in differentiable or graph-structured memories (Ramani, 2019).
Interference and Memory Capacity: Merging or pruning in GWR-R and episodic buffers can introduce interference or under-represent critical transitions if overwriting is overly aggressive. Adaptive control of capacity and rewriting rate is necessary.
Benchmarks and Meta-Learning: New diagnostic tasks (variable-length, multi-modal cues, sequential overwriting) and curriculum/meta-learned solutions to optimal rewrite scheduling are identified as priorities (Shchendrigin et al., 21 Jan 2026).

Limitations persist: context-independence assumptions in gating, recursive implementation overheads, and instability under sparse reward regimes for attention-based methods. There remains significant demand for designs that couple retention, selective erasure, and capacity management with scalable, sample-efficient RL under strict partial observability (Le et al., 2024, Shchendrigin et al., 21 Jan 2026).

6. Taxonomy and Comparative Table of Memory-Rewriting Approaches

Below is an explicit taxonomy assembling principal mechanisms and representative papers:

Mechanism	Core Update Formula	Representative Work (arXiv)
Explicit overwrite	$m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 7 (agent chooses new state)	(Icarte et al., 2020, Yu et al., 3 Jul 2025)
Erasure/calibration	$m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 8	(Le et al., 2024, Shchendrigin et al., 21 Jan 2026)
Convex blending (LRU)	$m_{t+1} = W_\phi(F_\phi(m_t), E_\phi(\eta_t)),$ 9	(Cherepanov et al., 8 Oct 2025)
Time-decay buffer	$F_\phi$ 0	(Kang et al., 3 Jul 2025)
Graph merge/prune	Node insertion, update, edge pruning	(Hafez et al., 2023)
Experience replay	Overwrite oldest; adaptive buffer size	(Liu et al., 2017, Ramani, 2019)

These strategies encapsulate the diverse trade-offs between capacity, stability, trainability, and retention-versus-update critical in contemporary RL. Ongoing research continues to refine mechanisms for alignment of credit assignment, interference suppression, and adaptive schedule learning.

7. Future Directions and Open Problems

Key areas for future advances include:

Meta-learned Forgetting Parameters: Dynamic adaptation of rewrite schedules and gating factors to task statistics and environmental drift (Shchendrigin et al., 21 Jan 2026).
Integrating Planning with Memory-Rewriting: Hybrid model-based and episodic recall architectures, facilitating joint adaptation of internal state and world model (Ramani, 2019).
Unsupervised Memory Objectives: Auxiliary objectives (e.g., reconstruction, contrastive losses) to shape more robust, interference-resistant memory rewriting (Ramani, 2019).
Scalable Structural Memories: Efficient graph, map, and matrix-based external memories with principled rewrite mechanisms balancing sample decorrelation and long-term retention (Hafez et al., 2023, Cherepanov et al., 8 Oct 2025).

A plausible implication is that the act of forgetting—not merely remembering—is an essential ingredient for mastering RL in realistic, nonstationary, and partially observable domains, and architectures with explicit, context-sensitive rewrite dynamics are anticipated to dominate future agent design.