SyReM: Continual Learning for Motion Forecasting

Updated 19 January 2026

SyReM is a continual learning framework that balances stability and plasticity by merging gradient projection with selective memory rehearsal.
It operates in an online, one-pass training paradigm using reservoir sampling and gradient similarity to select optimal rehearsal samples.
Evaluated on the INTERACTION benchmark, SyReM nearly eliminates forgetting with MR-BWT ≈ -0.01% and reduces MR-CT by 27% compared to baseline methods.

Synergetic Memory Rehearsal (SyReM) is a continual learning (CL) scheme designed to resolve the stability–plasticity dilemma in deep neural network (DNN)-based motion forecasting. The method synergistically unites a gradient-projection–based stability constraint with a selective memory rehearsal mechanism to mitigate catastrophic forgetting while maintaining high learning plasticity. SyReM operates in an online CL paradigm where data from diverse scenarios arrive as a one-pass stream, and it achieves state-of-the-art performance on naturalistic driving prediction tasks evaluated on the INTERACTION benchmark (Lin et al., 27 Aug 2025).

1. Conceptual Foundations: The Stability–Plasticity Dilemma

Continual learning methods for DNNs face the dual challenge of preserving stability (retaining old knowledge) and ensuring plasticity (adaptation to new data). Excessive emphasis on memory stability leads to impaired learning plasticity, while prioritizing plasticity results in catastrophic forgetting (i.e., degradation of performance on previously learned scenarios). SyReM explicitly addresses this competition for model parameters by introducing architectural and training constraints that balance both objectives.

SyReM is distinguished by its maintenance of a compact long-term memory buffer $\mathcal{M}$ and two tightly coupled modules:

A stability constraint based on gradient projection that prevents increases in average loss over stored buffer samples.
A selective rehearsal strategy that prioritizes buffer samples whose loss gradients closely resemble those of new data, thereby targeting plasticity enhancement without destabilizing previously acquired knowledge.

2. Stability Constraint: Mathematical Specification

Let $f_{\theta}$ denote the forecasting model with parameters $\theta$ and $\mathcal{M}$ the buffer of $|\mathcal{M}|$ previously seen samples. The average buffer loss is $\ell(f_\theta, \mathcal{M})$ . At any update step $c$ , SyReM enforces

$\ell(f_{\theta_c}, \mathcal{M}) \leq \ell(f_{\theta_{c-1}}, \mathcal{M})$

An equivalent gradient-based inner-product form is

$\langle \mathbf{g}, \mathbf{g}_{\mathcal{M}} \rangle \geq 0$

where $\mathbf{g} = \nabla_{\theta} \mathcal{L}_{\text{total}}$ (combined loss gradient) and $\mathbf{g}_{\mathcal{M}} = \nabla_{\theta} \ell(f_\theta, \mathcal{M})$ (buffer loss gradient). If violated, the training gradient is projected onto the closest feasible direction in $\ell_2$ norm space:

$\tilde{\mathbf{g}}^* = \begin{cases} \mathbf{g}, & \langle\mathbf{g}, \mathbf{g}_{\mathcal{M}}\rangle \geq 0 \[6pt] \mathbf{g} - \dfrac{\mathbf{g}^\top \mathbf{g}_{\mathcal{M}}}{\|\mathbf{g}_{\mathcal{M}}\|^2_2} \mathbf{g}_{\mathcal{M}}, & \langle\mathbf{g}, \mathbf{g}_{\mathcal{M}}\rangle < 0 \end{cases}$

This guarantees that updates do not worsen performance on stored scenarios.

3. SyReM Algorithm: Operational Workflow

SyReM adheres to strict online (one-pass) continual learning. For each incoming $(X,Y)$ sample:

Buffer Management:
- Use reservoir sampling to update buffer $\mathcal{M}$ , maintaining representative coverage over $N_{\text{seen}}$ total samples.
- If buffer is full, new samples may replace older entries based on uniform random selection.
Batch Formation:
- Compose the current training batch $\mathcal{D}_{\mathcal{T}_c}$ with the $B$ newest samples.
Selective Memory Rehearsal:
- Randomly draw $M\geq 2B$ candidates from $\mathcal{M}$ .
- Compute loss gradients $\{\mathbf{g}_k\}_{k=1}^M$ for each buffer candidate and $\mathbf{g}_c$ for the current batch.
- Rank by cosine similarity scores $q_k = \frac{\mathbf{g}_c^\top \mathbf{g}_k}{\|\mathbf{g}_c\|_2 \|\mathbf{g}_k\|_2}$ .
- Assemble the rehearsal buffer $\mathcal{M}_{\text{reh}}$ from top- $B$ scoring candidates.
Composite Loss:
- Compute new-data loss $\mathcal{L}_{\text{new}}$ , rehearsal loss $\mathcal{L}_{\text{r}}$ , and total loss $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{new}} + \mathcal{L}_{\text{r}}$ .
Gradient Computation and Stability Enforcement:
- Calculate $\mathbf{g}$ and $\mathbf{g}_{\mathcal{M}}$ , project if necessary per the stability constraint.
Parameter Update:
- Set $\theta \leftarrow \theta - \alpha \tilde{\mathbf{g}}^*$ using the projected gradient.

Reservoir sampling details and full pseudocode are described in Section IV.A of (Lin et al., 27 Aug 2025).

4. Selective Rehearsal Mechanism: Gradient Similarity Dynamics

Selective rehearsal leverages gradient similarity to maximize downstream transfer from buffered samples. For each candidate buffer sample $k$ :

$q_k = \frac{\mathbf{g}_c^\top \mathbf{g}_k}{\|\mathbf{g}_c\|_2 \|\mathbf{g}_k\|_2}, \quad k = 1, \ldots, M$

Candidates are ranked by $q_k$ . Top-scoring samples, which are most aligned with current-task updates, constitute the rehearsal batch $\mathcal{M}_{\text{reh}}$ :

$\mathcal{L}_{\text{r}} = \mathbb{E}_{(X,Y)\sim \mathcal{M}_{\text{reh}}} \big[ \ell(f_\theta(X), Y) \big]$

Empirically, SyReM-selected samples yield predominantly positive cosine similarity, in contrast to random replay which results in about $50\%$ negative similarities, often conflicting with current gradients.

5. Hyperparameter Effects

SyReM’s efficacy is modulated by several hyperparameters:

Hyperparameter	Typical Value	Effect
Buffer size $\|\mathcal{M}\|$	$1,000$ ( $\approx$ 0.5% total)	Memory fidelity vs. computational overhead
Training batch size $B$	$8$	Magnitude of plasticity vs. stability
Candidate pool $M$	$16$	Quality of rehearsal sample selection
Stability threshold	$0$	Controls strictness of loss non-increase
Similarity threshold	None (top- $B$ )	Selection by highest cosine similarity only

Larger buffers and candidate pools increase compute costs but enhance stability and plasticity, respectively. Overly large $B$ can risk violating stability constraints, and no explicit similarity cutoff is present beyond top- $B$ selection.

6. Empirical Results: INTERACTION Motion Forecasting

SyReM was evaluated on 11 sequential subdatasets from the INTERACTION driving benchmark. Baselines included Vanilla (no CL), Vanilla-GP (gradient projection only), SyReM-R (stability constraint with random replay), and the joint training upper bound JoTr.

Key findings:

Vanilla exhibited severe forgetting: MR-BWT $\approx +20\%$ .
Vanilla-GP improved stability (MR-BWT $\approx +5.9\%$ ) but impaired plasticity (higher MR-CT).
SyReM achieved MR-BWT $\approx -0.01\%$ (virtually no forgetting, even a slight improvement) and a $27\%$ reduction in MR-CT versus Vanilla.
FDE-BWT for SyReM was lowest at all stages, while FDE-CT was equal to or better than Vanilla except in one subtask.
On combined joint evaluation, SyReM outperformed all online-CL methods and matched or exceeded JoTr once multiple tasks had been presented.
Zero-shot generalization (forward transfer: FDE-FWT, MR-FWT) was uniformly superior for SyReM compared to baselines.

7. Ablation Studies and Theoretical Insights

Ablation with SyReM-R demonstrated that the gradient-projection constraint alone suffices for stability, but random sample replay yields diminished plasticity (MR-CT higher by approximately $26\%$ ). Recorded cosine similarity distributions for selected rehearsal samples were skewed positive with SyReM but near-even for random replay, supporting the notion that random selection can introduce gradient conflict.

This supports the conclusion that SyReM’s modules are complementary: gradient projection independently enforces stability, selective rehearsal is crucial for plasticity enhancement, and only their conjunction reconciles both objectives without trade-off. The interplay enables escape from the stability–plasticity dilemma:

Stability is safeguarded by constraining buffer loss via gradient projection.
Plasticity is boosted by focusing rehearsal on buffer samples whose gradient directions align with the current task, thereby protecting historical knowledge while facilitating efficient acquisition of new skills.

SyReM’s public implementation is available at https://github.com/BIT-Jack/SyReM (Lin et al., 27 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Escaping Stability-Plasticity Dilemma in Online Continual Learning for Motion Forecasting via Synergetic Memory Rehearsal (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Synergetic Memory Rehearsal (SyReM).