Demonstration-Guided Continual RL

Updated 28 December 2025

The paper demonstrates that externalizing prior knowledge into a self-evolving demonstration repository enables state-of-the-art forward transfer and minimizes forgetting in continual RL tasks.
It employs a curriculum-based exploration strategy that seamlessly shifts from demonstration guidance to autonomous exploration for rapid adaptation across dynamic tasks.
Experimental results on navigation and locomotion benchmarks confirm that DGCRL outperforms traditional methods in Average Performance and Forgetting metrics.

Demonstration-Guided Continual Reinforcement Learning (DGCRL) encompasses a class of algorithms designed to address the stability–plasticity dilemma in continual reinforcement learning (CRL). In dynamic, non-stationary environments, RL agents must learn over a sequence of tasks without catastrophic forgetting or retraining from scratch, while adapting rapidly to novel task conditions. DGCRL externalizes prior knowledge not as parameter regularization or replay buffers, but as a self-evolving demonstration repository that directly influences agent exploration policy at the behavioral level. This integration of demonstration-guided exploration and curriculum scheduling offers state-of-the-art forward transfer, stability, and knowledge reuse in dynamic continual RL benchmarks (Yang et al., 21 Dec 2025). Demonstration-guided approaches also extend to the reward inference setting, as in lifelong inverse reinforcement learning (Lifelong IRL) (Mendez et al., 2022), further highlighting the generality of expert-trajectory-driven transfer in CRL.

1. Formal Problem Setting and Mathematical Framework

DGCRL is formulated for a sequence $\mathcal{D} = \{M_1, \ldots, M_N\}$ of Markov decision processes (MDPs) that share state and action spaces $(\mathcal{S}, \mathcal{A})$ but may differ in transition dynamics $T$ and reward $R$ . Every task $M_i = (\mathcal{S}, \mathcal{A}, T_i, R_i, \gamma)$ , with discount factor $\gamma\in[0,1)$ . The agent’s objective is to optimize the average return:

$J_{CRL}(\pi) = \frac{1}{N}\sum_{i=1}^N J(\pi; M_i), \quad \text{where} \ J(\pi; M) = \mathbb{E}_{\pi, M}\left[\sum_{t=0}^{\infty} \gamma^t r_{t+1}\right].$

Key desiderata include:

Stability: Maintain performance on past tasks (minimize forgetting).
Plasticity: Rapidly acquire new knowledge (maximize forward transfer).

DGCRL diverges from prior CRL approaches by externalizing prior knowledge into a demonstration repository (guide policy set $\Pi_g$ ), yielding direct behavioral control during agent exploration.

2. Demonstration Repository Construction and Evolution

DGCRL’s core insight is to maintain a continually-updated set $\Pi_g = \{\pi_g^{(1)}, \pi_g^{(2)}, \ldots\}$ of guide policies, each representing a previously successful or expert trajectory. For a new task $M_i$ , the agent retrieves the demonstration $\pi_{g,i}\in\Pi_g$ yielding the highest expected return on $M_i$ , i.e.,

$\pi_{g,i} = \arg\max_{\pi\in\Pi_g}\; J(\pi; M_i),$

and records its performance threshold $r_{\mathrm{thr},i} = J(\pi_{g,i}; M_i)$ .

Self-evolution of the repository proceeds as follows: if the current policy $\pi_i$ (a mixture of demonstration-guided and exploratory behaviors) achieves return $r_{\pi_i} > \beta_t\,r_{\mathrm{thr},i}$ (with $\beta_t\to1$ over training), then $\pi_i$ is added to $\Pi_g$ and the threshold is updated. This mechanism ensures $\Pi_g$ encodes increasingly performant and relevant behaviors as the agent’s experience grows (Yang et al., 21 Dec 2025).

3. Curriculum-Guided Exploration and Policy Scheduling

DGCRL implements a curriculum-based exploration mechanism that combines demonstration and agent policies within each episode of horizon $H$ . A guide length $h_t$ (initially $H$ , decremented by $\Delta_h$ after threshold-exceeding rollouts) segments each episode:

For $t=0,\ldots,h_t-1$ , actions are sampled from the guide policy: $a_t \sim \pi_{g,i}(\cdot|s_t)$ .
For $t=h_t,\ldots,H-1$ , actions are sampled from the agent's own exploration policy: $a_t \sim \pi_{e,i}(\cdot|s_t)$ .

This phased control “jump-starts” the agent from promising state regions and then transitions to autonomous exploration. The horizon $h_t$ decreases as agent performance surpasses demonstration quality, scheduling a gradual shift from demonstration guidance to pure exploration. The update rule is:

$h_{t+1} = h_t - \Delta_h \quad \text{if} \quad r_{\pi_i} > r_{\mathrm{thr},i}.$

Formally, the induced episode return is

$J(\pi_i; M_i) = \mathbb{E} \left[\sum_{t=0}^{h_t-1}\gamma^t r_t + \sum_{t=h_t}^{H-1} \gamma^t r_t\right].$

No explicit imitation loss is required, since demonstration policies directly influence the visitation distribution (Yang et al., 21 Dec 2025).

4. DGCRL Algorithmic Implementation

DGCRL is realized atop off-policy actor-critic RL (TD3 in the reference implementation). The training protocol for each task $M_i$ :

Select demonstration $\pi_{g,i}$ , set $r_{\mathrm{thr},i}$ , initialize $h\leftarrow H$ .
While $h \geq 0$ $h \geq 0$ :
- Execute mixed policy for $H$ steps, gather transitions $\mathcal{B}$ .
- Update actor (parameter $\theta$ ) and twin critics (parameters $\phi$ ) by minimizing, respectively,
$L(\phi) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}}[ (Q_\phi(s,a) - y)^2 ], \qquad y = r + \gamma Q_{\bar{\phi}}(s', \bar{\pi}_e(s')+\epsilon),$

$L(\theta) = -\mathbb{E}_{s\sim\mathcal{B}}[ Q_\phi(s, \pi_e(s; \theta)) ].$

- If $r_{\pi_i} > \beta_t\,r_{\mathrm{thr},i}$ : add $\pi_i$ to $\Pi_g$ , decrement $h \leftarrow h - \Delta_h$ .

Proceed to the next task.

Critical hyperparameters for TD3 include learning rates $3\times 10^{-4}$ , $\gamma=0.99$ or $0.95$, target update rate $\tau=5\times 10^{-3}$ , and action noise in $[0.05,0.1]$ . Initial $|\Pi_g|=50$ (60 for HalfCheetah); $\Delta_h$ is task-dependent (Yang et al., 21 Dec 2025).

5. Experimental Results and Benchmarking

Empirical evaluation demonstrates DGCRL’s efficacy on both synthetic nonstationary-2D navigation (three variation modes: goal/reward, puddle/transition, both) and MuJoCo locomotion tasks (Hopper, HalfCheetah, Ant with target-velocity shifts). Each sequence contains $N=50$ tasks with $H=100$ episode horizon. Baselines comprise naive sequential RL, Robust Policy (domain randomization), Adaptive (LSTM), MAML, and LLIRL.

Quantitative comparisons, using metrics such as Average Performance (AP), Forward Transfer (FT), and Forgetting (F), show that DGCRL achieves superior performance and lower (sometimes negative) forgetting:

Benchmark	AP (DGCRL)	AP (Baseline Range)	FT (DGCRL)	FT (Baseline Range)	Forgetting (DGCRL)	Forgetting (Baseline Range)
Navigation v1	–6.7	–43…–78	0.82	–0.02…0.62	–1.3	22…31
Hopper	+93.8	–3…–25	0.80	–0.14…0.39	–3.5	–60…35

Learning curves exhibit a rapid jump-start (due to demonstration coverage), stable convergence, brief dips correlating with reductions in guide length $h_t$ , and swift recovery, confirming the value of curriculum scheduling (Yang et al., 21 Dec 2025).

6. Sensitivity, Ablations, and Theoretical Insights

Sensitivity analyses demonstrate that increasing initial repository size accelerates convergence and improves AP/FT, but marginally affects forgetting. Notably, DGCRL retains a performance lead even with minimal demonstrations (20% of full set).

Ablations confirm that (i) resetting both actor and critic parameters between tasks maximizes AP/FT, and (ii) so-called “pure replay” baselines (Initial Trajectory Replay, Evolving Trajectory Replay) are inferior to DGCRL, establishing that dynamic curriculum and self-evolution are crucial beyond mere replay of demonstrations.

A theoretical regret analysis (Appendix) indicates a sublinear dependency $O(K^{4/3}T^{-1/3})$ on the number of tasks $K$ , though direct comparisons to alternate CRL regret bounds are pending (Yang et al., 21 Dec 2025).

7. Relation to Lifelong Inverse Reinforcement Learning and Broader Context

Lifelong IRL (Mendez et al., 2022) extends DGCRL methodology to the reward inference regime. Instead of policy cloning, it employs maximum-entropy IRL with a hierarchical latent reward basis $L$ and sparse task coefficients $s^{(t)}$ , incrementally recovering reusable components across a sequence of demonstration-driven tasks. The online learning algorithm alternates between single-task reward inference and basis updating (via LASSO and ridge regression), supporting both forward and reverse transfer—i.e., improvement of earlier tasks as more tasks are processed. This approach provides an efficient, interpretable instantiation of demonstration-guided transfer, embodying the conceptual foundation of DGCRL in the inverse RL domain.

8. Limitations and Prospects

Current DGCRL variants are evaluated exclusively in simulated domains. Scalability of the demonstration repository may require advanced retrieval and pruning (e.g., clustering-based indexing) for real-world application. Conventional forgetting metrics can yield misleading negative values; the development of more robust continual RL evaluation protocols is needed. DGCRL directly shapes sampled state-action distributions but does not address scenarios with evolving observation modalities or online feature learning. The open challenge remains to extend theoretical analysis, repository management, and demonstration-guided control to broader, high-dimensional, or safety-critical settings (Yang et al., 21 Dec 2025).

References

"Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments" (Yang et al., 21 Dec 2025)
"Lifelong Inverse Reinforcement Learning" (Mendez et al., 2022)

Markdown Report Issue Upgrade to Chat

References (2)

Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments (2025)

Lifelong Inverse Reinforcement Learning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Demonstration-Guided Continual Reinforcement Learning (DGCRL).