Papers
Topics
Authors
Recent
Search
2000 character limit reached

Demonstration-Guided Continual RL

Updated 28 December 2025
  • The paper demonstrates that externalizing prior knowledge into a self-evolving demonstration repository enables state-of-the-art forward transfer and minimizes forgetting in continual RL tasks.
  • It employs a curriculum-based exploration strategy that seamlessly shifts from demonstration guidance to autonomous exploration for rapid adaptation across dynamic tasks.
  • Experimental results on navigation and locomotion benchmarks confirm that DGCRL outperforms traditional methods in Average Performance and Forgetting metrics.

Demonstration-Guided Continual Reinforcement Learning (DGCRL) encompasses a class of algorithms designed to address the stability–plasticity dilemma in continual reinforcement learning (CRL). In dynamic, non-stationary environments, RL agents must learn over a sequence of tasks without catastrophic forgetting or retraining from scratch, while adapting rapidly to novel task conditions. DGCRL externalizes prior knowledge not as parameter regularization or replay buffers, but as a self-evolving demonstration repository that directly influences agent exploration policy at the behavioral level. This integration of demonstration-guided exploration and curriculum scheduling offers state-of-the-art forward transfer, stability, and knowledge reuse in dynamic continual RL benchmarks (Yang et al., 21 Dec 2025). Demonstration-guided approaches also extend to the reward inference setting, as in lifelong inverse reinforcement learning (Lifelong IRL) (Mendez et al., 2022), further highlighting the generality of expert-trajectory-driven transfer in CRL.

1. Formal Problem Setting and Mathematical Framework

DGCRL is formulated for a sequence D={M1,,MN}\mathcal{D} = \{M_1, \ldots, M_N\} of Markov decision processes (MDPs) that share state and action spaces (S,A)(\mathcal{S}, \mathcal{A}) but may differ in transition dynamics TT and reward RR. Every task Mi=(S,A,Ti,Ri,γ)M_i = (\mathcal{S}, \mathcal{A}, T_i, R_i, \gamma), with discount factor γ[0,1)\gamma\in[0,1). The agent’s objective is to optimize the average return:

JCRL(π)=1Ni=1NJ(π;Mi),where J(π;M)=Eπ,M[t=0γtrt+1].J_{CRL}(\pi) = \frac{1}{N}\sum_{i=1}^N J(\pi; M_i), \quad \text{where} \ J(\pi; M) = \mathbb{E}_{\pi, M}\left[\sum_{t=0}^{\infty} \gamma^t r_{t+1}\right].

Key desiderata include:

  • Stability: Maintain performance on past tasks (minimize forgetting).
  • Plasticity: Rapidly acquire new knowledge (maximize forward transfer).

DGCRL diverges from prior CRL approaches by externalizing prior knowledge into a demonstration repository (guide policy set Πg\Pi_g), yielding direct behavioral control during agent exploration.

2. Demonstration Repository Construction and Evolution

DGCRL’s core insight is to maintain a continually-updated set Πg={πg(1),πg(2),}\Pi_g = \{\pi_g^{(1)}, \pi_g^{(2)}, \ldots\} of guide policies, each representing a previously successful or expert trajectory. For a new task MiM_i, the agent retrieves the demonstration πg,iΠg\pi_{g,i}\in\Pi_g yielding the highest expected return on MiM_i, i.e.,

πg,i=argmaxπΠg  J(π;Mi),\pi_{g,i} = \arg\max_{\pi\in\Pi_g}\; J(\pi; M_i),

and records its performance threshold rthr,i=J(πg,i;Mi)r_{\mathrm{thr},i} = J(\pi_{g,i}; M_i).

Self-evolution of the repository proceeds as follows: if the current policy πi\pi_i (a mixture of demonstration-guided and exploratory behaviors) achieves return rπi>βtrthr,ir_{\pi_i} > \beta_t\,r_{\mathrm{thr},i} (with βt1\beta_t\to1 over training), then πi\pi_i is added to Πg\Pi_g and the threshold is updated. This mechanism ensures Πg\Pi_g encodes increasingly performant and relevant behaviors as the agent’s experience grows (Yang et al., 21 Dec 2025).

3. Curriculum-Guided Exploration and Policy Scheduling

DGCRL implements a curriculum-based exploration mechanism that combines demonstration and agent policies within each episode of horizon HH. A guide length hth_t (initially HH, decremented by Δh\Delta_h after threshold-exceeding rollouts) segments each episode:

  • For t=0,,ht1t=0,\ldots,h_t-1, actions are sampled from the guide policy: atπg,i(st)a_t \sim \pi_{g,i}(\cdot|s_t).
  • For t=ht,,H1t=h_t,\ldots,H-1, actions are sampled from the agent's own exploration policy: atπe,i(st)a_t \sim \pi_{e,i}(\cdot|s_t).

This phased control “jump-starts” the agent from promising state regions and then transitions to autonomous exploration. The horizon hth_t decreases as agent performance surpasses demonstration quality, scheduling a gradual shift from demonstration guidance to pure exploration. The update rule is:

ht+1=htΔhifrπi>rthr,i.h_{t+1} = h_t - \Delta_h \quad \text{if} \quad r_{\pi_i} > r_{\mathrm{thr},i}.

Formally, the induced episode return is

J(πi;Mi)=E[t=0ht1γtrt+t=htH1γtrt].J(\pi_i; M_i) = \mathbb{E} \left[\sum_{t=0}^{h_t-1}\gamma^t r_t + \sum_{t=h_t}^{H-1} \gamma^t r_t\right].

No explicit imitation loss is required, since demonstration policies directly influence the visitation distribution (Yang et al., 21 Dec 2025).

4. DGCRL Algorithmic Implementation

DGCRL is realized atop off-policy actor-critic RL (TD3 in the reference implementation). The training protocol for each task MiM_i:

  1. Select demonstration πg,i\pi_{g,i}, set rthr,ir_{\mathrm{thr},i}, initialize hHh\leftarrow H.
  2. While h0h \geq 0:

    • Execute mixed policy for HH steps, gather transitions B\mathcal{B}.
    • Update actor (parameter θ\theta) and twin critics (parameters ϕ\phi) by minimizing, respectively,

    L(ϕ)=E(s,a,r,s)B[(Qϕ(s,a)y)2],y=r+γQϕˉ(s,πˉe(s)+ϵ),L(\phi) = \mathbb{E}_{(s,a,r,s')\sim\mathcal{B}}[ (Q_\phi(s,a) - y)^2 ], \qquad y = r + \gamma Q_{\bar{\phi}}(s', \bar{\pi}_e(s')+\epsilon),

    L(θ)=EsB[Qϕ(s,πe(s;θ))].L(\theta) = -\mathbb{E}_{s\sim\mathcal{B}}[ Q_\phi(s, \pi_e(s; \theta)) ].

- If rπi>βtrthr,ir_{\pi_i} > \beta_t\,r_{\mathrm{thr},i}: add πi\pi_i to Πg\Pi_g, decrement hhΔhh \leftarrow h - \Delta_h.

  1. Proceed to the next task.

Critical hyperparameters for TD3 include learning rates 3×1043\times 10^{-4}, γ=0.99\gamma=0.99 or $0.95$, target update rate τ=5×103\tau=5\times 10^{-3}, and action noise in [0.05,0.1][0.05,0.1]. Initial Πg=50|\Pi_g|=50 (60 for HalfCheetah); Δh\Delta_h is task-dependent (Yang et al., 21 Dec 2025).

5. Experimental Results and Benchmarking

Empirical evaluation demonstrates DGCRL’s efficacy on both synthetic nonstationary-2D navigation (three variation modes: goal/reward, puddle/transition, both) and MuJoCo locomotion tasks (Hopper, HalfCheetah, Ant with target-velocity shifts). Each sequence contains N=50N=50 tasks with H=100H=100 episode horizon. Baselines comprise naive sequential RL, Robust Policy (domain randomization), Adaptive (LSTM), MAML, and LLIRL.

Quantitative comparisons, using metrics such as Average Performance (AP), Forward Transfer (FT), and Forgetting (F), show that DGCRL achieves superior performance and lower (sometimes negative) forgetting:

Benchmark AP (DGCRL) AP (Baseline Range) FT (DGCRL) FT (Baseline Range) Forgetting (DGCRL) Forgetting (Baseline Range)
Navigation v1 –6.7 –43…–78 0.82 –0.02…0.62 –1.3 22…31
Hopper +93.8 –3…–25 0.80 –0.14…0.39 –3.5 –60…35

Learning curves exhibit a rapid jump-start (due to demonstration coverage), stable convergence, brief dips correlating with reductions in guide length hth_t, and swift recovery, confirming the value of curriculum scheduling (Yang et al., 21 Dec 2025).

6. Sensitivity, Ablations, and Theoretical Insights

Sensitivity analyses demonstrate that increasing initial repository size accelerates convergence and improves AP/FT, but marginally affects forgetting. Notably, DGCRL retains a performance lead even with minimal demonstrations (20% of full set).

Ablations confirm that (i) resetting both actor and critic parameters between tasks maximizes AP/FT, and (ii) so-called “pure replay” baselines (Initial Trajectory Replay, Evolving Trajectory Replay) are inferior to DGCRL, establishing that dynamic curriculum and self-evolution are crucial beyond mere replay of demonstrations.

A theoretical regret analysis (Appendix) indicates a sublinear dependency O(K4/3T1/3)O(K^{4/3}T^{-1/3}) on the number of tasks KK, though direct comparisons to alternate CRL regret bounds are pending (Yang et al., 21 Dec 2025).

7. Relation to Lifelong Inverse Reinforcement Learning and Broader Context

Lifelong IRL (Mendez et al., 2022) extends DGCRL methodology to the reward inference regime. Instead of policy cloning, it employs maximum-entropy IRL with a hierarchical latent reward basis LL and sparse task coefficients s(t)s^{(t)}, incrementally recovering reusable components across a sequence of demonstration-driven tasks. The online learning algorithm alternates between single-task reward inference and basis updating (via LASSO and ridge regression), supporting both forward and reverse transfer—i.e., improvement of earlier tasks as more tasks are processed. This approach provides an efficient, interpretable instantiation of demonstration-guided transfer, embodying the conceptual foundation of DGCRL in the inverse RL domain.

8. Limitations and Prospects

Current DGCRL variants are evaluated exclusively in simulated domains. Scalability of the demonstration repository may require advanced retrieval and pruning (e.g., clustering-based indexing) for real-world application. Conventional forgetting metrics can yield misleading negative values; the development of more robust continual RL evaluation protocols is needed. DGCRL directly shapes sampled state-action distributions but does not address scenarios with evolving observation modalities or online feature learning. The open challenge remains to extend theoretical analysis, repository management, and demonstration-guided control to broader, high-dimensional, or safety-critical settings (Yang et al., 21 Dec 2025).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Demonstration-Guided Continual Reinforcement Learning (DGCRL).