Papers
Topics
Authors
Recent
Search
2000 character limit reached

Residual Skill Policies (ReSkill)

Updated 2 February 2026
  • Residual Skill Policies (ReSkill) are a hierarchical reinforcement learning approach that integrates state-conditioned skill priors with residual controllers to enable efficient and adaptable robotic control.
  • The framework employs flow-based generative models and β-VAE skill embeddings to extract and refine temporally extended action sequences, overcoming limitations of undirected exploration.
  • Empirical evaluations on MuJoCo robotic tasks show that ReSkill achieves 3–5× faster sample efficiency and higher rewards compared to baseline methods.

Residual Skill Policies (ReSkill) constitute a hierarchical reinforcement learning framework designed to enable efficient and adaptable skill-based control in robotic environments, particularly under significant domain shift between demonstration and downstream tasks. The ReSkill framework couples a state-conditioned skill prior—constructed via flow-based generative models on latent skill spaces—with a fine-grained residual action policy, enabling downstream RL agents to rapidly explore meaningful skills and adapt to distributional divergences or unmodeled task variations. This methodology addresses both the inefficiency of undirected skill-space exploration and the rigidity of task transfer when skill priors are learned from fixed demonstration repertoires (Rana et al., 2022).

1. Problem Setting and Hierarchical Skill Structure

ReSkill considers the downstream task as a finite-horizon Markov Decision Process: M=(S,A,T,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, R, \gamma) where SRn\mathcal{S}\subseteq\mathbb{R}^n is the state space, ARm\mathcal{A}\subseteq\mathbb{R}^m is the atomic action space, T\mathcal{T} encodes transition probabilities, RR is the (potentially sparse) reward function, and γ\gamma the discount factor. "Skills" are defined as temporally extended action sequences of HH steps, zZz\in\mathcal{Z}, extracted from demonstrations (e.g., using classical controllers). A high-level RL policy πHL(zs)\pi_{\mathrm{HL}}(z\mid s) selects skill codes, each decoded to action sequences {at}t=0H1\{a_t\}_{t=0}^{H-1}. The RL objective is: J(πHL)=EπHL[k=0K1γkr~k],r~k=t=0H1R(skH+t,akH+t)J(\pi_{\mathrm{HL}}) = \mathbb{E}_{\pi_{\mathrm{HL}}} \left[\sum_{k=0}^{K-1}\gamma^k \tilde r_k \right], \quad \tilde r_k = \sum_{t=0}^{H-1} R(s_{kH+t}, a_{kH+t}) where K=T/HK = T/H for a TT-step episode. ReSkill introduces two extensions: (i) a state-conditioned skill prior p(zs)p(z|s), and (ii) a low-level residual policy πδ(δas,z,a)\pi_\delta(\delta a\mid s, z, a') that adapts each decoded action aa', thus reconnecting the high-level skill abstraction to the full atomic action space.

2. State-Conditioned Generative Skill Model

2.1 VAE-Based Skill Embedding

A β\beta-VAE is trained to encode demonstration skill trajectories into a continuous latent space ZRdz\mathcal{Z}\subseteq\mathbb{R}^{d_z}. Given a dataset of skill sequences {(st:t+H1,at:t+H1)}\{(s_{t:t+H-1}, a_{t:t+H-1})\}, the encoder qϕ(zs,a)q_\phi(z|s,a) and decoder pθ(atz,st)p_\theta(a_t|z,s_t) are jointly optimized to minimize: Lembed=Eqϕ(zs,a)[t=0H1logpθ(atz,st)]+βDKL(qϕ(zs,a)p(z))\mathcal{L}_{\mathrm{embed}} = -\mathbb{E}_{q_\phi(z|s,a)}\left[\sum_{t=0}^{H-1} \log p_\theta(a_t|z, s_t)\right] + \beta D_{KL}(q_\phi(z|s,a) \| p(z)) with p(z)=N(0,I)p(z) = \mathcal{N}(0, I), and the reconstruction term implemented as mean-squared error.

2.2 State-Conditioned Flow-Based Skill Prior

Latent sampling from a global prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I) is inefficient, since most latent codes are irrelevant to the current state. A state-conditioned prior p(zs0)p(z|s_0) is instead learned via a normalizing flow f:Z×SRdzf:\mathcal{Z}\times\mathcal{S}\to\mathbb{R}^{d_z}, parameterizing: p(zs0)pG(f(z,s0))detzf(z,s0)p(z|s_0) \propto p_\mathcal{G}(f(z, s_0)) \left| \det \nabla_z f(z, s_0) \right| Here pG(g)=N(0,I)p_\mathcal{G}(g) = \mathcal{N}(0, I) is the Gaussian base, and z=f1(g,s0)z = f^{-1}(g, s_0) permits direct sampling. The flow is optimized via negative log-likelihood loss: Lprior=logpG(f(z,s0))logdetzf(z,s0)\mathcal{L}_{\mathrm{prior}} = -\log p_\mathcal{G}(f(z, s_0)) - \log \left| \det \nabla_z f(z, s_0) \right| Total skill-model loss, blocking gradients from Lprior\mathcal{L}_{\mathrm{prior}} into the VAE encoder for stability, is: Lskills=Lembed+Lprior\mathcal{L}_{\mathrm{skills}} = \mathcal{L}_{\mathrm{embed}} + \mathcal{L}_{\mathrm{prior}}

3. Residual Low-Level Policy for Fine-Grained Adaptation

After offline training of the VAE and flow (which are frozen thereafter), action selection during RL proceeds as follows:

  1. Observe state sts_t.
  2. Sample gπHL(gst)g\sim\pi_{\mathrm{HL}}(g\mid s_t).
  3. Compute skill code z=f1(g,st)p(zst)z = f^{-1}(g, s_t) \sim p(z|s_t).
  4. For each h=0h=0 to H1H-1:
    • Decode primitive action at=pθ(az,st)a'_t = p_\theta(a'|z, s_t).
    • Sample residual δatπδ(δast,z,at)\delta a_t \sim \pi_\delta(\delta a|s_t, z, a'_t).
    • Execute at=at+δata_t = a'_t + \delta a_t.

Both πHL\pi_{\mathrm{HL}} and πδ\pi_\delta are neural networks trained on-policy using PPO. The PPO objective for each is: LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat A_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t \right)\right] where rt(θ)r_t(\theta) is the policy probability ratio, and xtx_t denotes (st)(s_t) or (st,z,at)(s_t, z, a'_t) for the high/low policies respectively.

4. Training Protocol and Architecture

4.1 Two-Stage Skill Module Learning

Stage 1 involves collecting 40,000 trajectories from classical controllers (push, pick-and-place, hook), sliced into HH-step skills. Stage 2 jointly trains the VAE and flow via Lskills\mathcal{L}_{\mathrm{skills}}.

4.2 Hierarchical RL Fine-Tuning

Weights of the VAE and flow are frozen; πHL\pi_{\mathrm{HL}} and πδ\pi_\delta are initialized randomly. πHL\pi_{\mathrm{HL}} is trained alone for a 20,000-step warm-up, then the residual controller is gradually introduced with a gating schedule: w(x)=11+exp(k(xC))w(x) = \frac{1}{1+\exp(-k(x-C))} with C=10,000C=10,000, k=3×104k=3\times10^{-4}. Executed action is at=at+w(x)δata_t = a'_t + w(x)\delta a_t. Both policies are then trained jointly with PPO (clip =0.2=0.2, γ=0.99\gamma=0.99, learning rate 3×1043\times10^{-4}).

Architectural specifications:

  • VAE encoder: 1× LSTM (input == state \oplus action, hidden =128=128), MLP(128642×4128\rightarrow64\rightarrow2\times 4) for mean/log-std.
  • VAE decoder: MLP(44\oplus state 12864\rightarrow 128 \rightarrow64 \rightarrow action), tanh output.
  • Flow: 4 RealNVP affine-coupling layers, each conditioned on the initial state.
  • Policy nets: 2-layer MLPs ([256,256][256, 256]), ReLU.
  • Optimizers: Adam (lr 1×1041\times10^{-4} for VAE/flow, 3×1043\times10^{-4} for PPO).

5. Empirical Evaluation and Results

ReSkill is evaluated on four MuJoCo Fetch-arm variants: Slippery Push (reduced friction), Table Cleanup (tray obstacle), Pyramid Stack (stack on higher block), and Complex Hook (random obstacles and hook usage). Baselines include scripted controllers, BC + fine-tune, SAC/PPO from scratch, HAC, PaRRot, and SPiRL, with ablations of the skill prior and residual.

Configuration Sample Efficiency Asymptotic Performance
ReSkill 3–5× faster than SPiRL Highest reward
ReSkill w/o prior Poor early exploration (<10%) Stalled performance
ReSkill w/o residual Stalls at suboptimal reward Limited adaptation
Isotropic Gaussian prior <1% meaningful skills Fails to explore

ReSkill exhibits >45% meaningful skill execution early in training with the state-conditioned prior, sharply contrasting with <1% under the isotropic Gaussian prior. Without the residual controller, performance plateaus when facing task variations requiring fine adaptation. Both components are confirmed necessary for optimal transfer and exploration.

6. Implementation Details

ReSkill is implemented using PyTorch with OpenAI SpinningUp's PPO. Batch sizes are 128 for VAE training and 64 for PPO. Key hyperparameters include skill horizon H=10H=10, latent dimension dz=4d_z=4, and PPO clip ratio 0.2. All code is available through the project website.

7. Principal Insights and Theoretical Implications

State-conditioned skill priors substantially enhance exploration, privileging the agent toward executing contextually relevant skills (>45% early trajectory steps). The residual controller adapts high-level skill abstractions to unseen environment perturbations—such as differing friction coefficients, unexpected obstacles, or shape/height changes—that were not present in demonstrations. The bijective nature of the flow-based prior renders the entire latent skill space accessible, ensuring no skill diversity is lost. The two-stage protocol (offline skill-learning followed by online PPO) achieves sample efficiency and final-policy performance exceeding end-to-end or prior-based methods lacking structured priors. In summary, Residual Skill Policies offer a generalizable, efficient approach for skill-based RL in robotics, uniting state-conditioning and fine-grained adaptation within a unified action-space design (Rana et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Skill Policies (ReSkill).