Residual Skill Policies (ReSkill)

Updated 2 February 2026

Residual Skill Policies (ReSkill) are a hierarchical reinforcement learning approach that integrates state-conditioned skill priors with residual controllers to enable efficient and adaptable robotic control.
The framework employs flow-based generative models and β-VAE skill embeddings to extract and refine temporally extended action sequences, overcoming limitations of undirected exploration.
Empirical evaluations on MuJoCo robotic tasks show that ReSkill achieves 3–5× faster sample efficiency and higher rewards compared to baseline methods.

Residual Skill Policies (ReSkill) constitute a hierarchical reinforcement learning framework designed to enable efficient and adaptable skill-based control in robotic environments, particularly under significant domain shift between demonstration and downstream tasks. The ReSkill framework couples a state-conditioned skill prior—constructed via flow-based generative models on latent skill spaces—with a fine-grained residual action policy, enabling downstream RL agents to rapidly explore meaningful skills and adapt to distributional divergences or unmodeled task variations. This methodology addresses both the inefficiency of undirected skill-space exploration and the rigidity of task transfer when skill priors are learned from fixed demonstration repertoires (Rana et al., 2022).

1. Problem Setting and Hierarchical Skill Structure

ReSkill considers the downstream task as a finite-horizon Markov Decision Process: $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, R, \gamma)$ where $\mathcal{S}\subseteq\mathbb{R}^n$ is the state space, $\mathcal{A}\subseteq\mathbb{R}^m$ is the atomic action space, $\mathcal{T}$ encodes transition probabilities, $R$ is the (potentially sparse) reward function, and $\gamma$ the discount factor. "Skills" are defined as temporally extended action sequences of $H$ steps, $z\in\mathcal{Z}$ , extracted from demonstrations (e.g., using classical controllers). A high-level RL policy $\pi_{\mathrm{HL}}(z\mid s)$ selects skill codes, each decoded to action sequences $\{a_t\}_{t=0}^{H-1}$ . The RL objective is: $J(\pi_{\mathrm{HL}}) = \mathbb{E}_{\pi_{\mathrm{HL}}} \left[\sum_{k=0}^{K-1}\gamma^k \tilde r_k \right], \quad \tilde r_k = \sum_{t=0}^{H-1} R(s_{kH+t}, a_{kH+t})$ where $K = T/H$ for a $T$ -step episode. ReSkill introduces two extensions: (i) a state-conditioned skill prior $p(z|s)$ , and (ii) a low-level residual policy $\pi_\delta(\delta a\mid s, z, a')$ that adapts each decoded action $a'$ , thus reconnecting the high-level skill abstraction to the full atomic action space.

2. State-Conditioned Generative Skill Model

2.1 VAE-Based Skill Embedding

A $\beta$ -VAE is trained to encode demonstration skill trajectories into a continuous latent space $\mathcal{Z}\subseteq\mathbb{R}^{d_z}$ . Given a dataset of skill sequences $\{(s_{t:t+H-1}, a_{t:t+H-1})\}$ , the encoder $q_\phi(z|s,a)$ and decoder $p_\theta(a_t|z,s_t)$ are jointly optimized to minimize: $\mathcal{L}_{\mathrm{embed}} = -\mathbb{E}_{q_\phi(z|s,a)}\left[\sum_{t=0}^{H-1} \log p_\theta(a_t|z, s_t)\right] + \beta D_{KL}(q_\phi(z|s,a) \| p(z))$ with $p(z) = \mathcal{N}(0, I)$ , and the reconstruction term implemented as mean-squared error.

2.2 State-Conditioned Flow-Based Skill Prior

Latent sampling from a global prior $p(z) = \mathcal{N}(0, I)$ is inefficient, since most latent codes are irrelevant to the current state. A state-conditioned prior $p(z|s_0)$ is instead learned via a normalizing flow $f:\mathcal{Z}\times\mathcal{S}\to\mathbb{R}^{d_z}$ , parameterizing: $p(z|s_0) \propto p_\mathcal{G}(f(z, s_0)) \left| \det \nabla_z f(z, s_0) \right|$ Here $p_\mathcal{G}(g) = \mathcal{N}(0, I)$ is the Gaussian base, and $z = f^{-1}(g, s_0)$ permits direct sampling. The flow is optimized via negative log-likelihood loss: $\mathcal{L}_{\mathrm{prior}} = -\log p_\mathcal{G}(f(z, s_0)) - \log \left| \det \nabla_z f(z, s_0) \right|$ Total skill-model loss, blocking gradients from $\mathcal{L}_{\mathrm{prior}}$ into the VAE encoder for stability, is: $\mathcal{L}_{\mathrm{skills}} = \mathcal{L}_{\mathrm{embed}} + \mathcal{L}_{\mathrm{prior}}$

3. Residual Low-Level Policy for Fine-Grained Adaptation

After offline training of the VAE and flow (which are frozen thereafter), action selection during RL proceeds as follows:

Observe state $s_t$ .
Sample $g\sim\pi_{\mathrm{HL}}(g\mid s_t)$ .
Compute skill code $z = f^{-1}(g, s_t) \sim p(z|s_t)$ .
For each $h=0$ $h = 0$ to $H-1$ $H - 1$ :
- Decode primitive action $a'_t = p_\theta(a'|z, s_t)$ .
- Sample residual $\delta a_t \sim \pi_\delta(\delta a|s_t, z, a'_t)$ .
- Execute $a_t = a'_t + \delta a_t$ .

Both $\pi_{\mathrm{HL}}$ and $\pi_\delta$ are neural networks trained on-policy using PPO. The PPO objective for each is: $L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat A_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t \right)\right]$ where $r_t(\theta)$ is the policy probability ratio, and $x_t$ denotes $(s_t)$ or $(s_t, z, a'_t)$ for the high/low policies respectively.

4. Training Protocol and Architecture

4.1 Two-Stage Skill Module Learning

Stage 1 involves collecting 40,000 trajectories from classical controllers (push, pick-and-place, hook), sliced into $H$ -step skills. Stage 2 jointly trains the VAE and flow via $\mathcal{L}_{\mathrm{skills}}$ .

4.2 Hierarchical RL Fine-Tuning

Weights of the VAE and flow are frozen; $\pi_{\mathrm{HL}}$ and $\pi_\delta$ are initialized randomly. $\pi_{\mathrm{HL}}$ is trained alone for a 20,000-step warm-up, then the residual controller is gradually introduced with a gating schedule: $w(x) = \frac{1}{1+\exp(-k(x-C))}$ with $C=10,000$ , $k=3\times10^{-4}$ . Executed action is $a_t = a'_t + w(x)\delta a_t$ . Both policies are then trained jointly with PPO (clip $=0.2$ , $\gamma=0.99$ , learning rate $3\times10^{-4}$ ).

Architectural specifications:

VAE encoder: 1× LSTM (input $=$ state $\oplus$ action, hidden $=128$ ), MLP( $128\rightarrow64\rightarrow2\times 4$ ) for mean/log-std.
VAE decoder: MLP( $4\oplus$ state $\rightarrow 128 \rightarrow64 \rightarrow$ action), tanh output.
Flow: 4 RealNVP affine-coupling layers, each conditioned on the initial state.
Policy nets: 2-layer MLPs ( $[256, 256]$ ), ReLU.
Optimizers: Adam (lr $1\times10^{-4}$ for VAE/flow, $3\times10^{-4}$ for PPO).

5. Empirical Evaluation and Results

ReSkill is evaluated on four MuJoCo Fetch-arm variants: Slippery Push (reduced friction), Table Cleanup (tray obstacle), Pyramid Stack (stack on higher block), and Complex Hook (random obstacles and hook usage). Baselines include scripted controllers, BC + fine-tune, SAC/PPO from scratch, HAC, PaRRot, and SPiRL, with ablations of the skill prior and residual.

Configuration	Sample Efficiency	Asymptotic Performance
ReSkill	3–5× faster than SPiRL	Highest reward
ReSkill w/o prior	Poor early exploration (<10%)	Stalled performance
ReSkill w/o residual	Stalls at suboptimal reward	Limited adaptation
Isotropic Gaussian prior	<1% meaningful skills	Fails to explore

ReSkill exhibits >45% meaningful skill execution early in training with the state-conditioned prior, sharply contrasting with <1% under the isotropic Gaussian prior. Without the residual controller, performance plateaus when facing task variations requiring fine adaptation. Both components are confirmed necessary for optimal transfer and exploration.

6. Implementation Details

ReSkill is implemented using PyTorch with OpenAI SpinningUp's PPO. Batch sizes are 128 for VAE training and 64 for PPO. Key hyperparameters include skill horizon $H=10$ , latent dimension $d_z=4$ , and PPO clip ratio 0.2. All code is available through the project website.

7. Principal Insights and Theoretical Implications

State-conditioned skill priors substantially enhance exploration, privileging the agent toward executing contextually relevant skills (>45% early trajectory steps). The residual controller adapts high-level skill abstractions to unseen environment perturbations—such as differing friction coefficients, unexpected obstacles, or shape/height changes—that were not present in demonstrations. The bijective nature of the flow-based prior renders the entire latent skill space accessible, ensuring no skill diversity is lost. The two-stage protocol (offline skill-learning followed by online PPO) achieves sample efficiency and final-policy performance exceeding end-to-end or prior-based methods lacking structured priors. In summary, Residual Skill Policies offer a generalizable, efficient approach for skill-based RL in robotics, uniting state-conditioning and fine-grained adaptation within a unified action-space design (Rana et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Residual Skill Policies: Learning an Adaptable Skill-based Action Space for Reinforcement Learning for Robotics (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Skill Policies (ReSkill).

Residual Skill Policies (ReSkill)

1. Problem Setting and Hierarchical Skill Structure

2. State-Conditioned Generative Skill Model

2.1 VAE-Based Skill Embedding

2.2 State-Conditioned Flow-Based Skill Prior

3. Residual Low-Level Policy for Fine-Grained Adaptation

4. Training Protocol and Architecture

4.1 Two-Stage Skill Module Learning

4.2 Hierarchical RL Fine-Tuning

5. Empirical Evaluation and Results

6. Implementation Details

7. Principal Insights and Theoretical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Residual Skill Policies (ReSkill)

1. Problem Setting and Hierarchical Skill Structure

2. State-Conditioned Generative Skill Model

2.1 VAE-Based Skill Embedding

2.2 State-Conditioned Flow-Based Skill Prior

3. Residual Low-Level Policy for Fine-Grained Adaptation

4. Training Protocol and Architecture

4.1 Two-Stage Skill Module Learning

4.2 Hierarchical RL Fine-Tuning

5. Empirical Evaluation and Results

6. Implementation Details

7. Principal Insights and Theoretical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research