Residual Skill Discovery in RL and Robotics

Updated 22 January 2026

Residual Skill Discovery is a framework that augments pre-trained skill representations with novel residual adaptations to bridge gaps between simulation and real-world tasks.
It employs latent skill spaces via a β-VAE framework and state-conditioned priors to guide efficient exploration and hierarchical reinforcement learning.
The residual adaptation mechanism refines skill behaviors in sim-to-real transfer, significantly enhancing sample efficiency and overall task performance.

Residual Skill Discovery refers to a family of methodologies for learning, adapting, and composing new skills in reinforcement learning (RL) and robotics scenarios by leveraging existing skill representations while addressing gaps between pre-trained knowledge and novel task demands or deployment realities. Central to these approaches is the use of latent skill spaces and residual adaptation mechanisms that enable agents to discover, refine, and implement skill behaviors that are not present in the initial demonstration or simulation distribution but are required for effective generalization to new environments, including sim-to-real transfer.

1. Latent Skill Spaces and State-Conditioned Priors

Residual skill discovery typically begins with the construction of a latent skill space. Demonstration data comprising state–action sequences $\mathcal{D} = \{(s^{(i)}, a^{(i)})\}$ of fixed horizon $H$ are embedded in a continuous latent domain $\mathcal{Z} \cong \mathbb{R}^d$ via a $\beta$ -VAE framework:

Encoder: $q_\phi(z|s,a)$ produces a latent vector $z$ for each trajectory,
Decoder: $p_\theta(a_t|z, s_t)$ generates state-conditional actions.

The embedding is optimized using the VAE loss:

$\mathcal{L}_{\text{embed}}(\phi, \theta) = \mathbb{E}_{z \sim q_\phi(z|s,a)} \left[ \sum_{t=0}^{H-1} \log p_\theta(a_t|z, s_t) \right] - \beta \cdot D_{KL}[q_\phi(z|s,a)\|p(z)]$

with $p(z)=\mathcal{N}(0,I)$ . Sampling skills directly from $p(z)$ proves inefficient since only a small subset is relevant in any given state. To address this, a state-conditioned skill prior $H$ 0 is learned via Real-NVP flow models, parameterizing skill distributions adaptively for each environment state. This mechanism dramatically accelerates early exploration: in the Slippery-Push task, a skill-prior yielded 45.4% physical interaction steps versus 0.56% for Gaussian noise (Rana et al., 2022).

2. Hierarchical Reinforcement Learning with Residual Adaptation

Upon skill-space pretraining, residual skill discovery employs a hierarchical RL framework comprising two levels:

High-level policy $H$ 1 samples latent skill proposals for each state, mapped via the bijective flow $H$ 2.
Low-level residual policy $H$ 3 computes fine-grained adaptations by outputting a residual action $H$ 4 added to the decoder’s output, $H$ 5.

This enables the agent to “nudge” inferred skill trajectories into novel behaviors not present in the demonstrations, effectively bridging the train–test domain gap (e.g., adapting pushing skills on low friction surfaces or overcoming novel tray barriers) (Rana et al., 2022).

3. Optimization Objectives and Workflow

Training proceeds in two distinct stages:

Stage I: Joint minimization of the embedding and flow prior losses over demonstration data:

$H$ 6

Stage II: Joint on-policy RL, typically using Proximal Policy Optimization (PPO), for both $H$ 7 and $H$ 8. Returns are segmented into skill blocks for high-level optimization and atomic transitions for residual adaptation. A gating function $H$ 9 gradually introduces the residual pathway, transitioning from pure skill reuse to skill adaptation.

Pseudocode for action selection is as follows:

$\beta$ 3

4. Residual Discovery in Sim-to-Real Transfer: Spectral Methods

Residual skill discovery has been extended to sim-to-real transfer using spectral representation learning (Ma et al., 2024). Starting with an MDP $\mathcal{Z} \cong \mathbb{R}^d$ 0, the spectral decomposition:

$\mathcal{Z} \cong \mathbb{R}^d$ 1

yields a basis $\mathcal{Z} \cong \mathbb{R}^d$ 2 spanning all policy $\mathcal{Z} \cong \mathbb{R}^d$ 3-functions under fixed dynamics.

Skill transfer leverages $\mathcal{Z} \cong \mathbb{R}^d$ 4 from the simulator. For the real world, the residual transition $\mathcal{Z} \cong \mathbb{R}^d$ 5 is spectrally decomposed via $\mathcal{Z} \cong \mathbb{R}^d$ 6, found by least-squares optimization with a constraint enforcing orthogonality between novel features $\mathcal{Z} \cong \mathbb{R}^d$ 7 and simulator basis $\mathcal{Z} \cong \mathbb{R}^d$ 8:

$\mathcal{Z} \cong \mathbb{R}^d$ 9

This ensures newly discovered skills capture dynamics not representable by the simulator basis, directly filling the sim-to-real gap and yielding empirically up to 30.2% improvement in quadrotor tracking performance (Ma et al., 2024).

5. Policy Synthesis and Skill Composition

Following residual skill identification, policies are synthesized using an augmented feature vector $\beta$ 0. Value functions and policies are linearly parameterized in this expanded space:

$\beta$ 1

with policy actors often regularized via KL penalties to the base simulator policy, promoting retention of stable behaviors while enabling new skill applications:

$\beta$ 2

6. Empirical Performance and Adaptation Capabilities

Residual skill discovery methods have demonstrated superior sample efficiency and adaptability in both simulated and real-world robotics settings. On sparse MuJoCo manipulation tasks (Slippery-Push, Table-Cleanup, Pyramid-Stack, Complex-Hook), ReSkill converged to greater than 90% success, outperforming conventional RL and prior skill-based methods which either failed entirely or saturated at <60% success due to lack of adaptation (Rana et al., 2022). In sim-to-real quadrotor control, STEADY achieved up to 30.2% reduction in trajectory tracking error relative to zero-shot transfer and 11.9% improvement over skill transfer without residual discovery (Ma et al., 2024).

Ablations confirm the crucial roles of both state-conditioned priors and residual pathways:

Removal of skill prior slows early exploration by 4–10×.
Omission of the residual policy sharply reduces final task performance.

7. Conceptual Foundations and Scope

Residual skill discovery is characterized by:

Discovery of New Skill Variations: The residual pathway enables adaptation beyond the pretraining distribution, systematically uncovering new skill behaviors.
State-Conditioned Discovery: Skill relevance and sampling are tailored to local context via state-dependent priors or spectral features.
Sample-Efficient Exploration and Robustness: Compact latent spaces and adaptive composition promote fast exploration and reliable generalization.

This paradigm resolves critical challenges in hierarchical RL and sim-to-real transfer by balancing prior knowledge reuse with principled on-line adaptation. A plausible implication is that residual skill discovery constitutes an essential mechanism for scalable deployment of autonomous agents under distribution shift and task novelty.

Markdown Report Issue Upgrade to Chat

References (2)

Residual Skill Policies: Learning an Adaptable Skill-based Action Space for Reinforcement Learning for Robotics (2022)

Skill Transfer and Discovery for Sim-to-Real Learning: A Representation-Based Viewpoint (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Skill Discovery.