Residual Skill Discovery in RL and Robotics
- Residual Skill Discovery is a framework that augments pre-trained skill representations with novel residual adaptations to bridge gaps between simulation and real-world tasks.
- It employs latent skill spaces via a β-VAE framework and state-conditioned priors to guide efficient exploration and hierarchical reinforcement learning.
- The residual adaptation mechanism refines skill behaviors in sim-to-real transfer, significantly enhancing sample efficiency and overall task performance.
Residual Skill Discovery refers to a family of methodologies for learning, adapting, and composing new skills in reinforcement learning (RL) and robotics scenarios by leveraging existing skill representations while addressing gaps between pre-trained knowledge and novel task demands or deployment realities. Central to these approaches is the use of latent skill spaces and residual adaptation mechanisms that enable agents to discover, refine, and implement skill behaviors that are not present in the initial demonstration or simulation distribution but are required for effective generalization to new environments, including sim-to-real transfer.
1. Latent Skill Spaces and State-Conditioned Priors
Residual skill discovery typically begins with the construction of a latent skill space. Demonstration data comprising state–action sequences of fixed horizon are embedded in a continuous latent domain via a -VAE framework:
- Encoder: produces a latent vector for each trajectory,
- Decoder: generates state-conditional actions.
The embedding is optimized using the VAE loss:
with . Sampling skills directly from proves inefficient since only a small subset is relevant in any given state. To address this, a state-conditioned skill prior is learned via Real-NVP flow models, parameterizing skill distributions adaptively for each environment state. This mechanism dramatically accelerates early exploration: in the Slippery-Push task, a skill-prior yielded 45.4% physical interaction steps versus 0.56% for Gaussian noise (Rana et al., 2022).
2. Hierarchical Reinforcement Learning with Residual Adaptation
Upon skill-space pretraining, residual skill discovery employs a hierarchical RL framework comprising two levels:
- High-level policy samples latent skill proposals for each state, mapped via the bijective flow .
- Low-level residual policy computes fine-grained adaptations by outputting a residual action added to the decoder’s output, .
This enables the agent to “nudge” inferred skill trajectories into novel behaviors not present in the demonstrations, effectively bridging the train–test domain gap (e.g., adapting pushing skills on low friction surfaces or overcoming novel tray barriers) (Rana et al., 2022).
3. Optimization Objectives and Workflow
Training proceeds in two distinct stages:
- Stage I: Joint minimization of the embedding and flow prior losses over demonstration data:
- Stage II: Joint on-policy RL, typically using Proximal Policy Optimization (PPO), for both and . Returns are segmented into skill blocks for high-level optimization and atomic transitions for residual adaptation. A gating function gradually introduces the residual pathway, transitioning from pure skill reuse to skill adaptation.
Pseudocode for action selection is as follows:
1 2 3 4 5 6 7 8 |
Inputs: state s_t, frozen modules f_psi, p_theta, high-level pi_H, residual pi_R
g = sample(pi_H(.|s_t))
z = f_psi^{-1}(g; s_t)
for tau in range(H):
a_prime = p_theta(a|z, s_t)
delta_a = sample(pi_R(.|s_t, z, a_prime))
a_t = a_prime + delta_a
execute a_t # observe next state, reward |
4. Residual Discovery in Sim-to-Real Transfer: Spectral Methods
Residual skill discovery has been extended to sim-to-real transfer using spectral representation learning (Ma et al., 2024). Starting with an MDP , the spectral decomposition:
yields a basis spanning all policy -functions under fixed dynamics.
Skill transfer leverages from the simulator. For the real world, the residual transition is spectrally decomposed via , found by least-squares optimization with a constraint enforcing orthogonality between novel features and simulator basis :
This ensures newly discovered skills capture dynamics not representable by the simulator basis, directly filling the sim-to-real gap and yielding empirically up to 30.2% improvement in quadrotor tracking performance (Ma et al., 2024).
5. Policy Synthesis and Skill Composition
Following residual skill identification, policies are synthesized using an augmented feature vector . Value functions and policies are linearly parameterized in this expanded space:
with policy actors often regularized via KL penalties to the base simulator policy, promoting retention of stable behaviors while enabling new skill applications:
6. Empirical Performance and Adaptation Capabilities
Residual skill discovery methods have demonstrated superior sample efficiency and adaptability in both simulated and real-world robotics settings. On sparse MuJoCo manipulation tasks (Slippery-Push, Table-Cleanup, Pyramid-Stack, Complex-Hook), ReSkill converged to greater than 90% success, outperforming conventional RL and prior skill-based methods which either failed entirely or saturated at <60% success due to lack of adaptation (Rana et al., 2022). In sim-to-real quadrotor control, STEADY achieved up to 30.2% reduction in trajectory tracking error relative to zero-shot transfer and 11.9% improvement over skill transfer without residual discovery (Ma et al., 2024).
Ablations confirm the crucial roles of both state-conditioned priors and residual pathways:
- Removal of skill prior slows early exploration by 4–10×.
- Omission of the residual policy sharply reduces final task performance.
7. Conceptual Foundations and Scope
Residual skill discovery is characterized by:
- Discovery of New Skill Variations: The residual pathway enables adaptation beyond the pretraining distribution, systematically uncovering new skill behaviors.
- State-Conditioned Discovery: Skill relevance and sampling are tailored to local context via state-dependent priors or spectral features.
- Sample-Efficient Exploration and Robustness: Compact latent spaces and adaptive composition promote fast exploration and reliable generalization.
This paradigm resolves critical challenges in hierarchical RL and sim-to-real transfer by balancing prior knowledge reuse with principled on-line adaptation. A plausible implication is that residual skill discovery constitutes an essential mechanism for scalable deployment of autonomous agents under distribution shift and task novelty.