Skill-MDP in Hierarchical Reinforcement Learning

Updated 9 February 2026

Skill-MDP is a formal extension of Markov Decision Processes that integrates temporally extended actions (skills) for hierarchical planning and control.
It employs discrete or continuous skill abstractions using variational inference and diffusion models to learn interpretable, modular skills.
Skill-MDPs enable efficient planning and multi-agent coordination by bridging low-level primitive control with high-level strategic reasoning.

A Skill-MDP is a Markov Decision Process augmented to formalize and leverage temporally-extended actions—known as skills, options, or macro-actions—for tractable planning, sample efficiency, skill compositionality, and interpretable control in reinforcement learning and imitation settings. The Skill-MDP abstraction underpins contemporary hierarchical learning algorithms, bridging low-level primitive control and high-level strategic reasoning via learned or discovered skill policies, skill dynamics models, and often multi-level reward or intrinsic motivation signals.

1. Formal Definition and Mathematical Structure

A Skill-MDP extends the standard MDP or Semi-MDP formalism to include a discrete or continuous set of skills (macro-actions), each associated with its own policy, initiation set, potentially stochastic duration, and termination rule. Let $M = (S, A, P, R, \gamma)$ denote a base MDP. A Skill-MDP formulates:

State Space ( $S$ ): The standard set of environment states, which can range from low-dimensional coordinates to high-dimensional embeddings of images or multimodal sensory input. E.g., $S \subset \mathbb{R}^I$ after visual encoding (Liang et al., 2023), or agent pose/object states in control tasks (Shi et al., 2022).
Skill Space ( $K$ or $\mathcal{K}$ ): A (typically finite) set of temporally extended actions, $\mathcal{K} = \{\sigma_1, ..., \sigma_m\}$ or learned codebook $\mathcal{C} = \{z^1, ..., z^K\}$ (Liang et al., 2023, Gu et al., 5 Jan 2026). Each skill $\sigma$ or $z_k$ can execute a policy $\pi_\sigma(a|s)$ or generate a sequence of actions over a horizon $H$ .
Transition Dynamics ( $P(s'|s, k)$ ): Defines the outcome after executing skill $k$ from state $s$ . In data-driven implementations, this is often approximated by a skill-conditioned latent dynamics model or, in the case of trajectory generation, via a conditional diffusion process (Liang et al., 2023, Shi et al., 2022, Gu et al., 5 Jan 2026).
Reward function ( $R(s, k)$ ): Quantifies immediate or cumulative reward for selecting skill $k$ in state $s$ . In imitation-based settings this may be implicit, evaluated only by task success rates (Liang et al., 2023), while in model-based RL settings rewards can be estimated over the skill execution window (Shi et al., 2022).
Discount factor ( $\gamma$ ): Governs temporal credit assignment; Skill-MDPs may either inherit the base MDP’s $\gamma$ or use $\gamma=1$ for finite-horizon macro-timesteps (Liang et al., 2023).

Formally, when skills are interpreted as options, the induced semi-MDP's Bellman equation for a skill policy $\mu$ is:

$V^{\mu, \Sigma}(s) = \sum_{\sigma} \mathrm{1}\{\mu(s)=\sigma\} \left[ R_{\Sigma}(s, \sigma) + \gamma \mathbb{E}_{s' \sim P_{\Sigma}(\cdot|s, \sigma)} V^{\mu, \Sigma}(s') \right]$

where $R_\Sigma$ and $P_\Sigma$ are skill-level cumulative reward and transition kernel (Mankowitz et al., 2015).

2. Skill Abstraction, Representation, and Learning

Skill discovery and representation is a central element of Skill-MDP algorithms. Abstractions range from hand-coded macro-actions to data-driven learned skills:

Discrete Skill Codebooks: Skills are represented as tokens in a finite codebook constructed via vector quantization on continuous latent codes, ensuring semantic interpretablility (Liang et al., 2023, Gu et al., 5 Jan 2026).
Variational Skill Encoders: Conditional variational autoencoders (VAEs) or Bayesian nonparametric models infer a set of latent skills from offline demonstration or RL trajectories, with the option to dynamically infer the number of skills using Dirichlet Process or stick-breaking priors (Villecroze et al., 2022).
Parameterization: Each skill is associated with its own intra-skill policy (e.g., a neural network over primitive actions), an initiation set (states in which it may be executed), and a termination rule (Mankowitz et al., 2015). For language-conditioned tasks, both visual and instruction embeddings are fused to produce continuous skill codes, quantized to tokens (Liang et al., 2023, Gu et al., 5 Jan 2026).
Interpretability and Modularity: Choice of small $K$ , codebook regularization, and discrete skill-quantization promotes interpretability (e.g., skills mapped to human-understandable actions like "pull drawer") (Liang et al., 2023), and supports modular composition and reuse.
Skill Learning Objectives: Objectives typically combine reconstruction or imitation losses at the action/trajectory level with regularization (e.g., VQ losses (Liang et al., 2023), KL divergence on VAEs (Shi et al., 2022), or mutual information-based intrinsic rewards (Yang et al., 2019)) to ensure that skills are both efficient for planning and decodable from their trajectories.

3. Planning and Policy Optimization in Skill Space

Skill-MDPs enable the construction of hierarchical planners and policies that operate at the skill level:

Hierarchical Execution: At inference time, a high-level policy selects the next skill every horizon $H$ ; the low-level system executes the corresponding skill policy for $H$ steps before returning control to the planner (Shi et al., 2022, Liang et al., 2023, 2022.04675).
Planning Algorithms: Imaginary rollouts in skill space are facilitated by learned skill dynamics models, enabling model-predictive control or sampling-based planners (e.g., Cross-Entropy Method in skill-space) to select optimal sequences of skills (Shi et al., 2022).
Diffusion-Based Trajectory Generation: Recent methods inject skill embeddings as conditioning variables into diffusion models (U-Net architectures), generating entire sequences of future states consistent with a skill (Liang et al., 2023, Gu et al., 5 Jan 2026).
Multi-Agent and Cooperative Skill-MDPs: In multi-agent scenarios, Skill-MDPs enable coordination via high-level skill assignment and low-level skill execution, with agents alternating between learning decodable, distinct skills and joint cooperative policies (Yang et al., 2019).
Policy Optimization: Skill-level policies can be optimized via RL losses defined on the SMDP (e.g., temporal difference targets at the skill timescale), with gradients propagated through learned skill abstractions and skill transition models (Shi et al., 2022, Liang et al., 2023).

4. Intrinsic Motivation, Goal-Conditioning, and Open-Ended Skill Acquisition

Skill-MDPs naturally capture agents acting under intrinsic or competence-based rewards, leading to the autonomous formation of skill repertoires:

Autotelic Learning: Agents sample self-generated goals from a latent space and learn goal-conditioned policies, with intrinsic rewards that quantify novelty (model predictive error) and competence progress (improvement toward goal achievement) (Srivastava et al., 6 Feb 2025).
Goal-Conditioned Policies: The policy $\pi(a|s,g)$ conditions actions on both current state and a goal $g$ sampled from a generator, often leveraging contrastive or distance-based reward signals in embedding spaces (Srivastava et al., 6 Feb 2025).
Skill-GMDP Formulations: Skill-MDPs in open-ended exploration settings augment the MDP with a goal space $G$ , an intrinsic reward function $r_i$ , and a curriculum or meta-loop that adapts the goal-sampling distribution to maximize skill acquisition and coverage (Srivastava et al., 6 Feb 2025).
Taxonomy of Exploration: Approaches include random goal exploration, curriculum-driven, or competence-based sampling, with evaluation grounded in achieved goal diversity, generalization to unseen targets, and robustness metrics (Srivastava et al., 6 Feb 2025).

5. Training Algorithms, Theoretical Guarantees, and Practical Deployments

Skill-MDP-based methods implement a range of training workflows, often underpinned by convergence guarantees or specific empirical strategies:

Bootstrapping via Skill-MDPs: Iterative algorithms (e.g., Learning Skills via Bootstrapping, LSB) alternate between policy evaluation and local skill learning within subregions of the state space, contracting policy approximation error with each iteration and yielding explicit near-optimality guarantees in terms of the number of skills $m$ , error per skill $\eta_\mathcal{P}$ , and partition granularity (Mankowitz et al., 2015).
Variational Inference for Skill Discovery: Structured approximate posterior distributions, continuous relaxations (Gumbel-Softmax), and nonparametric priors allow for flexible and scalable skill inference from offline trajectories, removing hand-tuning of skill cardinality (Villecroze et al., 2022).
Hierarchical Losses and Optimization: Combined loss landscapes, encompassing skill abstraction (VQ or variational losses), diffusion model denoising error, and inverse dynamics imitation, enable end-to-end differentiable training (Liang et al., 2023, Gu et al., 5 Jan 2026).
Evaluation: Performance is typically measured via task success rate on compositional instructions, generalization to held-out tasks, or sample efficiency (environment interactions to achieving fixed subtask count), with modern skill-MDP algorithms achieving state-of-the-art results in benchmarks such as LOReL Sawyer, Meta-World MT10, and CALVIN (Liang et al., 2023, Shi et al., 2022, Gu et al., 5 Jan 2026).

6. Applications, Variants, and Extensions

Skill-MDPs have found broad application in robotic manipulation, navigation, multi-agent games, developmental robotics, and open-ended learning. Key themes include:

Skill Reuse and Transfer: Learned skills or option libraries can be transferred across tasks or shared across agents to accelerate adaptation (Yang et al., 2019, Mankowitz et al., 2015).
Dynamic Skill Discovery: Bayesian nonparametric approaches avoid specification of fixed skill count and adaptively allocate capacity during training (Villecroze et al., 2022).
Interpretability and Human Alignment: Skill tokenization and visualization (e.g., heatmaps, word-clouds, language-to-skill mappings) facilitate transparency and human-robot interaction (Liang et al., 2023, Gu et al., 5 Jan 2026).
Integration with Diffusion Models: Skill-conditioned diffusion policies show strong performance in robotic domains, enabling robust, interpretable planning from raw perceptual inputs (Liang et al., 2023, Gu et al., 5 Jan 2026).
Multi-Agent Cooperation: Hierarchical multi-agent Skill-MDPs enable scalable centralized training with decentralized execution, fostering complementary skill selection and coordinated team behavior (Yang et al., 2019).