Distance-Maximizing Skill Discovery (DSD)
- Distance-Maximizing Skill Discovery (DSD) is a reinforcement learning framework that learns latent skill-conditioned policies designed to traverse the state space using explicit distance measures.
- It employs norm-matching and Wasserstein objectives to ensure that learned skills cover realistic spatial extents and adhere to underlying geometric constraints.
- DSD demonstrates practical advantages such as zero-shot goal-reaching and superior state-space coverage in robotics, grid environments, and Atari games compared to MI-based methods.
Distance-Maximizing Skill Discovery (DSD) is a framework within unsupervised reinforcement learning aimed at learning latent skill-conditioned policies that drive agents as far as possible through the state space, utilizing explicit distance measures. Unlike classical mutual information (MI) objectives that focus on state discriminability, DSD prioritizes maximizing the spatial or functional range of discovered behaviors, often subject to structured constraints. This approach is motivated by the need for diverse, reusable skills in settings lacking extrinsic rewards, such as robotics or hard-exploration domains.
1. Foundational Objective and Formalism
Distance-Maximizing Skill Discovery proposes to learn a latent skill-conditioned policy and an associated state-embedding such that sampling skill induces the system to traverse in a corresponding direction in latent space, while transitions in the embedding reflect meaningful spatial or task-related distances. For a given state transition and a nonnegative distance function , the canonical DSD optimization is
subject to
This formulation enforces that latent transitions do not unrealistically exceed the metric structure of the original state space, ensuring that skills genuinely correspond to substantial traversals in the system’s true geometry (Park et al., 2023).
In cases where is the Euclidean distance, the formulation reduces to Latent Skill Discovery (LSD). Alternative metrics, such as those based on controllability or learned densities, can be employed to bias discovery toward more complex or challenging transitions (Park et al., 2023). The dual-Lagrangian relaxation facilitates optimization by introducing a multiplier for practical enforcement of the constraint.
2. Norm-Matching and Wasserstein Extensions
Traditional MI-based unsupervised skill methods, such as DIAYN or VIC, maximize via a KL or cross-entropy term, which may promote state “variety” but not necessarily encourage skills that traverse the state space over long distances (Durugkar et al., 2021). DSD and related methods incorporate explicit distance objectives, notably:
- Norm-matching DSD (as in "Constrained Skill Discovery: Quadruped Locomotion with Unsupervised Reinforcement Learning" (Atanassov et al., 2024)): Directly optimizes a mean squared error (MSE) loss between the desired skill vector and the actual latent transition, i.e.
with the constraint .
- Wasserstein DSD / Wasserstein Intrinsic Control (wic):
Maximizes the expected Wasserstein-1 distance of skill-conditioned state visitation distributions from the initial state, formally:
where is the Kantorovich–Rubinstein dual and is a ground metric, making the policy push state distributions as far from as possible (Durugkar et al., 2021).
The distinction is that Wasserstein-based approaches explicitly embed the underlying geometric metric into the reward, while norm-matching DSD ensures both magnitude and direction alignment between the latent and real transitions.
3. Algorithmic and Architectural Implementation
Typical DSD implementations involve:
- A parameterized encoder (MLP, dimension ) that maps states to a latent space .
- Skill prior , typically uniform in a disk or sphere of bounded norm.
- Policy , usually an MLP receiving state and skill as input and outputting action means and covariances.
- Training employs actor-critic or off-policy RL (e.g., Soft Actor-Critic), where the intrinsic reward is given by
or its norm-matching variant (see above).
- Constraint enforcement is done via multiplier-based Lagrangian optimization or by clamping the latent skill such that never exceeds . For example, in (Atanassov et al., 2024), is rescaled when latent transitions surpass the state-space metric.
- Extrinsic regularization is often applied to promote smooth or physically plausible behavior, particularly in robotics (e.g., joint torque/joint jerk penalties, feet air-time, orientation and base height penalties) (Atanassov et al., 2024).
- In controllability-aware DSD (CSD), the distance metric is dynamically learned to increase for transitions that are harder to achieve, providing a curriculum effect (Park et al., 2023).
A representative table of architecture/hyperparameter choices (for quadruped locomotion) (Atanassov et al., 2024):
| Module | Architecture/Details | Value / Dim |
|---|---|---|
| Skill prior | Uniform disk, (or $50$) | or |
| Encoder | 2-layer MLP, 512 units per layer, ReLU | (main), |
| Decoder | Unit-variance Gaussian, mean = | --- |
| Policy | 2-layer MLP, 256 units, ReLU; outputs , | 12 DOF (robot) |
| Optimizer | Adam, lr | --- |
4. Empirical Evaluation and Results
Distance-Maximizing Skill Discovery has been empirically validated in diverse environments:
- Quadruped locomotion (Atanassov et al., 2024): DSD yields uniform state-space coverage and skill-controllable speed distributions from 0 to 3 m/s, unlike LSD/METRA baselines, which concentrate at high velocities regardless of the skill. Uniform multi-skill sweeps of 6 × 6 m Cartesian areas are achieved. On real hardware (ANYmal robot), DSD demonstrates zero-shot goal-reaching with mean errors 0.16 m (simulation) and m (hardware), substantially outperforming LSD ( m overshoot).
- Tabular and grid environments (Durugkar et al., 2021): Wasserstein DSD skills traverse to the edges of the grid, covering 100% of reachable states, while VIC baselines cover only ~20%. In Four-Rooms, DSD covers ~3 more unique states than MI-based methods.
- Atari (Durugkar et al., 2021): DSD provides significantly higher episodic returns and improved coverage in difficult exploration games.
- Robotic manipulation and high-DOF locomotion (Park et al., 2023): In FetchPush, CSD achieves 980 covered object bins, compared to DIAYN/DADS (100) and LSD (220). Downstream sparse-goal success rates are 75% for CSD and only 10–20% for mutual-info/Euclidean DSD-derived skills. Analogous performance improvements are found in kitchen and Ant environments.
These results demonstrate that DSD consistently outperforms both MI-based and norm-only skill discovery methods in metrics reflecting spatial coverage, controllability, and downstream utility.
5. Advanced Distance Functions and Controllability Awareness
While early DSD variants used Euclidean or metric-induced distances, subsequent advances integrate learned or controllability-aware metrics to encourage the acquisition of more challenging or informative skills.
Controllability-Aware Skill Discovery (CSD) (Park et al., 2023) introduces a learned distance , defined as the Mahalanobis distance with parameters fit via the negative log-likelihood under a conditional Gaussian density model . As skills become proficient at certain transitions, increases and decreases, thereby dynamically reducing the reward for repeated/easy transitions and driving skill discovery toward more complex behaviors. This mechanism induces a natural curriculum for unsupervised agents without reward engineering.
6. Applications, Zero-Shot Control, and Limitations
DSD methods provide a unified latent command space for downstream goal-directed or planning tasks. For instance, the invertibility of enables direct mapping from a current state to a latent action . Conditioning the policy on permits real-time, closed-loop convergence to arbitrary Cartesian goals without retraining—the basis for zero-shot goal-reaching demonstrated in real quadruped platforms (Atanassov et al., 2024).
However, DSD approaches have limitations:
- Distance constraints (e.g., Euclidean) serve only as upper bounds on true path distance; oscillatory motions may exploit ambiguities.
- In highly dynamic settings, the learned metric may inadequately account for obstacles, path constraints, or environmental variation.
- Incorporating planning-level constraints or obstacle avoidance into the latent metric remains an open challenge (Atanassov et al., 2024).
7. Comparison with Mutual Information and Diversity Methods
A central distinction between DSD and earlier MI-based methods (e.g., DIAYN, VIC) is that the latter reward distinguishability of skill-induced final states, not the geometric or task-specific range achieved. As shown in (Durugkar et al., 2021, Atanassov et al., 2024, Park et al., 2023), MI objectives may yield locally oscillatory or low-coverage behaviors despite high state discriminability. In contrast, DSD—particularly with norm-matching or Wasserstein objectives—explicitly enforces spatially distant or functionally challenging behaviors, which empirically results in broader state-space coverage, better alignment with metric structure, and improved transferability to complex downstream tasks.
A plausible implication is that for domains where the spatial or functional reach of skills is critical (e.g., navigation, manipulation, locomotion), DSD and its controllability-aware variants provide a principled and empirically validated approach for skill discovery surpassing solely discriminability-based objectives.