APEX: Action Priors Enable Efficient Exploration for Skill Imitation on Articulated Robots

Published 15 May 2025 in cs.RO | (2505.10022v2)

Abstract: Learning by imitation provides an effective way for robots to develop well-regulated complex behaviors and directly benefit from natural demonstrations. State-of-the-art imitation learning (IL) approaches typically leverage Adversarial Motion Priors (AMP), which, despite their impressive results, suffer from two key limitations. They are prone to mode collapse, which often leads to overfitting to the simulation environment and thus increased sim-to-real gap, and they struggle to learn diverse behaviors effectively. To overcome these limitations, we introduce APEX (Action Priors enable Efficient eXploration): a simple yet versatile IL framework that integrates demonstrations directly into reinforcement learning (RL), maintaining high exploration while grounding behavior with expert-informed priors. We achieve this through a combination of decaying action priors, which initially bias exploration toward expert demonstrations but gradually allow the policy to explore independently. This is complemented by a multi-critic RL framework that effectively balances stylistic consistency with task performance. Our approach achieves sample-efficient IL and enables the acquisition of diverse skills within a single policy. APEX generalizes to varying velocities and preserves reference-like styles across complex tasks such as navigating rough terrain and climbing stairs, utilizing only flat-terrain kinematic motion data as a prior. We validate our framework through extensive hardware experiments on the Unitree Go2 quadruped. There, APEX yields diverse and agile locomotion gaits, inherent gait transitions, and the highest reported speed for the platform to the best of our knowledge (peak velocity of ~3.3 m/s on hardware). Our results establish APEX as a compelling alternative to existing IL methods, offering better efficiency, adaptability, and real-world performance. https://marmotlab.github.io/APEX/

Abstract PDF Upgrade to Chat

Summary

The paper introduces APEX, integrating decaying action priors with multi-critic reinforcement learning for sample-efficient imitation on articulated robots.
It demonstrates significant improvements in gait speed and terrain adaptation, achieving peak speeds over 3.3 m/s with robust style consistency.
The approach mitigates mode collapse seen in adversarial methods, enabling scalable deployment and versatile multi-skill policy learning.

APEX Framework for Skill Imitation on Articulated Robots

Introduction

The APEX framework introduces an innovative approach to overcoming challenges in imitation learning for legged robots. Traditionally, imitation learning (IL) utilizes expert demonstrations to train polymorphic tasks via Behavioral Cloning (BC) which often struggles with effective generalization. Adversarial imitation learning (AIL) via GANs has advanced IL significantly but experiences mode collapse, limiting its real-world applicability due to increased sim-to-real gaps. APEX seeks to address these issues through "Action Priors Enable Efficient Exploration," integrating expert data directly into reinforcement learning (RL) to maintain high exploration efficiency while grounding behavior with expert-informed priors. This integration is facilitated by progressively decaying action priors, which initially bias exploration towards meaningful actions derived from demonstrations but allow more autonomous policy exploration as training progresses. By leveraging multi-critic RL architectures, APEX balances stylistic consistency and task performance, exhibiting sample-efficient imitation learning and multitasking prowess.

Figure 1: (a) A Japanese Spitz exhibiting a canter gait (b) Learned canter gait using animal motion capture data achieves peak speeds > 3.3m/s (c,d) Generalization to stairs and slopes using only flat-ground imitation data preserving the gait trot and canter respectively (e) Gait adaptation based on velocity using a single reinforcement learning policy (f) Extension of APEX to humanoids.

APEX Framework Implementation

APEX's novelty lies in integrating action priors within an RL paradigm to guide policy exploration effectively. The framework capitalizes on decaying action priors, where initial training heavily biases exploration toward expert-like behavior, stemming from feedforward torque calculations based on Proportional-Derivative control parameters on joint angles and velocities. Mathematically, this is formulated as $\tau_t = a_t + \gamma^{t/k} \beta_t$ , where $\beta_t = K_p(\hat{\theta}_{t} - \theta_t) + K_d(-\dot{\theta}_t)$ . This setup stabilizes early policy learning by lowering reward sparsity and reducing variance in PPO advantage estimates, contributing to robust policy updates.

Figure 2: Overview of APEX. Only dashed lines are required during deployment; 1) Imitation data can be collected from various sources; 2) Action Priors are feed-forward torques calculated from kinematic data and added to the actions to bias exploration.

Generalization and Diverse Behavior Learning

APEX's structured exploration translates into efficient training of diverse skills within a unified policy framework. The introduction of a multi-critic architecture, where separate critics provide disjoint reward signals for imitation and task completion, facilitates better style-task balance. Furthermore, the inclusion of phase-based tracking conditions the policy on normalized gait phases, allowing for continuous transformations between distinct behavioral states without external high-level controllers. These mechanisms ensure smooth gait transitions and robust adaptation to unstructured terrains despite the absence of explicit transitional data within the imitation dataset.

Figure 3: (a) Canter gait on uneven slopes (b) Trot gait blindly walking on stairs 5/5 times (c) Robustness of policies trained on flat terrain on uneven terrain (pace gait shown).

Performance Evaluation

In simulation and real-world hardware tests using the Unitree Go2 quadruped, APEX consistently surpasses existing motion imitation frameworks like AMP. APEX policies trained with less data and time achieve high fidelity in executing various gaits with stylistic consistency and adaptability. Notably, APEX achieves peak gait speeds exceeding 3.3 m/s in real-world scenarios, a benchmark unmatched by AMP-trained policies due to their oscillations and sim-to-real barriers. The effectiveness extends to multi-gait learning and terrain generalization, where APEX displays superior ability to maintain behavioral integrity amid domain randomization, an attribute less performant in AMP models prone to conservative drift.

Figure 4: Comparison of single-gait execution in the real world. Each set of images shows the reference motion (top row), APEX (middle row), and AMP (bottom row). APEX matches reference style closely, while AMP shows deviations.

Conclusion

The APEX framework asserts itself as a robust alternative for imitation learning in legged robotics. Its reinforcement learning integrations streamline imitation data utilization without reliance on high-complexity adversarial models, mitigating common pitfalls such as mode collapse and non-trivialization of computational demands. APEX capitalizes on real-world deployment efficacy, exhibiting outstanding generalization across command velocities, terrains, and even humanoid platforms. Its results denote significant strides in mastering efficiency, adaptability, and performance within the field of robotic motion imitation.

Figure 5: Multi-skill policy comparison in real world experiments, showing APEX (top) and AMP (bottom) for pace, pronk, trot, and canter.

In conclusion, APEX demonstrates a pragmatic stride forward in the implementation of imitation learning frameworks, opening new paths toward scalable, deployable robotic systems for diverse applications. Future developments could focus on refining morphological re-targeting processes and integrating exteroceptive sensors for more comprehensive environmental interaction.

Markdown Report Issue