Teacher Motion Priors: Enhancing Robot Locomotion over Challenging Terrain

Published 14 Apr 2025 in cs.RO and cs.AI | (2504.10390v2)

Abstract: Achieving robust locomotion on complex terrains remains a challenge due to high dimensional control and environmental uncertainties. This paper introduces a teacher prior framework based on the teacher student paradigm, integrating imitation and auxiliary task learning to improve learning efficiency and generalization. Unlike traditional paradigms that strongly rely on encoder-based state embeddings, our framework decouples the network design, simplifying the policy network and deployment. A high performance teacher policy is first trained using privileged information to acquire generalizable motion skills. The teacher's motion distribution is transferred to the student policy, which relies only on noisy proprioceptive data, via a generative adversarial mechanism to mitigate performance degradation caused by distributional shifts. Additionally, auxiliary task learning enhances the student policy's feature representation, speeding up convergence and improving adaptability to varying terrains. The framework is validated on a humanoid robot, showing a great improvement in locomotion stability on dynamic terrains and significant reductions in development costs. This work provides a practical solution for deploying robust locomotion strategies in humanoid robots.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel teacher-student framework that uses adversarial imitation and auxiliary tasks to enhance robot locomotion on variable terrains.
It demonstrates substantial improvements in terrain traversal, tracking accuracy, and energy efficiency compared to traditional methods.
The approach features decoupled architectures and extensive domain randomization, enabling effective sim-to-real transfer in humanoid robots.

Teacher Motion Priors: Enhancing Robot Locomotion over Challenging Terrain

Introduction and Motivation

Achieving robust autonomous locomotion on complex natural terrains with high-degree-of-freedom humanoid robots is a persistent challenge. Traditional model-based approaches, while demonstrating reliable control in controlled environments, exhibit limited adaptability in unstructured, dynamic domains. Recent work utilizing deep RL-based policies has advanced robustness, but sim-to-real transfer, distributional shift, and dependency on privileged observations restrain real-world efficacy. Notably, teacher-student policy paradigms have mitigated these constraints by first training a teacher policy with privileged information and subsequently distilling knowledge to a deployable proprioception-only student. However, these methods frequently encounter distribution mismatch, network coupling, and limited policy generalization.

This paper introduces the Teacher Motion Priors (TMP) framework, a novel teacher-student method that leverages generative adversarial training and auxiliary task learning for efficient, highly-generalizable robot locomotion. The principal innovations address distributional shift via adversarial imitation, decouple student and teacher architectures for deployment simplicity, and exploit multi-task auxiliary objectives for accelerated learning and terrain adaptation.

Methodology

Problem Formulation and Overall Framework

Humanoid locomotion is modeled as a POMDP. During training, the teacher policy has access to both proprioceptive ( $o_t$ ) and high-dimensional privileged ( $o^p_t$ ) observations (e.g., height maps, foot contacts), facilitating sample-efficient acquisition of robust strategies. The student observes only noisy proprioceptive states, reflecting realistic sensor limitations.

The TMP training proceeds in two phases:

Teacher Policy Training: Uses PPO within a large-capacity actor-critic architecture. The teacher receives frame-stacked proprioceptive and privileged states, optimizing a composite RL objective incorporating clipped surrogate loss, value estimation, and entropy regularization. Gaussian noise is injected into proprioceptive channels to improve domain robustness.
Student Policy Training: The student, architecturally decoupled from the teacher, receives only proprioceptive observations. Imitation is achieved not via classical supervised behavior cloning but through a GAIL-inspired generative adversarial process. A discriminator $\mathcal{D}$ is trained to differentiate between teacher and student (state, action) pairs over recent frames, providing imitation feedback to the student. To further mitigate the compounding errors endemic in imitation (especially under distributional drift), the student additionally optimizes auxiliary prediction objectives, sharing initial network layers with the policy and thus promoting robust, noise-resilient representation learning.

Losses and Training Details

Teacher Loss:

$\mathcal{L}_{\text{teacher}} = \mathcal{L}_{\text{clip}} + \lambda_v \mathcal{L}_v - \lambda_e \mathcal{L}_e$

This objective balances efficient credit assignment, value estimation accuracy, and action exploration.

Adversarial Knowledge Transfer:

The discriminator objective is:

$\mathcal{L}_{\text{disc}} = \lambda_{\text{pred}} \mathcal{L}_{\text{pred}} + \lambda_{\text{grad}} \mathcal{L}_{\text{grad}} + \lambda_{\text{weight}} \mathcal{L}_{\text{weight}}$

where prediction is BCE, gradient penalty enforces smoothness, and weight decay regularizes overfitting.

Student Loss:

$\mathcal{L}_{\text{student}} = \mathcal{L}_{\text{clip}} + \lambda_v \mathcal{L}_v - \lambda_e \mathcal{L}_e + \lambda_{\text{aux}} \mathcal{L}_{\text{aux}} + \lambda_{\text{disc}} \mathcal{L}_{\text{disc}}$

This balances adversarial imitation, auxiliary prediction, and RL exploration-exploitation.

Highly-structured curriculum learning across parametrized terrain types (slopes, rough, stairs, discrete obstacles) and extensive domain randomization are integral to learning terrain-agnostic policies.

Network Architecture

TMP employs larger, deeper teacher networks and significantly slimmed student architectures, decoupled for efficient deployment. The student policy shares layers with the auxiliary network. The discriminator models spatiotemporal action-state dependencies for improved behavioral fidelity.

Experimental Results

Simulated Evaluation

Comprehensive ablations and benchmarks are conducted in Isaac Gym using CASBOT SE, a full-size humanoid (1.65m, 46kg, 18 DoFs; 12 actively controlled).

Terrain Level Convergence: TMP outperforms the baseline PPO, standard teacher-student (TS), and regularized online adaptation (ROA) methods. Final terrain level is improved by 26.39% over TS and 17.20% over ROA—reflecting greater traversability and adaptation speed.
Tracking Accuracy: Across all terrain modalities, TMP decreases velocity tracking error by 44.16% (discrete obstacles), 40.53% (rough slopes), 39.17% (slopes), and 27.74% (stairs) relative to TS; and 30.25%, 28.16%, 23.71%, 26.66% relative to ROA, respectively.
Cost of Transport (CoT): Energy efficiency metrics reveal that TMP policies achieve the lowest CoT among compared methods, with improvements of up to 26.67% (discrete obstacles) over TS and up to 14.35% (slopes) over ROA.

Real-World Deployment

The student policy trained by TMP is directly deployed on CASBOT SE without additional fine-tuning or privileged observation dependence. Robust performance is demonstrated in:

Slope Traversal: Dynamic adaptation via joint modulation, notably leveraging ankle mechanisms.
Rough Brick-Paved Surfaces: Adaptive stepping compensates for variable foot contacts, minimizing upper body perturbations.
Disturbance Recovery: External pushes are countered in real-time, maintaining postural stability by exploiting learned feedback policies.

These physical results underscore the generalization and robustness attained through the proposed adversarial-auxiliary training paradigm.

Implications and Future Directions

TMP offers several significant advances for RL-based robot locomotion. By decoupling student architecture from the teacher, deployment flexibility is enhanced, facilitating integration with exteroceptive modalities (e.g., vision) without legacy retraining. Adversarial imitation with auxiliary tasks represents a robust mechanism for overcoming distributional shift, a critical challenge in sim-to-real RL transfer. Empirical gains in stability, tracking, and energy consumption demonstrate the broad applicability of TMP in mobile robotics and potential extensions to manipulation and multi-contact behaviors.

The framework's modularity suggests future research directions in:

Integrating visual or tactile exteroceptive signals into the student, leveraging TMP for multimodal sensor fusion.
Extending adversarial prior transfer to other robot morphologies or manipulation domains.
Examining curriculum and domain randomization strategies for more complex, task-parameterized environments.

Conclusion

The Teacher Motion Priors framework advances state-of-the-art teacher-student locomotion by fusing adversarial knowledge transfer and auxiliary multi-task learning. The resulting student policies display significantly improved generalization, tracking, and energy efficiency in both simulated and physical settings, demonstrating that principled generative imitation with auxiliary structure is a powerful paradigm for robust, adaptable humanoid robotic control ["Teacher Motion Priors: Enhancing Robot Locomotion over Challenging Terrain" (2504.10390)].