Learning Perceptive Humanoid Locomotion over Challenging Terrain

Published 2 Mar 2025 in cs.RO | (2503.00692v3)

Abstract: Humanoid robots are engineered to navigate terrains akin to those encountered by humans, which necessitates human-like locomotion and perceptual abilities. Currently, the most reliable controllers for humanoid motion rely exclusively on proprioception, a reliance that becomes both dangerous and unreliable when coping with rugged terrain. Although the integration of height maps into perception can enable proactive gait planning, robust utilization of this information remains a significant challenge, especially when exteroceptive perception is noisy. To surmount these challenges, we propose a solution based on a teacher-student distillation framework. In this paradigm, an oracle policy accesses noise-free data to establish an optimal reference policy, while the student policy not only imitates the teacher's actions but also simultaneously trains a world model with a variational information bottleneck for sensor denoising and state estimation. Extensive evaluations demonstrate that our approach markedly enhances performance in scenarios characterized by unreliable terrain estimations. Moreover, we conducted rigorous testing in both challenging urban settings and off-road environments, the model successfully traverse 2 km of varied terrain without external intervention.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Humanoid Perception Controller (HPC), a novel method integrating external terrain perception and sensor noise mitigation for robust humanoid robot locomotion on challenging terrain.
HPC employs a teacher-student distillation framework where an oracle policy trained on clean data guides a student policy that uses a world model with a variational information bottleneck for sensor denoising.
Experiments show HPC achieves superior velocity tracking and terrain negotiation, maintains performance under noise, and is stable in real-world uncertain environments, confirming the approach's effectiveness.

The paper introduces a novel Humanoid Perception Controller (HPC) designed to enhance humanoid robot locomotion over challenging terrain by integrating external terrain perception with sensor noise mitigation. The approach employs a teacher-student distillation framework, where an oracle policy, trained on noise-free data, guides the learning of a student policy. The student policy utilizes a world model with a variational information bottleneck for sensor denoising and state estimation.

The methodology consists of two key stages:

Oracle Policy Training: An optimal reference policy $\pi^*: \mathcal{S}^p \rightarrow \mathcal{A}$ is derived using a Markov Decision Process (MDP) $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, r, \gamma)$ . The policy is parameterized via proximal policy optimization (PPO) with the objective to maximize the expected discounted reward:

$\pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim p_\pi}\left[\sum_{t=0}^T \gamma^t r(\boldsymbol{s}^p_t, \boldsymbol{a}_t)\right]$

* $\gamma \in [0,1)$ : Discount factor. * $\boldsymbol{s}^p_t \in \mathcal{S}^p$ : Privileged information set.

The oracle policy leverages privileged observations $\boldsymbol{o}^p_t = \left\{ h^p_t, \boldsymbol{p}^p_t, \boldsymbol{R}^p_t, \boldsymbol{v}^p_t, \boldsymbol{\omega}^p_t, \boldsymbol{v}^*_t, \boldsymbol{\omega}^*_t, c^p_t, \boldsymbol{q}_t, \dot{\boldsymbol{q}_t, \boldsymbol{a}_{t-1},\boldsymbol{e}_{t} \right\}$ to maximize performance. A terrain encoder $T_{\theta_t}$ transforms noise-free height maps $\boldsymbol{e}_t$ into spatial features $\boldsymbol{f}_t^{\text{terrain} \in \mathbb{R}^{d_e}$, which are then combined with proprioceptive states and kinematic measurements. The architecture employs LSTM layers and MLP branches for both actor and critic networks.

Student Policy Training: The student model, comprising a world model and a locomotion policy, is trained using teacher-student distillation. The student observation includes goal commands, proprioception, and terrain perception with noise: $\boldsymbol{o}_t = \left\{ \boldsymbol{\omega}_t, \boldsymbol{p}_t, \boldsymbol{v}_t, \boldsymbol{v}^*_t, \boldsymbol{\omega}^*_t, \boldsymbol{q}_t, \dot{\boldsymbol{q}_t, \boldsymbol{a}_{t-1}, \boldsymbol{\tilde{e}_{t} \right\}$. A variational autoencoder (VAE) is used to map noisy sensor observations $\boldsymbol{o}_{1:t}$ to privileged states $\boldsymbol{s}^p_t$ through latent variables $\boldsymbol{z}_t$ . The evidence lower bound (ELBO) is optimized as:

$\mathcal{L}_{\text{ELBO} = \mathbb{E}_{q_{\phi_s}(\boldsymbol{z}_t|\boldsymbol{o}_{1:t})}[\log p_{\psi_s}(\boldsymbol{s}^p_t|\boldsymbol{z}_t)] - \beta D_{\text{KL}(q_{\phi_s}(\boldsymbol{z}_t|\boldsymbol{o}_{1:t}) \parallel p(\boldsymbol{z}_t))}$

* $q_{\phi_s}$ : Recognition model (encoder). * $p_{\psi_s}$ : Generative model (decoder). * $\beta$ : Weighting coefficient for the KL-divergence term.

Dataset Aggregation (DAgger) is employed for behavior cloning, minimizing the mean squared error (MSE) between student and teacher actions. The imitation objective is defined as:

$\mathcal{L}_{\text{imitation} = \mathbb{E}_{(\boldsymbol{o}_t, \boldsymbol{a}_t^{\text{teacher}) \sim \mathcal{D} \left[ \| \pi_{\xi_s}(\boldsymbol{o}_t) - \boldsymbol{a}_t^{\text{teacher} \|_2^2 \right]}$

* $\mathcal{D}$ : Aggregated dataset.

The student policy's training objective combines variational inference with behavior cloning through a multi-task loss function:

$\mathcal{L}_{\text{student} = \mathcal{L}_{\text{imitation} + \lambda \mathcal{L}_{\text{ELBO}}$

* $\lambda$ : Weight balancing reconstruction-imitation trade-off.

A comprehensive domain randomization framework accounts for sensor inaccuracies and terrain deformability, modeling the perceived elevation map $\hat{\mathcal{E}_t \in \mathbb{R}^{H \times W}$ as:

$\hat{\mathcal{E}_t = \alpha \odot \mathcal{E}_t + \beta + \epsilon_t}$

* $\mathcal{E}_t$ : Ground-truth elevation matrix. * $\alpha \sim \mathcal{U}[0.8,1.2]$ : Multiplicative noise coefficient. * $\beta \sim \mathcal{N}(0,\,0.05^2)$ : Persistent terrain deformation. * $\epsilon_t \sim \mathcal{GP}(0, k(l))$ : Zero-mean Gaussian process with MatÃ©rn kernel $k(l)$ .

Experimental results demonstrate that the HPC achieves superior velocity tracking precision and terrain negotiation capability compared to baseline methods. The ablation studies highlight the necessity of both the world model and the distillation process. The HPC also exhibits graceful degradation under extreme noise conditions, preserving a significant portion of its baseline terrain performance. Real-world experiments confirm the system's stability and robustness in uncertain, noisy sensing environments. One limitation is the complex trade-offs between reconstruction fidelity and policy imitation effectiveness, requiring careful balancing of the variational bottleneck coefficient and imitation loss weight during training.

Markdown Report Issue