- The paper introduces the Humanoid Perception Controller (HPC), a novel method integrating external terrain perception and sensor noise mitigation for robust humanoid robot locomotion on challenging terrain.
- HPC employs a teacher-student distillation framework where an oracle policy trained on clean data guides a student policy that uses a world model with a variational information bottleneck for sensor denoising.
- Experiments show HPC achieves superior velocity tracking and terrain negotiation, maintains performance under noise, and is stable in real-world uncertain environments, confirming the approach's effectiveness.
The paper introduces a novel Humanoid Perception Controller (HPC) designed to enhance humanoid robot locomotion over challenging terrain by integrating external terrain perception with sensor noise mitigation. The approach employs a teacher-student distillation framework, where an oracle policy, trained on noise-free data, guides the learning of a student policy. The student policy utilizes a world model with a variational information bottleneck for sensor denoising and state estimation.
The methodology consists of two key stages:
- Oracle Policy Training: An optimal reference policy π∗:Sp→A is derived using a Markov Decision Process (MDP) M=(S,A,P,r,γ). The policy is parameterized via proximal policy optimization (PPO) with the objective to maximize the expected discounted reward:
π∗=argπmaxEτ∼pπ[t=0∑Tγtr(stp,at)]
* γ∈[0,1): Discount factor.
* stp∈Sp: Privileged information set.
The oracle policy leverages privileged observations $\boldsymbol{o}^p_t = \left\{ h^p_t, \boldsymbol{p}^p_t, \boldsymbol{R}^p_t, \boldsymbol{v}^p_t, \boldsymbol{\omega}^p_t, \boldsymbol{v}^*_t, \boldsymbol{\omega}^*_t, c^p_t, \boldsymbol{q}_t, \dot{\boldsymbol{q}_t, \boldsymbol{a}_{t-1},\boldsymbol{e}_{t} \right\}$ to maximize performance. A terrain encoder Tθt transforms noise-free height maps et into spatial features $\boldsymbol{f}_t^{\text{terrain} \in \mathbb{R}^{d_e}$, which are then combined with proprioceptive states and kinematic measurements. The architecture employs LSTM layers and MLP branches for both actor and critic networks.
- Student Policy Training: The student model, comprising a world model and a locomotion policy, is trained using teacher-student distillation. The student observation includes goal commands, proprioception, and terrain perception with noise: $\boldsymbol{o}_t = \left\{ \boldsymbol{\omega}_t, \boldsymbol{p}_t, \boldsymbol{v}_t, \boldsymbol{v}^*_t, \boldsymbol{\omega}^*_t, \boldsymbol{q}_t, \dot{\boldsymbol{q}_t, \boldsymbol{a}_{t-1}, \boldsymbol{\tilde{e}_{t} \right\}$. A variational autoencoder (VAE) is used to map noisy sensor observations o1:t to privileged states stp through latent variables zt. The evidence lower bound (ELBO) is optimized as:
$\mathcal{L}_{\text{ELBO} = \mathbb{E}_{q_{\phi_s}(\boldsymbol{z}_t|\boldsymbol{o}_{1:t})}[\log p_{\psi_s}(\boldsymbol{s}^p_t|\boldsymbol{z}_t)] - \beta D_{\text{KL}(q_{\phi_s}(\boldsymbol{z}_t|\boldsymbol{o}_{1:t}) \parallel p(\boldsymbol{z}_t))}$
* qϕs: Recognition model (encoder).
* pψs: Generative model (decoder).
* β: Weighting coefficient for the KL-divergence term.
Dataset Aggregation (DAgger) is employed for behavior cloning, minimizing the mean squared error (MSE) between student and teacher actions. The imitation objective is defined as:
$\mathcal{L}_{\text{imitation} = \mathbb{E}_{(\boldsymbol{o}_t, \boldsymbol{a}_t^{\text{teacher}) \sim \mathcal{D} \left[ \| \pi_{\xi_s}(\boldsymbol{o}_t) - \boldsymbol{a}_t^{\text{teacher} \|_2^2 \right]}$
* D: Aggregated dataset.
The student policy's training objective combines variational inference with behavior cloning through a multi-task loss function:
$\mathcal{L}_{\text{student} = \mathcal{L}_{\text{imitation} + \lambda \mathcal{L}_{\text{ELBO}}$
* λ: Weight balancing reconstruction-imitation trade-off.
A comprehensive domain randomization framework accounts for sensor inaccuracies and terrain deformability, modeling the perceived elevation map $\hat{\mathcal{E}_t \in \mathbb{R}^{H \times W}$ as:
Et=α⊙Et+β+ϵt^
* Et: Ground-truth elevation matrix.
* α∼U[0.8,1.2]: Multiplicative noise coefficient.
* β∼N(0,0.052): Persistent terrain deformation.
* ϵt∼GP(0,k(l)): Zero-mean Gaussian process with Matérn kernel k(l).
Experimental results demonstrate that the HPC achieves superior velocity tracking precision and terrain negotiation capability compared to baseline methods. The ablation studies highlight the necessity of both the world model and the distillation process. The HPC also exhibits graceful degradation under extreme noise conditions, preserving a significant portion of its baseline terrain performance. Real-world experiments confirm the system's stability and robustness in uncertain, noisy sensing environments. One limitation is the complex trade-offs between reconstruction fidelity and policy imitation effectiveness, requiring careful balancing of the variational bottleneck coefficient and imitation loss weight during training.