AME-2: Attention-Based Neural Map Encoder
- The paper introduces AME-2, a framework that integrates uncertainty-aware, attention-based mapping with hierarchical RL to achieve dynamic, robust legged locomotion.
- The system fuses local elevation maps, proprioception, and multi-head attention to create interpretable, adaptive policy inputs for agile maneuvering.
- Empirical results on quadrupedal and bipedal platforms demonstrate improved zero-shot generalization, robustness to sensor noise, and effective sim-to-real transfer.
The Attention-Based Neural Map Encoder (AME-2) is a reinforcement learning (RL) framework for agile and generalized legged locomotion that integrates an uncertainty-aware, attention-based environment representation with hierarchical policy learning. Designed to overcome the limitations of both conventional model-based pipelines and end-to-end sensorimotor RL models, AME-2 achieves dynamic, robust whole-body maneuvers on complex terrains while supporting interpretable policy decision-making and robust zero-shot generalization under sensor noise, terrain occlusion, and sparse foothold conditions. The framework was validated on both quadrupedal (ANYmal-D) and bipedal (LimX TRON1) robotic platforms, demonstrating simultaneous agility, generalization, and robustness in both simulated and real-world settings (Zhang et al., 13 Jan 2026).
1. Motivation and Unified Framework
AME-2 targets the longstanding trade-offs between agility, generalization, and robustness in legged locomotion. Classical mapping with model-based control can generalize but is limited in agility and resilience to occlusions. Conversely, end-to-end sensorimotor policies excel at agile tasks (e.g., parkour) but lack interpretability and generalization. AME-2 introduces a mid-level paradigm: it augments the sensorimotor stack with a learned, egocentric height map encoder that incorporates per-cell uncertainty and multi-head attention, enabling the RL control policy to dynamically prioritize salient map regions based on both global context and proprioceptive feedback.
Key advancements include tight, learnable integration of mapping and control to maximize task-relevant information flow and the use of a compact, interpretable terrain representation. Explicit map uncertainty reasoning confers robustness to occlusions and sensor degradation; the modular design provides pathways for interpretability and debugging.
2. Map Encoder Architecture and Attention Mechanism
The AME-2 encoder ingests three primary input types: a local elevation map with per-cell mean and log-variance, a global context feature capturing macroscopic terrain layout, and an embedding of robot proprioception (base state, joint configuration/velocity, history, and command intent). Local elevation and associated uncertainties are predicted per cell using a lightweight U-Net with gated residual blocks. The global context is extracted by a multilayer perceptron (MLP) applied to local features and aggregated with max-pooling.
Feature fusion proceeds as follows: a local feature map is combined with positional encoding, and a global feature vector is formed. For attention, the query vector combines global and proprioceptive features, while local map cells serve as attention keys () and values (). Multi-head attention computes cellwise relevance:
The attended, weighted local embedding is concatenated with to form the policy input map embedding , combined with and passed to an MLP policy decoder. This scheme allows the policy to focus adaptively on terrain subregions most salient for safe and agile traversal, as conditioned on global context and locomotion intent.
3. Reinforcement Learning Training Regime
The RL agent’s state at each timestep comprises the proprioceptive embedding and the AME-2 map embedding ; actions are target joint positions for a high-rate (400 Hz) PD controller, selected at 50 Hz by the policy.
The reward function encompasses position and heading tracking, goal-seeking, standing stability, and penalties for unsafe or inefficient behaviors (early terminations, excessive torque, slip, base roll, contact violations). The training regimen employs PPO with an asymmetric actor–critic architecture: the actor uses AME-2 while the critic adopts a Mixture-of-Experts (MoE) module for computational efficiency.
Training occurs in two stages: a privileged teacher policy is trained on ground-truth maps (80,000 iterations), followed by a student policy trained on the learned mapping pipeline (40,000 iterations). The student policy incorporates PPO loss, action-distillation from the teacher, and a representation-matching objective on the AME-2 embedding. The PPO loss is omitted during the initial 5,000 student iterations to enhance stability.
4. Learning-Based Uncertainty-Aware Mapping Pipeline
Input depth clouds are rasterized to local elevation grids, which are processed by a gated U-Net to predict per-cell elevation and uncertainty . Training uses a -negative log-likelihood loss (), proportionally weighting rough terrain samples to promote calibration and sharp uncertainty boundaries.
For global mapping, local predictions are fused onto a persistent grid using odometry. The fusion protocol computes a measurement variance and applies a probabilistic winner-take-all strategy: with updates randomly drawn according to this probability. Update acceptance is gated by uncertainty thresholds to avoid building overconfident beliefs in occluded or unobserved regions, yet enable rapid map updates on acquisition of high-certainty data. This process yields an occupancy map where local uncertainty is explicit and dynamically modulated during operation.
5. Sim-to-Real Transfer and Domain Robustness
Robust sim-to-real transfer is realized via a teacher–student RL scheme in which the student is trained with the onboard mapping module, domain-randomized observations, and injected corruptions. Randomization encompasses robot dynamics (mass, friction, delays), sensor noise, and structured observation corruptions (missing points, map drift, uncertainty spikes, partial resets). Uncertainty enters both the AME-2 encoder (as a map channel) and mapping fusion, which promotes cautious policy behavior in unknown and occluded regions. The deployable controller is executed at 2 ms per policy step on commodity CPUs using ONNX runtime.
6. Empirical Evaluation and Ablation Analysis
The AME-2 framework was evaluated on multiple metrics:
- Success on training terrains: Teacher achieves ≈96%, student 94% (dense/climb) and 92% (sparse).
- Zero-shot transfer to unseen/mixed terrains: Teacher 95.2% avg, student 82.4% avg.
- Dynamic metrics: Real-world speeds up to 2 m/s (ANYmal-D), platform climbing to 1.0 m (quadruped), 0.48 m up/0.88 m down (biped).
Ablation studies demonstrate that AME-2 surpasses predecessor architectures (AME-1: 51.2%, MoE: 45.0% test success) and outperforms end-to-end visual recurrent students (51.5% test). The uncertainty-aware, Bayesian mapping pipeline yields lower calibration loss ( on test terrains: 0.046) and sharper uncertainty signals than alternative methods. Policy robustness is quantitatively resilient: success exceeds 83% under 20% point dropout or 3% artifact rates.
Notable emergent behaviors include active probe-based perception (robot inspects obstacles before committing), whole-body impact management, and interpretable attention—heatmaps demonstrate consistent focus on critical foothold/transition points and key terrain landmarks.
7. Interpretability, Applications, and Extensions
AME-2’s explicit attention mechanism provides a pathway for high-level interpretability, safety certification, and system debugging, as visualizations of attention heatmaps consistently correspond to task-critical terrain features. The modular mapping and encoding structure is amenable to extension to fully 3D voxel-based representations, dynamic obstacle integration, and higher-DOF robots with multiple contact modalities. The design suggests broad applicability to autonomous mobile robotics in unstructured environments requiring both agility and generalization, without sacrificing the transparency crucial for deployment in safety-critical domains (Zhang et al., 13 Jan 2026).