LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations

Published 25 Feb 2026 in cs.RO | (2602.21723v1)

Abstract: Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates a unified DF-based representation that decouples object geometry from humanoid motion, enabling robust long-horizon interaction.
It presents a three-stage training pipeline—behavior cloning, adversarial RL, and visual-motor distillation—that achieves strong sim-to-real transfer.
Empirical results show 80–100% success on scale variations and 62.1% on multi-task sequences, highlighting impressive generalization and adaptability.

Unified Distance Field Representations for Long-Horizon Humanoid Interaction

Motivation and Problem Formulation

The paper "LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations" (2602.21723) addresses a longstanding challenge in embodied intelligence: autonomous, contact-rich humanoid interaction across extended horizons and heterogeneous object geometries. Conventional approaches rely on reference-based motion imitation or task-specific reward engineering, resulting in policies that are overfit to training instances and unable to generalize or compose skills robustly. Reference-based methods entangle object geometry with motion demonstrations, leading to brittle execution in novel scenarios; reference-free methods lack a principled interaction signal, yielding isolated policies that fail to compose across tasks. The work posits that this representational bottleneck fundamentally limits practical deployment of humanoid robots for real-world multi-task interaction.

Distance Field-Based Interaction Representation

The paper advocates distance field (DF) representations as a unified, geometry-agnostic state space for humanoid-object interaction. A DF assigns to every spatial point its distance to the nearest object surface, with the gradient encoding surface normals. By constructing per-link features comprising DF values, surface gradients, and velocity decompositions into surface-normal and tangential components, the interaction representation captures both spatial proximity and contact dynamics. Critically, the local DF structure is invariant to object shapes, scales, and placements, furnishing a transferable signal for reasoning about interaction independent of memorized absolute trajectories. Temporal sequences of DF-based features are encoded via a variational autoencoder (VAE) to yield robust latents, mitigating sensor noise and facilitating generalization.

Three-Stage Training Pipeline

The control framework is realized through a three-stage pipeline:

Behavior Cloning Pre-training: A teacher policy tracks reference motions augmented by residual learning in simulation, generating physically-valid interaction data across tasks. A student policy is behavior-cloned from teacher rollouts but operates solely on DF-based latents, root trajectory commands, and proprioception, consciously omitting reference motions to align with inference-time constraints.
Adversarial Interaction Prior Reinforcement Learning: Post-training is conducted in procedures with randomized object geometries, eliminating reference motions and task-specific rewards. An Adversarial Interaction Prior (AIP) discriminator supervises interaction in the DF latent space, rewarding the geometric structure of valid interaction independent of absolute joint configurations. Policy optimization uses a composite reward that includes root tracking, geometric interaction style (via AIP), and whole-body motion style (via AMP discriminator).
Visual-Motor Distillation: For real-world deployment, visual perception replaces global object information by distilling the DF-conditioned policy into a vision-only policy using egocentric depth observations. Visual encoders are trained via DAgger-style supervision to align vision-based latents with DF-based latents, enabling execution using only onboard sensors.

Numerical Evaluation and Claims

Strong empirical results substantiate several key claims:

Generalization across Object Scales and Shapes: The DF-based policies maintain 80--100% success on PickUp and SitStand tasks across object scales from $0.4\times$ to $1.6\times$ , outperforming reference-based HDMI and ResMimic baselines that sharply degrade outside the training distribution. The policy generalizes to unseen shapes, including spheres and cylinders, without retraining.
Long-Horizon Skill Composition: A single DF-conditioned policy achieves 62.1% success on 5-task trajectories, remaining viable up to 40 sequential tasks in simulation. Ablated variants collapse to zero success beyond short sequences, indicating the necessity of the unified geometric representation and training pipeline.
Vision-Based Deployment: The distilled vision-based policy preserves performance trends, achieving 63.7--99.7% success for PickUp under scale variation. While vision-based execution exhibits modest performance reduction due to perceptual uncertainty, it exceeds reference-free and vision-guided reference-based baselines in generalization capacity.
Real-World Robustness: The DF-based policy achieves 8--10/10 success for PickUp and SitStand across varying object sizes and seat heights on a physical humanoid platform. Tracking accuracy exceeds 75%, and stable interaction is demonstrated even with unseen objects.

Implications and Theoretical Impact

The work demonstrates that DF-based representations constitute a scalable, differentiable, and shape-invariant signal for contact-rich humanoid control. By grounding interaction in dynamic local geometry rather than kinematic templates, the approach decouples policy execution from demonstration-specific biases, dramatically improving generalization and maneuverability. The adversarial prior applied in DF latent space regularizes interaction validity without constraining the kinematic template, enabling synthesis of novel poses for unseen geometries while preserving physical plausibility and task success. This representational unification obviates the need for handcrafted rewards or explicit task sequencing, supporting seamless skill composition in a single policy architecture.

The practical implication is clear: humanoid robots trained using DF conditioning can autonomously execute, recover, and compose skills across heterogeneous interaction regimes without explicit replanning, a capability required for scalable deployment in unstructured, multi-task environments. The theoretical insight—that local geometric structure captures the essential invariants of interaction—opens avenues for extending this paradigm to articulated and deformable objects, as well as for robust policy learning under partial observability.

Future Directions

Future work is suggested to broaden the applicability of DF-based representations to interaction with articulated and deformable objects, as well as to further improve robustness in partially observable settings by integrating richer sensory modalities and uncertainty-aware encoders. The demonstrated sim-to-real transfer underscores the viability of geometry-conditioned policies for real-world humanoid robotics.

Conclusion

By leveraging distance field representations as a unified, geometry-invariant signal, the proposed framework achieves reference-free, generalizable, and compositional humanoid interaction across long horizons and varied object geometries (2602.21723). A comprehensive three-stage pipeline—behavior cloning, adversarial RL, and visual distillation—enables scale-robust skill acquisition and real-world deployment with minimal sensory assumptions. The approach advances embodied intelligence by resolving a core representational bottleneck, offering a practical methodology for autonomous whole-body control with broad implications for AI-driven robotics.

Markdown Report Issue