HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Published 3 Mar 2026 in cs.RO | (2603.03243v1)

Abstract: We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io

Abstract PDF Upgrade to Chat

Summary

The paper introduces a scalable framework using robot-free human demonstrations to learn whole-body mobile manipulation.
It leverages multimodal egocentric sensing, embodiment-agnostic visual embeddings, and a relaxed head gaze representation to overcome kinematic incompatibility.
Empirical results show high success rates in tasks like Laundry, Delivery, and Tablescape, outperforming traditional baselines.

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Motivation and Problem Formulation

HoMMI addresses the challenge of scaling whole-body mobile manipulation learning by enabling data collection and policy learning directly from robot-free human demonstrations. The complexity arises from the need for bimanual and whole-body coordination, active perception, and long-horizon navigation. Prior approaches relying on teleoperation or stationary setups, such as UMI, VR, or external cameras, lack scalability and fail to capture global context, navigation, and full-body coordination sufficiently. Naively fusing egocentric sensing with portable demonstration devices exacerbates the human-to-robot embodiment gap—differences in visual appearance and kinematics—limiting policy transfer fidelity.

System Architecture and Data Collection

HoMMI extends the UMI framework by integrating a head-mounted camera for egocentric sensing, establishing a portable and scalable demonstration interface via multi-device synchronization (ARKit), capturing RGB, depth, and pose (6-DoF) trajectories across all modalities at 60Hz. This enables dense, multimodal, and in-the-wild data collection for bimanual mobile manipulation without teleoperation overhead.

Figure 1: HoMMI architecture supporting robot-free, egocentric demonstration collection, cross-embodiment policy learning, and constraint-aware whole-body control.

Embodiment-Agnostic Policy Design

Visual Embedding

To bridge the visual gap, the system employs a 3D visual representation for egocentric observations. Egocentric RGB is lifted to process geometry-aware tokens (via depth and patch features) and masked to remove embodiment-specific cues (arms/body), transforming all sensory input into a gripper-centric coordinate frame. Attention pooling aggregates task-relevant visual information.

Figure 2: 3D representation for egocentric visual input, masking arms/body and converting to gripper-centric frame.

Action Representation

To resolve kinematic incompatibility, HoMMI implements a relaxed head action via a "3D look-at point" abstraction. Human head pose trajectories are replaced by view-agnostic gaze targets that the robot translates to feasible neck motions, respecting hardware constraints and preserving active perception strategies. This circumvents infeasible joint tracking and improves transferability between different embodiments.

Figure 3: Relaxed head gaze representation enables precise hand-eye tracking and effective perception, avoiding kinematic constraints violations.

Whole-Body Control and Policy Learning

The robot hardware consists of a high-DoF bimanual manipulator with holonomic base, torso, 2-DoF neck, and wrist-mounted cameras. A differential whole-body IK solver (Mink) transforms policy-generated end-effector poses and head look-at points to joint and base velocities, optimizing tracking, stability, smoothness, and posture via weighted cost functions and explicit constraints. Policy inference is decoupled from control, using timestamped, latency-matched observations and actions to preserve real-time consistency.

Figure 4: Hardware schematic for high-bandwidth perception and closed-loop control on mobile manipulator.

Empirical Evaluation

HoMMI is benchmarked across three long-horizon mobile manipulation tasks: Laundry (cloth placement in bin), Delivery (box delivery to trolley), and Tablescape (mat unfolding on table). Each task involves variations in object placement and initial configuration, probing robustness and generalization.

Figure 5: Representative rollout for Laundry task, demonstrating whole-body and active perception integration.

Figure 6: Policy rollout for Delivery task, highlighting navigation and active perception in variable workspace.

Figure 7: Tablescape task rollout, illustrating fine-grained bimanual and spatial coordination.

Baselines and Ablations

Comparisons include Wrist-Only (classic UMI), RGB-Only (adding head RGB but regressing head pose), Head-Only (no wrist input), and w/o Active Neck (disabling head motion). Each ablation isolates modality and embodiment effects.

Quantitative Results

HoMMI achieves high success rates: 90% (Laundry), 85% (Delivery), 80% (Tablescape), consistently outperforming baselines that suffer from grasp failures, navigation drift, object misalignment, or inability to recover from out-of-view targets. Key findings:

Wrist-only policies lack global context and fail at search/navigation/alignment.
Head-only input misses local contact cues essential for precise manipulation.
RGB-only policies degrade under visual/appearance mismatch, causing out-of-distribution failures.
Disabling active perception undermines adaptability, especially in search and fine placement.
Figure 8: Success rates across all tasks, showing HoMMI outperforms all baselines for long-horizon mobile manipulation.

Policy Attention and Observability

Policy attention maps reveal that HoMMI's cross-embodiment hand-eye policy focuses cleanly on task-relevant objects and contacts, validating the synergy of egocentric and wrist modalities, and gripper-centric framing for spatial awareness. Baseline attentions are diffuse and less informative.

Figure 9: Attention maps for egocentric observations, highlighting sharp task-focused attention in HoMMI.

Implications, Limitations, and Future Directions

HoMMI demonstrates that robot-free human demonstrations, when augmented with egocentric sensing and embodiment-agnostic policy design, suffice for learning versatile, long-horizon whole-body mobile manipulation. The system scales data collection, enhances transferability, and establishes precise, coordinated behaviors in unconstrained environments, pushing toward democratized mobile manipulation learning.

The short observation history limits policy memory and long-term recovery; integrating extended sequence modeling or memory architecture is an evident future direction. The reliance on vision-only input also restricts safety and compliance for contact-rich manipulation; tactile and force sensing could augment robustness. Residual hardware embodiment gaps persist—generative hardware co-design and domain adaptation may further reduce them.

Conclusion

HoMMI introduces a scalable framework for learning mobile manipulation directly from robot-free demonstrations, leveraging embodiment-agnostic visual and action abstractions and constraint-aware whole-body control. Empirically, it achieves robust, transferable performance across diverse manipulation scenarios, surpassing prior art and baselines in both accuracy and task success. This architecture enables practical, rapid deployment for real-world mobile manipulation and sets new standards for cross-embodiment policy learning.

Markdown Report Issue