HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model

Published 12 Feb 2026 in cs.RO | (2602.11758v1)

Abstract: Humanoid robots show promise for complex whole-body tasks in unstructured environments. Although Human-Object Interaction (HOI) has advanced, most methods focus on fully actuated objects rigidly coupled to the robot, ignoring underactuated objects with independent dynamics and non-holonomic constraints. These introduce control challenges from coupling forces and occlusions. We present HAIC, a unified framework for robust interaction across diverse object dynamics without external state estimation. Our key contribution is a dynamics predictor that estimates high-order object states (velocity, acceleration) solely from proprioceptive history. These predictions are projected onto static geometric priors to form a spatially grounded dynamic occupancy map, enabling the policy to infer collision boundaries and contact affordances in blind spots. We use asymmetric fine-tuning, where a world model continuously adapts to the student policy's exploration, ensuring robust state estimation under distribution shifts. Experiments on a humanoid robot show HAIC achieves high success rates in agile tasks (skateboarding, cart pushing/pulling under various loads) by proactively compensating for inertial perturbations, and also masters multi-object long-horizon tasks like carrying a box across varied terrain by predicting the dynamics of multiple objects.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a dynamics-aware world model that predicts latent object states using historical proprioceptive data and reference motion.
It employs explicit geometric projection and an asymmetric distillation framework to enable robust sim-to-real transfer without external perception.
Real-world evaluations demonstrate 100% success in complex agile tasks, including skateboarding and cart manipulation, under visual occlusion.

Humanoid Agile Object Interaction Control via Dynamics-Aware World Model (HAIC): Technical Analysis

Introduction and Problem Motivation

The HAIC framework introduces a comprehensive approach to humanoid Human-Object Interaction (HOI) focusing on robust, agile interaction with both fully actuated and underactuated objects in visually occluded, unstructured environments. The primary limitation of prior HOI methods is their focus on fully actuated objects, where object kinematics are tightly constrained by the end-effector, resulting in tractable, but unrealistic, dynamics. Real-world scenarios involve underactuated objects, such as carts and skateboards, introducing independent dynamics and non-holonomic constraints that generate complex contact forces and frequently occlude visual state. Traditional policies, reliant on privileged state estimation or explicit visual feedback, are unsuited to these challenges.

HAIC addresses these limitations by integrating proprioception-driven, simulation-distilled world modeling with high-order dynamics prediction and explicit geometric priors, enabling sim-to-real transfer for whole-body humanoid control without reliance on external perception.

Figure 1: HAIC targets a diverse suite of fully actuated and underactuated objects, critical for robust HOI.

Dynamics-Aware World Model Architecture

HAIC’s central innovation is a dynamics-aware world model (DWM) that predicts latent, unobserved object states—velocity and acceleration—exclusively from historical proprioceptive data and reference motion. The core pipeline comprises the following stages:

High-Order Dynamics Prediction: The model leverages a temporal window of proprioception to infer linear/angular positions, velocities, and accelerations of interactive objects, accounting for coupled humanoid-object dynamics during contact and manipulation.
Explicit Geometric Projection: Predicted object kinematics are mapped onto canonical point cloud templates (geometric priors) via a differentiable transform, yielding a dynamic, spatially grounded representation of occupancy and surface contact affordances. This geometric grounding is critical for stable control under visual occlusion, functioning as an implicit collision boundary predictor.
Figure 2: The DWM architecture predicts latent object dynamics from proprioceptive histories, projecting states onto geometric priors to reconstruct spatial features required for downstream control.

Policy Distillation and Training Paradigm

HAIC employs a two-stage asymmetric distillation model (privileged teacher → partial observation student):

Stage 1: The privileged teacher, using full state (including ground-truth object states), is trained via RL to establish high-performing policy and feature spaces. Concurrently, the student policy and the world model are supervised to mimic teacher actions and features through joint policy distillation and world model prediction.
Stage 2: The student policy, restricted to proprioception and geometric priors, is fine-tuned using an asymmetric actor-critic setup—where the critic has privileged access—while the DWM is online-updated to maintain consistency under distribution drift. To mitigate instability, an exponential moving average (EMA) is used for the world model weights during rollouts.
Figure 3: Overview of HAIC’s teacher-student, sim-to-real asymmetric fine-tuning framework. The student leverages only onboard proprioception and geometric priors at deployment.

Reward Design and Multiple Objects Interaction

To support robust multi-object manipulation, HAIC introduces a contact reward unifying geometric alignment and force regulation, structured into guidance and execution phases. The reward jointly optimizes proximity of end-effectors to semantic targets during contact search and maintains regulated contact force post-establishment, preventing drift and instability. This strategy supports mode-switching and discrete contact transitions critical for long-horizon sequential tasks.

Figure 4: Multi-object contact guidance integrates geometric and force constraints for mode switching and robust manipulation.

Empirical Results: Real-World and Sim-to-Real Evaluation

Real-World Performance

HAIC demonstrates superior sim-to-real transfer and robust generalization in complex underactuated and multi-terrain scenarios on a physical humanoid platform. Key empirical highlights include:

Underactuated Object Interaction: In skateboarding and cart manipulation tasks, HAIC achieves up to 100% gliding and balancing success on skateboards and carts, outperforming proprioception-only and baseline HDMI* (Table results).
Long-Horizon Composite Tasks: HAIC is the only approach able to sequentially load payloads and manipulate carts with high success rates, whereas all baselines fail due to error accumulation.
Multi-Terrain Locomotion: In terrain-adaptive box-carrying tasks (platforms, slopes, stairs), HAIC maintains 100% success rates even under visual occlusion and compounded task difficulty. The DWM’s explicit geometric projection and dynamics prediction effectively bridge the perception gap in sensor-constrained deployments.
Figure 5: HAIC exhibits robust stability in all real-world interaction domains, outperforming baselines that fail under trajectory drift and loss of balance.

Figure 6: Sim-to-real transfer for HAIC shows robust, zero-shot generalization across HOI tasks, terrain types, object sizes, orientations, and load weights.

Ablation Study

The authors rigorously dissect contributions of dynamics prediction and geometric projection:

Position/Orientation-Only Models (Vec-Pose, Geo-Pose) can maintain balance but are unable to handle impulsive perturbations in high-dynamics scenarios, failing in de-mount phases of skateboarding.
Flattened Vector Dynamics (Vec-Dyn) improve robustness but lack geometric alignment for spatial constraints, resulting in drift.
Full HAIC (DWM with high-order dynamics + geometric projection) achieves highest sample efficiency, lowest pose/orientation errors, and the only consistent success in long-horizon and complex terrain scenarios.
Zero-Shot Generalization: HAIC policies generalize to unseen object mass, size, and environmental perturbations due to the inductive bias and explicit modeling in the world model.
Figure 7: Comparative training curves reveal HAIC sample efficiency and higher asymptotic performance on sequential HOI tasks.

Implications and Future Directions

HAIC’s architecture marks a significant step in realistic, robust sim-to-real humanoid control for HOI:

Practical Deployment: Eliminating dependence on privileged sensors/motion capture for object/terrain state, policies are deployable on unmodified hardware platforms.
Theoretical Contributions: The explicit integration of geometric priors and high-order dynamics sets a new standard for internal world modeling in partial observability. The approach robustifies control under interplay of occlusions, distribution shift, and contact discontinuities.
Bottlenecks/Future Expansion: The next research directions should include bridging perception-action using real-time vision, incorporating deformable or articulated objects, and scaling up world models to more diverse, non-rigid environments. Extension to multi-contact and collaborative settings (human-robot joint manipulation) is plausible given the demonstrated generalization.

Conclusion

HAIC fundamentally extends humanoid HOI by fusing dynamics-aware world modeling with explicit geometric priors and asymmetric training, enabling real-world agile whole-body interaction with underactuated objects and complex multistage environments under partial observations. Quantitative results confirm strong sim-to-real transfer, task success, and robustness, while ablation underscores the necessity of high-order prediction and geometric reasoning. These advances pave the way for vision-agnostic, generalizable humanoid deployment in unstructured, visually occluded domains.