- The paper introduces a two-stage framework that decouples grasping and manipulation, enabling physically viable HOI synthesis for dexterous robotics.
- It employs conditional flow matching to accelerate inference by 40× while maintaining smooth, stable hand-object trajectories.
- Results on GRAB and HOT3D demonstrate improved physical realism, semantic fidelity, and seamless transfer to real-world robotic manipulation tasks.
FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation
Introduction
The synthesis of physically plausible, semantically aligned hand-object interaction (HOI) motions is foundational for advancing dexterous robot manipulation in unstructured environments. Existing vision-language-action (VLA) models are limited by their lack of explicit interaction representation and inability to perform long-horizon, contact-rich tasks. "FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation" (2602.13444) introduces a framework decoupling geometric grasping and semantically grounded manipulation via conditional flow matching, leveraging egocentric observations, language commands, and 3D scene context. This approach enables real-time, physically viable HOI sequence generation—substantially improving physical realism, semantic correspondence, and transferability to robotic platforms.
Method
Two-Stage Flow-Matching Architecture
FlowHOI decomposes HOI synthesis into two stages:
- Grasping Stage: Uses a pretrained grasping prior, fine-tuned from large-scale reconstructed egocentric HOI data, to establish contact-stable initializations for hand-object configuration. This stage focuses solely on geometry and reachability, conditioning on object mesh, initial hand-object state, and a grasp-oriented instruction.
- Manipulation Stage: Generates subsequent interaction trajectories, conditioning on a fused hybrid 3D scene token (geometric+semantic features), holistic global scene representation, transition hand-object state, and the full language command. Motion-text alignment loss is imposed to ensure semantic fidelity.
Figure 1: Overview of the FlowHOI two-stage framework, decoupling grasping and manipulation phases, each leveraging task-specific conditioning.
Conditional Flow Matching
Instead of diffusion models, FlowHOI employs conditional flow matching. This ODE-based generative paradigm accelerates inference—requiring only ≈0.16s per sequence (≈40× faster than DiffH2O)—by amortizing the transformation from noise to data along continuous probability flows parameterized by neural vector fields. Key design aspects include:
Hand-Object Data Reconstruction
A central barrier is the scarcity of high-fidelity paired HOI sequences. FlowHOI introduces a pipeline to reconstruct such data from large-scale egocentric video, including:
- Transition point detection via wrist kinematic analysis.
- SAM3- and DepthAnything3-based object segmentation and metric 3D reconstruction from static pre-grasp frames.
- MANO IK fitting to tracked hand poses, with optimized object translation for contact enforcement and non-penetration under physical constraints.
The resulting dataset substantially improves grasping prior accuracy and generalization across diverse interaction topologies.
Figure 4: Full HOI data pipeline: transition frame detection, mesh reconstruction, and hand-object alignment under spatial constraints.
Experiments
Quantitative Results
Evaluations were performed on GRAB and HOT3D, encompassing metrics for physical interaction quality (interpenetration volume, contact ratio), motion quality (action recognition, diversity), and physical feasibility (Isaac Gym simulations after robot retargeting). Notable findings include:
- On GRAB, FlowHOI delivers the best interpenetration volume (10.93 cm3), highest simulation success rate (55.96%, a 1.7× improvement over the best diffusion baseline), and robust action recognition (0.95).
- On HOT3D, semantic grounding and contact consistency are preserved despite real-world scene noise, achieving top contact ratio and sustained success rates.
- Inference is consistently 40× faster than prior diffusion baselines, without loss of realism or diversity.
Figure 5: Qualitative HOI generation comparison with DiffH2O and LatentHOI, exhibiting FlowHOI's improved realism and semantic compliance.
Figure 6: Real-world deployment: FlowHOI-retargeted HOI sequences guide Franka + Allegro hardware to successful execution in pouring, drinking, tilting, and squeezing tasks.
Qualitative and Robustness Analyses
Real-World and Simulation Validation
Generated HOI sequences are inverted to Allegro hand kinematics via PyRoki, then further refined by the DexTrack policy for physically valid execution within Isaac Gym and Franka+Allegro hardware setups. Empirically, successful execution is contingent on sustained contact and task-compliant object manipulation—a critical advantage over purely kinematic or heuristic-plausibility approaches.
Figure 7: Comparison of kinematic, retargeted, and physics-executed HOI trajectories highlights feasibility gaps, validating the need for physically grounded evaluation.
Implications and Future Developments
FlowHOI’s semantics-grounded, flow-based HOI generation closes the gap between scalable perceptual policy learning and physically robust, transferable dexterous manipulation. Its explicit decoupling of grasping/manipulation stages and rich scene grounding facility interaction generalization across tasks and robot embodiments, forming a flexible, robot-agnostic interface for downstream control and planning.
Key implications include:
- Robust transfer: FlowHOI's intermediate HOI representation is decoupled from robot-specific actuation, enabling zero-shot or few-shot transfer between embodiments.
- Model efficiency: The conditional flow matching approach introduces a paradigm shift for fast inference in generative motion modeling, directly impacting real-time control and planning feasibility.
- Data-centric ML: The reconstruction pipeline democratizes the acquisition of high-quality interaction datasets from egocentric video, enabling scalable pretraining and improved generalization.
Potential future directions involve relaxing the dependence on accurate initial hand-object state estimation, deploying adaptive contact-aware controllers, and scaling to multi-body, mobile, or exocentric interaction priors.
Conclusion
FlowHOI represents a substantive advancement in semantics- and geometry-aware HOI motion generation for dexterous robotics. By integrating flow matching, hybrid 3D scene embedding, and a scalable reconstruction pipeline, it achieves physical plausibility, semantic alignment, and rapid inference exceeding previous diffusion-based and vision-language approaches. The system’s success in both simulated and real-world robotic manipulation tasks substantiates the practical utility of semantics-grounded, robot-agnostic HOI representation as an interface for next-generation embodied agents.