Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

Published 13 Feb 2026 in cs.RO, cs.AI, and cs.CV | (2602.13444v1)

Abstract: Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.

Summary

  • The paper introduces a two-stage framework that decouples grasping and manipulation, enabling physically viable HOI synthesis for dexterous robotics.
  • It employs conditional flow matching to accelerate inference by 40× while maintaining smooth, stable hand-object trajectories.
  • Results on GRAB and HOT3D demonstrate improved physical realism, semantic fidelity, and seamless transfer to real-world robotic manipulation tasks.

FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

Introduction

The synthesis of physically plausible, semantically aligned hand-object interaction (HOI) motions is foundational for advancing dexterous robot manipulation in unstructured environments. Existing vision-language-action (VLA) models are limited by their lack of explicit interaction representation and inability to perform long-horizon, contact-rich tasks. "FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation" (2602.13444) introduces a framework decoupling geometric grasping and semantically grounded manipulation via conditional flow matching, leveraging egocentric observations, language commands, and 3D scene context. This approach enables real-time, physically viable HOI sequence generation—substantially improving physical realism, semantic correspondence, and transferability to robotic platforms.

Method

Two-Stage Flow-Matching Architecture

FlowHOI decomposes HOI synthesis into two stages:

  1. Grasping Stage: Uses a pretrained grasping prior, fine-tuned from large-scale reconstructed egocentric HOI data, to establish contact-stable initializations for hand-object configuration. This stage focuses solely on geometry and reachability, conditioning on object mesh, initial hand-object state, and a grasp-oriented instruction.
  2. Manipulation Stage: Generates subsequent interaction trajectories, conditioning on a fused hybrid 3D scene token (geometric+semantic features), holistic global scene representation, transition hand-object state, and the full language command. Motion-text alignment loss is imposed to ensure semantic fidelity. Figure 1

    Figure 1: Overview of the FlowHOI two-stage framework, decoupling grasping and manipulation phases, each leveraging task-specific conditioning.

Conditional Flow Matching

Instead of diffusion models, FlowHOI employs conditional flow matching. This ODE-based generative paradigm accelerates inference—requiring only ≈0.16s per sequence (≈40× faster than DiffH2O)—by amortizing the transformation from noise to data along continuous probability flows parameterized by neural vector fields. Key design aspects include:

  • xx-prediction as the target, improving temporal trajectory stability.
  • Motion-text alignment via contrastive InfoNCE loss between T5-encoded action descriptions and Transformer-embedded motions.
  • Hybrid 3D scene encoding, fusing geometric Concerto features and semantic SceneSplat embeddings, compressed via a Perceiver-style bottleneck for efficiency.
  • Sequential inpainting and ODE-level state clamping at the grasp-to-manipulation transition to guarantee spatiotemporal continuity. Figure 2

    Figure 2: HOI data reconstruction from egocentric video: segmenting grasp/manipulate phases, reconstructing metric object mesh, and optimizing hand-object alignment for data curation.

Hand-Object Data Reconstruction

A central barrier is the scarcity of high-fidelity paired HOI sequences. FlowHOI introduces a pipeline to reconstruct such data from large-scale egocentric video, including:

  • Transition point detection via wrist kinematic analysis.
  • SAM3- and DepthAnything3-based object segmentation and metric 3D reconstruction from static pre-grasp frames.
  • MANO IK fitting to tracked hand poses, with optimized object translation for contact enforcement and non-penetration under physical constraints.

The resulting dataset substantially improves grasping prior accuracy and generalization across diverse interaction topologies. Figure 3

Figure 4: Full HOI data pipeline: transition frame detection, mesh reconstruction, and hand-object alignment under spatial constraints.

Experiments

Quantitative Results

Evaluations were performed on GRAB and HOT3D, encompassing metrics for physical interaction quality (interpenetration volume, contact ratio), motion quality (action recognition, diversity), and physical feasibility (Isaac Gym simulations after robot retargeting). Notable findings include:

  • On GRAB, FlowHOI delivers the best interpenetration volume (10.93 cm3^3), highest simulation success rate (55.96%, a 1.7× improvement over the best diffusion baseline), and robust action recognition (0.95).
  • On HOT3D, semantic grounding and contact consistency are preserved despite real-world scene noise, achieving top contact ratio and sustained success rates.
  • Inference is consistently 40× faster than prior diffusion baselines, without loss of realism or diversity. Figure 4

    Figure 5: Qualitative HOI generation comparison with DiffH2O and LatentHOI, exhibiting FlowHOI's improved realism and semantic compliance.

    Figure 5

    Figure 6: Real-world deployment: FlowHOI-retargeted HOI sequences guide Franka + Allegro hardware to successful execution in pouring, drinking, tilting, and squeezing tasks.

Qualitative and Robustness Analyses

  • FlowHOI trajectories maintain robust bimanual coordination, stable contact, and context-aware collision avoidance in 3D scenes.
  • Ablations demonstrate that hybrid geometric/semantic scene tokens and T5-based semantic alignment significantly boost both action recognition and final pose accuracy.
  • xx-prediction as a target eliminates temporal jitter observed with vv-prediction. Figure 6

    Figure 3: xx-prediction yields smooth hand motion, while vv-prediction introduces visually apparent pose instability and kinematic noise.

Real-World and Simulation Validation

Generated HOI sequences are inverted to Allegro hand kinematics via PyRoki, then further refined by the DexTrack policy for physically valid execution within Isaac Gym and Franka+Allegro hardware setups. Empirically, successful execution is contingent on sustained contact and task-compliant object manipulation—a critical advantage over purely kinematic or heuristic-plausibility approaches. Figure 7

Figure 7: Comparison of kinematic, retargeted, and physics-executed HOI trajectories highlights feasibility gaps, validating the need for physically grounded evaluation.

Implications and Future Developments

FlowHOI’s semantics-grounded, flow-based HOI generation closes the gap between scalable perceptual policy learning and physically robust, transferable dexterous manipulation. Its explicit decoupling of grasping/manipulation stages and rich scene grounding facility interaction generalization across tasks and robot embodiments, forming a flexible, robot-agnostic interface for downstream control and planning.

Key implications include:

  • Robust transfer: FlowHOI's intermediate HOI representation is decoupled from robot-specific actuation, enabling zero-shot or few-shot transfer between embodiments.
  • Model efficiency: The conditional flow matching approach introduces a paradigm shift for fast inference in generative motion modeling, directly impacting real-time control and planning feasibility.
  • Data-centric ML: The reconstruction pipeline democratizes the acquisition of high-quality interaction datasets from egocentric video, enabling scalable pretraining and improved generalization.

Potential future directions involve relaxing the dependence on accurate initial hand-object state estimation, deploying adaptive contact-aware controllers, and scaling to multi-body, mobile, or exocentric interaction priors.

Conclusion

FlowHOI represents a substantive advancement in semantics- and geometry-aware HOI motion generation for dexterous robotics. By integrating flow matching, hybrid 3D scene embedding, and a scalable reconstruction pipeline, it achieves physical plausibility, semantic alignment, and rapid inference exceeding previous diffusion-based and vision-language approaches. The system’s success in both simulated and real-world robotic manipulation tasks substantiates the practical utility of semantics-grounded, robot-agnostic HOI representation as an interface for next-generation embodied agents.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 45 likes about this paper.