Papers
Topics
Authors
Recent
Search
2000 character limit reached

InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion

Published 26 Mar 2024 in cs.CV | (2403.17422v1)

Abstract: We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction. Sampling from our model yields plausible and diverse two-hand shapes in close interaction with or without an object. Our prior can be incorporated into any optimization or learning methods to reduce ambiguity in an ill-posed setup. Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus, we propose to decompose the modeling of joint distribution into the modeling of factored unconditional and conditional single instance distribution. In particular, we introduce a diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout. For sampling, we combine anti-penetration and classifier-free guidance to enable plausible generation. Furthermore, we establish the rigorous evaluation protocol of two-hand synthesis, where our method significantly outperforms baseline generative models in terms of plausibility and diversity. We also demonstrate that our diffusion prior can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Abien Fred Agarap. Deep learning using rectified linear units (relu). CoRR, abs/1803.08375, 2018.
  2. Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3d hand pose estimation under hand-object interaction. In ECCV, 2020.
  3. Weakly-supervised domain adaptation via gan and mesh model for estimating 3d hand poses interacting objects. In CVPR, 2020.
  4. Motion capture of hands in action using discriminative salient points. In ECCV, 2012.
  5. Universal guidance for diffusion models. In CVPRW, 2023.
  6. Jonathan Baxter. A model of inductive bias learning. JAIR, 2000.
  7. Demystifying mmd gans. In ICLR, 2018.
  8. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  9. Dexycb: A benchmark for capturing hand grasping of objects. In CVPR, 2021.
  10. Ganhand: Predicting human grasp affordances in multi-object scenes. In CVPR, 2020.
  11. Diffusion models in vision: A survey. IEEE TPAMI, 2023.
  12. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  13. Arctic: A dataset for dexterous bimanual hand-object manipulation. In CVPR, 2023.
  14. First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In CVPR, 2018.
  15. Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning. In IROS, 2020.
  16. Large-scale multiview 3d hand pose dataset. Image Vis. Comput., 2019.
  17. Contactopt: Optimizing contact to improve grasps. In CVPR, 2021.
  18. Honnotate: A method for 3d annotation of hand and object poses. In CVPR, 2020.
  19. Learning joint reconstruction of hands and manipulated objects. In CVPR, 2019.
  20. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
  21. Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
  22. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  23. Diffusion-based generation, optimization, and planning in 3d scenes. In CVPR, 2023.
  24. A2j-transformer: Anchor-to-joint transformer network for 3d interacting hand pose estimation from a single rgb image. In CVPR, 2023.
  25. Hand-object contact consistency reasoning for human grasps generation. In ICCV, 2021.
  26. Whole-body human pose estimation in the wild. In ECCV, 2020.
  27. Grasping field: Learning implicit representations for human grasps. In 3DV, 2020.
  28. A skeleton-driven neural occupancy representation for articulated hands. In 3DV, 2021.
  29. Auto-encoding variational bayes. In ICLR, 2013.
  30. Nifty: Neural object interaction fields for guided human motion synthesis. CoRR, abs/2307.07511, 2023.
  31. H2o: Two hands manipulating objects for first person interaction recognition. In ICCV, 2021.
  32. Fourierhandflow: Neural 4d hand representation using fourier query flow. In NeurIPS, 2023a.
  33. Im2hands: Learning attentive implicit representation of interacting two-hand shapes. In CVPR, 2023b.
  34. Syncdiffusion: Coherent montage via synchronized joint diffusions. In NeurIPS, 2023c.
  35. Interacting attention graph for single image two-hand reconstruction. In CVPR, 2022.
  36. Intergen: Diffusion-based multi-human motion generation under complex interactions. CoRR, abs/2304.05684, 2023.
  37. Microsoft coco: Common objects in context. In ECCV, 2014.
  38. Contactgen: Generative contact modeling for grasp generation. In ICCV, 2023.
  39. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In CVPR, 2022.
  40. Smpl: A skinned multi-person linear model. ACM TOG, 2015.
  41. Gyeongsik Moon. Bringing inputs to shared domains for 3d interacting hands recovery in the wild. In CVPR, 2023.
  42. Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In ECCV, 2020.
  43. Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM TOG, 2019.
  44. Generative proxemics: A prior for 3d social interaction from images. CoRR, abs/2306.09337, 2023.
  45. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  46. Tracking the articulated motion of two strongly interacting hands. In CVPR, 2012.
  47. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
  48. Action-conditioned 3d human motion synthesis with transformer vae. In ICCV, 2021.
  49. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2022.
  50. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017a.
  51. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017b.
  52. Modi: Unconditional motion synthesis from diverse data. In CVPR, 2023.
  53. Searching for activation functions. CoRR, abs/1710.05941, 2017.
  54. Decoupled iterative refinement framework for interacting hands reconstruction from a single rgb image. In ICCV, 2023.
  55. Embodied hands: modeling and capturing hands and bodies together. ACM TOG, 2017.
  56. Monocular 3d reconstruction of interacting hands via collision-aware factorized refinements. In 3DV, 2021.
  57. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  58. Assessing generative models via precision and recall. In NeurIPS, 2018.
  59. Human motion diffusion as a generative prior. CoRR, abs/2303.01418, 2023.
  60. 3d point cloud generative adversarial network based on tree structured graph convolutions. In CVPR, 2019.
  61. Denoising diffusion implicit models. In ICLR, 2021.
  62. Articulated distance fields for ultra-fast tracking of hands interacting. ACM TOG, 2017.
  63. Human motion diffusion model. In ICLR, 2022.
  64. Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In NeurIPS, 1995.
  65. Grasp’d: Differentiable contact-rich grasp synthesis for multi-fingered hands. In ECCV, 2022.
  66. Capturing hands in action using discriminative salient points and physics simulation. IJCV, 2016.
  67. Attention is all you need. NeurIPS, 2017.
  68. Diffusion models: A comprehensive survey of methods and applications. Comput. Surv., 2022.
  69. Acr: Attention collaboration-based regressor for arbitrary two-hand reconstruction. In CVPR, 2023.
  70. Interacting two-hand 3d pose and shape reconstruction from single color image. In ICCV, 2021.
  71. A hand pose tracking benchmark from stereo matching. In ICIP, 2017.
  72. A survey on multi-task learning. IEEE Trans. Knowl. Data En., 2021.
  73. On the continuity of rotation representations in neural networks. In CVPR, 2019.
  74. Learning to estimate 3d hand pose from single rgb images. In ICCV, 2017.
  75. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In ICCV, 2019.
  76. Reconstructing interacting hands with interaction prior from monocular images. In ICCV, 2023.
Citations (5)

Summary

  • The paper presents a cascaded reverse diffusion framework that decomposes the complex joint distribution of two-hand interactions into simpler parts.
  • It employs classifier-free and anti-penetration guidance to balance fidelity and diversity for generating plausible hand models.
  • Benchmark metrics like FHID and KHID demonstrate superior performance, setting a new foundation for VR/AR applications and future research.

Leveraging Cascaded Reverse Diffusion for Two-Hand Interaction Generation

Introduction to Two-Hand Interaction Generation

Attempting to model the complex interactiveness of two hands, whether engaging with each other or with an object, represents a formidable challenge in the field of generative modeling. To date, efforts in reconstruing two-hand interactions have largely concentrated on reconstructing from monocular images. Generating these interactions procedurally, however, remains under-explored, a gap that this work, titled "InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion," aims to fill. At the core of this challenge is the inherent complexity imbued in modeling the combinatorial nature of hand articulations and interactions.

Breaking Down the Complexity

The key innovation in this paper is the proposed framework that simplifies the generative process by decomposing the joint distribution of two interacting hands into separate, more manageable distributions. This decomposition elegantly reduces the learning complexity and enables the generation of highly plausible hand shapes in close interaction contexts. Specifically, the authors leverage a diffusion-based method, incorporating anti-penetration and classifier-free guidance to enhance the generative fidelity and diversity.

Technical Contributions

The authors introduce several noteworthy technical innovations in their approach to two-hand interaction generation. Importantly, they acknowledge the symmetrical nature of hands and utilize this to learn a shared distribution model for both hands, applying conditioning dropout as a means to model both conditional and unconditional single-hand distributions. Additionally, the paper pioneers the use of cascaded reverse diffusion, where the sampling process first generates one hand and then conditions the generation of the second hand on the first. This cascaded approach is fortified with classifier-free guidance to balance between fidelity and diversity and further employs anti-penetration guidance to avert physically implausible intersections between the generated hand models.

Evaluation and Benchmarking

Given the absence of established benchmarks in the domain of two-hand synthesis, the authors construct a rigorous evaluation protocol to quantify the plausibility and diversity of generated hand interactions. This protocol includes metrics adapted from the generative modeling domain, such as Fréchet Hand Interaction Distance (FHID) and Kernel Hand Interaction Distance (KHID), alongside novel metrics specifically designed to quantify the physical plausibility of interaction. The resulting benchmarks not only demonstrate the superior performance of InterHandGen against existing baselines but also establish a methodological foundation for future research in this area.

Implications and Future Directions

The implications of InterHandGen span both practical applications and theoretical advancements in generative AI. From a practical standpoint, high-fidelity hand interaction models hold significant promise for enhancing user experiences in virtual and augmented reality environments. Theoretically, this work propels forward the understanding of complex interaction generation, shedding light on the potential of cascaded diffusion processes in tackling high-dimensional generative tasks.

Speculatively, extending the framework to incorporate dynamic interactions or applying its underlying principles to other domains of complex interaction, such as human-to-human or human-to-object interactions, could yield fascinating avenues for future research. Moreover, the versatility of the proposed model suggests potential in refining optimization or learning-based methods across a broad spectrum of applications in computer vision and beyond.

In conclusion, "InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion" presents a methodologically sound and technically innovative approach to generating plausible two-hand interactions. By decomposing the generative process and employing a cascaded diffusion strategy, this work lays a robust foundation for future explorations into complex interaction modeling and sets a new benchmark in two-hand interaction generation.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 131 likes about this paper.