InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion
Abstract: We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction. Sampling from our model yields plausible and diverse two-hand shapes in close interaction with or without an object. Our prior can be incorporated into any optimization or learning methods to reduce ambiguity in an ill-posed setup. Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus, we propose to decompose the modeling of joint distribution into the modeling of factored unconditional and conditional single instance distribution. In particular, we introduce a diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout. For sampling, we combine anti-penetration and classifier-free guidance to enable plausible generation. Furthermore, we establish the rigorous evaluation protocol of two-hand synthesis, where our method significantly outperforms baseline generative models in terms of plausibility and diversity. We also demonstrate that our diffusion prior can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.
- Abien Fred Agarap. Deep learning using rectified linear units (relu). CoRR, abs/1803.08375, 2018.
- Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3d hand pose estimation under hand-object interaction. In ECCV, 2020.
- Weakly-supervised domain adaptation via gan and mesh model for estimating 3d hand poses interacting objects. In CVPR, 2020.
- Motion capture of hands in action using discriminative salient points. In ECCV, 2012.
- Universal guidance for diffusion models. In CVPRW, 2023.
- Jonathan Baxter. A model of inductive bias learning. JAIR, 2000.
- Demystifying mmd gans. In ICLR, 2018.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Dexycb: A benchmark for capturing hand grasping of objects. In CVPR, 2021.
- Ganhand: Predicting human grasp affordances in multi-object scenes. In CVPR, 2020.
- Diffusion models in vision: A survey. IEEE TPAMI, 2023.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Arctic: A dataset for dexterous bimanual hand-object manipulation. In CVPR, 2023.
- First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In CVPR, 2018.
- Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning. In IROS, 2020.
- Large-scale multiview 3d hand pose dataset. Image Vis. Comput., 2019.
- Contactopt: Optimizing contact to improve grasps. In CVPR, 2021.
- Honnotate: A method for 3d annotation of hand and object poses. In CVPR, 2020.
- Learning joint reconstruction of hands and manipulated objects. In CVPR, 2019.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
- Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Diffusion-based generation, optimization, and planning in 3d scenes. In CVPR, 2023.
- A2j-transformer: Anchor-to-joint transformer network for 3d interacting hand pose estimation from a single rgb image. In CVPR, 2023.
- Hand-object contact consistency reasoning for human grasps generation. In ICCV, 2021.
- Whole-body human pose estimation in the wild. In ECCV, 2020.
- Grasping field: Learning implicit representations for human grasps. In 3DV, 2020.
- A skeleton-driven neural occupancy representation for articulated hands. In 3DV, 2021.
- Auto-encoding variational bayes. In ICLR, 2013.
- Nifty: Neural object interaction fields for guided human motion synthesis. CoRR, abs/2307.07511, 2023.
- H2o: Two hands manipulating objects for first person interaction recognition. In ICCV, 2021.
- Fourierhandflow: Neural 4d hand representation using fourier query flow. In NeurIPS, 2023a.
- Im2hands: Learning attentive implicit representation of interacting two-hand shapes. In CVPR, 2023b.
- Syncdiffusion: Coherent montage via synchronized joint diffusions. In NeurIPS, 2023c.
- Interacting attention graph for single image two-hand reconstruction. In CVPR, 2022.
- Intergen: Diffusion-based multi-human motion generation under complex interactions. CoRR, abs/2304.05684, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Contactgen: Generative contact modeling for grasp generation. In ICCV, 2023.
- Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In CVPR, 2022.
- Smpl: A skinned multi-person linear model. ACM TOG, 2015.
- Gyeongsik Moon. Bringing inputs to shared domains for 3d interacting hands recovery in the wild. In CVPR, 2023.
- Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In ECCV, 2020.
- Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM TOG, 2019.
- Generative proxemics: A prior for 3d social interaction from images. CoRR, abs/2306.09337, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Tracking the articulated motion of two strongly interacting hands. In CVPR, 2012.
- Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
- Action-conditioned 3d human motion synthesis with transformer vae. In ICCV, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2022.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017b.
- Modi: Unconditional motion synthesis from diverse data. In CVPR, 2023.
- Searching for activation functions. CoRR, abs/1710.05941, 2017.
- Decoupled iterative refinement framework for interacting hands reconstruction from a single rgb image. In ICCV, 2023.
- Embodied hands: modeling and capturing hands and bodies together. ACM TOG, 2017.
- Monocular 3d reconstruction of interacting hands via collision-aware factorized refinements. In 3DV, 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Assessing generative models via precision and recall. In NeurIPS, 2018.
- Human motion diffusion as a generative prior. CoRR, abs/2303.01418, 2023.
- 3d point cloud generative adversarial network based on tree structured graph convolutions. In CVPR, 2019.
- Denoising diffusion implicit models. In ICLR, 2021.
- Articulated distance fields for ultra-fast tracking of hands interacting. ACM TOG, 2017.
- Human motion diffusion model. In ICLR, 2022.
- Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In NeurIPS, 1995.
- Grasp’d: Differentiable contact-rich grasp synthesis for multi-fingered hands. In ECCV, 2022.
- Capturing hands in action using discriminative salient points and physics simulation. IJCV, 2016.
- Attention is all you need. NeurIPS, 2017.
- Diffusion models: A comprehensive survey of methods and applications. Comput. Surv., 2022.
- Acr: Attention collaboration-based regressor for arbitrary two-hand reconstruction. In CVPR, 2023.
- Interacting two-hand 3d pose and shape reconstruction from single color image. In ICCV, 2021.
- A hand pose tracking benchmark from stereo matching. In ICIP, 2017.
- A survey on multi-task learning. IEEE Trans. Knowl. Data En., 2021.
- On the continuity of rotation representations in neural networks. In CVPR, 2019.
- Learning to estimate 3d hand pose from single rgb images. In ICCV, 2017.
- Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In ICCV, 2019.
- Reconstructing interacting hands with interaction prior from monocular images. In ICCV, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.