Incorporating joint-level hand pose conditioning into video diffusion models

Establish an effective method for incorporating joint-level hand pose conditioning into video diffusion models to enable precise, dexterous hand–object interactions in egocentric settings.

Background

Extended reality applications require generative video models that respond to fine-grained user motion, especially hand and finger articulation. Existing world models typically accept only coarse controls (e.g., text, keyboard, or camera motion) and cannot represent dexterous hand–object interactions.

The paper highlights that prior approaches conditioned on camera or full-body pose lack the precision needed for joint-level hand control, motivating the need for a principled mechanism to integrate tracked hand poses into video diffusion models.

References

As a result, it remains an open question how to effectively incorporate joint-level hand pose conditioning into video diffusion models.