Extending Talking Avatars to Grounded Human-Object Interaction (GHOI)

Establish a text-driven grounded human-object interaction (GHOI) capability for talking avatar video generation that enables avatars to perform interactions with surrounding objects aligned to textual commands.

Background

The paper introduces Grounded Human-Object Interaction (GHOI) for talking avatars, emphasizing the need for environmental perception and the ability to execute text-aligned interactions within the scene. Existing audio-driven, pose-driven, and subject-consistent approaches either lack explicit modeling of objects and environments or require impractical controls, leading to a control-quality trade-off and difficulty in grounding interactions in a provided reference image.

The authors propose InteractAvatar, a dual-stream framework that decouples perception and planning from video synthesis, and introduce modules to improve scene-aware motion planning and audio-interaction-aware rendering. Despite these advances, the extension from simple talking avatar generation to fully grounded human-object interactions is explicitly identified as an open challenge.

References

Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects.