Generating 3D humans that functionally interact with 3D scenes

Develop methods to generate 3D humans that functionally interact with 3D scenes, producing interactions that are correct with respect to object functionality and applicable to embodied AI, robotics, and interactive content creation.

Background

The paper introduces FunHSI, a training-free framework for generating functionally correct human-scene interactions from open-vocabulary task prompts. The authors emphasize that current approaches often lack explicit reasoning about object functionality and corresponding human-scene contact, which leads to implausible or functionally incorrect results.

They contrast data-driven methods that depend on large paired datasets with zero-shot/training-free methods that leverage vision-LLMs, noting that existing solutions mainly handle general interactions (e.g., sitting or walking) and struggle with fine-grained functional tasks (e.g., operating knobs or switches). This context motivates addressing the broader open problem of synthesizing functionally aware 3D human-scene interactions.

References

Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation.

Open-Vocabulary Functional 3D Human-Scene Interaction Generation  (2601.20835 - Liu et al., 28 Jan 2026) in Abstract (page 1)