Achieving fine-grained spatiotemporal control in human motion generation

Develop human motion generation models that achieve fine-grained simultaneous control over spatial structure at the per-body-part level and temporal dynamics across motion sequences, enabling precise alignment of generated motions with detailed spatiotemporal constraints.

Background

The paper surveys recent advances in controllable human motion generation across modalities (text, interactions, key-pose guidance, audio), noting that current approaches either focus on high-level sequence/action control or offer limited part-level control without temporal alignment. The authors emphasize that existing datasets typically lack temporally aligned, part-level annotations, and that model designs have not unified fine-grained spatial and temporal control.

Within this context, the paper explicitly identifies the absence of unified fine-grained spatiotemporal control as a persistent challenge, motivating their FrankenMotion framework and FrankenStein dataset to address atomic body-part and action-level conditioning along time. The stated open problem frames the broader need for methods that can provide precise control over individual body parts while maintaining coherent temporal structure.

References

Despite these advances, achieving fine-grained spatial and temporal control in motion generation remains a challenging open problem.

FrankenMotion: Part-level Human Motion Generation and Composition  (2601.10909 - Li et al., 15 Jan 2026) in Section 2 (Related Work), Motion generation with control