Obtaining action–video pairs for training action‑conditioned video models
Construct large-scale datasets of paired continuous 3D physical actions—such as forces, torques, and robot end-effector commands—and corresponding videos suitable for training action-conditioned video generation models, addressing the challenge that inferring precise underlying physical actions from observed videos is often infeasible.
References
Meanwhile, attempts to directly encode physical actions as tokens face two critical obstacles: first, actions like forces and torques are continuous and unbounded, resisting the tokenization schemes that work for discrete inputs like camera poses; second, obtaining action-video pairs for training remains an open problem, as inferring the precise physical actions that caused observed motions in videos is often infeasible.