Obtaining action–video pairs for training action‑conditioned video models

Construct large-scale datasets of paired continuous 3D physical actions—such as forces, torques, and robot end-effector commands—and corresponding videos suitable for training action-conditioned video generation models, addressing the challenge that inferring precise underlying physical actions from observed videos is often infeasible.

Background

The paper argues that directly encoding continuous, unbounded 3D physical actions (e.g., forces and torques) as tokens is difficult, and that a key bottleneck for training action-conditioned video models is the lack of datasets pairing such actions with their visual consequences.

To bypass this data requirement, the proposed RealWonder system uses physics simulation to translate action inputs into visual motion cues (optical flow and coarse RGB), enabling training with only flow–video pairs. Nevertheless, the authors explicitly note that acquiring genuine action–video pairs remains an open problem in the field.

References

Meanwhile, attempts to directly encode physical actions as tokens face two critical obstacles: first, actions like forces and torques are continuous and unbounded, resisting the tokenization schemes that work for discrete inputs like camera poses; second, obtaining action-video pairs for training remains an open problem, as inferring the precise physical actions that caused observed motions in videos is often infeasible.

RealWonder: Real-Time Physical Action-Conditioned Video Generation  (2603.05449 - Liu et al., 5 Mar 2026) in Section 1 (Introduction)