Bidirectional vs. Autoregressive Architectures for World Action Models
Determine whether bidirectional architectures or autoregressive architectures are better suited for World Action Models built upon pretrained video diffusion backbones that jointly predict future video frames and robot actions, with respect to modality alignment between video and action, error accumulation in closed-loop control, and inference efficiency.
References
Pretrained video diffusion models offer rich spatiotemporal priors from web-scale data, making them attractive backbones for robot policies. However, converting these models into effective World Action Models (WAMs) presents three key challenges: (1) Video-action alignment: jointly predicting video and actions requires tight coupling between visual futures and motor commands, yet naively combining separate video and action heads can lead to misalignment; (2) Architectural design: it remains unclear whether bidirectional or autoregressive architectures are better suited for WAMs, with implications in modality alignment, error accumulation, and inference efficiency; and (3) Real-time inference: video diffusion models require iterative denoising across high-dimensional latent spaces, making them prohibitively slow for closed-loop control.