Bidirectional vs. Autoregressive Architectures for World Action Models

Determine whether bidirectional architectures or autoregressive architectures are better suited for World Action Models built upon pretrained video diffusion backbones that jointly predict future video frames and robot actions, with respect to modality alignment between video and action, error accumulation in closed-loop control, and inference efficiency.

Background

The paper introduces World Action Models (WAMs) that jointly predict video and actions using a pretrained video diffusion backbone. A central design decision concerns whether the model should be bidirectional (processing fixed-length sequences) or autoregressive (leveraging KV caching and visual history).

The authors highlight trade-offs: bidirectional models may suffer from modality alignment issues and require subsampling that distorts native frame rates, while autoregressive models can preserve alignment and enable efficient inference but may raise concerns about error accumulation. Establishing which architecture is fundamentally better for WAMs remains unresolved and impacts practical deployment and generalization.

References

Pretrained video diffusion models offer rich spatiotemporal priors from web-scale data, making them attractive backbones for robot policies. However, converting these models into effective World Action Models (WAMs) presents three key challenges: (1) Video-action alignment: jointly predicting video and actions requires tight coupling between visual futures and motor commands, yet naively combining separate video and action heads can lead to misalignment; (2) Architectural design: it remains unclear whether bidirectional or autoregressive architectures are better suited for WAMs, with implications in modality alignment, error accumulation, and inference efficiency; and (3) Real-time inference: video diffusion models require iterative denoising across high-dimensional latent spaces, making them prohibitively slow for closed-loop control.

World Action Models are Zero-shot Policies  (2602.15922 - Ye et al., 17 Feb 2026) in Section 3 (Method), opening paragraph