Cosmos Policy

A lightning talk on adapting Cosmos-Predict2 for robotic control via latent frame injection and unified planning.
Script
If a machine can vividly dream the future with perfect physics, does it already know how to act in the present? The authors of this paper explore whether a generative video model—trained on millions of clips—can become a robot brain without complex surgery. They introduce Cosmos Policy, a method that turns video prediction directly into motor control and planning.
Previous attempts to adapt video models often grafted extra modules onto the network, adding complexity and requiring multiple training stages. In contrast, this work changes nothing about the model's architecture. Instead, it alters the data structure itself to seamlessly blend robot actions with video frames.
This visual demonstrates the core mechanism called 'Latent Frame Injection'. The model treats proprioception vectors, action chunks, and value estimates exactly like video frames by encoding them into the existing latent space. This allows a single unified model to process observations and output actions using the same diffusion process used for generating video.
By training on a mix of expert demonstrations and self-generated rollouts, the model learns three functions simultaneously: acting, predicting the future, and estimating value. This trifecta enables the robot to 'think before it acts' at test time, sampling multiple action chunks and selecting the one with the highest predicted success.
The results are compelling, achieving nearly 99% success on the LIBERO simulation suite and dominating real-world ALOHA baselines. However, the powerful planning capability comes at a cost: it creates a roughly 5-second inference latency and requires significant rollout data to fine-tune the world model effectively.
Cosmos Policy proves that with the right data representation, we can repurpose the physics priors of video generators for precise robotic control. For more details on this unified approach to planning and acting, check out the full paper or visit EmergentMind.com.