Walk through Paintings: Egocentric World Models from Internet Priors
A lightning talk on how researchers transform passive video diffusion models into action-conditioned world models using lightweight fine-tuning.Script
Imagine trying to teach a robot to navigate, but you only have videos of people walking—with no record of the specific decisions they made to turn left or right. This paper tackles that exact information gap by converting passive internet videos into interactive world models without requiring massive amounts of expensive action-labeled data.
The core problem driving this research is that while we have petabytes of passive video on the internet, we have very little data that pairs video with specific agent actions. The authors aim to bridge this divide by taking powerful video diffusion models, which understand how the world looks, and teaching them to understand how the world reacts to movement.
To solve this, the researchers introduce a method called EgoWM. Instead of retraining a model from scratch, they piggyback on the existing architecture of standard video diffusion models. Specifically, they inject action signals directly into the model's timestep-conditioning pathway—the part of the network that tracks the flow of time during generation.
This diagram illustrates the mechanism: the action sequence is processed into an embedding and added to the standard timestep embedding. Because both U-Nets and Transformer backbones already use these channels to modulate generation, this approach allows the model to learn action-following dynamics with minimal changes to its vast internal knowledge.
Evaluating these models is tricky because a video can look sharp but still be physically impossible. The authors introduced a new metric called the Structural Consistency Score, or SCS, which specifically measures whether walls and objects stay stable during movement. Using this, they outperformed the leading baseline, Navigation World Models, by up to 80 percent while running nearly 6 times faster.
By successfully combining internet-scale visual priors with lightweight action conditioning, this paper offers a scalable path toward general-purpose world models. It demonstrates that we don't need to choose between the breadth of web data and the control of robotic data; we can effectively have both.