AdaWorld: Learning Adaptable World Models with Latent Actions
Abstract: World models aim to learn action-controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces AdaWorld, a kind of “world model.” A world model is like a smart simulator: given what you do (your actions), it predicts what will happen next in a video or game. AdaWorld’s big idea is to learn a hidden, compact “action code” from ordinary videos (with no labels) and use that to make the simulator easy to adapt to new environments and new action sets with very little extra training.
What questions are they trying to answer?
The authors focus on three simple questions:
- Can a model learn about actions just by watching videos, without needing lots of labeled data?
- If it learns these hidden “action codes,” can it use them to control and predict what happens in many different environments (like different games or real-world scenes)?
- Can this make adapting to new tasks faster and cheaper (fewer examples, fewer training steps), while still planning and simulating well?
How did they do it? (Methods explained simply)
Learning “latent actions” from videos
- Think of two frames in a video: frame at time t, and the next frame at time t+1.
- The model learns a tiny “secret code” that captures what changed between those two frames — that’s the latent action. It’s “latent” because it’s hidden inside the model, not a human-written label like “move left.”
- To force the code to focus on what matters (the action, not background colors or textures), they use an “information bottleneck.” This means the code is very small, so it can’t carry everything; it must carry the most important change (like the movement).
- They train this with an autoencoder (a kind of compression-and-reconstruction system) and a variation of a VAE (Variational Autoencoder) called a β-VAE. The β part lets them control how strict the compression is: more strict pushes the code to represent only the essentials; less strict lets it carry more details.
Analogy: Imagine watching a superhero flip a switch. The latent action is like a tiny note that just says “the switch was flipped,” ignoring the wallpaper and the lighting, so the model focuses on the cause of change, not the background.
Training the world model with those actions
- After the model learns how to extract latent actions from videos, they train a world model that predicts the next frame given:
- The current frame,
- A short memory of recent frames,
- And the latent action code.
- They use a “diffusion model” to make each predicted frame look realistic. Diffusion models are like un-blurring a noisy picture step-by-step until it looks right.
- The world model predicts one frame at a time and feeds its own predictions forward. This step-by-step style is called “autoregressive,” like writing a story one sentence at a time, using the last sentence to guide the next.
Making it adaptable
- Because the world model was trained to use latent actions (instead of fixed, human-defined labels), adapting to new environments is mostly about finding the right latent action codes for those new actions.
- With just a few examples of a new action (say, “jump”), they:
- Extract the latent actions from those examples,
- Average them to get a stable “action embedding,”
- Plug that into the world model and do minimal fine-tuning.
- The model can also “transfer” actions: if it sees a demonstration (like “push object forward” in one video), it can extract that latent action and replay it in a different scene without retraining.
- Because the latent action space is continuous (not limited to a small set of discrete buttons), they can “mix” actions by blending their codes to create new behaviors (action composition).
What did they find? (Main results)
Here are the key findings in plain language:
- Action transfer without retraining: The model can watch a video of an action and reuse that action in a different context (like a new game level or viewpoint) by applying the extracted latent action sequence.
- Faster adaptation with few samples: Compared to traditional methods, AdaWorld needs fewer labeled examples and fewer training steps to adapt to new environments (like Habitat, Minecraft, and DMLab) and still produces higher-quality simulations.
- Better visual planning: Using the world model to “imagine” and plan steps (via model predictive control), AdaWorld achieved higher success rates in tasks from the Procgen benchmark than common baselines and even classical Q-learning in the same limited-data setting.
- Flexible control: The continuous latent action space allows composing and creating new actions by averaging or interpolating latent action codes.
Why this is important:
- It shows that action-aware pretraining (using latent actions learned from videos) makes world models more adaptable, saving time and data while improving simulation and planning in new places.
What does it mean for the future? (Implications)
- Easier transfer across domains: Robots, game agents, and interactive systems could quickly learn new controls in new environments using a few demonstrations, rather than thousands of labeled samples.
- Lower costs and faster development: Training big models is expensive. Learning from unlabeled videos and needing less fine-tuning reduces cost and speeds up deployment.
- More general, human-like learning: The idea mirrors how people learn — by watching and extracting the essence of actions — and reusing that knowledge in new situations.
- Creative control: Being able to blend or create new actions opens doors to building interactive environments and tools that are more flexible.
Simple note on limitations and future work:
- Speed: The model isn’t real-time yet; faster sampling and distillation could help.
- Very long rollouts and totally new content: Like many video models, it gets harder the longer you predict without fresh input; scaling models and data may improve this.
- Some failure cases still happen; more research is needed to make it even more robust.
Overall, AdaWorld shows a practical path to smarter, more adaptable simulators that learn action knowledge from ordinary videos and quickly plug into new tasks.
Collections
Sign up for free to add this paper to one or more collections.