Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdaWorld: Learning Adaptable World Models with Latent Actions

Published 24 Mar 2025 in cs.AI, cs.CV, cs.LG, and cs.RO | (2503.18938v4)

Abstract: World models aim to learn action-controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.

Summary

  • The paper introduces a latent action autoencoder using a β-VAE objective to disentangle actions from context for effective control transfer.
  • It demonstrates rapid adaptation to new environments with minimal labeled data, significantly reducing FVD and boosting ECS metrics.
  • The approach supports action composition and robust planning, enabling scalable world models in robotics, gaming, and simulation.

AdaWorld: Learning Adaptable World Models with Latent Actions

Introduction and Motivation

AdaWorld addresses a central limitation in current world model research: the lack of adaptability to novel environments with heterogeneous action spaces, especially under constraints of limited action-labeled data and minimal finetuning. Existing world models, even those initialized from large-scale video pretraining, typically require substantial action annotation and retraining to achieve action controllability in new domains. This restricts their scalability and practical deployment in diverse, real-world scenarios where action spaces are not standardized and labeled data is scarce.

AdaWorld proposes a paradigm shift by introducing action-aware pretraining via self-supervised extraction of latent actions from videos. The core hypothesis is that incorporating action information during pretraining—rather than relying solely on action-agnostic video data—enables the resulting world models to generalize and adapt efficiently to new environments and action spaces with minimal supervision.

Methodology

Latent Action Autoencoder

The foundation of AdaWorld is a latent action autoencoder, instantiated as a Transformer-based architecture. The encoder receives two consecutive frames (ft,ft+1)(f_t, f_{t+1}) and produces a compact, continuous latent action a~\tilde{a} that encodes the transition between these frames. The decoder reconstructs ft+1f_{t+1} from ftf_t and a~\tilde{a}. An information bottleneck, implemented via a β\beta-VAE objective, enforces that a~\tilde{a} captures only the most salient, context-invariant aspects of the transition—effectively disentangling action from background and scene context.

Key architectural details include:

  • Patch-based tokenization of frames (16×1616 \times 16 patches).
  • Spatiotemporal Transformer blocks with interleaved spatial and temporal attention.
  • Rotary embeddings in temporal attention to encode causality.
  • Posterior parameterization of a~\tilde{a} as a Gaussian, with sampling during training.
  • β\beta-VAE loss to balance expressiveness and disentanglement.

Empirically, the continuous latent action space is shown to be more expressive and transferable than discrete alternatives (e.g., VQ-VAE codebooks), supporting nuanced action representation and flexible composition.

Action-Aware World Model Pretraining

After training the latent action autoencoder, AdaWorld uses the encoder to extract latent actions from large-scale, unlabeled video corpora. These latent actions serve as unified, context-invariant conditions for pretraining an autoregressive world model.

The world model is based on a diffusion architecture (initialized from Stable Video Diffusion), modified to support frame-level control and deep aggregation of action information. At each step, the model predicts the next frame conditioned on the current latent action and a memory of historical frames. Training employs noise augmentation and random memory lengths to improve robustness and mitigate long-term drift.

Adaptation and Transfer

AdaWorld's design enables two principal adaptation mechanisms:

  1. Action Transfer: Given a demonstration video, the latent action encoder extracts a sequence of latent actions, which can be replayed in novel contexts to generate new trajectories exhibiting the demonstrated behavior, without any additional training.
  2. World Model Adaptation: For environments with explicit action labels, AdaWorld averages the latent actions corresponding to each action to initialize action embeddings. Finetuning with a small number of action-labeled samples rapidly adapts the world model to the new action space, leveraging the continuity and expressiveness of the latent action manifold.

Additionally, AdaWorld supports action composition by interpolating or clustering in the latent action space, enabling the synthesis of novel control primitives and flexible action sets.

Experimental Results

AdaWorld is evaluated on a diverse suite of environments, including held-out domains (Habitat, Minecraft, DMLab) and standard video datasets (LIBERO, SSv2). The training corpus aggregates over 2 billion frames from robot, human, and game environments, ensuring broad coverage and diversity.

Key empirical findings:

  • Action Transfer: AdaWorld achieves substantial improvements in Fréchet Video Distance (FVD) and Embedding Cosine Similarity (ECS) over action-agnostic and discrete-latent baselines. For example, on LIBERO, AdaWorld reduces FVD from 1545.2 (action-agnostic) to 767.0 and increases ECS from 0.702 to 0.804, indicating more faithful and context-invariant action transfer.
  • World Model Adaptation: With only 100 samples per action and 800 finetuning steps, AdaWorld outperforms baselines in PSNR and LPIPS across all tested environments. The adaptation curves show that AdaWorld achieves higher simulation fidelity with fewer samples and faster convergence.
  • Visual Planning: In model-predictive control (MPC) tasks on Procgen environments, AdaWorld achieves higher success rates than both action-agnostic world models and Q-learning baselines, even without finetuning. For instance, in the Jumper environment, AdaWorld (without finetuning) attains a 68% success rate versus 20.67% for the action-agnostic baseline.
  • Ablations: Increasing training data diversity improves generalization to unseen domains. The use of averaged action embeddings for adaptation is critical for rapid and stable finetuning.

Implications and Future Directions

AdaWorld demonstrates that action-aware pretraining with self-supervised latent actions is a scalable and effective strategy for building adaptable world models. The approach circumvents the need for large-scale action annotation and supports rapid adaptation to new environments and action spaces. The continuous latent action space enables nuanced control, action composition, and flexible expansion of the action set.

Practical implications include:

  • Efficient deployment of world models in robotics, gaming, and simulation domains with heterogeneous or evolving action spaces.
  • Reduced data annotation and retraining costs for new tasks or environments.
  • Enhanced sample efficiency and planning performance in low-data regimes.

Theoretical implications center on the disentanglement of action and context, the expressiveness of continuous latent action spaces, and the potential for compositionality in control.

Limitations include non-real-time inference speed (due to diffusion-based generation), challenges in long-horizon rollouts, and imperfect handling of complex physics or dramatic scene changes. Addressing these may require model distillation, improved sampling, and further scaling of model and data.

Future research directions may explore:

  • Real-time or near-real-time world model inference via distillation or alternative generative architectures.
  • Integration with language-conditioned or multimodal control interfaces.
  • Scaling to more complex, open-ended environments and tasks.
  • Deeper investigation into the compositionality and interpretability of the latent action space.

Conclusion

AdaWorld establishes a new paradigm for adaptable world model pretraining by leveraging self-supervised latent action extraction and action-aware pretraining. The approach achieves strong empirical results in action transfer, adaptation, and planning, with significant improvements in sample efficiency and generalization. The methodology and findings have broad implications for the development of scalable, flexible, and efficient world models in embodied AI and interactive simulation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper introduces AdaWorld, a kind of “world model.” A world model is like a smart simulator: given what you do (your actions), it predicts what will happen next in a video or game. AdaWorld’s big idea is to learn a hidden, compact “action code” from ordinary videos (with no labels) and use that to make the simulator easy to adapt to new environments and new action sets with very little extra training.

What questions are they trying to answer?

The authors focus on three simple questions:

  • Can a model learn about actions just by watching videos, without needing lots of labeled data?
  • If it learns these hidden “action codes,” can it use them to control and predict what happens in many different environments (like different games or real-world scenes)?
  • Can this make adapting to new tasks faster and cheaper (fewer examples, fewer training steps), while still planning and simulating well?

How did they do it? (Methods explained simply)

Learning “latent actions” from videos

  • Think of two frames in a video: frame at time t, and the next frame at time t+1.
  • The model learns a tiny “secret code” that captures what changed between those two frames — that’s the latent action. It’s “latent” because it’s hidden inside the model, not a human-written label like “move left.”
  • To force the code to focus on what matters (the action, not background colors or textures), they use an “information bottleneck.” This means the code is very small, so it can’t carry everything; it must carry the most important change (like the movement).
  • They train this with an autoencoder (a kind of compression-and-reconstruction system) and a variation of a VAE (Variational Autoencoder) called a β-VAE. The β part lets them control how strict the compression is: more strict pushes the code to represent only the essentials; less strict lets it carry more details.

Analogy: Imagine watching a superhero flip a switch. The latent action is like a tiny note that just says “the switch was flipped,” ignoring the wallpaper and the lighting, so the model focuses on the cause of change, not the background.

Training the world model with those actions

  • After the model learns how to extract latent actions from videos, they train a world model that predicts the next frame given:
    • The current frame,
    • A short memory of recent frames,
    • And the latent action code.
  • They use a “diffusion model” to make each predicted frame look realistic. Diffusion models are like un-blurring a noisy picture step-by-step until it looks right.
  • The world model predicts one frame at a time and feeds its own predictions forward. This step-by-step style is called “autoregressive,” like writing a story one sentence at a time, using the last sentence to guide the next.

Making it adaptable

  • Because the world model was trained to use latent actions (instead of fixed, human-defined labels), adapting to new environments is mostly about finding the right latent action codes for those new actions.
  • With just a few examples of a new action (say, “jump”), they:
    • Extract the latent actions from those examples,
    • Average them to get a stable “action embedding,”
    • Plug that into the world model and do minimal fine-tuning.
  • The model can also “transfer” actions: if it sees a demonstration (like “push object forward” in one video), it can extract that latent action and replay it in a different scene without retraining.
  • Because the latent action space is continuous (not limited to a small set of discrete buttons), they can “mix” actions by blending their codes to create new behaviors (action composition).

What did they find? (Main results)

Here are the key findings in plain language:

  • Action transfer without retraining: The model can watch a video of an action and reuse that action in a different context (like a new game level or viewpoint) by applying the extracted latent action sequence.
  • Faster adaptation with few samples: Compared to traditional methods, AdaWorld needs fewer labeled examples and fewer training steps to adapt to new environments (like Habitat, Minecraft, and DMLab) and still produces higher-quality simulations.
  • Better visual planning: Using the world model to “imagine” and plan steps (via model predictive control), AdaWorld achieved higher success rates in tasks from the Procgen benchmark than common baselines and even classical Q-learning in the same limited-data setting.
  • Flexible control: The continuous latent action space allows composing and creating new actions by averaging or interpolating latent action codes.

Why this is important:

  • It shows that action-aware pretraining (using latent actions learned from videos) makes world models more adaptable, saving time and data while improving simulation and planning in new places.

What does it mean for the future? (Implications)

  • Easier transfer across domains: Robots, game agents, and interactive systems could quickly learn new controls in new environments using a few demonstrations, rather than thousands of labeled samples.
  • Lower costs and faster development: Training big models is expensive. Learning from unlabeled videos and needing less fine-tuning reduces cost and speeds up deployment.
  • More general, human-like learning: The idea mirrors how people learn — by watching and extracting the essence of actions — and reusing that knowledge in new situations.
  • Creative control: Being able to blend or create new actions opens doors to building interactive environments and tools that are more flexible.

Simple note on limitations and future work:

  • Speed: The model isn’t real-time yet; faster sampling and distillation could help.
  • Very long rollouts and totally new content: Like many video models, it gets harder the longer you predict without fresh input; scaling models and data may improve this.
  • Some failure cases still happen; more research is needed to make it even more robust.

Overall, AdaWorld shows a practical path to smarter, more adaptable simulators that learn action knowledge from ordinary videos and quickly plug into new tasks.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.