DINO-WM: Zero-Shot Latent World Model
- The paper introduces DINO-WM, a framework that leverages frozen DINOv2 patch embeddings for zero-shot planning in complex visual environments.
- It employs a causal vision transformer and a learned MLP to predict future latent representations from offline trajectories without pixel reconstruction.
- Empirical evaluations show DINO-WM outperforms baselines in navigation, manipulation, and dynamics tasks, achieving high success rates and low error metrics.
DINO-WM (DINO World Model) is a world modeling framework that leverages spatial patch features from the pre-trained DINOv2 vision transformer to enable zero-shot planning in complex visual environments. Unlike classical world models that rely on generative pixel reconstructions or task-specific latent representations, DINO-WM operates fully in latent space, eschewing pixel reconstruction in favor of direct prediction of future DINOv2 patch embeddings. This paradigm facilitates offline training on diverse, pre-recorded trajectories and supports task-agnostic, zero-shot planning through direct optimization of action sequences to reach a desired visual goal (Zhou et al., 2024).
1. Motivation and Core Principles
Learning accurate visual world models for control from pixels remains a significant challenge, primarily due to two limitations: the computational burden of pixel-level prediction (necessitating generative or diffusion models), and the lack of generality in existing latent-space approaches that entangle representation learning with task or reconstruction objectives. DINO-WM addresses these by using a frozen, spatially structured DINOv2 encoder, which provides object-centric, spatially localized patch embeddings learned from web-scale datasets. This decoupling allows world modeling to focus purely on dynamics in a compact, robust feature space, enabling efficient offline training from passive data and versatile test-time planning (Zhou et al., 2024).
Key principles include:
- Latent-only modeling: All training and inference occur in DINOv2 patch embedding space; pixel reconstruction is unused except for optional qualitative inspection.
- Offline, task-agnostic training: The model is trained purely on trajectories of observations and actions, with no rewards, demonstrations, or inverse models.
- Zero-shot planning: At test time, goal-conditioning is achieved by optimizing over the DINO feature space, allowing immediate adaptation to new tasks by specifying different goal observations.
2. Model Architecture and Dynamics
DINO-WM formalizes dynamics prediction in partially observable MDPs by encoding each observation as fixed DINOv2 patch embeddings, , where and is the DINOv2 embedding dimension (typ. 384 for ViT-S/16). The encoder is frozen throughout.
The transition model is a causal vision transformer (ViT) parameterized by , accepting as input the past DINO-patch latent frames and action embeddings , producing a prediction of the next patch collection . Each action is mapped via a learned MLP , and broadcasted to patch dimension. Causality is enforced so that only earlier frames inform the prediction at each timestep:
An optional decoder can be used for qualitative inspection by mapping predicted latents back to the pixel space, but this decoder is never used in either training or test-time planning.
3. Training Regimen and Data
Training proceeds fully offline on pre-collected trajectories of the form without any reward or expert control signals. DINO-WM trains only the parameters (transformer weights in ) and (the action-to-latent MLP).
Prediction uses teacher forcing over windows of frames. For each such window, the model predicts the next steps of patch embeddings, supervised via a standard latent-space loss:
No reconstruction, reward, terminal, or auxiliary losses are used. Trajectory collection is environment-specific but always uses random or noisy expert-like behavior, ensuring coverage of varied states (e.g., 2,000 random 50-step PointMaze trajectories, over 18,500 replayed or noisy expert Push-T sequences, and similar volumes for manipulation tasks) (Zhou et al., 2024).
4. Zero-Shot Test-Time Planning
DINO-WM enables zero-shot planning by optimizing action sequences to minimize the distance in DINOv2-patch latent space between the model's roll-out and a goal observation's encoding. Formally, for a given start and goal , actions are sought to minimize
where and .
The cross-entropy method (CEM) is the preferred optimizer, iteratively updating a Gaussian distribution over action sequences by selecting those with lowest planning cost, computing the mean and covariance, and resampling. A gradient-based shooting method is also possible via backpropagation through the world model, but is outperformed by CEM due to non-smooth latent dynamics.
Planning proceeds in an MPC loop: after executing the first (typically one) actions of the best sequence, the observation is re-encoded and planning resumes, closing the perception–action loop (Zhou et al., 2024).
5. Empirical Evaluation and Comparison
DINO-WM is evaluated on a suite of 2D/3D environments involving navigation, manipulation, and multi-particle dynamics using 224×224 RGB observations:
- PointMaze (navigation),
- Push-T (block pushing),
- Wall/Two-Room (randomized barriers),
- Rope Manipulation (robotic rope pulling),
- Granular Manipulation (particle aggregation).
Zero-shot performance metrics include Success Rate (SR) on discrete goal tasks and Chamfer Distance (CD) for continuous shape-matching. DINO-WM substantially outperforms competitive baselines including IRIS, DreamerV3, and TD-MPC2. For example, on Push-T, DINO-WM achieves SR = 0.90, compared to IRIS (0.32), DreamerV3 (0.04), and TD-MPC2 (0.00). On Rope Manipulation, DINO-WM yields a CD of 0.41 (lower is better), compared to IRIS (1.11) and DreamerV3 (2.49).
Zero-Shot Planning Performance
| Model | PointMaze SR | Push-T SR | Wall SR | Rope CD | Granular CD |
|---|---|---|---|---|---|
| IRIS | 0.74 | 0.32 | 0.04 | 1.11 | 0.37 |
| DreamerV3 | 1.00 | 0.04 | 1.00 | 2.49 | 1.05 |
| TD-MPC2 | 0.00 | 0.00 | 0.00 | 2.52 | 1.21 |
| DINO-WM | 0.98 | 0.90 | 0.96 | 0.41 | 0.26 |
Ablation studies show spatial DINOv2 patch embeddings are critical: replacing them with global R3M, ResNet18, or DINOv2 CLS token leads to significantly degraded performance, especially on manipulation and spatial reasoning tasks (Zhou et al., 2024).
DINO-WM demonstrates robust generalization to novel configurations, such as new object shapes, wall placements, or reduced particle number, maintaining or exceeding baseline performance (e.g., SR = 0.82 on unseen wall configurations vs DreamerV3 SR = 0.76, IRIS SR = 0.06).
Open-loop latent rollouts, decoded with the optional decoder, exhibit very low LPIPS (Push-T: 0.007; Wall: 0.0016) and high SSIM (Push-T: 0.985; Wall: 0.997), outperforming methods directly trained on pixel reconstruction.
6. Insights, Limitations, and Extensions
Several insights emerge from DINO-WM's results:
- The use of a frozen, high-quality spatial encoder (DINOv2 patches) enables the world model to focus exclusively on modeling environment dynamics, eliminating the confounding of feature and generative modeling objectives.
- Causal ViT-based latent predictors, trained entirely offline, are sufficient for precise prediction in domains requiring contact dynamics and complex object interactions.
- Latent-space planning using CEM is both straightforward and effective—no reward shaping or demonstration-derived inverse models are required.
Limitations include:
- The method requires access to aligned action data during offline training; applicability to settings lacking ground-truth actions is limited.
- Gradient-based action optimization may be hampered by latent non-smoothness.
- Current planning is single-level; hierarchical extensions or learning latent cost models for high-level goal following remain open directions.
Comparison to recent related latent world models, notably LaDi-WM (Huang et al., 13 May 2025), highlights a trend: latent-based dynamics using pre-trained visual feature spaces (e.g., DINO, CLIP/SigLip) substantially improve generalization and sample efficiency. LaDi-WM further integrates latent diffusion and semantic features for predictive manipulation, confirming the advantages of DINO-based encodings over pixel- or global-token models.
7. Context within Visual World Modeling
DINO-WM exemplifies the shift from pixel-centric to representation-centric visual world models, leveraging the semantic richness and spatial locality of transformer-trained visual encoders. Its decoupled, modular approach contrasts with end-to-end latent-pixel RL systems, supporting markedly improved zero-shot adaptability and interpretability. Future research may further unify geometric and semantic encodings (as in LaDi-WM), integrate language-conditioned goals, or extend to settings with partial or absent action annotation (Zhou et al., 2024, Huang et al., 13 May 2025).