Papers
Topics
Authors
Recent
Search
2000 character limit reached

DINO-WM: Zero-Shot Latent World Model

Updated 3 February 2026
  • The paper introduces DINO-WM, a framework that leverages frozen DINOv2 patch embeddings for zero-shot planning in complex visual environments.
  • It employs a causal vision transformer and a learned MLP to predict future latent representations from offline trajectories without pixel reconstruction.
  • Empirical evaluations show DINO-WM outperforms baselines in navigation, manipulation, and dynamics tasks, achieving high success rates and low error metrics.

DINO-WM (DINO World Model) is a world modeling framework that leverages spatial patch features from the pre-trained DINOv2 vision transformer to enable zero-shot planning in complex visual environments. Unlike classical world models that rely on generative pixel reconstructions or task-specific latent representations, DINO-WM operates fully in latent space, eschewing pixel reconstruction in favor of direct prediction of future DINOv2 patch embeddings. This paradigm facilitates offline training on diverse, pre-recorded trajectories and supports task-agnostic, zero-shot planning through direct optimization of action sequences to reach a desired visual goal (Zhou et al., 2024).

1. Motivation and Core Principles

Learning accurate visual world models for control from pixels remains a significant challenge, primarily due to two limitations: the computational burden of pixel-level prediction (necessitating generative or diffusion models), and the lack of generality in existing latent-space approaches that entangle representation learning with task or reconstruction objectives. DINO-WM addresses these by using a frozen, spatially structured DINOv2 encoder, which provides object-centric, spatially localized patch embeddings learned from web-scale datasets. This decoupling allows world modeling to focus purely on dynamics in a compact, robust feature space, enabling efficient offline training from passive data and versatile test-time planning (Zhou et al., 2024).

Key principles include:

  • Latent-only modeling: All training and inference occur in DINOv2 patch embedding space; pixel reconstruction is unused except for optional qualitative inspection.
  • Offline, task-agnostic training: The model is trained purely on trajectories of observations and actions, with no rewards, demonstrations, or inverse models.
  • Zero-shot planning: At test time, goal-conditioning is achieved by optimizing over the DINO feature space, allowing immediate adaptation to new tasks by specifying different goal observations.

2. Model Architecture and Dynamics

DINO-WM formalizes dynamics prediction in partially observable MDPs by encoding each observation ot∈RH×W×3o_t\in \mathbb{R}^{H\times W\times 3} as NN fixed DINOv2 patch embeddings, zt∈RN×Ez_t\in\mathbb{R}^{N\times E}, where N=(H/P)×(W/P)N=(H/P)\times(W/P) and EE is the DINOv2 embedding dimension (typ. 384 for ViT-S/16). The encoder is frozen throughout.

The transition model is a causal vision transformer (ViT) parameterized by θ\theta, accepting as input the past HH DINO-patch latent frames zt−H:t−1z_{t-H:t-1} and action embeddings ϕ(at−H:t−1)\phi(a_{t-H:t-1}), producing a prediction of the next patch collection z^t\hat z_t. Each action ata_t is mapped via a learned MLP ϕ\phi, and broadcasted to patch dimension. Causality is enforced so that only earlier frames inform the prediction at each timestep:

z^t=pθ(zt−H:t−1, ϕ(at−H:t−1)).\hat z_t = p_\theta( z_{t-H:t-1},\, \phi(a_{t-H:t-1}) ).

An optional decoder qψq_\psi can be used for qualitative inspection by mapping predicted latents z^t\hat z_t back to the pixel space, but this decoder is never used in either training or test-time planning.

3. Training Regimen and Data

Training proceeds fully offline on pre-collected trajectories of the form {(o0,a0,o1,a1,...)}\{ (o_0, a_0, o_1, a_1, ... ) \} without any reward or expert control signals. DINO-WM trains only the parameters θ\theta (transformer weights in pθp_\theta) and ϕ\phi (the action-to-latent MLP).

Prediction uses teacher forcing over windows of H+1H+1 frames. For each such window, the model predicts the next HH steps of patch embeddings, supervised via a standard latent-space L2L_2 loss:

Lpred=∑k=1H∥z^t−H+k−zt−H+k∥22.\mathcal{L}_{\text{pred}} = \sum_{k=1}^H \| \hat z_{t-H+k} - z_{t-H+k} \|_2^2.

No reconstruction, reward, terminal, or auxiliary losses are used. Trajectory collection is environment-specific but always uses random or noisy expert-like behavior, ensuring coverage of varied states (e.g., 2,000 random 50-step PointMaze trajectories, over 18,500 replayed or noisy expert Push-T sequences, and similar volumes for manipulation tasks) (Zhou et al., 2024).

4. Zero-Shot Test-Time Planning

DINO-WM enables zero-shot planning by optimizing action sequences to minimize the L2L_2 distance in DINOv2-patch latent space between the model's roll-out and a goal observation's encoding. Formally, for a given start o0o_0 and goal ogo_g, actions a0:H−1a_{0:H-1} are sought to minimize

C(a0:H−1)=∥z^H(a0:H−1)−zg∥22,\mathcal{C}(a_{0:H-1}) = \| \hat z_H(a_{0:H-1}) - z_g \|_2^2,

where z^0=z0\hat z_0 = z_0 and z^t+1=pθ(z^t−H+1:t,ϕ(at−H+1:t))\hat z_{t+1} = p_\theta( \hat z_{t-H+1:t}, \phi(a_{t-H+1:t}) ).

The cross-entropy method (CEM) is the preferred optimizer, iteratively updating a Gaussian distribution over action sequences by selecting those with lowest planning cost, computing the mean and covariance, and resampling. A gradient-based shooting method is also possible via backpropagation through the world model, but is outperformed by CEM due to non-smooth latent dynamics.

Planning proceeds in an MPC loop: after executing the first kk (typically one) actions of the best sequence, the observation is re-encoded and planning resumes, closing the perception–action loop (Zhou et al., 2024).

5. Empirical Evaluation and Comparison

DINO-WM is evaluated on a suite of 2D/3D environments involving navigation, manipulation, and multi-particle dynamics using 224×224 RGB observations:

  • PointMaze (navigation),
  • Push-T (block pushing),
  • Wall/Two-Room (randomized barriers),
  • Rope Manipulation (robotic rope pulling),
  • Granular Manipulation (particle aggregation).

Zero-shot performance metrics include Success Rate (SR) on discrete goal tasks and Chamfer Distance (CD) for continuous shape-matching. DINO-WM substantially outperforms competitive baselines including IRIS, DreamerV3, and TD-MPC2. For example, on Push-T, DINO-WM achieves SR = 0.90, compared to IRIS (0.32), DreamerV3 (0.04), and TD-MPC2 (0.00). On Rope Manipulation, DINO-WM yields a CD of 0.41 (lower is better), compared to IRIS (1.11) and DreamerV3 (2.49).

Zero-Shot Planning Performance

Model PointMaze SR Push-T SR Wall SR Rope CD Granular CD
IRIS 0.74 0.32 0.04 1.11 0.37
DreamerV3 1.00 0.04 1.00 2.49 1.05
TD-MPC2 0.00 0.00 0.00 2.52 1.21
DINO-WM 0.98 0.90 0.96 0.41 0.26

Ablation studies show spatial DINOv2 patch embeddings are critical: replacing them with global R3M, ResNet18, or DINOv2 CLS token leads to significantly degraded performance, especially on manipulation and spatial reasoning tasks (Zhou et al., 2024).

DINO-WM demonstrates robust generalization to novel configurations, such as new object shapes, wall placements, or reduced particle number, maintaining or exceeding baseline performance (e.g., SR = 0.82 on unseen wall configurations vs DreamerV3 SR = 0.76, IRIS SR = 0.06).

Open-loop latent rollouts, decoded with the optional decoder, exhibit very low LPIPS (Push-T: 0.007; Wall: 0.0016) and high SSIM (Push-T: 0.985; Wall: 0.997), outperforming methods directly trained on pixel reconstruction.

6. Insights, Limitations, and Extensions

Several insights emerge from DINO-WM's results:

  • The use of a frozen, high-quality spatial encoder (DINOv2 patches) enables the world model to focus exclusively on modeling environment dynamics, eliminating the confounding of feature and generative modeling objectives.
  • Causal ViT-based latent predictors, trained entirely offline, are sufficient for precise prediction in domains requiring contact dynamics and complex object interactions.
  • Latent-space planning using CEM is both straightforward and effective—no reward shaping or demonstration-derived inverse models are required.

Limitations include:

  • The method requires access to aligned action data during offline training; applicability to settings lacking ground-truth actions is limited.
  • Gradient-based action optimization may be hampered by latent non-smoothness.
  • Current planning is single-level; hierarchical extensions or learning latent cost models for high-level goal following remain open directions.

Comparison to recent related latent world models, notably LaDi-WM (Huang et al., 13 May 2025), highlights a trend: latent-based dynamics using pre-trained visual feature spaces (e.g., DINO, CLIP/SigLip) substantially improve generalization and sample efficiency. LaDi-WM further integrates latent diffusion and semantic features for predictive manipulation, confirming the advantages of DINO-based encodings over pixel- or global-token models.

7. Context within Visual World Modeling

DINO-WM exemplifies the shift from pixel-centric to representation-centric visual world models, leveraging the semantic richness and spatial locality of transformer-trained visual encoders. Its decoupled, modular approach contrasts with end-to-end latent-pixel RL systems, supporting markedly improved zero-shot adaptability and interpretability. Future research may further unify geometric and semantic encodings (as in LaDi-WM), integrate language-conditioned goals, or extend to settings with partial or absent action annotation (Zhou et al., 2024, Huang et al., 13 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DINO-WM.