DINO-WM: Zero-Shot Latent World Model

Updated 3 February 2026

The paper introduces DINO-WM, a framework that leverages frozen DINOv2 patch embeddings for zero-shot planning in complex visual environments.
It employs a causal vision transformer and a learned MLP to predict future latent representations from offline trajectories without pixel reconstruction.
Empirical evaluations show DINO-WM outperforms baselines in navigation, manipulation, and dynamics tasks, achieving high success rates and low error metrics.

DINO-WM (DINO World Model) is a world modeling framework that leverages spatial patch features from the pre-trained DINOv2 vision transformer to enable zero-shot planning in complex visual environments. Unlike classical world models that rely on generative pixel reconstructions or task-specific latent representations, DINO-WM operates fully in latent space, eschewing pixel reconstruction in favor of direct prediction of future DINOv2 patch embeddings. This paradigm facilitates offline training on diverse, pre-recorded trajectories and supports task-agnostic, zero-shot planning through direct optimization of action sequences to reach a desired visual goal (Zhou et al., 2024).

1. Motivation and Core Principles

Learning accurate visual world models for control from pixels remains a significant challenge, primarily due to two limitations: the computational burden of pixel-level prediction (necessitating generative or diffusion models), and the lack of generality in existing latent-space approaches that entangle representation learning with task or reconstruction objectives. DINO-WM addresses these by using a frozen, spatially structured DINOv2 encoder, which provides object-centric, spatially localized patch embeddings learned from web-scale datasets. This decoupling allows world modeling to focus purely on dynamics in a compact, robust feature space, enabling efficient offline training from passive data and versatile test-time planning (Zhou et al., 2024).

Key principles include:

Latent-only modeling: All training and inference occur in DINOv2 patch embedding space; pixel reconstruction is unused except for optional qualitative inspection.
Offline, task-agnostic training: The model is trained purely on trajectories of observations and actions, with no rewards, demonstrations, or inverse models.
Zero-shot planning: At test time, goal-conditioning is achieved by optimizing over the DINO feature space, allowing immediate adaptation to new tasks by specifying different goal observations.

2. Model Architecture and Dynamics

DINO-WM formalizes dynamics prediction in partially observable MDPs by encoding each observation $o_t\in \mathbb{R}^{H\times W\times 3}$ as $N$ fixed DINOv2 patch embeddings, $z_t\in\mathbb{R}^{N\times E}$ , where $N=(H/P)\times(W/P)$ and $E$ is the DINOv2 embedding dimension (typ. 384 for ViT-S/16). The encoder is frozen throughout.

The transition model is a causal vision transformer (ViT) parameterized by $\theta$ , accepting as input the past $H$ DINO-patch latent frames $z_{t-H:t-1}$ and action embeddings $\phi(a_{t-H:t-1})$ , producing a prediction of the next patch collection $\hat z_t$ . Each action $a_t$ is mapped via a learned MLP $\phi$ , and broadcasted to patch dimension. Causality is enforced so that only earlier frames inform the prediction at each timestep:

$\hat z_t = p_\theta( z_{t-H:t-1},\, \phi(a_{t-H:t-1}) ).$

An optional decoder $q_\psi$ can be used for qualitative inspection by mapping predicted latents $\hat z_t$ back to the pixel space, but this decoder is never used in either training or test-time planning.

3. Training Regimen and Data

Training proceeds fully offline on pre-collected trajectories of the form $\{ (o_0, a_0, o_1, a_1, ... ) \}$ without any reward or expert control signals. DINO-WM trains only the parameters $\theta$ (transformer weights in $p_\theta$ ) and $\phi$ (the action-to-latent MLP).

Prediction uses teacher forcing over windows of $H+1$ frames. For each such window, the model predicts the next $H$ steps of patch embeddings, supervised via a standard latent-space $L_2$ loss:

$\mathcal{L}_{\text{pred}} = \sum_{k=1}^H \| \hat z_{t-H+k} - z_{t-H+k} \|_2^2.$

No reconstruction, reward, terminal, or auxiliary losses are used. Trajectory collection is environment-specific but always uses random or noisy expert-like behavior, ensuring coverage of varied states (e.g., 2,000 random 50-step PointMaze trajectories, over 18,500 replayed or noisy expert Push-T sequences, and similar volumes for manipulation tasks) (Zhou et al., 2024).

4. Zero-Shot Test-Time Planning

DINO-WM enables zero-shot planning by optimizing action sequences to minimize the $L_2$ distance in DINOv2-patch latent space between the model's roll-out and a goal observation's encoding. Formally, for a given start $o_0$ and goal $o_g$ , actions $a_{0:H-1}$ are sought to minimize

$\mathcal{C}(a_{0:H-1}) = \| \hat z_H(a_{0:H-1}) - z_g \|_2^2,$

where $\hat z_0 = z_0$ and $\hat z_{t+1} = p_\theta( \hat z_{t-H+1:t}, \phi(a_{t-H+1:t}) )$ .

The cross-entropy method (CEM) is the preferred optimizer, iteratively updating a Gaussian distribution over action sequences by selecting those with lowest planning cost, computing the mean and covariance, and resampling. A gradient-based shooting method is also possible via backpropagation through the world model, but is outperformed by CEM due to non-smooth latent dynamics.

Planning proceeds in an MPC loop: after executing the first $k$ (typically one) actions of the best sequence, the observation is re-encoded and planning resumes, closing the perception–action loop (Zhou et al., 2024).

5. Empirical Evaluation and Comparison

DINO-WM is evaluated on a suite of 2D/3D environments involving navigation, manipulation, and multi-particle dynamics using 224×224 RGB observations:

PointMaze (navigation),
Push-T (block pushing),
Wall/Two-Room (randomized barriers),
Rope Manipulation (robotic rope pulling),
Granular Manipulation (particle aggregation).

Zero-shot performance metrics include Success Rate (SR) on discrete goal tasks and Chamfer Distance (CD) for continuous shape-matching. DINO-WM substantially outperforms competitive baselines including IRIS, DreamerV3, and TD-MPC2. For example, on Push-T, DINO-WM achieves SR = 0.90, compared to IRIS (0.32), DreamerV3 (0.04), and TD-MPC2 (0.00). On Rope Manipulation, DINO-WM yields a CD of 0.41 (lower is better), compared to IRIS (1.11) and DreamerV3 (2.49).

Zero-Shot Planning Performance

Model	PointMaze SR	Push-T SR	Wall SR	Rope CD	Granular CD
IRIS	0.74	0.32	0.04	1.11	0.37
DreamerV3	1.00	0.04	1.00	2.49	1.05
TD-MPC2	0.00	0.00	0.00	2.52	1.21
DINO-WM	0.98	0.90	0.96	0.41	0.26

Ablation studies show spatial DINOv2 patch embeddings are critical: replacing them with global R3M, ResNet18, or DINOv2 CLS token leads to significantly degraded performance, especially on manipulation and spatial reasoning tasks (Zhou et al., 2024).

DINO-WM demonstrates robust generalization to novel configurations, such as new object shapes, wall placements, or reduced particle number, maintaining or exceeding baseline performance (e.g., SR = 0.82 on unseen wall configurations vs DreamerV3 SR = 0.76, IRIS SR = 0.06).

Open-loop latent rollouts, decoded with the optional decoder, exhibit very low LPIPS (Push-T: 0.007; Wall: 0.0016) and high SSIM (Push-T: 0.985; Wall: 0.997), outperforming methods directly trained on pixel reconstruction.

6. Insights, Limitations, and Extensions

Several insights emerge from DINO-WM's results:

The use of a frozen, high-quality spatial encoder (DINOv2 patches) enables the world model to focus exclusively on modeling environment dynamics, eliminating the confounding of feature and generative modeling objectives.
Causal ViT-based latent predictors, trained entirely offline, are sufficient for precise prediction in domains requiring contact dynamics and complex object interactions.
Latent-space planning using CEM is both straightforward and effective—no reward shaping or demonstration-derived inverse models are required.

Limitations include:

The method requires access to aligned action data during offline training; applicability to settings lacking ground-truth actions is limited.
Gradient-based action optimization may be hampered by latent non-smoothness.
Current planning is single-level; hierarchical extensions or learning latent cost models for high-level goal following remain open directions.

Comparison to recent related latent world models, notably LaDi-WM (Huang et al., 13 May 2025), highlights a trend: latent-based dynamics using pre-trained visual feature spaces (e.g., DINO, CLIP/SigLip) substantially improve generalization and sample efficiency. LaDi-WM further integrates latent diffusion and semantic features for predictive manipulation, confirming the advantages of DINO-based encodings over pixel- or global-token models.

7. Context within Visual World Modeling

DINO-WM exemplifies the shift from pixel-centric to representation-centric visual world models, leveraging the semantic richness and spatial locality of transformer-trained visual encoders. Its decoupled, modular approach contrasts with end-to-end latent-pixel RL systems, supporting markedly improved zero-shot adaptability and interpretability. Future research may further unify geometric and semantic encodings (as in LaDi-WM), integrate language-conditioned goals, or extend to settings with partial or absent action annotation (Zhou et al., 2024, Huang et al., 13 May 2025).

Markdown Report Issue Upgrade to Chat

References (2)

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (2024)

LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DINO-WM.