Learning and Leveraging World Models in Visual Representation Learning

Published 1 Mar 2024 in cs.CV, cs.AI, and cs.LG | (2403.00504v1)

Abstract: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

Abstract PDF HTML Upgrade to Chat

Authors (6)

References (58)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces IWM, a novel framework that reuses a transformation-conditioned predictor to learn robust visual representations.
It employs a Vision Transformer backbone with masked image modeling and complex augmentations to achieve controlled invariance and equivariance.
Fine-tuning the world model on downstream tasks shows notable efficiency gains and improved performance in classification and segmentation.

Learning and Leveraging Image World Models for Visual Representation Learning

Introduction

This work introduces Image World Models (IWM), a generalization of the Joint-Embedding Predictive Architecture (JEPA) framework, aimed at advancing self-supervised visual representation learning by integrating the explicit learning of world models. Unlike traditional paradigms where the learned world model is discarded after pretraining, IWM designs the predictor as a reusable, fine-tunable world model, thus bridging the gap between world modeling in reinforcement learning and representation learning in computer vision. The framework systematically extends masked image modeling to encompass global photometric transformations within the latent space, yielding representations that can be controlled to be either invariant or equivariant with respect to defined transformations.

Image World Model Architecture

IWM instances are grounded in a Vision Transformer (ViT) backbone, where the predictor network acts as the “world model” operating within latent space. Both source and target image views are derived from shared augmentations: the source is generated by masking patches and introducing strong photometric and destructive augmentations; the target, by contrast, receives less destructive transformations. The predictor is conditioned on transformation parameters, mask tokens for geometric context, and is architected to reconstruct the target's latent representation from the source.

Figure 1: The IWM pipeline: a source image is masked and augmented, then passed through the encoder. The predictor (world model) uses masked source embeddings, transformation parameters, and mask tokens to predict target latent representations computed via EMA of the encoder.

The loss function is the mean squared error between the predictor’s output and the EMA-encoded target representation over unmasked positions. Conditioning the predictor on transformation parameters is essential; models omitting this are forced into learning invariant representations, and lose the ability to represent transformation-specific semantics.

World Model Equivariance, Invariance, and Capacity

A core contribution is the empirical dissection of the requirements to learn effective image world models:

Conditioning on Transformation: Proper conditioning (feature concatenation or sequence tokens) is necessary for the predictor to be non-invariant and capture the transformation structure.
Transformation Complexity: Utilizing sufficiently complex photometric and destructive augmentations pushes the predictor toward modeling transformation equivariance. Weak augmentations lead to degenerate invariant solutions.
Predictor Capacity: Predictor depth and embedding dimensionality directly affect the ability to learn nontrivial world models; deeper predictors demonstrate more robust and stable learning of equivariant representations.
Figure 2: The predictor retrieves nearest-neighbor embeddings after applying latent-space transformations, confirming the ability to model and invert augmentations.

Visualization and Qualitative Analysis

The paper presents detailed visual evidence of the predictor’s capability. When queried with a source embedding and transformation parameters, IWMs map these to latent codes whose nearest neighbors in an augmented latent bank consistently match the appropriate transformed targets, with slight limitations in perfectly inverting nonbijective transformations (e.g., grayscale).

Figure 3: Retrieval in latent space—nearest neighbor matches for predictions are the original or minimally transformed images, indicating fine-grained control over latent transformations.

Further, the world model's capacity for fine control is illustrated by systematically varying transformation parameters in latent space, demonstrating smooth and interpretable gradients of change in predicted representations.

Figure 4: IWM's predictions across systematically varied transformation parameters, highlighting the granularity and faithfulness of the learned model.

Leveraging World Models for Downstream Tasks

Crucially, the IWM predictor is shown to be directly beneficial for downstream discriminative tasks:

Predictor Fine-Tuning: Fine-tuning only the world model on top of a frozen encoder matches or surpasses standard encoder fine-tuning. Gains exceed 1.8 accuracy points on ImageNet classification using the equivariant IWM predictor relative to a random same-architecture head, at a fraction of the parameter cost.
Multitask Finetuning: Inspired by instruction tuning, the predictor can be conditioned with task tokens and fine-tuned jointly over multiple tasks, maintaining or improving single-task baselines and amortizing adaptation cost.
Figure 5: Multitask predictor tuning setup—batches are mixed across tasks, and the predictor is conditioned using task-specific tokens, enabling label-efficient multitask adaptation.
Segmentation: Similar trends are observed for semantic segmentation tasks such as ADE20k, with predictor fine-tuning for the equivariant IWM delivering significant gains over encoder-based adaptation.

Representational Tradeoffs: Abstraction Spectrum

IWM enables continuous control of representational abstraction along the invariance–equivariance spectrum:

Invariant Models: Achieved by disabling transformation conditioning or reducing predictor complexity, resulting in representations similar to those from contrastive methods—linear probing performance is maximized due to information abstraction.
Equivariant Models: With ample predictor capacity and strong conditioning, richer, transformation-aware representations are learned, paralleling masked image modeling; these excel under complex adaptation strategies (predictor/encoder fine-tuning or attentive probing) but underperform in linear evaluation.

Figure 6: Performance trade-offs across the equivariance–invariance spectrum for IWM according to linear, attentive, and predictor finetuning protocols.

This interpretability and control are uniquely facilitated by the explicit world model design in IWM, positioning the approach as modular and adaptable to task requirements.

Implications and Future Directions

The explicit modeling and re-use of the world model in visual representation learning have both practical and theoretical implications:

Parameter and Compute Efficiency: Reusing the IWM predictor for downstream adaptation avoids the inefficiency of full encoder retraining; parameter sharing is highly beneficial in many-task regimes.
Versatile Semantics: The approach subsumes contrastive and masked modeling as special cases, enabling flexible design of representations based on the task's abstraction needs.
Extension Potential: The framework naturally extends to multimodal, continual, and instruction-tuned settings—avenues vital for scalable, general-purpose visual models.

Conclusion

This work establishes Image World Models as a principled, general, and practical extension of JEPA frameworks. By conditioning the predictor, tuning transformation complexity, and scaling predictor capacity, IWM learns representations whose abstraction levels are continuously tunable, and whose predictor can be efficiently repurposed for discriminative and multi-task downstream settings. This paradigm unifies self-supervised paradigms under a world-modeling lens and is likely to inform future architecture and protocol choices in scalable visual and multimodal representation learning.

Markdown Report Issue