Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning and Leveraging World Models in Visual Representation Learning

Published 1 Mar 2024 in cs.CV, cs.AI, and cs.LG | (2403.00504v1)

Abstract: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.
  2. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
  3. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  4. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
  5. VICRegl: Self-supervised learning of local visual features. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=ePZsWeGJXyp.
  6. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  7. Quality diversity for visual pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5384–5394, October 2023a.
  8. Amortised invariance learning for contrastive self-supervision. In The Eleventh International Conference on Learning Representations, 2023b. https://openreview.net/forum?id=nXOhmfFu5n.
  9. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607. PMLR, 2020a.
  10. Context autoencoder for self-supervised representation learning. International Journal of Computer Vision, pages 1–16, 2023.
  11. Exploring simple siamese representation learning. In CVPR, 2020.
  12. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
  13. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  14. Deconstructing denoising diffusion models for self-supervised learning, 2024.
  15. Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233, 2023.
  16. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  17. Randaugment: Practical automated data augmentation with a reduced search space. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18613–18624. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/d85b63ef0ccb114d0a3bb7b7d808028f-Paper.pdf.
  18. Equivariant contrastive learning. arXiv preprint arXiv:2111.00899, 2021.
  19. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  20. Equimod: An equivariance module to improve self-supervised learning. arXiv preprint arXiv:2211.01244, 2022.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  22. Dytox: Transformers for continual learning with dynamic token expansion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  23. Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541, 2024.
  24. Whitening for self-supervised representation learning, 2021.
  25. On the duality between contrastive and non-contrastive self-supervised learning. In The Eleventh International Conference on Learning Representations, 2023a. https://openreview.net/forum?id=kDEL91Dufpa.
  26. Self-supervised learning of split invariant equivariant representations. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10975–10996. PMLR, 23–29 Jul 2023b. https://proceedings.mlr.press/v202/garrido23b.html.
  27. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020.
  28. Structuring representation geometry with rotationally equivariant contrastive learning, 2023.
  29. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31, pages 2451–2463. 2018.
  30. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  31. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  32. Modem: Accelerating visual model-based reinforcement learning with demonstrations. arXiv preprint arXiv:2212.05698, 2022.
  33. Provable guarantees for self-supervised deep learning with spectral contrastive loss. NeurIPS, 34, 2021.
  34. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  35. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  36. The inaturalist species classification and detection dataset. In CVPR, 2018.
  37. Gaia-1: A generative world model for autonomous driving, 2023.
  38. Soda: Bottleneck diffusion models for representation learning, 2023.
  39. Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247, 2019.
  40. Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1), 2022.
  41. Prefix-tuning: Optimizing continuous prompts for generation, 2021.
  42. Neural manifold clustering and embedding. arXiv preprint arXiv:2201.10000, 2022.
  43. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. https://openreview.net/forum?id=Bkg6RiCqY7.
  44. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  45. Learning Symmetric Embeddings for Equivariant World Models, June 2022. http://arxiv.org/abs/2204.11371. arXiv:2204.11371 [cs].
  46. Finetuned language models are zero-shot learners, 2022.
  47. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  48. Unified perceptual parsing for scene understanding. In ECCV, 2018.
  49. Simmim: A simple framework for masked image modeling, 2022.
  50. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
  51. Decoupled contrastive learning. arXiv preprint arXiv:2110.06848, 2021.
  52. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
  53. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  54. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, pages 12310–12320. PMLR, 2021.
  55. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. https://openreview.net/forum?id=r1Ddp1-Rb.
  56. Instruction tuning for large language models: A survey, 2023.
  57. Learning deep features for scene recognition using places database. In NeurIPS, 2014.
  58. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
Citations (15)

Summary

  • The paper introduces IWM, a novel framework that reuses a transformation-conditioned predictor to learn robust visual representations.
  • It employs a Vision Transformer backbone with masked image modeling and complex augmentations to achieve controlled invariance and equivariance.
  • Fine-tuning the world model on downstream tasks shows notable efficiency gains and improved performance in classification and segmentation.

Learning and Leveraging Image World Models for Visual Representation Learning

Introduction

This work introduces Image World Models (IWM), a generalization of the Joint-Embedding Predictive Architecture (JEPA) framework, aimed at advancing self-supervised visual representation learning by integrating the explicit learning of world models. Unlike traditional paradigms where the learned world model is discarded after pretraining, IWM designs the predictor as a reusable, fine-tunable world model, thus bridging the gap between world modeling in reinforcement learning and representation learning in computer vision. The framework systematically extends masked image modeling to encompass global photometric transformations within the latent space, yielding representations that can be controlled to be either invariant or equivariant with respect to defined transformations.

Image World Model Architecture

IWM instances are grounded in a Vision Transformer (ViT) backbone, where the predictor network acts as the “world model” operating within latent space. Both source and target image views are derived from shared augmentations: the source is generated by masking patches and introducing strong photometric and destructive augmentations; the target, by contrast, receives less destructive transformations. The predictor is conditioned on transformation parameters, mask tokens for geometric context, and is architected to reconstruct the target's latent representation from the source. Figure 1

Figure 1: The IWM pipeline: a source image is masked and augmented, then passed through the encoder. The predictor (world model) uses masked source embeddings, transformation parameters, and mask tokens to predict target latent representations computed via EMA of the encoder.

The loss function is the mean squared error between the predictor’s output and the EMA-encoded target representation over unmasked positions. Conditioning the predictor on transformation parameters is essential; models omitting this are forced into learning invariant representations, and lose the ability to represent transformation-specific semantics.

World Model Equivariance, Invariance, and Capacity

A core contribution is the empirical dissection of the requirements to learn effective image world models:

  • Conditioning on Transformation: Proper conditioning (feature concatenation or sequence tokens) is necessary for the predictor to be non-invariant and capture the transformation structure.
  • Transformation Complexity: Utilizing sufficiently complex photometric and destructive augmentations pushes the predictor toward modeling transformation equivariance. Weak augmentations lead to degenerate invariant solutions.
  • Predictor Capacity: Predictor depth and embedding dimensionality directly affect the ability to learn nontrivial world models; deeper predictors demonstrate more robust and stable learning of equivariant representations. Figure 2

    Figure 2: The predictor retrieves nearest-neighbor embeddings after applying latent-space transformations, confirming the ability to model and invert augmentations.

Visualization and Qualitative Analysis

The paper presents detailed visual evidence of the predictor’s capability. When queried with a source embedding and transformation parameters, IWMs map these to latent codes whose nearest neighbors in an augmented latent bank consistently match the appropriate transformed targets, with slight limitations in perfectly inverting nonbijective transformations (e.g., grayscale). Figure 3

Figure 3: Retrieval in latent space—nearest neighbor matches for predictions are the original or minimally transformed images, indicating fine-grained control over latent transformations.

Further, the world model's capacity for fine control is illustrated by systematically varying transformation parameters in latent space, demonstrating smooth and interpretable gradients of change in predicted representations. Figure 4

Figure 4: IWM's predictions across systematically varied transformation parameters, highlighting the granularity and faithfulness of the learned model.

Leveraging World Models for Downstream Tasks

Crucially, the IWM predictor is shown to be directly beneficial for downstream discriminative tasks:

  • Predictor Fine-Tuning: Fine-tuning only the world model on top of a frozen encoder matches or surpasses standard encoder fine-tuning. Gains exceed 1.8 accuracy points on ImageNet classification using the equivariant IWM predictor relative to a random same-architecture head, at a fraction of the parameter cost.
  • Multitask Finetuning: Inspired by instruction tuning, the predictor can be conditioned with task tokens and fine-tuned jointly over multiple tasks, maintaining or improving single-task baselines and amortizing adaptation cost. Figure 5

    Figure 5: Multitask predictor tuning setup—batches are mixed across tasks, and the predictor is conditioned using task-specific tokens, enabling label-efficient multitask adaptation.

  • Segmentation: Similar trends are observed for semantic segmentation tasks such as ADE20k, with predictor fine-tuning for the equivariant IWM delivering significant gains over encoder-based adaptation.

Representational Tradeoffs: Abstraction Spectrum

IWM enables continuous control of representational abstraction along the invariance–equivariance spectrum:

  • Invariant Models: Achieved by disabling transformation conditioning or reducing predictor complexity, resulting in representations similar to those from contrastive methods—linear probing performance is maximized due to information abstraction.
  • Equivariant Models: With ample predictor capacity and strong conditioning, richer, transformation-aware representations are learned, paralleling masked image modeling; these excel under complex adaptation strategies (predictor/encoder fine-tuning or attentive probing) but underperform in linear evaluation. Figure 6

Figure 6

Figure 6

Figure 6: Performance trade-offs across the equivariance–invariance spectrum for IWM according to linear, attentive, and predictor finetuning protocols.

This interpretability and control are uniquely facilitated by the explicit world model design in IWM, positioning the approach as modular and adaptable to task requirements.

Implications and Future Directions

The explicit modeling and re-use of the world model in visual representation learning have both practical and theoretical implications:

  • Parameter and Compute Efficiency: Reusing the IWM predictor for downstream adaptation avoids the inefficiency of full encoder retraining; parameter sharing is highly beneficial in many-task regimes.
  • Versatile Semantics: The approach subsumes contrastive and masked modeling as special cases, enabling flexible design of representations based on the task's abstraction needs.
  • Extension Potential: The framework naturally extends to multimodal, continual, and instruction-tuned settings—avenues vital for scalable, general-purpose visual models.

Conclusion

This work establishes Image World Models as a principled, general, and practical extension of JEPA frameworks. By conditioning the predictor, tuning transformation complexity, and scaling predictor capacity, IWM learns representations whose abstraction levels are continuously tunable, and whose predictor can be efficiently repurposed for discriminative and multi-task downstream settings. This paradigm unifies self-supervised paradigms under a world-modeling lens and is likely to inform future architecture and protocol choices in scalable visual and multimodal representation learning.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 16 tweets with 371 likes about this paper.