Optimal Objective for Representation Learning in Image-Based World Models

Determine the optimal objective function for learning latent state representations in image-based model-based reinforcement learning world models, such as those based on the Recurrent State-Space Model (RSSM), so that the learned representations emphasize task-essential information while avoiding overfitting to irrelevant visual details.

Background

The paper highlights that, despite the success of architectures like the Recurrent State-Space Model (RSSM) in world-model-based reinforcement learning, the choice of objective for learning latent representations from images is not settled. In high-dimensional visual settings, representation learning is especially challenging, making the selection of an effective objective crucial.

Decoder-based approaches optimize pixel-level reconstruction, which can overemphasize large but task-irrelevant background regions. Decoder-free approaches typically rely on data augmentation as an external regularizer to prevent representation collapse, but augmentation choices can distort task-critical information. These limitations motivate the explicit open question about what objective should guide representation learning in image-based world models.

References

While architectures like the Recurrent State-Space Model (RSSM) have achieved remarkable success (Hafner et al., 2025), a fundamental question remains open: What is the optimal objective function for learning the representation itself?

R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation  (2603.18202 - Morihira et al., 18 Mar 2026) in Section 1 (Introduction)