Scaling abstract world-model representations across domains and modalities

Develop scalable methods for learning compact, abstract world-model representations that discard irrelevant details and generalize across arbitrary domains and modalities, rather than relying on approaches that preserve complete observation information via reconstructable latent representations or raw data.

Background

The paper contrasts human mental models—which use compact representations that omit irrelevant details—with most current AI world-model techniques that retain full observational information, either through reconstructable latents or raw data such as pixels. It highlights that modern video-generation world models often capture detailed pixel-level dynamics, whereas language provides higher abstraction.

The authors note that despite interest in abstract representations (e.g., compact state abstractions), a principled, scalable approach that works across arbitrary domains and modalities has not been established. This motivates unified multimodal models as a potential pathway, but the general scaling problem remains unresolved.

References

Although psychology and cognitive science suggest that human mental models rely on compact representations that discard irrelevant details, how to scale approaches capable of learning such abstract representations to arbitrary domains and modalities is still unclear.

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models  (2601.19834 - Wu et al., 27 Jan 2026) in Section 2: Related Work — World models