Develop a Continuous Unified Visual Tokenizer for Understanding and Generation

Develop a simple yet effective continuous visual tokenizer that naturally supports both visual understanding and image generation.

Background

Unified multimodal models often use two separate tokenizers to produce semantic tokens for understanding and pixel-reconstructable tokens for generation, increasing system complexity and limiting synergy between tasks.

Alternative approaches based on discrete, quantized representations introduce discretization errors that can degrade generation quality. This motivates the need for a continuous tokenizer that can serve both understanding and generation without such drawbacks. Open Vision 3 is presented as a step toward this goal, but the broader challenge remains open.

References

As a result, developing a simple yet effective continuous visual tokenizer that naturally supports both visual understanding and generation remains an open and practically important challenge.

— OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation (2601.15369 - Zhang et al., 21 Jan 2026) in Section 1, Introduction

Develop a Continuous Unified Visual Tokenizer for Understanding and Generation

Background

References

Related Problems