Do multimodal models provide capabilities beyond language-only models?

Determine whether the joint modeling of text, vision, and audio can introduce capabilities beyond those captured by language-only models.

Background

LongCat-Next adopts the Discrete Native Autoregression (DiNA) paradigm to represent text, vision, and audio as discrete tokens within a single decoder-only model, aiming to unify understanding and generation across modalities without sacrificing language ability.

Despite these advances, the authors explicitly identify a central unresolved question about whether multimodality confers abilities that surpass what can be learned from language alone. They further note that the interaction between large-scale perceptual pretraining and discrete token modeling remains underexplored, with preliminary results indicating a potential mismatch between continuous representations and discrete modeling.

References

A central open question is whether multimodality can introduce capabilities beyond those already captured by language. In particular, the interaction between large-scale perceptual pretraining and discrete token modeling remains underexplored.

LongCat-Next: Lexicalizing Modalities as Discrete Tokens  (2603.27538 - Team et al., 29 Mar 2026) in Section: Discussion and Future Work — Data Scaling and Representation Learning