Do multimodal models provide capabilities beyond language-only models?
Determine whether the joint modeling of text, vision, and audio can introduce capabilities beyond those captured by language-only models.
References
A central open question is whether multimodality can introduce capabilities beyond those already captured by language. In particular, the interaction between large-scale perceptual pretraining and discrete token modeling remains underexplored.
— LongCat-Next: Lexicalizing Modalities as Discrete Tokens
(2603.27538 - Team et al., 29 Mar 2026) in Section: Discussion and Future Work — Data Scaling and Representation Learning