Richer spatial representation without increasing visual tokens
Determine whether there exists a visual representation for vision encoders in vision–language models that encodes richer spatial information without increasing the number of visual tokens, so that spatially grounded evidence can be preserved under a fixed multimodal token budget.
References
To capture fine details, VLMs often increase image resolution or the number of visual tokens, but this quickly raises compute and memory costs in both the vision encoder and the LLM. This limitation raises an interesting open question: {is there a better visual representation that also encodes richer spatial information without increasing the number of vision tokens?}.
— Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
(2603.19209 - Kuo et al., 19 Mar 2026) in Introduction, Section 1