Richer spatial representation without increasing visual tokens

Determine whether there exists a visual representation for vision encoders in vision–language models that encodes richer spatial information without increasing the number of visual tokens, so that spatially grounded evidence can be preserved under a fixed multimodal token budget.

Background

Vision–LLMs often operate under a fixed budget of visual tokens for efficiency. Increasing resolution or token count can improve fine-grained spatial detail but rapidly increases compute and memory in both the vision encoder and the LLM.

The authors highlight the need for representations that retain richer spatial information without expanding the number of tokens, motivating exploration of alternative backbones such as state space models (e.g., VMamba) as potential solutions.

References

To capture fine details, VLMs often increase image resolution or the number of visual tokens, but this quickly raises compute and memory costs in both the vision encoder and the LLM. This limitation raises an interesting open question: {is there a better visual representation that also encodes richer spatial information without increasing the number of vision tokens?}.

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders  (2603.19209 - Kuo et al., 19 Mar 2026) in Introduction, Section 1