Text‑conditioned image tokenization for efficient multimodal encoding

Determine how to leverage text‑conditioning to adaptively and efficiently tokenize images in mid‑fusion vision–language models, so that regions irrelevant to a given textual query (for example, background areas in high‑resolution scenes) are encoded at lower resolution to reduce the number of visual tokens without degrading performance on agentic tasks.

Background

The paper shows that increasing image resolution or the number of visual tokens improves high‑resolution reasoning and GUI grounding performance, but at a significant efficiency cost due to quadratic attention scaling with context length. All tested featurization techniques operate independently of the text prompt, suggesting potential inefficiency when only parts of a high‑resolution image are relevant to a specific query.

The authors note that incorporating text‑conditioning into the image tokenization process could save tokens by lowering resolution in irrelevant regions. They reference related ideas such as BLIP‑2’s Q‑Former but observe that such approaches have not yet demonstrated clear benefits for agentic tasks.

References

It is an open question how to leverage text-conditioning to most efficiently tokenize the image---for example, if a specific question is asked about a high-resolution scene, the background could be encoded in a lower resolution to save on tokens.

Phi-4-reasoning-vision-15B Technical Report  (2603.03975 - Aneja et al., 4 Mar 2026) in Open research questions, Section 2.2 (Vision Encoder and Image Processing)