Text‑conditioned image tokenization for efficient multimodal encoding
Determine how to leverage text‑conditioning to adaptively and efficiently tokenize images in mid‑fusion vision–language models, so that regions irrelevant to a given textual query (for example, background areas in high‑resolution scenes) are encoded at lower resolution to reduce the number of visual tokens without degrading performance on agentic tasks.
References
It is an open question how to leverage text-conditioning to most efficiently tokenize the image---for example, if a specific question is asked about a high-resolution scene, the background could be encoded in a lower resolution to save on tokens.
— Phi-4-reasoning-vision-15B Technical Report
(2603.03975 - Aneja et al., 4 Mar 2026) in Open research questions, Section 2.2 (Vision Encoder and Image Processing)