Stability of VLM behavior when language is presented as pixels
Determine whether Vision–Language Models maintain stable behavior and performance when language inputs are provided as visualized text embedded in images rather than as discrete tokenized text, in order to assess modality equivalence between pixel-based and token-based representations.
References
This blind spot overlooks the perceptual challenge of reading language from pixels, leaving it unclear whether model behavior remains stable when language isn't conveyed symbolically.
— VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
(2602.04802 - Liu et al., 4 Feb 2026) in Introduction (Section 1)