Stability of VLM behavior when language is presented as pixels

Determine whether Vision–Language Models maintain stable behavior and performance when language inputs are provided as visualized text embedded in images rather than as discrete tokenized text, in order to assess modality equivalence between pixel-based and token-based representations.

Background

The paper highlights that most existing evaluations supply language as pure text tokens, overlooking cases where language appears as visualized text within images. This creates a blind spot regarding how models process language when it is rendered as pixels rather than tokens.

The authors argue that the perceptual challenge of reading language from pixels may affect behavior, and they formulate the explicit uncertainty about whether model behavior remains stable under this change in input representation, motivating the need for a dedicated benchmark.

References

This blind spot overlooks the perceptual challenge of reading language from pixels, leaving it unclear whether model behavior remains stable when language isn't conveyed symbolically.

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?  (2602.04802 - Liu et al., 4 Feb 2026) in Introduction (Section 1)