Persistence and manifestation of the modality gap in multimodal tasks

Determine whether the performance degradation observed when converting pure-text tasks to visualized-text inputs persists in multimodal tasks—including perception and reasoning—and characterize how this modality gap manifests across such scenarios.

Background

Concurrent work (e.g., VTCBench) has shown significant performance drops when text-only tasks are rendered as images, suggesting a modality gap in unimodal settings. However, comprehensive analyses in multimodal contexts have been limited.

The authors explicitly state uncertainty about whether this modality-gap phenomenon extends to multimodal tasks and what its specific patterns of manifestation are, framing a key question that motivates their benchmark design and evaluations.

References

It therefore remains unclear whether similar phenomena persist in multimodal tasks from perception to reasoning and how they manifest in such scenarios.

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?  (2602.04802 - Liu et al., 4 Feb 2026) in Introduction (Section 1)