Persistence and manifestation of the modality gap in multimodal tasks
Determine whether the performance degradation observed when converting pure-text tasks to visualized-text inputs persists in multimodal tasks—including perception and reasoning—and characterize how this modality gap manifests across such scenarios.
References
It therefore remains unclear whether similar phenomena persist in multimodal tasks from perception to reasoning and how they manifest in such scenarios.
— VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
(2602.04802 - Liu et al., 4 Feb 2026) in Introduction (Section 1)