When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

Published 19 Apr 2026 in cs.CV and cs.AI | (2604.17375v1)

Abstract: Recent advances in Vision-LLMs (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH). In this work, we propose VisualTextTrap, the first comprehensive benchmark, including large-scale human-validated samples with specifically designed evaluation metrics. In particular, we construct VisualTextTrap from widely-used public datasets using a scalable hybrid pipeline of VLMs assisted text generation and rigorous manual verification. The benchmark features 6,057 samples annotated across 88 fine-grained attributes within four dimensions, with hallucination intensity quantified on a five-level scale (L1--L5) that reflects the semantic contradiction between overlay text and visual reality. Moreover, we propose Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE), a novel Vision-Text Disentanglement framework that employs a dual-encoder architecture. Concretely, four dimension-specialized expert modules spanning Temporal, Action, Object, and Spatial reasoning are first pre-trained to identify and leverage cross-modal discrepancies between textual semantics and actual video content. We develop an Adaptive Token Routing Strategy to enable dynamic expert allocation, conferring robust resistance to TOIH while preserving performance on uncontaminated videos. Extensive experiments conducted on our VisualTextTrap benchmark verify the effectiveness of VTHM-MoE, outperforming state-of-the-art counterparts with diverse video question answering tasks.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents the VisualTextTrap benchmark with detailed metrics to quantify text-induced hallucination in vision-language models.
It introduces the VTHM-MoE architecture, featuring dual-encoder design, query-guided patch selection, and specialized experts for temporal, action, object, and spatial reasoning.
Empirical analysis shows significant performance drops in existing models under adversarial text overlays, while VTHM-MoE maintains robust VQA accuracy.

Text Overlay-Induced Hallucination in Vision-LLMs: Characterization, Benchmarking, and Mitigation

Problem Definition and Motivation

The paper addresses a critical and previously underexplored failure mode in vision-LLMs (VLMs), termed Text Overlay-Induced Hallucination (TOIH). This occurs when visually-rendered overlay text in video frames contradicts the ground-truth visual content, yet VLMs yield predictions that follow the textual semantics, disregarding actual visual evidence. The authors present systematic evidence that leading VLMs—including open and proprietary models—exhibit high rates of TOIH across video question answering (VQA) tasks covering temporal, action, object, and spatial reasoning dimensions.

They argue that current benchmarks and mitigation strategies are insufficient, as they primarily focus on visually-congruent overlays or on coarse object hallucination in images. Consequently, genuine multimodal grounding cannot be reliably assessed, and model robustness to adversarial text-visual conflict remains poorly understood.

The VisualTextTrap Benchmark: Construction and Methodology

To systematically measure TOIH, the authors introduce VisualTextTrap, the first dedicated benchmark for text overlay-induced hallucination. The benchmark is constructed by curating 6,057 samples from LLaVA-Video, VideoMME, and TemporalBench, incorporating real and user-generated content. Each sample is annotated with:

Four cognitive dimensions: Temporal, Action, Object, Spatial
88 fine-grained attributes: To enable precise semantic and perceptual diagnosis
Three overlay text conditions: Text-Free, Text-Congruent, Text-Contradictory
Five-level Semantic Conflict Score (SCS): Quantifies conflict severity between overlay text and visual ground truth

Benchmark construction employs a hybrid pipeline: model-assisted generation of adversarial overlays (Claude-Sonnet-4.6), multi-model and human expert validation, and hierarchical cognitive complexity annotation. This design enables controlled contrastive evaluation, allowing quantitative and qualitative analysis of how VLMs respond to varying conflict intensity and complexity.

Evaluation Metrics

The authors design 14 metrics across three layers:

Hallucination vulnerability: Hallucination Resistance Rate (HRR), Visual Yielding Rate (VYR), and Text-Induced Hallucination Rate (TIHR)
Conflict sensitivity: Semantic Conflict Sensitivity Index (SCSI), Hallucination Surge Rate (HSR)
Cognitive load sensitivity: Layer-specific Pearson correlations between cognitive complexity and resistance

This multidimensional metric suite provides detailed profiling of VLM behavior under TOIH.

Visual Text Hallucination Mitigation MoE (VTHM-MoE) Architecture

To mitigate TOIH, the paper introduces the Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE) framework. This architecture targets explicit cross-modal disentanglement and dynamic mitigation:

Dual-Encoder Backbone: Parallel frozen Qwen3-VL-8B visual and OCR encoders, yielding patch-level feature sequences that separately represent scene content and text overlays.
Query-Guided Patch Selection and Cross-Attention: Patches are selected and conditioned by question relevance for both modalities; three-token representations capture visual features, OCR features, and their difference.
Four Specialized Expert Modules: Temporal, Action, Object, and Spatial experts, each pre-trained for dimension-specific conflict detection and mitigation.
Adaptive Token Routing Strategy: At inference, token-level cross-modal consistency scores direct high-inconsistency tokens to specialized experts, while default (non-conflict) routing is preserved for standard VQA.
Training on Synthetic-Conflict Data: The model is trained using a broad distribution of contradiction types and conflict intensities, ensuring robust coverage.

Only the routing, conditioning, and expert modules are trainable; encoders and LLM weights remain frozen, facilitating efficient transfer to new tasks.

Empirical Results and Analysis

TOIH Prevalence and Model Vulnerability

Experiments demonstrate that all contemporary VLMs—regardless of scale or fine-tuning strategy—exhibit substantial accuracy degradation under contradicting overlay text. For example, accuracy drops of over 35 percentage points are observed (e.g., Qwen3-VL-8B on VideoMME: 42.7% → 7.1%). This vulnerability is pervasive and not alleviated by chain-of-thought prompting, model scaling, or supervised fine-tuning.

Probabilistic analysis confirms that overlay text captures model probability mass, systematically shifting predictions toward text-induced distractors at the expense of visual grounding.

VTHM-MoE Performance

VTHM-MoE achieves superior TOIH mitigation compared to all open-source and proprietary baselines. On VisualTextTrap:

Hallucination Resistance Rate (HRR): VTHM-MoE attains HRR comparable to (or marginally superior to) the strongest closed-source baseline, Gemini-3.1-Pro.
Sustained Native VQA Accuracy: The adaptive routing mechanism preserves VQA accuracy on Text-Free and Text-Congruent conditions, indicating effective conditional activation of mitigation pathways.
Dimension-Aware Robustness: Ablation and case analyses show that the four-expert architecture selectively targets hallucination type, and explicit cross-modal difference representations are critical for mitigation.

Notably, CoT and SFT approaches provide only minor improvements on text-free or congruent samples and fail to address TOIH. VTHM-MoE’s mixture-of-experts design is uniquely effective under adversarial overlay conditions.

Cognitive Complexity Effects

The analysis reveals that hallucination resistance in open-source VLMs deteriorates with increasing temporal or spatial reasoning demand (negative correlation with cognitive complexity), whereas VTHM-MoE and Gemini-3.1-Pro maintain or improve robustness under higher complexity. This reflects the specific mitigation capacity of VTHM-MoE’s dimension-aware architecture and the limitations of extant models in deep cross-modal grounding.

Ablation Studies

Removing any of the dual-encoders, explicit difference token, conflict classifier, or expert modules results in significant accuracy loss under TOIH. This underscores the need for explicit inconsistency signaling and expert specialization rather than relying solely on implicit reasoning or monolithic LLM modules.

Implications and Future Directions

The work has significant implications for both diagnostic assessment and robust design of VLMs:

Multimodal model evaluation must include adversarial conflicting overlays to reveal over-reliance on text and superficial shortcutting.
Architectures must develop explicit mechanisms for disentangling and reconciling modality conflicts, not only via scaling or generic prompting.
The VTHM-MoE routing paradigm may serve as a template for future extensions to other cross-modal and multi-turn settings, where text or metadata can undermine visual grounding.

The paper further points to open problems. Fine-grained semantic query classes (pragmatic, logical) and higher-order multi-turn or causal reasoning were not fully addressed, representing important directions for future work.

Conclusion

This work systematically defines, benchmarks, and mitigates Text Overlay-Induced Hallucination in vision-LLMs. VisualTextTrap enables rigorous and fine-grained diagnosis of model failure modes on adversarial overlays, while VTHM-MoE introduces a dimension-specialized, adaptive expertise architecture that sets a new standard in robustness. Future research must build on this framework to address more complex semantic and temporal reasoning under adversarial and noisy cross-modal contexts, ensuring that VLMs achieve genuine—and not merely apparent—visual grounding.

Markdown Report Issue