MMSearchVQA: Multi-Hop VQA Benchmark

Updated 16 January 2026

MMSearchVQA is a suite of benchmarks and datasets that assess multi-hop visual question answering by integrating text and image retrieval for iterative reasoning.
It enforces multi-turn tool-based retrieval by alternating between visual searches and text queries to construct chains of evidence for complex questions.
Key variants like DeepMMSearchVQA, MMSearch-Plus, and M³Searcher advance research in search efficiency, grounding accuracy, and dynamic multi-modal reasoning.

MMSearchVQA refers to a family of benchmarks and datasets that collectively define, evaluate, and enable progress in retrieval-augmented, multi-hop Visual Question Answering (VQA) for multimodal agents. These resources are designed to train and assess the ability of large multimodal LLMs (MLLMs) and vision–LLMs (VLMs) to reason across textual and visual domains, utilize external web search tools, and perform iterative, multi-turn reasoning to resolve knowledge-intensive queries. Key variants include DeepMMSearchVQA (Narayan et al., 14 Oct 2025), MMSearch-Plus (Tao et al., 29 Aug 2025), and the multi-hop MMSearchVQA benchmark introduced alongside M $^3$ Searcher (Yu et al., 14 Jan 2026).

1. Motivation and Problem Scope

MMSearchVQA benchmarks are constructed to overcome fundamental limitations in typical VQA and browsing agent datasets: shallow reasoning steps, absence of cross-modal retrieval orchestration, and insufficient representation of realistic web-search complexity. Real-world information seeking often demands coordination of multiple search modalities (visual recognition, text search) and propagation of weak local signals (e.g., part-based image crops, micro-text). Existing datasets tend to restrict retrieval depth, leading to agent policies that issue a single image search or direct text lookup.

MMSearchVQA benchmarks enforce multi-hop requirements by demanding at least two, often up to three or more, retrieval steps per question. Each instance obliges agents to alternate “think” states (visual or intermediate knowledge processing) with targeted tool invocations (image or text search) and to synthesize chains of evidence before producing a final answer (Yu et al., 14 Jan 2026). These properties facilitate the training and evaluation of agents capable of dynamic reasoning, tool selection, and long-horizon planning under mixed-modality and retrieval noise.

2. Dataset Construction and Annotation Protocols

DeepMMSearchVQA (Narayan et al., 14 Oct 2025) is built from a large-scale InfoSeek training split, encompassing 200,000 (question, image, answer) triples. The dataset generation pipeline employs the Gemini-2.5-Pro model primed with an expert assistant prompt to construct multi-turn reasoning traces. At each step $t$ , the model emits a <reason> block followed by one of the following tool calls: <img_search>[referring expression]</img_search>, <img_search><img></img></img_search> (whole image), <text_search>…</text_search>, or <answer>…</answer> (terminal answer). Quality control is fully automated and enforces exact match between predicted and source answers, tag-schema conformance, and recursive invocation of external tools (image crop via GroundingDINO, text search, and LLM summarization) until the answer is attained.

The MMSearchVQA variant described in M $^3$ Searcher (Yu et al., 14 Jan 2026) comprises 6,000 unique multi-hop questions sourced from Wikidata entity graphs, automatically expanded and validated using Wikipedia evidence chains. Each question is constructed such that every hop is both necessary and unambiguously supported by explicit text snippets. Questions are balanced for difficulty (easy, medium, hard) based on DeepSeek-V3 multiple-run validation.

No human annotation or inter-annotator agreement metrics (e.g., Cohen’s $\kappa$ ) are used in either corpus; consistency and validity are maintained exclusively through automated matching to ground-truth labels and semantic equivalence judging via LLM-as-Judge protocols.

3. Task Composition and Knowledge Taxonomy

MMSearchVQA datasets are partitioned according to both modality and reasoning structure.

DeepMMSearchVQA contains 10,000 samples, split 50:50 between search-free and search-required queries, evenly distributed over six knowledge categories (e.g., Geography, Art, Science). Each question is annotated with the number of required search turns ( $\mu_{\text{turns}} \approx 2.8$ ), whether single-hop ( $\approx 60\%$ ) or multi-hop ( $\geq$ 2 search turns, $\approx 40\%$ ), and with the specific tool usage events (whole-image search, cropped-image search, text search with refinements).
MMSearch-Plus (Tao et al., 29 Aug 2025) comprises 311 instances, each demanding localization and propagation of weak spatial or temporal signals, often requiring part-level cropping, iterative text/image search, and provenance checks.
The M $^3$ Searcher benchmark (Yu et al., 14 Jan 2026) offers a diverse coverage: biography, geography, science, arts, sports, and miscellaneous, each with a ground-truth answer and explicit evidence chain $(e_1,…,e_h)$ where $t$ 0.

Typical question formats include yes/no, open-ended, and multi-step inferences that interleave image and textual reasoning. Visual contexts may span full-scene understanding or focused entity crops requiring precise bounding-box attention (e.g., sponsor patch domain discrimination).

4. Agent Frameworks and Reasoning Methodologies

MMSearchVQA benchmarks require agents to operate via tool-equipped reasoning frameworks, orchestrating the following actions at each turn:

Text search: web retrieval for top- $t$ 1 snippets, often followed by summarization.
Image search: reverse image lookup (on the whole scene or targeted crops).
Crop/select_bbox: localization of regions bearing critical cues, grounded by models such as GroundingDINO.
Multi-turn reasoning: endogenous query refinement predicated on retrieved information and state history.
Self-reflection/Self-correction: iterative adjustment to queries and regions as new evidence is incorporated.

A commonly used agent loop (as in MMSearch-Plus) formalizes this as a sequence over a maximum round budget, with the LLM proposing tools/actions, consuming results, and breaking upon answer emission. Performance in these frameworks is tightly bound to accurate, efficient grounding and the capacity to dynamically budget search hops to avoid excessive or unfocused retrieval (Tao et al., 29 Aug 2025, Narayan et al., 14 Oct 2025).

5. Evaluation Protocols and Key Results

Three primary metrics govern MMSearchVQA evaluation:

Metric	Definition & Protocol	Example Value
Answer Accuracy	$t$ 2; $t$ 3 set judged by GPT-4o	o3 agent: 36.0% (full rollout) (Tao et al., 29 Aug 2025)
Bounding-box IoU	$t$ 4 for crop selection calls	Measured vs. human annotation
Search Efficiency	Avg. number of search calls per successful trajectory	o3: ~3.7 (success), ~6.4 (fail)

In DeepMMSearchVQA-based model training, supervised fine-tuning is followed by online RL (e.g., GRPO, BN-GSPO) with group-relative or batch-normalized advantage objectives, curriculum constraints (max 10 tool calls, token limits), and composite reward functions balancing correctness and format conformity.

Empirical results establish sustained gains with full agentic rollout (part-based visual reasoning, provenance verification, deep planning), with state-of-the-art models (SenseNova-MARS-8B, DeepMMSearch-R1) surpassing both open-source and proprietary baselines across MMSearchVQA and HR-MMSearch (Chng et al., 30 Dec 2025, Narayan et al., 14 Oct 2025).

6. Error Taxonomy, Limitations, and Insights

Error analysis on MMSearchVQA tasks reveals dominant classes:

No relevant information found: 51.1% (noisy or overly specific queries)
Hallucination: 11.5% (parametric memory or mistaken attribution)
Key information not extracted: 8.4% (summarization omissions)
Relevance not verified: 6.9% (false positive source acceptance)
Image/multimodal understanding breakdown

Failures commonly stem from brittle grounding, weak cropping models, and inefficient planning. Search sprawl (excess calls with diminishing returns) and insufficient snippet ranking degrade accuracy in long-horizon contexts (Tao et al., 29 Aug 2025). The lack of human-annotated inter-annotator agreement and exclusive Wikipedia evidence in some variants (e.g., M $t$ 5Searcher MMSearchVQA) may impose limitations on generalization to open-web settings and introduce subtle ambiguities (Yu et al., 14 Jan 2026).

7. Comparative Landscape and Future Directions

MMSearchVQA benchmarks are contrasted against prior multimodal datasets:

InfoSeek: shallow, single-hop reasoning; insufficient for hybrid tool orchestration.
MMSearch/MMSearch-R1: single salient entity collapse, high-recall image search dominance.
BrowseComp/MM-BrowseComp: textual navigation and image anchoring but little sustained visual reasoning.
MMSearch-Plus: rigorous spatial–temporal extrapolation, exhaustive crop/retrieval, and provenance demands (Tao et al., 29 Aug 2025).

Future research avenues, as suggested in the source literature, include integration of video/transcript modalities, richer in-LLM visual reasoning (instruction-guided masking), learned dynamic search budgeting, and cross-modal retrieval/generation joint learning via advanced RL objectives (Chng et al., 30 Dec 2025, Lim et al., 11 Apr 2025). MMSearchVQA agents are poised to combine enhanced perception with efficient multi-tool rollouts to advance the field of multimodal web-based information seeking.