MMDR-Bench: Multimodal Research Evaluation
- MMDR-Bench is a multimodal benchmark designed to evaluate deep research agents on integrating text and visual evidence into detailed, citation-rich reports.
- It employs a multi-stage evaluation pipeline—FLAE, TRACE, and MOSAIC—to rigorously assess writing quality, citation accuracy, and visual evidence fidelity.
- The benchmark features 140 expert-curated tasks across 21 domains, simulating real-world research scenarios and detailed report generation.
MMDeepResearch-Bench (MMDR-Bench) is a comprehensive benchmark suite developed to evaluate multimodal deep research agents (DRAs). DRAs are systems that, given complex tasks, must perform multi-step retrieval, integrate diverse sources, extract evidence from both text and images, and synthesize citation-rich analyst-grade reports. Unlike preceding benchmarks that are limited to text-only tasks or short-form multimodal question answering, MMDR-Bench targets the unique demands of real-world research where interpreting and citing visual evidence (e.g., charts, diagrams, screenshots) is as critical as reasoning over text. MMDR-Bench provides a suite of 140 expert-curated, image-text bundled tasks spanning 21 domains and introduces a multi-stage evaluation pipeline with granular analysis of report quality, citation fidelity, and multimodal textual-visual consistency (Huang et al., 18 Jan 2026).
1. Motivation and Novelty
MMDR-Bench was designed to address limitations in existing evaluation frameworks, which bifurcate into (a) text-only deep-research benchmarks such as DeepResearch Bench and DeepScholar that lack visual components, and (b) multimodal QA or perception benchmarks (e.g., ChartQA, DocVQA, MMSearch) that assess only short-horizon, usually single-answer tasks. Real research agents are often tasked with synthesizing across textual and visual modalities, grounding claims in external sources, and presenting findings in structured, evidence-rich reports. MMDR-Bench explicitly fills this gap by requiring DRAs to produce, for each task, a citation-grounded document with inline claims, embedded and referenced images, section structure, and explicit links between narrative and multimodal evidence (Huang et al., 18 Jan 2026).
Key innovations include:
- Realistic, expert-authored, image–text task suite (140 tasks across 21 domains).
- Unified, interpretable evaluation pipeline—Formula-LLM Adaptive Evaluation (FLAE), Trustworthy Retrieval-Aligned Citation Evaluation (TRACE), and Multimodal Support-Aligned Integrity Check (MOSAIC).
- Introduction of Visual Evidence Fidelity (VEF) as a strict gating criterion, prohibiting hallucination or visual misreading.
2. Task Construction and Characteristics
MMDR-Bench comprises tasks partitioned into two regimes:
- Daily Tasks (40 tasks, 11 domains): Emulate lightweight research scenarios—screenshots, UI captures, and casual visuals.
- Research Tasks (100 tasks, 10 technical and social-science domains): Feature information-dense charts, tables, and diagrams requiring deep quantitative reasoning and synthesis.
Each task consists of:
- A prompt (e.g., “Based on the enclosed bar chart and the article excerpt, compare the energy efficiency across regions and cite your sources.”).
- A set of images central to the task (charts, document excerpts, diagrams).
- The requirement for a generated output: A long-form, sectioned report (with Introduction, Analysis, Conclusion, and References), inline claims each followed by citation indices, embedded images with captioned citations, and demonstrable use of both provided images and cited online sources as explicit evidence.
Explicit evidence use is required: every claim must be grounded in a cited URL or specifically referenced image, and demonstrations of how visual artifacts inform conclusions are mandatory (Huang et al., 18 Jan 2026).
3. Evaluation Pipeline
MMDR-Bench introduces a tri-component evaluation pipeline with fine-grained metrics:
3.1. FLAE — Formula-LLM Adaptive Evaluation
FLAE measures writing quality along three dimensions: Readability, Insightfulness, and Structural Completeness. It fuses a formula-driven channel (exploiting quantifiable textual features) with LLM-judge outputs, moderated by an adaptive fusion coefficient , and produces a weighted per-dimension score:
- Formula channel: , where includes lexical diversity, section count, citation statistics, and sentence length.
- Judge LLM channel: via calibrated scoring prompts.
- Fusion: .
- Overall FLAE: , where are task-adaptive weights.
3.2. TRACE — Trustworthy Retrieval-Aligned Citation Evaluation
TRACE evaluates the faithfulness of claim grounding:
- Parses inline citation markers to reference URLs.
- Extracts claim–URL pairs from text.
- Assesses for each URL:
- Consistency (does claim match cited text?),
- Coverage (are all essential claims cited?),
- Textual Fidelity (are paraphrases/quotes faithful?),
- Visual Evidence Fidelity (VEF; strict PASS/FAIL—if below $6/10$ or identity-critical errors, report fails).
- Points: VEF is weighted as of TRACE (0.4 within TRACE, i.e., 20% of the full score).
- Aggregate:
where and are judge scores and weights for metrics .
3.3. MOSAIC — Multimodal Support-Aligned Integrity Check
MOSAIC inspects whether image-grounded statements are consistent with visual artifacts:
- Extracts MM-items (statements referencing images).
- Classifies each as data chart, OCR chart, diagram, or photo.
- Scores per item: for semantic, numeric accuracy, and VQA alignment.
- Aggregation uses type-specific weights (e.g., data charts emphasize accuracy).
- Overall: is the weighted mean over items.
The composite MMDR-Bench score is: MOSAIC is gated—activated only if FLAE and TRACE exceed minimal thresholds (Huang et al., 18 Jan 2026).
4. Systematic Findings
Evaluation of 25 SOTA systems reveals multi-dimensional trade-offs:
| Tier / System Classification | FLAE (Quality) | TRACE (Citation) | MOSAIC (MM Integrity) |
|---|---|---|---|
| Dedicated DRAs (e.g., Gemini Deep Research) | 49.41 | High Con./Cov./Vef. | Strong MOSAIC (∼41) |
| Search-enabled Multimodal LMMs (e.g., Gemini 3 Pro, Flash) | 44.68–44.43 | High Visual Evidence Fidelity (GPT-5.2: 46.43) | Competitive |
| Single-shot LLMs (DeepSeek-V3.2, GPT-5 mini) | Down to 38.49 | Lower | Inferior |
Key observations:
- Generation–Grounding Trade-off: Models with fluent prose (high FLAE) often perform poorly on TRACE, indicating challenges in grounding claims across modalities. Conversely, strict evidence adherence can degrade coherence.
- Vision Integration: Multimodal capabilities enhance semantic alignment with task requirements but increase susceptibility to detail extraction errors (DTE) in charts/tables, negatively impacting citation fidelity.
- Agentic Pipelines: DRAs improve information coverage and multi-source aggregation, but experience higher entity-misidentification errors (EMI), likely associated with complex retrieval chains.
5. Critical Metrics and Constraints
MMDR-Bench enforces Visual Evidence Fidelity (VEF) as a hard constraint, ensuring that no report passes unless model interpretations of visuals reach a threshold and are devoid of identity-critical errors. VEF’s stringent nature distinguishes MMDR-Bench from earlier multimodal benchmarks, where visual errors often result only in score reductions but not outright task failure. TRACE and MOSAIC surface error classes—detail extraction errors, entity drift, false citations, and improper grounding—that are otherwise obscured in aggregate scores, providing actionable signals for system development (Huang et al., 18 Jan 2026).
6. Implications and Future Directions
Findings from MMDR-Bench evaluations suggest:
- Vision augments DRAs only if models can robustly parse visual detail. Auxiliary or ambiguous images frequently result in hallucinated evidence or misattribution.
- Multimodal alignment is necessary but insufficient; high Trace requires disciplined citation and careful integration of both textual and visual evidence.
- Advanced retrieval and visual processing pipelines, not mere scaling of base model parameters, are critical for improved performance.
- Persistent challenges include fine-grained table/chart parsing, maintaining entity-level consistency across multi-step retrieval, and eliminating spurious reasoning from noisy visual artifacts.
Suggested research trajectories include designing sophisticated chart and diagram parsers for DTE reduction, incorporating entity-aware memory to curb drift in modular pipelines, and enforcing tighter coupling between visual grounding and citation to ensure traceability for every claim (Huang et al., 18 Jan 2026).
7. Resources and Reproducibility
MMDR-Bench provides an open-source repository containing:
- Full dataset: 140 image-text bundled tasks with prompts and images.
- Evaluation code for FLAE, TRACE, and MOSAIC, including audit logs and metrics.
- Paper with detailed methodology, all experimental protocol, and error taxonomy.
This infrastructure supports the evaluation and development of new multimodal deep research systems. By providing both granular and composite signals, MMDR-Bench serves as a platform for accelerating progress in citation-grounded, multimodal analytic agents (Huang et al., 18 Jan 2026).