- The paper introduces MiRAGE, a multi-agent framework that generates multimodal, multi-hop QA datasets for domain-specific evaluation of RAG models.
- It employs recursive context expansion and adversarial verification to ensure high factuality and complexity, with faithfulness scores above 0.91 in most technical domains.
- Empirical results highlight the importance of dense textual artifact descriptions in maintaining multimodal reasoning when visual grounding remains limited.
MiRAGE: A Multiagent Framework for Multimodal Multihop QA Dataset Generation for RAG Evaluation
Motivation and Context
The deployment of Retrieval-Augmented Generation (RAG) architectures for enterprise applications is outpacing the development of domain-specific, multimodal, and multi-hop QA evaluation benchmarks. Most existing QA datasets target open, text-centric domains and single-hop reasoning. However, real-world technical corpora integrate text, figures, and tables, necessitating complex, multi-step, and multimodal reasoning. Prior synthetic QA generators employ linear, non-interactive architectures that cannot ensure factuality, depth, or diversity, resulting in hallucinations and trivialized evaluations.
Method
MiRAGE proposes a multi-agent, model-agnostic architecture to generate robust, domain-aligned, multimodal, multi-hop QA datasets for RAG system evaluation. The framework orchestrates the following agents:
- Multimodal Data Ingestion Agent: Parses technical documents, segmenting them into semantically coherent, multimodal chunks. It utilizes a VLM to generate dense, technical descriptions for visual components, ensuring preservation of document structure and semantic proximity across modalities.
- Domain and Persona Identification Agent: Executes topic modeling on the corpus using multimodal embeddings, clusters semantic topics, and synthesizes both the domain label and expert persona to condition subsequent QA generation.
- Recursive Semantically Multihop Retrieval Agent: Conducts iterative retrieval to construct context windows, recursively expanding from seed chunks. This agent identifies gaps and generates search queries to retrieve additional context, achieving monotonic, information-rich context construction for complex, multi-hop queries.
- Question-Answer Generation and Adversarial Verification Agents: Conditioned on the aggregated context, persona, and domain, the QA generation agent creates candidate QA pairs. A separate adversarial verifier rigorously ensures answers are fully grounded in, and require, the constructed context, enforcing factuality and penalizing hallucination.
- Refinement, Clustering, and Deduplication Agents: Final QA pairs undergo community detection and clustering based on joint semantic and context similarity. A stratified policy merges redundancies and retains unique, representative pairs, maximizing coverage and minimizing duplication.
The architecture supports flexible backends, demonstrated with both Gemini 2.5 Flash and GPT-5 Mini reasoning models and Nomic-based semantic retrievers. Importantly, MiRAGE can operate in absence of native VLMs if dense textual artifact descriptions are available.
Empirical Evaluation
MiRAGE was deployed on four diverse corpora: SP Global (Finance), UNECE GTRs (Regulation), Arxiv Q-Bio (Science), and NYTimes Opinions (Journalism), covering a spectrum from highly structured, tabular-intensive domains to visually and semantically diffuse news articles. The generated benchmarks contain 1000 QA pairs per domain.
Reasoning complexity, measured by hop count, consistently exceeded 2.3 for technical domains (with a peak of 2.84 for Gemini 2.5 Flash on SP Global), confirming the efficacy of recursive context expansion and rejection of trivial, extractive QA pairs. Only news, inherently less structured, yielded hop counts ≈1.2.
Faithfulness scores, as assessed by LLM-as-a-judge and automated metrics, remained >0.91 in three domains, substantiating the effectiveness of adversarial verification in filtering hallucinations. Relevance also remained >0.81 across all settings.
Visual grounding remains insufficient in current VLM architectures. Visual grounding scores remained moderate (<0.45 across models and domains). Analysis indicates reliance on textual artifact descriptions; when only images are present (without descriptions), faithfulness drops sharply and multimodal reasoning is limited.
Ablation demonstrated that removing the verification agent reduces faithfulness by over 20 points, persona/domain injection drops difficulty, and the lack of recursive multi-hop context building prevents complex reasoning. Semantic chunking, while not always increasing hop counts, is critical for preserving continuity and fidelity. When QA generation is conditioned only on image content, faithfulness decreases; conversely, removing raw images but retaining descriptions preserves high QA quality, indicating textual representations are critical for current LLM power.
The thematic alignment between corpora and generated datasets, quantified via Jensen-Shannon divergence, remains low, indicating MiRAGE preserves the latent distribution of real-world technical content.
Theoretical and Practical Implications
MiRAGE demonstrates that domain-specific, high-complexity, multimodal multi-hop QA datasets can be automatically generated using agentic, recursive multi-agent architectures. It highlights the necessity of agent specialization: context expansion, adversarial verification, and persona conditioning are all vital for dataset quality. The empirical results expose persistent deficits in visual reasoning on the part of VLMs, with synthetic QA quality dramatically enhanced by dense artifact descriptions.
This framework enables rapid, low-overhead creation of rigorous benchmarks reflecting proprietary corpora, critical for robust, domain-grounded RAG evaluation. MiRAGE paves the way for scalable test set generation in settings where annotated multimodal expert QA pairs are unavailable or infeasible.
Future Work
Current limitations are primarily computational—multi-agent recursive loops elevate cost and latency. The framework's effectiveness on corpora reliant on true image-understanding (not text) is constrained by VLM capabilities. Future work should prioritize token-efficient agent orchestration and extend benchmarking to open-source reasoning and retrieval models, such as Qwen3-VL, to enhance accessibility. Advances in VLMs are necessary to unlock reliable, fine-grained visual reasoning and grounding in multimodal QA.
Conclusion
MiRAGE offers a technically rigorous, multi-agent pipeline for the automated generation of domain-specific, multimodal, and multi-hop QA benchmarks. It establishes a new standard for evaluation dataset creation, enabling systematic, scenario-aligned assessment of RAG systems on tasks requiring complex, context-rich, and multimodal reasoning (2601.15487).