mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs

Published 16 May 2025 in cs.IR | (2505.11180v1)

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of LLMs. However, existing RAG evaluation predominantly focuses on text retrieval and relies on opaque, end-to-end assessments of generated outputs. To address these limitations, we introduce mmRAG, a modular benchmark designed for evaluating multi-modal RAG systems. Our benchmark integrates queries from six diverse question-answering datasets spanning text, tables, and knowledge graphs, which we uniformly convert into retrievable documents. To enable direct, granular evaluation of individual RAG components -- such as the accuracy of retrieval and query routing -- beyond end-to-end generation quality, we follow standard information retrieval procedures to annotate document relevance and derive dataset relevance. We establish baseline performance by evaluating a wide range of RAG implementations on mmRAG.

Abstract PDF Upgrade to Chat

Summary

The paper introduces mmRAG, a modular benchmark evaluating retrieval-augmented generation using multi-modal datasets from text, tables, and knowledge graphs.
It provides granular diagnostics by separately assessing retrieval, query routing, and generation components with cross-dataset relevance annotations.
Baseline evaluations of classic and neural retrievers underscore mmRAG’s potential to drive improvements in complex RAG systems.

"mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs" (2505.11180)

Introduction and Motivation

The paper "mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs" addresses the limitations of existing Retrieval-Augmented Generation (RAG) benchmarks, which primarily focus on single-modal text retrieval and often rely on end-to-end assessments. These evaluations obscure the granularity needed to distinguish failures in retrieval from those in generation. The authors introduce mmRAG, a novel benchmark designed to facilitate a more detailed evaluation across multiple modalities—specifically text, tables, and knowledge graphs (KGs). The benchmark aims to provide comprehensive diagnostics by evaluating specific RAG components, including retrieval and query routing accuracy, using cross-dataset relevance annotations.

Contributions and Novel Features

The mmRAG benchmark offers the following innovations:

Unified Multi-Modal Corpus: It amalgamates datasets from different modalities, including text, tables, and KGs, creating a comprehensive environment for evaluating RAG systems’ performance across these diverse data types.
Cross-Dataset Relevance Annotation: By annotating queries with relevant document chunks across all involved datasets, mmRAG allows for the direct measurement of retrieval accuracy and ensures that the benchmark reflects real-world scenarios where information might span multiple data sources.
Whole-Process Modular Evaluation: The benchmark provides distinct annotations for various components of the RAG process — from query routing to retrieval and generation — enabling thorough performance diagnostics for each component separately.
Figure 1: Construction of mmRAG.

Methodology for Benchmark Construction

The construction of mmRAG comprises three main phases: dataset collection, data processing, and data annotation.

Dataset Collection: The benchmark draws from six diverse QA datasets encompassing text, tables, and KGs, selected to cover a broad range of real-world query types and reasoning tasks. This collection allows mmRAG to evaluate RAG systems’ handling of complex, heterogeneous data.
Data Processing: This phase involves the selection and processing of documents into retrievable chunks. Techniques include clustering for query selection, transforming dataset corpora into unified document formats, and segmenting documents into manageable chunks for efficient retrieval.
Data Annotation: Utilizing a standardized IR pooling method and LLM-based automatic annotation, mmRAG provides detailed relevance labels for retrieval and queries. Importantly, it also introduces dataset-level relevance labels, crucial for evaluating routing accuracy.

Evaluation of Retrievers and Query Routers

The authors establish baseline performances for a variety of retrieval models ranging from classic lexical methods like BM25 to modern neural models such as BGE and GTE. These models were evaluated using standard IR metrics, yielding insights into their retrieval efficiencies and guiding future improvements in RAG systems.

Figure 2: Evaluation of query routers (routing accuracy).

Furthermore, mmRAG facilitates the evaluation of query routers, comparing semantic and LLM-based routing strategies. Direct evaluation with the dataset-level relevance labels was shown to be an effective measure for routing accuracy, providing a more nuanced assessment compared to generation quality metrics alone.

Figure 3: Evaluation of query routers (generation quality).

Implications and Future Directions

The introduction of mmRAG represents a pivotal advancement for RAG benchmarking, enabling detailed, component-level evaluation that surpasses the capabilities of previous benchmarks. By facilitating evaluation at multiple levels — from retrieval to generation — and across multiple data modalities, mmRAG sets a new standard for diagnosing and optimizing RAG systems.

Future work involves expanding the benchmark to encompass additional modalities, such as images, and optimizing the annotation process to enhance scalability. Additionally, exploring domain-specific data splits could further refine the evaluation of RAG systems within particular fields like medicine or law.

Conclusion

mmRAG advances RAG research by providing a detailed, multi-modal, and modular benchmark. It offers significant insights into retrieval, query routing, and generation, all essential for the continued enhancement of RAG systems. The benchmark's robust design aims to inspire further research into retrieval-augmented generation and encourage a broader adoption of diverse knowledge sources in machine learning applications.

Markdown