T$^2$-RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation

Published 4 Jun 2025 in cs.IR | (2506.12071v1)

Abstract: While most financial documents contain a combination of textual and tabular information, robust Retrieval-Augmented Generation (RAG) systems are essential for effectively accessing and reasoning over such content to perform complex numerical tasks. This paper introduces T$^2$-RAGBench, a benchmark comprising 32,908 question-context-answer triples, designed to evaluate RAG methods on real-world financial data. Unlike typical QA datasets that operate under Oracle-context settings, where the relevant context is explicitly provided, T$^2$-RAGBench challenges models to first retrieve the correct context before conducting numerical reasoning. Existing QA datasets involving text and tables typically contain context-dependent questions, which may yield multiple correct answers depending on the provided context. To address this, we transform these datasets into a context-independent format, enabling reliable RAG evaluation. We conduct a comprehensive evaluation of popular RAG methods. Our analysis identifies Hybrid BM25, a technique that combines dense and sparse vectors, as the most effective approach for text-and-table data. However, results demonstrate that T$^2$-RAGBench remains challenging even for SOTA LLMs and RAG methods. Further ablation studies examine the impact of embedding models and corpus size on retrieval performance. T$^2$-RAGBench provides a realistic and rigorous benchmark for existing RAG methods on text-and-table data. Code and dataset are available online.

Abstract PDF Upgrade to Chat

Summary

The paper introduces T²-RAGBench as a benchmark to rigorously evaluate RAG models on mixed text and table data from financial documents.
It details a dual-phase task that first retrieves appropriate context and then extracts answers, employing metrics like MRR and Number Match.
Experimental results show that while Hybrid BM25 improves performance, significant retrieval bottlenecks remain for state-of-the-art models.

T $^2$ -RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation

Introduction

The paper introduces T $^2$ -RAGBench, a novel benchmark designed to evaluate Retrieval-Augmented Generation (RAG) methods specifically in the context of mixed text and table data. Unlike traditional QA datasets that assume an oracle-context setting, T $^2$ -RAGBench requires models to retrieve and reason over contexts that are not pre-specified. This benchmark comprises 32,908 question-context-answer triples extracted from real-world financial documents, emphasizing context-independent question formulation to provide a rigorous evaluation of RAG capabilities.

Current RAG methods typically handle either purely textual data or rely on context-specific datasets, which limits their efficacy in documents comprising text and tables. Such documents are common in financial reporting and require robust systems capable of processing both modalities effectively. Existing datasets in the financial sector, like FinQA and ConvFinQA, often suffer from context-dependent questions which hinder comprehensive evaluation of RAG methods. T $^2$ -RAGBench closes this gap by providing a context-independent and numerically challenging dataset.

Figure 1: Overview of current SOTA approaches with a focus on the unknown-context setting of T $^2$ -RAGBench.

Benchmark Composition and Data Preparation

T $^2$ -RAGBench integrates subsets from existing datasets, including FinQA, ConvFinQA, VQAonBD, and TAT-DQA. Each subset underwent rigorous preprocessing to ensure question reformulation yielded context-independent queries. This involved using LLMs to reformulate questions, ensuring each had a single, unambiguous answer, verified through both automated and manual validation processes.

Task Definition

The benchmark challenges models with a dual-phase task: first, they must retrieve the appropriate context from the corpus using a function $f$ , and then extract the correct answer using a model $M$ . The evaluation employs metrics like Mean Reciprocal Rank (MRR) to measure retrieval effectiveness and a novel Number Match metric for assessing numerical reasoning accuracy.

Evaluation of RAG Methods

A comprehensive evaluation was conducted on popular RAG methods. It was found that Hybrid BM25, which combines dense and sparse vectors, outperformed other methods on text-and-table data, though T $^2$ -RAGBench remained challenging even for SOTA models.

Figure 2: Human agreement on questions showing the shift from context-dependent to context-independent queries.

Experimental Results and Performance Analysis

Despite performance improvements with Hybrid BM25, a notable gap remains compared to oracle contexts, highlighting ongoing challenges in effectively retrieving mixed-modality data. Ablation studies revealed modest gains from using advanced embedding models, indicating that retrieval remains a bottleneck in text-and-table tasks.

Conclusion and Future Directions

T $^2$ -RAGBench sets a new standard for evaluating RAG systems on text-and-table data, with significant implications for financial document analysis. Future work could explore expanding the benchmark to include a wider range of document types and assessing methods that leverage multi-modal data interactions further. This benchmark encourages advancements in RAG methodologies to better handle the complexities of mixed-modality information retrieval and generation.