OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Published 17 Dec 2024 in cs.CL | (2412.13018v2)

Abstract: As a typical and practical application of LLMs, Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces OmniEval, a novel benchmark that automatically evaluates RAG systems across diverse financial query scenarios.
It employs a matrix-based, multi-stage framework combining GPT-4 driven data generation with human annotation, achieving an 87.47% acceptance rate.
The study highlights significant performance gaps in retrievers and generators, urging advancements in domain-specific RAG system evaluations.

"OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain"

Introduction

"OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain" introduces a comprehensive evaluation benchmark tailored for Retrieval-Augmented Generation (RAG) systems within the financial sector. The study addresses the inadequacy of LLMs in handling domain-specific knowledge due to their generic training data. To tackle this limitation, the proposed benchmark, OmniEval, provides a structured, multi-dimensional evaluation trajectory aimed at financial queries, coupling both human and automated data generation methodologies to assess the RAG pipeline's performance comprehensively.

OmniEval Framework

The OmniEval benchmark consists of several core components:

Matrix-based RAG Evaluation System: OmniEval categorizes query scenarios into five distinct task classes—extractive QA, multi-hop reasoning, long-form QA, contrast QA, and conversational QA—and 16 financial topics. This matrix-based approach enables a detailed assessment of a RAG system’s capabilities across varied financial query types.
Multi-dimensional Evaluation Data Generation: The benchmark amalgamates GPT-4-driven automated data generation with human annotation, achieving an impressive 87.47% acceptance rate in human evaluations (Figure 1).
Figure 1: The visualization of OmniEval's generation pipeline of evaluation data.
Multi-stage Evaluation System: Recognizing the critical role of both retrievers and generators within RAG systems, OmniEval scrutinizes retrieval processes and end-response quality, facilitating a holistic evaluation of the pipeline.

Evaluation Metrics

OmniEval employs both rule-based metrics such as MAP and Rouge and LLM-based metrics, which include hallucination detection and numerical accuracy. This dual-angled approach enhances reliability through manual annotations and LLM-driven evaluations.

Experimental Results

Experiments conducted on various retrievers, including BGE-M3 and GTE-Qwen2-1.5B, demonstrate the system’s robustness and variability in performance based on task and topic-specific scenarios. The evaluations confirm considerable performance disparities in RAG systems across topics and tasks, underscoring a gap and potential for model improvement (Figure 2).

Figure 2: Visualization of matrix-based evaluation of GTE-Qwen2.5-1.5B+Yi15-34B on Rouge-L.

Implications and Future Work

The creation of OmniEval marks a significant stride towards the automatic and detailed evaluation of RAG systems within specialized domains. By providing a nuanced assessment of model performance tailored to complex financial scenarios, OmniEval underscores the necessity of integrating comprehensive domain-specific evaluation benchmarks into model training and testing. Future research directions may include expanding the scope of OmniEval to cover more nuanced financial topics and employing it as a training tool for refining LLM performance on expert tasks.

Conclusion

OmniEval presents a significant advancement in the evaluation of RAG systems for domain-specific applications. By offering a multi-dimensional, matrix-driven assessment framework, it illuminates the nuanced performance of retrievers and generators across diverse financial query scenarios, paving the way for enhanced RAG capabilities in specialized fields. Such domain-oriented benchmarks are crucial for steering LLM development toward adept handling of intricate domain-specific queries, offering a structured approach to addressing the gap between general AI models and expert domain necessities.