LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation

Published 28 Feb 2025 in cs.CL and cs.IR | (2502.20640v1)

Abstract: Retrieval-augmented generation (RAG) has proven highly effective in improving LLMs across various domains. However, there is no benchmark specifically designed to assess the effectiveness of RAG in the legal domain, which restricts progress in this area. To fill this gap, we propose LexRAG, the first benchmark to evaluate RAG systems for multi-turn legal consultations. LexRAG consists of 1,013 multi-turn dialogue samples and 17,228 candidate legal articles. Each sample is annotated by legal experts and consists of five rounds of progressive questioning. LexRAG includes two key tasks: (1) Conversational knowledge retrieval, requiring accurate retrieval of relevant legal articles based on multi-turn context. (2) Response generation, focusing on producing legally sound answers. To ensure reliable reproducibility, we develop LexiT, a legal RAG toolkit that provides a comprehensive implementation of RAG system components tailored for the legal domain. Additionally, we introduce an LLM-as-a-judge evaluation pipeline to enable detailed and effective assessment. Through experimental analysis of various LLMs and retrieval methods, we reveal the key limitations of existing RAG systems in handling legal consultation conversations. LexRAG establishes a new benchmark for the practical application of RAG systems in the legal domain, with its code and data available at https://github.com/CSHaitao/LexRAG.

Abstract PDF Upgrade to Chat

Summary

The paper introduces LexRAG, a benchmark that evaluates retrieval-augmented generation in multi-turn legal consultations by addressing challenges like context management and legal reasoning.
It leverages a dataset of 1,013 expert-annotated dialogues and 17,228 legal articles to assess both conversational knowledge retrieval and precise response generation.
The study reveals that even advanced LLMs with neural retrieval strategies struggle with complex legal reasoning, highlighting significant opportunities for future improvements.

LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation

Introduction

The paper introduces LexRAG, a benchmark designed to evaluate Retrieval-Augmented Generation (RAG) systems within multi-turn legal consultation dialogues. RAG has demonstrated enhanced performance in several domains by synergizing information retrieval with language generation. However, its adoption and evaluation within the legal field face significant challenges. Legal dialogues often involve extensive context management, abrupt topic shifts, and complex legal reasoning. LexRAG aims to provide an evaluation framework that addresses these challenges through a specially curated dataset and toolkit.

LexRAG Benchmark

LexRAG consists of 1,013 dialogues, each featuring five turns of interaction, annotated by legal experts. The dataset includes 17,228 legal articles across various domains, challenging the RAG systems to retrieve legally relevant information and generate precise responses. The benchmark focuses on two key tasks: Conversational Knowledge Retrieval and Response Generation. Systems must effectively integrate dialogue context to retrieve relevant articles and generate responses consistent with legal reasoning and logic.

Figure 1: An example of a legal consultation in LexRAG.

The complexity of legal domain RAG lies in managing multi-turn interactions where queries evolve through clarification and topic shifts, often unrelated to lexical similarity. Retrieval systems must reason beyond lexical matching to pinpoint the legal essence of inquiries.

LexiT Toolkit

To facilitate the adoption and reproducible evaluation of RAG systems, the LexiT toolkit provides a modular framework encompassing retrieval, generation, and evaluation components. LexiT supports various retrieval strategies, from BM25 for lexical matching to neural methods like BGE and GTE-Qwen2. For query processing, it offers techniques ranging from simple last-query utilization to sophisticated query rewriting using transformer models to ensure coherent and contextually appropriate inputs for retrieval.

Figure 2: Overview of LexiT Components.

LexiT also features an LLM-as-a-judge evaluation approach, using LLMs to simulate expert evaluation across dimensions like factuality, logical coherence, and completeness.

Challenges and Evaluation

Experiments conducted using LexRAG reveal significant challenges in legal domain RAG systems. While dense retrieval models outperform lexical methods, they still underperform in scenarios requiring complex legal reasoning. The best-performing retrieval strategy only managed a Recall@10 of 33.33%, indicating substantial room for improvement.

Similarly, response generation tasks highlight the limitations of even advanced LLMs. Under ideal settings with reference-annotated legal content, LLMs like Qwen-2.5-72B show improvement but still fail to capture comprehensive legal reasoning. This reflects the foundational gap in LLMs when confronted with the nuanced reasoning required in legal domains.

Conclusion

LexRAG sets a new standard for evaluating RAG systems in legal contexts, providing insights into the unique challenges posed by legal consultation dialogues. By incorporating modules that handle the intricacies of legal query management and context integration, LexRAG serves as an essential tool for advancing legal AI research. Future expansions will target multilingual datasets and the adaptation of RAG systems to diverse legal systems, extending its applicability beyond the Chinese legal context.