Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Published 7 May 2025 in cs.CL and cs.AI | (2505.04847v1)

Abstract: Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.

Abstract PDF Upgrade to Chat

Summary

The paper introduces FaithJudge, a few-shot evaluation approach, and Vectara's Hallucination Leaderboard to benchmark and improve LLM faithfulness in RAG.
Current hallucination detection methods demonstrate limited accuracy (~50%), underscoring the necessity for more effective techniques to ensure factual consistency in LLM outputs.
FaithJudge achieved an F1-macro of 82.1% on the FaithBench dataset, showing substantial improvement in approximating human judgment for hallucination detection compared to prior methods.

Benchmarking LLM Faithfulness in Retrieval-Augmented Generation

This paper investigates the challenge of hallucinations within LLMs when employing Retrieval-Augmented Generation (RAG) techniques. Despite RAG's purpose of grounding model outputs in retrieved documents to enhance faithfulness, issues persist with LLMs producing unsupported or contradictory information. The research highlights efforts aimed at measuring and mitigating such hallucinations, specifically within the context of summarization tasks.

The research introduces two significant contributions—Vectara's Hallucination Leaderboard and FaithJudge. Vectara's Hallucination Leaderboard utilizes the Hughes Hallucination Evaluation Model (HHEM) 2.1 to evaluate the prevalence of hallucinations in LLM-generated summaries. It serves as a benchmarking system, comparing over 130 LLMs, and represents a significant effort to track ongoing progress in LLM faithfulness.

The paper identifies limitations in existing hallucination detection methodologies, which generally demonstrate modest accuracy in benchmarks such as FaithBench and AggreFact. Specifically, it highlights that many current models achieve close to 50% accuracy, underscoring substantial room for improvement. Additionally, it explores zero-shot detection techniques using LLMs themselves as judges, finding that larger models, such as GPT-4o and o3-mini-high, tend to excel, albeit with overall performance improvements remaining necessary.

FaithJudge is proposed as an enhanced hallucination evaluation approach that relies on few-shot prompting informed by human-annotated examples of hallucinations. With FaithJudge, there is a focus on better approximating human judgment through automated means by guiding the evaluation process with labelled hallucinations. Initial tests show FaithJudge achieving an F1-macro of 82.1% on the FaithBench dataset using the o3-mini-high judge, thus representing a substantial gain over previous methods.

The research also extends FaithJudge's applicability beyond summarization to other RAG domains like question answering and data-to-text generation using the RAGTruth dataset. This extension aims to build a broader understanding of LLM performance across tasks requiring faithful content generation.

The implications of this research are multifaceted. Practically, more reliable hallucination detection mechanisms can lead to the development of LLMs that are more trustworthy in real-world applications, especially in sectors where factual consistency is critical such as legal or medical fields. Theoretically, this work suggests a promising path toward improving automated fact-checking systems through leveraging LLM capabilities in a few-shot context and encourages future research in refining LLM evaluation processes with robust benchmarking systems like FaithJudge.

In conclusion, this paper underscores the ongoing challenge of mitigating hallucinations within RAG frameworks and presents tools that contribute to measuring and improving faithfulness in LLM outputs. Future efforts will likely focus on enhancing the robustness of models against hallucinations and refining the methodologies for evaluation and benchmarking, thereby driving forward advancements in the creation of reliable AI systems.