- The paper shows GPT-4o achieves 56% agreement with human evaluations in a manual condition, rising to 72% with post-editing support.
- It employs a three-tier support scale (FS, PS, NS) and metrics with Kendall’s τ > 0.79 to robustly compare LLM and human performance.
- Error analysis reveals GPT-4o tends to assign more Partial Support while humans under-detect support, highlighting cost-effective RAG assessments.
This paper investigates the feasibility of using LLMs, specifically GPT-4o, as automated judges for evaluating the "support" aspect in Retrieval-Augmented Generation (RAG) systems, comparing their performance against human judges (2504.15205). Support evaluation determines if the information presented in a generated answer sentence is factually backed by the cited source documents. This is crucial for assessing RAG system quality and reducing hallucinations.
The study was conducted using data from the TREC 2024 RAG Track, involving 45 system submissions across 36 diverse, non-factoid topics. The evaluation focused on sentence-level support, using a three-tier scale: Full Support (FS), Partial Support (PS), and No Support (NS). Due to budget constraints, only the first cited passage for each answer sentence was assessed.
Two primary conditions were used for human assessment:
- Manual from scratch: Human judges assessed support without any prior information.
- Manual with post-editing: Human judges were shown GPT-4o's predicted support label before making their final assessment.
GPT-4o was used as the automatic LLM judge, prompted with the answer sentence and the cited passage text to output one of the three support labels. Evaluation metrics included weighted precision (penalizing over-citation) and weighted recall (penalizing under-citation), assigning weights of 1.0 for FS, 0.5 for PS, and 0.0 for NS.
Key Findings:
- Agreement: In the "manual from scratch" condition, GPT-4o and human judgments matched perfectly 56% of the time. This agreement increased significantly to 72% in the "manual with post-editing" condition.
- Correlation: System-level scores (weighted precision and recall) showed strong correlation (Kendall's Ï„ > 0.79) between human and GPT-4o judges across both conditions.
- Disagreement Analysis: An unbiased study involving an independent human judge and LLAMA-3.1 405B re-assessed 537 cases where the original human judge and GPT-4o disagreed.
- Surprisingly, the independent human judge showed higher agreement with GPT-4o (Cohen's κ ≈ 0.27-0.29) than with the original human judge (Cohen's κ ≈ -0.03-0.07).
- LLAMA-3.1 also showed strong agreement with GPT-4o (Cohen's κ ≈ 0.46-0.60).
- Disagreements most frequently involved the "Partial Support" label. GPT-4o tended to label more instances as PS, while human judges labeled more as NS.
- Error Types:
- GPT-4o errors: Confusing similar concepts, failing to evaluate the entire sentence, assigning PS when the passage offered no support (NS).
- Human errors: Insufficiently careful reading leading to missed supporting evidence (labeling FS as NS), potential bias from prior knowledge overriding passage content.
Conclusion and Practical Implications:
The results suggest that LLMs like GPT-4o can be a reliable and potentially more consistent alternative or supplement to human judges for RAG support evaluation, especially given the higher correlation observed in the disagreement study. Using LLMs could significantly reduce the cost and effort of large-scale RAG evaluations. The study highlights that disagreements often center on ambiguous "Partial Support" cases and identifies specific error patterns for both humans and LLMs, offering directions for improving future support assessment protocols and LLM-based evaluation methods. The choice between human and LLM judges may depend on budget, scale, and the specific requirements for evaluation rigor.