Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Published 21 Apr 2025 in cs.CL, cs.AI, and cs.IR | (2504.15205v1)

Abstract: Retrieval-augmented generation (RAG) enables LLMs to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

Abstract PDF Upgrade to Chat

Summary

The paper shows GPT-4o achieves 56% agreement with human evaluations in a manual condition, rising to 72% with post-editing support.
It employs a three-tier support scale (FS, PS, NS) and metrics with Kendall’s τ > 0.79 to robustly compare LLM and human performance.
Error analysis reveals GPT-4o tends to assign more Partial Support while humans under-detect support, highlighting cost-effective RAG assessments.

This paper investigates the feasibility of using LLMs, specifically GPT-4o, as automated judges for evaluating the "support" aspect in Retrieval-Augmented Generation (RAG) systems, comparing their performance against human judges (2504.15205). Support evaluation determines if the information presented in a generated answer sentence is factually backed by the cited source documents. This is crucial for assessing RAG system quality and reducing hallucinations.

The study was conducted using data from the TREC 2024 RAG Track, involving 45 system submissions across 36 diverse, non-factoid topics. The evaluation focused on sentence-level support, using a three-tier scale: Full Support (FS), Partial Support (PS), and No Support (NS). Due to budget constraints, only the first cited passage for each answer sentence was assessed.

Two primary conditions were used for human assessment:

Manual from scratch: Human judges assessed support without any prior information.
Manual with post-editing: Human judges were shown GPT-4o's predicted support label before making their final assessment.

GPT-4o was used as the automatic LLM judge, prompted with the answer sentence and the cited passage text to output one of the three support labels. Evaluation metrics included weighted precision (penalizing over-citation) and weighted recall (penalizing under-citation), assigning weights of 1.0 for FS, 0.5 for PS, and 0.0 for NS.

Key Findings:

Agreement: In the "manual from scratch" condition, GPT-4o and human judgments matched perfectly 56% of the time. This agreement increased significantly to 72% in the "manual with post-editing" condition.
Correlation: System-level scores (weighted precision and recall) showed strong correlation (Kendall's τ > 0.79) between human and GPT-4o judges across both conditions.
Disagreement Analysis: An unbiased study involving an independent human judge and LLAMA-3.1 405B re-assessed 537 cases where the original human judge and GPT-4o disagreed.
- Surprisingly, the independent human judge showed higher agreement with GPT-4o (Cohen's κ ≈ 0.27-0.29) than with the original human judge (Cohen's κ ≈ -0.03-0.07).
- LLAMA-3.1 also showed strong agreement with GPT-4o (Cohen's κ ≈ 0.46-0.60).
- Disagreements most frequently involved the "Partial Support" label. GPT-4o tended to label more instances as PS, while human judges labeled more as NS.
Error Types:
- GPT-4o errors: Confusing similar concepts, failing to evaluate the entire sentence, assigning PS when the passage offered no support (NS).
- Human errors: Insufficiently careful reading leading to missed supporting evidence (labeling FS as NS), potential bias from prior knowledge overriding passage content.

Conclusion and Practical Implications:

The results suggest that LLMs like GPT-4o can be a reliable and potentially more consistent alternative or supplement to human judges for RAG support evaluation, especially given the higher correlation observed in the disagreement study. Using LLMs could significantly reduce the cost and effort of large-scale RAG evaluations. The study highlights that disagreements often center on ambiguous "Partial Support" cases and identifies specific error patterns for both humans and LLMs, offering directions for improving future support assessment protocols and LLM-based evaluation methods. The choice between human and LLM judges may depend on budget, scale, and the specific requirements for evaluation rigor.

Markdown Report Issue