FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Published 29 Oct 2024 in cs.CL | (2410.22257v2)

Abstract: The rapid adoption of LMs across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect (unsupported) and inconclusive (undecidable) LM responses. These prompts form FACTBENCH, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama families on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, with decreased performance from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual precision than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.

Abstract PDF HTML Upgrade to Chat

References (31)

Summary

The paper introduces FactBench, a dynamic benchmark that evaluates LM factuality through real-world hallucination prompts.
It presents a systematic evaluation pipeline categorizing outputs as supported, unsupported, or undecidable based on web evidence.
Empirical results show proprietary models outperform open-source versions, with accuracy varying by prompt difficulty and subjectivity.

FactBench: A Dynamic Benchmark for In-the-Wild LLM Factuality Evaluation

This paper presents a novel benchmark, FactBench, which seeks to address the challenge of factuality in LMs. Given the increasing use of LMs and their inherent problem with generating false or irrelevant content, the authors propose a comprehensive evaluation framework named \system (Verification and Evidence RetrIeval for FactualitY evaluation). This framework evaluates factuality in LMs during real-world interactions, focusing on verifiability based on web evidence.

Key Contributions

Dynamic Factuality Benchmark: FactBench is designed to be a dynamic and diverse factuality evaluation benchmark grounded in real-world requirements. Notably, this benchmark incorporates hallucination prompts—queries that result in the highest rates of false or inconclusive responses from LMs. This dynamic nature allows the benchmark to stay relevant as new factuality challenges emerge.
Factuality Evaluation Pipeline (\system): \system offers a systematic pipeline for factuality evaluation. The framework categorizes LM-generated content into supported, unsupported, or undecidable categories based on evidence retrieved from the web. This ensures a higher correlation between factuality judgments by \system and human evaluations compared to existing methods.
Empirical Evaluation and Findings: The authors benchmark popular LMs from families such as GPT, Gemini, and Llama3.1 against FactBench. The results reveal that proprietary models deliver better factuality, with factual accuracy decreasing from easy to hard hallucination prompts. For instance, Llama3.1-405B-Instruct performs comparably to or worse than Llama3.1-70B-Instruct, primarily due to the former's higher degree of subjectivity that leads to more undecidable content.

Implications and Future Directions

The introduction of FactBench and \system has significant implications for LM development and evaluation. The ability of FactBench to adaptively incorporate new hallucination prompts indicates a shift towards more responsive evaluation methods, capable of tracking the evolving capabilities and shortcomings of LMs. This dynamic benchmarking is particularly crucial as LMs are increasingly utilized in diverse and complex real-world applications.

Moreover, the insights gained from empirical evaluations highlight the need for LMs to balance factual accuracy with refusal strategies, such as those seen in Gemini1.5-Pro, which displayed a high refusal rate to uncertain prompts. This positions the refusal as a critical area for improving factual accuracy without compromising response quality.

In terms of future work, the authors suggest that more sophisticated factuality evaluation approaches could incorporate logical coherence across content units in LM responses. This would further strengthen the evaluation framework and ensure that LMs are assessed for both individual factual correctness and overall narrative consistency.

In conclusion, the paper presents a structured and adaptable approach to evaluating factuality in LMs, marking a step towards more reliable and context-aware language modeling in real-world scenarios. FactBench, with its dynamic nature and robust evaluation measures, is poised to become a pivotal tool in the ongoing development of factuality evaluation methods in artificial intelligence.

Markdown Report Issue