Quantifying LLM Hallucination

Develop reliable and reproducible evaluation methodologies that accurately quantify hallucination rates in large language models, particularly for document-grounded question answering and other long-form contexts.

Background

The paper highlights that existing approaches to measuring hallucination suffer from contamination, reliance on LLM-as-judge scoring, and narrow task focus. It notes that widely used benchmarks like TruthfulQA and SimpleQA have limitations—contamination concerns and a focus on short-form responses, respectively.

The authors argue that no existing benchmark adequately captures long-form, document-grounded knowledge extraction, motivating the need for robust methodologies (such as their RIKER approach) that enable deterministic scoring and resist contamination. This open challenge underpins the broader motivation of the study.

References

Quantifying LLM hallucination remains an open challenge.

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms  (2603.08274 - Roig, 9 Mar 2026) in Section 2.1 (Related Work: Hallucination Measurement)