Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Published 7 Mar 2024 in cs.CL, cs.AI, and cs.LG | (2403.04696v2)

Abstract: LLMs are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.

Abstract PDF HTML Upgrade to Chat

References (34)

Citations (17)

View on Semantic Scholar

Summary

The paper presents a novel Claim-Conditioned Probability (CCP) method that quantifies token-level uncertainty to detect LLM hallucinations.
CCP outperforms traditional baselines on ROC-AUC metrics in multilingual experiments, enhancing reliability without extra external resources.
The method decomposes text into atomic claims and aggregates uncertainty scores, offering a scalable pipeline for LLM fact-checking.

Fact-Checking the Output of LLMs via Token-Level Uncertainty Quantification

Introduction

LLMs have emerged as powerful tools in natural language processing, excelling in a wide array of tasks such as information retrieval, medical inquiries, and content generation. Despite their versatility and the growing trust in LLMs as credible sources of information, these models confront a critical challenge: the propensity to hallucinate. Hallucinations refer to the generation of erroneous claims that, although encapsulated within coherent and factual-looking text, can obscure factual inaccuracies, posing a risk to users. The paper introduces a fact-checking pipeline predicated on token-level uncertainty quantification (UQ) to address this issue, leveraging uncertainty scores from the neural networks' output to detect unreliable predictions.

Methods

The novel method proposed is Claim-Conditioned Probability (CCP), designed to precisely measure the uncertainty of specific claim values expressed by the model, filtering out uncertainty pertaining to what claim type or surface form to generate. CCP evaluates the uncertainty of atomic claims in generated outputs, presenting improvements over baseline methods like Maximum Probability and methods that incorporate external knowledge for fact-checking across various LLMs and languages.

Figure 1: Visual comparison of our proposed Claim-Conditioned Probability method to the Maximum Probability baseline. CCP accurately identifies the incorrectly specified number of awards.

The fact-checking pipeline initiates by decomposing generated text into atomic claims using smaller, specialized models or APIs, such as ChatGPT. Token-level uncertainty scores are computed and aggregated into claim-level uncertainties, which are used to highlight potentially deceptive parts of text as unreliable to end-users.

Experiments

The experiments involved biography generation across six LLMs and languages including English, Chinese, and Arabic. A benchmark was constructed employing FactScore, an automatic fact-checking tool, for annotation. Additionally, human evaluations were conducted to validate the automated annotations, comparing CCP against several baselines including Maximum Probability and P(True).

Figure 2: ROC-AUC of claim-level UQ methods based on FactScore labels, aggregated into bins when considering facts from various sentence lengths.

Overall, CCP outperformed traditional baselines and exhibited a strong correlation with manual annotations, indicating its efficacy and practicality when external fact-checking resources are not feasible.

Results

CCP consistently surpassed other UQ methods in performance across multiple LLMs and languages, showcasing robust improvements in detecting hallucinations:

In English benchmarks using FactScore labels, CCP demonstrated superior ROC-AUC scores compared to baselines in all tested models including GPT-3.5-turbo and other state-of-the-art LLMs.
Multilingual evaluations reinforced the effectiveness of CCP, particularly for Chinese and Arabic languages where CCP maintained a substantial lead over alternatives.

The paper's empirical findings suggest that token-level uncertainty quantification can effectively serve as an internal mechanism for hallucination detection, potentially replacing more complex systems dependent on external knowledge bases.

Conclusion

The research outlined innovative strides in fact-checking LLM outputs using token-level UQ, culminating in the development of CCP, a method that mitigates the impact of irrelevant uncertainties. By refining uncertainty estimation techniques, the method enhances LLM reliability without added computational overhead or reliance on external resources. Moving forward, further explorations could refine implementations of uncertainty post-processing in LLM architectures and extend applicability across broader NLP tasks.

In summary, CCP provides a pragmatic solution to the growing concern of hallucinations in LLM outputs, advancing both the theoretical understanding and practical approaches in AI-driven text generation validation.