Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Published 7 Mar 2024 in cs.CL, cs.AI, and cs.LG | (2403.04696v2)

Abstract: LLMs are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718, Nusa Dua, Bali. Association for Computational Linguistics.
  2. Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217.
  3. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
  4. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity Even better. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36–50, Toronto, Canada. Association for Computational Linguistics.
  5. LM-polygraph: Uncertainty estimation for language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 446–461.
  6. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.
  7. Yarin Gal et al. 2016. Uncertainty in deep learning. Ph.D. thesis, University of Cambridge.
  8. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  9. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR.
  10. Mistral 7b. CoRR, abs/2310.06825.
  11. Language models (mostly) know what they know. CoRR, abs/2207.05221.
  12. Adam Tauman Kalai and Santosh S Vempala. 2023. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648.
  13. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
  14. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  15. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NeurIPS 2017, page 6405–6416, Red Hook, NY, USA. Curran Associates Inc.
  16. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  17. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  18. Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations.
  19. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  20. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100.
  21. The clef-2021 checkthat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43, pages 639–649. Springer.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  23. Fact-checking complex claims with program-guided reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6981–7004.
  24. Out-of-distribution detection and selective generation for conditional language models. In The Eleventh International Conference on Learning Representations.
  25. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. CoRR, abs/2308.16149.
  26. Evaluating large language models on controlled generation tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3155–3168, Singapore. Association for Computational Linguistics.
  27. Is chatgpt good at search? investigating large language models as re-ranking agent. ArXiv, abs/2304.09542.
  28. Junya Takayama and Yuki Arase. 2019. Relevant and informative response generation using pointwise mutual information. In Proceedings of the First Workshop on NLP for Conversational AI, pages 133–138. Association for Computational Linguistics.
  29. Large language models in medicine. Nature medicine, 29(8):1930–1940.
  30. Mutual information alleviates hallucinations in abstractive summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5956–5965. Association for Computational Linguistics.
  31. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
  32. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550.
  33. Yi. 2023. A series of large language models trained from scratch by developers at 01-ai. https://github.com/01-ai/Yi.
  34. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
Citations (17)

Summary

  • The paper presents a novel Claim-Conditioned Probability (CCP) method that quantifies token-level uncertainty to detect LLM hallucinations.
  • CCP outperforms traditional baselines on ROC-AUC metrics in multilingual experiments, enhancing reliability without extra external resources.
  • The method decomposes text into atomic claims and aggregates uncertainty scores, offering a scalable pipeline for LLM fact-checking.

Fact-Checking the Output of LLMs via Token-Level Uncertainty Quantification

Introduction

LLMs have emerged as powerful tools in natural language processing, excelling in a wide array of tasks such as information retrieval, medical inquiries, and content generation. Despite their versatility and the growing trust in LLMs as credible sources of information, these models confront a critical challenge: the propensity to hallucinate. Hallucinations refer to the generation of erroneous claims that, although encapsulated within coherent and factual-looking text, can obscure factual inaccuracies, posing a risk to users. The paper introduces a fact-checking pipeline predicated on token-level uncertainty quantification (UQ) to address this issue, leveraging uncertainty scores from the neural networks' output to detect unreliable predictions.

Methods

The novel method proposed is Claim-Conditioned Probability (CCP), designed to precisely measure the uncertainty of specific claim values expressed by the model, filtering out uncertainty pertaining to what claim type or surface form to generate. CCP evaluates the uncertainty of atomic claims in generated outputs, presenting improvements over baseline methods like Maximum Probability and methods that incorporate external knowledge for fact-checking across various LLMs and languages. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Visual comparison of our proposed Claim-Conditioned Probability method to the Maximum Probability baseline. CCP accurately identifies the incorrectly specified number of awards.

The fact-checking pipeline initiates by decomposing generated text into atomic claims using smaller, specialized models or APIs, such as ChatGPT. Token-level uncertainty scores are computed and aggregated into claim-level uncertainties, which are used to highlight potentially deceptive parts of text as unreliable to end-users.

Experiments

The experiments involved biography generation across six LLMs and languages including English, Chinese, and Arabic. A benchmark was constructed employing FactScore, an automatic fact-checking tool, for annotation. Additionally, human evaluations were conducted to validate the automated annotations, comparing CCP against several baselines including Maximum Probability and P(True). Figure 2

Figure 2: ROC-AUC of claim-level UQ methods based on FactScore labels, aggregated into bins when considering facts from various sentence lengths.

Overall, CCP outperformed traditional baselines and exhibited a strong correlation with manual annotations, indicating its efficacy and practicality when external fact-checking resources are not feasible.

Results

CCP consistently surpassed other UQ methods in performance across multiple LLMs and languages, showcasing robust improvements in detecting hallucinations:

  • In English benchmarks using FactScore labels, CCP demonstrated superior ROC-AUC scores compared to baselines in all tested models including GPT-3.5-turbo and other state-of-the-art LLMs.
  • Multilingual evaluations reinforced the effectiveness of CCP, particularly for Chinese and Arabic languages where CCP maintained a substantial lead over alternatives.

The paper's empirical findings suggest that token-level uncertainty quantification can effectively serve as an internal mechanism for hallucination detection, potentially replacing more complex systems dependent on external knowledge bases.

Conclusion

The research outlined innovative strides in fact-checking LLM outputs using token-level UQ, culminating in the development of CCP, a method that mitigates the impact of irrelevant uncertainties. By refining uncertainty estimation techniques, the method enhances LLM reliability without added computational overhead or reliance on external resources. Moving forward, further explorations could refine implementations of uncertainty post-processing in LLM architectures and extend applicability across broader NLP tasks.

In summary, CCP provides a pragmatic solution to the growing concern of hallucinations in LLM outputs, advancing both the theoretical understanding and practical approaches in AI-driven text generation validation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.