Papers
Topics
Authors
Recent
Search
2000 character limit reached

FactCHD: Benchmarking Fact-Conflicting Hallucination Detection

Published 18 Oct 2023 in cs.CL, cs.AI, cs.CV, cs.IR, and cs.LG | (2310.12086v3)

Abstract: Despite their impressive generative capabilities, LLMs are hindered by fact-conflicting hallucinations in real-world applications. The accurate identification of hallucinations in texts generated by LLMs, especially in complex inferential scenarios, is a relatively unexplored area. To address this gap, we present FactCHD, a dedicated benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. A distinctive element of FactCHD is its integration of fact-based evidence chains, significantly enhancing the depth of evaluating the detectors' explanations. Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately. Furthermore, we introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence. The benchmark dataset is available at https://github.com/zjunlp/FactCHD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/68d30a9594728bc39aa24be94b319d21-Abstract-round1.html
  2. Building a knowledge graph to enable precision medicine. Scientific Data 10, 1 (2023), 67. https://doi.org/10.1038/s41597-023-01960-3
  3. FELM: Benchmarking Factuality Evaluation of Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. http://arxiv.org/abs/2310.00741
  4. TabFact: A Large-scale Dataset for Table-based Fact Verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rkeJRhNYDH
  5. Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/97011c648eda678424f9292dadeae72e-Abstract-Conference.html
  6. KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. In WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, Frédérique Laforest, Raphaël Troncy, Elena Simperl, Deepak Agarwal, Aristides Gionis, Ivan Herman, and Lionel Médini (Eds.). ACM, 2778–2788. https://doi.org/10.1145/3485447.3511998
  7. FacTool: Factuality Detection in Generative AI - A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. CoRR abs/2307.13528 (2023). https://doi.org/10.48550/arXiv.2307.13528 arXiv:2307.13528
  8. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  9. Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271 (2022).
  10. Diving Deep into Modes of Fact Hallucinations in Dialogue Systems. arXiv preprint arXiv:2301.04449 (2023).
  11. Handling Divergent Reference Texts when Evaluating Table-to-Text Generation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, 4884–4895.
  12. Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614 (2020).
  13. FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Trans. Assoc. Comput. Linguistics 10 (2022), 1473–1490.
  14. DialFact: A Benchmark for Fact-Checking in Dialogue. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 3785–3801.
  15. LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9
  16. TRUSTGPT: A Benchmark for Trustworthy and Responsible Large Language Models. CoRR abs/2304.10513 (2023). https://arxiv.org/pdf/2306.11507.pdf
  17. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  18. HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 3441–3460. https://doi.org/10.18653/v1/2020.findings-emnlp.309
  19. Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9332–9346.
  20. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  21. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. CoRR abs/2305.11747 (2023). https://doi.org/10.48550/arXiv.2305.11747 arXiv:2305.11747
  22. Unveiling the Pitfalls of Knowledge Editing for Large Language Models. arXiv:2310.02129 [cs.CL]
  23. When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories. arXiv preprint arXiv:2212.10511 (2022). https://arxiv.org/abs/2212.10511
  24. Editing Personality for LLMs. arXiv:2310.02168 [cs.CL]
  25. CoVERT: A Corpus of Fact-checked Biomedical COVID-19 Tweets. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, 244–257. https://aclanthology.org/2022.lrec-1.26
  26. Generating Benchmarks for Factuality Evaluation of Language Models. CoRR abs/2307.06908 (2023). https://doi.org/10.48550/arXiv.2307.06908 arXiv:2307.06908
  27. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
  28. Training language models to follow instructions with human feedback. In NeurIPS.
  29. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4812–4829.
  30. Tool Learning with Foundation Models. CoRR abs/2304.08354 (2023). https://doi.org/10.48550/arXiv.2304.08354 arXiv:2304.08354
  31. Measuring Attribution in Natural Language Generation Models. CoRR abs/2112.12870 (2021). arXiv:2112.12870 https://arxiv.org/abs/2112.12870
  32. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990. https://doi.org/10.18653/v1/D19-1410
  33. COVID-Fact: Fact Extraction and Verification of Real-World Claims on COVID-19 Pandemic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2116–2129. https://doi.org/10.18653/v1/2021.acl-long.165
  34. Evidence-based Fact-Checking of Health-related Claims. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 3499–3512. https://doi.org/10.18653/v1/2021.findings-emnlp.297
  35. Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 3784–3803.
  36. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. arXiv preprint arXiv:2205.12854 (2022).
  37. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  38. FEVER: a Large-scale Dataset for Fact Extraction and VERification. https://aclanthology.org/N18-1074
  39. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023). https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971
  40. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/arXiv.2307.09288 arXiv:2307.09288
  41. Joyce Valenza. 2016. Truth, truthiness, triangulation: A news literacy toolkit for a “post-truth” world. School Library journal (2016). https://blogs.slj.com/neverendingsearch/2016/11/26/truth-truthiness-triangulation-and-the-librarian-way-a-news-literacy-toolkit-for-a-post-truth-world/
  42. Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85. https://doi.org/10.1145/2629489
  43. Fact or Fiction: Verifying Scientific Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 7534–7550. https://doi.org/10.18653/v1/2020.emnlp-main.609
  44. Fact or Fiction: Verifying Scientific Claims. https://aclanthology.org/2020.emnlp-main.609
  45. EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models. CoRR abs/2308.07269 (2023). https://doi.org/10.48550/arXiv.2308.07269 arXiv:2308.07269
  46. Editing Large Language Models: Problems, Methods, and Opportunities. CoRR abs/2305.13172 (2023). https://doi.org/10.48550/arXiv.2305.13172 arXiv:2305.13172
  47. Cognitive Mirage: A Review of Hallucinations in Large Language Models. CoRR abs/2309.06794 (2023). https://doi.org/10.48550/arXiv.2309.06794 arXiv:2309.06794
  48. Do Large Language Models Know What They Don’t Know? CoRR abs/2305.18153 (2023). https://doi.org/10.48550/arXiv.2305.18153 arXiv:2305.18153
  49. Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=ek9a0qIafW
  50. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. CoRR abs/2309.01219 (2023). https://doi.org/10.48550/arXiv.2309.01219 arXiv:2309.01219
  51. A Survey of Large Language Models. CoRR abs/2303.18223 (2023). https://doi.org/10.48550/arXiv.2303.18223 arXiv:2303.18223
Citations (11)

Summary

  • The paper presents FactCHD, a benchmark that detects and explains factually inconsistent outputs from large language models.
  • It uses diverse data sources to simulate realistic query-response scenarios across various factuality patterns including multi-hop and comparative reasoning.
  • Empirical evaluations with models like GPT-3.5-turbo and Llama2-chat validate the novel Truth-Triangulator framework and specialized tuning approaches.

FactCHD: Benchmarking Fact-Conflicting Hallucination Detection

The paper presents FactCHD, a dedicated benchmark aimed at addressing the challenge of detecting fact-conflicting hallucinations in outputs generated by LLMs. While LLMs have demonstrated significant generative capabilities, their tendency to produce factually inaccurate or hallucinatory text poses a barrier to their deployment in critical domains such as finance, healthcare, and law. This work tackles the relatively unexplored area of hallucination detection by establishing a comprehensive framework and dataset to evaluate LLMs' ability to recognize and explain factual inconsistencies.

Core Contributions

  1. Introduction of FactCHD: FactCHD is a benchmark designed to detect hallucinations stemming from conflicting facts. Unlike traditional fact verification tasks, it simulates a realistic "Query-Response" scenario where explicit claims or evidence might be absent. The dataset incorporates diverse factuality patterns, including vanilla, multi-hop, comparison, and set-operation, each presenting unique challenges in reasoning and fact comprehension.
  2. Diverse Data Collection: The dataset spans multiple domains, derived from varied sources such as knowledge graphs (KGs) and text corpora. This diversity aims to reflect real-world application scenarios. Notably, FactCHD is positioned within a novel categorical framework that includes vanilla, multi-hop reasoning, comparative analysis, and set-operations as fundamental factuality patterns.
  3. Golden Evidence Chains: FactCHD introduces golden chains of evidence to evaluate the capacity of hallucination detectors not only to identify non-factual statements but also to provide coherent and accurate explanations for their judgments. This aspect of explanation is a key distinguishing factor in the dataset.
  4. Evaluation of LLMs and Approaches: The paper provides empirical evaluations using models like GPT-3.5-turbo, Llama2-chat, and Alpaca across various learning paradigms—zero-shot, in-context learning, and specifically-tuned detection models. Results indicate significant variability in performance, highlighting the effectiveness of specialized tuning and knowledge augmented approaches.
  5. Truth-Triangulator Framework: To enhance the reliability of hallucination detection, the authors propose the Truth-Triangulator, a framework inspired by triangulation theory. It involves cross-referencing multiple evidence sources, employing roles such as Truth Seeker and Truth Guardian to independently assess the factual accuracy of responses before reaching a consensus through Fact Verdict Manager.

Implications and Future Directions

The introduction of FactCHD has several implications for the deployment of LLMs in sensitive or high-stakes environments. By providing a structured means to assess and improve hallucination detection, it supports the development of more reliable AI systems. Additionally, by combining evidence-based explanations with detection, FactCHD encourages transparency in model decision-making processes.

From a theoretical perspective, the benchmark offers a paradigm for understanding complex factual relationships within generated text, paving the way for more sophisticated LLM designs that could inherently manage fact verification. Practically, the approach espoused in FactCHD could facilitate the creation of AI tools better attuned to diverse, nuanced applications, ultimately enhancing trustworthiness and user confidence in AI systems.

For ongoing research and future developments, FactCHD prompts further exploration into scalable knowledge integration, leveraging advancements in retrieval-augmented generation, and methodologies to counteract the inherent limitations of LLMs in real-world fact-checking tasks. These endeavors will be crucial in refining hallucination detection frameworks to meet the rigors of real-world deployment and utility.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.