FactCHD: Benchmarking Fact-Conflicting Hallucination Detection
Abstract: Despite their impressive generative capabilities, LLMs are hindered by fact-conflicting hallucinations in real-world applications. The accurate identification of hallucinations in texts generated by LLMs, especially in complex inferential scenarios, is a relatively unexplored area. To address this gap, we present FactCHD, a dedicated benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. A distinctive element of FactCHD is its integration of fact-based evidence chains, significantly enhancing the depth of evaluating the detectors' explanations. Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately. Furthermore, we introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence. The benchmark dataset is available at https://github.com/zjunlp/FactCHD.
- FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/68d30a9594728bc39aa24be94b319d21-Abstract-round1.html
- Building a knowledge graph to enable precision medicine. Scientific Data 10, 1 (2023), 67. https://doi.org/10.1038/s41597-023-01960-3
- FELM: Benchmarking Factuality Evaluation of Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. http://arxiv.org/abs/2310.00741
- TabFact: A Large-scale Dataset for Table-based Fact Verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rkeJRhNYDH
- Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/97011c648eda678424f9292dadeae72e-Abstract-Conference.html
- KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. In WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, Frédérique Laforest, Raphaël Troncy, Elena Simperl, Deepak Agarwal, Aristides Gionis, Ivan Herman, and Lionel Médini (Eds.). ACM, 2778–2788. https://doi.org/10.1145/3485447.3511998
- FacTool: Factuality Detection in Generative AI - A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. CoRR abs/2307.13528 (2023). https://doi.org/10.48550/arXiv.2307.13528 arXiv:2307.13528
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
- Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271 (2022).
- Diving Deep into Modes of Fact Hallucinations in Dialogue Systems. arXiv preprint arXiv:2301.04449 (2023).
- Handling Divergent Reference Texts when Evaluating Table-to-Text Generation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and LluĂs MĂ rquez (Eds.). Association for Computational Linguistics, 4884–4895.
- Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614 (2020).
- FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Trans. Assoc. Comput. Linguistics 10 (2022), 1473–1490.
- DialFact: A Benchmark for Fact-Checking in Dialogue. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 3785–3801.
- LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9
- TRUSTGPT: A Benchmark for Trustworthy and Responsible Large Language Models. CoRR abs/2304.10513 (2023). https://arxiv.org/pdf/2306.11507.pdf
- Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
- HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 3441–3460. https://doi.org/10.18653/v1/2020.findings-emnlp.309
- Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9332–9346.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. CoRR abs/2305.11747 (2023). https://doi.org/10.48550/arXiv.2305.11747 arXiv:2305.11747
- Unveiling the Pitfalls of Knowledge Editing for Large Language Models. arXiv:2310.02129Â [cs.CL]
- When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories. arXiv preprint arXiv:2212.10511 (2022). https://arxiv.org/abs/2212.10511
- Editing Personality for LLMs. arXiv:2310.02168Â [cs.CL]
- CoVERT: A Corpus of Fact-checked Biomedical COVID-19 Tweets. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, 244–257. https://aclanthology.org/2022.lrec-1.26
- Generating Benchmarks for Factuality Evaluation of Language Models. CoRR abs/2307.06908 (2023). https://doi.org/10.48550/arXiv.2307.06908 arXiv:2307.06908
- OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4812–4829.
- Tool Learning with Foundation Models. CoRR abs/2304.08354 (2023). https://doi.org/10.48550/arXiv.2304.08354 arXiv:2304.08354
- Measuring Attribution in Natural Language Generation Models. CoRR abs/2112.12870 (2021). arXiv:2112.12870 https://arxiv.org/abs/2112.12870
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990. https://doi.org/10.18653/v1/D19-1410
- COVID-Fact: Fact Extraction and Verification of Real-World Claims on COVID-19 Pandemic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2116–2129. https://doi.org/10.18653/v1/2021.acl-long.165
- Evidence-based Fact-Checking of Health-related Claims. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 3499–3512. https://doi.org/10.18653/v1/2021.findings-emnlp.297
- Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 3784–3803.
- Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. arXiv preprint arXiv:2205.12854 (2022).
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
- FEVER: a Large-scale Dataset for Fact Extraction and VERification. https://aclanthology.org/N18-1074
- LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023). https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971
- Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/arXiv.2307.09288 arXiv:2307.09288
- Joyce Valenza. 2016. Truth, truthiness, triangulation: A news literacy toolkit for a “post-truth” world. School Library journal (2016). https://blogs.slj.com/neverendingsearch/2016/11/26/truth-truthiness-triangulation-and-the-librarian-way-a-news-literacy-toolkit-for-a-post-truth-world/
- Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85. https://doi.org/10.1145/2629489
- Fact or Fiction: Verifying Scientific Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 7534–7550. https://doi.org/10.18653/v1/2020.emnlp-main.609
- Fact or Fiction: Verifying Scientific Claims. https://aclanthology.org/2020.emnlp-main.609
- EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models. CoRR abs/2308.07269 (2023). https://doi.org/10.48550/arXiv.2308.07269 arXiv:2308.07269
- Editing Large Language Models: Problems, Methods, and Opportunities. CoRR abs/2305.13172 (2023). https://doi.org/10.48550/arXiv.2305.13172 arXiv:2305.13172
- Cognitive Mirage: A Review of Hallucinations in Large Language Models. CoRR abs/2309.06794 (2023). https://doi.org/10.48550/arXiv.2309.06794 arXiv:2309.06794
- Do Large Language Models Know What They Don’t Know? CoRR abs/2305.18153 (2023). https://doi.org/10.48550/arXiv.2305.18153 arXiv:2305.18153
- Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=ek9a0qIafW
- Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. CoRR abs/2309.01219 (2023). https://doi.org/10.48550/arXiv.2309.01219 arXiv:2309.01219
- A Survey of Large Language Models. CoRR abs/2303.18223 (2023). https://doi.org/10.48550/arXiv.2303.18223 arXiv:2303.18223
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.