FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation
Abstract: The rapid adoption of LMs across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect (unsupported) and inconclusive (undecidable) LM responses. These prompts form FACTBENCH, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama families on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, with decreased performance from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual precision than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.
- AI@Meta. Llama3 model. https://ai.meta.com/blog/meta-llama-3/, 2024.
- Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- Felm: Benchmarking factuality evaluation of large language models, 2023. URL https://arxiv.org/abs/2310.00741.
- Jonathan Goldsmith. Wikipedia: A pythonic wrapper for the wikipedia api. https://github.com/goldsmith/Wikipedia, 2014.
- Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure, 2022. URL https://arxiv.org/abs/2203.05794.
- Molecular facts: Desiderata for decontextualization in llm fact verification, 2024. URL https://arxiv.org/abs/2406.20079.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023. URL https://arxiv.org/abs/2311.05232.
- Towards automated factchecking: Developing an annotation schema and benchmark for consistent automated claim detection, 2020. URL https://arxiv.org/abs/1809.08193.
- Halueval: A large-scale hallucination evaluation benchmark for large language models, 2023. URL https://arxiv.org/abs/2305.11747.
- Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958.
- Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4791–4797, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.291. URL https://aclanthology.org/2023.emnlp-main.291.
- Data contamination: From memorization to exploitation, 2022. URL https://arxiv.org/abs/2203.08242.
- ExpertQA: Expert-curated questions and attributed answers. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3025–3045, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.167. URL https://aclanthology.org/2024.naacl-long.167.
- A hybrid approach to hierarchical density-based cluster selection. In 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, September 2020. doi: 10.1109/mfi49285.2020.9235263. URL http://dx.doi.org/10.1109/MFI49285.2020.9235263.
- Umap: Uniform manifold approximation and projection for dimension reduction, 2020. URL https://arxiv.org/abs/1802.03426.
- Meta. Introducing llama 3.1: Our most capable models to date. https://ai.meta.com/blog/meta-llama-3-1/, 2024. Accessed: 2024-09-10.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12076–12100, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https://aclanthology.org/2023.emnlp-main.741.
- Overview of the clef-2022 checkthat! lab task 1 on identifying relevant claims in tweets. In Conference and Labs of the Evaluation Forum, 2022. URL https://api.semanticscholar.org/CorpusID:251472020.
- AFaCTA: Assisting the annotation of factual claim detection with reliable LLM annotators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1890–1912, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.104. URL https://aclanthology.org/2024.acl-long.104.
- OpenAI. New embedding models and api updates. https://openai.com/index/ new-embedding-models-and-api-updates/, 2024a.
- OpenAI. Gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024b. Version: 2024-05-13.
- OpenAI. New models and developer products announced at devday. https://openai.com/blog/ new-models-and-developer-products-announced-at-devday, 2024c. Version: turbo-2024-04-09.
- On the definition of prescriptive annotation guidelines for language-agnostic subjectivity detection. 04 2023.
- Veriscore: Evaluating the factuality of verifiable claims in long-form text generation, 2024. URL https://arxiv.org/abs/2406.19276.
- Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805.
- The spread of true and false news online. Science, 359(6380):1146–1151, 2018. doi: 10.1126/science.aap9559. URL https://www.science.org/doi/abs/10.1126/science.aap9559.
- Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers, 2024a. URL https://arxiv.org/abs/2311.09000.
- Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers, 2024b. URL https://arxiv.org/abs/2311.09000.
- Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024a. Curran Associates Inc. ISBN 9781713871088.
- Long-form factuality in large language models, 2024b. URL https://arxiv.org/abs/2403.18802.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2024. URL https://arxiv.org/abs/2309.11998.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.