UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation
Abstract: LLMs have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese LLMs and the GPT series models to derive professional performance insights regarding hallucination challenges.
- W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
- V. Rawte, S. Chakraborty, A. Pathak, A. Sarkar, S. Tonmoy, A. Chadha et al., “The troubling emergence of hallucination in large language models–an extensive definition, quantification, and prescriptive remediations,” arXiv preprint arXiv:2310.04988, 2023.
- C. Wang, X. Liu, Y. Yue, X. Tang, T. Zhang, C. Jiayang et al., “Survey on factuality in large language models: Knowledge, retrieval and domain-specificity,” arXiv preprint arXiv:2310.07521, 2023.
- V. Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
- Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu et al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
- J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” arXiv preprint arXiv:2305.11747, 2023.
- T. Liu, Y. Zhang, C. Brockett, Y. Mao, Z. Sui, W. Chen et al., “A token-level reference-free hallucination detection benchmark for free-form text generation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 6723–6737. [Online]. Available: https://aclanthology.org/2022.acl-long.464
- S. Yang, R. Sun, and X. Wan, “A new benchmark and reverse validation method for passage-level hallucination detection,” arXiv preprint arXiv:2310.06498, 2023.
- S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh et al., “Factscore: Fine-grained atomic evaluation of factual precision in long form text generation,” arXiv preprint arXiv:2305.14251, 2023.
- D. Muhlgay, O. Ram, I. Magar, Y. Levine, N. Ratner, Y. Belinkov et al., “Generating benchmarks for factuality evaluation of language models,” arXiv preprint arXiv:2307.06908, 2023.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin et al., “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 27 730–27 744. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
- Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang et al., “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335.
- A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin et al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023.
- J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
- InternLM, “Internlm: A multilingual language model with progressively enhanced capabilities,” https://github.com/InternLM/InternLM, 2023.
- N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao et al., “Crosslingual generalization through multitask finetuning,” arXiv preprint arXiv:2211.01786, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, P. Isabelle, E. Charniak, and D. Lin, Eds. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040
- C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
- OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- M.-C. de Marneffe and J. Nivre, “Dependency grammar,” Annual Review of Linguistics, vol. 5, no. 1, pp. 197–218, 2019. [Online]. Available: https://doi.org/10.1146/annurev-linguistics-011718-011842
- BAAI, “Aquila2,” https://github.com/FlagAI-Open/Aquila2, 2023.
- Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu et al., “A survey on evaluation of large language models,” arXiv preprint arXiv:2307.03109, 2023.
- Q. Cheng, T. Sun, W. Zhang, S. Wang, X. Liu, M. Zhang et al., “Evaluating hallucinations in chinese large language models,” arXiv preprint arXiv:2310.03368, 2023.
- Y. Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen et al., “Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization,” arXiv preprint arXiv:2306.05087, 2023.
- J. Novikova, O. Dušek, A. Cercas Curry, and V. Rieser, “Why we need new evaluation metrics for NLG,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 2241–2252. [Online]. Available: https://aclanthology.org/D17-1238
- T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr
- S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3252. [Online]. Available: https://aclanthology.org/2022.acl-long.229
- J. Fu, S.-K. Ng, Z. Jiang, and P. Liu, “Gptscore: Evaluate as you desire,” arXiv preprint arXiv:2302.04166, 2023.
- S. Zheng, Y. Zhang, Y. Zhu, C. Xi, P. Gao, X. Zhou et al., “Gpt-fathom: Benchmarking large language models to decipher the evolutionary path towards gpt-4 and beyond,” arXiv preprint arXiv:2309.16583, 2023.
- Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang et al., “Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,” arXiv preprint arXiv:2107.02137, 2021.
- B. Wang, E. Chern, and P. Liu, “Chinesefacteval: A factuality benchmark for chinese llms,” https://GAIR-NLP.github.io/ChineseFactEval, 2023.
- J. Chen, W. Shi, Z. Fu, S. Cheng, L. Li, and Y. Xiao, “Say what you mean! large language models speak too positively about negative commonsense knowledge,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 9890–9908. [Online]. Available: https://aclanthology.org/2023.acl-long.550
- M. Elaraby, M. Lu, J. Dunn, X. Zhang, Y. Wang, and S. Liu, “Halo: Estimation and reduction of hallucinations in open-source weak large language models,” arXiv preprint arXiv:2308.11764, 2023.
- N. Lee, W. Ping, P. Xu, M. Patwary, P. Fung, M. Shoeybi et al., “Factuality enhanced language models for open-ended text generation,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=LvyJX20Rll
- J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-Li, X. Lv et al., “Kola: Carefully benchmarking world knowledge of large language models,” arXiv preprint arXiv:2306.09296, 2023.
- A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” arXiv preprint arXiv:2307.15343, 2023.
- Z. Yin, Q. Sun, Q. Guo, J. Wu, X. Qiu, and X. Huang, “Do large language models know what they don’t know?” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 8653–8665. [Online]. Available: https://aclanthology.org/2023.findings-acl.551
- N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,” arXiv preprint arXiv:2307.03987, 2023.
- J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 1906–1919. [Online]. Available: https://aclanthology.org/2020.acl-main.173
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.