UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

Published 26 Nov 2023 in cs.CL | (2311.15296v3)

Abstract: LLMs have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese LLMs and the GPT series models to derive professional performance insights regarding hallucination challenges.

Abstract PDF HTML Upgrade to Chat

References (40)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces UHGEval, an unconstrained evaluation benchmark that assesses hallucination in Chinese LLM outputs using a comprehensive, annotated news dataset.
It employs a two-stage annotation process and a novel keyword precision metric to accurately rank and label hallucinatory content.
Empirical analysis of eight Chinese LLMs and three GPT models under multiple evaluation frameworks highlights varying strengths in handling numeric, knowledge, and general news data.

Benchmarking Hallucinations in Chinese LLMs with UHGEval

The paper presents an in-depth exploration of LLMs’ (LLMs) tendency to produce hallucinated text, thereby limiting their reliability in professional applications. Addressing a critical gap in the assessment of hallucinatory phenomena, the authors introduce $\mathbb{UHG}$ Eval, an Unconstrained Hallucination Generation Evaluation benchmark that specifically targets Chinese LLMs’ outputs within a news context.

Overview and Methodological Contributions

The paper critiques existing benchmarks that utilize constrained generation techniques due to cost and time constraints, arguing that these methods fall short of reflecting real-world applications where generation is typically unrestricted. Such benchmarks often employ induced hallucination or alter legitimate texts to fabricate hallucinations, which may not accurately mimic genuine creative or generative errors.

A significant contribution of this work is the introduction of a comprehensive Chinese-language dataset that captures hallucinations in an unconstrained manner. The dataset comprises over 5,000 annotated items derived from historical news articles, categorized into document-intensive, number-intensive, knowledge-intensive, and general news. This categorization recognizes the different ways hallucinations might manifest across various content types.

Dataset Construction

The dataset construction includes a two-stage annotation process: a hallucination ranking followed by automatic labeling and human rechecking. The hallucination ranking algorithm prioritizes fluency and the likelihood of hallucination, selecting text candidates that strike a balance between coherence and hallucinatory potential. The authors propose the keyword precision (kwPrec) metric as a superior alternative to traditional BLEU and ROUGE scoring methods, arguing it better identifies essential fact-related inaccuracies.

The authors ensured the inclusivity of multiple LLMs in dataset creation, incorporating five Chinese models to generate hallucinations, thus providing greater diversity and reducing the risk of model-specific biases.

Evaluation Framework

The proposed evaluation framework includes discriminative, selective, and generative evaluations. Discriminative evaluation demands LLMs determine the presence of hallucinations; selective evaluation requires discerning between text options, with and without hallucinations; and generative evaluation involves analyzing LLM-generated continuations for hallucinated content using reference-based techniques.

In the empirical analysis, the authors tested eight prominent Chinese LLMs and three GPT series models, offering significant insights into their hallucination dynamics. Notably, they reported that most models performed better with number-intensive and general news, echoing the complexity of numeric data and societal narratives in generating reliable outputs.

Discussion and Implication

The findings suggest that Chinese LLMs, especially domain-specific ones like Xinyu2-70B, excel in selective evaluation, indicative of their robustness in narrower contexts such as news domains. Generative evaluation posed challenges, revealing the inherent difficulty for LLMs in producing factually correct and coherent continuations without hallucination.

By aligning evaluative tasks with real-world scenarios, this research underscores the need to improve LLM training methods, potentially guiding enhancements in knowledge integration and retrieval processes. It points to the profound implication that leveraging specific domain knowledge significantly improves the factual accuracy of LLM outputs.

Conclusion and Future Directions

This paper's contributions in constructing and evaluating Chinese LLMs through $\mathbb{UHG}$ Eval establish a rigorous standard for assessing hallucinated content generation without constraints, paving the way for more reliable LLM applications in professional fields such as journalism and academia. Future research will likely focus on expanding this benchmark across other languages and domains, enhancing LLMs' ability to generate reliable, contextually appropriate content across various applications.