Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge

Published 14 Nov 2024 in cs.AI and cs.CL | (2411.09689v3)

Abstract: LLM hallucination, where unfaithful text is generated, presents a critical challenge for LLMs' practical applications. Current detection methods often resort to external knowledge, LLM fine-tuning, or supervised training with large hallucination-labeled datasets. Moreover, these approaches do not distinguish between different types of hallucinations, which is crucial for enhancing detection performance. To address such limitations, we introduce hallucination probing, a new task that classifies LLM-generated text into three categories: aligned, misaligned, and fabricated. Driven by our novel discovery that perturbing key entities in prompts affects LLM's generation of these three types of text differently, we propose SHINE, a novel hallucination probing method that does not require external knowledge, supervised training, or LLM fine-tuning. SHINE is effective in hallucination probing across three modern LLMs, and achieves state-of-the-art performance in hallucination detection, outperforming seven competing methods across four datasets and four LLMs, underscoring the importance of probing for accurate detection.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel categorization of LLM outputs into aligned, misaligned, and fabricated to pinpoint hallucination origins.
It proposes a zero-shot Model Knowledge Test that perturbs key embeddings to effectively distinguish fabricated responses with 77.85% accuracy.
The study combines MKT with SelfCheckGPT for refined hallucination detection, reducing reliance on external datasets and enhancing LLM reliability.

An Overview of "LLM Hallucination Reasoning with Zero-shot Knowledge Test"

The paper "LLM Hallucination Reasoning with Zero-shot Knowledge Test" by Seongmin Lee et al. addresses the ubiquitous issue of hallucinations in LLMs. Hallucinations, where LLMs produce incorrect or unsubstantiated outputs, pose significant reliability concerns, particularly in the context of practical applications where accuracy is paramount. Traditional methods for hallucination detection have relied on external datasets, LLM fine-tuning, or comparison with trusted external knowledge sources. This paper approaches the problem through a novel categorization and a zero-shot methodology that aims to assess an LLM's knowledge efficacy directly, without external dependencies.

Main Contributions

The paper introduces two key innovations to the field of hallucination detection in LLMs:

Hallucination Reasoning Task: The paper proposes classifying LLM-generated text into three distinct categories: aligned, misaligned, and fabricated. This categorization is essential as it pinpoints potential causes of hallucinations — lack of knowledge in the case of fabricated outputs, and randomness or prior dependency for misaligned texts. By addressing these subdivisions, the work enhances the accuracy of hallucination detection and provides clarity on the root of the errors.
Model Knowledge Test (MKT): A distinctive zero-shot method, MKT, evaluates if the LLM possesses sufficient knowledge to generate a particular output. This method does not require any prior dataset training or fine-tuning of the LLM. Instead, it employs a mechanism to perturb the embedding of key subjects in the text and examines the effects on the LLM's output generation. This effectively distinguishes fabricated from non-fabricated responses.

Methodology

The proposed two-stage methodology successfully distinguishes hallucinated text by first applying the Model Knowledge Test. Perturbing key subject-related embeddings allows researchers to assess the LLM's internal knowledge base. Subsequently, an Alignment Test is conducted using the SelfCheckGPT framework, focusing explicitly on either alignment or misalignment, thereby refining the hallucination detection process.

Experimental Validation

Experimental results demonstrate the robustness of the proposed approach. The method was evaluated using newly constructed datasets like NEC and Biography datasets, alongside existing ones. The experiments underscore the superior accuracy of the proposed MKT and SelfCheckGPT combination method over existing zero-shot methods such as Hallucination Score and Semantic Entropy. For instance, this paper reports significant accuracy improvements in hallucination detection, with MKT correctly identifying 77.85% of fabricated instances in the NEC dataset.

Implications and Future Directions

The paper's contributions could pave the way for enhanced LLM reliability by allowing simpler integration of hallucination detection in practical applications. By offering a procured understanding of the different types of hallucinations, the paper provides a foundation for developing future LLMs that can self-assess the fidelity of their outputs before dissemination. Future research, as suggested by the authors, might focus on refining the Alignment Test to be less computationally demanding and validating approaches on broader datasets.

In conclusion, while the study provides a significant step forward in hallucination detection in LLMs, it emphasizes a nuanced understanding of LLM-generated text classification. This insight is expected to significantly enhance both the practical deployment and theoretical investigations of LLM reliability and accuracy.