Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Published 21 Nov 2024 in cs.CL, cs.AI, and cs.LG | (2411.14257v2)

Abstract: Hallucinations in LLMs are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

Abstract PDF HTML Upgrade to Chat

Summary

The paper demonstrates that hallucinations arise when LLMs generate content about entities they do not recognize, using sparse autoencoders to uncover self-awareness.
It applies activation steering and patching techniques to causally link latent directions with entity recognition, modulating the model's factual recall behavior.
The study identifies uncertainty latents that predict incorrect responses, offering insights to enhance model reliability and reduce hallucinations.

Knowledge Awareness and Hallucinations in LLMs

This essay explores the study of hallucinations in LLMs and the underlying mechanisms that contribute to such phenomena. The focus is on exploring entity recognition as a key aspect of these mechanisms, using sparse autoencoders (SAEs) as an interpretability tool to uncover self-knowledge within these models. The study specifically examines the causal relationship between entity recognition directions and model behaviors such as knowledge refusal and hallucination.

Entity Recognition and Mechanisms

The study identifies that hallucinations often arise when models attempt to generate information about entities they do not actually recognize. Through the use of sparse autoencoders, linear directions in the underlying representation space are discovered. These directions are pivotal in signaling whether a model can recall factual information about an entity, essentially encapsulating a form of self-awareness about its own capabilities.

Figure 1

Figure 1: We identify SAE latents in the final token of the entity residual stream (i.e., hidden state) that activate almost exclusively on either unknown or known entities.

The application of CAEs demonstrates that these entity recognition directions can be generalized across different entity types, such as movies, cities, players, and songs. By manipulating these directions, researchers can influence the model's behavior to either hallucinate or refuse answering questions about specific entities, showcasing a causal effect on knowledge refusal behavior.

Methodology

To conduct this study, entities across various domains were classified into 'known' or 'unknown' categories based on their factual recall accuracy. Questions about these entities were used as prompts, and the model's responses were analyzed to understand the influence of entity recognition directions on the output.

Figure 2

Figure 2: Layerwise evolution of the Top 5 latents in Gemma 2 2B SAEs, showing known (left) and unknown (right) latent separation scores.

Activation steering techniques were applied wherein the activation values of specific latent directions were altered, leading to observable changes in model behavior. This involved using a validation set to fine-tune steering coefficients, allowing systematic evaluation of the model's performance in terms of answering queries about unknown entities.

Mechanistic Insights

The mechanistic analysis reveals that entity recognition directions significantly influence the attention paid to entity tokens during factual recall processes. Steering with these directions modulates attention mechanisms, such as attribute extraction heads, thereby altering the model's propensity to hallucinate or perform knowledge refusal.

Figure 3

Figure 3: (a,b) Activation patching results on the residual streams and output attention of heads indicate that attention to entities is greater for known entities.

By employing activation patching, researchers demonstrated that the attention to entities increases when the model recognizes an entity, which in turn enhances the model's ability to extract attributes accurately.

Uncertainty and Predictions

The study further uncovers that certain latents, which can be considered representations of uncertainty, are predictive of a model's incorrect responses. These uncertainty directions are associated with discriminatory power in distinguishing between correct and mistaken answers, thereby serving as potential indicators for areas where the model may lack confidence or possess insufficient knowledge.

Figure 4

Figure 4: Left: Activation values of the Gemma 2B IT `unknown' latent on correct and incorrect responses. Right: Tokens with highest logit increases by this latent's influence.

Conclusion

This research enriches our understanding of how LLMs encode and utilize self-knowledge about entities and how these directions can be manipulated to alter factual recall behavior and mitigate hallucinations. Through this study, significant strides have been made in identifying mechanisms that underpin model predictability and reliability. The implications of these findings may lead to enhanced model development strategies that prioritize factual accuracy and reduce the frequency of hallucinations. Further exploration could involve refining these techniques to fine-tune LLM behaviors in complex real-world applications.

Markdown Report Issue