Do Androids Know They're Only Dreaming of Electric Sheep?

Published 28 Dec 2023 in cs.CL, cs.AI, and cs.LG | (2312.17249v2)

Abstract: We design probes trained on the internal representations of a transformer LLM to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than the expert annotator on two out of three generation tasks. Overall, we find that probing is a feasible and efficient alternative to LLM hallucination evaluation when model states are available.

Abstract PDF HTML Upgrade to Chat

References (45)

Citations (20)

View on Semantic Scholar

Summary

The paper presents a novel probing technique that detects hallucinatory behavior in language models by analyzing internal transformer states.
The study finds that probes trained on synthetic data perform poorly on organic data, emphasizing the critical impact of data origin.
Results indicate that middle and feed-forward layer probes capture stronger hallucination signals, underscoring the need for ecologically valid training.

Introduction

This study introduces a methodology for detecting hallucinatory behavior in LLMs by probing their internal representations during in-context generation tasks. This involves the development of trained probes that can predict when a transformer LLM is likely to produce unnatural or synthetic output that does not correspond to the context it has been given.

Hallucination Probes

Probes are trained on a dataset created for this purpose, which includes annotated instances of both organic and synthetic hallucinatory outputs across several tasks. Interestingly, it was discovered that probes effective on synthetic data are not necessarily as efficient when applied to organic data, highlighting the importance of the data's origin and nature. Furthermore, the analysis revealed that different tasks and distributions of hallucination have unique characteristics within the model's hidden state representations.

Hallucination Salience Across Model States

Probes applied to different layers and states of the transformer indicate varying degrees of hallucination salience. Notably, extrinsic hallucinations (information neither entailed nor contradicted by the source) are more prominently detected in the internal representations of the model compared to intrinsic hallucinations (direct contradictions with source material). The strength of the hallucinatory behavior signal varies across layers, with middle layers often housing stronger signals. Additionally, feed-forward layer probes generally outperformed those trained on attention layer outputs.

Probing Effectiveness and Ecological Validity

An ensemble of probes trained on internal transformer layers surpassed multiple baseline methods in identifying hallucinations, validating the use of probing as an effective tool for evaluating models when hidden states are accessible. In terms of ecological validity, while synthetic data can yield high-performance probes within its domain, the results here demonstrate that such probes do not generalize well to natural data, reinforcing the need for training on more ecologically valid data.

Discussion and Future Work

The research underscores the complexities of the hallucinatory behavior of LLMs and illuminates how trained probes can serve as a valuable tool for diagnosing such behavior. The findings of this study have implications for enhancing the reliability of AI-generated content. Looking forward, future work could expand upon these findings by exploring probe architectures and fine-tuning LLMs to minimize hallucinations, furthering the practical application of LLMs across various industries and use-cases.

It's important to note the limitation regarding the accessibility of model internals, as the growing trend towards closed-source APIs could restrict the feasibility of similar probing techniques. As this study opens new avenues for understanding and ensuring the trustworthiness of LLM outputs, it provides crucial considerations for both researchers and practitioners.

Markdown Report Issue