Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
Abstract: Hallucinations in LLMs refer to the phenomenon of LLMs producing responses that are coherent yet factually inaccurate. This issue undermines the effectiveness of LLMs in practical applications, necessitating research into detecting and mitigating hallucinations of LLMs. Previous studies have mainly concentrated on post-processing techniques for hallucination detection, which tend to be computationally intensive and limited in effectiveness due to their separation from the LLM's inference process. To overcome these limitations, we introduce MIND, an unsupervised training framework that leverages the internal states of LLMs for real-time hallucination detection without requiring manual annotations. Additionally, we present HELM, a new benchmark for evaluating hallucination detection across multiple LLMs, featuring diverse LLM outputs and the internal states of LLMs during their inference process. Our experiments demonstrate that MIND outperforms existing state-of-the-art methods in hallucination detection.
- Falcon-40B: an open large language model with state-of-the-art performance.
- Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Thuir at wsdm cup 2023 task 1: Unbiased learning to rank. arXiv preprint arXiv:2304.12650.
- Web search via an efficient and effective brain-machine interface. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 1569–1572.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- A tutorial on the cross-entropy method. Annals of operations research, 134:19–67.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Retrieval as attention: End-to-end learning of retrieval and reading within a single transformer. arXiv preprint arXiv:2212.02027.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Towards better web search performance: Pre-training, fine-tuning and learning to rank. arXiv preprint arXiv:2303.04710.
- Thuir@ coliee 2023: Incorporating structural knowledge into pre-trained language models for legal case retrieval. arXiv preprint arXiv:2305.06812.
- Thuir@ coliee 2023: More parameters and legal knowledge for legal case entailment. arXiv preprint arXiv:2305.06817.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- A token-level reference-free hallucination detection benchmark for free-form text generation. arXiv preprint arXiv:2104.08704.
- Caseencoder: A knowledge-enhanced pre-trained model for legal case encoding. arXiv preprint arXiv:2305.05393.
- Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
- On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Named entity recognition approaches and their comparison for custom ner model. Science & Technology Libraries, 39(3):324–337.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
- Wikiformer: Pre-training with structured information of wikipedia for ad-hoc retrieval. arXiv preprint arXiv:2312.10661.
- Caseformer: Pre-training for legal case retrieval. arXiv preprint arXiv:2311.00333.
- Thuir2 at ntcir-16 session search (ss) task. arXiv preprint arXiv:2307.00250.
- Healthcare ner models using language model pretraining. arXiv preprint arXiv:1910.11241.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Relevance feedback with brain signals. ACM Transactions on Information Systems, 42(4):1–37.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Enhancing uncertainty-based hallucination detection with stronger focus. arXiv preprint arXiv:2311.13230.
- Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about a problem with AI chatbots called hallucinations. A “hallucination” is when an AI gives an answer that sounds confident and smooth but is factually wrong. The authors introduce two things:
- MIND: a way to spot hallucinations while the AI is still writing its answer, using signals inside the AI itself.
- HELM: a new test set that helps researchers measure how good different methods are at catching hallucinations across several AI models.
What questions were the researchers trying to answer?
- Can we detect when an AI is about to say something wrong by watching its “inner signals,” not just its words?
- Can we train a hallucination detector without needing lots of human-labeled examples?
- Can we make detection fast enough to happen in real time (as the AI is typing)?
- How can we fairly test and compare hallucination detectors across different AI models?
How did they do it? (Explained simply)
Think of an AI model like a student. When it answers a question, it not only writes words (the answer you see) but also has a hidden “thought process” (its internal signals) that helps decide each word. The authors use those hidden signals to spot when the AI might be going off track.
Here’s the approach in everyday terms:
- Training without human labels (unsupervised):
- They use Wikipedia articles. For each article, they cut it off partway and ask the AI to continue the text.
- They pick an important name or thing (an “entity”) that appears after the first sentence in the original article.
- If the AI’s continuation starts correctly about that specific entity (like the real article does), they label it as “not a hallucination.”
- If the AI’s continuation doesn’t match or ignores that entity, they label it as a “hallucination.”
- No humans are needed to label each example; the rules do it automatically.
- Watching the AI’s inner signals:
- As the AI generates words, it produces hidden numbers that represent what it “thinks” is going on. You can imagine these like a heartbeat monitor or dashboard gauges for the AI’s brain.
- The authors focus on a compact summary of those signals (especially from the last word the AI produced).
- They train a small, fast classifier (a simple neural network called an MLP) to predict: “Does this look like the AI is hallucinating right now?”
- Real-time detection:
- Because the classifier is small and uses signals the AI already produces, it can run while the AI is typing.
- If the detector thinks a hallucination is likely, the system can trigger a safety step, like looking up facts (retrieval-augmented generation) before continuing.
- A new test set (HELM) to compare methods:
- They built a benchmark with outputs from six different AI models (including LLaMA, GPT-J, Falcon, and OPT).
- For each model’s output, they collected human labels (is this sentence a hallucination?) and saved the models’ inner signals while generating those outputs.
- This lets researchers test different detection methods fairly, both at the sentence level and the whole-passage level.
Key terms in plain language:
- Internal states/embeddings: the AI’s hidden “notes to itself” that help it choose the next word.
- Unsupervised: learning without hand-made answer keys; the system creates training labels by following rules.
- Classifier: a small program that looks at data and decides which category it belongs to (here: hallucination or not).
What did they find, and why is it important?
The authors report several main results:
- Their method (MIND) usually beats other popular detectors at spotting hallucinations across different AI models.
- It works quickly in real time, adding only a tiny amount of delay while the AI is generating text. Some other methods are much slower because they need to ask the AI many times or run extra heavy checks.
- You don’t need a huge amount of training data. A few thousand auto-labeled examples from Wikipedia were enough to get strong performance.
- Using training data tailored to the specific AI model (generated by that same model) helps the detector do better. In other words, a detector trained on Model A’s signals works best on Model A.
Why this matters:
- Safer AI: Catching hallucinations early means fewer wrong answers get sent to users.
- Faster and cheaper: Because MIND uses signals the AI already creates, it’s lightweight and practical to deploy.
- More general: The approach works across multiple LLMs, not just one.
What could this change in the future?
If AI systems can detect their own mistakes as they write, they can:
- Pause and verify facts before continuing.
- Use the web or a trusted database to confirm claims on the fly.
- Warn users when confidence is low.
The HELM benchmark also helps the research community:
- It gives a shared, fair way to measure detection methods across different models.
- It includes not just the text but also the models’ inner signals, which opens the door to new, smarter detectors.
The authors note one limitation: they only used the AI’s internal signals. In the future, combining those signals with the actual text content (what was said) might improve accuracy even more.
In short, this paper shows a practical, fast way to spot AI hallucinations as they happen and offers a new benchmark to push this research forward.
Collections
Sign up for free to add this paper to one or more collections.