The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination
Abstract: Hallucination is a persistent challenge in LLMs, where even with rigorous quality control, models often generate distorted facts. This paradox, in which error generation continues despite high-quality training data, calls for a deeper understanding of the underlying LLM mechanisms. To address it, we propose a novel concept: knowledge overshadowing, where model's dominant knowledge can obscure less prominent knowledge during text generation, causing the model to fabricate inaccurate details. Building on this idea, we introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing. Central to our approach is the log-linear law, which predicts that the rate of factual hallucination increases linearly with the logarithmic scale of (1) Knowledge Popularity, (2) Knowledge Length, and (3) Model Size. The law provides a means to preemptively quantify hallucinations, offering foresight into their occurrence even before model training or inference. Built on overshadowing effect, we propose a new decoding strategy CoDa, to mitigate hallucinations, which notably enhance model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%). Our findings not only deepen understandings of the underlying mechanisms behind hallucinations but also provide actionable insights for developing more predictable and controllable LLMs.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper tries to explain why LLMs sometimes “hallucinate” — that is, say things that sound confident but are factually wrong — even when they were trained on clean, correct data. The authors introduce a simple idea called knowledge overshadowing: when one piece of knowledge is more common or stronger in the model, it can drown out a less common but relevant piece, leading the model to mix up facts.
Example: If you ask “Who is a famous singer in North Korea?”, the word “North Korea” may remind the model strongly of “Kim Jong Un” (a very popular association), and that can overshadow the word “singer,” causing the model to answer incorrectly.
What questions did the researchers ask?
They focus on a few clear questions:
- What makes overshadowing happen in LLMs?
- Can we predict when hallucinations are likely before the model even answers (or before it’s trained)?
- Why, in theory, does overshadowing occur?
- Can we reduce hallucinations caused by overshadowing without retraining the model?
How did they study it?
To keep things clear and fair, the authors built controlled “mini-worlds” for the model to learn from, using synthetic (made-up but clean) data. This lets them change one factor at a time and see what happens, like a science experiment.
Here are the key ideas and steps, explained in everyday terms:
- Knowledge overshadowing: Think of two facts that share some words in a sentence, like a “loud” fact and a “quiet” fact. If the loud one is much more common, longer, or more dominant in the model, it can drown out the quiet one when the model generates an answer.
- Controlled experiments: They trained models from scratch on simple, clean sentences built from tokens (like LEGO bricks). By adjusting the “loudness” of one fact versus another, they measured how often the model mixed them up.
- Three factors that matter most: The team looked at how hallucinations change with:
- Knowledge popularity: How often one fact appears compared to another.
- Knowledge length: How long the shared context is compared to the short, specific clue that should guide the answer.
- Model size: How big the model is.
- A simple rule (the “log-linear law”): They found that hallucinations increase in a predictable way: if you plot hallucination rate against the logarithm of each factor, you get a straight line. In plain words, every time you scale up popularity, length, or model size by the same ratio (like doubling), the hallucination risk goes up by a steady step.
- Tested beyond mini-worlds: They fine-tuned real models on varied tasks (like logic, math, time/location facts) and saw the same rule holds pretty well. They could even predict hallucination rates ahead of time with about 8% average error.
- A practical fix (CoDA): They created a decoding method called CoDA that aims to stop overshadowing during answering, without retraining the model. CoDA: 1) Spots which word(s) in the prompt are getting drowned out by temporarily masking words and checking how the model’s next-word guesses change. 2) Adjusts the model’s decoding to “turn down” the overly dominant influence and “turn up” the overshadowed clue, so the final answer respects the important but quiet parts of the question.
What did they find, and why is it important?
Main findings:
- Hallucinations can happen even with perfect training data: It’s not just about bad or noisy data. The model’s internal balancing of different facts can cause errors.
- A predictable pattern (log-linear law): Hallucinations rise steadily as:
- More popular knowledge overshadows less popular knowledge.
- Longer shared context overshadows a short, crucial keyword.
- Bigger models compress information more and can blur fine details, increasing overshadowing.
- You can predict risk before training or answering: Because the relationship is predictable, you can estimate hallucination rates ahead of time by looking at data popularity, prompt structure, and model size.
- You can reduce overshadowing at decoding time: CoDA improved factual accuracy on three benchmarks, with sizable gains:
- Overshadow dataset: +27.9%
- MemoTrap: +13.1%
- NQ-Swap: +18.3%
Why it matters:
- This shifts the conversation from only “detecting hallucinations after they happen” to “predicting and preventing them before they happen.”
- The insight helps researchers and engineers design better training data, write better prompts, choose model sizes wisely, and use smarter decoding to avoid mistakes.
How does this fit with theory?
The authors link overshadowing to how models “generalize” — go from what they memorized to making good guesses in new situations. When some knowledge is much more common or represented by longer context, the model gets very good at that piece, which can make it accidentally push aside less common but relevant details. In short, the same forces that improve general performance can also make overshadowing — and thus hallucinations — more likely.
What are the broader implications?
- Better planning: Teams can estimate hallucination risk before training or deploying a model by checking how balanced their data is, how prompts are structured, and how large the model is.
- Safer prompts and data: Keep crucial clues from being tiny needles in big haystacks. If an important keyword is too short or buried in long text, restructure the prompt so it stands out.
- Smarter inference: Use decoding methods like CoDA to balance influences on-the-fly and protect important but less dominant information.
- More predictable AI: Understanding overshadowing helps build LLMs that are both powerful and reliable, which is key for education, healthcare, law, and other high-stakes uses.
In short, the paper shows that “louder” knowledge can drown out “quieter” knowledge in LLMs, causing confident but wrong answers. It offers a simple, testable rule to predict when that will happen and a practical method to reduce it — moving us toward AI that’s not just smart, but also trustworthy.
Collections
Sign up for free to add this paper to one or more collections.