Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

Published 23 May 2025 in cs.CL, cs.AI, and cs.LG | (2505.18148v1)

Abstract: LLMs face significant challenges with needle-in-a-haystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.

Abstract PDF Upgrade to Chat

Summary

The paper finds that smaller gold contexts significantly reduce LLM accuracy in long-context tasks compared to larger gold contexts.
Smaller gold contexts increase LLM sensitivity to the position of relevant information, amplifying the primacy bias where early placement helps more.
The paper shows gold context size matters even with distractors, implying real-world LLM systems need strategies to handle varying lengths of relevant information effectively.

Quantifying the Influence of Gold Context Size in Long-Context Problems

The paper "Gold Size Matters: Larger Gold Contexts Improve LLM Retrieval in Long-Contexts" offers a comprehensive analysis of how varying sizes of relevant information, termed "gold context," significantly influence the performance of LLMs when dealing with complex long-context tasks. This investigation addresses a critical gap in NLP research that has predominantly focused on positional bias and distractor quantities without adequately considering the size of relevant items within context.

Core Findings

Impact of Gold Context Size: The primary finding is the strong correlation between gold context size and the performance of LLMs in long-context question answering tasks. Smaller gold contexts lead to notably reduced accuracy, complicating the retrieval and assimilation of relevant information by the models. This outcome holds consistently across different domains, including general knowledge, biomedical reasonings, and mathematical reasoning taskbenchmarks.
Positional Sensitivity and Primacy Bias: Smaller gold contexts amplify the positional sensitivity of LLMs. Models exhibit a primacy bias—enhanced performance when relevant information is positioned early in the context window. This bias diminishes with larger gold contexts, which are less susceptible to the detrimental effects of misplaced relevant content.
Robustness in Noisy Environments: Increasing the volume of distractor information does not mitigate the performance gap between small and large gold contexts, underscoring the robustness of gold context size as a pivotal factor influencing retrieval effectiveness in noisy settings.

Theoretical and Practical Implications

From a theoretical perspective, the paper highlights the intrinsic sensitivity of LLMs to the structural composition of input data, extending the discourse beyond issues of context length and positional biases. This insight underscores the necessity for models to effectively prioritize and synthesize minimal yet essential information within vast distractions—a significant challenge in NLP.

Practically, the findings indicate that real-world implementations of NLP systems must meticulously manage input structure, particularly in agentic systems where context variability is unavoidable. There is an opportunity to develop strategies for context structuring and evidence expansion that can better accommodate the varying length of relevant information streams, thereby enhancing model reliability and performance.

Future Directions

Going forward, this study suggests avenues for advancing LLM architectures to increase resilience against context-size variability. The exploration of design modifications or algorithms that can optimize aggregation despite significant dispersion of relevant data is ripe with potential. Continuing this line of research will be critical to developing LLM-driven tools across diverse sectors, improving their ability to handle expansive, heterogeneous information environments reliably and efficiently.

Markdown Report Issue