Needle-in-a-Haystack Retrieval

Updated 27 January 2026

Needle-in-a-Haystack (NIAH) is a retrieval task that isolates one relevant passage from a large, diverse set of multilingual and multi-script distractors.
The MLNeedle benchmark evaluates retrieval performance in both monolingual and cross-lingual settings, systematically varying context length and needle position.
Empirical findings reveal significant performance drops with increased context, heightened language sensitivity, and a pronounced bias against mid-context needles.

The needle-in-a-haystack (NIAH) paradigm formalizes the challenge of extracting a single relevant item—the “needle”—buried within a large collection of irrelevant or distracting information—the “haystack.” In multilingual LLMs, this retrieval task becomes particularly complex due to language diversity, script variance, and varying context lengths. The MLNeedle benchmark offers a systematic evaluation of multilingual long-context retrieval, measuring how state-of-the-art LLMs handle both monolingual and cross-lingual scenarios. Central findings include severe declines in performance as context length increases, pronounced language sensitivity, and strong positional biases, especially in cross-lingual settings (Hengle et al., 2024).

1. Formalization of Multilingual Needle-in-a-Haystack Retrieval

Formally, the NIAH task considers an ordered context $C = \{d_1, d_2, \ldots, d_K\}$ , where each $d_i$ is a passage in one of several possible languages. A query $q$ (posed in English) has an associated target passage $d_{i^*}$ such that $d_{i^*} \models q$ (i.e., $d_{i^*}$ contains or entails the answer). The goal is to select the correct index $i^*$ : $f(C,q) = \operatorname*{arg\,max}_{1 \leq i \leq K} P(i \mid C, q)$ with $f(C,q)=i^*$ ideally holding for all examples.

In multilingual NIAH, both the needle and distractor passages may vary in language and script. This setting tests not only token-level recall but also the model’s semantic and script-agnostic reasoning capabilities.

2. MLNeedle Benchmark Design

MLNeedle instantiates multilingual NIAH in a controlled question answering (QA) setup leveraging the MLQA dataset for diverse language support:

Monolingual vs. Cross-lingual
- Monolingual: All passages and the query are in the same language.
- Cross-lingual: The query is always in English, but needle and distractors may be in other languages.
Languages: Seven typologically diverse languages are used:
- Indo-European, Latin-script: English (en), German (de), Spanish (es)
- Non-Latin, mid-resource: Hindi (hi), Vietnamese (vi)
- High-variance scripts, low-resource: Arabic (ar), Simplified Chinese (zh)
Context Length & Needle Position:
- Context lengths: ≈4K to ≈32K tokens (number of distractors varies from ≈10 to ≈50).
- Needle inserted at beginning ( $i^* \approx 1$ ), middle ( $d_i$ 0), or end ( $d_i$ 1).
Distractor Construction:
- For each query, distractor passages are drawn using multilingual SBERT and cosine similarity from mMARCO Wikipedia, ensuring none contain the answer. They are ordered by decreasing relevance, with the needle inserted at the designated position.

3. Evaluation Metrics

MLNeedle evaluates models on two principal metrics over a test set of $d_i$ 2 examples:

Exact Accuracy (Retrieval Accuracy):

$d_i$ 3

where $d_i$ 4 is the translated model prediction and $d_i$ 5 the ground-truth answer in English.

Existence Accuracy:

$d_i$ 6

where $d_i$ 7 reflects the model’s binary prediction of the presence of an answer.

Evaluation involves English translation of outputs and fixed generative settings (temperature = 0.7, top-k = 50).

4. Experimental Setup and Models

LLMs Evaluated:
- Llama2-7B-Chat (4,096 tokens)
- Llama3-8B-Instruct (8,192 tokens)
- Cohere-Aya-23-8B (8,192 tokens)
- Mistral-7B-Instruct-v0.2 (32,768 tokens)
Data Construction:
- Contexts are assembled from MLQA-aligned passages.
- Distractors are retrieved and ordered by semantic similarity.
- Context lengths are adjusted to fit within each model’s token budget.
- Systematic placement of needles (begin, middle, end) is used to probe positional bias.

5. Key Empirical Findings

A. Degradation with Context Length

All models experience a near-monotonic decrease in exact accuracy as the context grows. For example, Mistral-7B-Instruct’s mean accuracy declines from 0.579 (baseline) to 0.485 (4K) and 0.397 (32K). Critically, no model attains high retrieval accuracy at its maximum claimed context length; the “effective length” (where accuracy $d_i$ 8 of baseline) is well below the published context window.

B. Language Sensitivity

Retrieval is highly language-dependent:

Monolingual English: Highest accuracy (e.g., Mistral ~0.68 at 4K tokens).
Non-Latin Needles: Performance sharply degrades—German/Spanish needles lose ~30%, while Chinese/Arabic needles can incur deterioration of 50% or greater.
Distractor Language: Changing haystack language while keeping the needle in English affects accuracy only marginally. Retrieval difficulty is thus primarily determined by the needle’s language, not distractors'.

C. Needle Position (“Lost-in-the-Middle” Effect)

A pronounced U-shaped accuracy curve is observed. Retrieval is most successful when the needle is at the beginning or end, and lowest when it is embedded in the middle of the context.

Example (Mistral at 8K tokens):

Beginning: ~0.50
Middle: ~0.40
End: ~0.52

6. Conclusions, Best Practices, and Recommendations

The MLNeedle benchmark produces three critical takeaways regarding long-context multilingual retrieval:

Effective Context Is Much Shorter Than Advertised: Models do not leverage context windows beyond 4–8K tokens for high-fidelity retrieval, despite larger claimed context sizes.
Severe Language Sensitivity: Retrieval capability collapses for typologically distant, non-Latin-script languages due to script and resource disparities.
Persistent Positional Bias: Information in the middle of long contexts remains hardest to extract, regardless of language.

Recommendations for future evaluation and model development:

Incorporate multilingual NIAH tasks that include scripts beyond Latin.
Randomize and report needle position to expose positional retrieval biases.
Report not just claimed but also effective context lengths using systematic drop thresholds (e.g., 25% loss from baseline).
Use both exact and existence accuracy metrics to disentangle retrieval from generation errors.

7. Broader Implications and Future Directions

MLNeedle’s publicly released data and code establish a foundation for rigorous, multilingual long-context evaluation of LLMs. Systematic benchmarking along the dimensions outlined here is essential for developing models that can perform robust retrieval in diverse, real-world scenarios where relevant information must be located across vast, multilingual corpora.

The results imply that future architectures and training regimes must:

Address robust learning over scripts and languages that deviate from pretraining distributional norms.
Tackle attention collapse and positional sensitivity with new architectural or retrieval-augmented strategies.
Move beyond superficial context-size claims, reporting verifiable retrieval performance as a function of both content and position (Hengle et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Needle-in-a-Haystack (NIAH).