Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

Published 4 Mar 2026 in cs.IR, cs.AI, and cs.CL | (2603.04532v1)

Abstract: Information retrieval (IR) benchmarks typically follow the Cranfield paradigm, relying on static and predefined corpora. However, temporal changes in technical corpora, such as API deprecations and code reorganizations, can render existing benchmarks stale. In our work, we investigate how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains. We examine two independent corpus snapshots of FreshStack from October 2024 and October 2025 to answer questions about LangChain. Our analysis shows that all but one query posed in 2024 remain fully supported by the 2025 corpus, as relevant documents "migrate" from LangChain to competitor repositories, such as LlamaIndex. Next, we compare the accuracy of retrieval models on both snapshots and observe only minor shifts in model rankings, with overall strong correlation of up to 0.978 Kendall $τ$ at Recall@50. These results suggest that retrieval benchmarks re-judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at https://github.com/fresh-stack/driftbench.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that retrieval benchmarks robustly support queries through semantic redundancy, even amid significant corpus evolution.
It employs dual snapshots with GPT-powered nugget extraction and LLM-based judging to capture evolving relevance across technical repositories.
Stable embedding model rankings with high Kendall tau correlations underline the temporal reliability of the evaluation methodology.

Evaluating Temporal Drift in Retrieval Benchmarks: An In-Depth Analysis

Introduction

The persistence of the Cranfield paradigm in information retrieval (IR) evaluation has rendered benchmarks static, with the assumption that corpora, query sets, and relevance judgments remain temporally invariant. However, technical domains—characterized by rapid documentation evolution, codebase modifications, and feature deprecation—pose fundamental challenges for evaluation rooted in static corpora. "Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks" (2603.04532) conducts a nuanced analysis of temporal drift using dual snapshots of FreshStack, focusing on LangChain and affiliated repositories, interrogating the robustness of retrieval evaluation under corpus shifts. This essay reviews the work's methodological rigor, empirical findings, and implications for future IR evaluation protocols.

Methodology

Corpus and Query Anchoring

The study constructs temporally explicit test collections for October 2024 and October 2025, anchored to ten GitHub repositories relevant to retrieval-augmented generation (RAG) tools, predominantly LangChain, Chroma, LlamaIndex, and their satellites. All main-branch commits prior to each date are used, extracting documentation, code, and notebooks, chunked with a 2048-token limit. Each query is derived from Stack Overflow and decomposed into "nuggets"—atomic informational units—using GPT-4o, enabling granular document-level support assessment.

Retrieval and Judgment Framework

A hybrid retrieval pool is assembled through lexical (BM25) and dense (BGE [6], E5 Mistral [40], Qwen3 [44]) models, with Qwen3-4B-Instruct leveraged to generate sub-questions and closed-book answers, augmenting the breadth of judgment pools beyond original user queries. Top-50 document fusions are then subjected to nugget support assessment using Cohere's Command A, an LLM-based judge operating at nugget granularity, classifying documents as relevant if they support at least one essential nugget.

Metrics and Benchmarking

Model ranking and retrieval efficacy are evaluated using a-nDCG@10 (normalized DCG with answer normalization), Coverage@20, and Recall@50, providing a multi-dimensional perspective on both relevance and result diversity. Model rankings between temporal snapshots are compared via Kendall $\tau$ correlation coefficients to assess consistency.

Empirical Results

Sustained Query Support

Across 203 Stack Overflow queries and 640 nuggets, only a single nugget loses support in the 2025 snapshot, despite significant corpus restructuring—including a reduction of LangChain documentation by 67%. This demonstrates a key empirical insight: distributed redundancy and migration of code/documentation between related repositories (notably LangChain to LlamaIndex, Chroma, and Transformers) maintain the answerability of nearly all historical queries. The dominant mechanism is not document stasis, but content migration and expansion across modular competitor frameworks.

Shift in Distribution of Relevant Documents

A pronounced redistribution of relevant documents accompanies corpus drift:

In 2024, 50.9% of relevant documents stemmed from LangChain, falling to 24.8% by 2025, with LlamaIndex and LangChainJS overtaking in share.
Case studies, e.g., ImportError with UnstructuredPDFLoader, exhibit substantial migration: in 2025, answers for this query draw from six repositories, compared to one dominant source in 2024.
The expanded diversity of relevant sources in the 2025 corpus highlights the necessity for retrieval systems to operate atop robust semantic representations rather than repository-specific heuristics.

Retrieval Model Robustness Under Temporal Drift

The performance of diverse embedding models (e.g., Qwen3-4B/8B, Stella, BGE, E5, Voyage) exhibits only minor declines between years. Kendall $\tau$ correlation for model ranking remains robust: Recall@50 achieves 0.978, a-nDCG@10 reaches 0.846, and Coverage@20, though lower at 0.692, still indicates moderate consistency.

Qwen3 embeddings consistently achieve the highest retrieval scores.
Overall, model rankings are highly stable, with only minimal churn under corpus shift; this is a strong assertion regarding the temporal reliability of the FreshStack benchmark in this context.

Theoretical and Practical Implications

The findings provide valuable evidence that IR benchmarks built from technical documentation can remain valid even under significant corpus drift, provided that (a) content independence, redundancy, or migration exists across repositories, and (b) relevance is assessed at the semantic nugget level, not reliant on static file references. This raises key theoretical questions:

To what extent does this robustness generalize beyond modular, cross-linked technical codebases to more monolithic corpora (e.g., Wikipedia, scientific literature) where non-redundant content is at risk of becoming unreachable?
Does the strong stability of embedding model rankings suggest sufficient invariance in underlying semantic representations, or is this an artifact of evaluation metrics focusing on short/medium context retrieval rather than deeper reasoning or comprehensiveness?

Practically, the automated judgment pipeline—synthesizing LLM-based nugget generation and LLM-based judging—substantially reduces the annotator bottleneck for temporally evolving benchmarks. This positions such frameworks as essential for continuous evaluation of RAG and LLM-integrated systems deployed in dynamic technical domains.

Future Directions

The study foregrounds several open avenues, including:

Investigation into query-nugget drift, where not only the corpus but the informational need evolves and the "correct" answer may change.
Application of this methodology to other domains with less content migration, necessitating regenerative judgments at each snapshot.
Longitudinal evaluation of retrieval systems over multi-year timescales, incorporating non-technical, general knowledge corpora.

Conclusion

"Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks" (2603.04532) provides a methodologically sound demonstration that IR benchmarks can exhibit significant temporal robustness, even under extensive documentation churn, when evaluated across dynamic, cross-referential technical repositories. The empirical evidence of stable query grounding, diversified source support, and highly correlated model rankings challenges the necessity of frequent test collection regeneration in such domains. The work motivates a re-examination of static-benchmark assumptions in IR and presents a scalable workflow for dynamic, automated, and temporally aware evaluation, with significant implications for retrieval system assessment in both research and production environments.

Markdown Report Issue