- The paper shows a strong correlation (Pearson >0.8) between coverage-based retrieval metrics and enhanced information coverage in RAG outputs.
- It compares traditional relevance metrics with diversity-oriented metrics, revealing that non-redundant, aspect-diverse retrieval better supports complex answer synthesis.
- The study analyzes various retrieval stacks and pipeline structures, indicating that while iterative generation can partly mitigate retrieval gaps, targeted coverage metrics remain critical.
Introduction
The paper "Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage" (2603.08819) delivers a comprehensive empirical investigation into how the quality and objective of upstream retrieval within Retrieval-Augmented Generation (RAG) systems influence the information coverage of generated responses. In contrast to adhoc retrieval, where maximizing topical relevance in returned documents is the central goal, RAG pipelines for report generation require that the retrieved set collectively covers diverse, non-redundant aspects of the information need. This work systematically quantifies correlations between coverage-oriented retrieval metrics and the actual information coverage achieved in generative outputs, analyzing a broad suite of retrieval stacks, generation pipelines, modalities, and evaluation frameworks.
Motivation and Background
RAG pipelines have become central for complex information-seeking tasks, where synthesis of information from a corpus is required to answer multi-faceted queries. Traditionally, retrieval system evaluation has focused on document-level relevance (e.g., MRR, nDCG, MAP). However, in the context of RAG, these traditional relevance metrics cannot capture the holistic coverage of information or penalize redundancy among retrieved documents. Prior work in retrieval diversification (e.g., α-nDCG, Subtopic Recall) better aligns with the conceptual requirements of RAG, as these metrics reflect not only topical match but also the coverage of distinct informational facets. Nevertheless, the field has lacked rigorous evidence on whether strong retrieval performance with such diversity-oriented metrics translates to higher downstream information coverage in generated responses.
Experimental Design
To operationalize this investigation, the paper examines 15 text and 10 multimodal retrieval stacks (spanning BM25, dense retrievers, multilingual models, and rerankers) across distinct RAG pipelines (linear, subquery-based, and iterative/reflective generation). Downstream pipelines include GPT-Researcher, Bullet List, and LangGraph for text, as well as CAG for video. Evaluation is conducted on TREC NeuCLIR 2024, TREC RAG 2024, and WikiVideo, in conjunction with automatic and human-in-the-loop nugget-based evaluation frameworks (e.g., Auto-ARGUE, MiRAGE). Nugget coverage in generated responses serves as the primary endpoint, measuring the extent to which responses incorporate atomic factual units determined as necessary for complete answers.
The analysis is stratified at both the per-topic level (does a more coverage-diverse input set yield better response coverage for a given query?) and the system level (do improved retrieval stacks yield consistently better RAG pipelines?).
Findings
The central result is the strong, consistent correlation found between coverage-based retrieval metrics (α-nDCG, Subtopic Recall, nugget-based nDCG) and actual information coverage in RAG outputs. This correlation manifests at both the topic and system levels. For instance, Pearson correlation coefficients exceed 0.8 for α-nDCG and nugget coverage in several settings, establishing that retrieval systems tuned for aspectual coverage also yield generated responses with higher informational completeness.
Conversely, the canonical relevance-only metrics (e.g., relevance-based nDCG) exhibit substantially weaker correlation with nugget coverage, particularly in report-style complex-answer tasks. This empirically confirms that relevance alone is an insufficient surrogate for RAG pipeline effectiveness where the goal is complex answer synthesis.
Influence of RAG Pipeline Complexity
An important nuance uncovered is that pipeline structure modulates the retrieval-generation relationship. In simple retrieve-then-generate (linear) pipelines, improvements in coverage-based retrieval metrics directly yield higher generation coverage. In contrast, pipelines with complex, iterative, or subquery-driven generation (e.g., LangGraph, Bullet List) exhibit partial decoupling, as the generation logic can in some cases adapt to retrieval deficiencies—effectively shifting performance bottlenecks from retrieval to generative adaptation. However, this adaptivity does not universally recover information coverage loss and leads to increased evaluation variance.
Evaluation Consistency and Multimodality
The retrieval-coverage correlation generalizes across automatic evaluation tools (e.g., Auto-ARGUE, MiRAGE), with minor variations due to metric operationalization (e.g., direct nugget recall vs. recall conditioned on grounded citations), and across modalities (text and video). In multimodal RAG, the coverage-based retrieval metrics remain reliable predictors for factuality and content coverage in generated video-grounded articles, although the degree of correlation may be diluted when LLMs over-rely on parametric knowledge and under-utilize retrieved evidence.
Implications and Future Directions
By establishing coverage-based retrieval metrics as robust, cost-effective early indicators of downstream generative performance, this work enables more efficient and principled selection, tuning, and benchmarking of retrieval modules in RAG system design. The results also make clear that simply maximizing traditional relevance metrics is insufficient for RAG pipelines targeting complex, multi-aspect answers; retrieval models must be optimized for information coverage and minimal redundancy.
The partial decoupling effect in sophisticated pipelines suggests that further research is warranted into co-design and joint optimization strategies for retrieval and generation modules. Additionally, developing models that explicitly integrate nugget/objective awareness or perform querying with dynamic facet identification could further enhance coverage.
The methodology and results provide a strong foundation for generalizing coverage-first evaluation protocols, especially as generative models increasingly interface with large, heterogeneous knowledge corpora and as RAG paradigms are extended to new modalities or languages.
Conclusion
This paper delivers a rigorous, large-scale empirical demonstration that retrieval strategies prioritizing information coverage—quantified via diversity-oriented metrics—are strong predictors of RAG pipeline performance in terms of coverage in generated outputs. This finding holds across evaluation levels, modalities, and pipeline structures, with caveats regarding the adaptability of iterative LLM-driven generation. These results provide actionable guidelines for RAG system development: retriever selection and tuning should be dominated by coverage-centric objectives, and coverage metrics can partially obviate costly end-to-end generative assessment when they are appropriately aligned with the downstream task. This advances the field’s understanding of retrieval-generation interdependencies and sets a framework for future RAG model advancement and evaluation.