Papers
Topics
Authors
Recent
Search
2000 character limit reached

Utility-Driven Retrieval

Updated 5 February 2026
  • Utility-driven retrieval is a paradigm that selects documents based on their ability to enhance LLM performance on specific tasks, rather than mere topical relevance.
  • It employs utility-based labeling and reranking strategies to improve metrics such as F1 and NDCG@k, ensuring higher accuracy and efficiency in answer generation.
  • Emerging methods like iterative utility maximization, stochastic sampling, and model distillation balance retrieval efficiency with performance gains in complex query scenarios.

Utility-driven retrieval refers to a class of information retrieval and retrieval-augmented generation (RAG) methodologies that prioritize the retrieval of documents or passages based on their downstream usefulness ("utility") for a target model and task, as opposed to classic topical relevance matching. In RAG pipelines, utility is operationalized as the extent to which external content—once incorporated by a LLM—actually enhances the model’s ability to answer a query or fulfill a task, with the key insight being that utility is model- and task-specific rather than a static, universal property.

1. Conceptual Shift: From Relevance to Utility

Traditional information retrieval ranks documents by their topical or semantic relevance to a query: given qq and dd, a relevance label indicates if dd is "about" qq. In retrieval-augmented generation, however, the objective is not simply to retrieve on-topic passages but to select those that measurably improve the downstream generation—e.g., the accuracy or factual correctness of an LLM-generated answer.

The critical distinction is that a passage may be topically relevant but contribute no actionable information (generic or redundant statements), or, worse, introduce confusion for the LLM. Utility-driven retrieval explicitly selects for passages that increase task performance, such as exact match, F1, or ROUGE scores on generated outputs (Zhang et al., 13 Oct 2025, Dai et al., 3 Mar 2025, Chandra et al., 27 Jan 2026, Zhang et al., 25 Jul 2025).

2. Formal Definitions and Model-Specificity

The notion of utility is formalized in terms of the LLM’s capacity for “answer improvement.” For a given LLM L\mathcal{L}, the gold utilitarian passages for query qq are those that increase the probability that L\mathcal{L} outputs an answer containing the ground-truth, compared to internal knowledge alone:

ui=I[has_answer(L(q,di))>has_answer(L(q,))]u_i = \mathbb{I}[\text{has\_answer}(\mathcal{L}(q, d_i)) > \text{has\_answer}(\mathcal{L}(q, \emptyset))]

Gq={diCui=1}\mathcal{G}_q = \{ d_i \in \mathcal{C} \mid u_i = 1 \}

where C\mathcal{C} is the set of candidate passages for qq (Zhang et al., 13 Oct 2025).

Crucially, utility is not an absolute attribute of did_i; it is LLM- and task-specific. Different LLMs, by virtue of varied internal knowledge and comprehension ability, yield different sets Gq\mathcal{G}_q for the same query, rendering LLM-specific utility non-transferable: GqL1GqL2\mathcal{G}_q^{\mathcal{L}_1} \neq \mathcal{G}_q^{\mathcal{L}_2} for non-identical models (Zhang et al., 13 Oct 2025).

Empirical findings show that human-annotated relevant passages recover only about half of the LLM-specific gold utility sets and sometimes degrade performance, especially when the LLM already “knows” the answer (Zhang et al., 13 Oct 2025, Chandra et al., 27 Jan 2026).

3. Benchmarking and Utility Judgment Methodologies

Benchmarking utility-driven retrieval involves:

  1. Retrieving a candidate set of passages (e.g., top-20 from a dense retriever).
  2. Constructing the LLM-specific gold utility set Gq\mathcal{G}_q using single-passage context runs.
  3. Creating utility-labeled data by evaluating LLM output correctness for various passage inclusions.
  4. Evaluating candidate retrieval/selection/ranking approaches against gold labels using metrics such as Precision/Recall/F1 (set-based) or NDCG@k (rank-based).

Table: Key Measures in Utility-Driven Evaluation

Step Metric Output
Retrieval Candidate Set C\mathcal{C} Top-N Passages
Gold Utility Construction uiu_i Gold Set Gq\mathcal{G}_q
Judgment Task Precision, Recall, F1 Predicted Utility Subset
Ranking-based Evaluation NDCG@k, Recall@k Predicted Ranking

Verbalized selection (asking the LLM to select or rank passages using pseudo-answers), attention-based proxies, and likelihood-based methods are all explored. Verbalized listwise selection with a pseudo-answer is empirically most robust, with F1 up to 56–58% in optimal settings, although LLMs frequently fail to reject all passages for “known” queries (Zhang et al., 13 Oct 2025).

Automated metrics such as Semantic Perplexity Reduction (SePer) further model retrieval utility as the increase in the probability that an LLM generates a semantically correct answer post-retrieval, capturing both information gain and alignment with model knowledge (Dai et al., 3 Mar 2025).

4. Optimization and End-to-End Learning

Retrievers and rerankers can be optimized explicitly for utility rather than relevance:

  • Iterative Utility Maximization: Search engines can be optimized for the expected utility, using agent feedback to guide an EM algorithm that alternates between updating utility labels and retriever parameters (Salemi et al., 2024).
  • Differentiable Sampling: Stochastic RAG treats the retrieval process as stochastic sampling without replacement, employing differentiable straight-through Gumbel-top-k approximations to maximize expected utility end-to-end (Zamani et al., 2024).
  • Cascade and Distillation Approaches: Large LLMs can be used as teachers to label utility, enabling lightweight utility-based selectors (e.g., RankQwen1.7B, UtilityQwen1.7B) to be distilled, making dynamic selection feasible at lower computational cost (Zhang et al., 25 Jul 2025).

Dynamic Information Retrieval (DIR) extends these concepts to multi-stage scenarios, with Bellman-style recursive expected-utility objectives that accommodate feedback and personalized diversification (Sloan et al., 2016).

5. Practical Systems and Empirical Implications

In practice, utility-based selection or reranking has consistently outperformed classical relevance-optimized retrieval in RAG, particularly for complex and multi-hop queries. Notable empirical outcomes include:

Utility-based retrievers such as SCARLet leverage perturbation-based attribution to model inter-passage synergy, further improving multi-task generalization and model robustness (Xu et al., 1 Apr 2025).

6. Limitations, Challenges, and Future Directions

Outstanding challenges include:

  • Judgment Difficulty: LLMs are better at identifying “useful” passages than at abstaining when external context is not required; rejection accuracy for "empty" queries is low (<5%) (Zhang et al., 13 Oct 2025).
  • Cost: Direct LLM-based utility judgment is computationally expensive, motivating distillation, sliding-window inference, and lightweight modeling (Zhang et al., 25 Jul 2025, Chandra et al., 27 Jan 2026).
  • Transfer and Generalization: Gold utility sets are not transferable across models, but outputs from verbalized or distilled selectors can sometimes transfer, indicating a partial decoupling between downstream model and retriever utility signal (Zhang et al., 13 Oct 2025).
  • Joint Interaction Modeling: Most current frameworks select passages independently; modeling inter-passage interaction at retrieval time (beyond post-hoc attribution) remains an open research direction (Xu et al., 1 Apr 2025).
  • Task Generalization: Current approaches often focus on QA, but extending utility-driven supervision to summarization, fact verification, code generation, and continuous retrieval scenarios is an active research area (Dai et al., 3 Mar 2025, Xu et al., 1 Apr 2025).

Future work will aim for lighter, abstaining, resolutely utility-tuned selectors and end-to-end optimization strategies that maximize signal for the true end task (Zhang et al., 13 Oct 2025, Dai et al., 3 Mar 2025, Zamani et al., 2024).


Key References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Utility-driven Retrieval.