Instructed Retrievers

Updated 19 January 2026

Instructed retrievers are models that condition document ranking on both a query and explicit user instructions, enabling customizable relevance criteria.
They utilize dual-encoder and cross-encoder architectures with instruction-aware training and contrastive losses to improve facet-specific retrieval performance.
Empirical results show metric improvements like higher nDCG and p-MRR, while also highlighting challenges in instruction sensitivity and robustness.

Instructed retrievers are retrieval models that, in addition to a canonical query, accept free-form natural-language instructions specifying user intent, guiding what "relevant" means on a per-query or per-session basis. Unlike traditional retrieval paradigms, which define relevance axiomatically or through implicit training-task cues, instructed retrievers operationalize relevance as a function of both query and instruction, enabling fine-grained, steerable, and scenario-specific retrieval across diverse tasks, modalities, and user objectives.

1. Formal Definition and Conceptual Distinction

Instructed retrievers generalize classic dense or neural retrievers by conditioning ranking not only on the user query $q$ (or a seed document), but on a user-specified instruction $I$ that precisely modulates the relevance criterion. The core scoring function takes the form: $\text{score}(q, I; d) = \langle E_{q,I}(q, I),\ E_d(d) \rangle$ where $E_{q,I}$ is an instruction-conditioned encoder and $E_d$ encodes the candidate document. This enables retrieval tailored along facets (e.g., "method"), user characteristics ("for medical doctors"), style, source, and other high-level intent specifications.

In contrast, standard dense retrievers (e.g. SciNCL, Specter2) are "instruction-agnostic": they optimize semantic similarity between queries and documents for a fixed, often task-dependent notion of relevance, with no mechanism for query-time adaptation to auxiliary user-specified criteria (Maheshwari et al., 16 Jan 2026). The instructed paradigm subsumes instruction-agnostic methods as the case where $I$ is null or fixed.

The distinction is operationally critical in exploratory or multi-faceted search, where users wish to focus retrieval on specific aspects of complex seeds; for example, "find papers whose Method matches this paper's Method" as opposed to generic semantic similarity (Maheshwari et al., 16 Jan 2026). Standard retrievers without explicit aspect or instruction conditioning tend to overweight overall topic, failing to accommodate such fine-grained user intent.

2. Architectures, Training, and Prompting Strategies

Bi-Encoder and Dual Encoder Paradigms: Many instructed retrievers use dual-encoder frameworks. Architectures such as TART-dual and Promptriever employ a Transformer encoder $E$ shared between the "query+instruction" fusion and the document, with the joint query formed as $[I; q]$ or $q \Vert I$ via concatenation (Asai et al., 2022, Weller et al., 2024). Lightweight cross-attention between instruction and query tokens, as in InF-Embed (Zhuang et al., 27 May 2025), further enhances alignment.

Cross-Encoder and Reranking: TART-full, MonoT5-style and cross-encoder LLMs (e.g. GPT-4o, Mistral-7B-Instruct) accept the triplet $(I, q, d)$ and compute joint representations using cross-attention (Asai et al., 2022, Maheshwari et al., 16 Jan 2026). This allows for holistic matching and is especially effective for nuanced instruction following, at the cost of higher inference time.

Instruction Integration: Parameter-isolated designs such as I³ employ auxiliary "introspector" modules that attend over both early-layer query features and separately encoded instructions, producing an intent vector that modulates deeper query representations. The backbone encoders remain frozen, preserving base retrieval competency while enabling instruction awareness (Pan et al., 2023).

Prompt-based Ranking for LLMs: Pairwise Ranking Prompting (PRP) leverages LLMs by constructing templates like:

Which document is more relevant to Q under instruction I?
A: [D₁]
B: [D₂]
Answer: A or B.

This approach efficiently induces ranked lists via O(N log N) pairwise comparisons and circumvents context-length bottlenecks of listwise re-ranking (Maheshwari et al., 16 Jan 2026).

Loss Functions: Instruction-aware training typically involves contrastive losses aligning the $(q, I)$ fusion and the relevant document, with explicit contrast between positives and hard negatives. Multivariate InfoNCE loss over $(I, q, d)$ (as in InF-Embed) or in-batch softmax for dual encoder TART further enhance instruction discrimination (Zhuang et al., 27 May 2025, Asai et al., 2022).

3. Evaluation Protocols and Benchmarks

Instructed retrievers are assessed not just by standard IR metrics (nDCG@k, MAP), but by explicit measures of instruction sensitivity and robustness.

Instruction Sensitivity Metrics:

p-MRR: Measures pairwise Mean Reciprocal Rank when comparing rankings under two different instructions for the same query seed; near zero indicates insensitivity, positive values indicate proper instruction following, negatives signal counter-intuitive behavior (Maheshwari et al., 16 Jan 2026).
Robustness@k: Minimum nDCG@k across all paraphrased or altered instructions for each query; lower values signal fragility to instruction variants (Oh et al., 2024).
Strict Instruction Compliance Ratio (SICR): Fraction of queries where both positive instruction pushes the gold document up and the reversed instruction demotes it, capturing strict adherence (Zhou et al., 2024).
WISE: Weighted Instruction Sensitivity Evaluation, accounting for position and magnitude of rank changes under instruction switches (Zhou et al., 2024).

Benchmarks:

FollowIR, InstructIR, InfoSearch: Curate instance-level (query, instruction, document) triplets or mini-pools with fine-grained, user-aligned instructions covering dimensions such as Audience, Keyword, Format, Language, Length, and Source. Many benchmarks include both direct and reversed constraints for sensitivity analysis (Zhuang et al., 27 May 2025, Oh et al., 2024, Zhou et al., 2024).
mFollowIR: Multilingual instruction-following benchmark extending beyond English, evaluating both cross-lingual and multilingual performance in Chinese, Persian, and Russian (Weller et al., 31 Jan 2025).
Exploratory Search Collections: Aspect-controlled corpus with expert annotations on per-aspect relevance to evaluate facet steering and instruction sensitivity in exploratory research sessions (Maheshwari et al., 16 Jan 2026).

4. Empirical Findings, Robustness, and Shortcomings

Ranking Performance:

Instructed retrievers such as GritLM-7B and gpt-4o_PRP yield significant improvements in content-relevance metrics, e.g., +5.9 nDCG@20 over strong baselines on aspect-guided exploration (Maheshwari et al., 16 Jan 2026), +14.3 p-MRR on FollowIR (Weller et al., 2024), and nDCG@10 up to 52.9 on BEIR for I³ (Pan et al., 2023).
Cross-encoder and listwise reranking architectures (e.g., GPT-4o, RankZephyr) produce the highest instruction-following scores (e.g., WISE ≈ +33.5, SICR ≈ 32%) (Zhou et al., 2024).

Instruction Sensitivity:

Gains in nDCG do not imply improved instruction following: many models hover near zero or negative p-MRR, underscoring that increased content relevance alone fails to guarantee correct instruction response (Maheshwari et al., 16 Jan 2026).
Dense retrievers often match content but ignore or inconsistently follow auxiliary constraints like document length, style, or negation (Zhou et al., 2024, Oh et al., 2024).
Instruction-tuned models can overfit to narrow, task-style instructions, substantially underperforming on real-world, user-aligned, or paraphrased instructions (e.g., INSTRUCTOR-XL nDCG@10 drops 25+ points vs. base GTR) (Oh et al., 2024).

Multilingual Setting:

English instruction-tuned methods transfer well cross-lingually (p-MRR ≈ 8–10 in mFollowIR), but performance drops significantly in non-English queries due to lack of target-language instruction data (Weller et al., 31 Jan 2025).

Vulnerabilities and Risks:

The steerability of instructed retrievers introduces attack surfaces: such models can be induced to surface disallowed or harmful passages via carefully crafted instructions, even when the LLM is safety aligned (BehnamGhader et al., 11 Mar 2025).
Fine-grained instructions can retrieve specific, malicious content (mean rank ≈2.09 out of 10 for target passages) (BehnamGhader et al., 11 Mar 2025).

5. Dataset Construction and Hard Negative Strategies

High-fidelity evaluation and robust training hinge on richly varied, carefully validated instruction–query–document corpora.

Corpus	Distinctive Features	Size
Promptriever	Free-form, per-query instructions (4 styles)	~0.5M triplets
InF-IR	Instruction/Query poisoning negatives, o3-mini validation	38,759 positives, 77,518 negatives
BERRI (TART)	37 tasks × 3–8 expert-written instructions	~5M (q, d⁺) pairs
InstructIR	User-aligned, multi-category instance-level	9,906 (q, I, d)
InfoSearch	6 doc-level attributes; reversed instructions	600 base × 3 modes

Hard negatives are synthesized by instruction and query "poisoning," ensuring that negatives are plausible (semantically similar) but instruction-misaligned (Zhuang et al., 27 May 2025, Weller et al., 2024). Validation is performed using advanced reasoning models (e.g. o3-mini) and human rating to filter out ambiguous cases.

Instance-level negatives (documents relevant to $q$ but not under $I$ ) are critical. Benchmarks and training regimens that employ only generic or task-level instructions risk overfitting and poor robustness to user-aligned scenarios (Oh et al., 2024, Zhuang et al., 27 May 2025).

6. Recommendations, Open Challenges, and Future Directions

Training Methodology:

Employ highly diverse, instance-level instructions—not fixed or task-level prompts—to cover the spectrum of real user scenarios (Oh et al., 2024).
Include hard negatives that violate the intended instruction, in addition to query-based negatives, during contrastive training (Zhuang et al., 27 May 2025).
Incorporate multi-task or multi-dimensional objectives (e.g., WISE penalty terms) to jointly optimize for both content relevance and instruction compliance (Zhou et al., 2024).

Evaluation:

Use instruction-sensitivity metrics (p-MRR, SICR, WISE) in addition to standard nDCG/MAP for comprehensive evaluation.
Track robustness to paraphrastic and reversed instructions, and report worst-case rather than mean-case ranking behavior (Oh et al., 2024, Zhou et al., 2024).

Modeling Innovations:

Integrate explicit aspect vectors, multi-vector matching, or "introspector" modules for compositional facet conditioning (Pan et al., 2023, Maheshwari et al., 16 Jan 2026).
Explore hybrid architectures: a lightweight instruction encoder modulating a scalable dense or sparse index (Weller et al., 31 Jan 2025).
Expand to hierarchical or interactive retrieval (multi-instruction search, online steering), and cross-modal instruction-following (e.g., VideoITG for segment and frame retrieval in video) (Wang et al., 17 Jul 2025).

Risks and Mitigation:

Recognize emerging risks of malicious instruction exploitation; embed safety classifiers or adversarial training loops at the retrieval stage (BehnamGhader et al., 11 Mar 2025).
Measure and report the surfacing of disallowed content and develop methods for mitigation, including safety-aware negative mining and retrieval-aware RLHF (BehnamGhader et al., 11 Mar 2025).

Open Questions:

How to balance generalization and instruction sensitivity without sacrificing robustness or benign retrieval performance?
What is the optimal division of labor between dense-packed retrievers and reranking LLMs, given architectural and cost constraints?
How should models handle fine-grained, user-aligned, multi-lingual, and multimodal instructions at web scale, especially for under-resourced languages and formats?

Future research will advance instruction-following retrieval through more granular supervision, richer instance-level corpora, multi-dimensional objectives, robust instruction-sensitive benchmarking, and the synthesis of scalable encoder architectures with flexible, instruction-aware reranking. The goal is to realize retrieval models that not only "know what you want" but also reliably "do what you say" under evolving, context-rich user intents (Maheshwari et al., 16 Jan 2026, Zhou et al., 2024, Oh et al., 2024, Zhuang et al., 27 May 2025).