Generative Retrieval in IR
- Generative Retrieval is a novel IR paradigm that treats search as sequence generation to directly produce document identifiers.
- It leverages encoder–decoder architectures and constraint-based decoding to unify corpus encoding and document ranking.
- Practical insights include efficient identifier design, multi-task co-training, and strong benchmark performance on datasets like MSMARCO and NQ320K.
Generative Retrieval enacts a paradigm shift in information retrieval (IR) by framing the search process as sequence generation, in which a LLM directly outputs the unique identifiers (DocIDs) of relevant documents upon receiving a query. Rather than relying on external sparse or dense indices, generative retrieval systems encode corpus and relevance information within the model parameters and produce document rankings in an end-to-end manner. This approach enables direct optimization for retrieval objectives, potentially superior calibration and scaling properties, integration with multi-task learning frameworks, and fine-grained semantic matching between queries and document representations (Kuo et al., 2024, Zhang et al., 26 Sep 2025, Li et al., 2023).
1. Formal Definition and Comparison to Classical Paradigms
Generative retrieval models are typically based on encoder–decoder transformer architectures. Given a query , the model generates the identifier of a relevant document via conditional language modeling:
This is in contrast to sparse retrieval (e.g., BM25), which uses inverted term indices and statistical ranking, and dense retrieval, which relies on dual-encoder vector similarity:
- Sparse IR:
- Dense IR:
Generative retrieval collapses encoding and scoring into an autoregressive operation whose output is the DocID. The model is trained on pairs using maximum likelihood, and at inference decodes the highest-probability valid DocIDs, returning the corresponding documents (Kuo et al., 2024, Lee et al., 2023, Nguyen et al., 2023).
2. Document Identifier Design and Corpus Indexing
Identifier construction is critical for effective generative retrieval. Several schemes exist:
| Identifier Type | Description | Scalability/Tradeoffs |
|---|---|---|
| Numeric/Atomic | Unique integer tokens per doc | Large output vocabularies; poor semantics |
| Lexical/String | Titles, keyword sets, pseudo-queries, n-grams | Leverages PLM knowledge; collision risks |
| Hierarchical/Semantic | Multi-level cluster/path codes (e.g., via RQ) | Tree-like scaling; prefix support |
| Learned/Adaptive | Tokenizers over learned document embeddings | Adaptive but complex to tune |
String-based and dynamic lexical identifiers (e.g., as in GLEN) align best with LLMs, mitigate vocabulary mismatch, and support modular index growth (Lee et al., 2023, Li et al., 2023). Hierarchical cluster-based identifiers or semantic codes (e.g., as in RIPOR) offer scalability and efficient prefix-based beam search (Zeng et al., 2023, Nguyen et al., 2023). Identifier design also must minimize collision, maximize distinctness for effective ranking, and may employ autoencoder-based fusion or multi-task optimization for enhanced corpus coverage and relevance (Tang et al., 2024).
3. Sequence-to-Sequence Architecture and Retrieval Workflow
Generative retrieval models feature a shared encoder and autoregressive decoder(s):
- Shared encoder processes the query (and potentially enriched context via connectors/context augmentation).
- Decoder(s) generate DocIDs (retrieval head) and, if multi-task, grounded answer text (question answering head as in UniGen).
Decoding is performed under constraints—prefix-tree (trie), FM-index, term-set invariant search—to guarantee only valid identifiers are returned. Several frameworks employ dual decoders with shared encoders for retrieval and downstream tasks (RAG, QA) with connector bridges to address input length and semantic gaps (Li et al., 2023).
Example decoding objective for DocID generation:
For multi-task frameworks, joint optimization is weighted:
Connector augmentation and iterative refinement, as in UniGen, further enhance the seq2seq mapping between enriched queries and DocIDs or answers (Li et al., 2023).
4. Optimization Strategies and Learning-to-Rank
Generative retrieval traditionally suffers from the discrepancy between sequential likelihood training and ranking needs. Several advances address this:
- Learning-to-Rank: LTRGR incorporates pairwise margin-based losses on top of sequential generation, optimizing direct ranking of documents and bridging objective mismatch (Li et al., 2023).
- Contrastive Learning: DOGR and RIPOR employ document-oriented contrastive objectives, pulling query and document representations together for positives, pushing apart for negatives, and integrating prefix-level ranking constraints aligned with beam search processes (Lu et al., 11 Feb 2025, Zeng et al., 2023).
- Distillation: Distillation Enhanced Generative Retrieval (DGR) uses teacher rankers to supply multi-graded rank order and applies bespoke RankNet-style losses among candidate DocIDs, propagating soft relevance signals into the generative retriever (Li et al., 2024).
- Multi-Graded Supervision: GR introduces multi-graded contrastive losses and joint docid fusion to enable fine-grained graded relevance discrimination, outperforming binary-only generative approaches (Tang et al., 2024).
Advanced negative sampling (e.g., prefix-aware, retrieval-augmented), cross-entropy with auxiliary ranking losses, and curriculum methods on prefix lengths support robustness and effective beam-search survival.
5. Empirical Results, Comparisons, and Mechanistic Insights
Generative retrieval approaches yield state-of-the-art or competitive performance on industry benchmarks:
| Dataset | Model | Recall@1 | MRR@10 | Notable Results | Reference |
|---|---|---|---|---|---|
| NQ320K | GLEN | 69.1% | 75.4% | +1 pp over GenRet dense baseline | (Lee et al., 2023) |
| MSMARCO | RIPOR | 33.3% | -- | +30.5% MRR improvement over LTRGR | (Zeng et al., 2023) |
| NQ, MS MARCO | UniGen | 42.34% | 56.38% | Joint retrieval and QA gains over pipelines | (Li et al., 2023) |
| BEIR | ZeroGR-3B | -- | 48.1 | Zero-shot, multi-task generalization | (Sun et al., 12 Oct 2025) |
| Book Search | GBSᵖ | 56.7% | 46.9 | Outperforms RIPOR on multi-chapter dataset | (Tang et al., 19 Jan 2025) |
| E-commerce | GenR-PO | 37.6% | -- | Headroom on long-tail queries, conversion gains | (Li et al., 2024) |
Mechanistic interpretability studies reveal that generative retrieval models encode ranking logic almost entirely in the decoder, with late-stage cross-attention and position-wise MLPs critical for final decision-making; the encoder can often be swapped with minimal performance loss, suggesting decoupling and targeted adaptation pathways (Reusch et al., 25 Mar 2025). GR can be analytically unified with multi-vector dense retrieval: relevance scores are provably sums over products of query and document token vectors weighted by an alignment matrix derived from attention (Wu et al., 2024, Nguyen et al., 2023).
6. Advancements, Extensions, and Domain Adaptations
Recent works substantially broaden the applicability and efficiency of generative retrieval:
- Zero-shot Generalization: ZeroGR demonstrates that instruction-tuned docid and query generators, in conjunction with reverse-annealed decoding, enable robust retrieval on unseen tasks and heterogeneous corpora (Sun et al., 12 Oct 2025).
- Cross-modal Retrieval: ACE applies generative retrieval to multimodal (text-to-image/audio/video) retrieval, constructing hierarchical identifiers (coarse from K-Means, fine from RQ-VAE) and achieves superior Recall@1 over dual-tower baselines (Fang et al., 2024).
- Recommendation Systems: TIGER and GRAM pioneer generative retrieval for recommendation and e-commerce, integrating semantic or alignment-based identifiers, direct preference optimization, and partial-order learning for item search and ranking (Rajput et al., 2023, Pang et al., 2 Apr 2025).
- Long-Document Search: GBS extends generative retrieval to hierarchical, multifaceted books via outline-oriented encoder design and comprehensive data augmentation (Tang et al., 19 Jan 2025).
- Reasoning-Augmented Retrieval: R4R alternates between structured chain-of-thought reasoning and identifier decoding, updating context iteratively for enhanced ranking, yielding further gains over base GR systems (Zhang et al., 15 Oct 2025).
7. Challenges, Limitations, and Research Directions
Key limitations include scalability to massive corpora, collision management in identifier spaces, dynamic corpus updates, and fine-grained multi-graded supervision. As corpus size grows, output vocabulary and beam search depth strain model capacity. Ongoing solutions include:
- Hierarchical or dynamically learned identifier schemes (Zeng et al., 2023, Lee et al., 2023)
- Hybrid architectures combining generative and dense retrieval (coarse-to-fine, fallback to ANN indexing) (Zhang et al., 26 Sep 2025, Kuo et al., 2024)
- Incremental and online index update protocols, modular adapter layers, and memory replay (Kuo et al., 2024)
- Multi-task co-training and integration with QA/summarization (Li et al., 2023, Kuo et al., 2024)
- Improving connector design, data construction pipelines (pseudo-query/answer generation), and instruction-driven zero-shot generalizability (Sun et al., 12 Oct 2025, Tang et al., 19 Jan 2025)
- Distillation from high-quality teacher rankers for robust learning signal propagation (Li et al., 2024)
The field continues advancing methods for identifier learning, scalability, robust constraint-based decoding, cross-modal retrieval, preference optimization, mechanistic understanding, and practical deployment at industrial scale.
Generative retrieval, through unified sequence-to-sequence modeling that natively produces document identifiers, offers a fundamentally differentiable, highly adaptive, and semantically expressive approach to information retrieval—challenging the boundaries of classic sparse and dense IR paradigms and establishing fertile ground for continued research and deployment (Kuo et al., 2024, Zhang et al., 26 Sep 2025, Li et al., 2023, Zeng et al., 2023, Sun et al., 12 Oct 2025).