Retrieval Preference Optimization (RPO)

Updated 12 February 2026

RPO is a framework that aligns retriever relevance scores with generator outputs by optimizing explicit preference signals from user feedback and reward models.
It employs methods such as regularized direct preference optimization, contrastive likelihood objectives, and reinforcement learning to bridge retrieval gaps and mitigate noise.
RPO has demonstrated robust gains across applications like e-commerce search, open-domain QA, code generation, and conversational tasks with significant benchmark improvements.

Retrieval Preference Optimization (RPO) is a class of preference-based alignment strategies for retrieval-augmented or retrieval-centric language generation systems. RPO aims to bridge mismatches between the “preference” of a retriever (what it scores as relevant) and the downstream objectives of a generator (what produces the best task-specific output), or to align output generation directly with user, retriever, or context relevance signals. Its methods span regularized direct preference optimization, contrastive likelihood objectives, and reinforcement learning formulations, and have been successfully instantiated for settings such as e-commerce generative retrieval, retrieval-augmented open-domain QA, code synthesis, and conversational search. Theoretical and practical advances in RPO address core issues of retrieval noise, knowledge conflicts, and inference efficiency while achieving robust gains on real-world benchmarks.

1. Foundations and Motivation

Standard retrieval-augmented generation (RAG) settings combine two subsystems: (1) a retriever selecting relevant external contexts, and (2) a generator (typically a LLM, LLM) that conditions on the retrieved data to produce a response, code, or identifier. In such systems, critical challenges arise:

The retriever’s internal relevance scoring may not match downstream generation utility, especially when the context is noisy, redundant, or misaligned.
The generator often naively relies on retrieved content, even when this conflicts with useful internal (parametric) knowledge, leading to hallucinations or over-inclusion errors.
Optimization approaches that only fine-tune generation or retrieval independently fail to resolve these “preference gaps,” and cascaded post-hoc reranking is computationally expensive and brittle (Yan et al., 23 Jan 2025, Gao et al., 2024).
In end-user-facing scenarios such as e-commerce search or code generation, human-click or generator-quality signals provide valuable direct supervision, but require new objective formulations to exploit efficiently (Li et al., 2024, Gao et al., 2024).

RPO explicitly formalizes and optimizes these preference signals within unified or contrastive learning objectives, enabling the joint or decoupled alignment of retrieval and generation behaviors.

2. Mathematical Formulations and Objective Functions

RPO generalizes direct preference optimization objectives from reinforcement learning from human feedback (RLHF) by leveraging explicit preference data extracted from user clicks, generator reward models, or retriever ranking feedback (Li et al., 2024, Yan et al., 23 Jan 2025, Zhang et al., 2024, Gao et al., 2024).

Direct Preference Optimization with Retrieval-Aware Losses

Across diverse applications, RPO uses objectives of the form

$\mathcal{L}_{\mathrm{RPO}} = -\,\mathbb{E}_{(x,y^+,y^-)\sim D}\,\log \sigma \left( \Delta_{\theta}(x;y^+,y^-) - \epsilon \right)$

where:

$(x, y^+, y^-)$ is a preference triple: $y^+$ is preferred to $y^-$ for context $x$ .
$\Delta_{\theta}(x;y^+,y^-) = \log \pi_\theta(y^+|x) - \log \pi_\theta(y^-|x)$ is the log-likelihood gap.
$\epsilon$ is a margin hyperparameter.
$\sigma$ is the logistic sigmoid.

This formulation underpins both retrieval-aware DPO (Li et al., 2024, Yan et al., 23 Jan 2025, Zhang et al., 2024) and margin-based ORPO losses (Liu et al., 16 Feb 2025).

Joint Reward Decomposition

For RAG, RPO decomposes the reward into generation and retrieval relevance:

$\mathcal{L}_{\mathrm{RPO}} = -\,\mathbb{E}_{q,y_w,y_l,D}\, \log \sigma\Big( \beta[\underbrace{\log\tfrac{\pi_\theta(y_w|q,D)}{\pi_{\mathrm{ref}}(y_w|q,D)} - \log\tfrac{\pi_\theta(y_l|q,D)}{\pi_{\mathrm{ref}}(y_l|q,D)}}_{\text{generation preference}} \pm \frac{1}{|D|}\log\tfrac{\pi_\theta(D|q)}{\pi_{\mathrm{ref}}(D|q)}] \Big)$

Here, the retrieval reward is incorporated additively, and the sign is set by which knowledge source should be trusted (Yan et al., 23 Jan 2025).

Preference Gap Minimization

In retrieval-augmented code generation, the “preference gap” is quantified as:

$\mathrm{Gap}(q) = \mathbb{E}_{r \sim P_{\mathrm{retr}}(\cdot|q)} [s_{\mathrm{retr}}(r,q) - s_{\mathrm{gen}}(r,q)]$

where $s_{\mathrm{retr}}(r,q)$ is the retriever’s relevance score and $s_{\mathrm{gen}}(r,q)$ is the generator’s performance metric, e.g., BLEU or CodeBLEU. RPO in this context minimizes $\mathrm{Gap}(q)$ subject to redundancy and length constraints (Gao et al., 2024).

3. Instantiations Across Application Domains

(A) E-commerce Generative Retrieval

Generative Retrieval with Preference Optimization (GenR-PO) (Li et al., 2024) casts retrieval as autoregressive generation of multi-span item identifiers, significantly reducing order/noise variance compared to previous approaches. Click-log-derived preference triplets and DPO guide the model to maximize the likelihood of spans associated with purchased items, while constrained beam search—backed by an FM-index over catalog spans—ensures only catalog-valid, interpretable outputs.

(B) Retrieval-Augmented Question Answering and Knowledge Tasks

Preference optimization in open-domain QA settings, as instantiated in RPO (Yan et al., 23 Jan 2025) and KnowPO (Zhang et al., 2024), is motivated by the need for controllable knowledge selection under conflicting contexts:

RPO (Yan et al., 23 Jan 2025) integrates joint reward models for answer quality and retrieval relevance, formulating a single unified objective that steers the LLM to trust or disregard external context based on learned retrieval relevance awareness.
KnowPO (Zhang et al., 2024) constructs a balanced dataset of knowledge-conflict triples simulating “contextual ignorance” and “overinclusion” errors, and uses DPO-style contrastive learning to favor accurate context use or parameter fallback as required.

(C) Code Generation

In Retrieval-Augmented Code Generation (RACG), RPO identifies the fundamental mismatch between retriever objectives (e.g., maximizing ground truth similarity) and generator utility. The Preference-Guided Refactored Tuning approach (Gao et al., 2024) introduces a refactorer module between retriever and generator, trained in two stages (supervised and PPO-based preference alignment) to compress, denoise, and tune contexts for maximal generator performance.

(D) Conversational and Rewriting Tasks

RetPO (Yoon et al., 2024) aligns query reformulation with retriever preferences by first generating a diverse pool of candidate rewrites via LLM prompting, evaluating them with a retriever, and then fine-tuning via SFT and DPO using large-scale retriever feedback triplets, yielding superior retrieval performance in sequential conversational search settings.

4. Training Pipelines and Optimization Architectures

The following pipeline archetypes recur across RPO implementations:

Stage	Description	Core Reference
SFT	Supervised fine-tuning on preferred outputs (e.g., click-selected spans, optimal rewrites, good answers)	(Li et al., 2024, Yan et al., 23 Jan 2025, Yoon et al., 2024)
DPO/RPO	Direct Preference Optimization with pairwise or triplet loss on human/retriever preference data	(Yan et al., 23 Jan 2025, Zhang et al., 2024)
Preference-aware RL	RL (e.g., PPO) with generator reward feedback to post-process retrieved input	(Gao et al., 2024)
Inference constraints	Constrained decoding via catalogs, FM-indexes, or reward selection	(Li et al., 2024, Gao et al., 2024)

At inference, RPO-trained systems leverage either catalog-constrained search (for interpretable item retrieval (Li et al., 2024)), additive calibration of retrieval and generation scores (Zhang et al., 2024), or direct generation with optimized preference objectives (Yan et al., 23 Jan 2025).

5. Empirical Performance and Evaluation

RPO approaches consistently outperform dense retrieval, vanilla RAG, and cascade-based reranking baselines across domains:

On e-commerce long-tail queries, GenR-PO achieves Recall@1000 = 0.4310 (vs. RSR = 0.3100) with significant online conversion improvements (Li et al., 2024).
Across open-domain QA benchmarks, RPO achieves 4–10 percentage points accuracy gains without extra inference cost (Yan et al., 23 Jan 2025). KnowPO shows up to +37 percentage points in adherence rate over prior tuning methods (Zhang et al., 2024).
In code generation, the RRG refactorer pipeline delivers up to +28% EM, +13% BLEU, and +6.8% CodeBLEU over baselines (Gao et al., 2024).
In conversational search, RetPO improves mean reciprocal rank (MRR) and recall by double-digit points on TopiOCQA and QReCC (Yoon et al., 2024).

Ablation analyses uniformly show large drops in downstream metrics when DPO/preference optimization is removed, confirming the necessity of explicit preference modeling.

6. Interpretability, Limitations, and Future Directions

RPO frameworks contribute directly to interpretability, especially in configurations that generate explicit, human-readable rationales or catalog-aware spans (Li et al., 2024, Yoon et al., 2024). Interpretability is further supported by the structure of preference triplets and explicit reward decomposition.

Key limitations include:

Dependency on reliable preference data (e.g., click logs, task accuracy labels) and well-calibrated retriever/generator signals.
Computational overhead during preference-data collection (but not inference).
Potential brittleness to retrieval failures or overly noisy context distributions (Yan et al., 23 Jan 2025).

Future research directions highlighted by cited works focus on context compression, multi-hop or iterative retrieval preference optimization, representation learning for preference signals, and extending contrastive margin-based losses to broader RAG and multimodal pipelines (Yan et al., 23 Jan 2025, Gao et al., 2024).

7. Summary Table: RPO Across Domains

Domain	Main Signal for Preference	Objective Style	Key Gains
E-commerce	Click logs, spans	DPO over SFT spans; constrained decoding	Long-tail Recall@K, Conversion ↑
Open-domain QA	Correctness under conflict	Joint reward/decomposition RL/DPO	4–10pp accuracy ↑
Code Generation	Generator reward (CodeBLEU)	PPO tuned refactorer	Up to 28% EM, BLEU ↑
Conversational Rewrite	Retriever ranking	SFT + DPO on retrieval ranks	~+20pp MRR, Recall

Retrieval Preference Optimization establishes a rigorous, empirically validated foundation for aligning retrieval and generation objectives in LLMs, directly addressing failure modes of RAG systems and unlocking performance and interpretability gains in diverse real-world applications (Li et al., 2024, Yan et al., 23 Jan 2025, Zhang et al., 2024, Gao et al., 2024, Yoon et al., 2024).