Papers
Topics
Authors
Recent
Search
2000 character limit reached

Direct Preference Optimization for RAG

Updated 12 February 2026
  • The paper presents PA-RAG, which integrates a DPO loss into RAG systems to align LLM outputs with enhanced informativeness, robustness, and citation accuracy.
  • It leverages multi-perspective preference data from distinct datasets (RI, RR, CQ) using staged optimization to refine output quality without auxiliary reranking.
  • Empirical results on Llama2 models show significant gains in exact match, recall, and precision over SFT and baseline methods, mitigating issues like catastrophic forgetting.

Direct Preference Optimization (DPO) for Retrieval-Augmented Generation (RAG) is a training paradigm designed to end-to-end align LLM generators within RAG systems to multiple desiderata—specifically, informativeness, robustness, and citation quality—through the systematic use of pairwise preference learning. PA-RAG (“Multiple Perspective Preference Alignment for Retrieval-Augmented Generation”) provides the canonical instantiation of DPO in RAG, leveraging multi-perspective preference data and a staged DPO loss to refine LLM outputs without auxiliary reranking or verification modules (Wu et al., 2024).

1. Optimization Objective and DPO Loss

PA-RAG adopts the Direct Preference Optimization (DPO) loss as defined by Rafailov et al. (NeurIPS 2023), formulating the objective over preference tuples (x,y+,y)(x, y^+, y^-), where y+y^+ is the preferred output and yy^- the rejected output for input xx. The DPO loss for model parameters θ\theta is: LDPO(θ)=E(x,y+,y)D[log(1+exp(β[logπθ(y+x)logπθ(yx)]))]\mathcal{L}_{\mathrm{DPO}}(\theta) = \mathbb{E}_{(x, y^+, y^-) \sim \mathcal{D}} \left[ \log\left(1 + \exp\left(-\beta\,[\log\pi_\theta(y^+|x) - \log\pi_\theta(y^-|x)]\right)\right) \right] with:

  • πθ(yx)\pi_\theta(y|x): generator's likelihood of output yy given xx
  • D\mathcal{D}: dataset of preference triples
  • β>0\beta > 0: temperature (sharpness) parameter

This can be rewritten as: LDPO(θ)=ED[logσ(β[logπθ(y+x)logπθ(yx)])]\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}_{\mathcal{D}} \left[ \log\,\sigma\Bigl(\beta\,[\log\pi_\theta(y^+|x) - \log\pi_\theta(y^-|x)]\Bigr) \right] where σ(z)=1/(1+ez)\sigma(z) = 1 / (1 + e^{-z}). Minimization directly encourages higher log-probability for preferred responses relative to rejects, integrating preference signals into next-token prediction without explicit reward model training.

2. Multi-Perspective Preference Data Construction

Three distinct datasets target RAG’s alignment requirements:

  • Response Informativeness (RI): For each prompt, up to 5 “golden” documents covering short answers are included. Chosen outputs y+y^+ are ChatGPT-3.5 generations—rewritten to ensure all answers are present with correct citations. Rejected outputs yy^- come from an SFT-fine-tuned generator with some golden documents removed and only incomplete answers retained.
  • Response Robustness (RR): Prompts are corrupted by mixing up to 5 golden documents with 4 noisy documents (2 answer-free but topically related; 2 irrelevant from other questions). Chosen responses again reuse RI outputs (with citations renumbered for the new context), while rejected responses are the SFT generator’s incomplete outputs for these noisy prompts.
  • Citation Quality (CQ): Prompts contain partial outputs up to an erroneous citation. Rejected responses are the generator’s original completions with unsupported/irrelevant citations (flagged by a T5-11B NLI verifier), and chosen responses are citation-corrected rewrites.

Each dataset consists of triples (x,y+,y)(x, y^+, y^-), with sizes (RI: 11,788; RR: 13,399; CQ: 22,525; combined over ASQA, WebQ, NQ datasets).

3. Staged DPO in the RAG Pipeline

PA-RAG’s training process is structured in two key phases:

  • Phase I – Instruction Fine-Tuning (IFT): The base LLM is first trained on ∼58.9K ChatGPT–citation-rewrite examples (batch size 128, LR 2e–5, 1 epoch).
  • Phase II – Multi-Stage Preference Optimization via DPO: DPO is performed in three stages, each over one preference dataset (RI, then RR, then CQ), with per-stage learning rates (αRI=αRR=2×106\alpha_{RI} = \alpha_{RR} = 2 \times 10^{-6}, αCQ=2×107\alpha_{CQ} = 2 \times 10^{-7}), batch size 64, and single pass (1 epoch). Each minibatch computes the DPO loss and updates LLM parameters. The method ensures incremental preference alignment toward informativeness, robustness, then citation accuracy.

Algorithmic pseudocode (non-algorithmic excerpt):

1
2
3
4
5
6
7
8
9
for (stage, D_stage) in [(RI, D_RI), (RR, D_RR), (CQ, D_CQ)]:
    choose learning rate α_stage
    for epoch = 1 to 1:
        for each minibatch (x, y⁺, y⁻) ∈ D_stage:
            l⁺ ← log π_θ(y⁺ | x)
            l⁻ ← log π_θ(y⁻ | x)
            loss ← mean( log(1 + exp(–β·(l⁺ – l⁻))) )
            θ ← θ – α_stage · ∇_θ loss
return π_θ

4. Empirical Performance and Ablative Analysis

Evaluation across ASQA, WebQ, NQ, and TriviaQA demonstrates that PA-RAG achieves consistently higher exact match (EM), recall, and precision compared to:

  • Base LLM generator
  • Pipeline RAG methods (RetRobust-13B, Self-RAG-13B)
  • SFT-only baselines

Empirical results for Llama2-7b:

  • Base: EM = 35.61, Rec = 20.51, Prec = 34.51
  • +IFT: EM = 37.98, Rec = 75.69, Prec = 70.17
  • +PA-RAG (all DPO stages): EM = 46.16, Rec = 77.66, Prec = 76.82

Ablations reveal:

  • Each stage (IFT → IFT+RI → IFT+RI+RR → IFT+RI+RR+CQ) offers incremental performance gains.
  • DPO on preference data enables continual improvement, unlike SFT on mixed preferences, which triggers catastrophic forgetting (EM decreases when SFT is applied to RI+RR).
  • The order RI→RR is superior to RR→RI or mixing both at once.
  • Fluency metrics (MAUVE) are stable throughout training.

5. Theoretical Framework and Convergence Guarantees

The DPO objective is theoretically grounded in reward maximization under KL regularization. For latent reward r(x,y)r(x, y), the solution π\pi^* satisfies: π(yx)πpre(yx)=exp[βr(x,y)]\frac{\pi^*(y|x)}{\pi_{\mathrm{pre}}(y|x)} = \exp\left[\beta\,r(x, y)\right] where πpre\pi_{\mathrm{pre}} denotes the pretrained generator distribution. PA-RAG does not require an explicit reward model; instead, the LLM’s own capabilities for preference discrimination and generation suffice, effectively rendering it self-rewarding and self-improving. The convexity of the DPO loss in log-probabilities yields provable convergence properties and prevents catastrophic forgetting.

6. Implementation Details and Architectural Choices

Key settings:

  • Base architectures: Llama2-7b-Chat, Llama2-13b-Chat, Llama3-8b-Instruct
  • Hardware: 4×Nvidia A800 80GB, 1TB RAM
  • Citation verifier: TRUE (T5-11B) NLI

Training hyperparameters include phase-specific batch sizes and learning rates, adopted directly from the PA-RAG implementation.

Training Phase Dataset(s) Used LR Batch Size Epochs
Instruction tuning IFT (ChatGPT) 2×10⁻⁵ 128 1
Preference alignment RI, then RR, CQ 2×10⁻⁶ (RI, RR) 64 1
2×10⁻⁷ (CQ)

7. Significance and Implications for RAG Development

DPO for RAG, as instantiated in PA-RAG, enables single LLM generators to integrate multi-dimensional refinement signals (informativeness, robustness, citation correctness) directly through preference data and staged optimization. This approach circumvents the pitfalls of SFT-based forgetting and avoids dependence on reranking or post-hoc modules. A plausible implication is that DPO-based end-to-end RAG may favorably scale to broader settings where direct multi-perspective alignment is essential, especially in domains demanding high factuality and citation reliability (Wu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Direct Preference Optimization (DPO) for RAG.