Papers
Topics
Authors
Recent
Search
2000 character limit reached

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Published 26 May 2025 in cs.LG and cs.CL | (2505.19770v1)

Abstract: We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

Summary

  • The paper analyzes the performance gap between RLHF and DPO, showing differences arise primarily from explicit and implicit representation gaps when reward or policy models are mis-specified.
  • Under policy model mis-specification, RLHF outperforms DPO by leveraging accurately modeled rewards, whereas DPO performs better with reward model mis-specification by optimizing solely based on preferences.
  • RLHF demonstrates better statistical efficiency than DPO when reward functions are sparse, requiring fewer samples for accurate estimation in large-scale data settings.

Performance Gap in Preference-Based Policy Learning: RLHF vs DPO

The paper entitled Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO meticulously investigates the complex interaction between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) in preference-based policy learning. The central focus of the study is the performance discrepancies between RLHF and DPO, arising predominantly from representation gaps when model specifications deviate from their ideal forms.

Analytical Insights

The authors dissect the performance gap into two primary sources: explicit and implicit representation gaps. The explicit gap arises under exact optimization conditions, in which the representation capabilities of reward and policy model classes are crucial. Here, both RLHF and DPO are thoroughly examined to determine their respective advantages and deficiencies under various model mis-specifications.

  1. Exact Optimization with No Mis-specification: The study confirms that when both the reward and policy models are perfectly specified (realizable ground truth), RLHF and DPO converge upon the same optimal policy. This equivalence suggests that under ideal conditions, the choice between RLHF and DPO is largely a matter of procedural preference rather than performance.
  2. Policy Model Mis-specification: When the policy model cannot realize the optimal solution, RLHF demonstrates superiority by optimizing the reward model first, thereby achieving better policy outcomes than DPO. This reflects RLHF’s proficiency in leveraging accurately modeled rewards to enhance policy learning.
  3. Reward Model Mis-specification: Conversely, when the reward model is mis-specified, DPO can still achieve an optimal policy defined solely by preferences, outperforming RLHF which might be constrained by its sub-optimal reward model.
  4. Double Mis-specification: The interplay between reward and policy classes becomes complex in scenarios of mutual mis-specification. When reward and policy model classes are isomorphic, RLHF and DPO exhibit similar performance, yet online DPO may offer improved outcomes. The findings stress the importance of model architecture choice, advocating for scenarios where DPO’s iterative refinements can leverage feedback loops.

The paper also explores approximate optimization settings, highlighting statistical efficiency gaps between RLHF and DPO. Particularly, it demonstrates RLHF’s ability to capitalize on sparse representations of the reward function, notably reducing sample complexity compared to DPO. This advantage becomes pronounced when handling large-scale, sparse data where reward model sparsity allows RLHF to achieve accurate estimations with fewer samples.

Empirical Substantiation

Empirical results reinforce theoretical claims, indicating that RLHF’s structured two-stage approach can consistently outperform DPO when reward learning is efficiently integrated, especially in realistic settings where computational constraints prevail. Experimental data illustrate the robustness of RLHF in adapting to various preference learning contexts, while emphasizing the pitfalls DPO faces when policy models are inaccurately specified.

Implications and Future Directions

The paper presents invaluable insights into the conditions under which RLHF or DPO should be adopted. Practitioners in AI and machine learning now have a clearer understanding of the strategic deployment of RLHF over DPO in contexts where reward realizability is uncertain or computational resources are limited. The theoretical frameworks introduced not only clarify current methodological choices but also encourage future advancements in preference-based policy learning algorithms, particularly in developing models that deftly manage sparse and mis-specified conditions.

Future research might explore the potential of hybrid models that combine the strengths of RLHF and DPO, crafting algorithms capable of dynamically selecting reward and policy optimization pathways based on the model specifications and environment constraints. The findings could also inform the development of more adaptive frameworks that leverage the inherent strengths of RLHF’s two-stage learning, enhancing statistical efficiency and policy accuracy in broader applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper compares two popular ways to train AI systems using human preferences:

  • RLHF (Reinforcement Learning from Human Feedback): first learns a “reward model” (a judge that scores answers), then trains the AI (the policy) to get high scores.
  • DPO (Direct Preference Optimization): skips training a separate judge and directly adjusts the AI to match human preferences.

The authors ask: when does each method work better, and why? They show that differences in what the models can represent (their “expressive power”) and how much data you have can create a performance gap between RLHF and DPO.

What big questions does the paper ask?

The paper studies, in simple terms:

  • If both methods had perfect data and training, would they perform the same?
  • What happens if the “judge” model (reward model) is too simple to match reality, or if the “actor” model (policy model) is too simple?
  • Can collecting new data while training (online DPO) help?
  • With limited data, which method learns faster and more reliably?

How did the researchers study this?

They use a mix of math and small experiments to explore two settings:

1) Exact optimization (infinite data, perfect training)

  • Imagine you can train forever and have all the preference data you need.
  • Here, the only limitation is what your models can represent (like whether your “toolbox” is big enough to build the right thing). This is called a “representation gap” or “model mis-specification.”

They analyze four cases:

  • No mis-specification (both models are strong enough)
  • Policy is too weak (the actor can’t express the best behavior)
  • Reward is too weak (the judge can’t score correctly)
  • Both are weak (they can’t represent the true solution well in different ways)

They also study “online DPO,” where the AI keeps collecting new preference data during training to improve faster.

2) Approximate optimization (finite data, real training)

  • In real life, you only have limited comparisons from humans.
  • The authors build a simple example where the true “judge” (reward) is mostly zero except for a few important parts (this is called “sparse”). Think of a music equalizer with many sliders but only a few are actually important.
  • They compare how many samples each method needs to learn well (sample efficiency).

Key ideas explained simply

  • Reward model (RLHF): a separate “judge” that scores answers. The policy then learns to get high scores from this judge.
  • DPO’s “surrogate reward”: instead of a separate judge, DPO estimates a kind of “fake reward” from how it changes the probabilities of answers. This can be harder to learn well with limited data.
  • Bradley–Terry model: a standard way to turn “A is better than B” votes into training signals by comparing scores.
  • KL regularization: a gentle rule that keeps the AI from drifting too far from its starting behavior (prevents wild, unsafe changes).
  • “Isomorphic” model classes: the reward model and policy model have roughly equal expressive power.
  • “Sparse” reward: only a few features matter; learning these few is much easier than learning everything.

What did they find?

Main findings in the perfect-data world (exact optimization)

  • No mis-specification (both models strong):
    • RLHF and DPO can both reach the same best policy.
    • Online DPO can make the training path closer to the true goal, helping it converge well.
  • Policy too weak (actor can’t express the best behavior):
    • RLHF is at least as good as DPO, and can be better.
    • Even online DPO cannot always catch up.
    • Why: RLHF learns the true judge first, then finds the best policy possible within its limits. DPO tries to jump straight to the policy and can get stuck.
  • Reward too weak (judge can’t correctly represent human preferences):
    • DPO is at least as good as RLHF, and can be better.
    • Why: RLHF depends on a flawed judge; DPO avoids that by not training a separate judge.
  • Both reward and policy are weak:
    • If they’re “isomorphic” (roughly equally expressive), RLHF and DPO perform about the same.
    • In this case, online DPO can actually outperform both by collecting better training pairs during learning.
    • If they’re not matched in power, there’s no one-size-fits-all winner; it depends on the task.

Main findings with limited data (approximate optimization)

  • In a simple task where the true reward is “sparse” (only a few features matter), RLHF learns much faster and needs fewer samples than DPO.
  • Roughly:
    • DPO’s estimation error scales like “dimension/number of samples,” which can be slow when the system is large.
    • RLHF’s error scales like “(number of important features × log of dimension)/number of samples,” which is much better if only a few features matter.
  • Translation: If the judge (reward) is simple, learning the judge first (RLHF) is more data-efficient than trying to learn the policy directly (DPO).

Experiments

  • The authors ran small-scale tests on a safety dataset for LLMs.
  • Results matched their theory: RLHF tended to do better when the policy model was limited or when data were limited and the reward was simple; DPO shined when the reward model was too limited; online DPO helped in some matched cases.

Why does this matter?

  • Choosing between RLHF and DPO:
    • Pick RLHF when:
    • Your policy model is limited.
    • The true reward is likely simple or “sparse.”
    • You care about sample efficiency (doing more with fewer human comparisons).
    • Pick DPO when:
    • You don’t trust your reward model to represent preferences well.
    • You prefer a simpler, more stable training loop without reinforcement learning.
    • Consider online DPO when:
    • Reward and policy are similarly expressive, and you can gather data while training.
  • Practical impact:
    • Better AI training: Knowing when each method works best can save time, data, and compute.
    • Safer systems: Regularization and the right method choice help avoid unstable or biased behavior.
    • Research insight: Two-stage learning (RLHF) can have a statistical edge because it isolates a simpler target (the judge) before training the policy.
  • Limitations the authors note:
    • They rely on a standard comparison model (Bradley–Terry) for preferences, which isn’t perfect in all cases.
    • Experiments were done with relatively small models; more testing at larger scales would be helpful.

In short: There isn’t a universal winner between RLHF and DPO. Which one you should use depends on how powerful your models are and how much data you have. If the reward is simple or you have limited data, RLHF often wins. If your reward model can’t capture human preferences well, DPO can be safer. And online DPO can sometimes beat both by gathering better training examples on the fly.

Knowledge Gaps

Below is a concise, actionable list of the paper’s unresolved knowledge gaps, limitations, and open questions that can guide future work.

  • Extend analyses beyond the multi-armed bandit setting to full sequential decision-making (MDPs) with state, trajectory credit assignment, and PPO-style actor-critic components typical in LLM RLHF.
  • Characterize how KL regularization at the sequence vs. token level, and dynamic reference policy updates (e.g., EMA of SFT), impact the RLHF–DPO performance gap.
  • Provide necessary and sufficient conditions (not just existence proofs) for when RLHF, offline DPO, or online DPO is superior under different model mis-specification regimes.
  • Analyze intermediate regimes with partially overlapping reward and policy model classes (beyond strict isomorphism, strict inclusion, or strict superset), and derive performance comparisons under these overlaps.
  • Develop a principled framework for sampler design in online DPO that provably reduces the objective mismatch and closes the implicit representation gap without requiring strong closeness assumptions (small δ) to the ground-truth reward.
  • Quantify how reward scale (Rmax, β) affects the approximation in Theorem 4 and the empirical observation that online DPO degrades at larger reward scales; derive scale-robust samplers or objectives.
  • Replace or generalize the Bradley–Terry preference model to handle real-world violations (non-transitivity, heteroskedastic noise, annotator bias, multi-item rankings) and analyze how such misspecification shifts the RLHF–DPO gap.
  • Formalize and test robustness to preference noise: characterize breakdown points and develop robust variants of reward learning and DPO that maintain performance under adversarial or high-variance annotator signals.
  • Provide a theoretically grounded and practically effective alternative to BT-based reward modeling (acknowledged limitation), including a concrete algorithmic proposal and guarantees.
  • Generalize the sparse recovery separation results beyond the DTSP construction to broader architectures (deep, non-linear feature maps), longer sequences, and multi-token dependencies; quantify sample complexity under realistic LLM settings.
  • Establish distribution-free (worst-case) bounds rather than “there exists an environment” constructions to characterize typical-case performance gaps across tasks and datasets.
  • Investigate whether and how real-world reward functions are sparse in practice; develop empirical diagnostics for sparsity and leverage structured sparsity (group, hierarchical) to improve RLHF sample efficiency.
  • Provide a formal, data-driven estimator and confidence bounds for the proposed reward quality metric (pairwise difference ℓ2 distance), and relate it to downstream value gap with finite samples.
  • Analyze data coverage effects (offline DPO OOD exploitation, support mismatch) and derive conditions under which online data collection mitigates coverage-induced biases in both RLHF and DPO.
  • Compare the paper’s “pairwise policy gradient” implementation to standard PPO (advantage normalization, KL control, baselines), and provide theory explaining when pairwise PG preserves or loses RLHF advantages.
  • Study hyperparameter sensitivity (β, KL weight, sampling ratios) and provide guidelines or adaptive schemes that provably optimize the trade-off between regularization, stability, and final performance.
  • Bridge the gap between theoretical assumptions (exact optimization, infinite preference data, linear/log-linear classes) and practical LLM training (non-convex neural nets, finite compute), with results that survive realistic optimization noise.
  • Address the reliance on fixed, external “ground-truth reward oracles” for evaluation; propose human-in-the-loop or model-agnostic evaluation protocols that do not assume access to r⋆.
  • Extend the analysis to multi-preference modalities (safety, helpfulness, harmlessness) and joint optimization settings (multi-task or multi-objective RLHF/DPO); characterize conflicts and Pareto front behavior.
  • Provide constructive algorithms for the “isomorphic mis-specification” regime where online DPO can outperform both RLHF and offline DPO; include sampler, objective, and convergence guarantees.
  • Quantify how policy model mis-specification scales with parameter count and architecture (transformer depth/width), and provide formal results showing when reward models tend to be less mis-specified than policy models.
  • Examine partial or progressive freezing strategies (as used in experiments) more systematically and derive theoretical insights into their impact on mis-specification and performance gaps.
  • Investigate calibration of the surrogate reward (up to constants) and its effect on learned policies under finite samples; develop methods to correct offsets/scales in DPO-like objectives.
  • Scale the empirical validation to larger models and diverse datasets; evaluate whether the theoretical dichotomy holds under production-like conditions and with human evaluations.

Practical Applications

Immediate Applications

Below are practical uses that can be deployed now, drawing on the paper’s findings about when RLHF, DPO, or online DPO are preferable and how to improve training efficiency and robustness.

  • Alignment-training decision checklist and workflow selection (software/AI industry; academia)
    • What: A lightweight protocol to choose RLHF vs DPO vs online DPO based on observed or anticipated mis-specifications:
    • Strong reward model, weak policy model → prefer RLHF (two-stage) to recover best achievable policy under constraints.
    • Weak reward model, strong policy model → prefer DPO to avoid reward-model error.
    • Both models realizable → either RLHF or DPO; online DPO can accelerate convergence with appropriate sampling.
    • Isomorphic reward/policy classes → RLHF ≈ DPO; online DPO can outperform via iterative sampling.
    • Tools/products: A “decision assistant” script/notebook that checks model sizes (reward vs policy), training constraints (frozen layers, compute), and dataset coverage to recommend the training pipeline.
    • Assumptions/dependencies: Availability of pairwise preference data that fits the Bradley–Terry model; basic ability to quantify model capacity/mis-specification (e.g., layer freezing, parameter counts, known bottlenecks).
  • PILAF-style sampler module for online DPO (software/AI industry)
    • What: Implement the PILAF sampler to make online DPO’s objective approximate the regularized value function and speed convergence.
    • Tools/products: A training component that mixes current-policy sampling with a distributional correction as in PILAF; integrates into existing DPO trainers.
    • Assumptions/dependencies: Reasonable estimate of reward-policy differences and bounded reward scale; iterative training workflow; stability monitoring as large reward scales can amplify approximation error.
  • Sparse reward modeling for data efficiency (software/AI industry; education; healthcare; safety)
    • What: Use an LLM reward head with L1/L0 regularization to exploit implicit sparsity in rewards, improving sample efficiency versus surrogate reward learning in DPO.
    • Tools/products: “Sparse Reward Head Trainer” with L1/L0 constraints; feature selection utilities to identify sparse structures in reward features; training recipes that freeze most layers and only fit a linear head.
    • Assumptions/dependencies: Reward signals are implicitly sparse (common for safety/harmlessness or simple preference criteria); sufficient feature quality from the LLM backbone; small-to-moderate amounts of preference data.
  • Budget-aware annotation planning (policy; industry; finance)
    • What: Use the paper’s sample-efficiency insights to plan preference annotation budgets—favor RLHF when rewards are sparse or data is limited to reduce required samples and cost.
    • Tools/products: “Preference Budget Estimator” that translates target error bounds (e.g., O(√(k log d / n))) into approximate labeling needs; integrates with MLOps planning dashboards.
    • Assumptions/dependencies: Basic estimates of feature dimension d and sparsity k; consistent BT-style preferences; reliable data collection pipeline.
  • Mis-specification diagnostics and monitoring (software/AI industry; academia)
    • What: Measure the “reward-model quality gap” via pairwise difference L2 norms and track regularized value improvements to detect harmful mis-specification (policy too weak, reward too weak).
    • Tools/products: “Representational Gap Dashboard” reporting:
    • Empirical L2 error on preference differences for reward vs surrogate reward.
    • Regularized value V under current policy.
    • Capacity indicators (parameters, frozen blocks).
    • Assumptions/dependencies: Access to evaluation preferences and instrumentation in training loops; consistent KL-regularization settings; a stable base policy reference.
  • Safer LLM fine-tuning under constrained compute (software/AI industry; safety)
    • What: When deployment requires freezing many layers or small optimization steps, use RLHF with a competent reward model (often more robust than DPO under weak policy capacity) to push toward safer behaviors.
    • Tools/products: PPO (or pairwise policy-gradient) tuned with advantage normalization and KL regularization; reward oracles for harmfulness; simple reward scaling (“margin” tuning).
    • Assumptions/dependencies: Reward model is not severely mis-specified; stable RL training pipeline; availability of safety preference datasets.
  • Robotics and edge AI preference learning under limited model capacity (robotics; IoT)
    • What: For on-device learning where the policy class is limited (small models, frozen layers), prefer two-stage RLHF to achieve the best policy within constraints.
    • Tools/products: Lightweight reward models running on edge devices; periodic cloud-assisted PPO updates using stored preferences.
    • Assumptions/dependencies: BT-like preference signals available (from users/operators); manageable connectivity for periodic updates; policy-class constraints acknowledged.
  • Benchmarking and reporting practices (academia; policy)
    • What: Standardize reporting of regularized value metrics and mis-specification diagnostics to make RLHF vs DPO comparisons fair and reproducible.
    • Tools/products: Benchmark suites capturing:
    • Reward scale sensitivity.
    • Online vs offline sampling regimes.
    • Frozen-layer settings to simulate weak policy classes.
    • Assumptions/dependencies: Community consensus on regularized metrics; access to preference datasets; reproducible training configs.

Long-Term Applications

Below are applications that need further research, scaling, or development before broad deployment.

  • Automated algorithm selection and hybrid training (software/AI industry; academia)
    • What: Systems that automatically infer whether the environment is in a “strong-reward/weak-policy” regime, detect isomorphism between reward and policy classes, and switch among RLHF, DPO, and online DPO—or blend them.
    • Potential workflows/products: AutoML-like “Alignment Orchestrator” that:
    • Runs short diagnostics to estimate mis-specification and sample efficiency.
    • Chooses training mode, sampler, and regularization.
    • Adaptively transitions from reward modeling to online DPO as data and capacity evolve.
    • Assumptions/dependencies: Reliable diagnostics for model isomorphism and quality gaps; robust online training infrastructure; safety guardrails for mode-switching.
  • Architectures with isomorphic reward-policy parameterizations (software/AI industry; academia)
    • What: Design neural architectures that intentionally make the reward class and policy class isomorphic, enabling RLHF ≈ DPO guarantees and simplifying pipeline choices; then exploit online DPO’s potential performance edge.
    • Tools/products: Joint parameterization libraries; reward-policy mapping layers; constrained head designs to align classes.
    • Assumptions/dependencies: Careful theoretical guarantees and empirical validation; maturity of tooling to enforce and verify isomorphism; scalable training.
  • Active preference learning and adaptive sampling (software/AI industry; policy; finance)
    • What: Combine online DPO (with PILAF-like samplers) and active learning to prioritize informative comparisons, further improving sample efficiency for alignment tasks and reducing annotation costs.
    • Tools/products: Active preference selection services; dynamic sampler APIs; cost-aware labeling planners.
    • Assumptions/dependencies: Stable online training; human-in-the-loop pipelines; reliable uncertainty estimates for pairwise preferences.
  • Reward modeling standards and governance (policy; standards bodies)
    • What: Guidelines for disclosing training choices and diagnostics (e.g., when RLHF vs DPO was used, reward sparsity assumptions, sample budgets), helping regulators and customers assess alignment quality and risks.
    • Tools/products: Reporting templates and audit frameworks; metrics for reward-model quality and policy mis-specification; third-party certification programs.
    • Assumptions/dependencies: Consensus on metrics and disclosures; cooperation from model providers; alignment with privacy and safety regulations.
  • Sector-specific sparse reward libraries (healthcare; education; safety; finance)
    • What: Curated sparse reward heads tailored to domain tasks (e.g., correctness and safety in healthcare Q&A; bias and clarity in education; compliance and risk in finance) to leverage RLHF’s sample-efficiency advantage.
    • Tools/products: Domain reward templates; feature engineering packs; evaluation suites with domain-specific pairwise datasets.
    • Assumptions/dependencies: High-quality domain data; trustworthy domain evaluators; robust mapping of domain criteria to sparse features.
  • Process reward and hierarchical alignment (software/AI industry; academia)
    • What: Move beyond terminal rewards to token- or step-level “q-function” modeling where appropriate, while mitigating the statistical inefficiency identified for surrogate reward learning; combine sparse terminal rewards with structured process signals.
    • Tools/products: Hybrid process/terminal reward toolkits; hierarchical reward heads; curriculum training recipes.
    • Assumptions/dependencies: More research into stable and efficient process reward learning; availability of rich feedback (rationales, critiques); scalable token-level training.
  • Safety-by-design for limited compute environments (robotics; IoT; public-sector deployments)
    • What: Design alignment strategies for devices with strict compute budgets—favor RLHF with sparse reward heads and constrained policy updates; integrate periodic online DPO updates when device or cloud resources allow.
    • Tools/products: Edge safety alignment SDKs; cloud-assisted preference training services; compliance-grade monitoring.
    • Assumptions/dependencies: Reliable on-device feature extraction; intermittent connectivity; safe fallback behaviors during updates.
  • Education and workforce training on preference learning (academia; education)
    • What: Curricula and practical labs that teach the dichotomy between RLHF and DPO, mis-specification regimes, and sample-efficiency trade-offs; prepare practitioners to design data-efficient alignment pipelines.
    • Tools/products: Teaching modules (DTSP tasks, sampler implementations), open datasets, reproducible notebooks.
    • Assumptions/dependencies: Access to small-scale models and datasets; institutional support for applied ML education.
  • Cost-aware MLOps planning and governance (finance; large enterprises)
    • What: Holistic planning tools that combine annotation budgets, compute constraints, and expected error bounds to select alignment strategies and forecast ROI.
    • Tools/products: MLOps planners integrating preference-label cost models, sample-efficiency calculators, and training stability metrics.
    • Assumptions/dependencies: Integration with enterprise data platforms; reliable measurement of task sparsity and feature dimension; governance policies for alignment quality.

Notes on general assumptions across applications:

  • The Bradley–Terry model for pairwise preferences is assumed.
  • KL-regularized value function serves as the core performance metric.
  • Mis-specification conditions depend on the relative capacity of reward vs policy classes (e.g., smaller reward heads vs large policies; frozen layers).
  • Statistical efficiency gains from sparse reward modeling depend on actual sparsity k and feature dimension d; gains are largest when k ≪ d.
  • Online DPO’s benefits depend on good sampler design and stable iterative updates; large reward scales may amplify approximation errors if not controlled.

Glossary

  • Approximate optimization: Optimization in the finite-sample regime where statistical errors affect solutions. "For approximate optimization, i.e., the finite-sample regime, we study the implicit representation gap incurred by statistical efficiencies in \Cref{sec:approximate_optimization}."
  • Autoregressive mechanism: The sequential token-generation process used by LLMs, where outputs depend on previous tokens. "Given the autoregressive mechanism of LM, it is natural to consider the prompt and response together and view the combination as a whole action."
  • Bradley–Terry (BT) model: A probabilistic model of pairwise comparisons that maps preference likelihoods to reward differences via a sigmoid. "The reward modeling stage assumes human preferences follow the Bradley-Terry (BT) model \citep{BTmodel}, allowing a prompt-response pair to be assigned a scalar reward."
  • Coverage conditions: Assumptions on the support of sampled data that influence convergence guarantees of learning algorithms. "and milder coverage conditions~\citep{song2024importanceonlinedataunderstanding,xiong2024iterative}, than vanilla DPO."
  • Direct Preference Optimization (DPO): A method that directly optimizes policies from preference data, bypassing explicit reward modeling. "Another popular algorithm in this area is direct preference optimization~(DPO, \citet{rafailov2023direct}), which utilizes the closed-form solution (assuming realizability as well) for the policy optimization stage to bypass the reward modeling stage and directly fine-tune the base LM as a policy model πθ\pi_\theta using the preference dataset."
  • Dual-token Sparse Prediction (DTSP): A constructed task with two sequential tokens and a sparse ground-truth reward to analyze statistical efficiency. "Dual-token Sparse Prediction~(DTSP).\quad Under log-linear policy class, the policy model is required to sequentially output two tokens, yy and ω\omega, and the ground-truth reward is:"
  • Isomorphic model classes: Reward and policy model classes that can be mapped deterministically to each other, enabling direct comparison. "Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified."
  • Kullback–Leibler (KL) divergence: A measure of difference between probability distributions used as a regularizer in policy optimization. "based on rϕr_\phi under a Kullback-Leibler~(KL) divergence-regularized multi-armed bandit setting."
  • Linear MDP model: A linear parameterization of rewards over features in a Markov decision process framework. "it is natural to assume the reward model to be parameterized as a linear MDP model"
  • Log-linear policy class: A policy parameterization where log-probabilities are linear in features, common in structured prediction. "assuming linear reward class and log-linear policy class"
  • Maximum likelihood estimation (MLE): A statistical method to fit models by maximizing the likelihood of observed data. "by maximizing the population MLE objective:"
  • Multi-armed bandit: A decision-making framework with multiple actions (arms) and unknown rewards, often used to model policy selection. "A multi-armed bandit is defined by an action space YY and a reward function r:YRr: Y \to \mathbb R."
  • Online DPO: A variant of DPO that samples preference pairs from a policy-dependent distribution during training. "We also consider an online variant of DPO, where the pairwise data are sampled from a distribution $$ which could depend on the current policy.&quot;</li> <li><strong>Partition function</strong>: The normalization constant ensuring probabilities sum to one in exponential-family forms of optimal policies. &quot;$Z:=\sum_{y\in\mathcal Y}(y)\exp(r^\star(y)/\beta)$ is the partition function.&quot;</li> <li><strong>PILAF Sampler</strong>: A mixture sampling strategy for online DPO designed to approximate value optimization. &quot;PILAF Sampler is a probabilistic mixture of two sampler pairs:&quot;</li> <li><strong>Preference-based policy learning</strong>: The paradigm of learning policies from pairwise preference signals instead of absolute rewards. &quot;The above RLHF paradigm falls inside a broader problem, preference-based policy learning~\citep{Wirth2017ASO}.&quot;</li> <li><strong>Proximal Policy Optimization (PPO)</strong>: A reinforcement learning algorithm that stabilizes policy updates via clipped objectives. &quot;the base LM is ``online&#39;&#39; fine-tuned with RL algorithms such as proximal policy optimization~(PPO, \citet{schulman2017proximal})&quot;</li> <li><strong>q function</strong>: A recursive value function used to compute optimal token-level policies in regularized settings. &quot;where the $q$ function is determined in a recursive way:&quot;</li> <li><strong>Realizability</strong>: The assumption that the ground-truth reward or optimal policy lies within the chosen model class. &quot;The key assumption behind DPO&#39;s design is the realizability of the closed-form solution of the optimal policy.&quot;</li> <li><strong>Regularized value function</strong>: The performance metric combining expected reward with a regularization term (e.g., KL) to control policy deviation. &quot;For a reward function $randpolicy and policy \pi$, we define the regularized value function as:&quot;</li> <li><strong>Reinforcement Learning from Human Feedback (RLHF)</strong>: A two-stage paradigm that learns a reward model from preferences and then optimizes a policy against it. &quot;Reinforcement learning from human feedback (RLHF, \citet{Christiano2017DeepRL,Ziegler2019FineTuningLM}) is an important paradigm improving the natural language understanding and generation capabilities of LLMs.&quot;</li> <li><strong>Representation gap</strong>: The performance difference induced by limitations in model class expressiveness or statistical efficiency. &quot;We analyze the behavior of RLHF and DPO in the idealized setting of exact optimization, where both methods have access to infinite preference data and can optimize their respective objectives without statistical or computational error. Our focus is on the performance gap induced by the representation gap&quot;</li> <li><strong>Reward model class</strong>: The family of parameterized reward functions considered during learning. &quot;Let $\mathcal{F} = \{r_\phi : \phi \in \mathbb{R}^{d_R}\}$ denote the reward model class,&quot;</li> <li><strong>Reward model mis-specification</strong>: A mismatch where the true reward cannot be represented within the chosen reward model class. &quot;We now consider the setting where the ground-truth reward function $r^\starisnotrealizablebytherewardmodelclass is not realizable by the reward model class \mathcal{F},whiletheoptimalpolicy, while the optimal policy \pi^\starlieswithinthepolicyclass lies within the policy class \Pi$.&quot;</li> <li><strong>Sigmoid function</strong>: The logistic function mapping real numbers to probabilities, used in the BT model. &quot;Let $\sigma:\mathbb R\rightarrow \mathbb Rbethesigmoidfunction,where be the sigmoid function, where \sigma(x)=1/(1+\exp(-x))$.&quot;</li> <li><strong>Stopping-gradient operator (sg)</strong>: An operator that halts gradient propagation through a computation for optimization stability or analysis. &quot;$sgisthestoppinggradientoperator,where is the stopping-gradient operator, where \nabla_\theta[sg{f(\theta)}]=\boldsymbol{0}$.&quot;</li> <li><strong>Surrogate reward</strong>: A constructed reward from the log ratio of policy and reference policy, enabling preference-based policy training without explicit reward learning. &quot;By leveraging the surrogate reward $\hat r_\theta(y):=\beta\log\frac{\pi_\theta(y)}{(y)}$, DPO bypasses reward learning and directly learns the policy from preference data:"
  • Tabular parameterization: A discrete, table-based model representation often used to ensure realizability in theoretical analyses. "in \citet{rafailov2023direct}, both the reward class and policy class are tabular parameterized, making their optimal solutions realizable."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 106 likes about this paper.