Preference-Based Policy Learning

Updated 19 January 2026

Preference-based policy learning is a framework that optimizes policies using human-provided pairwise preferences rather than predefined scalar rewards.
It employs probabilistic models like Bradley–Terry and contrastive objectives to directly guide policy improvements in complex, high-dimensional action spaces.
This approach has proven effective in applications such as LLM alignment, robotics, and wireless caching, enhancing robustness and sample efficiency.

Preference-based policy learning is a family of machine learning and control methods that optimize policies not from scalar rewards, but instead from preferences over trajectories, episodes, or policy behaviors. Rather than assuming the task objective is specified as a reward function, preference-based approaches rely on pairwise or ordinal judgments (often human-generated) to drive policy improvement. These techniques now underpin both fundamental reinforcement learning research and practical LLM alignment (e.g., RLHF), as well as sample-efficient control, recommendation systems, and wireless systems with immense action/state spaces.

1. Mathematical Foundations and Preference Models

In preference-based policy learning, the agent’s ultimate performance criterion is determined by a (possibly unobservable) ordering over trajectories, behaviors, or outcome distributions. The preference feedback typically takes the form of pairwise labels: given two alternatives (e.g., trajectories $\tau^0,\tau^1$ or action sequences), an oracle (human or automatic) returns which is preferred. The most common probabilistic model is the Bradley–Terry (BT) or Bradley–Terry–Luce (BTL) model: $\Pr(\tau^1 \succ \tau^0) = \sigma \bigl( r^*(\tau^1) - r^*(\tau^0) \bigr),$ where $r^*(\cdot)$ is the true, possibly latent, scalar utility or reward, and $\sigma(\cdot)$ is the sigmoid function (Zhan et al., 2023, Jia, 17 Nov 2025, Kim et al., 13 Jan 2026). Variants may operate at the trajectory, segment, token, or action level, depending on the application.

In some domains (e.g., wireless caching (Chen et al., 2017)), preference models can involve user-specific behavior, individual conditional preference vectors, and population-level heterogeneity, with policies parameterized to maximize expected offload or utility for each user. In conversational recommendation (Zhang et al., 2023), vague preferences are captured via soft estimation with time-aware decay and choice-based extraction.

2. Preference-based Policy Learning Algorithms

Two-Stage Approach: Reward Modeling + RL

The canonical PbRL pipeline first fits a parametric reward (or utility) model $\hat r_\psi$ from preference data, typically via likelihood maximization under a BT/BTL model: $\min_\psi ~ -\mathbb{E}[ y \log P_\psi(\tau^0 \succ \tau^1) + (1-y) \log P_\psi(\tau^1 \succ \tau^0) ],$ where $P_\psi$ is the softmax over cumulative rewards (Kang et al., 2023, Chen et al., 2017, Zhang et al., 2024). The learned $\hat r_\psi$ is then used as a surrogate scalar reward in an (offline or online) RL algorithm (e.g., SAC, PPO, IQL, or CQL).

Direct Policy Optimization without Reward Modeling

Recent methods bypass explicit reward modeling entirely. Instead, they optimize policies directly against observed preferences using contrastive objectives: the policy is rewarded for generating behaviors more similar to preferred demonstrations and less similar to non-preferred ones—without committing to a scalar reward function (An et al., 2023, Kang et al., 2023, Carr et al., 2023). For instance, DPPO defines a contrastive score based on the distance between the policy's actions and those in human-labeled segments, maximizing direct agreement (An et al., 2023). Preference-guided policy optimization (OPPO) constructs a high-dimensional context space for policy conditioning, then iteratively matches offline trajectory behaviors and solves for an optimal context aligning with preferences (Kang et al., 2023).

Regret-based and Policy-labeled Learning

To address likelihood mismatch in sequential, non-optimal-data settings (e.g., RLHF), policy-labeled preference learning (PPL) scores segments by their regret relative to their behavior policy, rather than treating all samples as optimal, resulting in more robust credit assignment (Cho et al., 6 May 2025).

Min-max and Distributionally Robust Approaches

PbPO formulates preference optimization for LLMs as a min-max game between the main policy and a reward model constrained within a confidence set derived from preference data, offering high-probability regret bounds and conservative improvement (Jia, 17 Nov 2025).

3. Preference Data Collection, Query Selection, and Efficiency

Query Design and Discriminability

Obtaining preference data from humans is resource-intensive, making query efficiency critical. Discriminability—how easily a human can distinguish the better of two trajectories—is a fundamental constraint (Kadokawa et al., 9 May 2025). To maximize learning per query, DAPPER generates queries by comparing behaviors from independently trained policies rather than within-policy trajectories, ensuring greater diversity and higher discriminability. The querying process is guided by a learned discriminator that estimates distinguishability, actively sampling “most labelable” pairs.

Query-Policy Alignment

Traditional query selection schemes based on overall reward-model uncertainty may be misaligned with the policy’s state-action support, reducing actual policy improvement. QPA addresses this by enforcing near on-policy buffer sampling for query selection and a hybrid replay strategy for critic updates, ensuring feedback and learning are both aligned to the currently visited distribution (Hu et al., 2023).

Feedback Transformation and Aggregation

In online or noisy real-time feedback settings, methods such as Pref-GUIDE transform scalar evaluative data into windows of temporally-coherent preference comparisons, aggregating across users for robust policy updates (Ji et al., 10 Aug 2025).

Motion- and Task-adaptive Query Selection

For high-dimensional robotic tasks, SENIOR introduces motion-distinction-based selection (MDS) schemes to preferentially sample segment pairs that display large, diverse, and task-relevant differences, facilitating easier labeling and informative feedback (Ni et al., 17 Jun 2025).

4. Theoretical Properties: Sample Complexity, Optimality, and Convergence

Extensive recent work has analyzed the sample complexity, regret bounds, and optimal policy existence guarantees of PbRL and preference-based learning.

Under linear reward assumptions and known/unknown transitions, PB reward-agnostic algorithms like REGIME achieve optimal policy learning with $\widetilde{O}(d^2/\epsilon^2)$ human queries independent of state/action space size, outperforming previous work that scaled with $|S|$ or $\Pr(\tau^1 \succ \tau^0) = \sigma \bigl( r^*(\tau^1) - r^*(\tau^0) \bigr),$ 0 (Zhan et al., 2023).
For contextual bandits, offline preference-based methods avoid human bias and yield $\Pr(\tau^1 \succ \tau^0) = \sigma \bigl( r^*(\tau^1) - r^*(\tau^0) \bigr),$ 1 suboptimality rates, while rating-based approaches degrade under bias or heterogeneous noise (Ji et al., 2023).
Best-policy pure exploration with trajectory feedback, as in PSPL, enjoys non-asymptotic simple Bayesian regret bounds scaling as $\Pr(\tau^1 \succ \tau^0) = \sigma \bigl( r^*(\tau^1) - r^*(\tau^0) \bigr),$ 2 in the number of online episodes, with additional OOD correction from offline data and rater competence (Agnihotri et al., 31 Jan 2025).
In settings with general preferences (not induced by any underlying reward), the Direct Preference Process provides necessary and sufficient conditions (total, consistent preference relations) for the existence of optimal policies and a recursive ordinal Bellman equation (Carr et al., 2023).
For on-policy learning (e.g., DPO/LLM alignment), coverage improvement ensures each new policy has better data support for preference learning, leading to exponential convergence in the number of on-policy updates for sufficient batch size, and sharp separations with off-policy (fixed-data) sample complexity (Kim et al., 13 Jan 2026).

5. Practical Applications and Empirical Impact

Preference-based policy learning has demonstrated empirical success across a variety of domains:

LLM Post-training: Direct preference objectives (e.g., DPO, SimPER) are the backbone of modern RLHF and alignment for LLMs (Oh et al., 24 Sep 2025, Jia, 17 Nov 2025, Kim et al., 13 Jan 2026), with theoretical explanations for the superiority of on-policy and min-max learning. Methods such as the FPA modification address over-penalization of shared tokens in mathematical reasoning tasks by proactively regularizing gradients using predictors of the future policy, yielding substantial gains (Oh et al., 24 Sep 2025).
Sample-efficient Robotics: Model-based PbRL frameworks efficiently combine learned dynamics and pretraining via auto-labeled demonstrations, drastically reducing the number of real environment rollouts required relative to model-free methods (Liu et al., 2023). Methods like DAPPER and SENIOR further enhance query efficiency and policy improvement in real-world high-DoF tasks (Kadokawa et al., 9 May 2025, Ni et al., 17 Jun 2025).
Conversational Recommendation and CRS: VPPL generalizes preference-based policy learning to scenarios with ambiguous, vague, or temporally-volatile preferences, using soft estimations and graph-based RL to robustly guide recommendations (Zhang et al., 2023).
Wireless Caching: Preference-based caching policies that discriminate between content popularity and individual user preferences can yield 20–30 percentage point gains in offloading over standard popularity-based baselines (Chen et al., 2017).
Multi-Objective RL: Human-in-the-loop MORL systems (e.g., CBOB) that elicit preferences over trade-offs and adapt search to decision-maker utility efficiently concentrate learning effort on the region of interest in Pareto space, outperforming both conventional MORL and other preference-based MORL approaches (Li et al., 2024).

6. Challenges, Recent Advances, and Open Directions

Preference-based policy learning faces inherent challenges:

Distribution Shift and Generalization: Offline-trained reward models may fail to generalize to novel agent behaviors. Solutions include incorporating virtual preferences linking demo/agent state-action pairs and reliability-weighted preference modeling (Zhang et al., 2024).
Information Bottleneck: Scalar reward modeling can induce lossy compression of human judgment. High-dimensional context or latent representation-based approaches (OPPO) can better preserve non-Markovian or structured preference information (Kang et al., 2023).
Label Efficiency and Robustness: Encoding dynamics via self-supervised losses (REED) enables better extrapolation from sparse preference labels (Metcalf et al., 2022). Direct preference optimization and reward-agnostic exploration further reduce human labeling cost (Zhan et al., 2023, An et al., 2023).
Algorithmic Scalability and Stability: Designs such as hybrid samplers (preferential G-optimal design), discriminability-aware query scheduling, and distributionally-robust min-max objectives provide principled means to accelerate convergence and avoid degenerate solutions (Kim et al., 13 Jan 2026, Jia, 17 Nov 2025, Oh et al., 24 Sep 2025, Kadokawa et al., 9 May 2025).
Preference Vagueness and Human Feedback Limitations: Handling non-binary or ambiguous preference signals is achievable via soft estimation, voting schemes, or population aggregation (Ji et al., 10 Aug 2025, Zhang et al., 2023).

Open questions include theory for general function class convergence (Kim et al., 13 Jan 2026), extending regret and sample complexity guarantees to high-dimensional continuous settings (Agnihotri et al., 31 Jan 2025), robust preference modeling under severe human noise (Ji et al., 2023), and efficient scaling to multi-turn or long-horizon settings with ambiguous, time-varying, or partial preferences.

References:

(Chen et al., 2017) "Caching Policy for Cache-enabled D2D Communications by Learning User Preference"
(An et al., 2023) "Direct Preference-based Policy Optimization without Reward Modeling"
(Kang et al., 2023) "Beyond Reward: Offline Preference-guided Policy Optimization"
(Carr et al., 2023) "Conditions on Preference Relations that Guarantee the Existence of Optimal Policies"
(Zhang et al., 2024) "Online Policy Learning from Offline Preferences"
(Cho et al., 6 May 2025) "Policy-labeled Preference Learning: Is Preference Enough for RLHF?"
(Oh et al., 24 Sep 2025) "Future Policy Aware Preference Learning for Mathematical Reasoning"
(Ni et al., 17 Jun 2025) "SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based RL"
(Metcalf et al., 2022) "Rewards Encoding Environment Dynamics Improves Preference-based RL"
(Jia, 17 Nov 2025) "Bootstrapping LLMs via Preference-Based Policy Optimization"
(Kim et al., 13 Jan 2026) "Coverage Improvement and Fast Convergence of On-policy Preference Learning"
(Hu et al., 2023) "Query-Policy Misalignment in Preference-Based Reinforcement Learning"
(Kadokawa et al., 9 May 2025) "DAPPER: Discriminability-Aware Policy-to-Policy Preference-Based Efficient RL..."
(Zhan et al., 2023) "Provable Reward-Agnostic Preference-Based Reinforcement Learning"
(Liu et al., 2023) "Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models"
(Li et al., 2024) "Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective RL"
(Zhang et al., 2023) "Vague Preference Policy Learning for Conversational Recommendation"
(Ji et al., 2023) "Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems"
(Agnihotri et al., 31 Jan 2025) "Active RLHF via Best Policy Learning from Trajectory Preference Feedback"
(Ji et al., 10 Aug 2025) "Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning"