Preference-based Reward and Judge Design

Updated 28 December 2025

Preference-based reward and judge design is a framework that uses human or synthetic feedback to replace hand-engineered reward functions with inferred signals for training RL agents and LLMs.
It integrates probabilistic modeling, optimization, active learning, and interpretability to improve data efficiency, robustness, and task alignment in both real-world applications and language model systems.
Practical implementations, such as hierarchical reward models and generative judges, demonstrate enhanced calibration, transparency, and reduced labeling effort in complex reinforcement learning and multi-objective tasks.

Preference-based reward and judge design encompasses a collection of methodologies and algorithms aimed at inferring, constructing, or serving reward signals (and associated “judge” modules) for reinforcement learning (RL) and LLM alignment using human (or synthetic) preferences rather than hand-engineered reward functions. The field integrates probabilistic modeling, optimization, active learning, interpretability, and robustness principles to address the reward-engineering bottleneck in real-world agents and scalable LLM systems. The following sections synthesize foundational concepts, methodological variants, practical judge architectures, evaluation regimes, interpretability mechanisms, and robustness strategies based on recent developments in the literature.

1. Foundations of Preference-Based Reward Modeling

Preference-based reward models replace explicit scalar reward specification with human (or proxy) feedback that encodes relative preferences between system behaviors. Formally, given a Markov decision process (MDP) $(\mathcal{S}, \mathcal{A}, \mathcal{P}, \gamma)$ and observed trajectories $\tau = (s_0, a_0, \dots, s_T, a_T)$ , the designer lacks access to true rewards $r_t$ but can access preference labels over trajectory pairs, often in the form: “Is $\tau_1$ preferred to $\tau_2$ ?” The canonical likelihood for such comparisons is the Bradley–Terry model, in which the probability of preferring $\tau_1$ over $\tau_2$ is $\sigma(r(\tau_1) - r(\tau_2))$ with $r(\tau)$ the (possibly learned) cumulative reward and $\sigma$ the sigmoid function (Sun et al., 2024).

In LLM alignment, preference datasets $(x, y^+, y^-)$ —prompt, preferred and dispreferred response—enable training scalar reward heads or generative reward models that serve as proxies for human judgment (Ye et al., 2024, Yu et al., 17 Feb 2025). The output reward is subsequently used within RLHF, DPO, or policy selection protocols.

Order-consistency is critical: a reward model must induce rankings over candidate behaviors that match ground-truth preference orderings, up to strict monotonicity. This property is preserved by both Bradley–Terry models and many classifier-based surrogates (Sun et al., 2024).

2. Hierarchical, Structured, and Multi-Signal Reward Construction

Classical approaches treat all preference information as homogeneous; recent advances exploit structure in the space of feedback signals. HERON (Bukharin et al., 2023) proposes a hierarchical reward modeling approach in which multiple feedback signals, often available in complex RL domains or multi-criteria evaluation tasks, are rank-ordered by expert importance. Preferences between trajectory pairs are elicited via a margin-based decision tree, deepening through signals in order of importance until a decisive difference is found:

For $n$ signals $z_1 \succ z_2 \succ \ldots \succ z_n$ , the comparison between two trajectories proceeds level-wise; at each level $l$ (signal $z_l$ ), a margin $\delta_l$ is used to resolve minor/noisy differences.
If $|z_l(\tau_1) - z_l(\tau_2)| > \delta_l$ , a preference is recorded; otherwise, deeper signals are considered.
The resulting labeled dataset enables data-efficient preference learning via Bradley–Terry loss over selective, high-signal pairs.

Such strategies significantly reduce the labeling effort relative to "flat" pairwise comparison schemes, enhance robustness to feature scale-shifts, and enable flexible policy adaptation by reordering feedback-signal hierarchies (Bukharin et al., 2023).

3. Judge Architecture: Parametric, Generative, and Aggregation Strategies

3.1 Scalar and Multi-Objective Parametric Judges

Standard preference-based RMs for RLHF/LLM alignment are deep networks (MLP, Transformer blocks) trained via preference (Bradley–Terry) cross-entropy on trajectory or response pairs (Sun et al., 2024, Bukharin et al., 2023). Contextual features, concatenated prompts and outputs, or trajectory embeddings serve as inputs.

Recent work demonstrates the necessity of interpretable and context-aware judges:

ArmoRM (Wang et al., 2024) decomposes the reward vector into $K$ objectives ("helpfulness", "conciseness", etc.), each trained with binary cross-entropy against absolute-rated labels (if available). A Mixture-of-Experts (MoE) gating network scalarly combines the vector per prompt context, enabling context-sensitive reward amalgamation and traceable, steerable alignment.
Generalized Additive Model (GAM) and MLP aggregators (Sprejer et al., 29 Oct 2025) combine outputs from multiple rubric-conditioned (persona) judges into a final preference score. GAMs learn spline calibrations for each judge; MLPs learn complex non-linear combinations.

3.2 Generative Judges and Self-Judgment

Con-J (Ye et al., 2024) and Persona-judge (Zhang et al., 17 Apr 2025) put forward generative LLM-as-judge paradigms:

Con-J elicits natural-language justifications/rationales for each preference, training models via Direct Preference Optimization (DPO) to generate and explain their judgments. This approach improves both interpretability and robustness to dataset biases such as verbosity or stylistic artifacts.
Persona-judge operates at decoding-time, using a draft LLM (prompted with persona A) and a judge LLM (persona B) to filter candidate tokens at each step, by explicit distributional comparison and token-level rejection/sampling. This method injects desired preferences without further fine-tuning, supporting multi-persona alignment in a modular, scalable manner.

3.3 Judge Calibration and Aggregation

To mitigate rubric sensitivity, annotator bias, or instability in judge signals (human or LLM-based), learned aggregators (GAM/MLP) reweight and recalibrate raw judge scores, outperforming naive averaging and maintaining calibration under adversarial distortions (Sprejer et al., 29 Oct 2025). Persona-based synthesis of preference labels using LLMs allows for systematic control over bias, diversity, and robustness, and learned aggregation provides interpretability and robustness advantages in downstream RLHF pipelines.

4. Active Preference Elicitation and Query Optimization

Data efficiency in preference-based reward learning is enhanced by active query selection strategies:

Information Gain: Maximizing the expected mutual information between reward model parameters and the user's response to the next query leads to highly informative, user-friendly queries (Bıyık et al., 2020, Bıyık et al., 2020).
Generalized Acquisition Functions: Recent advances consider goal-aligned behavioral or policy-equivalence classes rather than precise parameter identification, selecting queries that maximize disambiguation among equivalence classes of reward functions, substantially reducing required queries under domain shift or transfer (Ellis et al., 2024).
Batch and Experimental Design: To further lower label complexity, queries can be accumulated in batches and subsetted by submodular design (e.g., D-optimality) for parallel labeling, as in efficient RLHF (Schlaginhaufen et al., 11 Jun 2025). This enables practical RLHF with $\tilde O(d\log T)$ batch updates and $o(T)$ preference queries, where $d$ is the feature dimension.

5. Robustness, Interpretability, and Shortcuts Mitigation

Scalar reward models are vulnerable to reward hacking—over-optimization for spurious correlates of human preferences (verbosity, tone, etc.)—and lack transparency in decision processes (Ye et al., 21 Oct 2025, Wang et al., 2024, Ye et al., 2024). Contemporary frameworks introduce:

Invariant kernels and regularization (PRISM): Training reward models to be invariant under group transformations (e.g., rewriting, verbosity shifts, sycophancy) using Haar-averaged kernels and decorrelation penalties, enforcing reliance on generalizable, shortcut-agnostic features (Ye et al., 21 Oct 2025).
Multi-objective decomposition (ArmoRM): By exposing per-objective scores and learning context-aware gating, developers can audit, steer, and intervene in reward calculation, substantially reducing reward hacking and improving alignment (Wang et al., 2024).
Generative rationales (Con-J): Natural-language explanations paired with preferences serve as regularizers and diagnostic tools to detect and counteract spurious correlations (Ye et al., 2024).
Judge aggregation and calibration (GAM/MLP): Calibration procedures mitigate judge-specific drifts and rubric instabilities, yielding robust scoring even under noisy or adversarial judge contamination (Sprejer et al., 29 Oct 2025).

6. Practical Algorithms and Guidelines

A range of practical algorithms and design patterns emerge:

Approach	Query Strategy	Judge/Reward Model
HERON (Bukharin et al., 2023)	Hierarchical signal ranking	Margin-based decision tree + BT loss
DemPref (Bıyık et al., 2020)	Demonstrations + Active Pref.	Bayesian linear reward, Boltzmann
GAM/MLP Aggregation	Persona-based LLM labels	Spline/MLP-calibrated aggregation
DPO + Con-J (Ye et al., 2024)	LLM-generated rationales	Generative chain-of-thought judge
Persona-judge (Zhang et al., 17 Apr 2025)	Self-filtering at decoding	LLM as token-level judge
Efficient RLHF (Schlaginhaufen et al., 11 Jun 2025)	Randomized, D-optimal queries	Linear reward, logistic/probit
PRISM (Ye et al., 21 Oct 2025)	Standard pairwise	Kernel-invariant, decorrelated

Key guidelines include:

Always design feedback queries and reward features aligned with downstream behavioral goals, not parameter recovery (Ellis et al., 2024).
Use modular, interpretable judge structures (multi-objective, generative rationales) for traceability and steerability (Wang et al., 2024, Ye et al., 2024).
Employ active, outcome-aware query selection (mutual information, batch design) to minimize preference query burden (Bıyık et al., 2020, Schlaginhaufen et al., 11 Jun 2025).
For application scenarios prone to shortcut exploitation, favor invariant regularization and routine auditing of RM correlations with known shortcut features (Ye et al., 21 Oct 2025).
Aggregate multiple persona/rubric-based judge outputs with learned calibration to ensure resilience to annotator bias and rubric drift (Sprejer et al., 29 Oct 2025).
For online or OOD deployment, augment fast RMs with strong option LLM-as-judge queries upon uncertainty thresholding, balancing cost and robustness (Xu et al., 23 Oct 2025).

7. Empirical Performance, Limitations, and Outlook

Empirical benchmarks across RL domains, LLM alignment tasks, and synthetic environments demonstrate substantial benefits in data efficiency, robustness, and final policy/LLM performance for structured, interpretable, and actively queried preference-based reward and judge design. For example:

HERON improves both sample efficiency and robustness to signal scale shifts versus linear baselines (Bukharin et al., 2023).
PRISM reduces shortcut-feature correlations to near zero and improves out-of-distribution ranking fidelity (Ye et al., 21 Oct 2025).
Multi-judge aggregation and generative, rationale-producing judges achieve higher explained variance, calibration, and resistance to rubric or annotation noise (Sprejer et al., 29 Oct 2025, Ye et al., 2024).
Data and compute cost are minimized with batch, experimental design, or self-judgment methods (Persona-judge, Refine-n-Judge).

Limitations include dependence on availability and diversity of feedback signals, residual judge biases, and the necessity of matching the design of reward/aggregation mechanisms to the anticipated deployment domain's value system and failure modes. Open research directions include scaling and efficiently training highly-interpretable multi-objective or generative judges, further integrating robust calibration into production RLHF pipelines, and principled theoretical analysis of generalization properties under non-i.i.d. (distribution-shifting) deployment.

References:

(Bukharin et al., 2023, Sprejer et al., 29 Oct 2025, Wang et al., 2024, Sun et al., 2024, Ye et al., 2024, Bıyık et al., 2020, Cayir et al., 3 Aug 2025, Ye et al., 21 Oct 2025, Zhang et al., 17 Apr 2025, Schlaginhaufen et al., 11 Jun 2025, Bıyık et al., 2020, Hejna et al., 2023, Katz et al., 2021, Yu et al., 17 Feb 2025, Ji et al., 2023, Ward et al., 2022, Xu et al., 23 Oct 2025, Ellis et al., 2024).