Papers
Topics
Authors
Recent
Search
2000 character limit reached

TriTrust-PBRL: Robust Reward Learning

Updated 2 February 2026
  • TriTrust-PBRL is a unified framework that infers reward functions from trajectory comparisons provided by heterogeneous experts.
  • It employs a shared neural reward model combined with expert-specific trust parameters to classify annotations as reliable, noisy, or adversarial using gradient-based dynamics.
  • Empirical evaluations on manipulation and locomotion tasks show that TTP outperforms existing PBRL methods under adversarial and noisy conditions.

TriTrust-PBRL (TTP) is a unified framework for preference-based reinforcement learning (PBRL) designed to robustly learn reward functions from trajectory comparisons labeled by multiple, heterogeneous experts exhibiting varying reliability profiles. TTP jointly learns a shared reward model and scalar trust parameters for each expert, automatically separating reliable, noisy, and adversarial annotators via gradient-based dynamics. The approach demonstrates theoretical identifiability properties and empirical robustness in scenarios where existing PBRL methods fail under adversarial corruption (Hosseini et al., 26 Jan 2026).

1. Problem Setting and Formalization

PBRL seeks to infer a reward function Rϕ(τ)=trϕ(st,at)R_\phi(\tau)=\sum_t r_\phi(s_t, a_t) from data comprising trajectory pairs (τ,τ)(\tau, \tau') with binary labels y{0,1}y \in \{0, 1\} indicating expert preference. In practical crowdsourcing, preference data originates from KK distinct experts, each annotating a private, possibly disjoint, set DkD_k of comparisons. Annotators exhibit:

  • Accurate: yy aligns with the unknown true return R(τ)R^*(\tau).
  • Noisy: yy is random (accuracy ≈ 50%).
  • Adversarial: yy systematically flips the truth (accuracy ≈ 0%).

This heterogeneous characteristic renders simple uniform aggregation or filtering insufficient in the presence of adversarial experts.

2. Model Architecture and Parameterization

TTP employs:

  • Shared Reward Model: rϕ(s,a)r_\phi(s,a), a neural network (commonly MLP), governed by parameters ϕ\phi.
  • Expert-specific Trust Parameters: A scalar βk\beta_k for each expert kk. Its interpretation:
    • βk0\beta_k \gg 0: Trust and amplify expert kk’s preferences.
    • βk0\beta_k \approx 0: Ignore expert kk (uninformative/noisy).
    • βk<0\beta_k < 0: Systematically invert adversarial preferences.

Aggregated reward for a trajectory is Rϕ(τ)=trϕ(st,at)R_\phi(\tau) = \sum_t r_\phi(s_t, a_t). For each labeled comparison, yy, by expert kk, the preference likelihood is modeled as P(y=1τ,τ,k;ϕ,β)=σ(βkΔRϕ)P(y=1 \mid \tau, \tau', k; \phi, \beta) = \sigma(\beta_k \cdot \Delta R_\phi), where ΔRϕ=Rϕ(τ)Rϕ(τ)\Delta R_\phi = R_\phi(\tau) - R_\phi(\tau') and σ(z)=1/(1+ez)\sigma(z) = 1/(1+e^{-z}).

3. Joint Objective and Learning Algorithm

The training objective integrates per-expert trust:

L(ϕ,β)=k=1K(τ,τ,y)Dk[ylogσ(βkΔRϕ)+(1y)log(1σ(βkΔRϕ))]+λrϕ2+λββ2L(\phi, \beta) = -\sum_{k=1}^{K} \sum_{(\tau, \tau', y) \in D_k} \left[ y \log \sigma(\beta_k \Delta R_\phi) + (1-y)\log(1 - \sigma(\beta_k \Delta R_\phi)) \right] + \lambda_r \|\phi\|^2 + \lambda_\beta \|\beta\|^2

Gradients for each sample inform updates:

  • For ϕ\phi: [σ(βkΔRi)yi]βkΔRiϕ+2λrϕ-\big[\sigma(\beta_k \Delta R_i) - y_i\big]\beta_k \frac{\partial \Delta R_i}{\partial \phi} + 2\lambda_r\phi
  • For βk\beta_k: [σ(βkΔRi)yi]ΔRi+2λββk-\big[\sigma(\beta_k \Delta R_i) - y_i\big]\Delta R_i + 2\lambda_\beta\beta_k

Key trust parameter dynamics:

  • Reliable experts: βk\beta_k pushed larger and positive.
  • Noisy experts: βk\beta_k remains near zero.
  • Adversarial experts: βk\beta_k driven negative, inverting preference.

Practical enhancements include bounding βk\beta_k via αk=tanh(βk)\alpha_k = \tanh(\beta_k) with normalization (αkαk/maxjαj\alpha_k \leftarrow \alpha_k / \max_j |\alpha_j|) to enforce scale invariance and stability. Weighted losses wk=αkw_k = |\alpha_k| further diminish noisy expert impact.

4. Theoretical Guarantees and Gradient Behavior

TTP’s theoretical analysis includes:

  • Logistic Monotonicity Lemma: The function zσ(βz)z \mapsto \sigma(\beta \cdot z) is strictly monotonic for β0\beta \ne 0.
  • Identifiability Theorem: Suppose (i) trajectory graph (nodes: trajectories, edges: compared pairs) is connected and (ii) expert overlap graph (experts linked if sharing any comparison) is connected. Then, solutions (R,β)(R, \beta) and (R,β)(R', \beta') with identical prediction likelihoods must satisfy R(τ)=aR(τ)+bR'(\tau) = a R(\tau) + b (a0a \ne 0), βk=βk/a\beta'_k = \beta_k / a. Thus, rewards are identifiable up to affine transform, and trust parameters rescale inversely.
  • Corollary for Bounded Trusts: With αk=tanh(βk)\alpha_k = \tanh(\beta_k) and normalization, ambiguity reduces to a=±1a = \pm1 and additive bb. This justifies the bounded normalization scheme and removes scale degeneracy.

These properties guarantee expert separation (trust, ignore, flip) emerges naturally from gradient optimization, without explicit expert supervision.

5. Empirical Benchmarks and Observed Behavior

TTP’s robustness is demonstrated on manipulation and locomotion domains:

  • MetaWorld (Door-Open-v2, Sweep-Into-v2) and DMControl (Cheetah-Run, Walker-Walk).
  • Simulated expert pools (B-Pref protocol): K=4K=4, with Bk{+1,0,1}B_k \in \{+1, 0, -1\} encoding reliable, noisy, adversarial composition. Typical mixtures include [1,1,1,1][1,1,1,-1] (25% adversarial) and [1,1,1,0][1,1,1,0] (25% noisy).

Comparative baselines:

Method Weighting/Filtering Performance Under Adversarial Mixture ([1,1,1,-1]) Performance Under Noisy Mixture ([1,1,1,0])
TTP Per-expert trust gradient ≈oracle (∼90% Sweep-Into) Outperforms PEBBLE/MCP, edges out RIME
PEBBLE Uniform Fails catastrophically (<20%) Lower than TTP/RIME
RIME KL filtering Intermediate (40-60%) Comparable to TTP, lower later
MCP Mixup smoothing Intermediate Lower than TTP/RIME
Oracle SAC True reward Baseline upper bound Baseline upper bound

Key phenomena:

  • Model’s αk\alpha_k trajectories separate expert classes early in training (~50K steps), maintaining stable classification.
  • Performance depends on reliable-expert fraction; few hundred comparisons suffice when reliable experts dominate, but several thousand required as adversarial/noisy prevalence increases. A plausible implication is that system resilience scales nonlinearly with adversarial/noisy concentration.

6. Practical Integration, Limitations, and Extensions

TTP integrates into existing PBRL pipelines by replacing uniform preference likelihoods with per-expert trust-weighted terms. Only the reward-learning phase changes; policy optimization (e.g., SAC) operates on the learned RϕR_\phi unmodified.

Limitations identified:

  • A single global βk\beta_k cannot capture context-dependent reliability for different comparison types.
  • All comparisons are treated equally; the framework does not distinguish informative from ambiguous queries.

Future research directions:

  • Context-dependent trust scoring βk(τ,τ)\beta_k(\tau, \tau') or hierarchical trust structures.
  • Active expert selection strategies based on dynamically computed trust.
  • Models that disentangle expert reliability from per-comparison uncertainty, allowing separation of annotator noise and inherent label ambiguity.

Overall, TriTrust-PBRL provides a scalable solution for robust reward learning from mixed-quality expert preferences, attaining identifiability guarantees and empirical resilience without relying on explicit expert supervision or engineered features beyond index identification (Hosseini et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TriTrust-PBRL (TTP).