TriTrust-PBRL: Robust Reward Learning

Updated 2 February 2026

TriTrust-PBRL is a unified framework that infers reward functions from trajectory comparisons provided by heterogeneous experts.
It employs a shared neural reward model combined with expert-specific trust parameters to classify annotations as reliable, noisy, or adversarial using gradient-based dynamics.
Empirical evaluations on manipulation and locomotion tasks show that TTP outperforms existing PBRL methods under adversarial and noisy conditions.

TriTrust-PBRL (TTP) is a unified framework for preference-based reinforcement learning (PBRL) designed to robustly learn reward functions from trajectory comparisons labeled by multiple, heterogeneous experts exhibiting varying reliability profiles. TTP jointly learns a shared reward model and scalar trust parameters for each expert, automatically separating reliable, noisy, and adversarial annotators via gradient-based dynamics. The approach demonstrates theoretical identifiability properties and empirical robustness in scenarios where existing PBRL methods fail under adversarial corruption (Hosseini et al., 26 Jan 2026).

1. Problem Setting and Formalization

PBRL seeks to infer a reward function $R_\phi(\tau)=\sum_t r_\phi(s_t, a_t)$ from data comprising trajectory pairs $(\tau, \tau')$ with binary labels $y \in \{0, 1\}$ indicating expert preference. In practical crowdsourcing, preference data originates from $K$ distinct experts, each annotating a private, possibly disjoint, set $D_k$ of comparisons. Annotators exhibit:

Accurate: $y$ aligns with the unknown true return $R^*(\tau)$ .
Noisy: $y$ is random (accuracy ≈ 50%).
Adversarial: $y$ systematically flips the truth (accuracy ≈ 0%).

This heterogeneous characteristic renders simple uniform aggregation or filtering insufficient in the presence of adversarial experts.

2. Model Architecture and Parameterization

TTP employs:

Shared Reward Model: $r_\phi(s,a)$ , a neural network (commonly MLP), governed by parameters $\phi$ .
Expert-specific Trust Parameters: A scalar $\beta_k$ $β_{k}$ for each expert $k$ $k$ . Its interpretation:
- $\beta_k \gg 0$ : Trust and amplify expert $k$ ’s preferences.
- $\beta_k \approx 0$ : Ignore expert $k$ (uninformative/noisy).
- $\beta_k < 0$ : Systematically invert adversarial preferences.

Aggregated reward for a trajectory is $R_\phi(\tau) = \sum_t r_\phi(s_t, a_t)$ . For each labeled comparison, $y$ , by expert $k$ , the preference likelihood is modeled as $P(y=1 \mid \tau, \tau', k; \phi, \beta) = \sigma(\beta_k \cdot \Delta R_\phi)$ , where $\Delta R_\phi = R_\phi(\tau) - R_\phi(\tau')$ and $\sigma(z) = 1/(1+e^{-z})$ .

3. Joint Objective and Learning Algorithm

The training objective integrates per-expert trust:

$L(\phi, \beta) = -\sum_{k=1}^{K} \sum_{(\tau, \tau', y) \in D_k} \left[ y \log \sigma(\beta_k \Delta R_\phi) + (1-y)\log(1 - \sigma(\beta_k \Delta R_\phi)) \right] + \lambda_r \|\phi\|^2 + \lambda_\beta \|\beta\|^2$

Gradients for each sample inform updates:

For $\phi$ : $-\big[\sigma(\beta_k \Delta R_i) - y_i\big]\beta_k \frac{\partial \Delta R_i}{\partial \phi} + 2\lambda_r\phi$
For $\beta_k$ : $-\big[\sigma(\beta_k \Delta R_i) - y_i\big]\Delta R_i + 2\lambda_\beta\beta_k$

Key trust parameter dynamics:

Reliable experts: $\beta_k$ pushed larger and positive.
Noisy experts: $\beta_k$ remains near zero.
Adversarial experts: $\beta_k$ driven negative, inverting preference.

Practical enhancements include bounding $\beta_k$ via $\alpha_k = \tanh(\beta_k)$ with normalization ( $\alpha_k \leftarrow \alpha_k / \max_j |\alpha_j|$ ) to enforce scale invariance and stability. Weighted losses $w_k = |\alpha_k|$ further diminish noisy expert impact.

4. Theoretical Guarantees and Gradient Behavior

TTP’s theoretical analysis includes:

Logistic Monotonicity Lemma: The function $z \mapsto \sigma(\beta \cdot z)$ is strictly monotonic for $\beta \ne 0$ .
Identifiability Theorem: Suppose (i) trajectory graph (nodes: trajectories, edges: compared pairs) is connected and (ii) expert overlap graph (experts linked if sharing any comparison) is connected. Then, solutions $(R, \beta)$ and $(R', \beta')$ with identical prediction likelihoods must satisfy $R'(\tau) = a R(\tau) + b$ ( $a \ne 0$ ), $\beta'_k = \beta_k / a$ . Thus, rewards are identifiable up to affine transform, and trust parameters rescale inversely.
Corollary for Bounded Trusts: With $\alpha_k = \tanh(\beta_k)$ and normalization, ambiguity reduces to $a = \pm1$ and additive $b$ . This justifies the bounded normalization scheme and removes scale degeneracy.

These properties guarantee expert separation (trust, ignore, flip) emerges naturally from gradient optimization, without explicit expert supervision.

5. Empirical Benchmarks and Observed Behavior

TTP’s robustness is demonstrated on manipulation and locomotion domains:

MetaWorld (Door-Open-v2, Sweep-Into-v2) and DMControl (Cheetah-Run, Walker-Walk).
Simulated expert pools (B-Pref protocol): $K=4$ , with $B_k \in \{+1, 0, -1\}$ encoding reliable, noisy, adversarial composition. Typical mixtures include $[1,1,1,-1]$ (25% adversarial) and $[1,1,1,0]$ (25% noisy).

Comparative baselines:

Method	Weighting/Filtering	Performance Under Adversarial Mixture ([1,1,1,-1])	Performance Under Noisy Mixture ([1,1,1,0])
TTP	Per-expert trust gradient	≈oracle (∼90% Sweep-Into)	Outperforms PEBBLE/MCP, edges out RIME
PEBBLE	Uniform	Fails catastrophically (<20%)	Lower than TTP/RIME
RIME	KL filtering	Intermediate (40-60%)	Comparable to TTP, lower later
MCP	Mixup smoothing	Intermediate	Lower than TTP/RIME
Oracle SAC	True reward	Baseline upper bound	Baseline upper bound

Key phenomena:

Model’s $\alpha_k$ trajectories separate expert classes early in training (~50K steps), maintaining stable classification.
Performance depends on reliable-expert fraction; few hundred comparisons suffice when reliable experts dominate, but several thousand required as adversarial/noisy prevalence increases. A plausible implication is that system resilience scales nonlinearly with adversarial/noisy concentration.

6. Practical Integration, Limitations, and Extensions

TTP integrates into existing PBRL pipelines by replacing uniform preference likelihoods with per-expert trust-weighted terms. Only the reward-learning phase changes; policy optimization (e.g., SAC) operates on the learned $R_\phi$ unmodified.

Limitations identified:

A single global $\beta_k$ cannot capture context-dependent reliability for different comparison types.
All comparisons are treated equally; the framework does not distinguish informative from ambiguous queries.

Future research directions:

Context-dependent trust scoring $\beta_k(\tau, \tau')$ or hierarchical trust structures.
Active expert selection strategies based on dynamically computed trust.
Models that disentangle expert reliability from per-comparison uncertainty, allowing separation of annotator noise and inherent label ambiguity.

Overall, TriTrust-PBRL provides a scalable solution for robust reward learning from mixed-quality expert preferences, attaining identifiability guarantees and empirical resilience without relying on explicit expert supervision or engineered features beyond index identification (Hosseini et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TriTrust-PBRL (TTP).