TriTrust-PBRL: Robust Reward Learning
- TriTrust-PBRL is a unified framework that infers reward functions from trajectory comparisons provided by heterogeneous experts.
- It employs a shared neural reward model combined with expert-specific trust parameters to classify annotations as reliable, noisy, or adversarial using gradient-based dynamics.
- Empirical evaluations on manipulation and locomotion tasks show that TTP outperforms existing PBRL methods under adversarial and noisy conditions.
TriTrust-PBRL (TTP) is a unified framework for preference-based reinforcement learning (PBRL) designed to robustly learn reward functions from trajectory comparisons labeled by multiple, heterogeneous experts exhibiting varying reliability profiles. TTP jointly learns a shared reward model and scalar trust parameters for each expert, automatically separating reliable, noisy, and adversarial annotators via gradient-based dynamics. The approach demonstrates theoretical identifiability properties and empirical robustness in scenarios where existing PBRL methods fail under adversarial corruption (Hosseini et al., 26 Jan 2026).
1. Problem Setting and Formalization
PBRL seeks to infer a reward function from data comprising trajectory pairs with binary labels indicating expert preference. In practical crowdsourcing, preference data originates from distinct experts, each annotating a private, possibly disjoint, set of comparisons. Annotators exhibit:
- Accurate: aligns with the unknown true return .
- Noisy: is random (accuracy ≈ 50%).
- Adversarial: systematically flips the truth (accuracy ≈ 0%).
This heterogeneous characteristic renders simple uniform aggregation or filtering insufficient in the presence of adversarial experts.
2. Model Architecture and Parameterization
TTP employs:
- Shared Reward Model: , a neural network (commonly MLP), governed by parameters .
- Expert-specific Trust Parameters: A scalar for each expert . Its interpretation:
- : Trust and amplify expert ’s preferences.
- : Ignore expert (uninformative/noisy).
- : Systematically invert adversarial preferences.
Aggregated reward for a trajectory is . For each labeled comparison, , by expert , the preference likelihood is modeled as , where and .
3. Joint Objective and Learning Algorithm
The training objective integrates per-expert trust:
Gradients for each sample inform updates:
- For :
- For :
Key trust parameter dynamics:
- Reliable experts: pushed larger and positive.
- Noisy experts: remains near zero.
- Adversarial experts: driven negative, inverting preference.
Practical enhancements include bounding via with normalization () to enforce scale invariance and stability. Weighted losses further diminish noisy expert impact.
4. Theoretical Guarantees and Gradient Behavior
TTP’s theoretical analysis includes:
- Logistic Monotonicity Lemma: The function is strictly monotonic for .
- Identifiability Theorem: Suppose (i) trajectory graph (nodes: trajectories, edges: compared pairs) is connected and (ii) expert overlap graph (experts linked if sharing any comparison) is connected. Then, solutions and with identical prediction likelihoods must satisfy (), . Thus, rewards are identifiable up to affine transform, and trust parameters rescale inversely.
- Corollary for Bounded Trusts: With and normalization, ambiguity reduces to and additive . This justifies the bounded normalization scheme and removes scale degeneracy.
These properties guarantee expert separation (trust, ignore, flip) emerges naturally from gradient optimization, without explicit expert supervision.
5. Empirical Benchmarks and Observed Behavior
TTP’s robustness is demonstrated on manipulation and locomotion domains:
- MetaWorld (Door-Open-v2, Sweep-Into-v2) and DMControl (Cheetah-Run, Walker-Walk).
- Simulated expert pools (B-Pref protocol): , with encoding reliable, noisy, adversarial composition. Typical mixtures include (25% adversarial) and (25% noisy).
Comparative baselines:
| Method | Weighting/Filtering | Performance Under Adversarial Mixture ([1,1,1,-1]) | Performance Under Noisy Mixture ([1,1,1,0]) |
|---|---|---|---|
| TTP | Per-expert trust gradient | ≈oracle (∼90% Sweep-Into) | Outperforms PEBBLE/MCP, edges out RIME |
| PEBBLE | Uniform | Fails catastrophically (<20%) | Lower than TTP/RIME |
| RIME | KL filtering | Intermediate (40-60%) | Comparable to TTP, lower later |
| MCP | Mixup smoothing | Intermediate | Lower than TTP/RIME |
| Oracle SAC | True reward | Baseline upper bound | Baseline upper bound |
Key phenomena:
- Model’s trajectories separate expert classes early in training (~50K steps), maintaining stable classification.
- Performance depends on reliable-expert fraction; few hundred comparisons suffice when reliable experts dominate, but several thousand required as adversarial/noisy prevalence increases. A plausible implication is that system resilience scales nonlinearly with adversarial/noisy concentration.
6. Practical Integration, Limitations, and Extensions
TTP integrates into existing PBRL pipelines by replacing uniform preference likelihoods with per-expert trust-weighted terms. Only the reward-learning phase changes; policy optimization (e.g., SAC) operates on the learned unmodified.
Limitations identified:
- A single global cannot capture context-dependent reliability for different comparison types.
- All comparisons are treated equally; the framework does not distinguish informative from ambiguous queries.
Future research directions:
- Context-dependent trust scoring or hierarchical trust structures.
- Active expert selection strategies based on dynamically computed trust.
- Models that disentangle expert reliability from per-comparison uncertainty, allowing separation of annotator noise and inherent label ambiguity.
Overall, TriTrust-PBRL provides a scalable solution for robust reward learning from mixed-quality expert preferences, attaining identifiability guarantees and empirical resilience without relying on explicit expert supervision or engineered features beyond index identification (Hosseini et al., 26 Jan 2026).