Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards

Published 11 Jan 2024 in cs.LG | (2401.05710v3)

Abstract: The reward signal plays a central role in defining the desired behaviors of agents in reinforcement learning (RL). Rewards collected from realistic environments could be perturbed, corrupted, or noisy due to an adversary, sensor error, or because they come from subjective human feedback. Thus, it is important to construct agents that can learn under such rewards. Existing methodologies for this problem make strong assumptions, including that the perturbation is known in advance, clean rewards are accessible, or that the perturbation preserves the optimal policy. We study a new, more general, class of unknown perturbations, and introduce a distributional reward critic framework for estimating reward distributions and perturbations during training. Our proposed methods are compatible with any RL algorithm. Despite their increased generality, we show that they achieve comparable or better rewards than existing methods in a variety of environments, including those with clean rewards. Under the challenging and generalized perturbations we study, we win/tie the highest return in 44/48 tested settings (compared to 11/48 for the best baseline). Our results broaden and deepen our ability to perform RL in reward-perturbed environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Vulnerability of deep reinforcement learning to policy induction attacks. In Machine Learning and Data Mining in Pattern Recognition: 13th International Conference, MLDM 2017, New York, NY, USA, July 15-20, 2017, Proceedings 13, pp.  262–275. Springer, 2017.
  3. A distributional perspective on reinforcement learning. In International conference on machine learning, pp.  449–458. PMLR, 2017.
  4. Provably robust blackbox optimization for reinforcement learning. In Conference on Robot Learning, pp.  683–696. PMLR, 2020.
  5. Reinforcement learning with stochastic reward machines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  6429–6436, 2022.
  6. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp.  1096–1105. PMLR, 2018a.
  7. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018b.
  8. Reinforcement learning with a corrupted reward channel. arXiv preprint arXiv:1705.08417, 2017.
  9. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  10. Inverse reward design. Advances in neural information processing systems, 30, 2017.
  11. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
  12. Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452, 2017.
  13. Mobile robot navigation using prioritized experience replay q-learning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  2036–2041. IEEE, 2019.
  14. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  15. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017a.
  16. Detecting adversarial attacks on neural network policies with visual foresight. arXiv preprint arXiv:1710.00814, 2017b.
  17. The effects of memory replay in reinforcement learning. In 2018 56th annual allerton conference on communication, control, and computing (Allerton), pp.  478–485. IEEE, 2018.
  18. Robust training under label noise by over-parameterization. In International Conference on Machine Learning, pp.  14153–14172. PMLR, 2022.
  19. Normalized loss functions for deep learning with noisy labels. In International conference on machine learning, pp.  6543–6553. PMLR, 2020.
  20. Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. arXiv preprint arXiv:1908.10693, 2019.
  21. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  22. Noisy reinforcements in reinforcement learning: some case studies based on gridworlds. In Proceedings of the 6th WSEAS international conference on applied computer science, pp.  296–300, 2006.
  23. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pp.  278–287. Citeseer, 1999.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  25. Robust deep reinforcement learning with adversarial attacks. arXiv preprint arXiv:1712.03632, 2017.
  26. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pp.  2817–2826. PMLR, 2017.
  27. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  28. Learning rewards to optimize global performance metrics in deep reinforcement learning. arXiv preprint arXiv:2303.09027, 2023.
  29. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pp.  7974–7984. PMLR, 2020.
  30. Reward estimation for variance reduction in deep reinforcement learning. In Proceedings of The 2nd Conference on Robot Learning, 2018.
  31. An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  29–37. PMLR, 2018.
  32. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  33. Reward is enough. Artificial Intelligence, 299:103535, 2021.
  34. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  35. Regression as classification: Influence of task formulation on neural network features. In International Conference on Artificial Intelligence and Statistics, pp.  11563–11582. PMLR, 2023.
  36. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  5026–5033. IEEE, 2012.
  37. Reinforcement learning with perturbed rewards. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  6202–6209, 2020.
  38. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.
  39. Robust bayesian inverse reinforcement learning with sparse behavior noise. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.

Summary

  • The paper introduces the Distributional Reward Critic (DRC) framework which recasts reward estimation as a classification problem to recover true rewards under arbitrary perturbations.
  • It leverages neural networks and a generalized confusion matrix model to predict and correct perturbed reward distributions, improving robustness in various environments.
  • Experimental results on Mujoco and discrete control tasks demonstrate that DRC and its variant, GDRC, outperform traditional methods in maintaining policy optimality.

The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards

Introduction

This paper addresses the challenge of reinforcement learning (RL) in environments with unknown perturbations of reward signals. Unlike existing methods which impose strong assumptions (e.g., smoothness or known perturbations), this work introduces a distributional reward critic framework designed to handle arbitrary perturbations that may discretize and shuffle reward distributions, yet preserve the true reward as the most frequently observed post-perturbation. The overarching objective is to equip RL agents with mechanisms to accurately infer true rewards from noisy observations, ensuring optimal policy recovery.

The Distributional Reward Critic (DRC) Approach

Architecture and Methodology

The DRC framework leverages distributional RL techniques, modeling perturbed rewards for each state-action pair as distributions rather than point estimates. This involves employing neural networks to predict reward distributions and recast reward estimations as classification tasks, minimizing cross-entropy loss instead of regression errors, inspired by the notion that classification tasks yield better statistical performance [Stewart, et al., "Regression as Classification" 2023].

Theoretical Insights

A theoretical underpinning is provided via a generalized confusion matrix (GCM) model. The paper extends known reward confusion classes to better comprehend continuous reward perturbations. Exploiting GCM, DRC can recover accurate reward distributions under conditions termed "mode-preserving" (true rewards being the modal class). Through theoretical demonstration, DRC is shown to reconstruct true rewards effectively, expressed in the framework of classification problems with softmax outputs aligned with perturbed reward labels.

Experimental Evaluation

Benchmarking Against Perturbed Environments

Figure 1

Figure 1: The episode reward for nr∈{5,10,20}n_r \in \{5,10,20\} as non_o varies for DRC in Hopper.

DRC surpasses traditional methods across various Mujoco environments and discrete control tasks (e.g., Pendulum and CartPole), achieving higher or equivalent returns in 40 out of 57 challenging settings, even when the perturbations modified optimal policies.

Comparative Performance Analysis

DRC is juxtaposed with prominent baseline methods such as RE and Surrogate Reward (SR). Notably, DRC excels in settings with unknown reward structures, often demanded in practical applications. The empirical results underline DRC's robustness over baselines designed for simpler, less varied perturbation models. Figure 2

Figure 2: The results of Mujoco environments under GCM perturbations. Solid line methods can be applied without any information.

Reward Critic Variants and Robustness

OBSErved enhancements when using the generalized DRC (GDRC), which autonomously learns interval granularity based on observed data cross-entropy dynamics, aligning output classification discretion with underlying true reward intervals.

Conclusion

The introduction of DRC and extensions with the GDRC variant marks a significant leap in RL’s ability to contend with arbitrary, non-trivial reward perturbations. By modeling rewards distributionally, RL systems can more faithfully recover true signals and thus retain policy optimality despite noisy environments. Future work could address balancing classification entropy to mitigate reward misestimation through broader sampling strategies or entropy regulation.

Figures Displayed in the Response

(Figure 1) illustrates performance relative to varying granularity in reward critic output, showcasing the nuanced effect of distributional dimensionality on policy efficacy.

(Figure 2) highlights the GDRC framework's resilience across perturbation scenarios by comparing training robustness under GCM perturbations.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.