The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards

Published 11 Jan 2024 in cs.LG | (2401.05710v3)

Abstract: The reward signal plays a central role in defining the desired behaviors of agents in reinforcement learning (RL). Rewards collected from realistic environments could be perturbed, corrupted, or noisy due to an adversary, sensor error, or because they come from subjective human feedback. Thus, it is important to construct agents that can learn under such rewards. Existing methodologies for this problem make strong assumptions, including that the perturbation is known in advance, clean rewards are accessible, or that the perturbation preserves the optimal policy. We study a new, more general, class of unknown perturbations, and introduce a distributional reward critic framework for estimating reward distributions and perturbations during training. Our proposed methods are compatible with any RL algorithm. Despite their increased generality, we show that they achieve comparable or better rewards than existing methods in a variety of environments, including those with clean rewards. Under the challenging and generalized perturbations we study, we win/tie the highest return in 44/48 tested settings (compared to 11/48 for the best baseline). Our results broaden and deepen our ability to perform RL in reward-perturbed environments.

Abstract PDF HTML Upgrade to Chat

References (39)

Summary

The paper introduces the Distributional Reward Critic (DRC) framework which recasts reward estimation as a classification problem to recover true rewards under arbitrary perturbations.
It leverages neural networks and a generalized confusion matrix model to predict and correct perturbed reward distributions, improving robustness in various environments.
Experimental results on Mujoco and discrete control tasks demonstrate that DRC and its variant, GDRC, outperform traditional methods in maintaining policy optimality.

The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards

Introduction

This paper addresses the challenge of reinforcement learning (RL) in environments with unknown perturbations of reward signals. Unlike existing methods which impose strong assumptions (e.g., smoothness or known perturbations), this work introduces a distributional reward critic framework designed to handle arbitrary perturbations that may discretize and shuffle reward distributions, yet preserve the true reward as the most frequently observed post-perturbation. The overarching objective is to equip RL agents with mechanisms to accurately infer true rewards from noisy observations, ensuring optimal policy recovery.

The Distributional Reward Critic (DRC) Approach

Architecture and Methodology

The DRC framework leverages distributional RL techniques, modeling perturbed rewards for each state-action pair as distributions rather than point estimates. This involves employing neural networks to predict reward distributions and recast reward estimations as classification tasks, minimizing cross-entropy loss instead of regression errors, inspired by the notion that classification tasks yield better statistical performance [Stewart, et al., "Regression as Classification" 2023].

Theoretical Insights

A theoretical underpinning is provided via a generalized confusion matrix (GCM) model. The paper extends known reward confusion classes to better comprehend continuous reward perturbations. Exploiting GCM, DRC can recover accurate reward distributions under conditions termed "mode-preserving" (true rewards being the modal class). Through theoretical demonstration, DRC is shown to reconstruct true rewards effectively, expressed in the framework of classification problems with softmax outputs aligned with perturbed reward labels.

Experimental Evaluation

Benchmarking Against Perturbed Environments

Figure 1: The episode reward for $n_r \in \{5,10,20\}$ as $n_o$ varies for DRC in Hopper.

DRC surpasses traditional methods across various Mujoco environments and discrete control tasks (e.g., Pendulum and CartPole), achieving higher or equivalent returns in 40 out of 57 challenging settings, even when the perturbations modified optimal policies.

Comparative Performance Analysis

DRC is juxtaposed with prominent baseline methods such as RE and Surrogate Reward (SR). Notably, DRC excels in settings with unknown reward structures, often demanded in practical applications. The empirical results underline DRC's robustness over baselines designed for simpler, less varied perturbation models.

Figure 2: The results of Mujoco environments under GCM perturbations. Solid line methods can be applied without any information.

Reward Critic Variants and Robustness

OBSErved enhancements when using the generalized DRC (GDRC), which autonomously learns interval granularity based on observed data cross-entropy dynamics, aligning output classification discretion with underlying true reward intervals.

Conclusion

The introduction of DRC and extensions with the GDRC variant marks a significant leap in RL’s ability to contend with arbitrary, non-trivial reward perturbations. By modeling rewards distributionally, RL systems can more faithfully recover true signals and thus retain policy optimality despite noisy environments. Future work could address balancing classification entropy to mitigate reward misestimation through broader sampling strategies or entropy regulation.

Figures Displayed in the Response

(Figure 1) illustrates performance relative to varying granularity in reward critic output, showcasing the nuanced effect of distributional dimensionality on policy efficacy.

(Figure 2) highlights the GDRC framework's resilience across perturbation scenarios by comparing training robustness under GCM perturbations.

Markdown Report Issue