No-Regret Reinforcement Learning with Heavy-Tailed Rewards

Published 25 Feb 2021 in cs.LG and stat.ML | (2102.12769v1)

Abstract: Reinforcement learning algorithms typically assume rewards to be sampled from light-tailed distributions, such as Gaussian or bounded. However, a wide variety of real-world systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging techniques from robust mean estimation, we propose Heavy-UCRL2 and Heavy-Q-Learning, and show that they achieve near-optimal regret bounds in this setting. Our algorithms also naturally generalize to deep reinforcement learning applications; we instantiate Heavy-DQN as an example of this. We demonstrate that all of our algorithms outperform baselines on both synthetic MDPs and standard RL benchmarks.

Abstract PDF Upgrade to Chat

Citations (11)

View on Semantic Scholar

Summary

The paper introduces two RL algorithms, Heavy-UCRL2 and Heavy-Q-Learning, that extend traditional methods to handle heavy-tailed reward distributions with robust mean estimators.
By truncating extreme rewards and adding exploration bonuses, the methods achieve near-optimal regret bounds in challenging Markov Decision Processes.
Experimental results on synthetic and deep RL benchmarks demonstrate improved performance over standard algorithms in high-variability, heavy-tailed settings.

No-Regret Reinforcement Learning with Heavy-Tailed Rewards

Reinforcement Learning (RL) algorithms often assume that reward distributions are light-tailed (e.g., Gaussian or bounded). However, many real-world applications involve rewards that follow heavy-tailed distributions, such as financial systems or network routing times. This paper addresses the challenge of designing RL algorithms that efficiently handle heavy-tailed rewards, introducing Heavy-UCRL2 and Heavy-Q-Learning, which achieve near-optimal regret bounds in this setting.

Figure 1: The DoubleChain MDP, illustrating the transition structure for two actions.

Problem Statement

The reinforcement learning problem is set in the context of a Markov Decision Process (MDP) with heavy-tailed reward distributions. Traditionally, RL algorithms focus on learning transition probabilities, but with heavy-tailed rewards, the learning difficulty shifts towards accurately estimating these rewards. The goal is to maximize the undiscounted return or equivalently, minimize regret.

Proposed Algorithms

Heavy-UCRL2

Heavy-UCRL2 is an extension of UCRL2 for heavy-tailed rewards. It uses robust mean estimators like the truncated mean to construct valid confidence intervals for rewards. This ensures that the algorithm explores effectively even when rewards have large variances. The pseudocode for Heavy-UCRL2 follows a similar structure to UCRL2, but with modifications to accommodate robust mean estimation techniques.

Heavy-Q-Learning

Heavy-Q-Learning adapts $Q$ -learning for environments with heavy-tailed rewards. The reward is truncated, and an additional bonus is added to handle the uncertainty introduced by heavy-tailed distributions. The initialization is prone to using optimistic values, which helps in navigating exploration effectively without misleading updates from extreme reward samples.

Heavy-DQN

This approach extends Heavy-Q-Learning into deep RL settings using DQNs, incorporating adaptive reward truncation via count-based exploration methods like SimHash. This adaptive strategy helps stabilize training dynamics in deep reinforcement learning environments impacted by heavy-tailed noise.

Theoretical Results

The paper presents various theoretical guarantees for the proposed algorithms:

Minimax Regret Bound: Both Heavy-UCRL2 and Heavy-Q-Learning achieve regret bounds of $\tilde{O}((SAT)^{1})$ , where $A$ is the action space and $T$ is the time horizon, indicating robustness against reward distribution heavy-tailedness.
Problem-Dependent Regret: Continued analysis shows logarithmic dependence on the number of state-action pairs, emphasizing efficient learning in environments where extreme outcomes are possible.
Regret Under Changing MDPs: The algorithms maintain performance even when MDP transitions, indicating adaptability to dynamic environments.

Experimental Validation

Extensive experiments demonstrate improved performance of Heavy-UCRL2 and Heavy-Q-Learning across synthetic MDPs like SixArms and DoubleChain (Figure 1), outperforming standard algorithms not designed for heavy-tailed settings. Deep RL experiments validate Heavy-DQN's effectiveness in environments like MountainCar, showcasing robust learning where reward noise follows heavy-tailed distributions.

Conclusion

The introduction of Heavy-UCRL2 and Heavy-Q-Learning represents significant progress in realistically handling environments with heavy-tailed reward distributions. These algorithms effectively apply techniques from robust statistics to RL, underscoring RL's potential in complex real-world applications with extreme variability. The continued exploration of heavy-tailed reward settings holds promise for enhancing RL's applicability in uncertain and dynamic environments.

Markdown Report Issue