Uniform Last-Iterate Guarantee for Bandits and Reinforcement Learning
Abstract: Existing metrics for reinforcement learning (RL) such as regret, PAC bounds, or uniform-PAC (Dann et al., 2017), typically evaluate the cumulative performance, while allowing the agent to play an arbitrarily bad policy at any finite time t. Such a behavior can be highly detrimental in high-stakes applications. This paper introduces a stronger metric, uniform last-iterate (ULI) guarantee, capturing both cumulative and instantaneous performance of RL algorithms. Specifically, ULI characterizes the instantaneous performance by ensuring that the per-round suboptimality of the played policy is bounded by a function, monotonically decreasing w.r.t. round t, preventing revisiting bad policies when sufficient samples are available. We demonstrate that a near-optimal ULI guarantee directly implies near-optimal cumulative performance across aforementioned metrics, but not the other way around. To examine the achievability of ULI, we first provide two positive results for bandit problems with finite arms, showing that elimination-based algorithms and high-probability adversarial algorithms with stronger analysis or additional designs, can attain near-optimal ULI guarantees. We also provide a negative result, indicating that optimistic algorithms cannot achieve near-optimal ULI guarantee. Furthermore, we propose an efficient algorithm for linear bandits with infinitely many arms, which achieves the ULI guarantee, given access to an optimization oracle. Finally, we propose an algorithm that achieves near-optimal ULI guarantee for the online reinforcement learning setting.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011.
- Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 2014.
- Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, 2013.
- Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(94):2785–2836, 2010.
- Best arm identification in multi-armed bandits. In Conference on Learning Theory, 2010.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 2002a.
- The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
- Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
- High-probability regret bounds for bandit online linear optimization. In Conference on Learning Theory, 2008.
- Generalized policy elimination: an efficient algorithm for nonparametric contextual bandits. In Conference on Uncertainty in Artificial Intelligence, 2020.
- Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, 2014.
- Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, 2008.
- Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
- Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, 2016.
- Last-iterate convergent policy gradient primal-dual methods for constrained mdps. arXiv preprint arXiv:2306.11700, 2023.
- Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
- Sequential experimental design for transductive linear bandits. In Advances in neural information processing systems, 2019.
- Efficient batched algorithm for contextual linear bandits with large action space via soft elimination. In Advances in Neural Information Processing Systems, 2023.
- Uniform-pac bounds for reinforcement learning with linear function approximation. In Advances in Neural Information Processing Systems, 2021.
- lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, 2014.
- Efficient pure exploration in adaptive round model. In Advances in Neural Information Processing Systems, 2019.
- Pac subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning, 2012.
- The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
- Bandit algorithms. Cambridge University Press, 2020.
- Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, 2020.
- Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. In Advances in neural information processing systems, 2020.
- Achieving near instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously. In International Conference on Machine Learning, 2021.
- A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web, 2010.
- Instance-optimal pac algorithms for contextual bandits. In Advances in Neural Information Processing Systems, 2022.
- Reload: Reinforcement learning with optimistic ascent-descent for last-iterate convergence in constrained mdps. In International Conference on Machine Learning, 2023.
- Geometric exploration for online control. In Advances in Neural Information Processing Systems, 2020.
- Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.
- Uniform-pac guarantees for model-based rl with bounded eluder dimension. In Conference on Uncertainty in Artificial Intelligence, 2023.
- Optimal pac multiple arm identification with applications to crowdsourcing. In International Conference on Machine Learning, 2014.
- Contextual bandits with large action spaces: Made practical. In International Conference on Machine Learning, 2022.
- An optimal algorithm for stochastic and adversarial bandits. In International Conference on Artificial Intelligence and Statistics, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.