Near-optimal Per-Action Regret Bounds for Sleeping Bandits
Abstract: We derive near-optimal per-action regret bounds for sleeping bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal sleeping regrets. Compared to the minimax $\Omega(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-sleeping bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts' confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds.
- Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems, volume 28.
- A closer look at adaptive regret. Journal of Machine Learning Research, 17(23):1–21.
- Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT).
- The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77.
- Blum, A. (1997). Empirical support for winnow and weighted-majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26(1):5–23.
- From external to internal regret. Journal of Machine Learning Research, 8:1307–1324.
- Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1–8.
- Prediction with expert evaluators’ advice. In Algorithmic Learning Theory, pages 8–22, Berlin, Heidelberg.
- Prediction with advice of unknown number of experts. CoRR, abs/1006.0475.
- Strongly adaptive online learning. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1405–1411, Lille, France.
- A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139.
- Using and combining predictors that specialize. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, STOC ’97, page 334–343, New York, NY, USA.
- One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206, pages 7755–7773.
- A second-order bound with excess losses. Journal of Machine Learning Research, 35.
- Efficient learning algorithms for changing environments. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 393–400, New York, NY, USA.
- Tracking the best expert. Machine Learning, 32(2):151–178.
- Learning hurdles for sleeping experts. ACM Trans. Comput. Theory, 6(3).
- Regret bounds for sleeping experts and bandits. Machine Learning, 80(2–3):245–272.
- Bandit Algorithms. Cambridge University Press.
- Luo, H. (2017). Lecture 13, Introduction to Online Learning. https://haipeng-luo.net/courses/CSCI699/lecture13.pdf.
- Achieving all with no parameters: AdaNormalHedge. In Annual Conference Computational Learning Theory.
- Efficient contextual bandits in non-stationary worlds. In Bubeck, S., Perchet, V., and Rigollet, P., editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1739–1776. PMLR.
- Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, volume 28.
- Online combinatorial optimization with stochastic decision sets and adversarial losses. In Advances in Neural Information Processing Systems, volume 27.
- Orabona, F. (2019). A modern introduction to online learning. CoRR, abs/1912.13213.
- Improved sleeping bandits with stochastic actions sets and adversarial rewards. In Proceedings of the 37th International Conference on Machine Learning, ICML’20.
- Slivkins, A. (2013). Dynamic ad allocation: Bandits with budgets. CoRR, abs/1306.0155.
- Slivkins, A. (2014). Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533–2568.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.