Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning Adversarial MDPs with Stochastic Hard Constraints

Published 6 Mar 2024 in cs.LG | (2403.03672v3)

Abstract: We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater's parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Reinforcement learning: An introduction. MIT press, 2018.
  2. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  3. Safe reinforcement learning for autonomous vehicles through parallel constrained policy optimization. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–7. IEEE, 2020.
  4. Safe reinforcement learning on autonomous vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–6. IEEE, 2018.
  5. Budget constrained bidding by model-free reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1443–1451, 2018.
  6. A unified solution to constrained bidding in online display advertising. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2993–3001, 2021.
  7. Building healthy recommendation sequences for everyone: A safe reinforcement learning approach. In Proceedings of the FAccTRec Workshop, Online, pages 26–27, 2020.
  8. E. Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.
  9. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021.
  10. Online learning in weakly coupled markov decision processes: A convergence time study. Proc. ACM Meas. Anal. Comput. Syst., 2(1), apr 2018. doi: 10.1145/3179415. URL https://doi.org/10.1145/3179415.
  11. Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15277–15287. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ae95296e27d7f695f891cd26b4f37078-Paper.pdf.
  12. A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints. arXiv preprint arXiv:2302.04375, 2023.
  13. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4860–4869. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/jin20c.html.
  14. Safe learning in tree-form sequential decision making: Handling hard and soft constraints. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 1854–1873. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/bernasconi22a.html.
  15. Prediction, learning, and games. Cambridge university press, 2006.
  16. Francesco Orabona. A modern introduction to online learning. CoRR, abs/1912.13213, 2019. URL http://arxiv.org/abs/1912.13213.
  17. Near-optimal regret bounds for reinforcement learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper/2008/file/e4a6222cdb5b34375400904f03d8e6a5-Paper.pdf.
  18. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
  19. Online markov decision processes under bandit feedback. Advances in Neural Information Processing Systems, 23, 2010.
  20. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  21. Online convex optimization in adversarial Markov decision processes. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5478–5486. PMLR, 09–15 Jun 2019a. URL https://proceedings.mlr.press/v97/rosenberg19a.html.
  22. Online stochastic shortest path with bandit feedback and unknown transition function. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/file/a0872cc5b5ca4cc25076f3d868e1bdf8-Paper.pdf.
  23. Constrained upper confidence reinforcement learning. In Alexandre M. Bayen, Ali Jadbabaie, George Pappas, Pablo A. Parrilo, Benjamin Recht, Claire Tomlin, and Melanie Zeilinger, editors, Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pages 620–629. PMLR, 10–11 Jun 2020. URL https://proceedings.mlr.press/v120/zheng20a.html.
  24. Provably sample-efficient model-free algorithm for mdps with peak constraints. Journal of Machine Learning Research, 24(60):1–25, 2023.
  25. Exploration-exploitation in constrained mdps, 2020. URL https://arxiv.org/abs/2003.02189.
  26. A best-of-both-worlds algorithm for constrained mdps with long-term constraints, 2023.
  27. Provably efficient model-free algorithms for non-stationary cmdps. In International Conference on Artificial Intelligence and Statistics, pages 6527–6570. PMLR, 2023.
  28. Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7396–7404, 2023.
  29. Online convex optimization with hard constraints: Towards the best of two worlds and beyond. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36426–36439. Curran Associates, Inc., 2022.
  30. Beyond the click-through rate: web link selection with multi-level feedback. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 3308–3314, 2018.
  31. Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 28, 2015.
Citations (5)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.