Papers
Topics
Authors
Recent
Search
2000 character limit reached

Policy Gradient for Robust Markov Decision Processes

Published 29 Oct 2024 in cs.LG and cs.AI | (2410.22114v2)

Abstract: We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA). Additionally, we propose innovative parametric transition kernels for both discrete and continuous state-action spaces, broadening the applicability of our approach. Empirical results validate the robustness and global convergence of DRPMD across various challenging robust MDP settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (113)
  1. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
  2. On the generation of Markov decision processes. Journal of the Operational Research Society, 46(3):354–361, 1995.
  3. Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning, pages 511–520. PMLR, 2021.
  4. Amir Beck. First-order methods in optimization. SIAM, 2017.
  5. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  6. Fast algorithms for l∞subscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-constrained s-rectangular robust MDPs. Advances in Neural Information Processing Systems, 34, 2021a.
  7. Optimizing percentile criterion using robust MDPs. In International Conference on Artificial Intelligence and Statistics (AIStats), 2021b.
  8. Dimitri P. Bertsekas. Nonlinear Programming. Athena scientific, 3rd edition, 2016.
  9. On the linear convergence of policy gradient methods for finite MDPs. In International Conference on Artificial Intelligence and Statistics, pages 2386–2394. PMLR, 2021.
  10. Global optimality guarantees for policy gradient methods. Operations Research, 2024.
  11. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
  12. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  13. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022.
  14. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
  15. Distributionally robust optimization for sequential decision-making. Optimization, 68(12):2397–2426, 2019.
  16. Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33:5527–5540, 2020.
  17. Real-time dynamic programming for Markov decision processes with imprecise probabilities. Artificial Intelligence, 230:192–223, 2016.
  18. Twice regularized MDPs and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34, 2021.
  19. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pages 1329–1338. PMLR, 2016.
  20. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1):59–99, 2016.
  21. Bounded-parameter Markov decision processes. Artificial Intelligence, 122(1-2):71–109, 2000.
  22. Data uncertainty in markov chains: Application to cost-effectiveness analyses of medical innovations. Operations Research, 66(3):697–715, 2018.
  23. Robust Markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 2022.
  24. First-order methods for Wasserstein distributionally robust MDPs. In International Conference on Machine Learning, pages 2010–2019. PMLR, 2021.
  25. Robust data-driven dynamic programming. Advances in Neural Information Processing Systems, 26, 2013.
  26. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  27. Fast bellman updates for robust MDPs. In International Conference on Machine Learning, pages 1979–1988. PMLR, 2018.
  28. Partial policy iteration for l1-robust Markov decision processes. Journal of Machine Learning Research, 22(275):1–46, 2021.
  29. Robust p⁢h⁢i𝑝ℎ𝑖phiitalic_p italic_h italic_i-divergence mdps. Advances in Neural Information Processing Systems, 35:32680–32693, 2022.
  30. Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  31. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pages 4880–4889. PMLR, 2020.
  32. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning. Citeseer, 2002.
  33. Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
  34. Robust modified policy iteration. INFORMS Journal on Computing, 25(3):396–410, 2013.
  35. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  36. Risk-averse learning by temporal difference methods with markov risk measures. Journal of machine learning research, 22(38):1–34, 2021.
  37. Towards faster global convergence of robust policy gradient methods. In Sixteenth European Workshop on Reinforcement Learning, 2023.
  38. Policy gradient for rectangular robust markov decision processes. Advances in Neural Information Processing Systems, 36, 2024.
  39. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107–1149, 2003.
  40. Andrew Lamperski. Projected stochastic gradient langevin algorithms for constrained sampling and non-convex learning. In Conference on Learning Theory, pages 2891–2937. PMLR, 2021.
  41. Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming, 198(1):1059–1106, 2023.
  42. Yann Le Tallec. Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD thesis, Massachusetts Institute of Technology, 2007.
  43. Global convergence of multi-agent policy gradient in Markov potential games. arXiv preprint arXiv:2106.01969, 2021.
  44. Softmax policy gradient methods can take exponential time to converge. In Conference on Learning Theory, pages 3107–3110. PMLR, 2021.
  45. Policy gradient algorithms for robust mdps with non-rectangular uncertainty sets. arXiv preprint arXiv:2305.19004, 2023.
  46. First-order policy optimization for robust Markov decision process. arXiv preprint arXiv:2209.10579, 2022.
  47. Reinforcement learning in robust markov decision processes. Advances in Neural Information Processing Systems, 26, 2013.
  48. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020.
  49. A single-loop robust policy gradient method for robust markov decision processes. arXiv preprint arXiv:2406.00274, 2024.
  50. Distributionally robust q𝑞qitalic_q-learning. In International Conference on Machine Learning, pages 13623–13643. PMLR, 2022.
  51. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. Advances in Neural Information Processing Systems, 33:20566–20577, 2020.
  52. Distributionally robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620, 2022.
  53. Lightning does not strike twice: Robust MDPs with coupled uncertainty. arXiv preprint arXiv:1206.4643, 2012.
  54. Robust MDPs with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509, 2016.
  55. Finite mixture models. Annual review of statistics and its application, 6(1):355–378, 2019.
  56. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pages 6820–6829. PMLR, 2020.
  57. Leveraging non-uniformity in first-order non-convex optimization. In International Conference on Machine Learning, pages 7555–7564. PMLR, 2021.
  58. Sean Meyn. Control systems and reinforcement learning. Cambridge University Press, 2022.
  59. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pages 1497–1507. PMLR, 2020.
  60. Problem complexity and method efficiency in optimization. 1983.
  61. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
  62. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32, 2019.
  63. Sample complexity of robust reinforcement learning with a generative model. arXiv:2112.01506 [cs, stat], 2021.
  64. Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR, 2022.
  65. Robust reinforcement learning using offline data. In Advances in Neural Information Processing Systems (NeurIPS), 2022a.
  66. Robust reinforcement learning using offline data. Advances in neural information processing systems, 35:32211–32224, 2022b.
  67. Smoothing policies and safe policy gradients. Machine Learning, 111(11):4081–4137, 2022.
  68. Marek Petrik. Approximate dynamic programming by minimizing distributionally robust bounds. In International Conference on Machine Learning, pages 497–504, 2012.
  69. Raam: The benefits of robustness in approximating aggregated MDPs in reinforcement learning. Advances in Neural Information Processing Systems, 27, 2014.
  70. Safe policy improvement by minimizing robust baseline regret. Advances in Neural Information Processing Systems, 29, 2016.
  71. Adaptive step-size for policy gradient methods. Advances in Neural Information Processing Systems, 26, 2013.
  72. Policy gradient in Lipschitz Markov decision processes. Machine Learning, 100(2):255–283, 2015.
  73. Evan L Porteus. Foundations of stochastic inventory theory. Stanford University Press, 2002.
  74. Warren B Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
  75. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  76. Nonconvex min-max optimization: Applications, challenges, and recent theoretical advances. IEEE Signal Processing Magazine, 37(5):55–66, 2020.
  77. R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  78. Reinforcement learning under model mismatch. Advances in neural information processing systems, 30, 2017.
  79. Robust constrained-MDPs: Soft-constrained robust policy optimization under model uncertainty. arXiv preprint arXiv:2010.04870, 2020.
  80. Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  81. Andrzej Ruszczyński. Risk-averse dynamic programming for Markov decision processes. Mathematical programming, 125(2):235–261, 2010.
  82. Bruno Scherrer. Approximate policy iteration schemes: a comparison. In International Conference on Machine Learning, pages 1314–1322. PMLR, 2014.
  83. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  84. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  85. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675, 2020.
  86. Alexander Shapiro. Rectangular sets of probability measures. Operations Research, 64(2):528–541, 2016.
  87. Alexander Shapiro. Distributionally robust optimal control and MDP modeling. Operations Research Letters, 49(5):809–814, 2021.
  88. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. PMLR, 2014.
  89. Reinforcement learning for quantitative trading. ACM Transactions on Intelligent Systems and Technology, 14(3):1–29, 2023.
  90. Reinforcement learning: An introduction. MIT press, 2018.
  91. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  92. Scaling up robust mdps using function approximation. In International conference on machine learning, pages 181–189. PMLR, 2014.
  93. Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215–6224. PMLR, 2019.
  94. Efficient algorithms for smooth minimax optimization. Advances in Neural Information Processing Systems, 32, 2019.
  95. The geometry of robust value functions. arXiv preprint arXiv:2201.12929, 2022.
  96. Policy gradient in robust mdps with global convergence guarantee. In International Conference on Machine Learning, pages 35763–35797. PMLR, 2023.
  97. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206, 2021.
  98. Policy gradient method for robust reinforcement learning. In International conference on machine learning, pages 23484–23526. PMLR, 2022.
  99. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
  100. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  101. Lin Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1–36, 2022.
  102. The robustness-performance tradeoff in Markov decision processes. Advances in Neural Information Processing Systems, 19, 2006.
  103. Parametric regret in uncertain Markov decision processes. In IEEE Conference on Decision and Control (CDC), pages 3606–3613. IEEE, 2009.
  104. Distributionally robust Markov decision processes. Advances in Neural Information Processing Systems, 23, 2010.
  105. Improving sample complexity bounds for (natural) actor-critic algorithms. Advances in Neural Information Processing Systems, 33:4358–4369, 2020.
  106. Reinforcement learning algorithms with function approximation: Recent advances and applications. Information Sciences, 261:1–31, 2014.
  107. Fast bellman updates for wasserstein distributionally robust mdps. Advances in Neural Information Processing Systems, 36, 2024.
  108. A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, pages 3332–3380. PMLR, 2022.
  109. A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems. Advances in Neural Information Processing Systems, 33:7377–7389, 2020.
  110. Analysis and improvement of policy gradient estimation. Advances in Neural Information Processing Systems, 24, 2011.
  111. Natural actor-critic for robust reinforcement learning with function approximation. arXiv preprint arXiv:2307.08875, 2023.
  112. Natural actor-critic for robust reinforcement learning with function approximation. Advances in neural information processing systems, 36, 2024.
  113. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR, 2021.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.