Regularized Q-Learning with Linear Function Approximation
Abstract: Regularized Markov Decision Processes serve as models of sequential decision making under uncertainty wherein the decision maker has limited information processing capacity and/or aversion to model ambiguity. With functional approximation, the convergence properties of learning algorithms for regularized MDPs (e.g. soft Q-learning) are not well understood because the composition of the regularized Bellman operator and a projection onto the span of basis vectors is not a contraction with respect to any norm. In this paper, we consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation. The {\em lower} level optimization problem aims to identify a value function approximation that satisfies Bellman's recursive optimality condition and the {\em upper} level aims to find the projection onto the span of basis vectors. This formulation motivates a single-loop algorithm with finite time convergence guarantees. The algorithm operates on two time-scales: updates to the projection of state-action values are slow' in that they are implemented with a step size that is smaller than the one used forfaster' updates of approximate solutions to Bellman's recursive optimality equation. We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise. In addition, we provide a performance guarantee for the policies derived from the proposed algorithm.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning. PMLR, 2015, pp. 1889–1897.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning. PMLR, 2018, pp. 1861–1870.
- B. Eysenbach and S. Levine, “Maximum entropy RL (provably) solves some robust RL problems,” arXiv preprint arXiv:2103.06257, 2021.
- K. Lee, S. Choi, and S. Oh, “Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1466–1473, 2018.
- A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse model of attention and multi-label classification,” in International conference on machine learning. PMLR, 2016, pp. 1614–1623.
- W. Yang, X. Li, and Z. Zhang, “A regularized approach to sparse optimal policy in reinforcement learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- M. Geist, B. Scherrer, and O. Pietquin, “A theory of regularized markov decision processes,” in International Conference on Machine Learning. PMLR, 2019, pp. 2160–2169.
- E. Derman, M. Geist, and S. Mannor, “Twice regularized MDPs and the equivalence between robustness and regularization,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 274–22 287, 2021.
- Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understanding the impact of entropy on policy optimization,” in International conference on machine learning. PMLR, 2019, pp. 151–160.
- S. Cayci, N. He, and R. Srikant, “Linear convergence of entropy-regularized natural policy gradient with linear function approximation,” arXiv preprint arXiv:2106.04096, 2021.
- J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, “On the global convergence rates of softmax policy gradient methods,” in International Conference on Machine Learning. PMLR, 2020, pp. 6820–6829.
- L. Shani, Y. Efroni, and S. Mannor, “Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5668–5675.
- C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, pp. 279–292, 1992.
- J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,” Machine learning, vol. 16, pp. 185–202, 1994.
- A. M. Devraj and S. P. Meyn, “Q-learning with uniformly bounded variance,” IEEE Transactions on Automatic Control, vol. 67, no. 11, pp. 5948–5963, 2021.
- Z. Chen, S. Zhang, T. T. Doan, J.-P. Clarke, and S. T. Maguluri, “Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning,” Automatica, vol. 146, p. 110623, 2022.
- F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 664–671.
- R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 993–1000.
- L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995. Elsevier, 1995, pp. 30–37.
- H.-D. Lim, D. Lee et al., “RegQ: Convergent q-learning with linear function approximation using regularization,” 2023.
- S. Zhang, H. Yao, and S. Whiteson, “Breaking the deadly triad with a target network,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 621–12 631.
- D. Lee and N. He, “A unified switching system perspective and ode analysis of q-learning algorithms,” arXiv preprint arXiv:1912.02270, 2019.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
- D. Carvalho, F. S. Melo, and P. Santos, “A new convergent variant of q-learning with linear function approximation,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 412–19 421, 2020.
- H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, “Toward off-policy learning control with function approximation.” in ICML, vol. 10, 2010, pp. 719–726.
- S. Ma, Z. Chen, Y. Zhou, and S. Zou, “Greedy-GQ with variance reduction: Finite-time analysis and improved complexity,” arXiv preprint arXiv:2103.16377, 2021.
- Y. Wang, Y. Zhou, and S. Zou, “Finite-time error bounds for Greedy-GQ,” arXiv preprint arXiv:2209.02555, 2022.
- Y. Wang and S. Zou, “Finite-sample analysis of Greedy-GQ with linear function approximation under Markovian noise,” in Conference on Uncertainty in Artificial Intelligence. PMLR, 2020, pp. 11–20.
- T. Xu and Y. Liang, “Sample complexity bounds for two timescale value-based reinforcement learning algorithms,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 811–819.
- Z. Chen, J. P. Clarke, and S. T. Maguluri, “Target network and truncation overcome the deadly triad in q𝑞qitalic_q-learning,” arXiv preprint arXiv:2203.02628, 2022.
- S. Meyn, “Stability of q-learning through design and optimism,” arXiv preprint arXiv:2307.02632, 2023.
- H. Hasselt, “Double q-learning,” Advances in neural information processing systems, vol. 23, 2010.
- D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforcement learning,” Journal of Machine Learning Research, vol. 6, 2005.
- A. M. Devraj and S. Meyn, “Zap q-learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- S. Chen, A. M. Devraj, F. Lu, A. Busic, and S. Meyn, “Zap q-learning with nonlinear function approximation,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 879–16 890, 2020.
- Q. Cai, Z. Yang, J. D. Lee, and Z. Wang, “Neural temporal-difference learning converges to global optima,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” arXiv preprint arXiv:1909.01150, 2019.
- J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep q-learning,” in Learning for dynamics and control. PMLR, 2020, pp. 486–489.
- J. Sirignano and K. Spiliopoulos, “Asymptotics of reinforcement learning with neural networks,” Stochastic Systems, vol. 12, no. 1, pp. 2–29, 2022.
- S. Cayci, S. Satpathi, N. He, and R. Srikant, “Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation,” IEEE Transactions on Automatic Control, 2023.
- S. Cayci, N. He, and R. Srikant, “Finite-time analysis of entropy-regularized neural natural actor-critic algorithm,” arXiv preprint arXiv:2206.00833, 2022.
- P. Xu and Q. Gu, “A finite-time analysis of q-learning with neural network function approximation,” in International Conference on Machine Learning. PMLR, 2020, pp. 10 555–10 565.
- Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” Advances in neural information processing systems, vol. 32, 2019.
- J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in Conference on learning theory. PMLR, 2018, pp. 1691–1692.
- H. Shen, K. Zhang, M. Hong, and T. Chen, “Asynchronous advantage actor critic: Non-asymptotic analysis and linear speedup,” 2020.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
- H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
- Z. Chen, S. Zhang, T. T. Doan, S. T. Maguluri, and J.-P. Clarke, “Performance of q-learning with linear function approximation: Stability and finite-time analysis,” arXiv preprint arXiv:1905.11425, p. 4, 2019.
- T. Xu, S. Zou, and Y. Liang, “Two time-scale off-policy TD learning: Non-asymptotic analysis over markovian samples,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- V. S. Borkar, “Stochastic approximation with two time scales,” Systems & Control Letters, vol. 29, no. 5, pp. 291–294, 1997.
- V. S. Borkar and S. Pattathil, “Concentration bounds for two time scale stochastic approximation,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2018, pp. 504–511.
- M. Hong, H.-T. Wai, Z. Wang, and Z. Yang, “A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic,” SIAM Journal on Optimization, vol. 33, no. 1, pp. 147–180, 2023.
- T. Xu, Z. Wang, and Y. Liang, “Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms,” arXiv preprint arXiv:2005.03557, 2020.
- J. Zhang, T. He, S. Sra, and A. Jadbabaie, “Why gradient clipping accelerates training: A theoretical justification for adaptivity,” arXiv preprint arXiv:1905.11881, 2019.
- B. Zhang, J. Jin, C. Fang, and L. Wang, “Improved analysis of clipping algorithms for non-convex optimization,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 511–15 521, 2020.
- S. Zeng, C. Li, A. Garcia, and M. Hong, “Maximum-likelihood inverse reinforcement learning with finite-time guarantees,” arXiv preprint arXiv:2210.01808, 2022.
- J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE TRANSACTIONS ON AUTOMATIC CONTROL, vol. 42, no. 5, 1997.
- A. Zanette, “When is realizability sufficient for off-policy reinforcement learning?” arXiv preprint arXiv:2211.05311, 2022.
- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI gym,” arXiv preprint arXiv:1606.01540, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.