Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning
Abstract: Offline reinforcement learning learns from a static dataset without interacting with environments, which ensures security and thus owns a good application prospect. However, directly applying naive reinforcement learning algorithm usually fails in an offline environment due to inaccurate Q value approximation caused by out-of-distribution (OOD) state-actions. It is an effective way to solve this problem by penalizing the Q-value of OOD state-actions. Among the methods of punishing OOD state-actions, count-based methods have achieved good results in discrete domains in a simple form. Inspired by it, a novel pseudo-count method for continuous domains called Grid-Mapping Pseudo-Count method (GPC) is proposed by extending the count-based method from discrete to continuous domains. Firstly, the continuous state and action space are mapped to discrete space using Grid-Mapping, then the Q-values of OOD state-actions are constrained through pseudo-count. Secondly, the theoretical proof is given to show that GPC can obtain appropriate uncertainty constraints under fewer assumptions than other pseudo-count methods. Thirdly, GPC is combined with Soft Actor-Critic algorithm (SAC) to get a new algorithm called GPC-SAC. Lastly, experiments on D4RL datasets are given to show that GPC-SAC has better performance and less computational cost than other algorithms that constrain the Q-value.
- OpenAI, “Gpt-4 technical report,” 2023.
- O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
- E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Mx¨¨𝑥\ddot{x}over¨ start_ARG italic_x end_ARGller, V. Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforcement learning,” Nature, vol. 620, no. 7976, pp. 982–987, 2023.
- C. Yu, J. Liu, and S. Nemati, “Reinforcement learning in healthcare: A survey,” ACM Computing Surveys (CSUR), vol. 55, no. 1, pp. 1–36, 2021.
- S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv:2005.01643, 2020.
- S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in International conference on machine learning. PMLR, 2019, pp. 2052–2062.
- R. Agarwal, D. Schuurmans, and M. Norouzi, “An optimistic perspective on offline reinforcement learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 104–114.
- R. F. Prudencio, M. R. O. A. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
- A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine, “Stabilizing off-policy q-learning via bootstrapping error reduction,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191, 2020.
- C. Bai, L. Wang, Z. Yang, Z. Deng, A. Garg, P. Liu, and Z. Wang, “Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning,” arXiv preprint arXiv:2202.11566, 2022.
- C. Gulcehre, Z. Wang, A. Novikov, T. L. Paine, S. G. Colmenarejo, K. Zolna, R. Agarwal, J. Merel, D. Mankowitz, C. Paduraru, G. Dulac-Arnold, J. Li, M. Norouzi, M. Hoffman, O. Nachum, G. Tucker, N. Heess, and N. de Freitas, “Rl unplugged: A suite of benchmarks for offline reinforcement learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 7248–7259, 2020.
- I. Osband, J. Aslanides, and A. Cassirer, “Randomized prior functions for deep reinforcement learning,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- S. Rezaeifar, R. Dadashi, N. Vieillard, L. Hussenot, O. Bachem, O. Pietquin, and M. Geist, “Offline reinforcement learning as anti-exploration,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 7, 2022, pp. 8106–8114.
- J. Hong, A. Kumar, and S. Levine, “Confidence-conditioned value functions for offline reinforcement learning,” arXiv preprint arXiv:2212.04607, 2022.
- J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4rl: Datasets for deep data-driven reinforcement learning,” arXiv preprint arXiv:2004.07219, 2020.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning. PMLR, 2018, pp. 1861–1870.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
- C.-J. Hoel, T. Tram, and J. Sjöberg, “Reinforcement learning with uncertainty estimation for tactical decision-making in intersections,” in 2020 IEEE 23rd international conference on intelligent transportation systems (ITSC). IEEE, 2020, pp. 1–7.
- W. R. Clements, B. V. Delft, B.-M. Robaglia, R. B. Slaoui, and S. Toth, “Estimating risk and uncertainty in deep reinforcement learning,” arXiv preprint arXiv:1905.09638, 2019.
- M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” Advances in neural information processing systems, vol. 32, 2019.
- T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma, “Mopo: Model-based offline policy optimization,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 129–14 142, 2020.
- T. Yu, A. Kumar, R. Rafailov, A. Rajeswaran, S. Levine, and C. Fin, “Combo: Conservative offline model-based policy optimization,” Advances in neural information processing systems, vol. 34, pp. 28 954–28 967, 2021.
- V. Mai, K. Mani, and L. Paull, “Sample efficient deep reinforcement learning via uncertainty estimation,” arXiv preprint arXiv:2201.01666, 2022.
- M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” Advances in neural information processing systems, vol. 29, 2016.
- A. Agarwal, S. Kakade, A. Krishnamurthy, and W. Sun, “Flambe: Structural complexity and representation learning of low rank mdps,” Advances in neural information processing systems, vol. 33, pp. 20 095–20 107, 2020.
- R. T. McAllister and C. E. Rasmussen, “Improving pilco with bayesian neural network dynamics models,” in Data-efficient machine learning workshop, ICML, vol. 4, no. 34, 2016, p. 25.
- Y. J. Ma, A. Shen, O. Bastani, and D. Jayaraman, “Conservative and adaptive penalty for model-based safe reinforcement learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 5, 2022, pp. 5404–5412.
- Z. Zhu, E. Bıyık, and D. Sadigh, “Multi-agent safe planning with gaussian processes,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 6260–6267.
- R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, “Morel: Model-based offline reinforcement learning,” Advances in neural information processing systems, vol. 33, pp. 21 810–21 823, 2020.
- Y. Wu, S. Zhai, N. Srivastava, J. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh, “Uncertainty weighted actor-critic for offline reinforcement learning,” arXiv preprint arXiv:2105.08140, 2021.
- G. An, S. Moon, J.-H. Kim, and H. O. Song, “Uncertainty-based offline reinforcement learning with diversified q-ensemble,” Advances in neural information processing systems, vol. 34, pp. 7436–7447, 2021.
- G. Ostrovski, M. G. Bellemare, A. Oord, and R. Munos, “Count-based exploration with neural density models,” in International conference on machine learning. PMLR, 2017, pp. 2721–2730.
- H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel, “# exploration: A study of count-based exploration for deep reinforcement learning,” Advances in neural information processing systems, vol. 30, 2017.
- J. Choi, Y. Guo, M. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee, “Contingency-aware exploration in reinforcement learning,” arXiv preprint arXiv:1811.01483, 2018.
- W. Hoeffding, “Probability inequalities for sums of bounded random variables,” The collected works of Wassily Hoeffding, pp. 409–426, 1994.
- C. Jin, Z. Yang, Z. Wang, and M. I. Jordan, “Provably efficient reinforcement learning with linear function approximation,” in Conference on Learning Theory. PMLR, 2020, pp. 2137–2143.
- R. Wang, S. S. Du, L. Yang, and R. R. Salakhutdinov, “On reward-free reinforcement learning with linear function approximation,” Advances in neural information processing systems, vol. 33, pp. 17 816–17 826, 2020.
- Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” Advances in neural information processing systems, vol. 24, 2011.
- I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,” ArXiv, vol. abs/2110.06169, 2021.
- Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving pilco with bayesian neural network dynamics models,” in Data-efficient machine learning workshop, ICML, vol. 4, no. 34, 2016, p. 25.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.