Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

Published 13 Jun 2023 in cs.LG and cs.AI | (2306.07541v2)

Abstract: Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. To this end, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in D4RL benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in Neural Information Processing Systems, 34:7436–7447, 2021.
  2. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009.
  3. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. In International Conference on Learning Representations, 2022.
  4. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
  5. Offline reinforcement learning at multiple frequencies. In Conference on Robot Learning, pages 2041–2051, 2023.
  6. Decision transformer: Reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34:15084–15097, 2021.
  7. Ucb exploration via q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
  8. Better exploration with optimistic actor critic. Advances in Neural Information Processing Systems, 32, 2019.
  9. Rvs: What is essential for offline rl via supervised learning? In International Conference on Learning Representations, 2022.
  10. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  11. A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34:20132–20145, 2021.
  12. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596, 2018.
  13. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062, 2019.
  14. Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Advances in Neural Information Processing Systems, 35:18267–18281, 2022.
  15. Model-based offline reinforcement learning with pessimism-modulated dynamics belief. Advances in Neural Information Processing Systems, 35:449–461, 2022.
  16. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
  17. Morel: Model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 33:21810–21823, 2020.
  18. Auto-encoding variational bayes. International Conference on Learning Representations, 2013.
  19. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  20. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783, 2021.
  21. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
  22. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  23. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  24. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pages 6131–6141, 2021.
  25. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pages 1702–1712, 2022.
  26. Clinical decision transformer: Intended treatment recommendation through goal prompting. arXiv preprint arXiv:2302.00612, 2023.
  27. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  28. A review of uncertainty for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 18, pages 155–162, 2022.
  29. Challenges and opportunities in offline reinforcement learning from visual observations. arXiv preprint arXiv:2206.04779, 2022.
  30. Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions. arXiv preprint arXiv:2303.17396, 2023.
  31. Mildly conservative q-learning for offline reinforcement learning. In Thirty-sixth Conference on Neural Information Processing Systems, 2022.
  32. Moore: Model-based offline-to-online reinforcement learning. arXiv preprint arXiv:2201.10070, 2022.
  33. Fine-tuning offline policies with optimistic action selection. In Deep Reinforcement Learning Workshop NeurIPS, 2022.
  34. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  35. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479, 2023.
  36. Offline reinforcement learning as anti-exploration. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8106–8114, 2022.
  37. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
  38. Juergen Schmidhuber. Reinforcement learning upside down: Don’t predict rewards–just map them to actions. arXiv preprint arXiv:1912.02875, 2019.
  39. Supported policy optimization for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
  40. Uncertainty weighted actor-critic for offline reinforcement learning. In International Conference on Machine Learning, pages 11319–11328, 2021.
  41. A policy-guided imitation approach for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:4085–4098, 2022.
  42. Offline rl with no ood actions: In-sample learning via implicit value regularization. In International Conference on Learning Representations, 2023.
  43. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  44. Actor-critic alignment for offline-to-online reinforcement learning. 2023.
  45. Policy expansion for bridging offline-to-online reinforcement learning. The International Conference on Learning Representations, 2023.
  46. User retention-oriented recommendation with decision transformer. Proceedings of the ACM Web Conference, 2023.
  47. Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning. arXiv preprint arXiv:2210.13846, 2022.
  48. Adaptive policy learning for offline-to-online reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
  49. Online decision transformer. In International Conference on Machine Learning, pages 27042–27059, 2022.
  50. Behavior proximal policy optimization. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3c13LptpIph.
Citations (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.