A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning
Abstract: Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. To this end, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in D4RL benchmark.
- Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in Neural Information Processing Systems, 34:7436–7447, 2021.
- Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009.
- Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. In International Conference on Learning Representations, 2022.
- R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Offline reinforcement learning at multiple frequencies. In Conference on Robot Learning, pages 2041–2051, 2023.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34:15084–15097, 2021.
- Ucb exploration via q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
- Better exploration with optimistic actor critic. Advances in Neural Information Processing Systems, 32, 2019.
- Rvs: What is essential for offline rl via supervised learning? In International Conference on Learning Representations, 2022.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34:20132–20145, 2021.
- Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596, 2018.
- Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062, 2019.
- Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Advances in Neural Information Processing Systems, 35:18267–18281, 2022.
- Model-based offline reinforcement learning with pessimism-modulated dynamics belief. Advances in Neural Information Processing Systems, 35:449–461, 2022.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Morel: Model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 33:21810–21823, 2020.
- Auto-encoding variational bayes. International Conference on Learning Representations, 2013.
- An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783, 2021.
- Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pages 6131–6141, 2021.
- Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pages 1702–1712, 2022.
- Clinical decision transformer: Intended treatment recommendation through goal prompting. arXiv preprint arXiv:2302.00612, 2023.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- A review of uncertainty for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 18, pages 155–162, 2022.
- Challenges and opportunities in offline reinforcement learning from visual observations. arXiv preprint arXiv:2206.04779, 2022.
- Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions. arXiv preprint arXiv:2303.17396, 2023.
- Mildly conservative q-learning for offline reinforcement learning. In Thirty-sixth Conference on Neural Information Processing Systems, 2022.
- Moore: Model-based offline-to-online reinforcement learning. arXiv preprint arXiv:2201.10070, 2022.
- Fine-tuning offline policies with optimistic action selection. In Deep Reinforcement Learning Workshop NeurIPS, 2022.
- Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479, 2023.
- Offline reinforcement learning as anti-exploration. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8106–8114, 2022.
- Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
- Juergen Schmidhuber. Reinforcement learning upside down: Don’t predict rewards–just map them to actions. arXiv preprint arXiv:1912.02875, 2019.
- Supported policy optimization for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
- Uncertainty weighted actor-critic for offline reinforcement learning. In International Conference on Machine Learning, pages 11319–11328, 2021.
- A policy-guided imitation approach for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:4085–4098, 2022.
- Offline rl with no ood actions: In-sample learning via implicit value regularization. In International Conference on Learning Representations, 2023.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
- Actor-critic alignment for offline-to-online reinforcement learning. 2023.
- Policy expansion for bridging offline-to-online reinforcement learning. The International Conference on Learning Representations, 2023.
- User retention-oriented recommendation with decision transformer. Proceedings of the ACM Web Conference, 2023.
- Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning. arXiv preprint arXiv:2210.13846, 2022.
- Adaptive policy learning for offline-to-online reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
- Online decision transformer. In International Conference on Machine Learning, pages 27042–27059, 2022.
- Behavior proximal policy optimization. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3c13LptpIph.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.