Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration
Abstract: Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine-tuning to enhance both sample efficiency and performance. However, most existing research has focused on single-agent settings, with limited exploration of the multi-agent extension, i.e., Offline-to-Online Multi-Agent Reinforcement Learning (O2O MARL). In O2O MARL, two critical challenges become more prominent as the number of agents increases: (i) the risk of unlearning pre-trained Q-values due to distributional shifts during the transition from offline-to-online phases, and (ii) the difficulty of efficient exploration in the large joint state-action space. To tackle these challenges, we propose a novel O2O MARL framework called Offline Value Function Memory with Sequential Exploration (OVMSE). First, we introduce the Offline Value Function Memory (OVM) mechanism to compute target Q-values, preserving knowledge gained during offline training, ensuring smoother transitions, and enabling efficient fine-tuning. Second, we propose a decentralized Sequential Exploration (SE) strategy tailored for O2O MARL, which effectively utilizes the pre-trained offline policy for exploration, thereby significantly reducing the joint state-action space to be explored. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate that OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.
- Emergent Tool Use From Multi-Agent Autocurricula. In International Conference on Learning Representations. https://openreview.net/forum?id=SkxpxJBKwS
- Efficient Online Reinforcement Learning with Offline Data. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 1577–1594. https://proceedings.mlr.press/v202/ball23a.html
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680 (2019).
- Dimitri Bertsekas. 2021. Multiagent Reinforcement Learning: Rollout and Policy Iteration. IEEE/CAA Journal of Automatica Sinica 8, 2 (2021), 249–272. https://doi.org/10.1109/JAS.2021.1003814
- Decision Transformer: Reinforcement Learning via Sequence Modeling. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 15084–15097. https://proceedings.neurips.cc/paper_files/paper/2021/file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf
- Bi-DexHands: Towards Human-Level Bimanual Dexterous Manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 5 (2024), 2804–2818. https://doi.org/10.1109/TPAMI.2023.3339515
- Off-the-Grid MARL: Datasets and Baselines for Offline Multi-Agent Reinforcement Learning. In Extended Abstract at the 2023 International Conference on Autonomous Agents and Multiagent Systems. AAMAS.
- D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219Â [cs.LG]
- Imitation Bootstrapped Reinforcement Learning. In Proceedings of Robotics: Science and Systems. Delft, Netherlands. https://doi.org/10.15607/RSS.2024.XX.056
- Bayesian Design Principles for Offline-to-Online Reinforcement Learning. In Forty-first International Conference on Machine Learning. https://openreview.net/forum?id=HLHQxMydFk
- Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-agent Reinforcement Learning. In ICLR Blogposts 2023 (May 1, 2023). https://iclr-blogposts.github.io/2023/blog/2023/riit/ https://iclr-blogposts.github.io/2023/blog/2023/riit/.
- Jiechuan Jiang and Zongqing Lu. 2023. Online Tuning for Offline Decentralized Multi-Agent Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence 37, 7 (Jun. 2023), 8050–8059. https://doi.org/10.1609/aaai.v37i7.25973
- Conservative Q-Learning for Offline Reinforcement Learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1179–1191. https://proceedings.neurips.cc/paper_files/paper/2020/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf
- Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble. In 5th Annual Conference on Robot Learning. https://openreview.net/forum?id=AlJXhEI6J5W
- Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=tbFBh3LMKi
- Beyond Conservatism: Diffusion Policies in Offline Multi-agent Reinforcement Learning. arXiv:2307.01472Â [cs.AI] https://arxiv.org/abs/2307.01472
- From motor control to team play in simulated humanoid football. Science Robotics 7, 69 (2022), eabo0235. https://doi.org/10.1126/scirobotics.abo0235 arXiv:https://www.science.org/doi/pdf/10.1126/scirobotics.abo0235
- Offline pre-trained multi-agent decision transformer. Machine Intelligence Research 20, 2 (2023), 233–248.
- Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 62244–62269. https://proceedings.neurips.cc/paper_files/paper/2023/file/c44a04289beaf0a7d968a94066a1d696-Paper-Conference.pdf
- Frans A. Oliehoek and Christopher Amato. 2016. A Concise Introduction to Decentralized POMDPs (1st ed.). Springer Publishing Company, Incorporated.
- Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 17221–17237. https://proceedings.mlr.press/v162/pan22a.html
- Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. Journal of Machine Learning Research 21, 178 (2020), 1–51. http://jmlr.org/papers/v21/20-081.html
- QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4295–4304. https://proceedings.mlr.press/v80/rashid18a.html
- Counterfactual Conservative Q Learning for Offline Multi-agent Reinforcement Learning. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 77290–77312. https://proceedings.neurips.cc/paper_files/paper/2023/file/f3f2ff9579ba6deeb89caa2fe1f0b99c-Paper-Conference.pdf
- Multi-agent reinforcement learning for resource allocation in large-scale robotic warehouse sortation centers. In IEEE CDC 2023. https://www.amazon.science/publications/multi-agent-reinforcement-learning-for-resource-allocation-in-large-scale-robotic-warehouse-sortation-centers
- Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning. PMLR, 5887–5896.
- Offline Multi-Agent Reinforcement Learning with Knowledge Distillation. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 226–237. https://proceedings.neurips.cc/paper_files/paper/2022/file/01d78b294d80491fecddea897cf03642-Paper-Conference.pdf
- Grandmaster level in StarCraft II using multi-agent reinforcement learning. nature 575, 7782 (2019), 350–354.
- Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 47081–47104. https://proceedings.neurips.cc/paper_files/paper/2023/file/9318763d049edf9a1f2779b2a59911d3-Paper-Conference.pdf
- Order Matters: Agent-by-agent Policy Optimization. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=Q-neeWNVv1
- Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 10299–10312. https://proceedings.neurips.cc/paper_files/paper/2021/file/550a141f12de6341fba65b0ad0433500-Paper.pdf
- Policy Expansion for Bridging Offline-to-Online Reinforcement Learning. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=-Y34L45JR6z
- A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
- ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.). International Joint Conferences on Artificial Intelligence Organization, 5563–5571. https://doi.org/10.24963/ijcai.2024/615 Main Track.
- MADiff: Offline Multi-agent Learning with Diffusion Models. arXiv preprint arXiv:2305.17330 (2023).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.