Constrained Decision Transformer for Offline Safe Reinforcement Learning
Abstract: Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. We study the offline safe RL problem from a novel multi-objective optimization perspective and propose the $\epsilon$-reducible concept to characterize problem difficulties. The inherent trade-offs between safety and task performance inspire us to propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment. Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints. The code is available at https://github.com/liuzuxin/OSRL.
- Constrained policy optimization. In International Conference on Machine Learning, pp. 22–31. PMLR, 2017.
- Altman, E. Constrained markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical methods of operations research, 48(3):387–417, 1998.
- Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5, 2021.
- Context-aware safe reinforcement learning for non-stationary environments. arXiv preprint arXiv:2101.00531, 2021a.
- Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021b.
- A primal-dual approach to constrained markov decision processes. arXiv preprint arXiv:2101.10895, 2021c.
- Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
- Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
- Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
- Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.
- Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021.
- Saac: Safe reinforcement learning as an adversarial game of actor-critics. arXiv preprint arXiv:2204.09424, 2022.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
- Generalized decision transformer for offline hindsight information matching. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=CAjxVodl_v.
- A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Mind your data! hiding backdoors in offline reinforcement learning datasets. arXiv preprint arXiv:2210.04688, 2022.
- Gronauer, S. Bullet-safety-gym: Aframework for constrained reinforcement learning. 2022.
- A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
- Rl unplugged: A suite of benchmarks for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:7248–7259, 2020.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
- Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
- Omnisafe: An infrastructure for accelerating safe reinforcement learning research. arXiv preprint arXiv:2305.09304, 2023.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. PMLR, 2019.
- Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957, 2022.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Constrained model-based reinforcement learning with robust cross-entropy method. arXiv preprint arXiv:2010.07968, 2020.
- Constrained variational policy optimization for safe reinforcement learning. In International Conference on Machine Learning, pp. 13644–13668. PMLR, 2022a.
- On the robustness of safe reinforcement learning under observational perturbations. arXiv preprint arXiv:2205.14691, 2022b.
- Datasets and benchmarks for offline safe reinforcement learning. arXiv preprint arXiv:2306.09303, 2023.
- Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. arXiv preprint arXiv:2212.11419, 2022.
- Learning barrier certificates: Towards safe reinforcement learning with zero training-time violations. Advances in Neural Information Processing Systems, 34, 2021.
- Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019a.
- Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
- Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Constrained offline policy optimization. In International Conference on Machine Learning, pp. 17801–17810. PMLR, 2022.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems. arXiv preprint arXiv:2203.01387, 2022.
- Improving language understanding by generative pre-training. 2018.
- Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7, 2019.
- Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Offline reinforcement learning from heteroskedastic data via support constraints. In Deep Reinforcement Learning Workshop NeurIPS 2022.
- S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics. In Conference on Robot Learning, pp. 907–917. PMLR, 2022.
- Sauté rl: Almost surely safe reinforcement learning using state augmentation. In International Conference on Machine Learning, pp. 20423–20443. PMLR, 2022.
- Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pp. 9133–9143. PMLR, 2020.
- Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2446–2454, 2020.
- CORL: Research-oriented deep offline reinforcement learning library. In 3rd Offline RL Workshop: Offline RL as a ”Launchpad”, 2022. URL https://openreview.net/forum?id=SyAS49bBcv.
- Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
- Safe reinforcement learning using advantage-based intervention. arXiv preprint arXiv:2106.09110, 2021.
- Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020.
- Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395–27407, 2021.
- Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8753–8760, 2022a.
- Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability. arXiv preprint arXiv:2209.08025, 2022b.
- Prompting decision transformer for few-shot policy generalization. In International Conference on Machine Learning, pp. 24631–24645. PMLR, 2022c.
- Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. In AAAI, pp. 10639–10646, 2021.
- Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152, 2020.
- Gendice: Generalized offline estimation of stationary values. arXiv preprint arXiv:2002.09072, 2020a.
- First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 2020b.
- Model-free safe control for zero-violation reinforcement learning. In 5th Annual Conference on Robot Learning, 2021.
- Online decision transformer. arXiv preprint arXiv:2202.05607, 2022.
- Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.