Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds
Abstract: Offline reinforcement learning (RL) aims to learn decision policies from a fixed batch of logged transitions, without additional environment interaction. Despite remarkable empirical progress, offline RL remains fragile under distribution shifts: value-based methods can overestimate the value of unseen actions, yielding policies that exploit model errors rather than genuine long-term rewards. We propose \emph{Bayesian Conservative Policy Optimization (BCPO)}, a unified framework that converts epistemic uncertainty into \emph{provably conservative} policy improvement. BCPO maintains a hierarchical Bayesian posterior over environment/value models, constructs a \emph{credible lower bound} (LCB) on action values, and performs policy updates under explicit KL regularization toward the behavior distribution. This yields an uncertainty-calibrated analogue of conservative policy iteration in the offline regime. We provide a finite-MDP theory showing that the pessimistic fixed point lower-bounds the true value function with high probability and that KL-controlled updates improve a computable return lower bound. Empirically, we verify the methodology on a real offline replay dataset for the CartPole benchmark obtained via the \texttt{d3rlpy} ecosystem, and report diagnostics that link uncertainty growth and policy drift to offline instability, motivating principled early stopping and calibration
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.