Mitigating PROWL’s conservatism under one-sided validity
Develop methods that reduce the finite-sample conservatism of PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL) while preserving strict one-sided validity of the reward certificate, i.e., maintaining that the unobserved target reward R* satisfies R* ≥ (R − U)+ almost surely. In particular, construct procedures that mitigate over-correction arising from bounding optimistic error without relying on an exact oracle bias correction, yet still provide guaranteed lower bounds on the true value.
References
Finally, the PROWL framework can be overly conservative in finite samples because it bounds the optimistic error instead of using an exact oracle bias correction. Mitigating this over-correction while maintaining strict one-sided validity remains a crucial open problem.