Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data

Published 18 Mar 2024 in stat.ML, cs.AI, and cs.LG | (2403.11841v1)

Abstract: In real-world scenarios, datasets collected from randomized experiments are often constrained by size, due to limitations in time and budget. As a result, leveraging large observational datasets becomes a more attractive option for achieving high-quality policy learning. However, most existing offline reinforcement learning (RL) methods depend on two key assumptions--unconfoundedness and positivity--which frequently do not hold in observational data contexts. Recognizing these challenges, we propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on front-door criterion to remove the confounding bias; additionally, we adopt the pessimistic principle to address the distributional shift between the action distributions induced by candidate policies, and the behavior policy that generates the observational data. Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Antos, A., Szepesvári, C., and Munos, R. (2008), “Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path,” Machine Learning, 71, 89–129.
  2. Banach, S. (1922), “Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales,” Fundamenta mathematicae, 3, 133–181.
  3. Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. (2019), “Dota 2 with large scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680.
  4. Bradley, R. C. (2005), “Basic properties of strong mixing conditions. A survey and some open questions,” .
  5. Bruns-Smith, D. and Zhou, A. (2023), “Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders,” arXiv preprint arXiv:2302.00662.
  6. Chang, J., Uehara, M., Sreenivas, D., Kidambi, R., and Sun, W. (2021), “Mitigating covariate shift in imitation learning via offline data with partial coverage,” Advances in Neural Information Processing Systems, 34, 965–979.
  7. Chen, J. and Jiang, N. (2019), “Information-theoretic considerations in batch reinforcement learning,” in International Conference on Machine Learning, PMLR, pp. 1042–1051.
  8. Chen, S. and Zhang, B. (2021), “Estimating and improving dynamic treatment regimes with a time-varying instrumental variable,” arXiv preprint arXiv:2104.07822.
  9. Dedecker, J. and Louhichi, S. (2002), “Maximal inequalities and empirical central limit theorems,” in Empirical process techniques for dependent data, Springer, pp. 137–159.
  10. Ertefaie, A. and Strawderman, R. L. (2018), “Constructing dynamic treatment regimes over indefinite time horizons,” Biometrika, 105, 963–977.
  11. Fu, Z., Qi, Z., Wang, Z., Yang, Z., Xu, Y., and Kosorok, M. R. (2022), “Offline reinforcement learning with instrumental variables in confounded markov decision processes,” arXiv preprint arXiv:2209.08666.
  12. Fulcher, I. R., Shpitser, I., Marealle, S., and Tchetgen Tchetgen, E. J. (2020), “Robust inference on population indirect causal effects: the generalized front door criterion,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 82, 199–214.
  13. Jin, Y., Ren, Z., Yang, Z., and Wang, Z. (2022), “Policy learning” without”overlap: Pessimism and generalized empirical Bernstein’s inequality,” arXiv preprint arXiv:2212.09900.
  14. Jin, Y., Yang, Z., and Wang, Z. (2021), “Is pessimism provably efficient for offline rl?” in International Conference on Machine Learning, PMLR, pp. 5084–5096.
  15. Kakade, S. and Langford, J. (2002), “Approximately optimal approximate reinforcement learning,” in Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267–274.
  16. Kallus, N. and Zhou, A. (2018), “Confounding-robust policy improvement,” Advances in neural information processing systems, 31.
  17. — (2020), “Confounding-robust policy evaluation in infinite-horizon reinforcement learning,” Advances in neural information processing systems, 33, 22293–22304.
  18. — (2022), “Stateful offline contextual policy evaluation and learning,” in International Conference on Artificial Intelligence and Statistics, PMLR, pp. 11169–11194.
  19. Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., and Mullainathan, S. (2018), “Human decisions and machine predictions,” The quarterly journal of economics, 133, 237–293.
  20. Kosorok, M. R. and Laber, E. B. (2019), “Precision medicine,” Annual review of statistics and its application, 6, 263–286.
  21. Kullback, S. and Leibler, R. A. (1951), “On information and sufficiency,” The annals of mathematical statistics, 22, 79–86.
  22. Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020), “Conservative q-learning for offline reinforcement learning,” Advances in Neural Information Processing Systems, 33, 1179–1191.
  23. LeCun, Y., Bengio, Y., and Hinton, G. (2015), “Deep learning,” nature, 521, 436–444.
  24. Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020), “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv:2005.01643.
  25. Li, J., Luo, Y., and Zhang, X. (2021), “Causal reinforcement learning: An instrumental variable approach,” arXiv preprint arXiv:2103.04021.
  26. Liao, L., Fu, Z., Yang, Z., Wang, Y., Kolar, M., and Wang, Z. (2021), “Instrumental variable value iteration for causal offline reinforcement learning,” arXiv preprint arXiv:2102.09907.
  27. Lu, C., Schölkopf, B., and Hernández-Lobato, J. M. (2018), “Deconfounding reinforcement learning in observational settings,” arXiv preprint arXiv:1812.10576.
  28. Lu, M., Min, Y., Wang, Z., and Yang, Z. (2022), “Pessimism in the face of confounders: Provably efficient offline reinforcement learning in partially observable markov decision processes,” arXiv preprint arXiv:2205.13589.
  29. Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. (2020), “Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning,” Journal of the American Statistical Association, 115, 692.
  30. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013), “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602.
  31. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015), “Human-level control through deep reinforcement learning,” nature, 518, 529–533.
  32. Mo, W., Qi, Z., and Liu, Y. (2021), “Learning optimal distributionally robust individualized treatment rules,” Journal of the American Statistical Association, 116, 659–674.
  33. Munos, R. and Szepesvári, C. (2008), “Finite-Time Bounds for Fitted Value Iteration.” Journal of Machine Learning Research, 9.
  34. Murphy, S. A. (2003), “Optimal dynamic treatment regimes,” Journal of the Royal Statistical Society: Series B, 65, 331–355.
  35. Nadler Jr, S. B. (1969), “Multi-valued contraction mappings.” .
  36. Namkoong, H., Keramati, R., Yadlowsky, S., and Brunskill, E. (2020), “Off-policy policy evaluation for sequential decisions under unobserved confounding,” Advances in Neural Information Processing Systems, 33, 18819–18831.
  37. Nie, X. and Wager, S. (2021), “Quasi-oracle estimation of heterogeneous treatment effects,” Biometrika, 108, 299–319.
  38. OpenAI (2023), “GPT-4 Technical Report,” .
  39. Qi, Z., Liu, D., Fu, H., and Liu, Y. (2020), “Multi-armed angle-based direct learning for estimating optimal individualized treatment rules with various outcomes,” Journal of the American Statistical Association, 115, 678–691.
  40. Qi, Z., Tang, J., Fang, E., and Shi, C. (2022), “Offline personalized pricing with censored demand,” in Offline Personalized Pricing with Censored Demand: Qi, Zhengling— uTang, Jingwen— uFang, Ethan— uShi, Cong, [Sl]: SSRN.
  41. Qian, M. and Murphy, S. A. (2011), “Performance guarantees for individualized treatment rules,” Annals of statistics, 39, 1180.
  42. Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021), “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” Advances in Neural Information Processing Systems, 34, 11702–11716.
  43. Riedmiller, M. (2005), “Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method,” in Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16, Springer, pp. 317–328.
  44. Robins, J. M. (2004), “Optimal structural nested models for optimal sequential decisions,” in Proceedings of the Second Seattle Symposium in Biostatistics, eds. Lin, D. Y. and Heagerty, P., Springer, New York, pp. 189–326.
  45. Shi, C., Fan, A., Song, R., and Lu, W. (2018), “High-dimensional A-learning for optimal dynamic treatment regimes,” Annals of statistics, 46, 925.
  46. Shi, C., Wan, R., Song, G., Luo, S., Song, R., and Zhu, H. (2022a), “A multi-agent reinforcement learning framework for off-policy evaluation in two-sided markets,” arXiv preprint arXiv:2202.10574.
  47. Shi, C., Zhang, S., Lu, W., and Song, R. (2022b), “Statistical inference of the value function for reinforcement learning in infinite-horizon settings,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 84, 765–793.
  48. Shi, C., Zhu, J., Ye, S., Luo, S., Zhu, H., and Song, R. (2022c), “Off-policy confidence interval estimation with confounded markov decision process,” Journal of the American Statistical Association, 1–12.
  49. Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. (2022d), “Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity,” in International Conference on Machine Learning, PMLR, pp. 19967–20025.
  50. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016), “Mastering the game of Go with deep neural networks and tree search,” nature, 529, 484–489.
  51. Tan, K., Lu, Y., Kausik, C., Wang, Y., and Tewari, A. (2022), “Offline Policy Evaluation and Optimization under Confounding,” arXiv preprint arXiv:2211.16583.
  52. Tang, X., Qin, Z., Zhang, F., Wang, Z., Xu, Z., Ma, Y., Zhu, H., and Ye, J. (2019), “A deep value-network based approach for multi-driver order dispatching,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1780–1790.
  53. Tennenholtz, G., Shalit, U., and Mannor, S. (2020), “Off-policy evaluation in partially observable environments,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10276–10283.
  54. Todorov, E., Erez, T., and Tassa, Y. (2012), “MuJoCo: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp. 5026–5033.
  55. Uehara, M., Huang, J., and Jiang, N. (2020), “Minimax weight and q-function learning for off-policy evaluation,” in International Conference on Machine Learning, PMLR, pp. 9659–9668.
  56. Uehara, M., Imaizumi, M., Jiang, N., Kallus, N., Sun, W., and Xie, T. (2021), “Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency,” arXiv preprint arXiv:2102.02981.
  57. Uehara, M. and Sun, W. (2022), “Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage,” in International Conference on Learning Representations.
  58. Wang, J., Qi, Z., and Shi, C. (2022), “Blessing from Experts: Super Reinforcement Learning in Confounded Environments,” arXiv preprint arXiv:2209.15448.
  59. Wang, L., Yang, Z., and Wang, Z. (2021), “Provably efficient causal reinforcement learning with confounded observational data,” Advances in Neural Information Processing Systems, 34, 21164–21175.
  60. Wang, L., Zhou, Y., Song, R., and Sherwood, B. (2018), “Quantile-optimal treatment regimes,” Journal of the American Statistical Association, 113, 1243–1254.
  61. Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021), “Bellman-consistent pessimism for offline reinforcement learning,” Advances in neural information processing systems, 34, 6683–6694.
  62. Xu, Y., Zhu, J., Shi, C., Luo, S., and Song, R. (2022), “An Instrumental Variable Approach to Confounded Off-Policy Evaluation,” arXiv preprint arXiv:2212.14468.
  63. Xu, Z., Li, Z., Guan, Q., Zhang, D., Li, Q., Nan, J., Liu, C., Bian, W., and Ye, J. (2018), “Large-scale order dispatch in on-demand ride-hailing platforms: A learning and planning approach,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 905–913.
  64. Yan, Y., Li, G., Chen, Y., and Fan, J. (2022a), “The efficacy of pessimism in asynchronous Q-learning,” arXiv preprint arXiv:2203.07368.
  65. — (2022b), “Model-based reinforcement learning is minimax-optimal for offline zero-sum markov games,” arXiv preprint arXiv:2206.04044.
  66. Yang, Z., Jin, C., Wang, Z., Wang, M., and Jordan, M. I. (2020), “On function approximation in reinforcement learning: Optimism in the face of large state spaces,” Advances in Neural Information Processing Systems, 2020.
  67. Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020), “Mopo: Model-based offline policy optimization,” Advances in Neural Information Processing Systems, 33, 14129–14142.
  68. Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2013), “Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions,” Biometrika, 100, 681–694.
  69. Zhang, J. and Bareinboim, E. (2016), “Markov decision processes with unobserved confounders: A causal approach,” Tech. rep., Technical report, Technical Report R-23, Purdue AI Lab.
  70. Zhang, Y., Laber, E. B., Davidian, M., and Tsiatis, A. A. (2018), “Interpretable dynamic treatment regimes,” Journal of the American Statistical Association, 113, 1541–1549.
  71. Zhao, Y.-Q., Zeng, D., Laber, E. B., and Kosorok, M. R. (2015), “New statistical learning methods for estimating optimal dynamic treatment regimes,” Journal of the American Statistical Association, 110, 583–598.
  72. Zhou, W., Zhu, R., and Qu, A. (2022), “Estimating optimal infinite horizon dynamic treatment regimes via pt-learning,” Journal of the American Statistical Association, accepted.
  73. Zhou, Y., Qi, Z., Shi, C., and Li, L. (2023), “Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach,” in International Conference on Artificial Intelligence and Statistics, PMLR, pp. 6704–6721.
  74. Zhu, J., Wan, R., Qi, Z., Luo, S., and Shi, C. (2023), “Robust Offline Policy Evaluation and Optimization with Heavy-Tailed Rewards,” arXiv preprint arXiv:2310.18715.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.