Model approximation in MDPs with unbounded per-step cost
Abstract: We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $\mathcal{M}$ when we only have access to an approximate model $\hat{\mathcal{M}}$. How well does an optimal policy $\hat{\pi}{\star}$ of the approximate model perform when used in the original model $\mathcal{M}$? We answer this question by bounding a weighted norm of the difference between the value function of $\hat{\pi}\star $ when used in $\mathcal{M}$ and the optimal value function of $\mathcal{M}$. We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.
- B. Bozkurt, A. Mahajan, A. Nayyar, and Y. Ouyang, “Weighted norm bounds in mdps with unbounded per-step cost,” in Conference on Decision and Control. Singapore: IEEE, Dec. 2023.
- B. L. Fox, “Finite-state approximations to denumerable-state dynamic programs,” J. Math. Analy. Appl., vol. 34, no. 3, pp. 665–670, 1971.
- W. Whitt, “Approximations of dynamic programs, I,” Math. Oper. Res., vol. 3, no. 3, pp. 231–243, 1978.
- ——, “Approximations of dynamic programs, II,” Math. Oper. Res., vol. 4, no. 2, pp. 179–185, 1979.
- ——, “Representation and approximation of noncooperative sequential games,” SIAM J. Contr. Optim., vol. 18, no. 1, pp. 33–48, 1980.
- D. Bertsekas, “Convergence of discretization procedures in dynamic programming,” IEEE Trans. Autom. Control, vol. 20, no. 3, pp. 415–419, 1975.
- C.-S. Chow and J. N. Tsitsiklis, “An optimal one-way multigrid algorithm for discrete-time stochastic control,” IEEE transactions on automatic control, vol. 36, no. 8, pp. 898–914, 1991.
- F. Dufour and T. Prieto-Rumeau, “Approximation of Markov decision processes with general state space,” J. Math. Analy. Appl., vol. 388, no. 2, pp. 1254–1267, 2012.
- A. Haurie and P. L’ecuyer, “Approximation and bounds in discrete event dynamic programming,” IEEE Trans. Autom. Control, vol. 31, no. 3, pp. 227–235, 1986.
- A. Müller, “How does the value function of a Markov decision process depend on the transition probabilities?” Math. Oper. Res., vol. 22, no. 4, pp. 872–885, 1997.
- N. Saldi, T. Linder, and S. Yüksel, “Asymptotic optimality and rates of convergence of quantized stationary policies in stochastic control,” IEEE Trans. Autom. Control, vol. 60, no. 2, pp. 553–558, 2014.
- N. Saldi, S. Yüksel, and T. Linder, “On the asymptotic optimality of finite approximations to Markov decision processes with borel spaces,” Math. Oper. Res., vol. 42, no. 4, pp. 945–978, 2017.
- A. D. Kara, “Near optimality of finite memory feedback policies in partially observed markov decision processes,” J. Mach. Learn. Res., vol. 23, no. 1, pp. 437–482, 2022.
- J. Subramanian, A. Sinha, R. Seraj, and A. Mahajan, “Approximate information state for approximate planning and reinforcement learning in partially observed systems.” J. Mach. Learn. Res., vol. 23, no. 12, pp. 1–83, 2022.
- B. L. Fox, “Discretizing dynamic programs,” J. Opt. Theory and Appl., vol. 11, pp. 228–234, 1973.
- P. K. Dutta, M. K. Majumdar, and R. K. Sundaram, “Parametric continuity in dynamic programming problems,” J. Economic Dynamics and Control, vol. 18, no. 6, pp. 1069–1092, 1994.
- N. Saldi, S. Yüksel, and T. Linder, “Near optimality of quantized policies in stochastic control under weak continuity conditions,” J. Math. Analy. Appl., vol. 435, no. 1, pp. 321–337, 2016.
- ——, “Asymptotic optimality of finite model approximations for partially observed markov decision processes with discounted cost,” IEEE Trans. Autom. Control, vol. 65, no. 1, pp. 130–142, 2019.
- A. D. Kara and S. Yuksel, “Robustness to incorrect priors in partially observed stochastic control,” SIAM J. Contr. Optim., vol. 57, no. 3, pp. 1929–1964, 2019.
- ——, “Robustness to incorrect system models in stochastic control,” SIAM J. Cont. Optim., vol. 58, no. 2, pp. 1144–1182, 2020.
- A. D. Kara, M. Raginsky, and S. Yüksel, “Robustness to incorrect models and data-driven learning in average-cost optimal stochastic control,” Automatica, vol. 139, p. 110179, 2022.
- B. Ravindran and A. G. Barto, “Approximate homomorphisms: A framework for non-exact minimization in Markov decision processes,” in KBCS, 2004.
- E. van der Pol, T. Kipf, F. A. Oliehoek, and M. Welling, “Plannable approximations to MDP homomorphisms: Equivariance under actions,” in AAMAS, Auckland, New Zealand, May 2020.
- N. Ferns, P. Panangaden, and D. Precup, “Metrics for finite Markov decision processes.” in UAI, vol. 4, 2004, pp. 162–169.
- ——, “Bisimulation metrics for continuous Markov decision processes,” SIAM J. Comp., vol. 40, no. 6, pp. 1662–1714, 2011.
- P. S. Castro, P. Panangaden, and D. Precup, “Equivalence relations in fully and partially observable Markov decision processes.” in IJCAI, vol. 9, 2009, pp. 1653–1658.
- D. Abel, D. Hershkowitz, and M. Littman, “Near optimal behavior via approximate state abstraction,” in ICML. PMLR, 2016, pp. 2915–2923.
- V. François-Lavet, G. Rabusseau, J. Pineau, D. Ernst, and R. Fonteneau, “On overfitting and asymptotic bias in batch reinforcement learning with partial observability,” J. Artif. Intel. Res., vol. 65, pp. 1–30, 2019.
- C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare, “Deepmdp: Learning continuous latent space models for representation learning,” in ICML. PMLR, 2019, pp. 2170–2179.
- K. J. Arrow, T. Harris, and J. Marschak, “Optimal inventory policy,” Econometrica: Journal of the Econometric Society, pp. 250–272, 1951.
- D. P. Bertsekas, “Dynamic programming and optimal control, volume II,” Athena Scientific, 2015.
- L. N. Vaserstein, “Markov processes over denumerable products of spaces, describing large systems of automata,” Problemy Peredachi Informatsii, vol. 5, no. 3, pp. 64–72, 1969.
- M. Hairer and J. C. Mattingly, “Yet another look at Harris’ ergodic theorem for Markov chains,” in Seminar on Stochastic Analysis, Random Fields and Applications V. Springer, 2011, pp. 109–117.
- S. P. Meyn and R. L. Tweedie, “Stability of Markovian processes I: Criteria for discrete-time chains,” Advances in Applied Probability, vol. 24, no. 3, pp. 542–574, 1992.
- H. A. Simon, “Dynamic programming under uncertainty with a quadratic criterion function,” Econometrica: Journal of the Econometric Society, pp. 74–81, 1956.
- H. Theil, “A note on certainty equivalence in dynamic planning,” Econometrica: Journal of the Econometric Society, pp. 346–349, 1957.
- H. Witsenhausen, “Inequalities for the performance of suboptimal uncertain systems,” Automatica, vol. 5, no. 4, pp. 507–512, Jul. 1969.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.