Experiment Planning with Function Approximation
Abstract: We study the problem of experiment planning with function approximation in contextual bandit problems. In settings where there is a significant overhead to deploying adaptive algorithms -- for example, when the execution of the data collection policies is required to be distributed, or a human in the loop is needed to implement these policies -- producing in advance a set of policies for data collection is paramount. We study the setting where a large dataset of contexts but not rewards is available and may be used by the learner to design an effective data collection strategy. Although when rewards are linear this problem has been well studied, results are still missing for more complex reward models. In this work we propose two experiment planning strategies compatible with function approximation. The first is an eluder planning and sampling procedure that can recover optimality guarantees depending on the eluder dimension of the reward function class. For the second, we show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small. We finalize our results introducing a statistical gap fleshing out the fundamental differences between planning and adaptive learning and provide results for planning with model selection.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312–2320, 2011.
- Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pages 19–26. PMLR, 2012.
- Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR, 2014.
- Making contextual decisions with low technical debt. arXiv preprint arXiv:1606.03966, 2016.
- Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR, 2017.
- S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR, 2013.
- Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
- Behavioural science is unlikely to change the world without a heterogeneity revolution. Nature human behaviour, 5(8):980–989, 2021.
- Parallelizing contextual bandits. arXiv preprint arXiv:2105.10590, 2021.
- On the statistical efficiency of reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 35:20960–20973, 2022.
- Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
- Dynamic balancing for model selection in bandits and rl. In International Conference on Machine Learning, pages 2276–2285. PMLR, 2021.
- Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432–2442. PMLR, 2020.
- Bilinear classes: A structural framework for provable generalization in rl. arXiv preprint arXiv:2103.10897, 2021.
- Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7340–7348, 2021.
- Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32, 2019.
- D. Foster and A. Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199–3210. PMLR, 2020.
- Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pages 1539–1548. PMLR, 2018.
- D. A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
- Y. Jedra and A. Proutiere. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33:10007–10017, 2020.
- Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
- Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
- K.-S. Jun and C. Zhang. Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Advances in Neural Information Processing Systems, 33:6366–6376, 2020.
- J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
- J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96–1, 2007.
- T. Lattimore and R. Munos. Bounded regret for finite-armed structured bandits. Advances in Neural Information Processing Systems, 27, 2014.
- T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662–5670. PMLR, 2020.
- Online model selection for reinforcement learning with function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3340–3348. PMLR, 2021.
- Understanding the eluder dimension. Advances in Neural Information Processing Systems, 35:23737–23750, 2022a.
- Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
- Instance-optimal pac algorithms for contextual bandits. Advances in Neural Information Processing Systems, 35:37590–37603, 2022b.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- I. Osband and B. Van Roy. Model-based reinforcement learning and the eluder dimension. Advances in Neural Information Processing Systems, 27, 2014.
- Regret bound balancing and elimination for model selection in bandits and rl. arXiv preprint arXiv:2012.13045, 2020a.
- Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33:10328–10337, 2020b.
- Neural design for genetic perturbation experiments. arXiv preprint arXiv:2207.12805, 2022.
- Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
- D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- D. Simchi-Levi and Y. Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 2021.
- Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014.
- R. S. Sutton. Introduction: The challenge of reinforcement learning. In Reinforcement Learning, pages 1–3. Springer, 1992.
- Best arm identification in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pages 4877–4886. PMLR, 2018.
- A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. Mobile Health: Sensors, Analytic Methods, and Applications, pages 495–517, 2017.
- Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning, pages 22491–22511. PMLR, 2022.
- On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020.
- A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 843–851. PMLR, 2018.
- A. Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. In International Conference on Machine Learning, pages 12287–12297. PMLR, 2021.
- Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720–22731, 2021.
- R. Zhu and B. Kveton. Safe optimal design with applications in off-policy learning. In International Conference on Artificial Intelligence and Statistics, pages 2436–2447. PMLR, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.