Towards Principled Representation Learning from Videos for Reinforcement Learning
Abstract: We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a theoretical understanding remains absent. We initiate the theoretical investigation into principled approaches for representation learning and focus on learning the latent state representations of the underlying MDP using video data. We study two types of settings: one where there is iid noise in the observation, and a more challenging setting where there is also the presence of exogenous noise, which is non-iid noise that is temporally correlated, such as the motion of people or cars in the background. We study three commonly used approaches: autoencoding, temporal contrastive learning, and forward modeling. We prove upper bounds for temporal contrastive learning and forward modeling in the presence of only iid noise. We show that these approaches can learn the latent state and use it to do efficient downstream RL with polynomial sample complexity. When exogenous noise is also present, we establish a lower bound result showing that the sample complexity of learning from video data can be exponentially worse than learning from action-labeled trajectory data. This partially explains why reinforcement learning with video pre-training is hard. We evaluate these representational learning methods in two visual domains, yielding results that are consistent with our theoretical findings.
- Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
- Time to augment self-supervised visual representation learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=o8xdgmwCP8l.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
- Information prioritization through empowerment in visual model-based rl. arXiv preprint arXiv:2204.08585, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pp. 1042–1051. PMLR, 2019.
- Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR, abs/2306.13831, 2023.
- Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665–1674. PMLR, 2019.
- Provably filtering exogenous distractors using multistep inverse dynamics. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RQLLzMCefQu.
- Video prediction models as rewards for reinforcement learning. arXiv preprint arXiv:2305.14343, 2023.
- Sara A Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
- One-shot learning of multi-step tasks from observation via activity localization in auxiliary video. In 2019 international conference on robotics and automation (ICRA), pp. 7755–7761. IEEE, 2019.
- Byol-explore: Exploration by bootstrapped prediction. Advances in neural information processing systems, 35:31855–31870, 2022.
- Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019. URL https://arxiv.org/pdf/1912.01603.pdf.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Agent-controller representations: Principled offline rl with rich exogenous information. arXiv preprint arXiv:2211.00164, 2022.
- Learning dynamics model in reinforcement learning by incorporating the long term future, 2019.
- Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209–232, 2002.
- Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE conference on computational intelligence and games (CIG), pp. 1–8. IEEE, 2016.
- Guaranteed discovery of controllable latent states with multi-step inverse models. arXiv preprint arXiv:2207.08229, 2022. URL https://arxiv.org/pdf/2207.08229.pdf.
- Vx2text: End-to-end learning of video-based text generation from multimodal inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7005–7015, 2021.
- Discrete-valued neural communication. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 2109–2121. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/10907813b97e249163587e6246612e21-Paper.pdf.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Representation learning with multi-step inverse kinematics: An efficient and optimal approach to rich-observation rl. arXiv preprint arXiv:2304.05889, 2023.
- Transformers are sample-efficient world models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=vhFu1Acb0xb.
- Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pp. 6961–6971. PMLR, 2020.
- Near-optimal representation learning for hierarchical reinforcement learning. In International Conference on Learning Representations, 2018.
- Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
- Self-supervised video pretraining yields strong image representations. arXiv preprint arXiv:2210.06433, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- A ranking game for imitation learning. arXiv preprint arXiv:2202.03481, 2022.
- Joint embedding predictive architectures focus on slow features. arXiv preprint arXiv:2211.10831, 2022.
- Unsupervised learning of video representations using lstms. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 843–852, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/srivastava15.html.
- Reinforcement learning: An introduction. MIT press, 2018.
- Understanding self-predictive learning for reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 33632–33656. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/tang23d.html.
- Recent advances in imitation learning from observation. arXiv preprint arXiv:1905.13566, 2019. URL https://arxiv.org/pdf/1905.13566.pdf.
- Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652, 2021.
- Denoised mdps: Learning world models better than the world itself. arXiv preprint arXiv:2206.15477, 2022. URL https://arxiv.org/pdf/2206.15477.pdf.
- Vizdoom competitions: Playing doom from pixels. IEEE Transactions on Games, 11(3):248–259, 2018.
- Become a proficient player with limited data through watching pure videos. In The Eleventh International Conference on Learning Representations, 2022.
- Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742, 2020.
- What makes representation learning from videos hard for control? RSS Workshop on Scaling Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:252635608.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.