Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning to Act without Actions

Published 17 Dec 2023 in cs.LG and cs.AI | (2312.10812v2)

Abstract: Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in domains such as language and vision. However, this paradigm has not yet taken hold in reinforcement learning. This is because videos, the most abundant form of embodied behavioral data on the web, lack the action labels required by existing methods for imitating behavior from demonstrations. We introduce Latent Action Policies (LAPO), a method for recovering latent action information, and thereby latent-action policies, world models, and inverse dynamics models, purely from videos. LAPO is the first method able to recover the structure of the true action space just from observed dynamics, even in challenging procedurally-generated environments. LAPO enables training latent-action policies that can be rapidly fine-tuned into expert-level policies, either offline using a small action-labeled dataset, or online with rewards. LAPO takes a first step towards pre-training powerful, generalist policies and world models on the vast amounts of videos readily available on the web.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Karl Johan Åström. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174–205, 1965.
  2. Playing hard exploration games by watching youtube. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp.  2935–2945, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/35309226eb45ec366ca86a4329a2b7c3-Abstract.html.
  3. Video pretraining (VPT): learning to act by watching unlabeled online videos. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9c7008aff45b5d8f0973b23e1a22ada0-Abstract-Conference.html.
  4. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  5. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Emerging properties in self-supervised vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp.  9630–9640. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00951. URL https://doi.org/10.1109/ICCV48922.2021.00951.
  8. Variational lossy autoencoder. In International Conference on Learning Representations, 2016.
  9. Quantifying generalization in reinforcement learning. In International Conference on Machine Learning, pp.  1282–1289. PMLR, 2019.
  10. Leveraging procedural generation to benchmark reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  2048–2056. PMLR, 2020. URL http://proceedings.mlr.press/v119/cobbe20a.html.
  11. Jukebox: A generative model for music. CoRR, abs/2005.00341, 2020. URL https://arxiv.org/abs/2005.00341.
  12. Imitating latent policies from observation. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp.  1755–1763. PMLR, 2019. URL http://proceedings.mlr.press/v97/edwards19a.html.
  13. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  1406–1415. PMLR, 2018. URL http://proceedings.mlr.press/v80/espeholt18a.html.
  14. Reinforcement learning from passive data via latent intentions. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  11321–11339. PMLR, 2023. URL https://proceedings.mlr.press/v202/ghosh23a.html.
  15. R. Gray. Vector quantization. IEEE ASSP Magazine, 1(2):4–29, 1984. doi: 10.1109/MASSP.1984.1162229.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  17. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  14096–14113. PMLR, 2023. URL https://proceedings.mlr.press/v202/huh23a.html.
  18. Marcus Hutter. On the existence and convergence of computable universal priors. In International Conference on Algorithmic Learning Theory, pp.  298–312. Springer, 2003.
  19. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  20. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  21. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  22. UMAP: uniform manifold approximation and projection for dimension reduction. CoRR, abs/1802.03426, 2018. URL http://arxiv.org/abs/1802.03426.
  23. Playable video generation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp.  10061–10070. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00993. URL https://openaccess.thecvf.com/content/CVPR2021/html/Menapace_Playable_Video_Generation_CVPR_2021_paper.html.
  24. Transformers are sample-efficient world models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=vhFu1Acb0xb.
  25. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
  26. Dean Pomerleau. ALVINN: an autonomous land vehicle in a neural network. In David S. Touretzky (ed.), Advances in Neural Information Processing Systems 1, [NIPS Conference, Denver, Colorado, USA, 1988], pp.  305–313. Morgan Kaufmann, 1988.
  27. Language models are unsupervised multitask learners. 2019.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  29. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021a.
  30. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8821–8831. PMLR, 2021b. URL http://proceedings.mlr.press/v139/ramesh21a.html.
  31. A generalist agent. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=1ikK0kHjvj.
  32. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  33. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.  661–668. JMLR Workshop and Conference Proceedings, 2010.
  34. Reinforcement learning with videos: Combining offline observations with interaction. In Jens Kober, Fabio Ramos, and Claire J. Tomlin (eds.), 4th Conference on Robot Learning, CoRL 2020, 16-18 November 2020, Virtual Event / Cambridge, MA, USA, volume 155 of Proceedings of Machine Learning Research, pp.  339–354. PMLR, 2020. URL https://proceedings.mlr.press/v155/schmeckpeper21a.html.
  35. Jürgen Schmidhuber. Discovering neural nets with low kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
  36. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  37. RJ Solmonoff. A formal theory of inductive inference. i. II Information and Control, 7:224–254, 1964.
  38. Preventing mode collapse when imitating latent policies from observations, 2023. URL https://openreview.net/forum?id=Mf9fQ0OgMzo.
  39. Reinforcement learning: An introduction. MIT press, 2018.
  40. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  41. Behavioral cloning from observation. In Jérôme Lang (ed.), Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp.  4950–4957. ijcai.org, 2018. doi: 10.24963/ijcai.2018/687. URL https://doi.org/10.24963/ijcai.2018/687.
  42. Recent advances in imitation learning from observation. In Sarit Kraus (ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp.  6325–6331. ijcai.org, 2019a. doi: 10.24963/ijcai.2019/882. URL https://doi.org/10.24963/ijcai.2019/882.
  43. Generative adversarial imitation from observation. ICML Workshop on Imitation, Intent, and Interaction (I3), 2019b.
  44. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  6306–6315, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html.
  45. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  46. Imitation learning from observations by minimizing inverse dynamics disagreement. Advances in neural information processing systems, 32, 2019.
  47. Become a proficient player with limited data through watching pure videos. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=Sy-o2N0hF4f.
  48. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URL https://doi.org/10.1109/TASLP.2021.3129994.
  49. Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVI, volume 13686 of Lecture Notes in Computer Science, pp.  111–128. Springer, 2022. doi: 10.1007/978-3-031-19809-0_7. URL https://doi.org/10.1007/978-3-031-19809-0_7.
  50. Semi-supervised offline reinforcement learning with action-free trajectories. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  42339–42362. PMLR, 2023. URL https://proceedings.mlr.press/v202/zheng23b.html.
Citations (17)

Summary

  • The paper introduces Purely Observational Policy Pre-training (POPP) to recover latent action information from video data without action labels.
  • It jointly trains an inverse dynamics model and a forward dynamics model using vector quantization to create compressed yet interpretable latent actions.
  • POPP adapts latent policies to true actions using minimal labeled data or online RL, significantly outperforming policies trained from scratch.

Purely Observational Policy Pre-training: Learning to Act without Actions

Introduction

The paper "Learning to Act without Actions" introduces Purely Observational Policy Pre-training (POPP), a method for learning policies, world models, and inverse dynamics models (IDMs) directly from video data without access to action labels. This approach addresses a central challenge in scaling reinforcement learning (RL) to web-scale data, where the most abundant behavioral data—videos—lack explicit action annotations. POPP leverages unsupervised objectives to recover latent action information from observed environment dynamics, enabling the training of latent-action policies that can be rapidly adapted to the true action space with minimal labeled data or online interaction.

Methodology

POPP is built on two key components: an inverse dynamics model (IDM) and a forward dynamics model (FDM). The IDM predicts a latent action ztz_t given a sequence of observations (ot−k,…,ot,ot+1)(o_{t-k}, \ldots, o_t, o_{t+1}), while the FDM predicts the next observation o^t+1\hat{o}_{t+1} given the past observations and the latent action. Both models are trained jointly to minimize the next-state prediction error, with the latent action serving as an information bottleneck. Vector quantization (VQ) is applied to the latent actions to enforce discrete, reusable representations and prevent the IDM from simply copying future observations.

The training procedure consists of three stages:

  1. Latent IDM Training: Learn a compressed latent action representation via predictive consistency between IDM and FDM.
  2. Behavior Cloning: Use the trained IDM to label transitions in the observation-only dataset with latent actions, then train a policy to imitate these latent actions.
  3. Decoding Latent Actions: Adapt the latent policy to the true action space using either a small action-labeled dataset (offline decoding) or online RL (online decoding).

Latent Action Space Analysis

POPP demonstrates that the learned latent action space is highly interpretable and closely corresponds to the true action space, even though no ground-truth action labels are used during training. UMAP projections of the latent action space reveal distinct clusters aligned with true actions across diverse environments. Figure 1

Figure 1: UMAP projection of the learned latent action space for Miner, showing interpretable clusters corresponding to true actions, despite training without action labels.

Further analysis across all 16 Procgen environments confirms that the structure of the latent action space varies with environment complexity and partial observability. In environments with higher partial observability, the latent space exhibits more fragmentation, reflecting the need to encode off-screen or unobserved information. Figure 2

Figure 2: UMAP projection of the learned latent action space for all 16 Procgen games, illustrating environment-dependent structure and alignment with true actions.

Policy Adaptation and Performance

POPP's latent policies can be efficiently adapted to the true action space. When a small action-labeled dataset is available, a decoder trained on as few as 200 labeled transitions enables the latent policy to exceed the performance of a policy trained from scratch with 4 million steps of PPO. Performance plateaus with increasing labeled data, indicating that the decoder's capacity is limited by state-invariance assumptions. Figure 3

Figure 3: Test performance of the latent policy with an offline-trained decoder, showing rapid gains with few labeled samples and plateauing below online RL decoding.

In the online setting, fine-tuning the latent policy with RL enables rapid recovery of expert-level performance, often exceeding the original expert within 4 million frames. This is in contrast to PPO from scratch, which achieves only 44% of expert performance in the same period. Ablations demonstrate the importance of vector quantization and supervised decoder initialization for efficient adaptation. Figure 4

Figure 4

Figure 4: Left: Mean episodic returns for decoding POPP's latent policy vs. PPO from scratch. Right: Mean test returns relative to expert policies across all Procgen environments.

Comparison to Prior Work

POPP differs fundamentally from prior approaches such as ILPO, FICC, VPT, and BCO. Unlike ILPO, which uses discrete latent actions and suffers from mode collapse in visually diverse environments, POPP employs continuous latent actions and an IDM-based approach, enabling robust modeling of stochasticity and partial observability. POPP also avoids the need for significant action-labeled data, unlike semi-supervised methods, and does not require access to the true action space during training, unlike imitation learning from observation (IfO) methods.

Limitations

POPP's performance can be affected by delayed action effects, significant environment stochasticity, and the need for larger models when scaling to more complex domains. Delayed effects can be mitigated by extending the temporal context in IDM and FDM architectures, potentially using Transformer-based models. Stochasticity may require larger datasets to ensure robust latent representations. Scaling to web-scale video data will necessitate careful balancing of model capacity and bottleneck strength.

Implications and Future Directions

POPP provides a pathway for unsupervised pretraining of generalist RL policies and world models on massive video datasets, analogous to pretraining paradigms in language and vision. The ability to recover action information and train adaptable policies from pure observation opens new possibilities for leveraging web-scale behavioral data. Future work should focus on scaling POPP to more powerful architectures, integrating multi-task and multi-modal data, and improving generalization to unseen tasks.

Conclusion

POPP establishes that comprehensive action information can be recovered from pure video via unsupervised objectives, enabling the training of latent-action policies that are rapidly adaptable to the true action space. The method achieves strong empirical results across diverse environments, often exceeding expert performance with minimal labeled data or online interaction. POPP represents a significant advance toward scalable, generalist RL agents trained on web-scale observational data, with broad implications for the future of unsupervised policy pretraining and embodied AI.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 229 likes about this paper.