Learning to Act without Actions

Published 17 Dec 2023 in cs.LG and cs.AI | (2312.10812v2)

Abstract: Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in domains such as language and vision. However, this paradigm has not yet taken hold in reinforcement learning. This is because videos, the most abundant form of embodied behavioral data on the web, lack the action labels required by existing methods for imitating behavior from demonstrations. We introduce Latent Action Policies (LAPO), a method for recovering latent action information, and thereby latent-action policies, world models, and inverse dynamics models, purely from videos. LAPO is the first method able to recover the structure of the true action space just from observed dynamics, even in challenging procedurally-generated environments. LAPO enables training latent-action policies that can be rapidly fine-tuned into expert-level policies, either offline using a small action-labeled dataset, or online with rewards. LAPO takes a first step towards pre-training powerful, generalist policies and world models on the vast amounts of videos readily available on the web.

Abstract PDF HTML Upgrade to Chat

References (50)

Citations (17)

View on Semantic Scholar

Summary

The paper introduces Purely Observational Policy Pre-training (POPP) to recover latent action information from video data without action labels.
It jointly trains an inverse dynamics model and a forward dynamics model using vector quantization to create compressed yet interpretable latent actions.
POPP adapts latent policies to true actions using minimal labeled data or online RL, significantly outperforming policies trained from scratch.

Purely Observational Policy Pre-training: Learning to Act without Actions

Introduction

The paper "Learning to Act without Actions" introduces Purely Observational Policy Pre-training (POPP), a method for learning policies, world models, and inverse dynamics models (IDMs) directly from video data without access to action labels. This approach addresses a central challenge in scaling reinforcement learning (RL) to web-scale data, where the most abundant behavioral data—videos—lack explicit action annotations. POPP leverages unsupervised objectives to recover latent action information from observed environment dynamics, enabling the training of latent-action policies that can be rapidly adapted to the true action space with minimal labeled data or online interaction.

Methodology

POPP is built on two key components: an inverse dynamics model (IDM) and a forward dynamics model (FDM). The IDM predicts a latent action $z_t$ given a sequence of observations $(o_{t-k}, \ldots, o_t, o_{t+1})$ , while the FDM predicts the next observation $\hat{o}_{t+1}$ given the past observations and the latent action. Both models are trained jointly to minimize the next-state prediction error, with the latent action serving as an information bottleneck. Vector quantization (VQ) is applied to the latent actions to enforce discrete, reusable representations and prevent the IDM from simply copying future observations.

The training procedure consists of three stages:

Latent IDM Training: Learn a compressed latent action representation via predictive consistency between IDM and FDM.
Behavior Cloning: Use the trained IDM to label transitions in the observation-only dataset with latent actions, then train a policy to imitate these latent actions.
Decoding Latent Actions: Adapt the latent policy to the true action space using either a small action-labeled dataset (offline decoding) or online RL (online decoding).

Latent Action Space Analysis

POPP demonstrates that the learned latent action space is highly interpretable and closely corresponds to the true action space, even though no ground-truth action labels are used during training. UMAP projections of the latent action space reveal distinct clusters aligned with true actions across diverse environments.

Figure 1: UMAP projection of the learned latent action space for Miner, showing interpretable clusters corresponding to true actions, despite training without action labels.

Further analysis across all 16 Procgen environments confirms that the structure of the latent action space varies with environment complexity and partial observability. In environments with higher partial observability, the latent space exhibits more fragmentation, reflecting the need to encode off-screen or unobserved information.

Figure 2: UMAP projection of the learned latent action space for all 16 Procgen games, illustrating environment-dependent structure and alignment with true actions.

Policy Adaptation and Performance

POPP's latent policies can be efficiently adapted to the true action space. When a small action-labeled dataset is available, a decoder trained on as few as 200 labeled transitions enables the latent policy to exceed the performance of a policy trained from scratch with 4 million steps of PPO. Performance plateaus with increasing labeled data, indicating that the decoder's capacity is limited by state-invariance assumptions.

Figure 3: Test performance of the latent policy with an offline-trained decoder, showing rapid gains with few labeled samples and plateauing below online RL decoding.

In the online setting, fine-tuning the latent policy with RL enables rapid recovery of expert-level performance, often exceeding the original expert within 4 million frames. This is in contrast to PPO from scratch, which achieves only 44% of expert performance in the same period. Ablations demonstrate the importance of vector quantization and supervised decoder initialization for efficient adaptation.

Figure 4: Left: Mean episodic returns for decoding POPP's latent policy vs. PPO from scratch. Right: Mean test returns relative to expert policies across all Procgen environments.

Comparison to Prior Work

POPP differs fundamentally from prior approaches such as ILPO, FICC, VPT, and BCO. Unlike ILPO, which uses discrete latent actions and suffers from mode collapse in visually diverse environments, POPP employs continuous latent actions and an IDM-based approach, enabling robust modeling of stochasticity and partial observability. POPP also avoids the need for significant action-labeled data, unlike semi-supervised methods, and does not require access to the true action space during training, unlike imitation learning from observation (IfO) methods.

Limitations

POPP's performance can be affected by delayed action effects, significant environment stochasticity, and the need for larger models when scaling to more complex domains. Delayed effects can be mitigated by extending the temporal context in IDM and FDM architectures, potentially using Transformer-based models. Stochasticity may require larger datasets to ensure robust latent representations. Scaling to web-scale video data will necessitate careful balancing of model capacity and bottleneck strength.

Implications and Future Directions

POPP provides a pathway for unsupervised pretraining of generalist RL policies and world models on massive video datasets, analogous to pretraining paradigms in language and vision. The ability to recover action information and train adaptable policies from pure observation opens new possibilities for leveraging web-scale behavioral data. Future work should focus on scaling POPP to more powerful architectures, integrating multi-task and multi-modal data, and improving generalization to unseen tasks.

Conclusion

POPP establishes that comprehensive action information can be recovered from pure video via unsupervised objectives, enabling the training of latent-action policies that are rapidly adaptable to the true action space. The method achieves strong empirical results across diverse environments, often exceeding expert performance with minimal labeled data or online interaction. POPP represents a significant advance toward scalable, generalist RL agents trained on web-scale observational data, with broad implications for the future of unsupervised policy pretraining and embodied AI.