Planning directly in latent action space

Develop methods to perform planning directly within the latent action space learned by latent action world models trained on in-the-wild videos, rather than mapping from known action spaces. Specifically, construct sampling and optimization procedures that operate over the continuous latent action vectors inferred by the inverse dynamics model and account for the geometry of sparsity- or noise-regularized latent actions, enabling goal-directed sequence generation in latent space.

Background

The paper trains latent action world models on large-scale in-the-wild videos and demonstrates their utility for planning by learning a small controller that maps known actions (e.g., from robotics or navigation datasets) into the learned latent action space. This approach yields competitive planning performance relative to action-labeled baselines, but it relies on an explicit action-to-latent mapping.

In the appendix, the authors discuss operating directly in the latent action space, noting that continuous latent actions (e.g., VAE-like noisy latents and sparse energy-based latents) pose nontrivial challenges for sampling and planning. While discrete codebooks can be sampled straightforwardly, continuous latents require priors or MCMC (e.g., SGLD for energy-based models), and they observe sampling mismatches that worsen as latent capacity increases.

They suggest learning-based generative approaches (e.g., diffusion models) as a possible direction, but emphasize that planning directly in latent action space remains an open problem, particularly due to geometric and sampling issues inherent to continuous latent actions.

References

Performing planning directly in latent action space is, to the best of our knowledge, an open problem that can be made worse depending on the geometry of the latent action space.

Learning Latent Action World Models In The Wild  (2601.05230 - Garrido et al., 8 Jan 2026) in Appendix, Section “Sampling latent actions”