- The paper introduces GVP-WM, which transforms video-generated visual plans into physically feasible action sequences using constrained trajectory optimization in latent space.
- It integrates diffusion-based video planning with a pretrained world model to align semantic cues with dynamic feasibility, yielding superior success rates compared to baseline methods.
- Empirical evaluations on manipulation and navigation tasks demonstrate enhanced robustness and efficiency in real-world execution under complex visual guidance.
Grounding Video-Generated Plans in Dynamically Feasible Actions via World Models
Introduction and Motivation
Large-scale video generative models, particularly diffusion-based architectures, now exhibit advanced zero-shot visual planning capability by synthesizing temporally ordered videos in response to start and goal prompts. However, direct action inference from such video plans leads to frequent violations of real-world physical and temporal constraints—manifesting as object teleportation, kinematics violations, or motion blur—especially in out-of-distribution environments. Existing approaches that use video-generated subgoals for hierarchical control or model-predictive control make the implicit assumption that generated visual plans are themselves feasible, which does not hold under these failure cases.
This work introduces Grounding Video Plans with World Models (GVP-WM), a test-time algorithm for mapping video-generated visual plans into physically feasible action sequences by projecting them onto a learned latent dynamics manifold defined by a pretrained, action-conditioned world model. The method approaches grounding as a constrained trajectory optimization problem in the latent space, ensuring both dynamic feasibility and semantic plan alignment. This contribution directly addresses the critical gap between visual-plan expressivity and physical-execution feasibility.
Methodology
GVP-WM Pipeline:
- A conditional video generative model (e.g., a diffusion-based image-to-video generator) produces a visual plan given the initial and goal observations.
- The video is encoded into a sequence of latent states using the pretrained world model’s visual encoder.
- Video-guided latent collocation is performed: both latent states and action sequences are jointly optimized to (a) semantically align with the video plan, (b) reach the latent goal, and (c) satisfy the world model’s transition dynamics as hard constraints.
- The augmented Lagrangian method (ALM) solves the resulting constrained optimization, alternating between primal (trajectory, actions) and dual (constraints) updates.
- The resulting feasible action sequence is executed using receding-horizon model predictive control (MPC), with optional local action refinement by sampling around the optimal trajectory.
Latent Collocation Formulation:
Contrary to shooting or gradient-based action planners, GVP-WM treats the entire trajectory of latent states and actions as optimization variables. The objective is formulated as:
- Minimize a sum of:
- Video-plan-to-latent-state alignment (scale-invariant cosine loss).
- Terminal goal loss (latent MSE).
- Action regularization.
- Subject to constraints that latent transitions are explained by the learned world-model dynamics and actions are bounded.
This ensures resultant trajectories are not only semantically aligned but executable under physically plausible transitions.
Experimental Evaluation
Environments and Baselines
GVP-WM is evaluated on two simulated control domains with challenging, long-horizon tasks:
- Push-T: Contact-rich 2D manipulation requiring physical interactions.
- Wall: Visual navigation around obstacles, emphasizing geometric reasoning.
Video plans are generated with both zero-shot and domain-adapted (LoRA fine-tuned) diffusion models, as well as ground-truth expert rollouts for upper-bound comparison. The world model (DINO-WM) is kept fixed, leveraging DINOv2-based visual features.
Baselines include:
- MPC-CEM: Sampling-based action optimization without video guidance.
- MPC-GD: Gradient-based planner over actions, without video guidance.
- UniPi [5]: Direct video-to-action inference using inverse dynamics, bypassing world-model dynamics.
Quantitative Results
Key Claims:
- GVP-WM yields higher success rates compared to the inverse-dynamics baseline (UniPi) in both manipulation and navigation tasks, especially for longer horizons and under realistic, noisy video guidance.
- Under domain-adapted video plans, GVP-WM consistently outperforms MPC-based planning without video guidance except in specific zero-shot out-of-distribution cases at extreme horizons, where performance is comparable.
- Robustness: GVP-WM is substantially more robust to temporally inconsistent (blurred) video guidance, maintaining high success rates where direct video-to-action methods collapse (e.g., success remains above 0.8 for moderate blur, compared to near-zero for UniPi).
- Efficiency: GVP-WM’s optimization-based collocation requires less planning time per episode compared to sampling-based MPC-CEM due to a reduced search space centered on the video semantic prior.
Qualitative and Ablation Analysis
- Zero-shot video plans often exhibit strong semantic drift and physics violations. GVP-WM either recovers feasible plans by rejecting inconsistent video guidance or fails gracefully when visual hallucinations dominate.
- Ablation studies confirm the importance of:
- Latent collocation (joint state/action optimization)—naive waypoint following in latent space fails.
- Video-guided initialization—starting from video latents is critical under high-quality visual guidance.
- Scale-invariant cosine loss—improves performance relative to MSE losses, which are susceptible to magnitude drift in pretrained visual encoders.
Theoretical and Practical Implications
GVP-WM directly mitigates the model-reality gap inherent in visual plan-based action inference, especially for out-of-distribution inputs. By anchoring optimization with video-generated priors but enforcing action sequence feasibility through learned world-model dynamics, the algorithm exploits the strengths of both generative video modeling and model-based reinforcement learning paradigms.
Theoretical implications include:
- The separation of plan generation and plan grounding enhances compositionality and interpretability.
- The synthesis of semantic guidance and latent collocation opens a path toward hierarchical planning architectures capable of integrating abstract visual plans with physical motion primitives.
Practically, GVP-WM represents a significant advance for integrating foundation-scale generative models into robotic and embodied agent control—enabling visual planning interfaces to robustly control agents without requiring end-to-end retraining or reward engineering.
Limitations and Future Prospects
GVP-WM’s efficacy relies on the accuracy of the world model and the quality of visual plan guidance. In scenarios with severe model-environment mismatch or extreme out-of-distribution video generation, feasible plan recovery is not guaranteed; indeed, in some zero-shot regimes, baseline planning without video guidance is preferable. The iterative nature of latent collocation introduces computational overhead relative to simple direct-inference approaches.
Potential future directions include:
- Extension to real-world robotic platforms, where the distributional gap between generated visual plans and physical execution is less severe.
- Joint training or co-adaptation of video generators and world models.
- Hierarchical, multi-level planning where GVP-WM anchors high-level strategies refined through local policy adaptation or online RL.
- Policy distillation from collocated planning to enable real-time policy deployment for long-horizon control.
Conclusion
GVP-WM provides a principled framework for transforming video-generated plans into executable, dynamically feasible actions by leveraging pretrained world-model dynamics in a video-guided latent collocation optimization loop. This approach achieves superior empirical robustness and performance relative to direct video-to-action methods and uncovers a promising direction for closing the planning-execution gap in visually guided robotics and control. As foundation video models and scalable world models improve, the integration demonstrated by GVP-WM is poised to become increasingly relevant in embodied AI.