Extending continuous planning to language-based procedural tasks

Determine principled techniques to extend low-level continuous planning approaches to high-level language-based agent planning, particularly for procedural tasks that require stronger semantic and temporal abstraction.

Background

The paper frames two bottlenecks in using world-state distances as reward signals. Prior successes have largely come from representations learned by visual foundation models in low-level continuous control settings (e.g., DINO-WM, JEPA-WMs, RoboCLIP). However, agents operating in text require higher-level abstractions that capture semantics and temporal structure.

Within this context, the authors explicitly note that bridging from low-level continuous planning to high-level language-based agent planning—especially for procedural tasks—remains unresolved. This motivates their proposed StateFactory framework and the RewardPrediction benchmark, aimed at structuring text-based world states and evaluating fine-grained reward prediction in language-centric environments.

References

Yet, how to extend from low-level continuous planning to high-level language-based agent planning, especially for procedural tasks that require stronger semantic and temporal abstraction, remains an open challenge.

Reward Prediction with Factorized World States  (2603.09400 - Shen et al., 10 Mar 2026) in Section 1: Introduction