- The paper demonstrates that current agents fail to leverage external world models to enhance long-horizon foresight in sequential tasks.
- Empirical results reveal that both optional and forced simulation modes often degrade performance, highlighting misalignment between simulation and action.
- Analysis attributes failures to poor governance in simulation invocation, interpretation, and integration rather than inherent model deficiencies.
Problem Statement and Motivation
The capacity for anticipatory cognition—projecting and reasoning about hypothetical future states—is pivotal for intelligent sequential decision making. While large vision-LLMs (VLMs) show proficiency in single-step or myopic planning, their failure modes on long-horizon, environment-coupled tasks frequently arise from a lack of effective foresight. Generative world models have recently emerged as off-the-shelf simulators that may help fill this gap, theoretically allowing agents to preview the outcomes of future actions before committing them in the real environment.
This work systematically investigates whether current LLM/VLM-based agents meaningfully benefit from having external world models available as optional tools, instead of only relying on internal rollouts or rigid, prompt-engineered forms of deliberate planning. The central research question is: can agents strategically consult a world model to enhance their downstream task cognition, and if not, what are the empirical and mechanistic reasons underpinning these failures?
The agent-environment-simulator triad and the study’s experimental protocol are formalized as a generalized tool-use framework in which an agent, at each step, may select either to act in the real environment or to invoke the world model for hypothetical simulation.
Figure 1: The world model as tool framework where the agent decides between real action taking and simulation.
Experimental Design
Task Selection
Evaluation spans two major families: (1) agentic sequential control tasks requiring long-horizon, articulated action in both 2D and 3D settings (FrozenLake, Navigation, PrimitiveSkill, Sokoban), and (2) VQA tasks demanding spatial or hypothetical visual reasoning (3DSRBench, MMSI Bench, SAT, Spatial-MM Object). For agentic tasks, a ground-truth simulator is cloned as the world model, while for VQA tasks, the generative video model WAN2.1 provides simulated state evolution. This dichotomy enables controlled analysis across both embodied and purely perceptual reasoning domains.
Model Sampling and Protocols
The analysis covers multiple families (GPT, Llama, Qwen), each at different scales, ensuring both closed- and open-source paradigms are tested. Three world model access modes are studied: invisible (no access), normal (optional access), and force (compulsory simulation before every action). Main metrics are task success rate (agent tasks) and answer accuracy (VQA).
Core Results
Across all model families, tasks, and settings, the anticipated advantages of off-policy world model rollouts fail to materialize. In agent tasks, world model access not only fails to offer consistent improvement but often degrades performance by several percentage points in the aggregate across most backbones, with only sporadic, model-specific exceptions. In VQA settings, the effect is statistically neutral; accuracy distributions are essentially indistinguishable with or without simulated foresight.
Figure 2: Percentage breakdown of the world model's impact. WM Helps'' is generally less frequent thanWM Hurts'' for agent tasks, while VQA shows a more balanced distribution.
Agents Rarely and Idiosyncratically Utilize Simulation
Usage rates—the frequency of invoking the world model as a tool in optional mode—are consistently very low, especially for VQA where even the most willing models use simulation for less than 10% of instances (except Llama-family models, which use it more frequently while exhibiting little benefit). Trends indicate that model family and scale have consistent effects: larger and more capable models consult simulations even less, showing high intrinsic confidence, while smaller models call the world model somewhat more but rarely see net benefit.
Figure 3: Comparison of world model usage rate and corresponding performance gain across model families and scales, for both agentic tasks (top row) and VQA tasks (bottom row). Results are aggregated by model family (GPT, Qwen) and size, highlighting systematic differences in invocation behavior and the limited, sometimes negative, returns of increased world model usage.
There is a robust, monotonic relationship between world model call count and task failure, both in agent and VQA settings. Higher simulation frequency predicts lower task success and accuracy. Empirically, repeated calls do not reflect strategic, cumulative reasoning but rather flag unresolved internal uncertainty and lack of integration policy.
Figure 4: Success rate as a function of world model call count. More calls often correlate with worse outcomes.
Forcing World Model Integration is Actively Harmful
When simulation is made mandatory before every action or answer, performance collapses further, in many cases by double digits in success rate, especially on sequential-agent tasks. Compulsory use amplifies all measured failure modes.
Figure 5: Comparison of agent task performance between world model invisible and forced world model use conditions.
Attribution Analysis Reveals Governance, Not Simulation, as the Bottleneck
Detailed attribution shows that the main limiting factor is not the accuracy of simulated rollouts, but the lack of cognitive pipeline governance: (I) agents do not know when to simulate, (II) cannot reliably interpret the signal, and (III) fail to ground simulated evidence into the action plan. Successes are attributable to carefully calibrated, strategically queried simulations with unambiguous verification and stable integration. Failures are pervasive—spanning calibration error (missed or unnecessary queries), ambiguous simulation interpretation, and instability in action integration.
Figure 6: A Taxonomy of World Model Governance Successes: Correctly functioning governance follows a three-stage cognitive pipeline enabled by Strategic Input (I), Clear Interpretation (II), and Grounded Action (III). Success arises from calibrated queries (I), unambiguous verification (II), and stable integration of simulations into actionable plans (III).
Figure 7: A Taxonomy of World Model Governance Failures: Pipeline breakdowns map to three disruptive pillars: Calibration Failures (I) causing unnecessary or missed simulation, Interpretation Ambiguity (II) corrupting signal-to-decision alignment, and Unstable Integration Policy (III) preventing foresight from becoming sustained progress. The dominant zones indicate the key bottleneck is governance stability, instead of foresight generation.
Theoretical and Practical Implications
The findings robustly undermine the assumption that adding high-fidelity world model simulators is, in itself, an effective means of bootstrapping agentic foresight or long-horizon reasoning. The central bottleneck in current VLM agents is the absence of mechanisms—either architectural, algorithmic, or protocol-driven—that enable calibrated, strategic, and reflective invocation of external simulation. Neither the modality nor the accuracy of rollouts explains the limited utility observed in practice; rather, the cognitive process connecting the agent’s uncertainty, tool selection, interpretation, and action commitment is misaligned or altogether missing.
Absent explicit governance, agents overtrust their own flawed internal priors, or alternatively, fall into over-planning and action loops. For VQA, simulation is often used as a confirmation rather than discovery tool, thus reinforcing the agent’s initial (potentially erroneous) beliefs instead of driving counterfactual exploration. In agentic control, failures are even starker, with mandates for simulation often resulting in severe performance collapses due to unresolved ambiguity and loss of focus.
Recommendations and Future Directions
The work identifies several actionable research directions:
- Decoupled and Dedicated Modules: Integration of explicit decider, reflector, and memory modules could structure the interaction loop between internal reasoning and foresight.
- Training With Foresight-Sensitive Objectives: Supervised fine-tuning or RL with direct incentives for strategic world model use, diversity and informativeness of queries, and information gain, rather than bare success rate, may be necessary.
- Formulating Hypotheses and Counterfactual Simulation: Agents should be encouraged not only to confirm current beliefs but to generate, discriminate, and select among hypotheses using world model evidence.
- Robust Prompting and Governance Protocols: Current prompt-based interfaces are inadequate for learning nuanced strategies for world model use.
Conclusion
The hypothesis that the addition of a predictive world model as an optional (or forced) tool readily confers agentic foresight is empirically refuted: world model access is, at best, net-neutral and often reliably counterproductive. The main limitation is not the generative capacity of world models but the inability of agentic policies to regulate when, why, and how simulation informs action. Robust anticipatory cognition in agents will require advances in foresight governance, hypothesis management, and explicit mechanisms for integrating hypothetical and real experience in deliberate decision making.
References
Cheng Qian et al., "Current Agents Fail to Leverage World Model as Tool for Foresight", (2601.03905).