Papers
Topics
Authors
Recent
Search
2000 character limit reached

Current Agents Fail to Leverage World Model as Tool for Foresight

Published 7 Jan 2026 in cs.AI, cs.CL, and cs.LG | (2601.03905v2)

Abstract: Agents built on vision-LLMs increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.

Summary

  • The paper demonstrates that current agents fail to leverage external world models to enhance long-horizon foresight in sequential tasks.
  • Empirical results reveal that both optional and forced simulation modes often degrade performance, highlighting misalignment between simulation and action.
  • Analysis attributes failures to poor governance in simulation invocation, interpretation, and integration rather than inherent model deficiencies.

Agents and the World Model-as-Tool Paradigm: Empirical Limits of Foresight Integration

Problem Statement and Motivation

The capacity for anticipatory cognition—projecting and reasoning about hypothetical future states—is pivotal for intelligent sequential decision making. While large vision-LLMs (VLMs) show proficiency in single-step or myopic planning, their failure modes on long-horizon, environment-coupled tasks frequently arise from a lack of effective foresight. Generative world models have recently emerged as off-the-shelf simulators that may help fill this gap, theoretically allowing agents to preview the outcomes of future actions before committing them in the real environment.

This work systematically investigates whether current LLM/VLM-based agents meaningfully benefit from having external world models available as optional tools, instead of only relying on internal rollouts or rigid, prompt-engineered forms of deliberate planning. The central research question is: can agents strategically consult a world model to enhance their downstream task cognition, and if not, what are the empirical and mechanistic reasons underpinning these failures?

The agent-environment-simulator triad and the study’s experimental protocol are formalized as a generalized tool-use framework in which an agent, at each step, may select either to act in the real environment or to invoke the world model for hypothetical simulation. Figure 1

Figure 1: The world model as tool framework where the agent decides between real action taking and simulation.

Experimental Design

Task Selection

Evaluation spans two major families: (1) agentic sequential control tasks requiring long-horizon, articulated action in both 2D and 3D settings (FrozenLake, Navigation, PrimitiveSkill, Sokoban), and (2) VQA tasks demanding spatial or hypothetical visual reasoning (3DSRBench, MMSI Bench, SAT, Spatial-MM Object). For agentic tasks, a ground-truth simulator is cloned as the world model, while for VQA tasks, the generative video model WAN2.1 provides simulated state evolution. This dichotomy enables controlled analysis across both embodied and purely perceptual reasoning domains.

Model Sampling and Protocols

The analysis covers multiple families (GPT, Llama, Qwen), each at different scales, ensuring both closed- and open-source paradigms are tested. Three world model access modes are studied: invisible (no access), normal (optional access), and force (compulsory simulation before every action). Main metrics are task success rate (agent tasks) and answer accuracy (VQA).

Core Results

World Model Access Fails to Systematically Improve Performance

Across all model families, tasks, and settings, the anticipated advantages of off-policy world model rollouts fail to materialize. In agent tasks, world model access not only fails to offer consistent improvement but often degrades performance by several percentage points in the aggregate across most backbones, with only sporadic, model-specific exceptions. In VQA settings, the effect is statistically neutral; accuracy distributions are essentially indistinguishable with or without simulated foresight. Figure 2

Figure 2: Percentage breakdown of the world model's impact. WM Helps'' is generally less frequent thanWM Hurts'' for agent tasks, while VQA shows a more balanced distribution.

Agents Rarely and Idiosyncratically Utilize Simulation

Usage rates—the frequency of invoking the world model as a tool in optional mode—are consistently very low, especially for VQA where even the most willing models use simulation for less than 10% of instances (except Llama-family models, which use it more frequently while exhibiting little benefit). Trends indicate that model family and scale have consistent effects: larger and more capable models consult simulations even less, showing high intrinsic confidence, while smaller models call the world model somewhat more but rarely see net benefit. Figure 3

Figure 3: Comparison of world model usage rate and corresponding performance gain across model families and scales, for both agentic tasks (top row) and VQA tasks (bottom row). Results are aggregated by model family (GPT, Qwen) and size, highlighting systematic differences in invocation behavior and the limited, sometimes negative, returns of increased world model usage.

Invocation Frequency and Performance are Inversely Correlated

There is a robust, monotonic relationship between world model call count and task failure, both in agent and VQA settings. Higher simulation frequency predicts lower task success and accuracy. Empirically, repeated calls do not reflect strategic, cumulative reasoning but rather flag unresolved internal uncertainty and lack of integration policy. Figure 4

Figure 4: Success rate as a function of world model call count. More calls often correlate with worse outcomes.

Forcing World Model Integration is Actively Harmful

When simulation is made mandatory before every action or answer, performance collapses further, in many cases by double digits in success rate, especially on sequential-agent tasks. Compulsory use amplifies all measured failure modes. Figure 5

Figure 5: Comparison of agent task performance between world model invisible and forced world model use conditions.

Attribution Analysis Reveals Governance, Not Simulation, as the Bottleneck

Detailed attribution shows that the main limiting factor is not the accuracy of simulated rollouts, but the lack of cognitive pipeline governance: (I) agents do not know when to simulate, (II) cannot reliably interpret the signal, and (III) fail to ground simulated evidence into the action plan. Successes are attributable to carefully calibrated, strategically queried simulations with unambiguous verification and stable integration. Failures are pervasive—spanning calibration error (missed or unnecessary queries), ambiguous simulation interpretation, and instability in action integration. Figure 6

Figure 6: A Taxonomy of World Model Governance Successes: Correctly functioning governance follows a three-stage cognitive pipeline enabled by Strategic Input (I), Clear Interpretation (II), and Grounded Action (III). Success arises from calibrated queries (I), unambiguous verification (II), and stable integration of simulations into actionable plans (III).

Figure 7

Figure 7: A Taxonomy of World Model Governance Failures: Pipeline breakdowns map to three disruptive pillars: Calibration Failures (I) causing unnecessary or missed simulation, Interpretation Ambiguity (II) corrupting signal-to-decision alignment, and Unstable Integration Policy (III) preventing foresight from becoming sustained progress. The dominant zones indicate the key bottleneck is governance stability, instead of foresight generation.

Theoretical and Practical Implications

The findings robustly undermine the assumption that adding high-fidelity world model simulators is, in itself, an effective means of bootstrapping agentic foresight or long-horizon reasoning. The central bottleneck in current VLM agents is the absence of mechanisms—either architectural, algorithmic, or protocol-driven—that enable calibrated, strategic, and reflective invocation of external simulation. Neither the modality nor the accuracy of rollouts explains the limited utility observed in practice; rather, the cognitive process connecting the agent’s uncertainty, tool selection, interpretation, and action commitment is misaligned or altogether missing.

Absent explicit governance, agents overtrust their own flawed internal priors, or alternatively, fall into over-planning and action loops. For VQA, simulation is often used as a confirmation rather than discovery tool, thus reinforcing the agent’s initial (potentially erroneous) beliefs instead of driving counterfactual exploration. In agentic control, failures are even starker, with mandates for simulation often resulting in severe performance collapses due to unresolved ambiguity and loss of focus.

Recommendations and Future Directions

The work identifies several actionable research directions:

  • Decoupled and Dedicated Modules: Integration of explicit decider, reflector, and memory modules could structure the interaction loop between internal reasoning and foresight.
  • Training With Foresight-Sensitive Objectives: Supervised fine-tuning or RL with direct incentives for strategic world model use, diversity and informativeness of queries, and information gain, rather than bare success rate, may be necessary.
  • Formulating Hypotheses and Counterfactual Simulation: Agents should be encouraged not only to confirm current beliefs but to generate, discriminate, and select among hypotheses using world model evidence.
  • Robust Prompting and Governance Protocols: Current prompt-based interfaces are inadequate for learning nuanced strategies for world model use.

Conclusion

The hypothesis that the addition of a predictive world model as an optional (or forced) tool readily confers agentic foresight is empirically refuted: world model access is, at best, net-neutral and often reliably counterproductive. The main limitation is not the generative capacity of world models but the inability of agentic policies to regulate when, why, and how simulation informs action. Robust anticipatory cognition in agents will require advances in foresight governance, hypothesis management, and explicit mechanisms for integrating hypothetical and real experience in deliberate decision making.

References

Cheng Qian et al., "Current Agents Fail to Leverage World Model as Tool for Foresight", (2601.03905).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 53 likes about this paper.