On the Limits of Innate Planning in Large Language Models

Published 26 Nov 2025 in cs.AI | (2511.21591v1)

Abstract: LLMs achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that intrinsic planning using text-only LLMs in the 8-puzzle yields low success rates and highlights model-specific bottlenecks.
It details how structured prompting (CoT and AoT) and tiered feedback can improve performance while exposing limitations in state tracking and heuristic search.
The study emphasizes that pure LLM agents lack robust goal-directed planning, underscoring the need for integrated state-management and search modules.

Limits of Innate Planning in LLMs: An Evaluation on the 8-Puzzle

Introduction

This paper systematically explores the intrinsic planning and stateful reasoning capacities of LLMs in environments demanding sequential goal-directed behavior, specifically within the constraints of the classic 8-puzzle domain. Unlike prior approaches that exploit external code execution or tool-based augmentation, this study evaluates pure model performance, isolating the language system's ability for autonomous plan formulation, state tracking, and heuristic search. Four models—GPT-5-Thinking, Gemini-2.5-Pro, GPT-5-mini, and Llama 3.1 8B-Instruct—are rigorously tested under standardized prompting protocols and structured feedback regimes. The investigation reveals critical bottlenecks in near-term LLM agentic utility.

Figure 1: The 8-puzzle evaluation pipeline directly assesses LLM state tracking and planning without external code.

Methods

Task Design

The 8-puzzle is selected for its well-defined rules, solvable state space (181,440 unique configurations), and tight coupling of state representation and planning. Models are provided with a random puzzle instance and must generate a legal sequence of moves leading to the canonical solved configuration, using only textual reasoning.

Prompting Strategies

Three major prompting regimes are benchmarked:

Zero-Shot: Only rules and goal state are provided. No examples or stepwise cues.
Chain-of-Thought (CoT): Three worked examples encourage stepwise reasoning.
Algorithm-of-Thought (AoT): Three examples demonstrate search-based, Manhattan-guided trajectories.

Tiered Feedback

Failed attempts are iteratively retried using three feedback settings:

Repeat: Model retries from its last valid state with no extra feedback.
Specific: Model is told exactly why it failed (invalid move, early stop, loop), plus situational context.
Suggestive: In addition to specific feedback, the model receives a scalar hint (optimal solution length).
Figure 2: Structured feedback pipeline provides specific and suggestive error cues after failure, tracking model recovery.

External Move Validator

In a final, more assisted regime, models are given a list of valid moves from the current state and are instructed to select the best one, absent the need for legality checks, but preserving the planning challenge.

Results

Baseline Performance

Aggregate analysis shows low success rates across models and conditions. Notably, GPT-5-Thinking achieves a 30% success rate in AoT, modestly outperforming others, while Gemini-2.5-Pro and Llama 3.1 8B-Instruct are almost entirely unsuccessful. Failure modes are predominantly invalid moves for all except Llama in CoT (which displays severe parse errors).

Figure 3: Success rates of each model and prompting regime stratified by puzzle difficulty; moderate-length puzzles do not correlate monotonically with performance.

Failure Mode Analysis

Looping and invalid moves are the dominant causes for episode failure, especially evident in the breakdown of termination outcomes. GPT-5-mini's performance deteriorates with more complex prompting, shifting increasingly to graceful failure (clarification requests).

Figure 4: Stacked breakdown of trial outcomes reveals the varied impact of prompting on both success and distinct failure classes.

Feedback-Induced Gains

Introducing explicit feedback and progress-saving mechanisms yields enhanced rates. The strongest improvement is observed for GPT-5-Thinking under AoT with suggestive feedback, achieving a 68% success rate, but requiring significant computational and token budget (mean 24 minutes, 75,000 tokens, 49 moves per successful puzzle). For GPT-5-mini and Gemini-2.5-Pro, improvements are less pronounced (peaking at 18%), and further feedback sometimes inversely correlates with success.

Figure 5: Success rates for prompting strategies as feedback granularity and repetition increase, showing non-monotonic effects.

External Move Validator: Planning Isolates

When state-tracking (invalid moves) is offloaded, none of the models can solve a puzzle, with looping becoming the universal failure mode (GPT-5-Thinking loops in 100% of trials). GPT-5-mini demonstrates persistent early termination after a large number of legal—but ultimately unproductive—moves, reflecting weak long-term planning heuristics.

Figure 6: When restricted to valid actions, looping predominates, evidencing a failure of goal-directed planning heuristics.

Planning Efficiency

When valid moves alone are furnished, models make numerous moves with minimal state progress under Manhattan distance, confirming the lack of effective search strategies.

Figure 7: Comparative assessment between move count and solution progress quantifies inefficiency in planning when legality is guaranteed.

Discussion

This work presents a granular dissection of the innate planning and state-tracking ability in modern LLMs. Despite advances in prompt-based guidance (CoT, AoT), improvement is highly model-specific and can introduce new failures (parse errors, graceful refusals). Feedback and iterative retries boost success—especially for GPT-5-Thinking—but introduce prohibitive computational cost. The planning deficit becomes stark when state updates are outsourced: models overwhelmingly fail due to looping, unable to exploit even simple heuristics for goal attainment, underscoring brittle state representations and ineffective forward search.

The empirical results suggest two fundamental limitations:

State Representation: LLMs frequently hallucinate or lose track of the true environment state, producing invalid or incoherent action proposals undetected by self-checks. This risk directly undermines reliability in agentic applications.
Heuristic Planning: Absent tool augmentation, even SOTA models lack robust, goal-directed search strategies and cannot avoid cycles or random walks.

The observed disconnect between academic benchmark performance and planning underscores a gap between current LLM architectures and the demands of real-world autonomy. The unreliability and high confidence in flawed solutions indicate pressing safety concerns for tasks requiring sequential correctness (agentic robotics, financial trading, logistics).

Practical and Theoretical Implications

Agentic AI Deployment: Pure LLM agents are currently unsuitable for tasks requiring robust stateful planning without external tool support. Confidence calibration and explicit state-maintenance mechanisms remain unsolved challenges.
Prompt Engineering Limits: Advanced prompting is not universally beneficial and can induce confusion in certain architectures. Optimal prompting remains deeply dependent on model specifics and underlying representation capability.
Search-Augmenting Modules: Future progress may require architectural or functional integration of explicit search and state-encoding modules, potentially merging symbolic and neural reasoning for agentic reliability.
Self-Verification: Reliance on LLMs for high-confidence chain generation without error detection leads to undetectable mistakes propagating downstream.

Prospects for Future AI Systems

These findings indicate a need for research into explicit state-encoding, online self-verification (cf. (Stechly et al., 2024)), and planning modules that can endow LLMs with authentic heuristic search capabilities. Integration of symbolic reasoning or hybrid neuro-symbolic architectures may be required for reliable autonomous agents.

Conclusion

This study reveals that contemporary LLMs—both large and small—exhibit substantive deficiencies in innate planning and state tracking when deprived of external tools. Improvements via prompt and feedback engineering are costly and fail to achieve reliability, while outsourcing action validity exposes fundamental planning limitations. High-confidence error propagation remains a serious risk for agentic deployments. Closing these gaps will require architectural innovation targeting state maintenance and search.

Limitations

Results are constrained to the 8-puzzle domain and a set of 50 puzzles, with randomization minimizing but not guaranteeing against training contamination. Broader generalization to other sequential domains warrants further inquiry.

Markdown Report Issue