Can Large Language Models Reason and Plan?

Published 7 Mar 2024 in cs.AI, cs.CL, and cs.LG | (2403.04121v2)

Abstract: While humans sometimes do show the capability of correcting their own erroneous guesses with self-critiquing, there seems to be no basis for that assumption in the case of LLMs.

Abstract PDF HTML Upgrade to Chat

References (13)

Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Daniel Kahneman. Thinking, fast and slow. macmillan, 2011.
Subbarao Kambhampati. Language imitation games and the arrival of broad and shallow AI. CACM Blog, 2021.
Subbarao Kambhampati. Polanyi’s revenge and AI’s new romance with tacit knowledge. Communications of the ACM, 64(2):31–32, 2021.
Subbarao Kambhampati. AI as (an ersatz) natural science? Communications of the ACM, 65(9):8–9, 2022.
LLMs can’t plan, but can help planning in LLM-Modulo frameworks. arXiv preprint 2402.01817, 2024.
On the role of large language models in planning, July 2023. Tutorial presented at the International Conference on Automated Planning and Scheduling (ICAPS), Prague.
A new frontier for travel scammers: A.I.-Generated Guidebooks. New York Times, August 2023.
GPT-4 Doesn’t Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
Can large language models really improve by self-critiquing their own plans? In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
On the planning abilities of large language models - a critical investigation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.

Citations (41)

View on Semantic Scholar

Summary

The paper examines LLMs' inherent inability for autonomous planning, as evidenced by GPT4's 30% accuracy in the Blocks World domain.
The study compares iterations from GPT3 to GPT4 and uses obfuscation techniques to distinguish genuine planning from mere memory-based retrieval.
The research advocates an iterative prompting framework that leverages external verifiers to mitigate LLMs' reasoning limitations while harnessing their generative strengths.

Evaluating the Planning and Reasoning Capabilities of LLMs

Introduction to the Study

LLMs have demonstrated remarkable linguistic behaviors, raising questions about their abilities beyond simple text completion, particularly in relation to tasks traditionally associated with human-like reasoning and planning capabilities. This article explores the inherent capabilities of LLMs, scrutinizing whether they can authentically perform planning and reasoning, or if their apparent successes in these domains are merely a result of other underlying mechanisms.

Core Findings and Methodology

The study began with an analysis of GPT3's performance on a variety of planning instances derived from the International Planning Competition (IPC), including the Blocks World domain. The outcomes contradicted popular narratives about LLMs' planning abilities, showing considerable limitations. This assessment was extended to more advanced models, GPT3.5 and GPT4, noting some improvements across iterations but still lacking in substantive planning capability.

GPT4 demonstrated a 30% empirical accuracy in the Blocks World domain, higher than its predecessors but still significantly lower in other tested domains.
When attempting to isolate genuine planning from approximate retrieval by obfuscating action and object names in planning problems, GPT4's performance significantly decreased.

These observations provided strong evidence against LLMs' inherent ability to autonomously generate executable plans.

Approaches and Techniques for Enhancing LLM Planning Capabilities

The paper explores two main strategies to potentially augment LLMs' planning and reasoning performances: fine-tuning and iterative prompting.

Fine-tuning: While initially hopeful, the process did not showcase a noticeable improvement in LLMs' planning capabilities. This technique essentially converts planning tasks into a form of memory-based approximate retrieval, rather than instilling genuine planning competence.
Iterative prompting: This includes back-prompting LLMs with hints or suggestions to improve initial plan guesses. The paper emphasizes relying on external model-based plan verifiers or expert humans in a loop to authenticate the correctness of LLM-produced solutions, underlining the framework termed as "LLM-Modulo."

Discussion on Autonomy in LLMs

The study critically addresses the distinction between LLMs' generation of correct answers through pattern recognition and their true ability to engage in principled reasoning. It identifies major challenges in discerning memorization from genuine problem-solving, both in LLMs and humans, particularly when entities are trained on extensive corpuses or "question banks."

Highlighting the limitations of self-verification strategies, the paper argues that LLMs' "self-improvement" claims rely on flawed premises, largely due to their tendency to produce both false positives and negatives without the intervention of reliable external verification mechanisms.

Implications and Future Directions

The research provides a nuanced understanding of LLMs' capabilities and limitations, suggesting that while they fall short of performing autonomous planning and reasoning, their strengths in idea generation and approximate retrieval can be effectively utilized. It proposes leveraging LLMs in conjunction with external verifiers or human expertise within the "LLM-Modulo" framework, advocating for a balanced approach that harnesses the generative strengths of LLMs while mitigating their reasoning shortfalls.

This perspective not only challenges current assertions about LLMs' capabilities in planning and reasoning tasks but also sets a constructive path forward, emphasizing collaboration between LLMs' generative capacities and human expertise or robust verification systems, to truly advance the field of AI.

Conclusion

The paper concludes that, despite improvements across iterations from GPT3 to GPT4, there remains no compelling evidence to suggest that LLMs possess an inherent capability for autonomous reasoning or planning. Their primary function as universal approximate retrieval systems, however, opens up exciting avenues for supplementing human cognitive tasks, provided their limitations are thoroughly understood and accounted for. It calls for a tempered approach in evaluating LLMs' advances, advocating for strategies that pragmatically leverage their strengths while transparently addressing their deficiencies.