- The paper categorizes planning benchmarks into seven distinct domains to comprehensively evaluate LLMs' planning capabilities.
- It employs a methodology spanning embodied environments, web navigation, scheduling, games, task automation, text-based reasoning, and integrated tasks.
- Results reveal LLM limitations in long-horizon and uncertain planning, underscoring the need for more robust evaluation benchmarks.
PLANET: Benchmarking LLMs' Planning Prowess
This paper introduces PLANET, a survey of benchmarks designed to evaluate the planning capabilities of LLMs. The authors categorize these benchmarks across several domains, including embodied environments, web navigation, scheduling, games/puzzles, task automation, text-based reasoning, and general agentic benchmarks. The work emphasizes the significance of planning in agentic AI and highlights the need for comprehensive benchmarks to assess and compare the performance of different planning algorithms.
Categorization of Planning Benchmarks
The study categorizes planning benchmarks into seven distinct groups, each focusing on different aspects of planning:
- Embodied Environments: These benchmarks involve LLM agents interacting with simulated or physical environments, often in household settings. Examples include Blocksworld, VirtualHome, PlanBench, ALFRED, and ALFWorld.
- Web Navigation: These environments assess an agent's ability to navigate and interact with websites to achieve specific goals. Datasets include WebShop, WebArena, VisualWebArena, and Mind2Web.
- Scheduling: These benchmarks evaluate planning for tasks involving time and resource constraints, such as trip planning, meeting scheduling, and calendar management. TravelPlanner and Natural Plan are key examples.
- Games and Puzzles: Games such as Rock-Paper-Scissors and puzzles like Tower of Hanoi are used to evaluate strategic planning, risk management, and multi-agent behaviors. SmartPlay and AucArena are representative benchmarks.
- Everyday Task Planning: These benchmarks focus on task decomposition and workflow automation, assessing the ability to break down complex tasks into actionable steps. TaskLAMA, CoScript, and WorldAPIs are prominent examples.
- Text-Based Reasoning: This category includes benchmarks that require advanced reasoning from LLMs, such as math problem-solving and code generation. PrOntoQA and TEACh fall into this category.
- Planning as a Subtask: These benchmarks evaluate LLMs' planning abilities as part of more extensive agentic tasks, including multi-step tool use and web navigation. AgentBench, SWE-Bench, and TheAgentCompany are notable examples.
Embodied Environments: Aligning Text and Action
Embodied environments are used to assess planning systems employing discrete action spaces, often limited to home-based tasks. Blocksworld is a classical example, involving the manipulation of blocks to achieve specific configurations. VirtualHome uses video simulations of household tasks with corresponding symbolic representations.
(Figure 1)
Figure 1: Sourced from ALFWorld~\citep{shridhar2021alfworldaligningtextembodied}, this example illustrates interactive alignment between text and embodied worlds.
ALFRED requires agents to follow natural language instructions paired with visual inputs to perform household tasks, demanding multi-step reasoning and fine-grained motor control. ALFWorld builds upon ALFRED by aligning high-level task reasoning with grounded execution, using a text-based interface to model environments with PDDL semantics.
Web Navigation: Simulating Real-World Computer Use
Benchmarks in web navigation evaluate LLM agents' ability to plan and execute actions on websites to achieve user goals, mimicking real-world computer usage. WebShop simulates an online shopping website with a vast array of real-world products and crowd-sourced instructions. WebArena provides a web environment with diverse, long-horizon tasks, such as online shopping and software management, testing the agent's ability to break down high-level goals into sequences of actions.
Figure 2: Adapted from VisualWebArena~\citep{koh2024visualwebarenaevaluatingmultimodalagents}, this example shows an agent's action trajectory to block the author of a target image post in /f/memes.
VisualWebArena focuses on visually grounded tasks, requiring multimodal agents to combine visual understanding with textual inputs. Mind2Web uses actual websites to evaluate an agent's ability to handle diverse interfaces and workflows, testing generalization and adaptability. OSWorld introduces real-world computer tasks across operating systems, evaluating multimodal agents in open-ended computing environments.
Scheduling: Managing Time and Resources
Scheduling benchmarks assess the capability to manage time and resources effectively, ensuring goals are achieved within specified constraints. TravelPlanner evaluates LLMs on realistic trip planning, requiring temporal planning and external knowledge. Natural Plan evaluates LLMs' ability to handle planning tasks described in natural language, focusing on realistic scenarios collected from Google services. These benchmarks often involve constraint satisfaction and optimality, assessing whether LLMs can generate efficient and valid plans.
Games and Puzzles: Strategic Reasoning
Collaborative and competitive games provide environments to evaluate LLMs' strategic planning, risk management, and multi-agent behaviors. SmartPlay includes games like Rock-Paper-Scissors and Minecraft, challenging LLMs on reasoning with object dependencies and long-term planning. AucArena uses a simulated auction environment to test strategic planning, requiring agents to manage budgets and anticipate opponents' actions.
(Figure 3)
Figure 3: Adapted from Dualformer~\citep{su2024dualformercontrollablefastslow}, this example illustrates the maze navigation task, where the task (prompt) and the plan are both represented as token sequences.
GAMA-Bench comprises game-theoretic scenarios where multiple LLM agents interact, evaluating decision-making and coordination. Plancraft, a Minecraft-based dataset, tests multistep planning in a sandbox environment. PPNL is a benchmark for spatial path planning tasks described in language, testing spatial and temporal reasoning.
Task Automation: Efficient Workflow Execution
Task decomposition facilitates efficient and reliable execution in workflow automation. TaskLAMA comprises annotated complex tasks structured into directed acyclic graphs, representing temporal dependencies between steps. CoScript presents the task of constrained language planning, requiring planners to respect various constraints on planning goals. WorldAPIs uses a top-down strategy to derive APIs from wikiHow instructions, expanding the action space for tasks in the physical world.
Text-Based Reasoning: Mathematical and Logical Challenges
Math benchmarks and code generation present significant challenges for LLM planning. PrOntoQA assesses LLMs' reasoning capability using synthetic world models represented in first-order logic, revealing that LLMs often struggle with planning, particularly in selecting the correct proof step. TEACh simulates a user interacting with a robot to perform household tasks, enhancing models' capabilities in language grounding and task execution.
Several benchmarks target LLMs' planning abilities as a component of their overall performance. AgentBench evaluates reasoning and decision-making in multi-turn, open-ended contexts. SWE-Bench assesses the ability of LLMs to resolve real-world GitHub issues. TheAgentCompany simulates a real-world software company environment, requiring AI agents to replicate workflows across multiple platforms. AgentGym provides a framework for developing LLM-based agents that can evolve across diverse environments and tasks. AgentBoard introduces a fine-grained "progress rate" metric to capture incremental achievements in complex environments. SafeAgentBench evaluates safety-aware task planning, testing the ability of agents to recognize and reject hazardous tasks.
Discussions: Addressing Benchmark Limitations
The survey identifies several limitations in current benchmark design:
- Complexity of World Models: Many benchmarks use simplified environments, limiting the ability of LLMs to build or adapt internal world models.
- Long-Horizon Tasks: LLM agents often lack mechanisms for state tracking and error correction in tasks requiring long sequences of actions.
- Planning under Uncertainty: Real-world scenarios often involve uncertainty and partial information, which are not adequately addressed in many benchmarks.
- Multimodal Planning Support: Most benchmarks are text-only, despite the increasing interest in multimodal agents, highlighting a gap in the evaluation of visual grounding.
Conclusion
The survey emphasizes the need for comprehensive benchmarks to evaluate LLM agents in planning tasks, organizing a wide range of recent benchmarks into seven categories and pointing out key gaps in benchmark development. The authors aim to guide researchers in selecting appropriate benchmarks and inspire new directions for benchmark development, highlighting the potential of LLM agents in complex, real-world planning tasks while recognizing the importance of responsible development and deployment.