Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Published 3 Jul 2024 in cs.CL, cs.AI, and cs.LG | (2407.03321v2)

Abstract: Recent works have explored using LLMs for planning problems. One approach examines translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). Existing evaluation methods struggle to ensure semantic correctness and rely on simple or unrealistic datasets. To bridge this gap, we introduce \textit{Planetarium}, a benchmark designed to evaluate LLMs' ability to generate PDDL code from natural language descriptions of planning tasks. \textit{Planetarium} features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. Finally, we evaluate several API-access and open-weight LLMs that reveal this task's complexity. For example, 96.1\% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 94.4\% are solvable, but only 24.8\% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

Abstract PDF HTML Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper introduces Planetarium, a novel benchmark that evaluates language models' ability to convert natural language into semantically correct PDDL code.
The methodology uses graph isomorphism checks to verify semantic accuracy, addressing limitations of conventional planning validators.
Numerical results show GPT-4o generated syntactically valid PDDL in 87.6% of cases but only 35.1% were semantically correct, highlighting the need for stricter evaluation.

A Rigorous Benchmark for Translating Text to Structured Planning Languages

The paper introduces a novel benchmark, termed \benchmarkName{}, designed to evaluate the capacity of LLMs to generate Planning Domain Definition Language (PDDL) code from natural language descriptions of planning tasks. The benchmark addresses critical challenges in accurately measuring the quality of generated PDDL code, thus targeting a significant gap in current research methodologies.

Recent advances have shown promise for using LLMs to translate natural language descriptions into structured planning languages such as PDDL. However, accurately evaluating the quality of such translations has remained challenging due to two primary issues: the reliance on planning validators, which may pass valid but semantically incorrect PDDL code, and the lack of sufficient benchmarks with varied natural language descriptions and adequately challenging task sets. In response to these challenges, \benchmarkName{} proposes a more rigorous evaluation methodology and provides an extensive dataset for diverse and difficult text-to-PDDL translation tasks.

Key Components of \benchmarkName{}

Evaluation Framework

\benchmarkName{}'s evaluation framework includes a PDDL equivalence algorithm that ensures the correctness of generated PDDL by comparing it against a ground truth. This approach overcomes the limitations of conventional planning validators by precisely defining equivalence and implementing an efficient, automatic way of checking it. The evaluation framework operates by transforming PDDL code into scene graphs and performing isomorphism checks between these graphs. By doing so, it ensures that two PDDL problem formulations are deemed equivalent only if they represent the same underlying planning task, thereby providing a robust measure of semantic correctness.

Dataset Composition

The dataset comprises 132,037 text-to-PDDL pairs across 13 different tasks within the Blocks World and Gripper domains. Each pair includes a natural language description and its corresponding ground truth PDDL. The dataset varies along two main dimensions: abstractness (explicit vs. abstract descriptions) and problem size (number of propositions). These variations allow for a comprehensive evaluation of a model's ability to handle a wide range of scenarios from straightforward to highly complex tasks.

Numerical Results

The evaluation of several LLMs, including GPT-4o and open-weight models such as Mistral v0.3 7B Instruct and Gemma 1.1 IT 2B {content} 7B, revealed substantial differences in performance. GPT-4o demonstrated that while it could generate syntactically parseable PDDL code in 87.6% of cases and solve-able code in 82.2% of cases, only 35.1% of its outputs were semantically correct. This stark contrast underscores the inadequacies of current LLMs in generating accurate PDDL representations solely based on their syntactical and solution-verifiable properties.

Practical and Theoretical Implications

Practically, the implications of this research are substantial for deploying LLMs in environments requiring accurate translation of natural language to structured planning languages. Currently deployed systems could generate misleading plans if they do not adopt a stringent evaluation method like that proposed in \benchmarkName{}. Theoretically, this benchmark sets a higher standard for evaluating the semantic correctness of generated PDDL, encouraging future research to focus on improving the understanding and generation capabilities of LLMs with respect to structured planning languages.

Future Developments

The paper highlights several future directions, which include extending \benchmarkName{} to support more planning domains beyond Blocks World and Gripper, and to incorporate more expressive subsets of PDDL such as those accounting for non-deterministic, temporal, and numeric domains. Enhancing the benchmark to cover these areas would facilitate evaluating LLMs on more complex, real-world planning tasks, pushing the boundaries of LLM capabilities further.

Overall, \benchmarkName{} provides a comprehensive and rigorous benchmark to evaluate the translation of natural language descriptions to PDDL, offering significant advancements over existing methodologies. The research underscores the necessity for precision in evaluating the correctness of generated PDDL and sets a new benchmark for future developments in this domain.

Markdown Report Issue