Cooking Task Planning using LLM and Verified by Graph Network

Published 27 Mar 2025 in cs.RO | (2503.21564v1)

Abstract: Cooking tasks remain a challenging problem for robotics due to their complexity. Videos of people cooking are a valuable source of information for such task, but introduces a lot of variability in terms of how to translate this data to a robotic environment. This research aims to streamline this process, focusing on the task plan generation step, by using a LLM-based Task and Motion Planning (TAMP) framework to autonomously generate cooking task plans from videos with subtitles, and execute them. Conventional LLM-based task planning methods are not well-suited for interpreting the cooking video data due to uncertainty in the videos, and the risk of hallucination in its output. To address both of these problems, we explore using LLMs in combination with Functional Object-Oriented Networks (FOON), to validate the plan and provide feedback in case of failure. This combination can generate task sequences with manipulation motions that are logically correct and executable by a robot. We compare the execution of the generated plans for 5 cooking recipes from our approach against the plans generated by a few-shot LLM-only approach for a dual-arm robot setup. It could successfully execute 4 of the plans generated by our approach, whereas only 1 of the plans generated by solely using the LLM could be executed.

Abstract PDF Upgrade to Chat

Summary

An Overview of "Cooking Task Planning using LLM and Verified by Graph Network"

This paper presents a novel approach to task and motion planning (TAMP) in robotic cooking, tackling the inherent complexity of interpreting non-uniform cooking videos. The research introduces a framework that combines the capabilities of large language models (LLMs) with Functional Object-Oriented Networks (FOON) to autonomously generate and verify cooking task plans for robotic execution.

The approach addresses key challenges in adapting videos of human cooking demonstrations for robotic understanding and action. Traditional methodologies often struggle with the unstandardized nature of videos, characterized by varying perspectives and omitted actions—issues compounded by the LLMs’ propensity to hallucinate, which typically results in outputs laden with inaccuracies or implausible task plans. By integrating LLMs with FOON, this research provides a mechanism to validate generated plans, ensuring the logical consistency and feasibility of task sequences.

Methodology and Results

The authors outline a multi-step process to process cooking videos into executable robotic tasks. This begins with converting videos into frames and identifying segments pertinent to the cooking task using subtitle extraction. These frames are utilized by the LLM to infer target object node states, essentially determining the desired outcome of each cooking step. The results are then formatted in a graph structure via FOON, which verifies sequence logic and provides feedback for any modifications needed.

The paper provides empirical evidence of the system's effectiveness by testing on five different recipes. The integrated framework successfully completed task-plan sequences for four recipes compared to just one for a baseline LLM-only method. Notably, the research highlights the method's efficacy in logic verification and error correction, illustrating its robustness against LLM-induced errors.

Contributions and Implications

This research significantly contributes to TAMP by illustrating a viable pathway to leveraging multimodal LLMs in conjunction with graph-based validation systems like FOON. The methodology demonstrates strong promise in practical implementation, with potential applications extending beyond robotic cooking into other domains requiring precise, long-horizon task planning.

The key innovation lies in creating a dynamic, iterative process that successfully incorporates error-checking feedback loops into planning—a critical advancement for real-world robotic applications. Furthermore, the extension of FOON in handling both target and environment object nodes enables this method to adapt task sequences across different operational environments, contributing to increased applicability and scalability.

Future Directions

While this method shows significant improvement over LLM-only approaches, there are areas for further exploration. The current system's sequential validation process is a limitation, given that it precludes backward error correction once a task is added to the graph. Future work could explore mechanisms for holistic plan adjustments to enhance adaptability.

Additionally, addressing real-world operational failures, such as inaccuracies in initial environment estimations or grasping errors, could enhance the system’s robustness. Development of a real-time feedback system for re-planning could help address these issues, ensuring reliable execution of task plans despite environmental uncertainties.

In summary, this research presents a substantial advancement in robotic task planning methodologies by adeptly marrying LLMs with structured graph-based validation tools. The framework holds potential for diverse applications that necessitate complex, adaptable task plans, serving as a significant step towards realizing more autonomous, intelligent robotic systems.