Open-vocabulary Queryable Scene Representations for Real World Planning

Published 20 Sep 2022 in cs.RO, cs.AI, and cs.CV | (2209.09874v2)

Abstract: LLMs have unlocked new capabilities of task planning from human instructions. However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene. In this paper, we develop NLMap, an open-vocabulary and queryable scene representation to address this problem. NLMap serves as a framework to gather and integrate contextual information into LLM planners, allowing them to see and query available objects in the scene before generating a context-conditioned plan. NLMap first establishes a natural language queryable scene representation with Visual LLMs (VLMs). An LLM based object proposal module parses instructions and proposes involved objects to query the scene representation for object availability and location. An LLM planner then plans with such information about the scene. NLMap allows robots to operate without a fixed list of objects nor executable options, enabling real robot operation unachievable by previous methods. Project website: https://nlmap-saycan.github.io

Abstract PDF Upgrade to Chat

Citations (162)

View on Semantic Scholar

Summary

The paper introduces NLMap, a framework that creates open-vocabulary, queryable scene representations using visual and language models.
It leverages semantic scene mapping and contextual object proposals to bridge unstructured natural language commands with structured scene data.
Empirical results show improved task success rates, especially for novel object tasks, marking a significant step in adaptive robotic planning.

Analyzing Open-Vocabulary Queryable Scene Representations for Real-World Planning in Robotics

This paper introduces a novel framework called Natural-Language Map (NLMap) designed to enhance the application of LLMs in robotic task planning within real-world scenarios. The central contribution is the creation of an open-vocabulary, queryable scene representation that integrates contextual information directly into the LLM, enabling robots to perceive and interact with complex, unstructured environments more effectively.

System Architecture

NLMap is founded on the integration of Visual LLMs (VLMs), specifically leveraging contrastive training approaches such as CLIP and ViLD to build a semantic representation of a scene. This representation is dynamic, allowing for querying via natural language. The system involves three core components:

Semantic Scene Representation: During scene exploration, the robot uses VLMs to generate a language-queryable map. Bounding boxes are proposed class-agnostically, and VLM features are extracted, forming a feature point cloud. This representation encapsulates a wide spectrum of potential objects, far beyond any static list.
Contextual Object Proposal: Through LLMs, objects involved in given tasks are proposed from natural language instructions. This bridging step between unstructured instructions and structured scene data is facilitated by the model’s ability to infer implicit task-related objects, handle fine-grained descriptions, and determine appropriate object granularity.
Executable Option Generation and Planning: Based on object presence within the scene, executable options are formulated. The planning process is particularly noteworthy for its incorporation of scene-specific object availability, adjusting the plan dynamically to reflect feasible actions considering detected objects.

Evaluation and Results

The paper discusses extensive empirical validation both in simulation and real-world scenarios, benchmarking the NLMap against a privileged version of SayCan—a baseline LLM-based planner that lacks dynamic scene understanding. On standard task benchmarks adopted from SayCan, NLMap + SayCan achieves a 60% task success rate, revealing some trade-off in direct task completion success due to additional layers of perception integration. However, on tasks involving novel objects or missing elements not predefined in the SayCan schema, NLMap augments the system's capability, achieving an 80% success rate on novel object tasks. The system also exhibits improved robustness in identifying infeasible tasks, demonstrating context-aware operational capabilities.

Technical Implications and Future Directions

The introduction of NLMap addresses a critical gap in LLM-based robotic planners by mitigating issues of inflexible object and action sets. The ability to construct semantic maps that are dynamically queryable via natural language commands represents a significant advancement in the ability of robots to operate within unstructured and diverse environments.

However, challenges still remain primarily in the field of perception accuracy and handling dynamic, moving scenes. While the current implementation is robust to static setups, extending this framework to account for dynamic changes is a promising direction for future research. Additionally, the integration of real-time learning mechanisms could further enhance the system’s adaptability and performance in novel environments.

Overall, the research presents a compelling step forward in the bid for more autonomous and adaptable robotic systems through the integration of advanced language and visual perception models. The potential applications across domains requiring mobile manipulation and context-sensitive task execution in open-world settings are substantial, paving the way for more generalizable robotic intelligence.