Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

Published 7 May 2024 in cs.RO and cs.CV | (2405.04378v4)

Abstract: We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representations to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) ASK-Splat, a GSplat representation that distills semantic and grasp affordance features into the 3D scene. ASK-Splat enables geometric, semantic, and affordance understanding of 3D scenes, which is critical in many robotics tasks; (ii) SEE-Splat, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a "digital twin" of the evolving environment throughout the manipulation task; and (iii) Grasp-Splat, a grasp generation module that uses ASK-Splat and SEE-Splat to propose affordance-aligned candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks and in four multi-stage manipulation tasks, using the edited scene to reflect changes due to prior manipulation stages, which is not possible with existing baselines. Video demonstrations and the code for the project are available at https://splatmover.github.io.

Abstract PDF HTML Upgrade to Chat

References (44)

Citations (8)

View on Semantic Scholar

Summary

The paper presents an innovative modular framework that integrates semantic and affordance cues through editable Gaussian splatting for enhanced robotic manipulation.
It details a three-stage process utilizing ASK-Splat for semantic embedding, SEE-Splat for real-time scene editing, and Grasp-Splat for optimized grasp generation.
Experimental evaluations show improved success rates in complex manipulation tasks compared to contemporary methods, showcasing its potential in autonomous robotics.

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

The paper "Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting" introduces a modular robotics stack that incorporates editable Gaussian Splatting (GSplat) techniques. This framework, designed for multi-stage robotic manipulation, leverages the integration of semantic and affordance knowledge into a three-dimensional scene representation to facilitate complex manipulation tasks. This essay will explore the paper's proposed methods and how they enhance robotic manipulation capabilities.

Overview of Splat-MOVER

Splat-MOVER consists of three main components: ASK-Splat, SEE-Splat, and Grasp-Splat. Each module performs a distinct function:

ASK-Splat: This is a 3D scene representation that distills semantic and affordance knowledge into Gaussian splats. It enables the system to understand the geometric, semantic, and grasp-affordance aspects of objects in a scene, a critical feature for executing manipulation tasks efficiently. ASK-Splat is trained to embed vision-language semantics (from models like CLIP) and visual affordance indicators (from VRB). This embedding allows the system to respond to open vocabulary commands.
SEE-Splat: A real-time scene-editing module that uses 3D semantic masking and infilling to visualize interactions within a scene. It generates a digital twin of the environment by continuously updating a virtual model of the scene to mirror real-world dynamics.
Grasp-Splat: This module proposes and rationalizes candidate grasps using the semantic and affordance-laden representations generated by ASK-Splat. Grasp-Splat evaluates candidate grasps based on their alignment with pre-embedded affordance scores and semantic information, enhancing grasp success rates.
Figure 1: ASK-Splat is trained with: (1) RGB images of the scene and is initialized with a sparse point cloud from structure-from-motion (in the green box); (2) CLIP latent semantic features computed from an encoder-decoder model (in the blue box); and (3) affordance codes from a vision-affordance foundation model (in the purple box), e.g., VRB.

Implementation and Performance Analysis

ASK-Splat: Semantic Scene Representation

ASK-Splat's training involves using RGB images and a sparse point cloud for initialization, followed by embedding semantic features from CLIP. This approach distills the high-dimensional semantic embeddings into a lower-dimensional latent space, allowing efficient scene manipulation and query performance. The use of an autoencoder to map these features conserves computational resources while maintaining detailed semantic understanding.

Scene-Editing with SEE-Splat

SEE-Splat performs real-time editing and visualization, essential for previewing a robot's interactions with its environment. It identifies relevant objects based on semantic similarity and makes use of 3D Gaussian primitives to flexibly transform the scene. This module supports tasks like inserting, modifying, or removing objects, crucial for dynamic interaction and simulation applications.

Figure 2: Given a natural-language prompt and a desired scene transformation, SEE-Splat leverages 3D ASK-Splat for open-vocabulary scene-editing via semantic localization of relevant Gaussian primitives in the scene, followed by 3D masking and transformation of these Gaussians.

Grasp-Splat: Enhanced Grasp Generation

Using the enriched data from ASK-Splat, Grasp-Splat ranks grasp candidates proposed by models such as GraspNet. It rationalizes these choices by evaluating how well the proposed grasps align with the affordance marks distilled into the scene representation, improving the likelihood of successful manipulation. The system effectively balances semantic inputs and grasp affordances without relying on external guidance.

Figure 3: 3D Gaussian Infilling in SEE-Splat: (Left) In general, without 3D Gaussian infilling, transformation of the Gaussians (e.g., moving the saucepan from the table to the electric stove) introduces artifacts, such as the hole in the table after moving the saucepan. (Right) Via 3D Gaussian infilling, SEE-Splat generates photorealistic renderings of the edited scene.

Evaluation

The paper outlines robust evaluation metrics and performance comparisons, where Splat-MOVER demonstrated superior performance in multi-stage manipulation tasks compared to contemporary methods like LERF-TOGO and F3RM. Successful task completion was evidenced by improved success rates in picking and placing tasks across various scenarios with minimal human inputs, underscoring the effectiveness of an integrated affordance-and-semantics approach.

Conclusion

Splat-MOVER enriches the paradigm of robotic manipulation by combining language-processing capabilities with affordance-oriented scene understanding, enabling complex and natural interactions in unstructured environments. While the foundational reliance on current affordance models limits its application scope to trained data distributions, future work aiming to expand these models could further elevate Splat-MOVER's potential across varied robotic applications. The advancement toward real-time scene-editing and scalable grasp generation underscores its impact on the evolution of autonomous robotic capabilities.