Closed-Loop Next-Best-View Planning for Target-Driven Grasping

Published 21 Jul 2022 in cs.RO | (2207.10543v1)

Abstract: Picking a specific object from clutter is an essential component of many manipulation tasks. Partial observations often require the robot to collect additional views of the scene before attempting a grasp. This paper proposes a closed-loop next-best-view planner that drives exploration based on occluded object parts. By continuously predicting grasps from an up-to-date scene reconstruction, our policy can decide online to finalize a grasp execution or to adapt the robot's trajectory for further exploration. We show that our reactive approach decreases execution times without loss of grasp success rates compared to common camera placements and handles situations where the fixed baselines fail. Video and code are available at https://github.com/ethz-asl/active_grasp.

Abstract PDF Upgrade to Chat

Citations (15)

View on Semantic Scholar

Summary

The paper introduces a policy that integrates closed-loop next-best-view planning with real-time grasp detection to optimize sensor positioning and mitigate occlusions.
It employs a TSDF-based volumetric map and a convolutional neural network (VGN) to evaluate grasp quality and dynamically update grasp predictions.
The approach reduces grasp search times while maintaining high success rates compared to fixed camera setups, demonstrating robustness in cluttered environments.

Closed-Loop Next-Best-View Planning for Target-Driven Grasping

Introduction

The paper introduces a novel approach to robotic grasping in cluttered environments, focusing on a method that integrates closed-loop next-best-view planning with real-time grasp detection. The methodology proposed in this work aims to optimize sensor placement to enhance the visibility of occluded object parts, thus improving the robustness and efficiency of grasp executions. This is achieved by a policy that continuously updates grasp predictions based on an evolving scene reconstruction and real-time sensor measurements. The approach effectively reduces execution times without compromising grasp success rates compared to fixed camera view baselines.

Figure 1: Detecting grasps on the orange packet using the wrist-mounted camera is hindered by occlusions from surrounding objects. By moving the sensor, our next-best-view grasp planner successfully discovers a grasp on the target item.

Methodology

The proposed framework comprises several critical components:

Volumetric Map Integration: The method employs a Truncated Signed Distance Function (TSDF) to create a volumetric representation of the scene. This allows for efficient integration of new sensor data and aids in accurate grasp detection.
Grasp Detection: A fully convolutional network, termed Volumetric Grasping Network (VGN), is utilized for grasp synthesis. This network computes grasp quality scores and associated grasp parameters at every voxel within the TSDF representation, filtering for grasps that are feasible both geometrically and kinematically.
Next-Best-View Planning: The policy generates potential sensor placements, selecting the one maximizing predicted information gain. The information gain metric is computed using ray casting to evaluate the visibility improvement of occluded voxels from new viewpoints.
Figure 2: Overview of the framework. Our policy continuously integrates sensor measurements into a volumetric map of the scene, computes grasps, and re-plans informative views until a stable grasp is detected.

Implementation and Evaluation

Significant emphasis is placed on the responsiveness and adaptability of the exploration and grasping policy. Key implementation details include:

Policy Execution Rate: The policy re-evaluates sensor data at 4 Hz, ensuring timely adaptations to grasp predictions based on evolving scene information.
Stopping Criteria: Several criteria are defined to balance exploration with exploitation, including a time budget, minimum information gain threshold, and a grasp stability check based on historical predictions.

The experimental evaluation in both simulation and real-world scenarios showcases the efficacy of the approach. The proposed method significantly reduces search time while maintaining comparable or improved grasp success rates relative to baseline methods employing fixed camera placements.

Figure 3: Success rate and search time vs window size T. Larger values of T force the policy to explore longer, leading to higher success rates at the cost of longer search times.

Discussion

While the approach demonstrates robustness and efficiency, several areas are identified for future research:

Collision checking and path planning remain outside the scope of the current implementation, which occasionally causes infeasibility in executing detected grasps.
The reliance on bounding boxes for target object identification could be integrated with real-time object detection or instance segmentation for enhanced autonomy.
Further exploration into multi-step manipulation tasks in dense clutter is suggested, where simple grasp execution might be inhibited by surrounding objects.
Figure 4: Images taken from the initial camera view for each test scene with the target object colored in green. Note that the RGB image has a smaller field of view compared to the depth sensor of the RealSense.

Conclusion

The paper presents a comprehensive strategy for target-driven grasping in cluttered environments, focusing on dynamic sensor placement to mitigate occlusions. By continuously incorporating new sensory data, the framework efficiently identifies stable grasp configurations, displaying robustness across diverse configurations. The approach offers promising directions for future enhancements in grasping technology, particularly in automating operations in unstructured environments.