Stereo R-CNN based 3D Object Detection for Autonomous Driving

Published 26 Feb 2019 in cs.CV and cs.RO | (1902.09738v2)

Abstract: We propose a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to simultaneously detect and associate object in left and right images. We add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D object bounding box. We then recover the accurate 3D bounding box by a region-based photometric alignment using left and right RoIs. Our method does not require depth input and 3D position supervision, however, outperforms all existing fully supervised image-based methods. Experiments on the challenging KITTI dataset show that our method outperforms the state-of-the-art stereo-based method by around 30% AP on both 3D detection and 3D localization tasks. Code has been released at https://github.com/HKUST-Aerial-Robotics/Stereo-RCNN.

Abstract PDF Upgrade to Chat

Citations (480)

View on Semantic Scholar

Summary

The paper introduces a Stereo R-CNN framework that simultaneously detects and associates objects using stereo images, enhancing 3D localization.
It extends Faster R-CNN with a stereo region proposal network and dense photometric alignment, achieving approximately 30% higher detection precision on KITTI.
The methodology eliminates the dependency on LiDAR by leveraging both semantic and geometric cues for robust autonomous driving applications.

Stereo R-CNN based 3D Object Detection for Autonomous Driving

Introduction to Stereo R-CNN

The paper introduces an innovative Stereo R-CNN approach aimed at enhancing 3D object detection for autonomous driving. By leveraging stereo imagery, the method offers an alternative to LiDAR, which is commonly used but has limitations such as high cost and sparse information. Stereo R-CNN is designed to exploit both semantic and geometric information from stereo images, enabling it to effectively detect and associate objects in left and right images concurrently.

Network Architecture

The proposed Stereo R-CNN network architecture is shown to extend the existing Faster R-CNN framework to handle stereo inputs. The network consists of three main components:

Stereo Region Proposal Network (RPN): Designed to output proposals for both left and right images by sharing weights. This module employs a novel target assignment strategy to handle stereo input effectively.
Stereo R-CNN: Includes branches for classifying objects, regressing stereo boxes, and predicting viewpoint angles and object dimensions. The network uses a concatenation of left-right RoI features to improve stereo detection precision.
Dense 3D Box Alignment: Implements a region-based photometric alignment to refine the 3D bounding box estimates. Rather than relying on depth input, this approach treats stereo images as a geometry problem to achieve precise localization.
Figure 1: Network architecture of the proposed Stereo R-CNN (Sect.~\ref{sec:rcnn}).

Key Contributions

The Stereo R-CNN introduces several key innovations:

Simultaneous Detection and Association: The framework detects and associates objects in stereo images simultaneously, leveraging a unique classification and regression target assignment.
3D Box Estimator: Utilizes both keypoints and stereo box constraints to enhance 3D bounding box prediction.
Photometric Alignment for 3D Localization: A dense alignment strategy ensures accurate object localization, outperforming traditional methods that require depth information.
Evaluation and Results: Tested on the KITTI dataset, the method surpasses state-of-the-art stereo-based techniques, providing approximately 30% higher average precision in both 3D detection and localization tasks.

Performance and Evaluation

The proposed method was evaluated on the KITTI benchmark, demonstrating significant improvements over existing image-based and stereo-based 3D detection approaches. The method achieved superior performance in both average precision for bird’s-eye view and 3D bounding box metrics across easy, moderate, and hard difficulty levels.

Practical Implications and Future Work

While Stereo R-CNN outperforms several existing methods, its practical applications extend beyond current achievements. Future developments could involve extending the model for multi-object tracking and integrating instance segmentation for more precise RoI selections.

Concluding, the research presents Stereo R-CNN as a viable, cost-effective alternative in autonomous driving applications, offering robust performance without the dependency on depth inputs. This framework not only enhances current 3D object detection capabilities but also lays the groundwork for future enhancements and applications in diverse environments.