RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Published 30 Dec 2020 in cs.CV | (2012.15072v1)

Abstract: Although the recent image-based 3D object detection methods using Pseudo-LiDAR representation have shown great capabilities, a notable gap in efficiency and accuracy still exist compared with LiDAR-based methods. Besides, over-reliance on the stand-alone depth estimator, requiring a large number of pixel-wise annotations in the training stage and more computation in the inferencing stage, limits the scaling application in the real world. In this paper, we propose an efficient and accurate 3D object detection method from stereo images, named RTS3D. Different from the 3D occupancy space in the Pseudo-LiDAR similar methods, we design a novel 4D feature-consistent embedding (FCE) space as the intermediate representation of the 3D scene without depth supervision. The FCE space encodes the object's structural and semantic information by exploring the multi-scale feature consistency warped from stereo pair. Furthermore, a semantic-guided RBF (Radial Basis Function) and a structure-aware attention module are devised to reduce the influence of FCE space noise without instance mask supervision. Experiments on the KITTI benchmark show that RTS3D is the first true real-time system (FPS$>$24) for stereo image 3D detection meanwhile achieves $10\%$ improvement in average precision comparing with the previous state-of-the-art method. The code will be available at https://github.com/Banconxuan/RTS3D

Abstract PDF Upgrade to Chat

Citations (31)

View on Semantic Scholar

Summary

The paper presents a novel 4D Feature-Consistency Embedding (FCE) space that extracts 3D structures from stereo pairs without depth supervision.
It integrates multi-scale feature consistency, a semantic-guided RBF module, and a structure-aware attention mechanism to enhance detection accuracy.
RTS3D achieves over 24 FPS and a 10% increase in average precision on the KITTI benchmark, significantly advancing real-time autonomous driving detection.

Real-Time Stereo 3D Detection for Autonomous Driving

The paper "RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving" presents an innovative approach for 3D object detection using stereo images. The method, RTS3D, addresses both efficiency and accuracy challenges associated with pseudo-LiDAR representations used in image-based 3D detection, by introducing a novel intermediate representation termed the 4D Feature-Consistency Embedding (FCE) space. This essay will explore the technical details, contributions, and implications of the proposed method.

4D Feature-Consistency Embedding Space

RTS3D diverges from traditional 3D occupancy space approaches by employing the 4D FCE space. The FCE space serves as an intermediary in 3D scene representation, encoding both structural and semantic information without requiring depth supervision. The core concept is to determine feature consistency between stereo image pairs to extract underlying 3D structures.

Multi-Scale Feature Consistency: Stereo pairs are passed through a ResNet18-based network to acquire multi-scale features. Consistency across these scales helps mitigate issues of textureless regions and reflective interference.
Semantic-guided Radial Basis Function (RBF): Introduces a semantic-guided RBF designed to reduce nontarget surface noise. This module explicitly models semantic indicators to refine regions of interest within 3D space.

Structural and Semantic Optimization

Addressing the inherent noise present in FCE space due to non-target interference and computational complexity, RTS3D employs several optimization strategies:

Latent Space Generation: Uses fast monocular 3D detection to predict an initial latent space. This space is progressively refined by iterating detection results within the FCE space.
Structure-aware Attention Module: Enhances spatial consistency by filtering through a novel structure-aware attention mechanism. This module identifies local geometry efficiently, contrasting the computational demand of full 3D convolutional networks.
Fast PointNet Architecture: Adopts a variant of PointNet to predict 3D bounding boxes. Coupled with the structure-aware attention, it strikes a beneficial balance between computational efficiency and detection accuracy.

Experimental Results

The efficacy of RTS3D is empirically demonstrated using the KITTI benchmark, highlighting several key outcomes:

Real-Time Performance: Achieves the milestone of over 24 FPS, making RTS3D the first real-time stereo image 3D detection system.
Accuracy Improvements: Delivers a 10% increase in average precision compared to prior state-of-the-art methods. This performance is accomplished without reliance on extensive annotations for supervised depth estimation.

Implications and Future Work

RTS3D’s method presents practical advancements in autonomous driving by significantly reducing the computational overhead typical of 3D object detection processes, and offering a scalable, accurate solution without excessive reliance on LiDAR systems. The approach effectively taps into the detailed decompose-and-reconstruct process of visual scenes through stereo imagery.

Potential future developments could focus on extending the current system by integrating temporal consistency across image frames or further reducing spatial noise using unsupervised learning techniques. This enhances the model’s robustness in dynamic environments typical in autonomous driving scenarios.

Conclusion

The RTS3D framework offers an efficient, accurate alternative for 3D detection in autonomous systems, by effectively leveraging stereo vision data. By circumventing depth supervision and extensive computational resources, it sets a precedent for real-time 3D object detection applicable in practical, real-world scenarios. As autonomous systems continue to evolve, such innovations enable broader deployment in urban and dense environments while maintaining high accuracy and efficiency.