Orthographic Feature Transform for Monocular 3D Object Detection

Published 20 Nov 2018 in cs.CV | (1811.08188v1)

Abstract: 3D object detection from monocular images has proven to be an enormously challenging task, with the performance of leading systems not yet achieving even 10\% of that of LiDAR-based counterparts. One explanation for this performance gap is that existing systems are entirely at the mercy of the perspective image-based representation, in which the appearance and scale of objects varies drastically with depth and meaningful distances are difficult to infer. In this work we argue that the ability to reason about the world in 3D is an essential element of the 3D object detection task. To this end, we introduce the orthographic feature transform, which enables us to escape the image domain by mapping image-based features into an orthographic 3D space. This allows us to reason holistically about the spatial configuration of the scene in a domain where scale is consistent and distances between objects are meaningful. We apply this transformation as part of an end-to-end deep learning architecture and achieve state-of-the-art performance on the KITTI 3D object benchmark.\footnote{We will release full source code and pretrained models upon acceptance of this manuscript for publication.

Abstract PDF Upgrade to Chat

Citations (348)

View on Semantic Scholar

Summary

The paper introduces the Orthographic Feature Transform to map perspective image features to an orthographic birds‐eye view for improved 3D detection.
The method leverages a ResNet-based feature extractor and a convolutional topdown network to enhance spatial reasoning and robust object localization.
Evaluations on the KITTI benchmark show that the approach outperforms previous monocular methods, reducing reliance on expensive LiDAR systems.

Overview of "Orthographic Feature Transform for Monocular 3D Object Detection"

The paper "Orthographic Feature Transform for Monocular 3D Object Detection" by Thomas Roddick, Alex Kendall, and Roberto Cipolla, presents a novel approach to improve 3D object detection using only monocular RGB images. The motivation stems from the substantial performance gap between monocular vision systems and LiDAR-based systems, primarily attributed to the challenges planar images face when extrapolating depth and scale information. By introducing the Orthographic Feature Transform (OFT), the authors enhance the ability to reason about spatial configurations in a 3D space, addressing some of the inherent limitations of monocular setups.

Contribution and Methodology

The primary contribution of this paper is the Orthographic Feature Transform, which serves as a differentiable method of mapping features from the perspective-based image domain to an orthographic birds-eye-view representation. This process helps mitigate issues related to depth variance, scale distortion, and viewpoint changes that typically compromise accuracy in monocular approaches. The OFT enables a more consistent spatial analysis and effective object localization in a homogeneous feature space.

The authors implement their approach using an end-to-end deep learning architecture. The system uses a ResNet-based front-end to extract image features, applying the OFT to map these features to a birds-eye-view representation. This transformed representation allows the model to leverage consistent scale information, facilitating better 3D bounding box predictions. The model incorporates a 'topdown network' using convolutional layers to perform spatial reasoning, predicting object locations on the ground plane with confidence scores. The authors also describe how integral images are used to efficiently compute average pooling during feature transformation, enhancing computational efficiency.

Distinctly, the proposed method alleviates reliance on explicit depth information by harnessing implicit feature cognition, an innovation that enhances object detection at various depths. The authors validate their approach against the KITTI 3D object detection benchmark, demonstrating state-of-the-art performance among monocular methods.

Results and Implications

In comparative evaluations against existing methodologies, particularly 3D LiDAR-based and monocular counterparts like Mono3D and 3DOP, this method exhibits superior performance across most scenarios. The OFT-Net achieves notable improvements in detecting objects even when distant from the camera, indicative of its robustness in depth reasoning. The reported average precision metrics on the KITTI dataset substantiate the effectiveness of performing object detection in orthographic space.

This approach holds significant implications for autonomous driving and robotics, where cost-effective and efficient sensing solutions are imperative. Monocular methods enriched by OFT could reduce dependency on expensive LiDAR systems, fostering advancements in autonomous system scalability and deployment.

Future Directions

Future work could explore further reducing the computational footprint of the OFT and extending the approach to handle dynamic object detection in congested urban environments. Additionally, exploring hybrid models that combine monocular and sparse depth information could enhance robustness, especially in challenging weather conditions where monocular imagery might be compromised. Advancing the efficiency of the topdown network could also encourage broader application in real-time systems.

Overall, the innovative step of transforming image features to orthographic space, as demonstrated by the authors, underscores a promising direction for non-LiDAR-dependent 3D object detection systems in AI research.

Markdown Report Issue