StereoDETR: Stereo-based Transformer for 3D Object Detection

Published 24 Nov 2025 in cs.CV | (2511.18788v1)

Abstract: Compared to monocular 3D object detection, stereo-based 3D methods offer significantly higher accuracy but still suffer from high computational overhead and latency. The state-of-the-art stereo 3D detection method achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast. In this paper, we propose StereoDETR, an efficient stereo 3D object detection framework based on DETR. StereoDETR consists of two branches: a monocular DETR branch and a stereo branch. The DETR branch is built upon 2D DETR with additional channels for predicting object scale, orientation, and sampling points. The stereo branch leverages low-cost multi-scale disparity features to predict object-level depth maps. These two branches are coupled solely through a differentiable depth sampling strategy. To handle occlusion, we introduce a constrained supervision strategy for sampling points without requiring extra annotations. StereoDETR achieves real-time inference and is the first stereo-based method to surpass monocular approaches in speed. It also achieves competitive accuracy on the public KITTI benchmark, setting new state-of-the-art results on pedestrian and cyclist subsets. The code is available at https://github.com/shiyi-mu/StereoDETR-OPEN.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a dual-branch DETR that combines monocular and stereo branches to enhance 3D object detection accuracy and speed.
It details a novel depth sampling strategy that manages occlusions without relying on extra annotations, ensuring precise depth estimation.
Experimental results on the KITTI benchmark show significant improvements by reducing inference time to 17.6 ms while excelling in pedestrian and cyclist detection.

StereoDETR: Stereo-based Transformer for 3D Object Detection

Introduction

StereoDETR presents an innovative approach to stereo-based 3D object detection, designed to offer significant improvements in computational efficiency and inference speed, while maintaining high accuracy. This framework leverages the Disentangled Transformer (DETR) architecture tailored for integrating binocular vision, thereby achieving superior performance compared to both existing monocular and stereo methods in various aspects of 3D detection tasks.

Methodology

Architecture Overview

StereoDETR operates through dual branches: a monocular DETR branch and a stereo branch. This design integrates cross-view disparity features with DETR's object-centric coarse depth estimations, facilitating precise 3D localization.

Figure 1: Overview of StereoDETR architecture.

DETR Branch

The monocular DETR branch extends traditional 2D detection frameworks by including elements for predicting object scale, orientation, and depth sampling points. Adopting a simplified version of MonoDETR and MonoDGP architectures, this branch decouples depth prediction, eliminating additional encoders typically required for depth feature fusion.

Stereo Branch

StereoDETR's stereo branch computes correlation volumes and performs multi-scale fusion using lightweight disparity features. This approach transforms stereo image pairs into efficient depth maps, overcoming the latency challenges commonly associated with stereo vision methods.

Figure 2: Multi-Scale Fusion module: transforms multi-scale correlation volumes into depth features.

Depth Sampling Strategy

A novel depth sampling strategy addresses occlusions by supervising sampling points based on object visibility, leveraging constrained supervision to bypass additional annotations. This offset sampling method is key to achieving accurate depth estimation in occlusion-prone environments.

Figure 3: Depth sampling strategy designed to address occlusions challenge.

Experimental Results

KITTI Benchmark Performance

StereoDETR sets new performance benchmarks on the KITTI dataset, showcasing leading accuracy for pedestrian and cyclist subsets, with significant improvements over both monocular methods and previous stereo approaches. This is achieved while notably reducing inference time to 17.6 ms per image, breaking the longstanding trade-off between speed and accuracy in 3D detection frameworks.

Figure 4: Comparison of accuracy and speed with existing camera-based methods on the KITTI test set (Car category, moderate difficulty).

Visualizations

StereoDETR's capability is further illustrated through visualization results of predicted 3D bounding boxes and depth maps, highlighting its robust performance in real-world scenarios.

Figure 5: Visualization results of predicted depth maps, 3D object centers, and non-occluded sampling points.

Figure 6: Visualization results of the predicted 3D bounding boxes and their corresponding representations in the Bird's-Eye View.

Conclusion

StereoDETR represents a groundbreaking stride in the stereo 3D detection domain, achieving real-time performance with unrivaled accuracy. By refining architectural simplicity and optimizing depth sampling, it paves the way for future advancements in 3D vision applications, including autonomous driving and beyond. This framework's potential to adapt to open-world detection scenarios signifies a substantial contribution to the evolution of intelligent machine perception systems.

Markdown Report Issue