Multi-View Attentive Contextualization for Multi-View 3D Object Detection
Abstract: We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision -- ``(contextualized) feature matters".
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- M3d-rpn: Monocular 3d region proposal network for object detection. In ICCV, pages 9287–9296, 2019.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Viewpoint equivariance for multi-view 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9213–9222, 2023.
- Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2781–2790, 2022a.
- Monopair: Monocular 3d object detection using pairwise spatial relationships. In CVPR, pages 12093–12102, 2020.
- Pseudo-stereo for monocular 3d object detection in autonomous driving. In CVPR, 2022b.
- Learning depth-guided convolutions for monocular 3d object detection. In CVPR Workshops, pages 1000–1001, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9710–9719, 2021.
- Paca-vit: Learning patch-to-cluster attention in vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18568–18578, 2023.
- Transformer in transformer. Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
- Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Monodtr: Monocular 3d object detection with depth-aware transformer. In CVPR, 2022.
- Groomed-nms: Grouped mathematically differentiable nms for monocular 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8973–8983, 2021.
- Deviant: Depth equivariant network for monocular 3d object detection. In ECCV, 2022.
- Dfa3d: 3d deformable attention for 2d-to-3d feature lifting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6684–6693, 2023a.
- Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In European Conference on Computer Vision, pages 644–660. Springer, 2020.
- Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
- Densely constrained depth estimator for monocular 3d object detection. ECCV, 2022a.
- Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1486–1494, 2023b.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023c.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022b.
- Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022.
- Learning auxiliary monocular contexts helps monocular 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1810–1818, 2022a.
- Monocular 3d object detection with bounding box denoising in 3d by perceiver. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6436–6446, 2023a.
- Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision, pages 531–548. Springer, 2022b.
- Petrv2: A unified framework for 3d perception from multi-camera images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3262–3272, 2023b.
- A survey of visual transformers. IEEE Transactions on Neural Networks and Learning Systems, 2023c.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021a.
- Autoshape: Real-time shape-aware monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15641–15650, 2021b.
- Geometry uncertainty projection network for monocular 3d object detection. arXiv:2107.13774, 2021.
- Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In ICCV, pages 6851–6860, 2019.
- Rethinking pseudo-lidar representation. In ECCV, pages 311–327. Springer, 2020.
- Delving into localization errors for monocular 3d object detection. In CVPR, pages 4721–4730, 2021.
- Vision-centric bev perception: A survey. arXiv preprint arXiv:2208.02797, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- 3d bounding box estimation using deep learning and geometry. In CVPR, pages 7074–7082, 2017.
- Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021.
- Depth is all you need for monocular 3d detection. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7024–7031, 2023.
- Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv preprint arXiv:2210.02443, 2022.
- Did-m3d: Decoupling instance depth for monocular 3d object detection. In ECCV, 2022.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
- 3d point positional encoding for multi-camera 3d object detection transformers. arXiv preprint arXiv:2211.14710, 2022.
- Disentangling monocular 3d object detection. In ICCV, pages 1991–1999, 2019.
- Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12894–12904, 2021.
- Exploring object-centric temporal modeling for efficient multi-view 3d object detection. arXiv preprint arXiv:2303.11926, 2023.
- Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 913–922, 2021.
- Monocular 3d object detection with depth from motion. In ECCV, 2022a.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022b.
- Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, pages 8445–8453, 2019.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022c.
- Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677, 2020.
- Mononerd: Nerf-like representations for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6814–6824, 2023.
- Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17830–17839, 2023.
- Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 579–588, 2021.
- Volo: Vision outlooker for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 45(5):6575–6586, 2022.
- Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3008, 2021a.
- Objects are different: Flexible monocular 3d object detection. In CVPR, pages 3289–3298, 2021b.
- Dimension embeddings for monocular 3d object detection. In CVPR, 2022.
- Potter: Pooling attention transformer for efficient human mesh recovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1620, 2023.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.