Monocular Per-Object Distance Estimation with Masked Object Modeling
Abstract: Per-object distance estimation is critical in surveillance and autonomous driving, where safety is crucial. While existing methods rely on geometric or deep supervised features, only a few attempts have been made to leverage self-supervised learning. In this respect, our paper draws inspiration from Masked Image Modeling (MiM) and extends it to multi-object tasks. While MiM focuses on extracting global image-level representations, it struggles with individual objects within the image. This is detrimental for distance estimation, as objects far away correspond to negligible portions of the image. Conversely, our strategy, termed Masked Object Modeling (MoM), enables a novel application of masking techniques. In a few words, we devise an auxiliary objective that reconstructs the portions of the image pertaining to the objects detected in the scene. The training phase is performed in a single unified stage, simultaneously optimizing the masking objective and the downstream loss (i.e., distance estimation). We evaluate the effectiveness of MoM on a novel reference architecture (DistFormer) on the standard KITTI, NuScenes, and MOTSynth datasets. Our evaluation reveals that our framework surpasses the SoTA and highlights its robust regularization properties. The MoM strategy enhances both zero-shot and few-shot capabilities, from synthetic to real domain. Finally, it furthers the robustness of the model in the presence of occluded or poorly detected objects. Code is available at https://github.com/apanariello4/DistFormer
- High quality monocular depth estimation via transfer learning. arXiv preprint arXiv:1812.11941, 2018.
- BEiT: BERT pre-training of image transformers. International Conference on Learning Representations Workshop, 2021.
- Monoloco: Monocular 3d pedestrian localization and uncertainty estimation. In IEEE International Conference on Computer Vision, 2019.
- How attentive are graph attention networks? In International Conference on Learning Representations Workshop, 2022.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2020.
- Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In IEEE International Conference on Multimedia and Expo, 2018.
- Aleatory or epistemic? does it matter? Structural safety, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations Workshop, 2021.
- Centernet: Keypoint triplets for object detection. In IEEE International Conference on Computer Vision, 2019.
- Depth map prediction from a single image using a multi-scale deep network. Advances in Neural Information Processing Systems, 2014.
- Motsynth: How can synthetic data help pedestrian detection and tracking? In IEEE International Conference on Computer Vision, 2021.
- Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proceedings of the European Conference on Computer Vision, 2016.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2012.
- Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, 2015.
- Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017.
- Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2019.
- Walter C Gogel. The visual perception of size and distance. Vision Research, 1963.
- Vision-based detection and distance estimation of micro unmanned aerial vehicles. Sensors, 2015.
- Disnet: a novel method for distance estimation from monocular camera. 10th Planning, Perception and Navigation for Intelligent Vehicles (PPNIV18), IROS, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016.
- Mask r-cnn. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2022.
- Depth estimation matters most: improving per-object depth estimation for monocular 3d detection and tracking. In International Conference on Robotics and Automation, 2022.
- Adam: A method for stochastic optimization. International Conference on Learning Representations Workshop, 2015.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017.
- Pulling things out of perspective. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014.
- Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2019.
- From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
- R4d: Utilizing reference objects for long-range distance estimation. In International Conference on Learning Representations Workshop, 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017.
- Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2021.
- A convnet for the 2020s. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2022.
- Sgdr: Stochastic gradient descent with warm restarts. International Conference on Learning Representations Workshop, 2017.
- Rethinking pseudo-lidar representation. In Proceedings of the European Conference on Computer Vision, 2020.
- Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biological cybernetics, 1991.
- Trackflow: Multi-object tracking with normalizing flows. In IEEE International Conference on Computer Vision, 2023.
- Estimating the mean and variance of the target probability distribution. In Proceedings of the IEEE International Conference on Neural Networks, 1994.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- Vision transformers for dense prediction. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2021.
- Real-time flying object detection with yolov8. arXiv preprint arXiv:2305.09972, 2023.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 2015.
- Robust vehicle detection and distance estimation under challenging lighting conditions. IEEE Transactions on Intelligent Transportation Systems, 2015.
- Feature-metric loss for self-supervised learning of depth and egomotion. In Proceedings of the European Conference on Computer Vision, 2020.
- Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations Workshop, 2015.
- Terrain influences the accurate judgement of distance. Nature, 1998.
- Distance determination for an automobile environment using inverse perspective mapping in opencv. In IET Irish Signals and Systems Conference, 2010.
- Attention is all you need. Advances in Neural Information Processing Systems, 2017.
- Graph attention networks. In International Conference on Learning Representations Workshop, 2018.
- Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018.
- Learning object-specific distance from a monocular image. In IEEE International Conference on Computer Vision, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.