FreDSNet: Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions
Abstract: In this work we present FreDSNet, a deep learning solution which obtains semantic 3D understanding of indoor environments from single panoramas. Omnidirectional images reveal task-specific advantages when addressing scene understanding problems due to the 360-degree contextual information about the entire environment they provide. However, the inherent characteristics of the omnidirectional images add additional problems to obtain an accurate detection and segmentation of objects or a good depth estimation. To overcome these problems, we exploit convolutions in the frequential domain obtaining a wider receptive field in each convolutional layer. These convolutions allow to leverage the whole context information from omnidirectional images. FreDSNet is the first network that jointly provides monocular depth estimation and semantic segmentation from a single panoramic image exploiting fast Fourier convolutions. Our experiments show that FreDSNet has similar performance as specific state of the art methods for semantic segmentation and depth estimation. FreDSNet code is publicly available in https://github.com/Sbrunoberenguel/FreDSNet
- I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d-semantic data for indoor scene understanding,” arXiv preprint arXiv:1702.01105, 2017.
- B. Berenguel-Baeta, J. Bermudez-Cameo, and J. J. Guerrero, “Scaled 360 layouts: Revisiting non-central panoramas,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2021, pp. 3702–3705.
- L. Chi, B. Jiang, and Y. Mu, “Fast fourier convolution,” in Conference on Neural Information Processing Systems, 2020, pp. 4479–4488.
- N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid, “Blitznet: A real-time deep network for scene understanding,” in Proceedings of the International Conference on Computer Vision, 2017, pp. 4154–4162.
- M. Eder, M. Shvets, J. Lim, and J.-M. Frahm, “Tangent images for mitigating spherical distortion,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2020, pp. 12 426–12 434.
- J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions for single-view depth,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2019, pp. 11 826–11 835.
- C. Fernandez-Labrador, J. M. Facil, A. Perez-Yus, C. Demonceaux, J. Civera, and J. J. Guerrero, “Corners for layout: End-to-end layout recovery from 360 images,” Robotics and Automation Letters, pp. 1255–1262, 2020.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2014.
- J. Guerrero-Viu, C. Fernandez-Labrador, C. Demonceaux, and J. J. Guerrero, “What’s in my room? object recognition on indoor panoramic images,” in International Conference on Robotics and Automation. IEEE, 2020, pp. 567–573.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the International Conference on Computer Vision. IEEE, 2017, pp. 2961–2969.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2016, pp. 770–778.
- M. Heo, J. Lee, K.-R. Kim, H.-U. Kim, and C.-S. Kim, “Monocular depth estimation using whole strip masking and reliability-based refinement,” in European Conference on Computer Vision. Springer, 2018, pp. 36–51.
- H. Koppula, A. Anand, T. Joachims, and A. Saxena, “Semantic labeling of 3d point clouds for indoor scenes,” Conference on Neural Information Processing Systems, 2011.
- J.-H. Lee and C.-S. Kim, “Monocular depth estimation using relative depth maps,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2019, pp. 9729–9738.
- M. Naseer, S. Khan, and F. Porikli, “Indoor scene understanding in 2.5/3d for autonomous agents: A survey,” IEEE Access, pp. 1859–1887, 2018.
- G. Pintore, M. Agus, E. Almansa, J. Schneider, and E. Gobbetti, “Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based representation,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2021, pp. 11 536–11 545.
- G. Pintore, M. Agus, and E. Gobbetti, “Atlantanet: Inferring the 3d indoor layout from a single 360 image beyond the manhattan world assumption,” in European Conference on Computer Vision. Springer, 2020, pp. 432–448.
- R. Ranftl, V. Vineet, Q. Chen, and V. Koltun, “Dense monocular depth estimation in complex dynamic scenes,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2016.
- I. Rusli, B. R. Trilaksono, and W. Adiprawita, “Roomslam: Simultaneous localization and mapping with objects and indoor layout structure,” IEEE Access, pp. 196 992–197 004, 2020.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, pp. 211–252, 2015.
- M. Salas, W. Hussain, A. Concha, L. Montano, J. Civera, and J. Montiel, “Layout aware visual tracking and mapping,” in International Conference on Intelligent Robots and Sistems. IEEE, 2015, pp. 149–156.
- C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2019, pp. 1047–1056.
- C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with latent horizontal features,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2021, pp. 2573–2582.
- R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in Proceedings on the Winter Conference of Applications of Computer Vision. IEEE, 2022, pp. 2149–2159.
- J. Tian, N. C. Mithun, Z. Seymour, H.-P. Chiu, and Z. Kira, “Striking the right balance: Recall loss for semantic segmentation,” in International Conference on Robotics and Automation. IEEE, 2022.
- F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Bifuse: Monocular 360 depth estimation via bi-projection fusion,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2020, pp. 462–471.
- X. Ye, J. Li, H. Huang, L. Du, and X. Zhang, “3d recurrent neural networks with context fusion for point cloud semantic segmentation,” in European Conference on Computer Vision. Springer, 2018.
- Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task-recursive learning for semantic segmentation and depth estimation,” in European Conference on Computer Vision. Springer, 2018, pp. 235–251.
- S. Zheng, M.-M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother, and P. H. S. Torr, “Dense semantic image segmentation with objects and attributes,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2014.
- N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras, “Omnidepth: Dense depth estimation for indoors spherical panoramas,” in European Conference on Computer Vision. Springer, 2018, pp. 448–465.
- C. Zou, J.-W. Su, C.-H. Peng, A. Colburn, Q. Shan, P. Wonka, H.-K. Chu, and D. Hoiem, “Manhattan room layout reconstruction from a single 360 image: A comparative study of state-of-the-art methods,” International Journal of Computer Vision, pp. 1410–1431, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.