Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM
Abstract: Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify.
- Instance-level semantic maps for vision language navigation. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, August 2023.
- Visual language maps for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
- Visual language navigation: A survey and open challenges. Artificial Intelligence Review, 56(1):365–427, 2023.
- Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv, 2023.
- A survey on active simultaneous localization and mapping: State of the art and new frontiers. IEEE Transactions on Robotics, 2023.
- Open-vocabulary queryable scene representations for real world planning. In arXiv preprint arXiv:2209.09874, 2022.
- Masked-attention mask transformer for universal image segmentation. arXiv, 2021.
- Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.
- Goat: Go to any thing, 2023.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Towards learning a generic agent for vision-and-language navigation via pre-training. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13134–13143, 2020.
- Hop: History-and-order aware pretraining for vision-and-language navigation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15397–15406, Los Alamitos, CA, USA, jun 2022. IEEE Computer Society.
- Meta-explore: Exploratory hierarchical vision-and-language navigation using scene object spectrum grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Sequence-agnostic multi-object navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9573–9579, 2023.
- Volumetric instance-aware semantic mapping and 3d object discovery. IEEE Robotics and Automation Letters, 4(3):3037–3044, 2019.
- Volumetric instance-level semantic mapping via multi-view 2d-to-3d label diffusion. IEEE Robotics and Automation Letters, 7(2):3531–3538, 2022.
- Volumetric semantically consistent 3d panoptic mapping. arXiv preprint arXiv:2309.14737, 2023.
- Segment anything, 2023.
- Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
- Panoptic segmentation, 2019.
- Openmask3d: Open-vocabulary 3d instance segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv, 2023.
- Ssr-2d: Semantic 3d scene reconstruction from 2d images. arXiv preprint arXiv:2302.03640, 2023.
- A survey on real-time 3d scene reconstruction with slam methods in embedded systems, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Conceptfusion: Open-set multimodal 3d mapping. Robotics: Science and Systems (RSS), 2023.
- Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. Journal of field robotics, 36(2):416–446, 2019.
- G2o: A general framework for graph optimization. In 2011 IEEE International Conference on Robotics and Automation, pages 3607–3613, 2011.
- Grounded SAM: Assembling open-world models for diverse visual tasks. arXiv, 2024.
- Recognize anything: A strong image tagging model. arXiv, 2023.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231, 1996.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt.
- Matterport3d: Learning from rgb-d data in indoor environments. arXiv, 2017.
- Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019.
- Analyzing generalization of vision and language navigation to unseen outdoor areas. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7519–7532, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Ground then navigate: Language-guided navigation in dynamic scenes. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 4113–4120, 2023.
- Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.