Papers
Topics
Authors
Recent
Search
2000 character limit reached

Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

Published 27 Apr 2024 in cs.CV and cs.RO | (2404.17922v1)

Abstract: Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Instance-level semantic maps for vision language navigation. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, August 2023.
  2. Visual language maps for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
  3. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, 56(1):365–427, 2023.
  4. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv, 2023.
  5. A survey on active simultaneous localization and mapping: State of the art and new frontiers. IEEE Transactions on Robotics, 2023.
  6. Open-vocabulary queryable scene representations for real world planning. In arXiv preprint arXiv:2209.09874, 2022.
  7. Masked-attention mask transformer for universal image segmentation. arXiv, 2021.
  8. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  9. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  10. DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.
  11. Goat: Go to any thing, 2023.
  12. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  13. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
  14. Towards learning a generic agent for vision-and-language navigation via pre-training. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13134–13143, 2020.
  15. Hop: History-and-order aware pretraining for vision-and-language navigation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15397–15406, Los Alamitos, CA, USA, jun 2022. IEEE Computer Society.
  16. Meta-explore: Exploratory hierarchical vision-and-language navigation using scene object spectrum grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  17. Sequence-agnostic multi-object navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9573–9579, 2023.
  18. Volumetric instance-aware semantic mapping and 3d object discovery. IEEE Robotics and Automation Letters, 4(3):3037–3044, 2019.
  19. Volumetric instance-level semantic mapping via multi-view 2d-to-3d label diffusion. IEEE Robotics and Automation Letters, 7(2):3531–3538, 2022.
  20. Volumetric semantically consistent 3d panoptic mapping. arXiv preprint arXiv:2309.14737, 2023.
  21. Segment anything, 2023.
  22. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
  23. Panoptic segmentation, 2019.
  24. Openmask3d: Open-vocabulary 3d instance segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  25. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv, 2023.
  26. Ssr-2d: Semantic 3d scene reconstruction from 2d images. arXiv preprint arXiv:2302.03640, 2023.
  27. A survey on real-time 3d scene reconstruction with slam methods in embedded systems, 2023.
  28. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  29. Conceptfusion: Open-set multimodal 3d mapping. Robotics: Science and Systems (RSS), 2023.
  30. Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. Journal of field robotics, 36(2):416–446, 2019.
  31. G2o: A general framework for graph optimization. In 2011 IEEE International Conference on Robotics and Automation, pages 3607–3613, 2011.
  32. Grounded SAM: Assembling open-world models for diverse visual tasks. arXiv, 2024.
  33. Recognize anything: A strong image tagging model. arXiv, 2023.
  34. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231, 1996.
  35. OpenAI. Chatgpt. https://openai.com/blog/chatgpt.
  36. Matterport3d: Learning from rgb-d data in indoor environments. arXiv, 2017.
  37. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019.
  38. Analyzing generalization of vision and language navigation to unseen outdoor areas. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7519–7532, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  39. Ground then navigate: Language-guided navigation in dynamic scenes. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 4113–4120, 2023.
  40. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint, 2021.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.