Papers
Topics
Authors
Recent
Search
2000 character limit reached

Do Visual-Language Grid Maps Capture Latent Semantics?

Published 15 Mar 2024 in cs.RO | (2403.10117v2)

Abstract: Visual-LLMs (VLMs) have recently been introduced in robotic mapping using the latent representations, i.e., embeddings, of the VLMs to represent semantics in the map. They allow moving from a limited set of human-created labels toward open-vocabulary scene understanding, which is very useful for robots when operating in complex real-world environments and interacting with humans. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is missing. In this paper, we propose a way to analyze the quality of maps created using VLMs. We investigate two critical properties of map quality: queryability and distinctness. The evaluation of queryability addresses the ability to retrieve information from the embeddings. We investigate intra-map distinctness to study the ability of the embeddings to represent abstract semantic classes and inter-map distinctness to evaluate the generalization properties of the representation. We propose metrics to evaluate these properties and evaluate two state-of-the-art mapping methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. Our findings show that while 3D features improve queryability, they are not scale invariant, whereas image-based embeddings generalize to multiple map resolutions. This allows the image-based methods to maintain smaller map sizes, which can be crucial for using these methods in real-world deployments. Furthermore, we show that the choice of the encoder has an effect on the results. The results imply that properly thresholding open-vocabulary queries is an open problem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. S. Garg et al., “Semantics for Robotic Mapping, Perception and Interaction: A Survey,” Foundations and Trends® in Robotics, vol. 8, no. 1–2, pp. 1–224, 2020, arXiv:2101.00443 [cs].
  2. A. Bendale and T. E. Boult, “Towards Open Set Deep Networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1563–1572.
  3. C. Geng, S.-J. Huang, and S. Chen, “Recent Advances in Open Set Recognition: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3614–3631, Oct. 2021.
  4. A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  5. A. Radford et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  6. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Audio visual language maps for robot navigation,” 2023, arXiv:2303.07522 [cs].
  7. M. Tenorth and M. Beetz, “KNOWROB - knowledge processing for autonomous personal robots,” in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.   St. Louis, MO, USA: IEEE, oct 2009, pp. 4261–4266.
  8. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   London, United Kingdom: IEEE, may 2023, pp. 10 608–10 615.
  9. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, jun 2023, pp. 23 171–23 181.
  10. D. Shah, B. Osiński, S. Levine et al., “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in Conference on Robot Learning.   Atlanta, GA, USA: PMLR, nov 2023, pp. 492–504.
  11. S. Peng et al., “Openscene: 3d scene understanding with open vocabularies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, jun 2023, pp. 815–824.
  12. A. Chang et al., “Matterport3D: Learning from RGB-D Data in Indoor Environments,” Sep. 2017, arXiv:1709.06158 [cs].
  13. B. Chen et al., “Open-vocabulary queryable scene representations for real world planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   London, UK: IEEE, may 2023, pp. 11 509–11 522.
  14. Y. Yuan and A. Nüchter, “Uni-fusion: Universal continuous mapping,” IEEE Transactions on Robotics, vol. 40, pp. 1373–1392, jan 2024.
  15. B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven Semantic Segmentation,” Apr. 2022, arXiv:2201.03546 [cs].
  16. G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image segmentation with image-level labels,” in European Conference on Computer Vision.   Tel Aviv, Israel: Springer, oct 2022, pp. 540–557.
  17. H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba, “Open Vocabulary Scene Parsing,” in 2017 IEEE International Conference on Computer Vision (ICCV).   Venice: IEEE, Oct. 2017, pp. 2021–2029.
  18. M. A. Bravo, S. Mittal, S. Ging, and T. Brox, “Open-vocabulary Attribute Detection,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 7041–7050.
  19. A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran, “Zero-Shot Object Detection,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11205.   Cham: Springer International Publishing, sep 2018, pp. 397–414, series Title: Lecture Notes in Computer Science.
  20. A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-Vocabulary Object Detection Using Captions,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Nashville, TN, USA: IEEE, Jun. 2021, pp. 14 388–14 397.
  21. X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary Object Detection via Vision and Language Knowledge Distillation,” May 2022, arXiv:2104.13921 [cs].
  22. C. Feng et al., “Promptdet: Towards open-vocabulary detection using uncurated images,” in European Conference on Computer Vision.   Tel Aviv, Israel: Springer, oct 2022, pp. 701–717.
  23. A. I. Wagan, A. Godil, and X. Li, “Map quality assessment,” in Proceedings of the 8th Workshop on Performance Metrics for Intelligent Systems.   Gaithersburg Maryland: ACM, Aug. 2008, pp. 278–282.
  24. T. P. Kucner, M. Luperto, S. Lowry, M. Magnusson, and A. J. Lilienthal, “Robust Frequency-Based Structure Extraction,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   Xi’an, China: IEEE, May 2021, pp. 1715–1721.
  25. S. Aravecchia, M. Clausel, and C. Pradalier, “Comparing metrics for evaluating 3D map quality in natural environments,” Robotics and Autonomous Systems, vol. 173, p. 104617, Mar. 2024.
  26. R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at the Level of Objects,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition.   Portland, OR, USA: IEEE, Jun. 2013, pp. 1352–1359.
  27. J. Mccormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger, “Fusion++: Volumetric Object-Level SLAM,” in 2018 International Conference on 3D Vision (3DV).   Verona: IEEE, Sep. 2018, pp. 32–41.
  28. M. Runz, M. Buffier, and L. Agapito, “MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects,” in 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).   Munich, Germany: IEEE, Oct. 2018, pp. 10–20.
  29. N. Sunderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Meaningful maps with object-oriented semantic mapping,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   Vancouver, BC: IEEE, Sep. 2017, pp. 5079–5085.
  30. W. Chen, S. Hu, R. Talak, and L. Carlone, “Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding,” Nov. 2023, arXiv:2209.05629 [cs].
  31. G. Narita, T. Seno, T. Ishikawa, and Y. Kaji, “PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   Macau, China: IEEE, Nov. 2019, pp. 4205–4212.
  32. R. Adams and L. Bischof, “Seeded region growing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 641–647, Jun. 1994.
  33. Chenguang Huang and Oier Mees and Andy Zeng and Wolfram Burgard. Vlmaps. [Online]. Available: https://github.com/vlmaps/vlmaps
  34. M. Savva et al., “Habitat: A platform for embodied ai research,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, oct 2019, pp. 9338–9346.
  35. A. Szot et al., “Habitat 2.0: Training home assistants to rearrange their habitat,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 251–266.
  36. X. Puig et al., “Habitat 3.0: A co-habitat for humans, avatars and robots,” 2023, arXiv:2310.13724 [cs].
  37. W. H. Kruskal and W. A. Wallis, “Use of ranks in one-criterion variance analysis,” Journal of the American statistical Association, vol. 47, no. 260, pp. 583–621, apr 1952.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.