Papers
Topics
Authors
Recent
Search
2000 character limit reached

PRAM: Place Recognition Anywhere Model for Efficient Visual Localization

Published 11 Apr 2024 in cs.CV and cs.RO | (2404.07785v2)

Abstract: Visual localization is a key technique to a variety of applications, e.g., autonomous driving, AR/VR, and robotics. For these real applications, both efficiency and accuracy are important especially on edge devices with limited computing resources. However, previous frameworks, e.g., absolute pose regression (APR), scene coordinate regression (SCR), and the hierarchical method (HM), have limited either accuracy or efficiency in both indoor and outdoor environments. In this paper, we propose the place recognition anywhere model (PRAM), a new framework, to perform visual localization efficiently and accurately by recognizing 3D landmarks. Specifically, PRAM first generates landmarks directly in 3D space in a self-supervised manner. Without relying on commonly used classic semantic labels, these 3D landmarks can be defined in any place in indoor and outdoor scenes with higher generalization ability. Representing the map with 3D landmarks, PRAM discards global descriptors, repetitive local descriptors, and redundant 3D points, increasing the memory efficiency significantly. Then, sparse keypoints, rather than dense pixels, are utilized as the input tokens to a transformer-based recognition module for landmark recognition, which enables PRAM to recognize hundreds of landmarks with high time and memory efficiency. At test time, sparse keypoints and predicted landmark labels are utilized for outlier removal and landmark-wise 2D-3D matching as opposed to exhaustive 2D-2D matching, which further increases the time efficiency. A comprehensive evaluation of APRs, SCRs, HMs, and PRAM on both indoor and outdoor datasets demonstrates that PRAM outperforms ARPs and SCRs in large-scale scenes with a large margin and gives competitive accuracy to HMs but reduces over 90\% memory cost and runs 2.4 times faster, leading to a better balance between efficiency and accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in ICCV, 2015.
  2. A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” in CVPR, 2017.
  3. F. Xue, X. Wu, S. Cai, and J. Wang, “Learning multi-view camera relocalization with graph neural networks,” in CVPR, 2020.
  4. S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “MapNet: Geometry-aware learning of maps for camera localization,” in CVPR, 2018.
  5. F. Xue, X. Wang, Z. Yan, Q. Wang, J. Wang, and H. Zha, “Local supports global: Deep camera relocalization with sequence enhancement,” in ICCV, 2019.
  6. E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “DSAC-differentiable RANSAC for camera localization,” in CVPR, 2017.
  7. E. Brachmann and C. Rother, “Learning less is more-6d camera localization via 3d surface regression,” in CVPR, 2018.
  8. I. Budvytis, M. Teichmann, T. Vojir, and R. Cipolla, “Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression,” in BMVC, 2019.
  9. E. Brachmann, T. Cavallari, and V. A. Prisacariu, “Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses,” in CVPR, 2023.
  10. P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From Coarse to Fine: Robust Hierarchical Localization at Large Scale,” in CVPR, 2019.
  11. F. Xue, I. Budvytis, D. O. Reino, and R. Cipolla, “Efficient Large-scale Localization by Global Instance Recognition,” in CVPR, 2022.
  12. T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized for large-scale image-based localization,” TPAMI, 2016.
  13. C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl, “Semantic match consistency for long-term visual localization,” in ECCV, 2018.
  14. B. Wang, C. Chen, C. X. Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: Attention guided camera localization,” in AAAI, 2020.
  15. X. Wu, H. Zhao, S. Li, Y. Cao, and H. Zha, “Sc-wls: Towards interpretable feed-forward camera re-localization,” in ECCV, 2022.
  16. J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in CVPR, 2013.
  17. X. Li, S. Wang, Y. Zhao, J. Verbeek, and J. Kannala, “Hierarchical scene coordinate classification and regression for visual localization,” in CVPR, 2020.
  18. E. Brachmann and C. Rother, “Visual camera re-localization from rgb and rgb-d images using dsac,” TPAMI, vol. 44, no. 9, pp. 5847–5865, 2022.
  19. V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” IJCV, 2009.
  20. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  21. R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in CVPR, 2016.
  22. S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,” in CVPR, 2021.
  23. F. Radenović, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” TPAMI, vol. 41, no. 7, pp. 1655–1668, 2018.
  24. P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in CVPR, 2020.
  25. F. Xue, I. Budvytis, and R. Cipolla, “Imp: Iterative matching and pose estimation with adaptive pooling,” in CVPR, 2023.
  26. H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai, and L. Quan, “Learning to match features with seeded graph matching network,” in ICCV, 2021.
  27. Y. Shi, J.-X. Cai, Y. Shavit, T.-J. Mu, W. Feng, and K. Zhang, “ClusterGNN: Cluster-based Coarse-to-Fine Graph Neural Network for Efficient Feature Matching,” in CVPR, 2022.
  28. D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in CVPRW, 2018.
  29. F. Xue, I. Budvytis, and R. Cipolla, “Sfd2: Semantic-guided feature detection and description,” in CVPR, 2023.
  30. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
  31. E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in ICCV, 2011.
  32. J. Revaud, P. Weinzaepfel, C. R. de Souza, and M. Humenberger, “R2D2: Repeatable and reliable detector and descriptor,” in NeurIPS, 2019.
  33. K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in ECCV, 2016.
  34. F. Langer, G. Bae, I. Budvytis, and R. Cipolla, “Sparc: Sparse render-and-compare for cad model alignment in a single rgb image,” in BMVC, 2022.
  35. F. Langer, I. Budvytis, and R. Cipolla, “Sparse multi-object render-and-compare,” in BMVC, 2023.
  36. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  37. B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time rgb-d camera relocalization,” in ISMAR, 2013, pp. 173–179.
  38. J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin, “Learning to navigate the energy landscape,” in 3DV, 2016.
  39. T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt, “Image retrieval for image-based localization revisited,” in BMVC, 2012.
  40. X. Li and H. Ling, “Pogo-net: pose graph optimization with graph neural networks,” in ICCV, 2021.
  41. Li, Xinyi and Ling, Haibin, “Gtcar: Graph transformer for camera re-localization,” in ECCV, 2022.
  42. M. O. Turkoglu, E. Brachmann, K. Schindler, G. J. Brostow, and A. Monszpart, “Visual camera re-localization using graph neural networks and relative pose supervision,” in 3DV, 2021.
  43. H. Li, P. Xiong, H. Fan, and J. Sun, “Dfanet: Deep feature aggregation for real-time semantic segmentation,” in CVPR, 2019.
  44. A. Moreau, N. Piasco, D. Tsishkou, B. Stanciulescu, and A. de La Fortelle, “Lens: Localization enhanced by nerf synthesis,” in CoRL, 2022.
  45. T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, “Understanding the limitations of cnn-based absolute camera pose regression,” in CVPR, 2019.
  46. N.-D. Duong, A. Kacete, C. Soladie, P.-Y. Richard, and J. Royan, “Accurate sparse feature regression forest learning for real-time camera relocalization,” in 3DV, 2018.
  47. J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” TPAMI, vol. 31, no. 4, pp. 591–606, 2008.
  48. D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” T-RO, vol. 28, no. 5, pp. 1188–1197, 2012.
  49. E. Stenborg, C. Toft, and L. Hammarstrand, “Long-term visual localization using semantically segmented images,” in ICRA, 2018.
  50. T. Shi, S. Shen, X. Gao, and L. Zhu, “Visual localization using sparse semantic 3D map,” in ICIP, 2019.
  51. Z. Xin, Y. Cai, T. Lu, X. Xing, S. Cai, J. Zhang, Y. Yang, and Y. Wang, “Localizing discriminative visual landmarks for place recognition,” in ICRA, 2019.
  52. M. Larsson, E. Stenborg, C. Toft, L. Hammarstrand, T. Sattler, and F. Kahl, “Fine-grained segmentation networks: Self-supervised segmentation for improved long-term visual localization,” in ICCV, 2019.
  53. R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” in CVPR, 2013.
  54. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” TPAMI, 2017.
  55. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
  56. R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras,” IEEE TRO, vol. 33, no. 5, pp. 1255–1262, 2017.
  57. F. Xue, X. Wang, S. Li, Q. Wang, J. Wang, and H. Zha, “Beyond tracking: Selecting memory and refining poses for deep visual odometry,” in CVPR, 2019.
  58. S. Li, F. Xue, X. Wang, Z. Yan, and H. Zha, “Sequential adversarial learning for self-supervised deep visual odometry,” in ICCV, 2019.
  59. I. Budvytis, P. Sauer, and R. Cipolla, “Semantic localisation via globally unique instance segmentation,” in BMVC, 2018.
  60. J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in CVPR, 2016.
  61. M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-Net: A trainable CNN for joint description and detection of local features,” in CVPR, 2019.
  62. M. J. Tyszkiewicz, P. Fua, and E. Trulls, “DISK: Learning local features with policy gradient,” in NeurIPS, 2020.
  63. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” ACM sigmod record, vol. 25, no. 2, pp. 103–114, 1996.
  64. D. Arthur and S. Vassilvitskii, “K-means++ the advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.
  65. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” in ICCV, 2023.
  66. Y. Li, N. Snavely, and D. P. Huttenlocher, “Location recognition using prioritized feature matching,” in ECCV, 2010.
  67. H. Soo Park, Y. Wang, E. Nurvitadhi, J. C. Hoe, Y. Sheikh, and M. Chen, “3d point cloud reduction using mixed-integer quadratic programming,” in CVPRW, 2013.
  68. J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic visual localization,” in CVPR, 2018.
  69. Y. Shavit and Y. Keller, “Camera pose auto-encoders for improving pose regression,” in ECCV, 2022.
  70. Y. Shavit, R. Ferens, and Y. Keller, “Learning multi-scene absolute pose regression with transformers,” in ICCV, 2021.
  71. Z. Huang, H. Zhou, Y. Li, B. Yang, Y. Xu, X. Zhou, H. Bao, G. Zhang, and H. Li, “VS-Net: Voting with segmentation for visual localization,” in CVPR, 2021.
  72. P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V. Larsson, M. Pollefeys, V. Lepetit, L. Hammarstrand, F. Kahl et al., “Back to the feature: learning robust camera localization from pixels to pose,” in CVPR, 2021.
  73. M. Ding, Z. Wang, J. Sun, J. Shi, and P. Luo, “Camnet: Coarse-to-fine retrieval for camera re-localization,” in ICCV, 2019.
  74. L. Yang, Z. Bai, C. Tang, H. Li, Y. Furukawa, and P. Tan, “Sanet: Scene agnostic network for camera localization,” in ICCV, 2019.
  75. L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang, and L. Quan, “Kfnet: Learning temporal camera relocalization using kalman filtering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  76. J. Liu, Q. Nie, Y. Liu, and C. Wang, “Nerf-loc: Visual localization with conditional neural radiance field,” in ICRA, 2023.
  77. S. Tang, C. Tang, R. Huang, S. Zhu, and P. Tan, “Learning camera localization via dense scene matching,” in CVPR, 2021.
  78. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019.
  79. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  80. I. Rocco, R. Arandjelović, and J. Sivic, “Efficient neighbourhood consensus networks via submanifold sparse convolutions,” in ECCV, 2020.
  81. Z. Huang, H. Zhou, Y. Li, B. Yang, Y. Xu, X. Zhou, H. Bao, G. Zhang, and H. Li, “Vs-net: Voting with segmentation for visual localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  82. L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson, “City-scale localization for cameras with known vertical direction,” TPAMI, 2016.
  83. W. Cheng, W. Lin, K. Chen, and X. Zhang, “Cascaded parallel filtering for memory-efficient image-based localization,” in ICCV, 2019.
  84. T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic et al., “Benchmarking 6dof outdoor visual localization in changing conditions,” in CVPR, 2018.
  85. E. Brachmann and C. Rother, “Expert sample consensus applied to camera re-localization,” in CVPR, 2019.
  86. S. Tang, S. Tang, A. Tagliasacchi, P. Tan, and Y. Furukawa, “Neumap: Neural coordinate mapping by auto-transdecoder for camera localization,” in CVPR, 2023, pp. 929–939.
  87. L. Yang, R. Shrestha, W. Li, S. Liu, G. Zhang, Z. Cui, and P. Tan, “Scenesqueezer: Learning to compress scene for camera relocalization,” in CVPR, 2022.
  88. Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan, “ASLFeat: Learning local features of accurate shape and localization,” in CVPR, 2020.
  89. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.

Summary

  • The paper introduces PRAM, a model that efficiently localizes visual data via a dual-stage landmark recognition and registration process, cutting processing time by 2.4x and storage by over 90%.
  • The paper employs a self-supervised, map-centric approach that defines 3D landmarks on sparse keypoints, eliminating manual labeling and reducing redundant computations.
  • The paper validates PRAM's high accuracy and scalability across multiple datasets, showcasing its versatility in diverse indoor and outdoor environments.

PRAM: Transforming Visual Localization through Place Recognition Anywhere Model

Introduction

Visual localization has been pivotal in advancing applications like augmented/virtual reality (AR/VR), autonomous driving, and robotics. Traditional methods like Absolute Pose Regression (APR), Scene Coordinate Regression (SCR), and Hierarchical Methods (HM) have paved the way for achieving significant milestones. However, these methods exhibit a trade-off between time and memory efficiency against accuracy, especially in large-scale scenes. Drawing inspiration from human landmark recognition and verification, the Place Recognition Anywhere Model (PRAM) introduces a novel paradigm, achieving efficient and accurate visual localization across various environments.

Landmark Recognition and Registration

PRAM distinguishes itself with a two-fold approach: landmark recognition and registration. By adopting a map-centric strategy to define landmarks directly on 3D points rather than objects, it allows for unique landmark identification in both indoor and outdoor scenarios. This method does away with laborious manual labeling, achieving a seamless, self-supervised landmark generation process. For recognition, PRAM utilizes a transformer-based neural network, leveraging sparse keypoints extracted from images. This adjustment not only reduces the time and memory footprint significantly but also retains high recognition accuracy compared to traditional dense pixel methods. The model efficiently narrows down to a coarse location through landmark recognition, followed by a landmark-wise verification for precise localization, running 2.4 times faster and requiring over 90% less storage than existing hierarchical approaches.

Advantages and Contributions

PRAM's methodology introduces several advantages:

  • Efficiency in Large-Scale Scenes: By transforming global reference search into landmark recognition, PRAM demonstrates superior time and memory efficiency.
  • Reduction in Redundant Computations: The model strategically filters potential outliers and performs semantic-aware registration, significantly cutting down unnecessary computations.
  • Flexibility and Extensibility: The framework accommodates multi-modality data, laying groundwork for advancements in visual localization like map-centric feature learning and sparse scene coordinate regression.
  • Significant Memory Savings: PRAM achieves substantial reductions in storage requirements by eliminating the need for storing extensive global and local descriptors.

Implications and Future Directions

The PRAM framework not only sets a new benchmark for efficiency and accuracy in visual localization but also inspires several future research directions. Enhanced landmark definition strategies, exploration into adaptive landmark generation, and integration of multi-modal inputs for improved recognition accuracy are some avenues that hold promise. Furthermore, PRAM's approach to map-centric feature learning and its potential in facilitating large-scale scene coordinate regression present exciting opportunities for the broader AI and computer vision communities to explore.

Experimentation and Results

Evaluated across renowned datasets including 7Scenes, 12Scenes, CambridgeLandmarks, and Aachen Day-Night, PRAM demonstrates commendable performance. Its ability to run significantly faster while using minimal storage and retaining accuracy positions it as a groundbreaking solution in the landscape of visual localization.

Conclusion

In summary, PRAM revolutionizes visual localization by introducing an efficient and accurate place recognition model versatile across different scales and settings. Through sophisticated landmark recognition and registration techniques, it addresses the longstanding challenges of efficiency and scalability that have hindered previous methods. As the research community continues to explore and expand upon the foundations laid by PRAM, the future of visual localization appears both promising and exciting.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 62 likes about this paper.