Learning Whole-Image Descriptors for Real-time Loop Detection andKidnap Recovery under Large Viewpoint Difference

Published 15 Apr 2019 in cs.RO | (1904.06962v1)

Abstract: We present a real-time stereo visual-inertial-SLAM system which is able to recover from complicatedkidnap scenarios and failures online in realtime. We propose to learn the whole-image-descriptorin a weakly supervised manner based on NetVLAD and decoupled convolutions. We analyse thetraining difficulties in using standard loss formulations and propose an allpairloss and show itseffect through extensive experiments. Compared to standard NetVLAD, our network takes an orderof magnitude fewer computations and model parameters, as a result runs about three times faster.We evaluate the representation power of our descriptor on standard datasets with precision-recall.Unlike previous loop detection methods which have been evaluated only on fronto-parallel revisits,we evaluate the performace of our method with competing methods on scenarios involving largeviewpoint difference. Finally, we present the fully functional system with relative computation andhandling of multiple world co-ordinate system which is able to reduce odometry drift, recover fromcomplicated kidnap scenarios and random odometry failures. We open source our fully functional system as an add-on for the popular VINS-Fusion.

Abstract PDF Upgrade to Chat

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a stereo visual-inertial SLAM approach using learned whole-image descriptors for real-time loop detection and kidnap recovery under large viewpoint differences.
A modified NetVLAD architecture with decoupled convolutions achieves approximately three times faster computation and fewer parameters, while a novel "all-pair loss" improves training convergence.
Experimental evaluation demonstrates high precision-recall, outperforming traditional methods, enabling more accurate robotic navigation, and contributing a new loss function for descriptor learning.

Overview of Learning Whole-Image Descriptors for Real-Time Loop Detection and Kidnap Recovery

The paper "Learning Whole-Image Descriptors for Real-time Loop Detection and Kidnap Recovery under Large Viewpoint Difference" by Manohar Kuse and Shaojie Shen introduces a sophisticated approach to stereo visual-inertial SLAM capable of handling challenging kidnap scenarios in real-time. This work extends the capabilities of SLAM systems by addressing the problem of large viewpoint differences in loop detection and introduces a novel way to train whole-image descriptors using deep learning methods.

The authors focus on improving the efficiency and accuracy of SLAM systems when confronted with environments where the viewpoint changes significantly. To achieve this, the paper presents a method based on a modified NetVLAD architecture using decoupled convolutions, which outperforms standard implementations in computational efficiency and memory usage. More specifically, this approach reduces computational time by approximately three times and significantly decreases the number of model parameters, highlighting the potential for real-time application.

In addressing the training of these descriptors, the paper identifies limitations in traditional NetVLAD training using triplet loss. An "all-pair loss" function is proposed and shown to provide superior convergence performance through rigorous experimentation. The proposed system is capable of maintaining multiple coordinate systems and reduces odometry drift effectively, particularly in scenarios involving extensive changes in viewpoint.

Experimental Evaluation

The paper presents robust experimental evaluations across multiple datasets to confirm the efficacy of the proposed methods. The authors report high precision-recall from the whole-image descriptors, particularly in non-fronto-parallel scenarios, setting a new benchmark for loop detection under varied conditions. The quantitative results underscore the descriptor's representation power, illustrated through comparative analyses with traditional bag-of-visual-words and contemporary CNN-based approaches.

Implications and Future Directions

The implications of this research are twofold. Practically, the enhanced SLAM system enables more accurate robotic navigation and mapping even in environments previously considered challenging. This capability is crucial for applications requiring persistent autonomous operation in dynamic settings, such as indoor robotics and autonomous vehicles. Theoretically, the introduction of a novel cost function (all-pair loss) for learning descriptors opens pathways for further work in optimizing neural networks for specific SLAM challenges.

Looking forward, the confluence of high-speed, weakly supervised learning approaches with SLAM systems invites expansion toward broader applications. Future research might explore integrating semantic information to complement visual place recognition, improving SLAM systems' long-term autonomy and effectiveness under more complex conditions. Additionally, enhancing scalability could involve incorporating product quantization techniques to manage larger datasets without compromising real-time performance.

This work, made open source and augmenting the VINS-Fusion framework, marks an incremental yet vital step in refining SLAM techniques to better tackle the multifaceted challenges posed by real-world environments.