Papers
Topics
Authors
Recent
Search
2000 character limit reached

SuperVINS: Real-time Visual-Inertial SLAM

Updated 17 February 2026
  • SuperVINS is a real-time visual-inertial SLAM framework integrating deep feature extraction and adaptive RANSAC to deliver robust tracking in low-light and motion-blur conditions.
  • It enhances VINS-Fusion by replacing traditional ORB features with SuperPoint and LightGlue, achieving up to 39.6% reduction in trajectory error in challenging scenarios.
  • The system employs a modular pipeline with sliding-window optimization and SuperPoint-based loop closure, maintaining real-time performance with efficient GPU acceleration.

SuperVINS is a real-time visual-inertial SLAM (Simultaneous Localization and Mapping) framework developed to address the robustness and accuracy limitations of traditional SLAM systems in challenging imaging conditions, such as low-light environments and motion-blur. Building upon the established VINS-Fusion architecture, SuperVINS integrates deep learning-based feature extraction and matching modules—principally the SuperPoint and LightGlue neural networks—alongside an adaptive RANSAC-based enhancement strategy. This results in improved stability and trajectory tracking under degraded visual cues while maintaining real-time performance through efficient implementation and parallelized components (Luo et al., 2024).

1. System Design and Architecture

SuperVINS employs a modular pipeline structured to process synchronized monocular camera images (IkI_k) and inertial measurements (uku_k) via the following key stages:

  • Preprocessing: Temporal calibration, geometric image undistortion, and compensation for IMU–camera extrinsic parameters.
  • Frontend:
    • SuperPoint network for feature detection and description.
    • LightGlue transformer model for sparse feature matching.
    • Adaptive RANSAC for geometric verification and outlier rejection.
    • IMU preintegration between keyframes.
  • Sliding-window Local Optimization: Tightly coupled visual-inertial bundle adjustment (BA) over the most recent NN state vector estimates.
  • Map Management: Keyframe selection and update conditioned on parallax or temporal thresholds.
  • Loop-closure Thread (Parallel):
    • Bag-of-words (DBoW3) vector quantization of SuperPoint descriptors.
    • Retrieval of candidate places, followed by geometric verification and pose-graph augmentation.
  • Global Pose-graph Optimization: Correction of accumulated drift using g2o or iSAM2 backends.

Compared with VINS-Fusion, the core modifications are the replacement of FAST+optical flow and ORB descriptors with SuperPoint deep descriptors, matching via LightGlue instead of brute-force Hamming on ORB, and loop-closure detection over a SuperPoint-trained DBoW3 vocabulary. All deep learning modules are served through ONNX and allow GPU acceleration on compatible devices.

2. Feature Extraction, Matching, and Geometric Verification

2.1 SuperPoint Feature Extraction

SuperPoint comprises an encoder–decoder CNN that outputs two entities for each image:

  • A heat-map HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}, which is further post-processed with a softmax to produce per-pixel keypoint probabilities.
  • A semi-dense feature descriptor map DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}; this is bicubically upsampled and L2L_2-normalized to yield dense descriptors DRh×w×dD \in \mathbb{R}^{h \times w \times d}.

SuperPoint is trained using a unified detection and description loss:

f=Lp(X,Y)+Lp(X,Y)+λLd(D,D,S)f = L_p(X,Y) + L_p(X',Y') + \lambda \cdot L_d(D, D', S)

where LpL_p denotes cross-entropy over keypoint presence and LdL_d a contrastive loss enforcing descriptor consistency, with uku_k0 balancing detection and description objectives.

2.2 LightGlue Feature Matching

Given two sets of SuperPoint descriptors uku_k1 and uku_k2, LightGlue uses a transformer-based architecture to compute a soft-assignment matrix uku_k3 via alternating self- and cross-attention on features from the two frames. At each transformer layer:

  • Self-attention updates feature vectors within each image.
  • Cross-attention exchanges contextual information across images, refining correspondence hypotheses.

The matching process is supervised by a hierarchical assignment loss:

uku_k4

where uku_k5 encodes ground-truth matches, uku_k6 signifies matchability predictions, and uku_k7 sets of predicted unmatchable points.

2.3 RANSAC-based Enhancement

To further reinforce matching robustness, the non-maximum suppression radius in SuperPoint is contracted (e.g., uku_k8 px), allowing denser and more resilient feature tracks in poor conditions. From LightGlue matches, minimal sets of four correspondences are sampled to estimate a planar homography (uku_k9) using DLT. Matches are retained if they satisfy the geometric threshold NN0, with adaptive NN1 tailored per sequence (e.g., NN2 px for EuRoC).

3. Loop Closure and Global Consistency

SuperVINS’s loop closure pipeline is constructed on SuperPoint-based bag-of-words (BoW):

  • Vocabulary Construction: Descriptors from datasets (EuRoC, TUM, KITTI) are clustered using NN3-means to yield NN4 visual “words.” This vocabulary is managed by DBoW3.
  • Runtime Encoding: Each keyframe’s NN5 descriptors are mapped to a histogram NN6 with tf–idf weighting NN7.
  • Loop Detection: Candidate loop frames are retrieved by histogram dot product, NN8. Frames with NN9 (typically HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}0) are geometrically validated (via RANSAC fitting of HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}1 or HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}2 with a minimum inlier threshold).
  • Graph Augmentation: Upon acceptance, a loop-closure factor is added to the pose-graph for global consistency optimization.

4. Optimization and State Estimation

SuperVINS employs a factor-graph representation in a sliding window over recent states HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}3, where:

HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}4

where HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}5 denotes 3D-to-2D reprojection error, HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}6 is the preintegrated IMU error, and HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}7 the loop-closure constraint.

Nonlinear optimization is performed with Ceres or g2o (dense Schur complement). The sliding window size is HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}8 keyframes, with keyframes marginalized as needed. Loop closures are incorporated asynchronously, triggering global graph optimization via iSAM2.

5. Implementation Parameters and Performance Budget

Key implementation and hyperparameter settings include:

Component Value/Setting Comment
OS/Hardware Ubuntu 18.04; RTX 2060 CUDA 11.7, cuDNN 8.9.6, onnxruntime 1.16.3
SuperPoint input HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}9 px Internally downsampled by DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}0
LightGlue model Medium (256 heads, 12 layers) DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}13 ms/pair on GPU
Mask radius (DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}2) DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}3 px (EuRoC) Denser features for challenging conditions
RANSAC iterations/inlier thresh 50; DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}4 px Range DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}5–DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}6 px for jitter adaptation
Sliding window DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}7 keyframes
Loop closure frequency Every 10 keyframes
Front-end budget DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}8 ms (CPU)
LightGlue GPU budget DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}9 ms
Total per-frame L2L_20–L2L_21 ms Fits real-time constraint

6. Quantitative and Qualitative Evaluation

SuperVINS was validated on the EuRoC MAV dataset, comparing against VINS-Fusion in terms of absolute trajectory error (ATE), rotational and translational relative pose errors.

Sequence VINS-Fusion ATE [m] SuperVINS ATE [m] Rot. RPE rad/m Trans. RPE m Notes
MH01 0.0911 0.0867 0.00203 0.00673
MH05 0.2621 0.1583 0.00379 0.00949 L2L_2239.6% (ATE)
V202 — (lost) 0.1003 0.00594 0.00850 VINS-Fusion lost tracking
V203 0.1926 0.1687

Qualitatively, SuperVINS demonstrated L2L_23 higher map point density in low-light and high-blur sequences, and trajectory alignment visually closer to ground truth in overlay plots. These results detail a substantive improvement in both tracking reliability and mapping detail in adverse conditions.

7. Ablation Studies and Empirical Insights

System component ablations and sensitivity analyses reveal:

  • SuperPoint vs. ORB: Replacing classical ORB+optical flow with SuperPoint yields 8–12% ATE reduction in challenging sequences; under nominal lighting, improvements are marginal (1–2%).
  • LightGlue + RANSAC: LightGlue reduces correspondence outliers by L2L_2415% over brute-force Hamming; adaptive RANSAC thresholds further improve ATE by L2L_2510% in high jitter.
  • Mask radius and Threshold Sweep: Denser matches (smaller L2L_26) increase features but risk spatial clustering. Optimal performance at L2L_27 px, with L2L_28 px optimal on EuRoC—stricter or looser settings respectively trade off inlier retention and outlier suppression.

This suggests that the coupling of deep-learned feature extraction, robust neural matching, and parameter-tuned verification, embedded within a classical VIO pipeline, achieves state-of-the-art tracking in previously failure-prone scenarios. The public release of SuperVINS code further supports reproducibility and extension in the SLAM research community (Luo et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperVINS.