Papers
Topics
Authors
Recent
Search
2000 character limit reached

SuperVINS: Real-time Visual-Inertial SLAM

Updated 17 February 2026
  • SuperVINS is a real-time visual-inertial SLAM framework integrating deep feature extraction and adaptive RANSAC to deliver robust tracking in low-light and motion-blur conditions.
  • It enhances VINS-Fusion by replacing traditional ORB features with SuperPoint and LightGlue, achieving up to 39.6% reduction in trajectory error in challenging scenarios.
  • The system employs a modular pipeline with sliding-window optimization and SuperPoint-based loop closure, maintaining real-time performance with efficient GPU acceleration.

SuperVINS is a real-time visual-inertial SLAM (Simultaneous Localization and Mapping) framework developed to address the robustness and accuracy limitations of traditional SLAM systems in challenging imaging conditions, such as low-light environments and motion-blur. Building upon the established VINS-Fusion architecture, SuperVINS integrates deep learning-based feature extraction and matching modules—principally the SuperPoint and LightGlue neural networks—alongside an adaptive RANSAC-based enhancement strategy. This results in improved stability and trajectory tracking under degraded visual cues while maintaining real-time performance through efficient implementation and parallelized components (Luo et al., 2024).

1. System Design and Architecture

SuperVINS employs a modular pipeline structured to process synchronized monocular camera images (IkI_k) and inertial measurements (uku_k) via the following key stages:

  • Preprocessing: Temporal calibration, geometric image undistortion, and compensation for IMU–camera extrinsic parameters.
  • Frontend:
    • SuperPoint network for feature detection and description.
    • LightGlue transformer model for sparse feature matching.
    • Adaptive RANSAC for geometric verification and outlier rejection.
    • IMU preintegration between keyframes.
  • Sliding-window Local Optimization: Tightly coupled visual-inertial bundle adjustment (BA) over the most recent NN state vector estimates.
  • Map Management: Keyframe selection and update conditioned on parallax or temporal thresholds.
  • Loop-closure Thread (Parallel):
    • Bag-of-words (DBoW3) vector quantization of SuperPoint descriptors.
    • Retrieval of candidate places, followed by geometric verification and pose-graph augmentation.
  • Global Pose-graph Optimization: Correction of accumulated drift using g2o or iSAM2 backends.

Compared with VINS-Fusion, the core modifications are the replacement of FAST+optical flow and ORB descriptors with SuperPoint deep descriptors, matching via LightGlue instead of brute-force Hamming on ORB, and loop-closure detection over a SuperPoint-trained DBoW3 vocabulary. All deep learning modules are served through ONNX and allow GPU acceleration on compatible devices.

2. Feature Extraction, Matching, and Geometric Verification

2.1 SuperPoint Feature Extraction

SuperPoint comprises an encoder–decoder CNN that outputs two entities for each image:

  • A heat-map HR(h/8)×(w/8)×65H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}, which is further post-processed with a softmax to produce per-pixel keypoint probabilities.
  • A semi-dense feature descriptor map DR(h/8)×(w/8)×dD' \in \mathbb{R}^{(h/8) \times (w/8) \times d}; this is bicubically upsampled and L2L_2-normalized to yield dense descriptors DRh×w×dD \in \mathbb{R}^{h \times w \times d}.

SuperPoint is trained using a unified detection and description loss:

f=Lp(X,Y)+Lp(X,Y)+λLd(D,D,S)f = L_p(X,Y) + L_p(X',Y') + \lambda \cdot L_d(D, D', S)

where LpL_p denotes cross-entropy over keypoint presence and LdL_d a contrastive loss enforcing descriptor consistency, with λ\lambda balancing detection and description objectives.

2.2 LightGlue Feature Matching

Given two sets of SuperPoint descriptors {xiA}i=1NA\{x_i^A\}_{i=1}^{N_A} and {xjB}j=1NB\{x_j^B\}_{j=1}^{N_B}, LightGlue uses a transformer-based architecture to compute a soft-assignment matrix PRNA×NBP \in \mathbb{R}^{N_A \times N_B} via alternating self- and cross-attention on features from the two frames. At each transformer layer:

  • Self-attention updates feature vectors within each image.
  • Cross-attention exchanges contextual information across images, refining correspondence hypotheses.

The matching process is supervised by a hierarchical assignment loss:

loss=1L=1L[1M(i,j)MlogPij+12AˉiAˉlog(1σiA)+12BˉjBˉlog(1σjB)]\text{loss} = -\frac{1}{L} \sum_{\ell=1}^L \left[ \frac{1}{|M|} \sum_{(i,j)\in M} \log P_{ij}^\ell + \frac{1}{2|\bar{A}|} \sum_{i\in \bar{A}} \log(1-\sigma_i^A) + \frac{1}{2|\bar{B}|} \sum_{j\in \bar{B}} \log(1-\sigma_j^B) \right]

where MM encodes ground-truth matches, σ\sigma signifies matchability predictions, and Aˉ,Bˉ\bar{A}, \bar{B} sets of predicted unmatchable points.

2.3 RANSAC-based Enhancement

To further reinforce matching robustness, the non-maximum suppression radius in SuperPoint is contracted (e.g., rmask=4r_{mask}=4 px), allowing denser and more resilient feature tracks in poor conditions. From LightGlue matches, minimal sets of four correspondences are sampled to estimate a planar homography (HH) using DLT. Matches are retained if they satisfy the geometric threshold x2Hx12<τgeo\|x_2 - H x_1\|_2 < \tau_{geo}, with adaptive τgeo\tau_{geo} tailored per sequence (e.g., [0.22,0.28][0.22, 0.28] px for EuRoC).

3. Loop Closure and Global Consistency

SuperVINS’s loop closure pipeline is constructed on SuperPoint-based bag-of-words (BoW):

  • Vocabulary Construction: Descriptors from datasets (EuRoC, TUM, KITTI) are clustered using kk-means to yield KK visual “words.” This vocabulary is managed by DBoW3.
  • Runtime Encoding: Each keyframe’s MM descriptors are mapped to a histogram vNKv \in \mathbb{N}^K with tf–idf weighting vk=(tfk)log(Ndocs/dfk)v_k = (\text{tf}_k) \log(N_{docs}/\text{df}_k).
  • Loop Detection: Candidate loop frames are retrieved by histogram dot product, sqi=vqvis_{qi} = v_q^\top v_i. Frames with sqi>δloops_{qi} > \delta_{loop} (typically δloop50\delta_{loop} \approx 50) are geometrically validated (via RANSAC fitting of FF or PP with a minimum inlier threshold).
  • Graph Augmentation: Upon acceptance, a loop-closure factor is added to the pose-graph for global consistency optimization.

4. Optimization and State Estimation

SuperVINS employs a factor-graph representation in a sliding window over recent states X={Ti,vi,big,bia}i=0KX = \{T_i, v_i, b_i^g, b_i^a\}_{i=0}^K, where:

J(X)=mCrcam(Tim,Xjm;zm)Σcam2+lIrimu(Til,Til+1;ul)Σimu2+(p,q)Lrlc(Tp,Tq)Σlc2J(X) = \sum_{m \in C} \| r_{cam}(T_{i_m}, X_{j_m}; z_m) \|_{\Sigma_{cam}}^2 + \sum_{l \in I} \| r_{imu}(T_{i_l}, T_{i_{l+1}}; u_l) \|_{\Sigma_{imu}}^2 + \sum_{(p,q) \in L} \| r_{lc}(T_p, T_q) \|_{\Sigma_{lc}}^2

where rcamr_{cam} denotes 3D-to-2D reprojection error, rimur_{imu} is the preintegrated IMU error, and rlcr_{lc} the loop-closure constraint.

Nonlinear optimization is performed with Ceres or g2o (dense Schur complement). The sliding window size is N20N \approx 20 keyframes, with keyframes marginalized as needed. Loop closures are incorporated asynchronously, triggering global graph optimization via iSAM2.

5. Implementation Parameters and Performance Budget

Key implementation and hyperparameter settings include:

Component Value/Setting Comment
OS/Hardware Ubuntu 18.04; RTX 2060 CUDA 11.7, cuDNN 8.9.6, onnxruntime 1.16.3
SuperPoint input 240×320240 \times 320 px Internally downsampled by 8×8\times
LightGlue model Medium (256 heads, 12 layers) \sim3 ms/pair on GPU
Mask radius (rmaskr_{mask}) $4$ px (EuRoC) Denser features for challenging conditions
RANSAC iterations/inlier thresh 50; τgeo=0.25\tau_{geo}=0.25 px Range $0.22$–$0.28$ px for jitter adaptation
Sliding window N=20N=20 keyframes
Loop closure frequency Every 10 keyframes
Front-end budget 15\sim15 ms (CPU)
LightGlue GPU budget 5\sim5 ms
Total per-frame $25$–$30$ ms Fits real-time constraint

6. Quantitative and Qualitative Evaluation

SuperVINS was validated on the EuRoC MAV dataset, comparing against VINS-Fusion in terms of absolute trajectory error (ATE), rotational and translational relative pose errors.

Sequence VINS-Fusion ATE [m] SuperVINS ATE [m] Rot. RPE rad/m Trans. RPE m Notes
MH01 0.0911 0.0867 0.00203 0.00673
MH05 0.2621 0.1583 0.00379 0.00949 \downarrow39.6% (ATE)
V202 — (lost) 0.1003 0.00594 0.00850 VINS-Fusion lost tracking
V203 0.1926 0.1687

Qualitatively, SuperVINS demonstrated 3×\sim3\times higher map point density in low-light and high-blur sequences, and trajectory alignment visually closer to ground truth in overlay plots. These results detail a substantive improvement in both tracking reliability and mapping detail in adverse conditions.

7. Ablation Studies and Empirical Insights

System component ablations and sensitivity analyses reveal:

  • SuperPoint vs. ORB: Replacing classical ORB+optical flow with SuperPoint yields 8–12% ATE reduction in challenging sequences; under nominal lighting, improvements are marginal (1–2%).
  • LightGlue + RANSAC: LightGlue reduces correspondence outliers by \sim15% over brute-force Hamming; adaptive RANSAC thresholds further improve ATE by \sim10% in high jitter.
  • Mask radius and Threshold Sweep: Denser matches (smaller rmaskr_{mask}) increase features but risk spatial clustering. Optimal performance at rmask=4r_{mask} = 4 px, with τgeo=0.25\tau_{geo}=0.25 px optimal on EuRoC—stricter or looser settings respectively trade off inlier retention and outlier suppression.

This suggests that the coupling of deep-learned feature extraction, robust neural matching, and parameter-tuned verification, embedded within a classical VIO pipeline, achieves state-of-the-art tracking in previously failure-prone scenarios. The public release of SuperVINS code further supports reproducibility and extension in the SLAM research community (Luo et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperVINS.