SuperVINS: Real-time Visual-Inertial SLAM

Updated 17 February 2026

SuperVINS is a real-time visual-inertial SLAM framework integrating deep feature extraction and adaptive RANSAC to deliver robust tracking in low-light and motion-blur conditions.
It enhances VINS-Fusion by replacing traditional ORB features with SuperPoint and LightGlue, achieving up to 39.6% reduction in trajectory error in challenging scenarios.
The system employs a modular pipeline with sliding-window optimization and SuperPoint-based loop closure, maintaining real-time performance with efficient GPU acceleration.

SuperVINS is a real-time visual-inertial SLAM (Simultaneous Localization and Mapping) framework developed to address the robustness and accuracy limitations of traditional SLAM systems in challenging imaging conditions, such as low-light environments and motion-blur. Building upon the established VINS-Fusion architecture, SuperVINS integrates deep learning-based feature extraction and matching modules—principally the SuperPoint and LightGlue neural networks—alongside an adaptive RANSAC-based enhancement strategy. This results in improved stability and trajectory tracking under degraded visual cues while maintaining real-time performance through efficient implementation and parallelized components (Luo et al., 2024).

1. System Design and Architecture

SuperVINS employs a modular pipeline structured to process synchronized monocular camera images ( $I_k$ ) and inertial measurements ( $u_k$ ) via the following key stages:

Preprocessing: Temporal calibration, geometric image undistortion, and compensation for IMU–camera extrinsic parameters.
Frontend:
- SuperPoint network for feature detection and description.
- LightGlue transformer model for sparse feature matching.
- Adaptive RANSAC for geometric verification and outlier rejection.
- IMU preintegration between keyframes.
Sliding-window Local Optimization: Tightly coupled visual-inertial bundle adjustment (BA) over the most recent $N$ state vector estimates.
Map Management: Keyframe selection and update conditioned on parallax or temporal thresholds.
Loop-closure Thread (Parallel):
- Bag-of-words (DBoW3) vector quantization of SuperPoint descriptors.
- Retrieval of candidate places, followed by geometric verification and pose-graph augmentation.
Global Pose-graph Optimization: Correction of accumulated drift using g2o or iSAM2 backends.

Compared with VINS-Fusion, the core modifications are the replacement of FAST+optical flow and ORB descriptors with SuperPoint deep descriptors, matching via LightGlue instead of brute-force Hamming on ORB, and loop-closure detection over a SuperPoint-trained DBoW3 vocabulary. All deep learning modules are served through ONNX and allow GPU acceleration on compatible devices.

2. Feature Extraction, Matching, and Geometric Verification

2.1 SuperPoint Feature Extraction

SuperPoint comprises an encoder–decoder CNN that outputs two entities for each image:

A heat-map $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ , which is further post-processed with a softmax to produce per-pixel keypoint probabilities.
A semi-dense feature descriptor map $D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ ; this is bicubically upsampled and $L_2$ -normalized to yield dense descriptors $D \in \mathbb{R}^{h \times w \times d}$ .

SuperPoint is trained using a unified detection and description loss:

$f = L_p(X,Y) + L_p(X',Y') + \lambda \cdot L_d(D, D', S)$

where $L_p$ denotes cross-entropy over keypoint presence and $L_d$ a contrastive loss enforcing descriptor consistency, with $u_k$ 0 balancing detection and description objectives.

2.2 LightGlue Feature Matching

Given two sets of SuperPoint descriptors $u_k$ 1 and $u_k$ 2, LightGlue uses a transformer-based architecture to compute a soft-assignment matrix $u_k$ 3 via alternating self- and cross-attention on features from the two frames. At each transformer layer:

Self-attention updates feature vectors within each image.
Cross-attention exchanges contextual information across images, refining correspondence hypotheses.

The matching process is supervised by a hierarchical assignment loss:

$u_k$ 4

where $u_k$ 5 encodes ground-truth matches, $u_k$ 6 signifies matchability predictions, and $u_k$ 7 sets of predicted unmatchable points.

2.3 RANSAC-based Enhancement

To further reinforce matching robustness, the non-maximum suppression radius in SuperPoint is contracted (e.g., $u_k$ 8 px), allowing denser and more resilient feature tracks in poor conditions. From LightGlue matches, minimal sets of four correspondences are sampled to estimate a planar homography ( $u_k$ 9) using DLT. Matches are retained if they satisfy the geometric threshold $N$ 0, with adaptive $N$ 1 tailored per sequence (e.g., $N$ 2 px for EuRoC).

3. Loop Closure and Global Consistency

SuperVINS’s loop closure pipeline is constructed on SuperPoint-based bag-of-words (BoW):

Vocabulary Construction: Descriptors from datasets (EuRoC, TUM, KITTI) are clustered using $N$ 3-means to yield $N$ 4 visual “words.” This vocabulary is managed by DBoW3.
Runtime Encoding: Each keyframe’s $N$ 5 descriptors are mapped to a histogram $N$ 6 with tf–idf weighting $N$ 7.
Loop Detection: Candidate loop frames are retrieved by histogram dot product, $N$ 8. Frames with $N$ 9 (typically $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 0) are geometrically validated (via RANSAC fitting of $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 1 or $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 2 with a minimum inlier threshold).
Graph Augmentation: Upon acceptance, a loop-closure factor is added to the pose-graph for global consistency optimization.

4. Optimization and State Estimation

SuperVINS employs a factor-graph representation in a sliding window over recent states $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 3, where:

$H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 4

where $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 5 denotes 3D-to-2D reprojection error, $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 6 is the preintegrated IMU error, and $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 7 the loop-closure constraint.

Nonlinear optimization is performed with Ceres or g2o (dense Schur complement). The sliding window size is $H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 8 keyframes, with keyframes marginalized as needed. Loop closures are incorporated asynchronously, triggering global graph optimization via iSAM2.

5. Implementation Parameters and Performance Budget

Key implementation and hyperparameter settings include:

Component	Value/Setting	Comment
OS/Hardware	Ubuntu 18.04; RTX 2060	CUDA 11.7, cuDNN 8.9.6, onnxruntime 1.16.3
SuperPoint input	$H \in \mathbb{R}^{(h/8) \times (w/8) \times 65}$ 9 px	Internally downsampled by $D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 0
LightGlue model	Medium (256 heads, 12 layers)	$D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 13 ms/pair on GPU
Mask radius ( $D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 2)	$D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 3 px (EuRoC)	Denser features for challenging conditions
RANSAC iterations/inlier thresh	50; $D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 4 px	Range $D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 5– $D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 6 px for jitter adaptation
Sliding window	$D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 7 keyframes
Loop closure frequency	Every 10 keyframes
Front-end budget	$D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 8 ms (CPU)
LightGlue GPU budget	$D' \in \mathbb{R}^{(h/8) \times (w/8) \times d}$ 9 ms
Total per-frame	$L_2$ 0– $L_2$ 1 ms	Fits real-time constraint

6. Quantitative and Qualitative Evaluation

SuperVINS was validated on the EuRoC MAV dataset, comparing against VINS-Fusion in terms of absolute trajectory error (ATE), rotational and translational relative pose errors.

Sequence	VINS-Fusion ATE [m]	SuperVINS ATE [m]	Rot. RPE rad/m	Trans. RPE m	Notes
MH01	0.0911	0.0867	0.00203	0.00673
MH05	0.2621	0.1583	0.00379	0.00949	$L_2$ 239.6% (ATE)
V202	— (lost)	0.1003	0.00594	0.00850	VINS-Fusion lost tracking
V203	0.1926	0.1687

Qualitatively, SuperVINS demonstrated $L_2$ 3 higher map point density in low-light and high-blur sequences, and trajectory alignment visually closer to ground truth in overlay plots. These results detail a substantive improvement in both tracking reliability and mapping detail in adverse conditions.

7. Ablation Studies and Empirical Insights

System component ablations and sensitivity analyses reveal:

SuperPoint vs. ORB: Replacing classical ORB+optical flow with SuperPoint yields 8–12% ATE reduction in challenging sequences; under nominal lighting, improvements are marginal (1–2%).
LightGlue + RANSAC: LightGlue reduces correspondence outliers by $L_2$ 415% over brute-force Hamming; adaptive RANSAC thresholds further improve ATE by $L_2$ 510% in high jitter.
Mask radius and Threshold Sweep: Denser matches (smaller $L_2$ 6) increase features but risk spatial clustering. Optimal performance at $L_2$ 7 px, with $L_2$ 8 px optimal on EuRoC—stricter or looser settings respectively trade off inlier retention and outlier suppression.

This suggests that the coupling of deep-learned feature extraction, robust neural matching, and parameter-tuned verification, embedded within a classical VIO pipeline, achieves state-of-the-art tracking in previously failure-prone scenarios. The public release of SuperVINS code further supports reproducibility and extension in the SLAM research community (Luo et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

SuperVINS: A Real-Time Visual-Inertial SLAM Framework for Challenging Imaging Conditions (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperVINS.