Visual-Inertial Fusion: Sensor Integration

Updated 5 February 2026

Visual-Inertial Fusion is the integration of visual and inertial sensors to accurately estimate an agent’s pose, velocity, and trajectory in real time.
It combines the high-frequency orientation from IMUs with precision visual cues, employing methods like EKF, spline optimization, and deep learning to overcome individual sensor limitations.
Recent advances include probabilistic residual weighting, neural fusion paradigms, and multi-modal integration with GNSS and LiDAR to enhance robustness in dynamic environments.

Visual-Inertial Fusion, also referred to as Visual-Inertial Odometry (VIO) or Visual-Inertial State Estimation, denotes the integration of visual (camera-based) and inertial (IMU) sensing modalities to estimate an agent’s metric pose, velocity, and trajectory over time. By combining complementary strengths—visual sensors yield accurate translational cues (and drift correction), while inertial sensors provide high-frequency, drift-prone but locally smooth orientation and velocity—fused solutions enable robust, real-time, and metric-accurate 3D motion estimation across a wide spectrum of robotics, AR/VR, and navigation tasks. Fusion can be implemented through tightly coupled optimization, extended Kalman filtering, batch polynomial parameterization, or—in recent advances—deep learning–based feature fusion and attention mechanisms.

1. Mathematical Foundations of Visual-Inertial Fusion

The canonical mathematical structure for visual-inertial fusion is a continuous or discrete-time state-space model:

State: pose, velocity, and biases $(\mathbf{x})$ (e.g., $x = [q, p, v, b_a, b_g]$ for quaternion, position, velocity, accelerometer/gyroscope biases)
IMU propagation: typically modeled as stochastic continuous dynamics,

$\begin{aligned} \dot{q}(t) &= \tfrac{1}{2} q(t) \circ \begin{bmatrix} 0 \ \omega_{\text{imu}}(t)-b_g-n_g(t) \end{bmatrix} \ \dot{v}(t) &= R(q(t))[a_{\text{imu}}(t)-b_a-n_a(t)] + g \ \dot{p}(t) &= v(t) \end{aligned}$

Visual measurements: project 3D landmarks (or structureless feature tracks) to the camera via calibrated intrinsics/extrinsics, resulting in non-linear reprojection error terms per tracked feature.

Integration of these modalities occurs in a filtering (EKF, ESKF), smoothing (bundle adjustment, factor graph), or direct optimization setting. Batch polynomial methods parameterize the continuous trajectory as a spline or Chebyshev series and directly tie camera/IMU observations to the global parameter coefficients (Zhang et al., 2024, Ovrén et al., 2018). IMU preintegration enables efficient marginalization of high-rate inertial data and permits formulation of relative pose/velocity residuals between arbitrary camera/IMU pairs (Cioffi et al., 2020).

2. Advances in Probabilistic Residual Weighting and Continuous-Time Fusion

Visual-inertial fusion accuracy hinges critically on correct normalization and weighting of diverse measurement residuals:

Probability-based weighting: When fitting a spline $\hat y(t|\Theta)$ to noisy sensor time-series $x(t)$ , the residual variance is not only due to measurement noise $\sigma_f^2$ but also the approximation (modeling) error $\sigma_e^2$ of the spline. The total residual variance is

$\hat \sigma_r^2 = \hat \sigma_e^2 + \hat \sigma_f^2,$

informing the optimal inverse-variance weight

$\gamma_i = 1/\hat\sigma_{r,i}^2.$

The frequency-domain spline error prediction uses the squared spectral residual outside the spline’s passband, allowing robust, automatic balancing across vision, gyro, and accelerometer modalities (Ovrén et al., 2018).

Continuous-time parameterizations: Representing trajectory as a Chebyshev polynomial (Zhang et al., 2024) or cubic B-spline (Ovrén et al., 2018) offers analytic derivatives, fast residual evaluation, and direct enforcement of inertial and visual constraints in a single global least-squares. Automatic, signal-adaptive knot or polynomial order selection ensures lossless representation up to a prescribed information threshold.

3. Robust Outlier Handling and Filter Structures

Real-world pipelines must contend with significant outlier rates in vision (feature loss, dynamic objects, rolling shutter, poor illumination). Robust inference strategies include:

Bayes-optimal joint inference/classification allows simultaneous state and inlier set estimation (marginalizing over the inlier/outlier combinatorics), but is tractable only approximately. High-performing approximations include Mahalanobis gating, one-point RANSAC with leave-one-out cross-validation, history-of-innovation whiteness (Ljung–Box) tests, and their combinations in fixed-lag smoothers (Tsotsos et al., 2014).
Nullspace marginalization (as in MSCKF or LIC-Fusion) projects out unobservable landmark directions, reducing filter inconsistency under first-order linearization (Zuo et al., 2019).

Recent systems further introduce adaptive residual weighting and sensor gating via online health evaluation, dynamically upweighting or downweighting modalities (vision, IMU, DVL) based on residual statistics and quality metrics, with failsafe sensor deactivation/reactivation logic for extended corruptions (Wei et al., 23 Dec 2025).

4. Deep Learning Paradigms for Visual-Inertial Fusion

End-to-end deep fusion architectures learn temporal and cross-modal aggregation directly from raw or pre-encoded features:

Causal Transformer-based fusion: The VIFT framework (Kurt et al., 2024) encodes visual and inertial sequences via frozen FlowNet and 1D-CNN feature extractors; these are concatenated and temporally fused with a small causal Transformer (multihead self-attention, causal masking). The fused latent is mapped to SE(3) increments (translation + axis–angle rotation), which update the pose on the manifold. Rotation regression is formulated with explicit manifold retraction (RPMG) to maintain SO(3) consistency.
Selective Sensor Fusion (Chen et al., 2019): Fusion masking—either deterministic soft (continuous masks) or stochastic hard (Gumbel-softmax Bernoulli masks)—attends more or less to each modality or individual feature channel, based on learned context. This enhances robustness under missing, delayed, or corrupted sensory input, and offers interpretability by visualizing modality/feature contributions per timestep.

Empirically, attention-based and selective fusion yields superior ATE and RPE, especially under corrupted or imbalanced conditions, compared to direct or naive fusion (Kurt et al., 2024, Chen et al., 2019).

5. Integration with Global and Auxiliary Modalities

Visual-inertial fusion is increasingly embedded within multi-modal navigation, mapping, and SLAM frameworks:

GNSS Fusion: Tightly-coupled factor graphs and filters now incorporate GNSS code, Doppler, and carrier-phase (even double-differenced) measurements, synchronizing visual-inertial trajectories with global ECEF or ENU coordinates. Optimal integration leverages IMU preintegration for correct time-alignment, automatic extrinsic calibration, and maintains cross-covariances between all states (Cioffi et al., 2020, Cao et al., 2021, Dong et al., 2023, Hu et al., 2024). Dropout-tolerant and uncertainty-aware schemes enable seamless operation through outages and challenging GNSS environments (Boche et al., 2022).
Neural Priors and NeRF Anchors: NVINS (Han et al., 2024) injects absolute-pose "anchors" with learned uncertainty from a NeRF-trained camera pose regressor into a factor graph, countering drift in standard VIO pipelines, and providing uncertainty-aware fusion via Bayesian MAP estimation.
LiDAR and Depth Sensing: LiDAR-inertial-visual fusion (e.g., through EKF, factor graphs, or joint optimization on Gaussians/splats) leveraging dense or sparse geometric priors can further constrain and initialize VIO pipelines in visually challenging scenarios (Hong et al., 2024, Zuo et al., 2019, Wei et al., 23 Dec 2025). Vision–depth–inertial methods extract robust descriptors, combine score maps for multimodal feature selection, and tightly couple depth cues into state estimation (Khattak et al., 2019).

6. Performance Validation and Application Domains

State-of-the-art visual-inertial fusion pipelines are validated on diverse public datasets (KITTI, EuRoC, PennCOSYVIO, TUM-VI, and custom UAV/agile locomotion datasets), and demonstrate the following:

Tightly-coupled fusion significantly reduces mean position and orientation errors versus loosely-coupled or vision-only systems, particularly under long-term drift, rolling-shutter artifacts, and degraded visual conditions (Ovrén et al., 2018, Cioffi et al., 2020).
Deep attention-based and masking-based fusion improves robustness to sensor corruptions, latency, and missing or delayed data (Kurt et al., 2024, Chen et al., 2019).
Fused pipelines with global (GNSS, NeRF) or auxiliary (leg odometry, LiDAR, DVL) sensors achieve drift-free, globally consistent solutions, suitable for both outdoor and indoor, high-dynamics, or GPS-denied environments (Cao et al., 2021, Dhédin et al., 2022, Wei et al., 23 Dec 2025, Khattak et al., 2019).

Implementation advances—such as square-root inverse filtering (Hu et al., 2024), Schur-complement–based online landmark elimination (Wei et al., 23 Dec 2025), and continuous-time global basis optimization (Zhang et al., 2024, Ovrén et al., 2018)—ensure real-time, numerically stable operation on embedded and resource-constrained platforms.

7. Future Research Directions and Open Challenges

Emerging directions in visual-inertial fusion include:

Further integration with neural priors and dense visual or geometric fields (e.g. NeRF, surface splats, volumetric representations), with uncertainty quantification and adaptive factor formulation (Han et al., 2024, Hong et al., 2024).
More expressive, context-dependent attention/fusion mechanisms, with explicit interpretability and dynamic reliability assessment (Kurt et al., 2024, Chen et al., 2019, Wei et al., 23 Dec 2025).
Precision global localization under urban/multipath conditions, incremental or online global frame alignment and extrinsic self-calibration (Dong et al., 2023, Hu et al., 2024).
Direct estimation of motion sub-states (e.g., velocity via event-based or high-speed sensing) decoupled from global position, for high-rate control and aggressive robotics (Xu et al., 2024).
Nonlinear, continuous-time, and hybrid time-frequency modeling frameworks, potentially extending the effectiveness of Chebyshev- or spline-based optimization to more general fusion scenarios (Zhang et al., 2024, Ovrén et al., 2018).

The field continues to advance toward comprehensive, statistically-principled, adaptive, and real-time state estimation by synthesizing principles from classical geometric estimation, probabilistic sensor fusion, and contemporary deep learning (Ovrén et al., 2018, Kurt et al., 2024, Han et al., 2024, Chen et al., 2019, Zhang et al., 2024).