Vision-Based State Estimation

Updated 3 February 2026

Vision-based state estimation is a technique that integrates visual data with sensor measurements to accurately infer state variables such as pose and velocity.
It employs recursive Bayesian filtering and sliding-window optimization to combine geometric image processing with inertial data for robust state inference.
Emerging hybrid and learning-based approaches enhance adaptability and resilience in challenging conditions, supporting diverse applications from UAV flight to autonomous landing.

Vision-based state estimation is the fusion of exteroceptive visual information—typically from one or more cameras—and, often, other sensor modalities to infer the state (pose, velocity, or other latent variables) of robotic systems, vehicles, or observed objects. This domain underpins modern SLAM, navigation, manipulation, and autonomous decision-making in robotics and related fields by exploiting the geometric and semantic richness of vision sensors. Vision-based state estimation algorithms span a wide spectrum, from classical geometric pipelines and optimization-based fusion to learning-mediated and hybrid approaches. This article surveys core methodologies, theoretical formulations, representative applications, and current challenges in vision-based state estimation, integrating recent advances from filtering, optimization, deep learning, and event-based paradigms.

1. Mathematical Foundations and Sensor Models

Vision-based state estimation relies on the fundamental mapping between scene state and image observations via projective geometry and sensor models. The system state $x$ may encode pose (e.g., $SE(3)$ elements), velocity, or more abstract latent features. Visual observations take the generic form $y_k = h(x_k) + v_k$ where $h$ is the camera (and possibly observation) model and $v_k$ is measurement noise.

The dominant image formation model is the pinhole (perspective) camera: $s \begin{bmatrix}u \ v \ 1\end{bmatrix} = K[R|T]\begin{bmatrix}X_w \ Y_w \ Z_w \ 1\end{bmatrix}$ with $K$ (intrinsics), $[R|T]$ (extrinsics), and $s$ an arbitrary scale. Feature projections, homographies, and inverse operations enable the inference of metric state variables from pixel measurements, provided sufficient calibration and data association (Llorca et al., 2021).

Visual state estimation frequently fuses camera measurements with IMU data for metric scale recovery, dynamic observability, and robustness to occlusion or poor visual geometry (Zhang et al., 2024, Wisth et al., 2019). Recent pipelines extend to event-based cameras for low-latency egomotion (Greatorex et al., 20 Jan 2025), or leverage non-traditional image representations such as dense depth, semantic cues, or object detections (Gao et al., 27 Feb 2025).

2. Filtering and Optimization-based Fusion Frameworks

Two main algorithmic schools—recursive Bayesian filtering and optimization—dominate the state estimation landscape, often hybridized in modern systems.

Filtering (EKF, InEKF, MSCKF): State is recursively estimated using Bayesian updates as new data arrives. Extended Kalman Filters (EKF) for vision incorporate measurement models derived from image geometry, with image Jacobians corresponding to spatial image gradients in the infinite-dimensional case (Varley et al., 23 Sep 2025). Invariant EKFs (InEKF) on Lie groups handle manifold-valued state (pose and features) with error representations consistent with group structure, critical for maintaining consistency, especially under marginalization and delayed updates (Gao et al., 27 Feb 2025).
Sliding Window/Batch Optimization: State is estimated by minimizing a cost function over a batch or sliding window, incorporating visual reprojection residuals and, where applicable, IMU preintegration residuals (Dinh et al., 2019, Zhang et al., 2024). These systems optimize over pose, structure, and dynamic variables, with explicit Jacobians for visual and IMU terms. Modern backends use factor graphs (e.g., iSAM2 in GTSAM (Wisth et al., 2019)) to incrementally solve the MAP estimate, efficiently marginalizing states outside the window.
Equivalence and Divergence: Theoretically, under identical linearization and marginalization, filtering (information form) and batch optimization yield identical state estimates with i.i.d. Gaussian noise (Zhu et al., 2024). Divergence in practice arises from different marginalization schedules, state augmentation schemes, and Jacobian updating (e.g., First Estimate Jacobian—FEJ in filtering). Strict equivalence can be restored in sliding-window filters by adopting a two-step update that matches marginalization to optimization (Zhu et al., 2024).
Continuous-Time Trajectory Representations: Chebyshev polynomial trajectory parameterizations enable fully continuous-time representations, fusing asynchronous measurements analytically and mitigating preintegration approximations (Zhang et al., 2024).

3. Learning-Enhanced and Hybrid Vision-State Fusion

Deep learning and hybrid pipelines mediate vision-based state estimation at higher levels of abstraction:

End-to-End Regression: CNNs regress 3D pose, velocity, or force estimates directly from image data, sometimes leveraging auxiliary state inputs such as robot kinematics or history. Fusing state into the visual pipeline via parallel MLP branches leads to substantial accuracy gains across manipulation and UAV tasks, with particular benefits in non-egocentric target estimation (e.g., human pose from a following drone) (Cereda et al., 2022).
Multimodal and Hybrid Estimators: Hybrid frameworks combine analytical filters (e.g., Kalman, MPC-augmented rigid-body models) with deep representations of visual modalities. OptiState, for instance, employs a Vision Transformer autoencoder to compress depth input, then fuses these latent codes with proprioceptive estimates in a gated recurrent network for legged robots—yielding improvements over pure VIO or model-based approaches (Schperberg et al., 2024).
Event-Driven and Low-Latency Learning: Spiking neural network architectures can process event-camera data for egomotion estimation entirely within the event domain, relying on fixed, shallow circuit designs with precise spike timing to encode optical flow and recover state (Greatorex et al., 20 Jan 2025).
Robustness and Generalization: Learning-based formulations integrating both vision and state are more robust to domain shifts for force, pose, or motion estimation, outperforming pure vision or state-alone baselines across axes of tool, material, and viewpoint changes (Chua et al., 2020).

4. Robustness to Degraded Conditions and Non-Standard Environments

Vision-based state estimation faces challenges under degraded or non-standard sensory conditions, often mitigated by modeling choices, sensor fusion, or by pivoting to object-level representations:

Nocturnal/Object-Level State Estimation: Frameworks such as Night-Voyager exploit prior object maps (e.g., streetlights) and data association strategies to maintain consistent and accurate estimation even in severe low-light conditions, where pixel-level methods fail (Gao et al., 27 Feb 2025). Object-level measurements and invariant extended Kalman filters on the appropriate Lie groups provide robustness and efficiency.
Event-Based Sensing: Event-domain pipelines sidestep frame-based vision limitations, achieving low-latency and power efficiency, and provide strong accuracy in motion tracking compared to classical vision and learned approaches (Greatorex et al., 20 Jan 2025).
Certified State Estimation: Approaches using reachability analysis and mixed monotonicity can yield certified error bounds on the output of vision-based estimators, shielding learning-based or geometric estimators against adversarial or high-noise inputs for tasks such as autonomous landing (Leal et al., 2023).
Challenging Dynamics and Recovery: Vision-inertial systems using drift-compensation, confidence-aware landmark detection, and fallback to IMU or wheel odometry maintain performance during fast UAV flight (Novák et al., 2 Feb 2026) or under vision loss in ground robots (Gang et al., 2020).

5. Practical Applications and Performance Benchmarks

Vision-based state estimation has demonstrated efficacy in a broad array of robotics and autonomy contexts:

Autonomous Landing and Relative Navigation: Tightly coupled monocular or vision-inertial estimators enable precise aerial or helicopter landing, leveraging manifold optimization and consistent IMU preintegration (Dinh et al., 2019, Bouazza et al., 22 Dec 2025). Relative state estimation using extended preintegration and graph optimization frameworks generalizes to leader–follower and multi-platform setups (Xia et al., 2023).
Legged Locomotion: Factor graph–based fusion of visual, inertial, and leg odometry improves robustness to occlusion, low texture, and slipping, cutting drift by up to 76% in real industrial settings (Wisth et al., 2019). Learning-enhanced frameworks further reduce RMSE over VIO and supply uncertainty quantification (Schperberg et al., 2024).
Fast UAV Flight: In vision-only, GNSS-denied drone racing, real-time landmark detection, VIO drift compensation, and IMU fusion yield robust 6DOF state estimation under aggressive flight (Novák et al., 2 Feb 2026).
Vehicle Speed and Target Tracking: Geometry-based, motion-model, and data-driven pipelines deliver sub-3% mean absolute errors in vision-based speed estimation; fusion with Bayesian filters or learning further enhances robustness (Llorca et al., 2021).
Small Body/Satellite Approach: Vision-based feature tracking and geometric routines reconstruct rotation axes and centers of celestial bodies to within $10^\circ$ for the majority of synthetic test cases, informing mission planning (Panicucci et al., 2023).

The following table summarizes representative accuracy gains and domains:

System	State Type	Accuracy/Finding	Reference
VILENS (ANYmal)	6DOF pose, velocity	76% ATE reduction vs. odom+IMU	(Wisth et al., 2019)
Vision+State Deep	3D pose, force	≥24% MAE reduction, better R²	(Cereda et al., 2022, Chua et al., 2020)
Chebyshev VINS	Full 6DOF	31–77% velocity/pos. RMSE drop	(Zhang et al., 2024)
Event-based SNN	Yaw rate	ARRE 0.00014–0.00086 rad	(Greatorex et al., 20 Jan 2025)
Night-Voyager	Full pose	≤0.6 m/1.2° error in 10 scenes	(Gao et al., 27 Feb 2025)
UAV Racing Fusion	6DOF, drift	0.65 m pos. RMSE vs 17.4 m (VIO)	(Novák et al., 2 Feb 2026)

6. Emerging Directions and Outstanding Challenges

Several active research challenges and future directions define the field:

Data Association and Representation: Robust feature or object association, particularly across lighting, weather, and domain shifts (nighttime, urban long-term, dynamic scenes), remains key. Methods exploiting object-level priors, learned semantics, or multi-modal fusion are promising (Gao et al., 27 Feb 2025).
Consistency and Theoretical Guarantees: Preserving observable subspace structure and filter consistency, especially in manifold state spaces under delayed or partial measurements, is a central design goal—addressed by feature decoupling and Lie group invariant error representations (Gao et al., 27 Feb 2025, Zhu et al., 2024).
Learning and Adaptation: Deployment of learning-based state estimators with provable or certified guarantees, effective uncertainty quantification, and adaptability to novel environments is an area of rapid progress (Leal et al., 2023, Schperberg et al., 2024, Chua et al., 2020).
Resource Efficiency: Event-based pipelines and efficient graph/filter designs facilitate real-time execution on edge and ultra-low-power platforms (Greatorex et al., 20 Jan 2025, Wisth et al., 2019, Cereda et al., 2022).
Hybrid Modular Pipelines: Combining model-based physics, visual geometry, and learned corrections in a tightly-coupled and interpretable manner achieves optimal performance under diverse conditions (Schperberg et al., 2024).
Community Benchmarks: The need for larger, more representative datasets (especially for night, event, or tactile scenarios), and standardized evaluation metrics informed by both practical deployment and theoretical consistency, is critical (Llorca et al., 2021).

Vision-based state estimation remains a fundamental, rapidly-evolving research area that bridges structured geometric inference, probabilistic filtering, graphical optimization, and deep learning. Continued convergence of these paradigms is expected to yield robust, efficient, and certifiable estimators for next-generation autonomous systems under a wide range of operational regimes.