Relative Camera Pose Estimation (RCPE)
- RCPE is the process of recovering relative camera rotation and translation, foundational for visual odometry, SLAM, and 3D reconstruction.
- It employs methods from classical epipolar geometry with minimal solvers to advanced deep learning models that integrate geometric constraints in training.
- Recent advances leverage sensor fusion, probabilistic models, and robust optimization to enhance performance under challenging imaging conditions.
Relative camera pose estimation (RCPE) is the process of recovering the geometric relationship—specifically, the rotation and translation—between two or more camera views observing a rigid scene. This estimation is fundamental to a range of geometric computer vision tasks, including visual odometry (VO), simultaneous localization and mapping (SLAM), structure-from-motion (SfM), 3D object reconstruction, and multi-robot collaboration. RCPE methods span classic epipolar geometry solvers, direct and correspondence-based deep learning, multimodal sensor fusion, and advanced architectures for challenging conditions such as rolling shutter, wide baselines, or minimal overlap. Contemporary research analyzes both the algorithmic underpinnings and the empirical limitations in natural and artificial scenes.
1. Classical Epipolar Geometry and Minimal Solvers
The canonical RCPE formulation for two calibrated, perspective cameras is based on the epipolar constraint:
where is the essential matrix, is the rotation, is the translation (up to scale), and are normalized image points. Finding from at least five correspondences leads to minimal-solvers (Nistér's 5-point, Stewénius 6-point, 8-point), which are the backbone of RANSAC pipelines for outlier rejection (Garcia-Salguero et al., 2020, Garcia-Salguero et al., 2021). Increased robustness and certification of global optimality are achieved by manifold optimization on , and by formulating the estimation as a QCQP, whose Lagrangian dual can be checked a posteriori to certify global optimality (Garcia-Salguero et al., 2020). Efficient Riemannian trust-region solvers, together with closed-form dual certificates, achieve both high accuracy and real-time performance (Garcia-Salguero et al., 2021), and can be robustified against outliers by embedding in Black–Rangarajan and Graduated Non-Convexity frameworks.
For multi-camera rigs, recent advances have yielded minimal solvers based on affine correspondences (2AC) that drastically reduce RANSAC sample complexity and enable real-time RCPE in general multi-camera configurations (Guan et al., 2023, Zhao et al., 2021). These exploit the additional constraints afforded by patch-level affine shape and can handle inter/intra-camera matches and mixed configurations.
2. Deep Learning Architectures for RCPE
Deep learning models for RCPE can be divided into direct regressors and correspondence-based (or hybrid) approaches. Direct regressors, such as RPNet (En et al., 2018), RelMobNet (Rajendran et al., 2022), and various Siamese CNNs (Melekhov et al., 2017), map image pairs to relative rotation and translation, typically expressed in 6D rotation representations or quaternions, and 3D Euclidean translations. Training losses combine weighted L2 distances on rotation and translation; recent curriculum designs (RelMobNet) eliminate the brittle loss weighting hyperparameter by separating normalized and true-scale regression (Rajendran et al., 2022). These architectures are robust to lack of texture and repetitive or textureless patterns compared to correspondences+RANSAC, and some (RPNet, RelMobNet) can infer metric translation, not just direction (En et al., 2018, Rajendran et al., 2022). However, their precision can lag behind feature-matching methods, and they may suffer in strong domain-shift scenarios.
A key innovation is the integration of geometric constraints at the loss layer, enabling end-to-end learning that directly minimizes geometric inconsistency, for example via direct minimization of the epipolar constraint or via differentiable bundle adjustment modules (Jau et al., 2020). Such systems, incorporating learnable detection, description, matching, and outlier rejection—with the entire pipeline being differentiable—push the performance of deep learning RCPE toward or even on par with classical pipelines.
An alternative paradigm is discrete distribution learning, typified by DirectionNet (Chen et al., 2021), which regresses a discretized spherical distribution over pose parameters rather than a single value, thereby capturing multimodality and inherent ambiguity in wide-baseline settings.
3. Robustness to Challenging Imaging: Roll. Shutter, Wide Baselines, Sparse/No Overlap
Robust RCPE under rolling shutter distortion, wide baselines, and sparse overlap has motivated significant methodological advances. For consumer rolling-shutter cameras, incorporating inertial (IMU) measurements—gravity for roll/pitch, and gyro for angular rate—enables reduction of the minimal problem and yields efficient 9- or 11-point RS solvers (Lee et al., 2017). These solvers utilize the specific structure of the rolling-shutter epipolar constraint and, by fusing inertial information, reduce required correspondences by an order of magnitude.
For wide-baseline or minimal-overlap scenarios, discrete-distribution approaches—such as DirectionNet's factorization of the 5D relative pose space into distributions on S²—achieve significantly reduced errors compared to direct regression or SIFT-based pipelines, particularly when classic local features fail (Chen et al., 2021). Scene completion and hybrid representations (360°, 2D layout, planar patches) support RCPE where there is little or no geometric overlap, as in the Extreme Relative Pose Network (Yang et al., 2019).
Energy-based models, such as RelPose (Zhang et al., 2022), leverage top-down priors to estimate explicit multi-modal distributions over relative rotation, particularly addressing object symmetry and ambiguous viewpoints, and can be jointly optimized over multiple views via block-coordinate ascent.
4. Hybrid, Distributional, and Optimization-based Pipelines
Hybrid approaches fuse geometric and learned estimation. FAR combines a learnable transformer-based pose prior with classical 5-point solvers, using the transformer to infer scale and to prior-guide RANSAC both in sampling and scoring (Rockwell et al., 2024). The fusion is controlled by learned weights, and the prior can be iteratively refined over the course of sampling. Similarly, SRPose (Yin et al., 2024) unifies sparse keypoints, intrinsic-calibration aware position encoding, promptable attention, and end-to-end regression, yielding low-latency inference while remaining robust to variations in input resolution and camera intrinsics.
Distributional and probabilistic models, e.g. GARPS (Li et al., 17 Sep 2025), bypass 2D matching altogether. Instead, they obtain independent, direct single-view 3D Gaussian reconstructions from monocular depth networks, and align these by maximizing a differentiable GMM overlap objective that jointly accounts for metric geometry, colour, and semantics. The result is a robust, metric RCPE estimation even for wide baselines and untextured regions.
Optimization-based pipelines further include direct minimization of the cheirality constraint (i.e. positivity of depth) via the normal flow, as in DiffPoseNet (Parameshwara et al., 2022), and iterative optimization over trifocal constraints via points and lines for RCPE over three views (Qadir et al., 2017).
5. Sensor Fusion, Cooperative and Multi-agent RCPE
Beyond monocular vision, RCPE has been extended to encompass multi-modal sensor fusion and cooperative settings. CREPES (Xun et al., 2023) employs tightly-integrated fusion of active infrared LED markers, fish-eye cameras, UWB modules (for ranging), and IMUs, with an error-state Kalman filter and pose-graph optimization. This enables real-time, metric 6DOF pose estimation between robots in challenging conditions (dark, occlusions, long range), with sub-decimeter and sub-degree accuracy. For UAV swarm scenarios, dual-channel feature association combined with relative MSCKF achieves real-time RCPE at full frame-rate on embedded hardware, by combining a lightweight Lucas-Kanade tracking front-end with periodic high-quality learned matches and visual-inertial odometry increments (Wang et al., 2024). These systems demonstrate that robust, scalable, and decentralized RCPE is feasible with commodity and low-cost sensor ensembles.
6. Failure Modes and Theoretical Limits
Recent diagnostic benchmarks have identified persistent failure modes for models not explicitly respecting projective and geometric constraints. Vision-LLMs (VLMs), despite proficiency in spatial attention and 2D reasoning, markedly underperform on RCPE tasks requiring robust inference of out-of-plane translation and roll (Deng et al., 29 Jan 2026). Even strong VLMs such as GPT-5 trail classical geometric pipelines—F1 0.64 vs 0.97 for LoFTR+RANSAC—indicating the necessity for explicit geometric priors or modules, and consistency losses that enforce multi-view SE(3) structure. Experimental breakdown suggests that common failure arises for depth motion and compass roll, and multi-image relational cues remain ungrounded in most current VLMs. This reinforces the continued importance of geometric constraints and multi-view consistency in RCPE system design.
7. Benchmarks and Quantitative Evaluation
RCPE algorithms are evaluated on metrics including:
- Angular rotation error:
- Translation angular or Euclidean error (some methods are up-to-scale only)
- Absolute and relative trajectory errors (RTE, ATE, RMSE, median)
- Task-specific accuracy (e.g. ADD for objects, relocalization precision, 3D alignment error)
Key public datasets include the Cambridge Landmarks (En et al., 2018, Rajendran et al., 2022), DTU Robot (Melekhov et al., 2017), RealEstate10K (Li et al., 17 Sep 2025), CO3D (Zhang et al., 2022), Matterport3D, InteriorNet, ScanNet, TartanAir, KITTI, and multi-agent settings (PennCOSYVIO (Lee et al., 2017), UAV experiment fields (Wang et al., 2024), VIRAL (Wang et al., 2024)). State-of-the-art pipelines—GARPS (Li et al., 17 Sep 2025), FAR (Rockwell et al., 2024), SRPose (Yin et al., 2024), DirectionNet (Chen et al., 2021)—consistently outperform both traditional and direct deep networks on these metrics, especially for wide baselines, minimal overlap, and challenging conditions.
In summary, the RCPE field synthesizes decades of geometric computer vision with cutting-edge deep models and sensor fusion, offering a spectrum of algorithmic approaches—manifold optimization, deep regression, probabilistic and distributional models, robust minimal solvers, and co-sensor fusion. Progress is quantifiable, especially for robust estimation under adverse imaging, and ongoing work is focused on both theoretical guarantees and real-world generalizability.