Mr. Virgil: Learning Multi-robot Visual-range Relative Localization

Published 11 Dec 2025 in cs.RO | (2512.10540v1)

Abstract: Ultra-wideband (UWB)-vision fusion localization has achieved extensive applications in the domain of multi-agent relative localization. The challenging matching problem between robots and visual detection renders existing methods highly dependent on identity-encoded hardware or delicate tuning algorithms. Overconfident yet erroneous matches may bring about irreversible damage to the localization system. To address this issue, we introduce Mr. Virgil, an end-to-end learning multi-robot visual-range relative localization framework, consisting of a graph neural network for data association between UWB rangings and visual detections, and a differentiable pose graph optimization (PGO) back-end. The graph-based front-end supplies robust matching results, accurate initial position predictions, and credible uncertainty estimates, which are subsequently integrated into the PGO back-end to elevate the accuracy of the final pose estimation. Additionally, a decentralized system is implemented for real-world applications. Experiments spanning varying robot numbers, simulation and real-world, occlusion and non-occlusion conditions showcase the stability and exactitude under various scenes compared to conventional methods. Our code is available at: https://github.com/HiOnes/Mr-Virgil.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a learning-based framework for multi-robot visual localization using a GNN for probabilistic data association and a differentiable pose graph optimization back-end.
It leverages both simulated and real-world experiments to validate significant reductions in RMSE compared to conventional odometry and heuristic matching methods.
The decentralized design and uncertainty-aware matching enable robust performance under occlusions and noisy sensor conditions in diverse multi-agent scenarios.

Mr. Virgil: A Learning-Based Framework for Multi-Robot Visual-Range Relative Localization

Introduction and Motivation

The paper "Mr. Virgil: Learning Multi-robot Visual-range Relative Localization" (2512.10540) introduces a decentralized, end-to-end, learning-based system for accurate and robust relative localization in multi-robot (particularly drone swarm) scenarios, emphasizing operation under challenging occlusions and with noisy sensor modalities. The primary challenge addressed is multimodal data association—specifically, matching UWB range measurements with anonymous visual detections without resorting to ID-encoded hardware or hand-crafted association schemes. The authors propose to solve this through a learnable GNN-based front-end, which provides probabilistic soft associations, and a differentiable pose graph optimization (PGO) back-end for refining pose estimates.

System Architecture

The proposed architecture consists of two tightly coupled, differentiable modules: a graph neural network for data association and position prediction, and a differentiable PGO module for optimal pose estimation (Figure 1). The framework is implemented in a fully decentralized configuration to allow for scalable, real-time execution in real multi-robot scenarios (Figure 2).

Figure 1: The pipeline of the end-to-end multi-robot localization network, depicting GNN-based association, uncertainty modeling, and differentiable pose graph optimization.

Figure 2: The decentralized system design implemented with ROS, LibTorch, and Ceres Solver.

GNN-Based Graph Matching Front-end

The front-end formulates the association problem as a graph matching task between ID-aware UWB/ranging priors and ID-less visual detections using a multi-layer attentional GNN (Figure 3), inspired by advances in feature matching for computer vision. The network aggregates both intra-set (priors or detections) and inter-set (prior–detection) relationships to encode global structure and similarity, enabling matching across any number of robots and observations. The association is solved as a regularized discrete assignment with the Sinkhorn algorithm.

Figure 3: The architecture of the graph matching network, emphasizing attentional layers and cross-set information flow.

This front-end produces, for each possible pair, soft assignment scores and predictive uncertainty estimates (covariances), critical for downstream optimization and mitigating irreversible errors from overconfident but wrong associations.

Differentiable Pose Graph Optimization Back-end

The PGO back-end fuses the probabilistically weighted associations and raw measurements to jointly optimize 6-DoF robot poses in the selected reference frame. The optimization includes three main constraints: mutual observations (with uncertainty weighting), pose priors, and direct UWB ranging. The system leverages the Levenberg-Marquardt algorithm with exact gradients propagated back to the GNN to realize joint optimization and error correction.

Experimental Validation and Numerical Results

The effectiveness and generality of Mr. Virgil are validated extensively in diverse conditions:

Simulated random forest environments with up to 16 drones, explicitly evaluating under various occlusion levels and with high rates (>40%) of spurious visual detections (Figure 4, Figure 5).
Real-world flight experiments indoors, both with and without occlusion (Figure 6), using a minimal IR-based visual detection system, but no hardware identity encoding.
Figure 4: Simulated forest environment with robot occlusion variability as robot count increases.

Figure 5: Estimated trajectories in multi-drone simulation under the proposed framework.

Figure 6: Trajectories from 4 drones in real-world, high-noise (PVO-augmented) circumstances.

Localization Precision

Strong improvements are demonstrated over both naive odometry (PVO) and a Practical "Simple Match" (directional nearest-neighbor with threshold):

On 16-robot simulation, Mr. Virgil reduces the position RMSE to 0.144m versus 0.198m (Simple Match) and 6.067m (PVO).
In real-world occluded scenarios, Mr. Virgil yields an RMSE of 0.129m versus 0.498m (Simple Match) and 1.445m (PVO).
In non-occluded physical environments, the gap is smaller, highlighting robustness under ambiguous sensing.

Data Association

The GNN-based matcher significantly outperforms threshold heuristics in F1 score across all datasets (e.g., F1=0.985 on Real-NLOS with 5 robots vs. F1=0.976/0.939 for Simple Match with loose/strict thresholds), particularly in highly ambiguous, partially observed conditions (Figure 7).

Figure 7: Comparison of data associations, with Mr. Virgil providing soft, uncertainty-aware matches robust to ambiguity.

Ablation Results

Incorporating the PGO back-end yields consistent >32% drop in localization error compared to front-end alone or Simple Match + PGO.
The uncertainty-aware matching modulates the impact of erroneous matches, increasing robustness to spurious observations.
The system generalizes well both in few-shot training (strong performance after only two scenes) and sim-to-real transfer (simulation-trained model matches real-trained accuracy within 1cm).

Figure 8: Combined loss and RPE decrease relative to number of training scenes; convergence after limited data.

Figure 9: Error distribution as a function of robot count in simulation; more robots improve GNN modeling and generalization.

Practical and Theoretical Implications

This work advances the paradigm of learning-based, decentralized relative localization in heterogeneous sensor swarms, particularly where identity is ambiguous and observations are noisy or intermittent. The fully differentiable architecture, robust to both occlusion and spurious detections, is practical in realistic environments and is compatible with resource-constrained hardware by design. From a theoretical perspective, the integration of assignment uncertainty and global structural modeling into matching, combined with differentiable optimization, bridges long-standing gaps between probabilistic data association and nonlinear trajectory optimization.

Major practical implications include reduced dependency on visual IDs or hand-tuned algorithms, robust scalable deployment in GPS-denied/crowded environments, and adaptation to varying team sizes. The formalism directly extends to other sensor configurations and could underpin collaborative SLAM or multi-agent tracking under minimal infrastructure.

Future Directions

Potential directions include:

Integration of full visual/inertial odometry chains rather than pseudo-odometry.
Multi-frame temporal modeling in both the front-end and optimization for improved temporal consistency.
Extension to more sophisticated sensor fusion (e.g., LIDAR, IMU).
Adaptation to large-scale, resource-constrained swarms in open-world settings.

Conclusion

Mr. Virgil establishes an end-to-end, uncertainty-aware framework for multi-robot visual-range localization without reliance on IDs or hand-crafted data association, advancing the state-of-the-art in challenging multi-agent robotics contexts. It demonstrates significant gains in both accuracy and robustness across diverse scenarios and lays a foundation for future research in learning-based collective perception and localization.

Markdown Report Issue