ImLoc: Scalable Dense Depth Localization
- ImLoc is a visual localization method that augments 2D image representations with dense, per-image depth maps to enable geometric reasoning without centralized 3D models.
- The system employs bidirectional dense matching and a GPU-accelerated LO-RANSAC pipeline, achieving state-of-the-art pose estimation accuracy with significant memory efficiency.
- By integrating image compression and scalable mapping techniques, ImLoc bridges 2D simplicity and 3D robustness, making it ideal for dynamic scenes and resource-constrained applications.
ImLoc is a visual localization method that revisits 2D image-based mapping paradigms by augmenting each reference image with a dense, estimated depth map. This image+depth representation enables geometric reasoning comparable to centralized 3D structure-based methods, but retains the build-and-update simplicity, memory efficiency, and scalability of pure image-based systems. ImLoc utilizes dense matching for representation and localization, compact map storage with modern image codecs, and a GPU-accelerated LO-RANSAC pipeline for rapid and accurate pose estimation. Empirical results demonstrate state-of-the-art accuracy and memory efficiency across diverse and challenging benchmarks, validating that just-in-time per-image depth maps together with dense matching recover best practices of both 2D and 3D visual localization approaches (Jiang et al., 7 Jan 2026).
1. Image-Based Map Representation with Dense Depth
ImLoc represents the localization map as a collection of database entries, each consisting of:
- A global retrieval descriptor
- The RGB image
- An estimated dense depth map
- Known pose and intrinsics
Depth estimation for each image is performed during the mapping stage. For a pixel in image , dense matching (specifically RoMa) against covisible images yields correspondences with match confidence . Depth for is triangulated by solving:
where denotes the angle between rays and , and is the camera center. Among all single-DOF hypotheses generated by individual matches (given known camera poses), the one with most inliers within is selected, then refined via weighted least-squares in angle-space. Depth is stored per pixel as the such that .
This strategy for per-image depth acquisition enables geometric information without centralized point-cloud construction, facilitating scalable, memory-efficient map building.
2. Dense Matching and Pose Estimation Pipeline
At localization (query) time, ImLoc executes the following sequence:
- Query Descriptor Extraction: The global descriptor is computed for the query image.
- Top-K Retrieval: Images from the database are ranked and the highest-scoring images selected via global descriptor similarity, e.g., using Megaloc, NetVLAD, or EigenPlaces.
- Bidirectional Dense Matcher (RoMa): For each retrieved pair, dense pixel-level correspondences are established at resolution . For each query pixel , a corresponding pixel in is found, associated with a match confidence . Matches with or invalid depths are discarded.
- 2D–3D Correspondence Formation: Remaining matches yield 2D–3D pairs , where:
- Pose Estimation via GPU LO-RANSAC: With dense correspondences:
- Up to correspondences are subsampled.
- CPU samples minimal sets, solves P3P with PoseLib to generate hypotheses .
- GPU scores all correspondences using truncated, confidence-weighted reprojection error.
- Hypothesis refinement via nonlinear local optimization (Cauchy loss).
- The process repeats up to $100,000$ iterations or until the probability of missing the best model drops below .
This pipeline is designed for memory and computational efficiency while maintaining high pose accuracy.
3. Compression and Scalable Memory Efficiency
ImLoc's storage scheme utilizes full images rather than sparse descriptors, enabling effective use of contemporary image codecs:
- RGB images are downsampled (to or for "micro" maps) and compressed with JPEG XL at specified quality (90 for standard, 30 for micro), yielding 60 KB (or 15 KB) per image.
- Depth maps are clipped to , log-quantized to 256 ($8$-bit) levels, and losslessly compressed (17 KB, or 3 KB for micro per image).
- Retrieval feature vectors ($4096$-D) are stored in half-precision ($8$ KB/image).
Resolution, JPEG XL quality, and frame subsampling (e.g., $1:8$ for "nano" $2$ MB maps) can be varied, affording flexible trade-offs along the Pareto frontier of memory and localization accuracy. For instance, on Cambridge Landmarks, the $90$ MB map yields $10$ cm/ median error, whereas the $2$ MB map achieves $13$ cm/, outperforming $3$–$13$ MB scene-coordinate regression competitors.
4. Benchmark Performance and Empirical Results
ImLoc is evaluated on several standard benchmarks, reporting accuracy as the percentage of queries localized within specified error thresholds:
| Benchmark | Map Size | Median Error | Accuracy (% within (0.25 m, 2°)) |
|---|---|---|---|
| Cambridge Landmarks | 90 MB | 11 cm / 0.2° | (not explicitly listed for this threshold) |
| Cambridge Landmarks | 16 MB | 12 cm / 0.2° | — |
| Cambridge Landmarks | 2 MB | 13 cm / 0.3° | — |
| Oxford Day & Night (day) | — | — | 89.3 / 96.1 / 99.3 |
| Oxford Day & Night (night) | — | — | 74.3 / 91.6 / 99.0 |
| Aachen Day–Night (day) | — | — | 89.3 / 96.1 / 99.3 |
| Aachen Day–Night (night) | — | — | 74.3 / 91.6 / 99.0 |
| LaMAR | — | — | up to 66.4% @ (1 m, 5°), CAB scene |
Compared to state-of-the-art alternatives, including HLoc and scene-coordinate regression methods, ImLoc demonstrates superior accuracy at significantly reduced map memory footprints. For example, ImLoc attains $11$ cm/ median error with a $90$ MB map for Cambridge Landmarks, matching the accuracy of HLoc ($11$ cm/) at $800$ MB and outperforming all SCR baselines ($4$–$260$ MB).
Across all datasets, ImLoc matches or exceeds centralized SfM-based approaches, confirming the effectiveness of the image+depth representation combined with dense matching and GPU-accelerated RANSAC (Jiang et al., 7 Jan 2026).
5. Relation to Previous Approaches
Traditional visual localization methods fall into two categories: 2D image-based and 3D structure-based pipelines. The former are simple to build and maintain, but lack the capacity for geometric reasoning, especially under challenging conditions. Structure-based approaches, leveraging centralized point-cloud models (e.g., SfM), offer high accuracy but suffer from reconstruction, update, and scalability limitations.
ImLoc's design bridges these paradigms by using per-image dense depth maps. This approach enables geometric constraints for effective localization without the overhead or rigidity of centralized 3D frameworks. Dense pixel-wise matching sidesteps the limitations of sparse local feature descriptors and enhances performance in dynamic or varying scenes.
For context, InLoc (Taira et al., 2018) pursued indoor localization via dense feature matching and 3D map verification via virtual view synthesis but necessitated database construction with RGBD panoramas and a centralized 3D point cloud. ImLoc dispenses with centralized point clouds, relying on just-in-time geometric reasoning from per-image depth.
6. Scalability, Flexibility, and Dynamic Scene Adaptation
ImLoc's memory efficiency and update flexibility derive from its non-centralized, image+depth representation. Updates to the map (e.g., new images, changes in scene geometry) do not require reprocessing global structure. Dynamic scenes, where objects or layouts change, are handled naturally via the retrieval and matching stages, and the loose coupling of depth estimation per image.
By adjusting map image resolution, JPEG XL compression quality, and reference frame subsampling, ImLoc supports application-specific trade-offs. This adaptability allows deployment in resource-constrained environments, from large-scale outdoor mapping to minimal "nano" maps for embedded systems.
7. Implications and Future Directions
ImLoc demonstrates that geometric inference for visual localization can be recovered from fully 2D representations augmented by estimated depth, obviating the need for explicit centralized 3D structures and their maintenance overhead. This suggests further research avenues in compressing depth information, enhancing dense matching robustness, and generalizing to more complex or unstructured environments.
A plausible implication is the broader applicability of image+depth maps in mobile robotics, augmented reality, and autonomous navigation, particularly where map size, update latency, and dynamic scene handling are operational constraints. Future work may investigate further memory-reduction techniques and improvements in dense correspondence filtering under challenging visibility or illumination conditions (Jiang et al., 7 Jan 2026).