ImLoc: Scalable Dense Depth Localization

Updated 8 January 2026

ImLoc is a visual localization method that augments 2D image representations with dense, per-image depth maps to enable geometric reasoning without centralized 3D models.
The system employs bidirectional dense matching and a GPU-accelerated LO-RANSAC pipeline, achieving state-of-the-art pose estimation accuracy with significant memory efficiency.
By integrating image compression and scalable mapping techniques, ImLoc bridges 2D simplicity and 3D robustness, making it ideal for dynamic scenes and resource-constrained applications.

ImLoc is a visual localization method that revisits 2D image-based mapping paradigms by augmenting each reference image with a dense, estimated depth map. This image+depth representation enables geometric reasoning comparable to centralized 3D structure-based methods, but retains the build-and-update simplicity, memory efficiency, and scalability of pure image-based systems. ImLoc utilizes dense matching for representation and localization, compact map storage with modern image codecs, and a GPU-accelerated LO-RANSAC pipeline for rapid and accurate pose estimation. Empirical results demonstrate state-of-the-art accuracy and memory efficiency across diverse and challenging benchmarks, validating that just-in-time per-image depth maps together with dense matching recover best practices of both 2D and 3D visual localization approaches (Jiang et al., 7 Jan 2026).

1. Image-Based Map Representation with Dense Depth

ImLoc represents the localization map as a collection of database entries, each consisting of:

A global retrieval descriptor $f \in \mathbb{R}^D$
The RGB image $I \in \mathbb{R}^{H \times W \times 3}$
An estimated dense depth map $D \in \mathbb{R}^{H \times W}$
Known pose $(R, t)$ and intrinsics $K$

Depth estimation for each image is performed during the mapping stage. For a pixel $p$ in image $i$ , dense matching (specifically RoMa) against covisible images $j$ yields correspondences $q$ with match confidence $w_{ij}$ . Depth for $p$ is triangulated by solving:

$X = \underset{X}{\arg\min} \sum_{(i,j)} w_{ij}\, \theta(K_i^{-1}p_i,\,X - C_i)^2$

where $\theta(u, v)$ denotes the angle between rays $u$ and $v$ , and $C_i$ is the camera center. Among all single-DOF hypotheses generated by individual matches (given known camera poses), the one with most inliers within $2^\circ$ is selected, then refined via weighted least-squares in angle-space. Depth $D(p)$ is stored per pixel as the $\lambda$ such that $X = C_i + \lambda K_i^{-1}p_i$ .

This strategy for per-image depth acquisition enables geometric information without centralized point-cloud construction, facilitating scalable, memory-efficient map building.

2. Dense Matching and Pose Estimation Pipeline

At localization (query) time, ImLoc executes the following sequence:

Query Descriptor Extraction: The global descriptor $f_q$ is computed for the query image.
Top-K Retrieval: Images $I_i$ from the database are ranked and the highest-scoring $K$ images selected via global descriptor similarity, e.g., using Megaloc, NetVLAD, or EigenPlaces.
Bidirectional Dense Matcher (RoMa): For each retrieved $(q, i)$ pair, dense pixel-level correspondences are established at resolution $560 \times 560$ . For each query pixel $p_q$ , a corresponding pixel $p_i$ in $I_i$ is found, associated with a match confidence $c(p_q \leftrightarrow p_i)$ . Matches with $c < 0.05$ or invalid depths are discarded.
2D–3D Correspondence Formation: Remaining matches yield 2D–3D pairs $(p_q, X_i)$ , where:

$X_i = \text{backproject}(p_i; D_i(p_i), K_i, R_i, t_i)$

$\text{backproject}(x; d, K, R, t) = R^\top K^{-1}x \cdot d + (-R^\top t)$

Pose Estimation via GPU LO-RANSAC: With $N \approx 10,000$ $N \approx 10, 000$ dense correspondences:
- Up to $M=10,000$ correspondences are subsampled.
- CPU samples $B=1,000$ minimal sets, solves P3P with PoseLib to generate hypotheses $(R_k, t_k)$ .
- GPU scores all correspondences using truncated, confidence-weighted reprojection error.
- Hypothesis refinement via nonlinear local optimization (Cauchy loss).
- The process repeats up to $100,000$ iterations or until the probability of missing the best model drops below $10^{-4}$ .

This pipeline is designed for memory and computational efficiency while maintaining high pose accuracy.

3. Compression and Scalable Memory Efficiency

ImLoc's storage scheme utilizes full images rather than sparse descriptors, enabling effective use of contemporary image codecs:

RGB images are downsampled (to $560^2$ or $280^2$ for "micro" maps) and compressed with JPEG XL at specified quality (90 for standard, 30 for micro), yielding $\approx$ 60 KB (or $\approx$ 15 KB) per image.
Depth maps are clipped to $[0.25\,\text{m}, 128\,\text{m}]$ , log-quantized to 256 ($8$-bit) levels, and losslessly compressed ( $\approx$ 17 KB, or $\approx$ 3 KB for micro per image).
Retrieval feature vectors ($4096$-D) are stored in half-precision ($8$ KB/image).

Resolution, JPEG XL quality, and frame subsampling (e.g., $1:8$ for "nano" $2$ MB maps) can be varied, affording flexible trade-offs along the Pareto frontier of memory and localization accuracy. For instance, on Cambridge Landmarks, the $90$ MB map yields $10$ cm/ $0.2^\circ$ median error, whereas the $2$ MB map achieves $13$ cm/ $0.3^\circ$ , outperforming $3$–$13$ MB scene-coordinate regression competitors.

4. Benchmark Performance and Empirical Results

ImLoc is evaluated on several standard benchmarks, reporting accuracy as the percentage of queries localized within specified error thresholds:

Benchmark	Map Size	Median Error	Accuracy (% within (0.25 m, 2°))
Cambridge Landmarks	90 MB	11 cm / 0.2°	(not explicitly listed for this threshold)
Cambridge Landmarks	16 MB	12 cm / 0.2°	—
Cambridge Landmarks	2 MB	13 cm / 0.3°	—
Oxford Day & Night (day)	—	—	89.3 / 96.1 / 99.3
Oxford Day & Night (night)	—	—	74.3 / 91.6 / 99.0
Aachen Day–Night (day)	—	—	89.3 / 96.1 / 99.3
Aachen Day–Night (night)	—	—	74.3 / 91.6 / 99.0
LaMAR	—	—	up to 66.4% @ (1 m, 5°), CAB scene

Compared to state-of-the-art alternatives, including HLoc and scene-coordinate regression methods, ImLoc demonstrates superior accuracy at significantly reduced map memory footprints. For example, ImLoc attains $11$ cm/ $0.2^\circ$ median error with a $90$ MB map for Cambridge Landmarks, matching the accuracy of HLoc ($11$ cm/ $0.2^\circ$ ) at $800$ MB and outperforming all SCR baselines ($4$–$260$ MB).

Across all datasets, ImLoc matches or exceeds centralized SfM-based approaches, confirming the effectiveness of the image+depth representation combined with dense matching and GPU-accelerated RANSAC (Jiang et al., 7 Jan 2026).

5. Relation to Previous Approaches

Traditional visual localization methods fall into two categories: 2D image-based and 3D structure-based pipelines. The former are simple to build and maintain, but lack the capacity for geometric reasoning, especially under challenging conditions. Structure-based approaches, leveraging centralized point-cloud models (e.g., SfM), offer high accuracy but suffer from reconstruction, update, and scalability limitations.

ImLoc's design bridges these paradigms by using per-image dense depth maps. This approach enables geometric constraints for effective localization without the overhead or rigidity of centralized 3D frameworks. Dense pixel-wise matching sidesteps the limitations of sparse local feature descriptors and enhances performance in dynamic or varying scenes.

For context, InLoc (Taira et al., 2018) pursued indoor localization via dense feature matching and 3D map verification via virtual view synthesis but necessitated database construction with RGBD panoramas and a centralized 3D point cloud. ImLoc dispenses with centralized point clouds, relying on just-in-time geometric reasoning from per-image depth.

6. Scalability, Flexibility, and Dynamic Scene Adaptation

ImLoc's memory efficiency and update flexibility derive from its non-centralized, image+depth representation. Updates to the map (e.g., new images, changes in scene geometry) do not require reprocessing global structure. Dynamic scenes, where objects or layouts change, are handled naturally via the retrieval and matching stages, and the loose coupling of depth estimation per image.

By adjusting map image resolution, JPEG XL compression quality, and reference frame subsampling, ImLoc supports application-specific trade-offs. This adaptability allows deployment in resource-constrained environments, from large-scale outdoor mapping to minimal "nano" maps for embedded systems.

7. Implications and Future Directions

ImLoc demonstrates that geometric inference for visual localization can be recovered from fully 2D representations augmented by estimated depth, obviating the need for explicit centralized 3D structures and their maintenance overhead. This suggests further research avenues in compressing depth information, enhancing dense matching robustness, and generalizing to more complex or unstructured environments.

A plausible implication is the broader applicability of image+depth maps in mobile robotics, augmented reality, and autonomous navigation, particularly where map size, update latency, and dynamic scene handling are operational constraints. Future work may investigate further memory-reduction techniques and improvements in dense correspondence filtering under challenging visibility or illumination conditions (Jiang et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

ImLoc: Revisiting Visual Localization with Image-based Representation (2026)

InLoc: Indoor Visual Localization with Dense Matching and View Synthesis (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ImLoc.