Papers
Topics
Authors
Recent
Search
2000 character limit reached

ImLoc: Scalable Dense Depth Localization

Updated 8 January 2026
  • ImLoc is a visual localization method that augments 2D image representations with dense, per-image depth maps to enable geometric reasoning without centralized 3D models.
  • The system employs bidirectional dense matching and a GPU-accelerated LO-RANSAC pipeline, achieving state-of-the-art pose estimation accuracy with significant memory efficiency.
  • By integrating image compression and scalable mapping techniques, ImLoc bridges 2D simplicity and 3D robustness, making it ideal for dynamic scenes and resource-constrained applications.

ImLoc is a visual localization method that revisits 2D image-based mapping paradigms by augmenting each reference image with a dense, estimated depth map. This image+depth representation enables geometric reasoning comparable to centralized 3D structure-based methods, but retains the build-and-update simplicity, memory efficiency, and scalability of pure image-based systems. ImLoc utilizes dense matching for representation and localization, compact map storage with modern image codecs, and a GPU-accelerated LO-RANSAC pipeline for rapid and accurate pose estimation. Empirical results demonstrate state-of-the-art accuracy and memory efficiency across diverse and challenging benchmarks, validating that just-in-time per-image depth maps together with dense matching recover best practices of both 2D and 3D visual localization approaches (Jiang et al., 7 Jan 2026).

1. Image-Based Map Representation with Dense Depth

ImLoc represents the localization map as a collection of database entries, each consisting of:

  • A global retrieval descriptor fRDf \in \mathbb{R}^D
  • The RGB image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}
  • An estimated dense depth map DRH×WD \in \mathbb{R}^{H \times W}
  • Known pose (R,t)(R, t) and intrinsics KK

Depth estimation for each image is performed during the mapping stage. For a pixel pp in image ii, dense matching (specifically RoMa) against covisible images jj yields correspondences qq with match confidence wijw_{ij}. Depth for pp is triangulated by solving:

X=argminX(i,j)wijθ(Ki1pi,XCi)2X = \underset{X}{\arg\min} \sum_{(i,j)} w_{ij}\, \theta(K_i^{-1}p_i,\,X - C_i)^2

where θ(u,v)\theta(u, v) denotes the angle between rays uu and vv, and CiC_i is the camera center. Among all single-DOF hypotheses generated by individual matches (given known camera poses), the one with most inliers within 22^\circ is selected, then refined via weighted least-squares in angle-space. Depth D(p)D(p) is stored per pixel as the λ\lambda such that X=Ci+λKi1piX = C_i + \lambda K_i^{-1}p_i.

This strategy for per-image depth acquisition enables geometric information without centralized point-cloud construction, facilitating scalable, memory-efficient map building.

2. Dense Matching and Pose Estimation Pipeline

At localization (query) time, ImLoc executes the following sequence:

  1. Query Descriptor Extraction: The global descriptor fqf_q is computed for the query image.
  2. Top-K Retrieval: Images IiI_i from the database are ranked and the highest-scoring KK images selected via global descriptor similarity, e.g., using Megaloc, NetVLAD, or EigenPlaces.
  3. Bidirectional Dense Matcher (RoMa): For each retrieved (q,i)(q, i) pair, dense pixel-level correspondences are established at resolution 560×560560 \times 560. For each query pixel pqp_q, a corresponding pixel pip_i in IiI_i is found, associated with a match confidence c(pqpi)c(p_q \leftrightarrow p_i). Matches with c<0.05c < 0.05 or invalid depths are discarded.
  4. 2D–3D Correspondence Formation: Remaining matches yield 2D–3D pairs (pq,Xi)(p_q, X_i), where:

Xi=backproject(pi;Di(pi),Ki,Ri,ti)X_i = \text{backproject}(p_i; D_i(p_i), K_i, R_i, t_i)

backproject(x;d,K,R,t)=RK1xd+(Rt)\text{backproject}(x; d, K, R, t) = R^\top K^{-1}x \cdot d + (-R^\top t)

  1. Pose Estimation via GPU LO-RANSAC: With N10,000N \approx 10,000 dense correspondences:
    • Up to M=10,000M=10,000 correspondences are subsampled.
    • CPU samples B=1,000B=1,000 minimal sets, solves P3P with PoseLib to generate hypotheses (Rk,tk)(R_k, t_k).
    • GPU scores all correspondences using truncated, confidence-weighted reprojection error.
    • Hypothesis refinement via nonlinear local optimization (Cauchy loss).
    • The process repeats up to $100,000$ iterations or until the probability of missing the best model drops below 10410^{-4}.

This pipeline is designed for memory and computational efficiency while maintaining high pose accuracy.

3. Compression and Scalable Memory Efficiency

ImLoc's storage scheme utilizes full images rather than sparse descriptors, enabling effective use of contemporary image codecs:

  • RGB images are downsampled (to 5602560^2 or 2802280^2 for "micro" maps) and compressed with JPEG XL at specified quality (90 for standard, 30 for micro), yielding \approx60 KB (or \approx15 KB) per image.
  • Depth maps are clipped to [0.25m,128m][0.25\,\text{m}, 128\,\text{m}], log-quantized to 256 ($8$-bit) levels, and losslessly compressed (\approx17 KB, or \approx3 KB for micro per image).
  • Retrieval feature vectors ($4096$-D) are stored in half-precision ($8$ KB/image).

Resolution, JPEG XL quality, and frame subsampling (e.g., $1:8$ for "nano" $2$ MB maps) can be varied, affording flexible trade-offs along the Pareto frontier of memory and localization accuracy. For instance, on Cambridge Landmarks, the $90$ MB map yields $10$ cm/0.20.2^\circ median error, whereas the $2$ MB map achieves $13$ cm/0.30.3^\circ, outperforming $3$–$13$ MB scene-coordinate regression competitors.

4. Benchmark Performance and Empirical Results

ImLoc is evaluated on several standard benchmarks, reporting accuracy as the percentage of queries localized within specified error thresholds:

Benchmark Map Size Median Error Accuracy (% within (0.25 m, 2°))
Cambridge Landmarks 90 MB 11 cm / 0.2° (not explicitly listed for this threshold)
Cambridge Landmarks 16 MB 12 cm / 0.2°
Cambridge Landmarks 2 MB 13 cm / 0.3°
Oxford Day & Night (day) 89.3 / 96.1 / 99.3
Oxford Day & Night (night) 74.3 / 91.6 / 99.0
Aachen Day–Night (day) 89.3 / 96.1 / 99.3
Aachen Day–Night (night) 74.3 / 91.6 / 99.0
LaMAR up to 66.4% @ (1 m, 5°), CAB scene

Compared to state-of-the-art alternatives, including HLoc and scene-coordinate regression methods, ImLoc demonstrates superior accuracy at significantly reduced map memory footprints. For example, ImLoc attains $11$ cm/0.20.2^\circ median error with a $90$ MB map for Cambridge Landmarks, matching the accuracy of HLoc ($11$ cm/0.20.2^\circ) at $800$ MB and outperforming all SCR baselines ($4$–$260$ MB).

Across all datasets, ImLoc matches or exceeds centralized SfM-based approaches, confirming the effectiveness of the image+depth representation combined with dense matching and GPU-accelerated RANSAC (Jiang et al., 7 Jan 2026).

5. Relation to Previous Approaches

Traditional visual localization methods fall into two categories: 2D image-based and 3D structure-based pipelines. The former are simple to build and maintain, but lack the capacity for geometric reasoning, especially under challenging conditions. Structure-based approaches, leveraging centralized point-cloud models (e.g., SfM), offer high accuracy but suffer from reconstruction, update, and scalability limitations.

ImLoc's design bridges these paradigms by using per-image dense depth maps. This approach enables geometric constraints for effective localization without the overhead or rigidity of centralized 3D frameworks. Dense pixel-wise matching sidesteps the limitations of sparse local feature descriptors and enhances performance in dynamic or varying scenes.

For context, InLoc (Taira et al., 2018) pursued indoor localization via dense feature matching and 3D map verification via virtual view synthesis but necessitated database construction with RGBD panoramas and a centralized 3D point cloud. ImLoc dispenses with centralized point clouds, relying on just-in-time geometric reasoning from per-image depth.

6. Scalability, Flexibility, and Dynamic Scene Adaptation

ImLoc's memory efficiency and update flexibility derive from its non-centralized, image+depth representation. Updates to the map (e.g., new images, changes in scene geometry) do not require reprocessing global structure. Dynamic scenes, where objects or layouts change, are handled naturally via the retrieval and matching stages, and the loose coupling of depth estimation per image.

By adjusting map image resolution, JPEG XL compression quality, and reference frame subsampling, ImLoc supports application-specific trade-offs. This adaptability allows deployment in resource-constrained environments, from large-scale outdoor mapping to minimal "nano" maps for embedded systems.

7. Implications and Future Directions

ImLoc demonstrates that geometric inference for visual localization can be recovered from fully 2D representations augmented by estimated depth, obviating the need for explicit centralized 3D structures and their maintenance overhead. This suggests further research avenues in compressing depth information, enhancing dense matching robustness, and generalizing to more complex or unstructured environments.

A plausible implication is the broader applicability of image+depth maps in mobile robotics, augmented reality, and autonomous navigation, particularly where map size, update latency, and dynamic scene handling are operational constraints. Future work may investigate further memory-reduction techniques and improvements in dense correspondence filtering under challenging visibility or illumination conditions (Jiang et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ImLoc.