Light3R-SfM: Towards Feed-forward Structure-from-Motion

Published 24 Jan 2025 in cs.CV and cs.LG | (2501.14914v1)

Abstract: We present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. Unlike existing SfM solutions that rely on costly matching and global optimization to achieve accurate 3D reconstructions, Light3R-SfM addresses this limitation through a novel latent global alignment module. This module replaces traditional global optimization with a learnable attention mechanism, effectively capturing multi-view constraints across images for robust and precise camera pose estimation. Light3R-SfM constructs a sparse scene graph via retrieval-score-guided shortest path tree to dramatically reduce memory usage and computational overhead compared to the naive approach. Extensive experiments demonstrate that Light3R-SfM achieves competitive accuracy while significantly reducing runtime, making it ideal for 3D reconstruction tasks in real-world applications with a runtime constraint. This work pioneers a data-driven, feed-forward SfM approach, paving the way toward scalable, accurate, and efficient 3D reconstruction in the wild.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Light3R-SfM, a fully learnable feed-forward Structure-from-Motion framework achieving efficient and scalable scene reconstruction from large image collections.
Its latent global alignment module uses a scalable attention mechanism to implicitly capture multi-view constraints and share global information efficiently.
It builds a sparse scene graph via a shortest path tree and accumulates pairwise 3D pointmaps globally, reducing memory and computation.

The paper introduces Light3R-SfM, a novel feed-forward framework for Structure-from-Motion (SfM) designed for efficiency and scalability with large, unconstrained image collections. It addresses the limitations of traditional SfM methods that rely on costly matching and global optimization by introducing a latent global alignment module, replacing traditional global optimization with a learnable attention mechanism. The method constructs a sparse scene graph via retrieval-score-guided shortest path tree (SPT) to reduce memory usage and computational overhead.

The key contributions of Light3R-SfM are:

A fully learnable feed-forward SfM model that directly estimates globally aligned camera poses from unordered image collections, thereby eliminating expensive optimization-based global alignment.
A latent global alignment module with a scalable attention mechanism that implicitly captures multi-view constraints, enabling global information sharing between features prior to pairwise 3D reconstruction.

The Light3R-SfM pipeline consists of four main stages:

Encoding: An image encoder extracts per-image feature tokens $F^{(0)}_i = \mathtt{Enc}(\mathcal{I}_i)$ , where $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ is the input image, $H$ and $W$ are the height and width of the image, respectively, $p$ is the patch size of the encoder, and $d$ is the token dimensionality.
Latent Global Alignment: This module performs implicit global alignment in the latent space using a scalable attention mechanism to globally align image tokens in the feature space.
- It computes a global token $g_i^{(0)} \in \mathbb{R}^{d}$ for each set of image tokens $F^{(0)}_i$ via averaging along its spatial dimensions.
- It applies $L$ blocks of the latent global alignment block to achieve global information sharing across all image tokens.
- For each level $l \in (0, L)$ , it shares information across all global image tokens $\{g^{(l)}_i\}_{i=1}^{N}$ using self-attention defined as $\{g_i^{(l+1)}\}_{i=1}^{N} = \mathtt{Self}(\{g_i^{(l)}\}_{i=1}^{N})$ .
- It propagates the updated global information to dense image tokens $\{F^{(l)}_i\}_{i=1}^{N}$ for each image independently via cross-attention: $F_i^{(l+1)} = \mathtt{Cross}(F_i^{(l)}, \{g_i^{(l+1)}\}_{i=1}^{N})$ .
- Finally, it obtains the globally aligned image tokens $F_i$ via a residual connection, $F_i \coloneq F_i^{(0)} + F_i^{(L)}$ .
Scene Graph Construction: It constructs a scene graph that maximizes pairwise image similarities using the shortest path tree (SPT) algorithm. The matrix $S$ containing all pairwise cosine similarities is computed as $S_{ij} = \langle \| \bar{F}_i \|_2, \| \bar{F}_j \|_2 \rangle$ where $\bar{F}_i$ is one-dimensional embedding obtained by average pooling the tokens of each image $F_i$ .
Decoding and Global Accumulation: The decoding step converts image pairs connected by an edge to pointmaps using a stereo reconstruction decoder. The global reconstruction accumulates pairwise pointmaps by traversing the scene graph to obtain the globally aligned pointmaps, resulting in per-image camera extrinsics $P\in \mathbb{R}^{4 \times 4}$ $P \in R^{4 \times 4}$ , intrinsics $K_i \in \mathbb{R}^{3\times3}$ $K_{i} \in R^{3 \times 3}$ and dense 3D pointmap at image resolution $X \in \mathbb{R}^{H \times W \times 3}$ $X \in R^{H \times W \times 3}$ .
- For every edge in the scene graph $(i, j) \in$ , the decoder outputs two pointmaps and associated confidence maps defined as: $(X^{i}_{i}, X^{j}_{i}), (C^{i}_{i}, C^{j}_{i}) = \mathtt{Dec}\left(F_i, F_j\right)$ .
- Per-edge local pointmap predictions are merged into a global one.
- The global point cloud is initialized as $\mathbf{X} = \{X^{i}, X^{j}\}$ and $\mathbf{C} = \{C^{i}, C^{j}\}$ .
- Procrustes alignment is used to estimate the optimal rigid body transformation between the two pointmaps: $P_{k} = \mathtt{Procrustes}(X^{k}, X^{k}_{k}, \log C^{k})$ .
- The pointmap of node $l$ is transformed into the global coordinate frame: $X^{l} = P_k^{-1} X^{l}_{k}$ .
- This is repeated for all edges in $%%%%30%%%%X^{i}%%%%31%%%%C^{i}$ .

The model is supervised by both pairwise and global losses. The pairwise loss, $L_{pair}$ , supervises the pairwise local pointmaps per-edge: $L_{pair} = \sum_{(i, j) \in } (L_{conf}(P_i \bar{X}^i, X^{i}_{i}, C^{i}_{i}, D^i) + L_{conf}(P_i \bar{X}^j, X^{j}_{i}, C^{j}_{i}, D^j) )$ , where $L_{conf}(\bar{X}, X, C, D) \coloneq \sum_{p \in D} C_p \left\Vert X_{p} - \bar{X}_p \right\Vert - \alpha C_p$ . $X, C, \bar{X}$ are the predicted pointmap, confidence map and the ground-truth pointmap, $D \subseteq \{1\ldots W\}\times\{1\ldots H\}$ defines the valid pixels with ground-truth, and $\alpha > 0$ regularizes the confidences to not be pushed to $0$. The global loss, $L_{global}$ , supervises the transformed global pointmap prediction for each image as $L_{global} = \sum_{i \in \{1, \ldots, N\}} L_{conf}(\bar{X}^i, P_{\mathtt{align}} X^i, C^i, D^i)$ . The total loss is optimized as $\mathcal{L} = L_{pair} + \lambda L_{global}$ , with $\lambda = 0.1$ .

The method was evaluated on Tanks{paper_content}Temples, CO3Dv2, and Waymo Open Dataset. The evaluation metrics include relative rotation accuracy (RRA), relative translation accuracy (RTA), average translation errors (ATE), and registration rate (Reg.). Results on Tanks{paper_content}Temples show that Light3R-SfM achieves competitive accuracy compared to other learning-based methods and rivals state-of-the-art optimization-based SfM techniques while offering significant improvements in efficiency and scalability. For instance, Light3R-SfM reconstructs a scene of 200 images in 33 seconds, whereas MASt3R-SfM takes approximately 27 minutes. Comparisons with Spann3R demonstrate the superiority of the latent global alignment module, leading to an average of $145\%$ and $84\%$ increase in RRA and RTA scores, respectively. On the Waymo Open Dataset, Light3R-SfM achieves comparable accuracy to MASt3R-SfM at a lower runtime ( $\sim 195 \times$ ) and outperforms Spann3R with better accuracy ( $\sim 4\times$ in RTA@5) at a lower runtime ( $> 6\times$ ). Ablation studies validate the impact of each component, including backbone initialization, global supervision, latent alignment, and graph construction strategies.