Papers
Topics
Authors
Recent
Search
2000 character limit reached

Light3R-SfM: Towards Feed-forward Structure-from-Motion

Published 24 Jan 2025 in cs.CV and cs.LG | (2501.14914v1)

Abstract: We present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. Unlike existing SfM solutions that rely on costly matching and global optimization to achieve accurate 3D reconstructions, Light3R-SfM addresses this limitation through a novel latent global alignment module. This module replaces traditional global optimization with a learnable attention mechanism, effectively capturing multi-view constraints across images for robust and precise camera pose estimation. Light3R-SfM constructs a sparse scene graph via retrieval-score-guided shortest path tree to dramatically reduce memory usage and computational overhead compared to the naive approach. Extensive experiments demonstrate that Light3R-SfM achieves competitive accuracy while significantly reducing runtime, making it ideal for 3D reconstruction tasks in real-world applications with a runtime constraint. This work pioneers a data-driven, feed-forward SfM approach, paving the way toward scalable, accurate, and efficient 3D reconstruction in the wild.

Summary

  • The paper introduces Light3R-SfM, a fully learnable feed-forward Structure-from-Motion framework achieving efficient and scalable scene reconstruction from large image collections.
  • Its latent global alignment module uses a scalable attention mechanism to implicitly capture multi-view constraints and share global information efficiently.
  • It builds a sparse scene graph via a shortest path tree and accumulates pairwise 3D pointmaps globally, reducing memory and computation.

The paper introduces Light3R-SfM, a novel feed-forward framework for Structure-from-Motion (SfM) designed for efficiency and scalability with large, unconstrained image collections. It addresses the limitations of traditional SfM methods that rely on costly matching and global optimization by introducing a latent global alignment module, replacing traditional global optimization with a learnable attention mechanism. The method constructs a sparse scene graph via retrieval-score-guided shortest path tree (SPT) to reduce memory usage and computational overhead.

The key contributions of Light3R-SfM are:

  • A fully learnable feed-forward SfM model that directly estimates globally aligned camera poses from unordered image collections, thereby eliminating expensive optimization-based global alignment.
  • A latent global alignment module with a scalable attention mechanism that implicitly captures multi-view constraints, enabling global information sharing between features prior to pairwise 3D reconstruction.

The Light3R-SfM pipeline consists of four main stages:

  1. Encoding: An image encoder extracts per-image feature tokens Fi(0)=Enc(Ii)F^{(0)}_i = \mathtt{Enc}(\mathcal{I}_i), where IiRH×W×3\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3} is the input image, HH and WW are the height and width of the image, respectively, pp is the patch size of the encoder, and dd is the token dimensionality.
  2. Latent Global Alignment: This module performs implicit global alignment in the latent space using a scalable attention mechanism to globally align image tokens in the feature space.
    • It computes a global token gi(0)Rdg_i^{(0)} \in \mathbb{R}^{d} for each set of image tokens Fi(0)F^{(0)}_i via averaging along its spatial dimensions.
    • It applies LL blocks of the latent global alignment block to achieve global information sharing across all image tokens.
    • For each level l(0,L)l \in (0, L), it shares information across all global image tokens {gi(l)}i=1N\{g^{(l)}_i\}_{i=1}^{N} using self-attention defined as {gi(l+1)}i=1N=Self({gi(l)}i=1N)\{g_i^{(l+1)}\}_{i=1}^{N} = \mathtt{Self}(\{g_i^{(l)}\}_{i=1}^{N}).
    • It propagates the updated global information to dense image tokens {Fi(l)}i=1N\{F^{(l)}_i\}_{i=1}^{N} for each image independently via cross-attention: Fi(l+1)=Cross(Fi(l),{gi(l+1)}i=1N)F_i^{(l+1)} = \mathtt{Cross}(F_i^{(l)}, \{g_i^{(l+1)}\}_{i=1}^{N}).
    • Finally, it obtains the globally aligned image tokens FiF_i via a residual connection, Fi:Fi(0)+Fi(L)F_i \coloneq F_i^{(0)} + F_i^{(L)}.
  3. Scene Graph Construction: It constructs a scene graph that maximizes pairwise image similarities using the shortest path tree (SPT) algorithm. The matrix SS containing all pairwise cosine similarities is computed as Sij=Fˉi2,Fˉj2S_{ij} = \langle \| \bar{F}_i \|_2, \| \bar{F}_j \|_2 \rangle where Fˉi\bar{F}_i is one-dimensional embedding obtained by average pooling the tokens of each image FiF_i.
  4. Decoding and Global Accumulation: The decoding step converts image pairs connected by an edge to pointmaps using a stereo reconstruction decoder. The global reconstruction accumulates pairwise pointmaps by traversing the scene graph to obtain the globally aligned pointmaps, resulting in per-image camera extrinsics PR4×4P\in \mathbb{R}^{4 \times 4}, intrinsics KiR3×3K_i \in \mathbb{R}^{3\times3} and dense 3D pointmap at image resolution XRH×W×3X \in \mathbb{R}^{H \times W \times 3}.
    • For every edge in the scene graph (i,j)(i, j) \in, the decoder outputs two pointmaps and associated confidence maps defined as: (Xii,Xij),(Cii,Cij)=Dec(Fi,Fj)(X^{i}_{i}, X^{j}_{i}), (C^{i}_{i}, C^{j}_{i}) = \mathtt{Dec}\left(F_i, F_j\right).
    • Per-edge local pointmap predictions are merged into a global one.
    • The global point cloud is initialized as X={Xi,Xj}\mathbf{X} = \{X^{i}, X^{j}\} and C={Ci,Cj}\mathbf{C} = \{C^{i}, C^{j}\}.
    • Procrustes alignment is used to estimate the optimal rigid body transformation between the two pointmaps: Pk=Procrustes(Xk,Xkk,logCk)P_{k} = \mathtt{Procrustes}(X^{k}, X^{k}_{k}, \log C^{k}).
    • The pointmap of node ll is transformed into the global coordinate frame: Xl=Pk1XklX^{l} = P_k^{-1} X^{l}_{k}.
    • This is repeated for all edges in %%%%30%%%%X^{i}%%%%31%%%%C^{i}.

The model is supervised by both pairwise and global losses. The pairwise loss, LpairL_{pair}, supervises the pairwise local pointmaps per-edge: Lpair=(i,j)(Lconf(PiXˉi,Xii,Cii,Di)+Lconf(PiXˉj,Xij,Cij,Dj))L_{pair} = \sum_{(i, j) \in } (L_{conf}(P_i \bar{X}^i, X^{i}_{i}, C^{i}_{i}, D^i) + L_{conf}(P_i \bar{X}^j, X^{j}_{i}, C^{j}_{i}, D^j) ), where Lconf(Xˉ,X,C,D):pDCpXpXˉpαCpL_{conf}(\bar{X}, X, C, D) \coloneq \sum_{p \in D} C_p \left\Vert X_{p} - \bar{X}_p \right\Vert - \alpha C_p. X,C,XˉX, C, \bar{X} are the predicted pointmap, confidence map and the ground-truth pointmap, D{1W}×{1H}D \subseteq \{1\ldots W\}\times\{1\ldots H\} defines the valid pixels with ground-truth, and α>0\alpha > 0 regularizes the confidences to not be pushed to $0$. The global loss, LglobalL_{global}, supervises the transformed global pointmap prediction for each image as Lglobal=i{1,,N}Lconf(Xˉi,PalignXi,Ci,Di)L_{global} = \sum_{i \in \{1, \ldots, N\}} L_{conf}(\bar{X}^i, P_{\mathtt{align}} X^i, C^i, D^i). The total loss is optimized as L=Lpair+λLglobal\mathcal{L} = L_{pair} + \lambda L_{global}, with λ=0.1\lambda = 0.1.

The method was evaluated on Tanks{paper_content}Temples, CO3Dv2, and Waymo Open Dataset. The evaluation metrics include relative rotation accuracy (RRA), relative translation accuracy (RTA), average translation errors (ATE), and registration rate (Reg.). Results on Tanks{paper_content}Temples show that Light3R-SfM achieves competitive accuracy compared to other learning-based methods and rivals state-of-the-art optimization-based SfM techniques while offering significant improvements in efficiency and scalability. For instance, Light3R-SfM reconstructs a scene of 200 images in 33 seconds, whereas MASt3R-SfM takes approximately 27 minutes. Comparisons with Spann3R demonstrate the superiority of the latent global alignment module, leading to an average of 145%145\% and 84%84\% increase in RRA and RTA scores, respectively. On the Waymo Open Dataset, Light3R-SfM achieves comparable accuracy to MASt3R-SfM at a lower runtime (195×\sim 195 \times) and outperforms Spann3R with better accuracy (4×\sim 4\times in RTA@5) at a lower runtime (>6×> 6\times). Ablation studies validate the impact of each component, including backbone initialization, global supervision, latent alignment, and graph construction strategies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 212 likes about this paper.