Light3R-SfM: Towards Feed-forward Structure-from-Motion
Published 24 Jan 2025 in cs.CV and cs.LG | (2501.14914v1)
Abstract: We present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. Unlike existing SfM solutions that rely on costly matching and global optimization to achieve accurate 3D reconstructions, Light3R-SfM addresses this limitation through a novel latent global alignment module. This module replaces traditional global optimization with a learnable attention mechanism, effectively capturing multi-view constraints across images for robust and precise camera pose estimation. Light3R-SfM constructs a sparse scene graph via retrieval-score-guided shortest path tree to dramatically reduce memory usage and computational overhead compared to the naive approach. Extensive experiments demonstrate that Light3R-SfM achieves competitive accuracy while significantly reducing runtime, making it ideal for 3D reconstruction tasks in real-world applications with a runtime constraint. This work pioneers a data-driven, feed-forward SfM approach, paving the way toward scalable, accurate, and efficient 3D reconstruction in the wild.
The paper introduces Light3R-SfM, a fully learnable feed-forward Structure-from-Motion framework achieving efficient and scalable scene reconstruction from large image collections.
Its latent global alignment module uses a scalable attention mechanism to implicitly capture multi-view constraints and share global information efficiently.
It builds a sparse scene graph via a shortest path tree and accumulates pairwise 3D pointmaps globally, reducing memory and computation.
The paper introduces Light3R-SfM, a novel feed-forward framework for Structure-from-Motion (SfM) designed for efficiency and scalability with large, unconstrained image collections. It addresses the limitations of traditional SfM methods that rely on costly matching and global optimization by introducing a latent global alignment module, replacing traditional global optimization with a learnable attention mechanism. The method constructs a sparse scene graph via retrieval-score-guided shortest path tree (SPT) to reduce memory usage and computational overhead.
The key contributions of Light3R-SfM are:
A fully learnable feed-forward SfM model that directly estimates globally aligned camera poses from unordered image collections, thereby eliminating expensive optimization-based global alignment.
A latent global alignment module with a scalable attention mechanism that implicitly captures multi-view constraints, enabling global information sharing between features prior to pairwise 3D reconstruction.
The Light3R-SfM pipeline consists of four main stages:
Encoding: An image encoder extracts per-image feature tokens Fi(0)=Enc(Ii), where Ii∈RH×W×3 is the input image, H and W are the height and width of the image, respectively, p is the patch size of the encoder, and d is the token dimensionality.
Latent Global Alignment: This module performs implicit global alignment in the latent space using a scalable attention mechanism to globally align image tokens in the feature space.
It computes a global token gi(0)∈Rd for each set of image tokens Fi(0) via averaging along its spatial dimensions.
It applies L blocks of the latent global alignment block to achieve global information sharing across all image tokens.
For each level l∈(0,L), it shares information across all global image tokens {gi(l)}i=1N using self-attention defined as {gi(l+1)}i=1N=Self({gi(l)}i=1N).
It propagates the updated global information to dense image tokens {Fi(l)}i=1N for each image independently via cross-attention: Fi(l+1)=Cross(Fi(l),{gi(l+1)}i=1N).
Finally, it obtains the globally aligned image tokens Fi via a residual connection, Fi:−Fi(0)+Fi(L).
Scene Graph Construction: It constructs a scene graph that maximizes pairwise image similarities using the shortest path tree (SPT) algorithm. The matrix S containing all pairwise cosine similarities is computed as Sij=⟨∥Fˉi∥2,∥Fˉj∥2⟩ where Fˉi is one-dimensional embedding obtained by average pooling the tokens of each image Fi.
Decoding and Global Accumulation: The decoding step converts image pairs connected by an edge to pointmaps using a stereo reconstruction decoder. The global reconstruction accumulates pairwise pointmaps by traversing the scene graph to obtain the globally aligned pointmaps, resulting in per-image camera extrinsics P∈R4×4, intrinsics Ki∈R3×3 and dense 3D pointmap at image resolution X∈RH×W×3.
For every edge in the scene graph (i,j)∈, the decoder outputs two pointmaps and associated confidence maps defined as: (Xii,Xij),(Cii,Cij)=Dec(Fi,Fj).
Per-edge local pointmap predictions are merged into a global one.
The global point cloud is initialized as X={Xi,Xj} and C={Ci,Cj}.
Procrustes alignment is used to estimate the optimal rigid body transformation between the two pointmaps: Pk=Procrustes(Xk,Xkk,logCk).
The pointmap of node l is transformed into the global coordinate frame: Xl=Pk−1Xkl.
This is repeated for all edges in .
The model is supervised by both pairwise and global losses. The pairwise loss, Lpair, supervises the pairwise local pointmaps per-edge:
Lpair=(i,j)∈∑(Lconf(PiXˉi,Xii,Cii,Di)+Lconf(PiXˉj,Xij,Cij,Dj)), where Lconf(Xˉ,X,C,D):−p∈D∑CpXp−Xˉp−αCp.
X,C,Xˉ are the predicted pointmap, confidence map and the ground-truth pointmap, D⊆{1…W}×{1…H} defines the valid pixels with ground-truth, and α>0 regularizes the confidences to not be pushed to $0$. The global loss, Lglobal, supervises the transformed global pointmap prediction for each image as Lglobal=i∈{1,…,N}∑Lconf(Xˉi,PalignXi,Ci,Di). The total loss is optimized as L=Lpair+λLglobal, with λ=0.1.
The method was evaluated on Tanks{paper_content}Temples, CO3Dv2, and Waymo Open Dataset. The evaluation metrics include relative rotation accuracy (RRA), relative translation accuracy (RTA), average translation errors (ATE), and registration rate (Reg.). Results on Tanks{paper_content}Temples show that Light3R-SfM achieves competitive accuracy compared to other learning-based methods and rivals state-of-the-art optimization-based SfM techniques while offering significant improvements in efficiency and scalability. For instance, Light3R-SfM reconstructs a scene of 200 images in 33 seconds, whereas MASt3R-SfM takes approximately 27 minutes. Comparisons with Spann3R demonstrate the superiority of the latent global alignment module, leading to an average of 145% and 84% increase in RRA and RTA scores, respectively.
On the Waymo Open Dataset, Light3R-SfM achieves comparable accuracy to MASt3R-SfM at a lower runtime (∼195×) and outperforms Spann3R with better accuracy (∼4× in RTA@5) at a lower runtime (>6×).
Ablation studies validate the impact of each component, including backbone initialization, global supervision, latent alignment, and graph construction strategies.