Efficient Vision Transformers for 3D Reconstruction

Develop efficient Vision Transformer architectures for 3D reconstruction that maintain long-range, multi-view geometric consistency across input image sequences.

Background

The paper highlights that most existing methods for making Vision Transformers efficient have been developed for 2D tasks such as image classification, where maintaining cross-view geometric consistency is not required.

In contrast, feed-forward 3D reconstruction demands modeling long-range relationships across many views to preserve geometric consistency, which makes straightforward adaptations of 2D-efficient ViTs inadequate. Despite FlashVGGT proposing descriptor-based cross-attention as one approach, the authors state that designing generally efficient Vision Transformers tailored for 3D reconstruction remains an open challenge.

References

Consequently, it remains an open challenge to design efficient Vision Transformers for 3D reconstruction, as it demands maintaining long-range, multi-view geometric consistency.

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention  (2512.01540 - Wang et al., 1 Dec 2025) in Related Work, Efficient Vision Transformers subsection (Section 2)