LucidFusion: Reconstructing 3D Gaussians with Arbitrary Unposed Images

Published 21 Oct 2024 in cs.CV | (2410.15636v3)

Abstract: Recent large reconstruction models have made notable progress in generating high-quality 3D objects from single images. However, current reconstruction methods often rely on explicit camera pose estimation or fixed viewpoints, restricting their flexibility and practical applicability. We reformulate 3D reconstruction as image-to-image translation and introduce the Relative Coordinate Map (RCM), which aligns multiple unposed images to a main view without pose estimation. While RCM simplifies the process, its lack of global 3D supervision can yield noisy outputs. To address this, we propose Relative Coordinate Gaussians (RCG) as an extension to RCM, which treats each pixel's coordinates as a Gaussian center and employs differentiable rasterization for consistent geometry and pose recovery. Our LucidFusion framework handles an arbitrary number of unposed inputs, producing robust 3D reconstructions within seconds and paving the way for more flexible, pose-free 3D pipelines.

Abstract PDF HTML Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces LucidFusion, a novel feed-forward framework that generates high-quality 3D Gaussians from arbitrary unposed images using a Relative Coordinate Map instead of pose estimation.
LucidFusion operates in two stages, first mapping images to Relative Coordinate Maps using a repurposed Stable Diffusion model, then refining outputs into 3D Gaussians for geometric fidelity.
LucidFusion demonstrates superior performance over state-of-the-art methods, generating 512x512 resolution 3D Gaussians at up to 13 fps, offering efficiency for AR/VR and digital content creation.

LucidFusion: Generating 3D Gaussians with Arbitrary Unposed Images

The generation of high-quality 3D content from sparse image inputs poses significant challenges within the realms of computer vision and graphics. The paper "LucidFusion: Generating 3D Gaussians with Arbitrary Unposed Images" proposes an insightful approach to this problem by introducing a novel, flexible, end-to-end feed-forward framework. The approach aims to generate 3D objects efficiently from unposed, sparse multiview images.

Overview of Methodology

LucidFusion eschews traditional pose estimation frameworks, instead introducing the Relative Coordinate Map (RCM) as a central innovation. Unlike conventional methods reliant on known pose parameters or complex pose estimation techniques, RCM aligns geometric features consistently across diverse views without requiring explicit pose data. This shift away from dependence on pose information is significant, allowing for more adaptable and intuitive 3D generation from arbitrary images.

The LucidFusion approach operates through two primary stages:

RCM Prediction Stage: Utilizes a pre-trained Stable Diffusion model, repurposed as a feed-forward network, to map input images onto RCMs. This stage emphasizes maintaining geometric coherence across multiple views through feature alignment, making use of the inherent self-attention mechanisms to handle concatenated inputs.
3D Gaussian Refinement Stage: Refines the noisy point clouds from the RCM using 3D Gaussians, ensuring enhanced global 3D consistency and improved geometric fidelity. The refinement process applies a differentiable rendering loss to ameliorate inaccuracies and textures in the initial output.

Technical Contributions and Results

Key contributions of the paper include:

A novel RCM mechanism enabling 3D object generation from unposed inputs.
Integration of the RCM with existing 2D networks, leveraging pre-existing model priors to enhance adaptability across a variety of objects and viewpoints.
Demonstrated ability to generate high-resolution 3D Gaussians, notably achieving up to 13 frames per second for outputs at a resolution of 512 x 512.

LucidFusion demonstrates superior qualitative and quantitative performance compared to state-of-the-art baselines in multiple settings, such as sparse view inputs and single-view generation extended with multi-view diffusion models. The method effectively surpasses existing methods in performance metrics such as PSNR, SSIM, and LPIPS, ensuring higher fidelity and visual coherence in the generated 3D outputs.

Implications and Future Directions

The implications of LucidFusion are multifaceted, offering significant efficiencies in fields requiring 3D content generation without the extensive expertise traditionally requisite for accurate pose estimation. By reducing computational overhead related to pose estimation, LucidFusion is positioned as a robust alternative for applications in AR/VR, gaming, and digital content creation, where flexibility and speed are paramount.

In terms of future developments, the RCM concept could potentially extend further into areas involving dynamic scenes or incorporating environmental textures and lighting. Moreover, exploration into generalization across a wider variety of capture conditions (e.g., varying field of view and resolution) could establish broader applicability and resilience of the LucidFusion model.

In conclusion, LucidFusion represents a notable advancement in the field of 3D reconstruction from image inputs, primarily through its elimination of pose dependency. By successfully integrating foundational 2D model capabilities within a novel coordinate mapping framework, LucidFusion sets a precedent for future innovations aimed at simplifying and expediting 3D generation processes.