GECO: Generative Image-to-3D within a SECOnd

Published 30 May 2024 in cs.CV | (2405.20327v2)

Abstract: Recent years have seen significant advancements in 3D generation. While methods like score distillation achieve impressive results, they often require extensive per-scene optimization, which limits their time efficiency. On the other hand, reconstruction-based approaches are more efficient but tend to compromise quality due to their limited ability to handle uncertainty. We introduce GECO, a novel method for high-quality 3D generative modeling that operates within a second. Our approach addresses the prevalent issues of uncertainty and inefficiency in existing methods through a two-stage approach. In the first stage, we train a single-step multi-view generative model with score distillation. Then, a second-stage distillation is applied to address the challenge of view inconsistency in the multi-view generation. This two-stage process ensures a balanced approach to 3D generation, optimizing both quality and efficiency. Our comprehensive experiments demonstrate that GECO achieves high-quality image-to-3D mesh generation with an unprecedented level of efficiency. We will make the code and model publicly available.

Abstract PDF HTML Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a two-stage distillation framework that rapidly generates high-fidelity 3D models from a single image.
It employs score distillation and reconstruction techniques to achieve superior metrics such as PSNR, SSIM, and LPIPS on benchmark datasets.
Its feed-forward architecture supports fast and diverse 3D generation, enabling practical real-time applications in digital media.

An Insightful Overview of the GECO Framework for Efficient Image-to-3D Generation

The paper "GECO: Generative Image-to-3D within a Second" introduces a novel and efficient approach to high-quality 3D generative modeling from a single input image, achieving results in under a second. This method, known as GECO, stands out in the landscape of 3D generation by addressing the prevailing issues of uncertainty and inefficiency, which are common in existing techniques.

The paper delineates a two-stage process to achieve its goals. During the first stage, the authors employ score distillation on a single-step multi-view generative model. Then, in the second stage, they apply additional distillation to alleviate view inconsistency from multi-view prediction. This structured approach ensures a balanced outcome, optimizing both the quality and efficiency of the 3D generation.

Key Contributions and Methods

The paper's authors identify the limitations inherent in current 3D generation techniques:

Score distillation methods yield high-quality results but suffer from extensive per-scene optimization, adversely affecting time efficiency.
Reconstruction-based approaches compromise output quality due to their limited ability to handle associated uncertainties, although they prioritize efficiency.

To overcome these limitations, GECO integrates:

A feed-forward architecture that generates 3D content in less than 0.35 seconds on a single L40 GPU.
A two-stage distillation approach, leveraging pre-trained diffusion and reconstruction models to achieve high-fidelity and diverse 3D outputs.

Stage I: Multi-View Score Distillation

The first stage focuses on generative modeling of multi-view images using Variational Score Distillation (VSD). GECO trains a multi-view generator, initialized from a pre-trained multi-view diffusion model (Zero123Plus), to compensate for potential inconsistency in its outputs. This step employs a teacher-student paradigm where the multi-view generator mimics the behavior of a pre-trained teacher model. The consistency and quality of multi-view images are crucial, thus forming an intermediate, stage where fixing underlying inconsistencies is paramount.

Stage II: 3D Consistent Distillation

The second stage refines the 3D representation constructed from the generated multi-view images to ensure 3D consistency. By using a pre-trained reconstruction model such as LGM (Latent Gaussian Model), this phase tackles the artifacts and inconsistencies carried over from the multi-view image generation. Pseudo ground truth images generated through detailed multi-step diffusion form the basis for training. This stage relies on both RGB and LPIPS losses to enhance the reconstruction gene's accuracy, providing more robust and clean 3D representations.

Experimental Results and Evaluation

The proposed framework is meticulously evaluated on the Google Scanned Object (GSO) dataset. Quantitative comparisons against existing methods — including LRM and TriplaneGaussian — demonstrate GECO's superior performance in metrics such as PSNR, SSIM, and LPIPS. GECO achieves high-quality 3D generation while maintaining ultra-low computation times.

Highlights:

Quality and Efficiency: GECO exceeds previous models in both rendering quality from novel viewpoints and operational efficiency, achieving 3D Gaussian synthesis within 0.34 seconds.
Diversity: GECO showcases the ability to generate diverse 3D outputs from the same input image, addressing the challenge of uncertainty.
Integration with Text-to-Image Models: The capability of combining text-to-image models like SD-XL with GECO extends its utility to text-driven 3D generation.

Implications and Future Directions

Practically, GECO's rapid 3D generation facilitates real-time applications in digital content creation, such as in gaming and virtual reality. Theoretically, it expands the boundary of efficient 3D generative modeling by harmonizing the multi-view image synthesis with consistent 3D reconstructions.

Future developments in AI could enhance GECO's methodology by introducing end-to-end 3D generative models that eliminate intermediate representations, further simplifying the pipeline. Additionally, integrating advanced diffusion models could potentially eradicate any remaining multi-view inconsistencies, leading to even cleaner 3D outputs.

The practical impact of GECO on lowering the barrier for non-experts to create high-quality 3D models cannot be overstated, opening new avenues for innovation and creativity in digital media and beyond.

In summary, GECO presents a significant advancement in efficient and high-quality 3D generative modeling, addressing core challenges with novel distillation strategies and ensuring consistent, high-fidelity outputs at speeds suitable for real-time applications.

Markdown Report Issue