ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

Published 25 May 2023 in cs.LG and cs.CV | (2305.16213v2)

Abstract: Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., $7.5$). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic. Project page and codes: https://ml.cs.tsinghua.edu.cn/prolificdreamer/

Abstract PDF Upgrade to Chat

Citations (634)

View on Semantic Scholar

Summary

The paper introduces ProlificDreamer, which improves text-to-3D generation by modeling 3D parameters as random variables via Variational Score Distillation.
The methodology leverages a particle-based variational framework and Wasserstein gradient flow to simulate differential equations for realistic and diverse scene generation.
Key experiments show that ProlificDreamer overcomes SDS issues, delivering higher fidelity, richer textures, and greater diversity for applications in gaming, animation, and VR.

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

Introduction

The paper introduces ProlificDreamer, a novel framework for high-fidelity and diverse text-to-3D generation using Variational Score Distillation (VSD). Traditional Score Distillation Sampling (SDS) methods often suffer from issues such as over-saturation, over-smoothing, and low diversity. ProlificDreamer addresses these limitations by modeling 3D parameters as random variables and utilizing a particle-based variational framework, which allows for improved diversity and quality in text-to-3D generation.

Figure 1: Text-to-3D samples generated by ProlificDreamer.

Variational Score Distillation (VSD)

VSD is central to ProlificDreamer's approach, optimizing a distribution of 3D scenes rather than a single point. It employs particle-based variational inference, maintaining a set of 3D parameters (particles) to represent the distribution. The method involves simulating an ODE with a principled gradient-based update rule derived via the Wasserstein gradient flow. This ensures convergence to the desired distribution, allowing for realistic and diverse 3D scene generation.

Figure 2: Overview of VSD, demonstrating the rendering and score computation process.

Technical Advancements

Compared to SDS, VSD utilizes a variational distribution approach that naturally accommodates multiple plausible 3D scene representations for a given prompt. The ability to use different CFG weights effectively addresses the limitations of SDS, enabling realistic renderings at normal CFG settings. The introduction of various design improvements, such as high rendering resolution, annealed time schedules, and scene initialization, also plays a significant role in the higher fidelity and complexity of the generated 3D content.

Figure 3: Samples demonstrating the superior realism and details generated by VSD compared to SDS.

Evaluation and Results

Experiments highlight that ProlificDreamer generates superior 3D outputs in terms of fidelity and diversity when compared to existing SDS-based methods. The study not only verifies the robustness of VSD in different settings but also illustrates the potential for generating complex scenes with rich structures and textures.

Figure 4: Ablation study showing improvements in NeRF generation with proposed enhancements.

Conclusion

ProlificDreamer significantly enhances text-to-3D generation through the application of VSD, addressing previous limitations in SDS approaches. Its ability to produce high-fidelity, diverse, and complex 3D scenes opens new avenues for applications across various domains such as gaming, animation, and virtual reality. Future work could focus on accelerating the generation process and further refining the integration of scene understanding and camera positioning for even more detailed scene renderings.