Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Published 16 May 2024 in cs.CV | (2405.09874v1)

Abstract: We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at https://dual3d.github.io

Abstract PDF HTML Upgrade to Chat

References (68)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces Dual3D, a framework that employs dual-mode multi-view latent diffusion to efficiently generate 3D assets from textual descriptions.
It alternates between 2D denoising for speed and 3D mode for multi-view consistency, achieving superior results in both inference time and asset realism.
Experimental evaluations show improved CLIP similarity and aesthetic quality, demonstrating its practical use in gaming, VR, and robotics.

"Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion" (2405.09874)

Introduction and Motivation

The paper introduces Dual3D, a framework designed to efficiently generate high-quality 3D assets from textual descriptions. The significance of this work lies in its potential applications across various domains like gaming, virtual reality, and robotics, where timely and coherent 3D model generation is crucial. Current methods either suffer from inefficiencies or lack 3D consistency. Dual3D addresses these problems by leveraging a dual-mode multi-view latent diffusion model, which allows for both quick inference and high-quality output.

Figure 1: The Framework of Dual3D. Firstly, we fine-tune a pre-trained 2D LDM into a dual-mode multi-view LDM. Subsequently, we employ a dual-mode toggling inference strategy to choose different denoising modes during inference to balance the inference speed and 3D consistency. Finally, the mesh extracted from the neural surface is further optimized via our efficient texture refinement process, enhancing the photo-realism and details of the asset.

Methodology

Dual-mode Multi-view Latent Diffusion Model

The core of Dual3D is its dual-mode multi-view latent diffusion model, which extends a pre-trained 2D LDM into a dual-mode format. The 2D mode aims for efficient denoising, while the 3D mode ensures consistency across different views by generating a tri-plane neural surface. The model is trained on multi-view images derived from simple 3D scenes, optimized to handle novel views via a rendering technique that combines elements of NeuS for superior geometric quality.

Figure 2: Two compositional 3D scenes rendered by Blender, where all visible assets are generated by our method with only texts as inputs. The text prompts for some assets are indicated by arrows. Please refer to our project page for the tour videos.

Dual-mode Toggling Inference Strategy

A significant innovation is the dual-mode toggling inference strategy that optimizes the trade-off between speed and quality. By alternating between 2D and 3D denoising steps, Dual3D reduces the computational cost without compromising on the quality of 3D consistency. This approach ensures the framework generates a 3D asset in a fraction of the time required by previous methods.

Figure 3: The architecture of dual-mode multi-view LDM. The noisy multi-view latents and three learnable tri-plane latents are fed into the 2D latent denoising network Z_\theta in parallel, where all self-attention blocks are replaced by cross-view self-attention blocks. A tiny transformer is used to enhance the connections between the multi-view features and the tri-plane features. The denoised tri-plane latents are decoded into higher resolution with the 2D latent decoder D and rendered to images with volume rendering of the tri-plane surface.

Experimental Evaluation

The authors conducted extensive experiments that demonstrated the state-of-the-art performance of Dual3D. Notably, the framework achieves competitive results across several metrics, including CLIP similarity and aesthetic quality, outperforming existing methods in both speed and output quality. The evaluation metrics confirm that Dual3D's assets are more consistent and realistic, affirming its practical applicability.

Figure 4: Qualitative comparison.

Discussion and Implications

Dual3D's approach of utilizing a dual-mode diffusion strategy to balance efficiency and quality represents a meaningful advancement in text-to-3D generation. The rapid inference times align well with industry requirements for real-time applications, while the framework's architecture allows for easy integration into existing pipelines. Future work could explore extending this model to handle more complex scenes or integrating additional modalities such as physics-based interactions.

Conclusion

In conclusion, Dual3D sets a new benchmark for efficient and high-quality text-to-3D asset generation. Its distinctive dual-mode strategy and toggling inference not only push the boundaries of current methodologies but also open avenues for broader applications. Future research could build on this foundation to explore further optimizations and expanded use cases, continuing to enhance the synthesis of virtual content from textual descriptions.