Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Published 16 May 2024 in cs.CV | (2405.09874v1)

Abstract: We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at https://dual3d.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. arXiv preprint arXiv:2311.17984, 2023.
  3. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5799–5809, 2021.
  4. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16123–16133, 2022.
  5. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pp.  333–350. Springer, 2022.
  6. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13142–13153, 2023.
  7. Gram: Generative radiance manifolds for 3d-aware image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10673–10683, 2022.
  8. Generative adversarial networks. COMMUNICATIONS OF THE ACM, 63(11), 2020.
  9. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  10. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
  11. Streetsurf: Extending multi-view implicit surface reconstruction to street views. arXiv preprint arXiv:2306.04988, 2023.
  12. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  13. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=0RDcd5Axok.
  14. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  15. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  16. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  17. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  867–876, 2022.
  18. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  19. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb.
  20. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4401–4410, 2019.
  21. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8110–8119, 2020.
  22. Alias-free generative adversarial networks. In Proc. NeurIPS, 2021.
  23. Noise-free score distillation. arXiv preprint arXiv:2310.17590, 2023.
  24. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023.
  25. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  26. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 2014.
  27. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020.
  28. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  300–309, 2023.
  29. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. arXiv preprint arXiv:2312.16256, 2023.
  30. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
  31. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9298–9309, 2023a.
  32. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  61–68, Dublin, Ireland, May 2022. URL https://aclanthology.org/2022.acl-short.8.
  33. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
  34. Unidream: Unifying diffusion priors for relightable text-to-3d generation, 2023c.
  35. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  36. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  37. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  38. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  39. Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13503–13513, 2022.
  40. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  41. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  42. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  43. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  44. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10901–10911, 2021.
  45. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  46. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  47. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  48. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  49. Graf: Generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems, 33:20154–20166, 2020.
  50. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
  51. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20875–20886, 2023.
  52. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021.
  53. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
  54. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023a.
  55. Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder, 2023b.
  56. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12619–12629, 2023a.
  59. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In 35th Conference on Neural Information Processing Systems, pp.  27171–27183. Curran Assoicates, Inc., 2021.
  60. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  61. Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  3202–3211, 2024.
  62. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp.  3–19, 2018.
  63. Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2195–2205, 2023.
  64. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023.
  65. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
  66. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9150–9161, 2023.
  67. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
  68. Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. arXiv preprint arXiv:2308.14078, 2023.
Citations (5)

Summary

  • The paper introduces Dual3D, a framework that employs dual-mode multi-view latent diffusion to efficiently generate 3D assets from textual descriptions.
  • It alternates between 2D denoising for speed and 3D mode for multi-view consistency, achieving superior results in both inference time and asset realism.
  • Experimental evaluations show improved CLIP similarity and aesthetic quality, demonstrating its practical use in gaming, VR, and robotics.

"Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion" (2405.09874)

Introduction and Motivation

The paper introduces Dual3D, a framework designed to efficiently generate high-quality 3D assets from textual descriptions. The significance of this work lies in its potential applications across various domains like gaming, virtual reality, and robotics, where timely and coherent 3D model generation is crucial. Current methods either suffer from inefficiencies or lack 3D consistency. Dual3D addresses these problems by leveraging a dual-mode multi-view latent diffusion model, which allows for both quick inference and high-quality output. Figure 1

Figure 1: The Framework of Dual3D. Firstly, we fine-tune a pre-trained 2D LDM into a dual-mode multi-view LDM. Subsequently, we employ a dual-mode toggling inference strategy to choose different denoising modes during inference to balance the inference speed and 3D consistency. Finally, the mesh extracted from the neural surface is further optimized via our efficient texture refinement process, enhancing the photo-realism and details of the asset.

Methodology

Dual-mode Multi-view Latent Diffusion Model

The core of Dual3D is its dual-mode multi-view latent diffusion model, which extends a pre-trained 2D LDM into a dual-mode format. The 2D mode aims for efficient denoising, while the 3D mode ensures consistency across different views by generating a tri-plane neural surface. The model is trained on multi-view images derived from simple 3D scenes, optimized to handle novel views via a rendering technique that combines elements of NeuS for superior geometric quality. Figure 2

Figure 2: Two compositional 3D scenes rendered by Blender, where all visible assets are generated by our method with only texts as inputs. The text prompts for some assets are indicated by arrows. Please refer to our project page for the tour videos.

Dual-mode Toggling Inference Strategy

A significant innovation is the dual-mode toggling inference strategy that optimizes the trade-off between speed and quality. By alternating between 2D and 3D denoising steps, Dual3D reduces the computational cost without compromising on the quality of 3D consistency. This approach ensures the framework generates a 3D asset in a fraction of the time required by previous methods. Figure 3

Figure 3: The architecture of dual-mode multi-view LDM. The noisy multi-view latents and three learnable tri-plane latents are fed into the 2D latent denoising network Z_\theta in parallel, where all self-attention blocks are replaced by cross-view self-attention blocks. A tiny transformer is used to enhance the connections between the multi-view features and the tri-plane features. The denoised tri-plane latents are decoded into higher resolution with the 2D latent decoder D and rendered to images with volume rendering of the tri-plane surface.

Experimental Evaluation

The authors conducted extensive experiments that demonstrated the state-of-the-art performance of Dual3D. Notably, the framework achieves competitive results across several metrics, including CLIP similarity and aesthetic quality, outperforming existing methods in both speed and output quality. The evaluation metrics confirm that Dual3D's assets are more consistent and realistic, affirming its practical applicability. Figure 4

Figure 4: Qualitative comparison.

Discussion and Implications

Dual3D's approach of utilizing a dual-mode diffusion strategy to balance efficiency and quality represents a meaningful advancement in text-to-3D generation. The rapid inference times align well with industry requirements for real-time applications, while the framework's architecture allows for easy integration into existing pipelines. Future work could explore extending this model to handle more complex scenes or integrating additional modalities such as physics-based interactions.

Conclusion

In conclusion, Dual3D sets a new benchmark for efficient and high-quality text-to-3D asset generation. Its distinctive dual-mode strategy and toggling inference not only push the boundaries of current methodologies but also open avenues for broader applications. Future research could build on this foundation to explore further optimizations and expanded use cases, continuing to enhance the synthesis of virtual content from textual descriptions.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 128 likes about this paper.