Papers
Topics
Authors
Recent
Search
2000 character limit reached

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Published 28 Nov 2023 in cs.CV | (2311.17261v1)

Abstract: We propose SceneTex, a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues, SceneTex formulates the texture synthesis task as an optimization problem in the RGB space where style and geometry consistency are properly reflected. At its core, SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views, we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. SceneTex enables various and accurate texture synthesis for 3D-FRONT scenes, demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. 3davatargan: Bridging domains for personalized editable avatars. In CVPR, 2023.
  2. Learning representations and generative models for 3d point clouds. In ICML, 2018.
  3. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR, 2021.
  4. Efficient geometry-aware 3D generative adversarial networks. In CVPR, 2022.
  5. ScanRefer: 3D object localization in RGB-D scans using natural language. In European Conference on Computer Vision, pages 202–221. Springer, 2020.
  6. Unit3d: A unified transformer for 3d dense captioning and visual grounding. arXiv preprint arXiv:2212.00836, 2022a.
  7. D 3 net: A unified speaker-listener architecture for 3d dense captioning and visual grounding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 487–505. Springer, 2022b.
  8. Text2tex: Text-driven texture synthesis via diffusion models. In ICCV, 2023a.
  9. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, pages 22246–22256, 2023b.
  10. Learning to predict 3d objects with an interpolation-based differentiable renderer. 2019.
  11. Upst-nerf: Universal photorealistic style transfer of neural radiance fields for 3d scene. arXiv preprint arXiv:2208.07059, 2022c.
  12. Learning implicit fields for generative shape modeling. In CVPR, 2019.
  13. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3193–3203, 2021.
  14. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In CVPR, 2023.
  15. Cross-modal 3d shape generation and manipulation. In ECCV, 2022.
  16. Stylizing 3d scene via implicit representation and hypernetwork. 2022.
  17. Spsg: Self-supervised photometric scene generation from rgb-d scans. In CVPR, 2021.
  18. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  19. Diffusion models beat gans on image synthesis. 2021.
  20. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021.
  21. Get3d: A generative model of high quality 3d textured shapes learned from images. In Advances In Neural Information Processing Systems, 2022.
  22. Image style transfer using convolutional neural networks. In CVPR, 2016.
  23. Controlling perceptual factors in neural style transfer. In CVPR, 2017.
  24. Generative adversarial nets. NeurIPS, 2014.
  25. Stylenerf: A style-based 3d aware generator for high-resolution image synthesis. In ICLR, 2022.
  26. Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789, 2023.
  27. Denoising diffusion probabilistic models. 2020.
  28. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021.
  29. Stylemesh: Style transfer for indoor 3d scene reconstructions. In CVPR, 2022.
  30. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In ICCV, 2023.
  31. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1439–1449, 2021.
  32. Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In CVPR, 2022.
  33. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  34. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  35. Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
  36. Infinicity: Infinite-scale city synthesis. In ICCV, 2023.
  37. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7708–7717, 2019.
  38. Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2021.
  39. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, 2023.
  40. AutoSDF: Shape priors for 3d completion, reconstruction and generation. In CVPR, 2022.
  41. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023.
  42. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  43. Improved denoising diffusion probabilistic models. 2021.
  44. Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR, 2021.
  45. Automatic differentiation in pytorch. 2017.
  46. Convolutional generation of textured 3d meshes. 2020.
  47. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  48. Learning transferable visual models from natural language supervision. In ICLR, 2021.
  49. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
  50. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  51. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  52. Image super-resolution via iterative refinement. IEEE TPAMI, 2022.
  53. Graf: Generative radiance fields for 3d-aware image synthesis. 2020.
  54. Unsupervised volumetric animation. In CVPR, 2023a.
  55. Unsupervised volumetric animation. In CVPR, 2023b.
  56. Texturify: Generating textures on 3d shape surfaces. In ECCV, 2022.
  57. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  58. Implicit neural representations with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020.
  59. 3d generation on imagenet. In ICLR, 2023.
  60. Improved adversarial systems for 3d object generation and reconstruction. In Conference on Robot Learning, pages 87–96. PMLR, 2017.
  61. Ldm3d: Latent diffusion model for 3d. arXiv preprint arXiv:2305.10853, 2023.
  62. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020.
  63. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097, 2023.
  64. Attention is all you need. 2017.
  65. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023a.
  66. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  67. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  68. Learning descriptor networks for 3d shape synthesis and analysis. In CVPR, 2018.
  69. Discoscene: Spatially disentangled generative radiance fields for controllable 3d-aware scene synthesis. In CVPR, 2023.
  70. Learning texture generators for 3d shape collections from internet photo sets. In BMVC, 2021.
  71. Arf: Artistic radiance fields. In ECCV, 2022.
  72. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  73. Sketch2model: View-aware 3d modeling from single free-hand sketches. In CVPR, 2021.
  74. 3d shape generation and completion through point-voxel diffusion. In ICCV, 2021.
Citations (20)

Summary

  • The paper introduces a novel approach leveraging depth-to-image diffusion priors to optimize high-quality, style-consistent texture synthesis for indoor 3D scenes.
  • It employs a Multiresolution Texture Field and Cross-attention Texture Decoder to integrate multi-scale texture details and maintain global style coherence.
  • The optimization via Variational Score Distillation enhances texture realism and geometric accuracy, paving the way for applications in VR, AR, and digital content creation.

High-Quality Texture Synthesis for Indoor Scenes with SceneTex

SceneTex presents an innovative approach to synthesizing high-quality, style-consistent textures for indoor 3D scenes using depth-to-image diffusion priors. This method reimagines texture synthesis as an optimization problem within the RGB space, emphasizing consistency in style and geometry across views.

Multiresolution Texture Field

At the heart of SceneTex's methodology is the Multiresolution Texture Field, which employs a hierarchical approach to texture encoding. Unlike previous models that relied on low-resolution latent spaces, SceneTex utilizes a multidimensional grid to capture a richer, more detailed representation of textures across various scales. Figure 1

Figure 1: Multiresolution feature grid for UV space encoding.

This grid allows SceneTex to integrate both low and high-frequency texture details into the final rendering. Through the interpolation of features at different resolutions, the system ensures that each query coordinate is represented by comprehensive UV embeddings, paving the way for superior texture fidelity.

Cross-attention Texture Decoder

SceneTex addresses style inconsistency issues arising from occlusions and limited viewpoints with its Cross-attention Texture Decoder. This component effectively aligns texture features across instances by applying a multi-head cross-attention mechanism. Figure 2

Figure 2: Cross-attention to produce instance-aware UV embeddings.

By treating the UV embeddings of the rendering as a query and the scattered reference textures as keys and values, this module maintains global style coherence. The final rendering is then mapped to RGB values with an MLP, ensuring consistent textural quality across the scene.

Optimization via Variational Score Distillation (VSD)

To optimize the texture field, SceneTex employs a pre-trained latent diffusion model within a Variational Score Distillation (VSD) framework. This method engages a score-distillation-based objective function to enhance the realism and geometric accuracy of synthesized textures. Figure 3

Figure 3: Texture synthesis pipeline showcasing the integration of diffusion priors.

Unlike traditional models that suffer from poor visual quality due to resolution discrepancies, SceneTex directly produces high-resolution textures by querying the texture field with the updated diffusion priors, ensuring a seamless integration of texture detail and fidelity.

Practical Implications and Future Directions

The SceneTex framework extends its applicability across various sectors such as virtual reality (VR), augmented reality (AR), and digital content creation. It not only strengthens the visual appeal of 3D scenes but also streamlines the workflow by minimizing manual effort in texture design.

Looking forward, the refinement of diffusion priors to eliminate unwanted shading artifacts remains a promising avenue. Additionally, broadening the dataset to include more diverse scene styles could further enhance the practical versatility of SceneTex in generating truly immersive environments.

Conclusion

SceneTex's methodology in leveraging depth-to-image diffusion priors marks a significant advancement in the domain of text-driven 3D texture synthesis. Through its novel integration of multiresolution texture fields and cross-attention decoding, SceneTex achieves unprecedented quality and style consistency in indoor scene textures. Its success suggests a rich potential for future exploration in automated texture design and application in various digital media spaces.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 6 likes about this paper.