Papers
Topics
Authors
Recent
Search
2000 character limit reached

Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering

Published 18 Dec 2023 in cs.CV, cs.AI, and cs.GR | (2312.11360v2)

Abstract: We present Paint-it, a text-driven high-fidelity texture map synthesis method for 3D meshes via neural re-parameterized texture optimization. Paint-it synthesizes texture maps from a text description by synthesis-through-optimization, exploiting the Score-Distillation Sampling (SDS). We observe that directly applying SDS yields undesirable texture quality due to its noisy gradients. We reveal the importance of texture parameterization when using SDS. Specifically, we propose Deep Convolutional Physically-Based Rendering (DC-PBR) parameterization, which re-parameterizes the physically-based rendering (PBR) texture maps with randomly initialized convolution-based neural kernels, instead of a standard pixel-based parameterization. We show that DC-PBR inherently schedules the optimization curriculum according to texture frequency and naturally filters out the noisy signals from SDS. In experiments, Paint-it obtains remarkable quality PBR texture maps within 15 min., given only a text description. We demonstrate the generalizability and practicality of Paint-it by synthesizing high-quality texture maps for large-scale mesh datasets and showing test-time applications such as relighting and material control using a popular graphics engine. Project page: https://kim-youwang.github.io/paint-it

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. https://ami.postech.ac.kr/members.
  2. http://virtualhumans.mpi-inf.mpg.de/people.html.
  3. https://renderpeople.com/, 2023.
  4. The generalized PatchMatch correspondence algorithm. In European Conference on Computer Vision (ECCV), 2010.
  5. Who left the dogs out?: 3D animal reconstruction with expectation maximization in the loop. In European Conference on Computer Vision (ECCV), 2020.
  6. Texfusion: Synthesizing 3d textures with text-guided image diffusion models. In IEEE International Conference on Computer Vision (ICCV), 2023a.
  7. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint, 2304.00916, 2023b.
  8. SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image. In British Machine Vision Conference (BMVC), 2023.
  9. Efficient geometry-aware 3D generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  10. Text2tex: Text-driven texture synthesis via diffusion models. In IEEE International Conference on Computer Vision (ICCV), 2023a.
  11. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In IEEE International Conference on Computer Vision (ICCV), 2023b.
  12. gdna: Towards generative detailed neural avatars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
  13. Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. In Advances in Neural Information Processing Systems (NeurIPS), 2022b.
  14. Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In European Conference on Computer Vision (ECCV), 2022.
  15. A reflectance model for computer graphics. ACM Transactions on Graphics (SIGGRAPH), 1(1), 1982.
  16. Object removal by exemplar-based inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2003.
  17. Objaverse: A universe of annotated 3d objects. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  18. AG3D: Learning to generate 3D avatars from 2D image collections. In IEEE International Conference on Computer Vision (ICCV), 2023.
  19. Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (SIGGRAPH), 40(8), 2021.
  20. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
  21. Humans in 4D: Reconstructing and tracking humans with transformers. In IEEE International Conference on Computer Vision (ICCV), 2023.
  22. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  23. Denoising and regularization via exploiting the structural bias of convolutional generators. In International Conference on Learning Representations (ICLR), 2020.
  24. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  25. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (SIGGRAPH), 41(4):1–19, 2022.
  26. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint, 2306.12422, 2023a.
  27. Dreamwaltz: Make a scene with complex 3d animatable avatars. arXiv preprint, 2305.12529, 2023b.
  28. Zero-shot text-guided object generation with dream fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  29. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. In IEEE International Conference on Computer Vision (ICCV), 2023.
  30. Flame: Free-form language-based motion synthesis & editing. In AAAI Conference on Artificial Intelligence (AAAI), 2022.
  31. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (SIGGRAPH), 39(6), 2020.
  32. 360-degree textures of people in clothing from a single image. In International Conference on 3D Vision (3DV), 2019.
  33. Magic3d: High-resolution text-to-3d content creation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  34. Learning to dress 3d people in generative clothing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  35. Latent-nerf for shape-guided generation of 3d shapes and textures. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  36. Nonparametric blind super-resolution. In IEEE International Conference on Computer Vision (ICCV), 2013.
  37. Text2mesh: Text-driven neural stylization for meshes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  38. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
  39. Deepsdf: Learning continuous signed distance functions for shape representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  40. Expressive body capture: 3D hands, face, and body from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  41. Dreamfusion: Text-to-3d using 2d diffusion. In International Conference on Learning Representations (ICLR), 2022.
  42. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
  43. Humor: 3d human motion model for robust pose estimation. In IEEE International Conference on Computer Vision (ICCV), 2021.
  44. Texture: Text-guided texturing of 3d shapes. ACM Transactions on Graphics (SIGGRAPH), 2023.
  45. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  46. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  47. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  48. On measuring and controlling the spectral bias of the deep image prior. International Journal of Computer Vision, 2022.
  49. Texturify: Generating textures on 3d shape surfaces. In European Conference on Computer Vision (ECCV), 2022.
  50. Consistency models. In International Conference on Machine Learning (ICML), 2023.
  51. Laughtalk: Expressive 3d talking head generation with laughter, 2023.
  52. Dinar: Diffusion inpainting of neural textures for one-shot human avatars. In IEEE International Conference on Computer Vision (ICCV), 2023.
  53. Human motion diffusion model. In International Conference on Learning Representations (ICLR), 2023.
  54. Deep image prior. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  55. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  56. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  57. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In Advances in Neural Information Processing Systems (NeurIPS), 2023b.
  58. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  59. Visibility aware human-object interaction tracking from single rgb camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  60. 3D human texture estimation from a single image with transformers. In IEEE International Conference on Computer Vision (ICCV), 2021.
  61. Nsf: Neural surface fields for human modeling from monocular depth. In IEEE International Conference on Computer Vision (ICCV), 2023.
  62. Unified 3d mesh recovery of humans and animals by learning animal exercise. In British Machine Vision Conference (BMVC), 2021.
  63. CLIP-Actor: Text-driven recommendation and stylization for animating human meshes. In European Conference on Computer Vision (ECCV), 2022.
  64. A large-scale 3d face mesh video dataset via neural re-parameterized optimization, 2023.
  65. Towards metrical reconstruction of human faces. In European Conference on Computer Vision (ECCV), 2022.
  66. 3D menagerie: Modeling the 3D shape and pose of animals. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Citations (29)

Summary

  • The paper introduces a novel method that converts text descriptions into high-fidelity 3D textures using deep convolutional optimization and physically-based rendering.
  • It employs Score-Distillation Sampling with U-Net kernels to progressively refine texture details from low to high frequency.
  • Experimental results demonstrate enhanced texture coherence and lower FID scores compared to traditional synthesis methods.

"Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering"

Introduction to Paint-it

The paper "Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering" describes a novel method for synthesizing high-fidelity 3D textures through text-driven guidance. The core contribution of this work lies in leveraging Deep Convolutional Physically-Based Rendering (DC-PBR) for parameterizing texture maps, thereby enhancing the quality and realism of the synthesized textures.

Method Overview

Paint-it operates by transforming text descriptions into physically-based rendering (PBR) texture maps for 3D meshes. The process begins with an untextured 3D mesh and a textual description of the desired appearance. The system employs a deep convolutional model to re-parameterize the PBR texture maps, optimizing them using a Score-Distillation Sampling (SDS) process. The use of U-Net convolutional kernels enhances the optimization by prioritizing low-frequency textures initially and gradually adapting to high-frequency details. Figure 1

Figure 1: Paint-it's pipeline illustrating DC-PBR as an intermediary layer that enhances texture realism through convolutional re-parameterization.

Score-Distillation Sampling

Score-Distillation Sampling is pivotal for ensuring that the synthesized image aligns with the user-provided text description. By simulating noisy perturbations of rendered images and utilizing a pre-trained text-conditional noise estimator, SDS refines the 3D representation to best match the textual prompt. This method supports the generation of textures with intricate material properties like reflectance and surface normals.

Practical Applications

The practical utilities of Paint-it are manifold, extending to industries such as gaming and cinematic productions where realistic 3D assets are imperative. Paint-it's ability to synthesize diverse texture maps offers significant flexibility in applications requiring dynamic relighting and material property adjustments, seamlessly integrating into existing graphics engines like Blender. Figure 2

Figure 2: Practical applications of Paint-it in managing dynamic lighting and material properties using PBR texture maps.

Comparative Advantages

Paint-it excels in producing vivid, consistent, and realistic textures compared to contemporaneous methodologies—specifically those relying on color-projected 3D textures. Techniques such as the pixel-based optimization or mesh-based re-meshing methods often fall short in quality or require substantial post-processing. Paint-it circumvents these issues by employing a global gradient update mechanism, improving coherence across surfaces and reducing artifacts like texture seams. Figure 3

Figure 3: Comparison of Paint-it's PBR disaggregation capabilities illustrating superior material uniqueness versus Fantasia3D.

Experimental Insights

Experiments conducted on standard datasets like Objaverse demonstrated Paint-it's superiority in generating realistic textures with lower Fréchet Inception Distance (FID) scores and higher user-study ratings than alternative methods. The fidelity in texture synthesis is further reinforced by ablation studies emphasizing the significance of convolutional re-parameterization.

Conclusion

Paint-it represents a substantive advance in text-to-texture synthesis, merging deep learning techniques with physics-based rendering to facilitate the creation of high-quality, text-driven 3D textures. Although providing a robust solution for texture synthesis, the method's optimization latency signs potential areas for future research, such as employing more efficient loss functions or utilizing pre-trained models to expedite rendering processes. The approach lays a foundation for future developments in AI-assisted graphics design, pushing towards automated generation of sophisticated digital assets.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 85 likes about this paper.