SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Published 28 Nov 2023 in cs.CV | (2311.17261v1)

Abstract: We propose SceneTex, a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues, SceneTex formulates the texture synthesis task as an optimization problem in the RGB space where style and geometry consistency are properly reflected. At its core, SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views, we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. SceneTex enables various and accurate texture synthesis for 3D-FRONT scenes, demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods.

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (74)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces a novel approach leveraging depth-to-image diffusion priors to optimize high-quality, style-consistent texture synthesis for indoor 3D scenes.
It employs a Multiresolution Texture Field and Cross-attention Texture Decoder to integrate multi-scale texture details and maintain global style coherence.
The optimization via Variational Score Distillation enhances texture realism and geometric accuracy, paving the way for applications in VR, AR, and digital content creation.

High-Quality Texture Synthesis for Indoor Scenes with SceneTex

SceneTex presents an innovative approach to synthesizing high-quality, style-consistent textures for indoor 3D scenes using depth-to-image diffusion priors. This method reimagines texture synthesis as an optimization problem within the RGB space, emphasizing consistency in style and geometry across views.

Multiresolution Texture Field

At the heart of SceneTex's methodology is the Multiresolution Texture Field, which employs a hierarchical approach to texture encoding. Unlike previous models that relied on low-resolution latent spaces, SceneTex utilizes a multidimensional grid to capture a richer, more detailed representation of textures across various scales.

Figure 1: Multiresolution feature grid for UV space encoding.

This grid allows SceneTex to integrate both low and high-frequency texture details into the final rendering. Through the interpolation of features at different resolutions, the system ensures that each query coordinate is represented by comprehensive UV embeddings, paving the way for superior texture fidelity.

Cross-attention Texture Decoder

SceneTex addresses style inconsistency issues arising from occlusions and limited viewpoints with its Cross-attention Texture Decoder. This component effectively aligns texture features across instances by applying a multi-head cross-attention mechanism.

Figure 2: Cross-attention to produce instance-aware UV embeddings.

By treating the UV embeddings of the rendering as a query and the scattered reference textures as keys and values, this module maintains global style coherence. The final rendering is then mapped to RGB values with an MLP, ensuring consistent textural quality across the scene.

Optimization via Variational Score Distillation (VSD)

To optimize the texture field, SceneTex employs a pre-trained latent diffusion model within a Variational Score Distillation (VSD) framework. This method engages a score-distillation-based objective function to enhance the realism and geometric accuracy of synthesized textures.

Figure 3: Texture synthesis pipeline showcasing the integration of diffusion priors.

Unlike traditional models that suffer from poor visual quality due to resolution discrepancies, SceneTex directly produces high-resolution textures by querying the texture field with the updated diffusion priors, ensuring a seamless integration of texture detail and fidelity.

Practical Implications and Future Directions

The SceneTex framework extends its applicability across various sectors such as virtual reality (VR), augmented reality (AR), and digital content creation. It not only strengthens the visual appeal of 3D scenes but also streamlines the workflow by minimizing manual effort in texture design.

Looking forward, the refinement of diffusion priors to eliminate unwanted shading artifacts remains a promising avenue. Additionally, broadening the dataset to include more diverse scene styles could further enhance the practical versatility of SceneTex in generating truly immersive environments.

Conclusion

SceneTex's methodology in leveraging depth-to-image diffusion priors marks a significant advancement in the domain of text-driven 3D texture synthesis. Through its novel integration of multiresolution texture fields and cross-attention decoding, SceneTex achieves unprecedented quality and style consistency in indoor scene textures. Its success suggests a rich potential for future exploration in automated texture design and application in various digital media spaces.

Markdown Report Issue