- The paper introduces PrimX, a compact tensor format that encodes shape, texture, and material for efficient 3D asset representation.
- It employs a Diffusion Transformer to model latent primitive tokens, enabling scalable and high-resolution 3D asset generation.
- Experimental results demonstrate superior geometric fidelity and speed compared to traditional 3D generative models.
Overview of 3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
The paper, titled "3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion", introduces a scalable framework for efficient and high-quality 3D asset generation using a novel primitive-based diffusion model. The work addresses existing limitations in speed, geometric fidelity, and the quality gap in physically based rendering (PBR) assets among conventional 3D generative models.
At the core of this framework is a new representation method called PrimX, which encodes shape, albedo, and material fields of 3D assets in a compact tensorial format. This representation facilitates the modeling of high-resolution 3D geometries accompanied by detailed textures and materials suitable for PBR. The generative framework, 3DTopia-XL, leverages a Diffusion Transformer (DiT) to model and generate these detailed 3D assets.
Key Contributions
- Novel 3D Representation:
- The introduction of PrimX allows the representation of a textured mesh as a compact tensor, which encodes shape, color, and material in a unified format. It differentiates itself from past representations by integrating SDF, RGB, and material payloads within spatially varied tiny voxels.
- PrimX supports efficient differentiable rendering, facilitating learning from both 3D datasets and image collections.
- Efficient Algorithm for Representation:
- An efficient initialization and fine-tuning algorithm allows PrimX to be rapidly tensorized from textured mesh files (GLB format), achieving performance that is significantly quicker than prior triplane representations under similar settings.
- Generative Framework Integration:
- The study presents a novel latent primitive diffusion approach within a Diffusion Transformer framework. This architecture effectively models the global correlations among primitive tokens without necessitating positional encoding, thus ensuring high-resolution 3D generative training.
- Qualitative and Quantitative Advancements:
- Extensive experiments, including qualitative and quantitative assessments, demonstrate the superior capabilities of 3DTopia-XL in generating high-quality 3D assets. These results are especially marked by the model's efficiency in merging geometry and detailed textures/materials, effectively bridging the quality gap for real-world applications.
Methodology
PrimX Representation:
- PrimX encodes a 3D mesh as a set of primitives, each parameterized by its position, a global scale factor, and a voxel grid containing the SDF, RGB, and material values. This representation not only proves to be efficient but also supports differentiable rendering.
Primitive Patch Compression:
- A 3D Variational Autoencoder (VAE) compresses primitives into latent tokens. This VAE employs 3D convolutional layers to encode voxelized patches into a more compact latent space, thereby supporting the efficient training of the generative model.
Latent Primitive Diffusion:
- The core of generative modeling in 3DTopia-XL utilizes a Transformer-based architecture to perform diffusion on the set of latent primitives. The diffusion model learns to denoise input noise through a series of steps, leveraging CFG and adaptive layer normalization.
Results
Representation Evaluation:
- PrimX achieves superior fidelity in terms of geometry and texture compared to other representations such as MLP, triplane, and dense voxels, demonstrating both higher accuracy and faster runtime for fitting.
Image-to-3D Task:
- Comparisons with state-of-the-art reconstruction and diffusion models showcase 3DTopia-XL's ability to produce high-quality 3D assets that are visually consistent with input images while also providing detailed PBR materials.
Text-to-3D Task:
- Evaluations based on the CLIP Score indicate that 3DTopia-XL generates text-conditioned 3D assets with better image-text alignment compared to other generative models.
Implications and Future Directions
Practical Implications:
- The scalability and efficiency of 3DTopia-XL make it suitable for applications in gaming, film, and virtual reality, where high-quality 3D assets are crucial. The model's ability to generate detailed PBR assets ensures its practical use in scenarios requiring realistic rendering.
Theoretical Implications:
- The introduction of PrimX and its integration with a Diffusion Transformer framework provide new avenues for research in 3D generative modeling. The representational efficiency and tensorial nature of PrimX offer a robust platform for further exploration into scalable 3D generative models.
Future Directions:
- Future research could explore using 3DTopia-XL for dynamic object generation and editing. Given PrimX's support for differentiable rendering, extending the learning process to mixed 2D and 3D data could address the scarcity of high-quality 3D datasets. Additionally, improved scaling laws could be established by augmenting the number of primitives and enhancing VAE’s compression mechanisms.
In summary, 3DTopia-XL represents a significant step forward in automated, high-quality 3D asset generation, combining an innovative representation with a scalable generative model.