MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation

Published 21 Aug 2024 in cs.CV | (2408.11465v1)

Abstract: Reconstructing 3D from a single view image is a long-standing challenge. One of the popular approaches to tackle this problem is learning-based methods, but dealing with the test cases unfamiliar with training data (Out-of-distribution; OoD) introduces an additional challenge. To adapt for unseen samples in test time, we propose MeTTA, a test-time adaptation (TTA) exploiting generative prior. We design joint optimization of 3D geometry, appearance, and pose to handle OoD cases with only a single view image. However, the alignment between the reference image and the 3D shape via the estimated viewpoint could be erroneous, which leads to ambiguity. To address this ambiguity, we carefully design learnable virtual cameras and their self-calibration. In our experiments, we demonstrate that MeTTA effectively deals with OoD scenarios at failure cases of existing learning-based 3D reconstruction models and enables obtaining a realistic appearance with physically based rendering (PBR) textures.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents MeTTA, which combines feed-forward 3D mesh prediction with test-time adaptation using a generative multi-view prior to enhance reconstruction quality from single-view images.
It introduces a learnable virtual camera that self-calibrates during test-time optimization to refine 2D-to-3D alignment and improve geometric fidelity.
Experimental results demonstrate robust performance on out-of-distribution data, outperforming previous methods on metrics such as Chamfer Distance, F-Score, LPIPS, PSNR, and CLIP score.

Overview of "MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation"

The paper entitled "MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation" addresses the enduring challenge of reconstructing 3D geometries and textures from single-view images, particularly in the presence of out-of-distribution (OoD) data. The proposed method, MeTTA, leverages both feed-forward mesh prediction and test-time adaptation, involving the use of generative multi-view prior knowledge to refine and enhance the 3D reconstructions.

Contributions and Methodology

The authors present a multifaceted approach comprising:

Initial Mesh and Viewpoint Prediction:
- An Image-to-3D module delivers coarse mesh and viewpoint estimations leveraging existing learning-based feed-forward models trained on datasets such as Pix3D and SUN RGB-D.
Learnable Virtual Camera:
- To bridge the gap between the 2D image space and the 3D object space, the concept of a learnable virtual camera is introduced. This virtual camera self-calibrates during test-time optimization to refine viewpoint estimations, improving the alignment between the 2D input image and the 3D mesh.
Test-Time Adaptation (TTA):
- Utilizing a pre-trained multi-view diffusion model, MeTTA implements a Score-Distillation Sampling (SDS) loss to iteratively update the mesh, texture, and camera parameters. This approach enables the handling of OoD test cases while enhancing geometric fidelity and texture realism.
Physically-Based Rendering (PBR) Textures:
- The texture component is parameterized using PBR principles, allowing for the generation of realistic textures that are compatible with standard graphics tools. This differentiates MeTTA from prior holistic 3D scene reconstruction researches, which typically do not model such detailed texture properties.

Experimental Validation

The paper offers comprehensive experimental results to validate the proposed methodology:

Cross-Domain Robustness:
- Evaluations on the 3D-Front dataset show that MeTTA can adapt to and effectively reconstruct objects from distributions not seen during training, outperforming existing feed-forward methods in terms of Chamfer Distance and F-Score metrics.
Real-World Applicability:
- Demonstrations on manually acquired real-world data and in-the-wild web images highlight MeTTA’s robustness and practical utility. The results exhibit significant visual and quantitative improvements, underscoring the necessity of the test-time adaptation stage.
Qualitative and Quantitative Comparisons:
- When compared with feed-forward and iterative generative approaches, MeTTA shows superior performance in generating fine-grained, textured 3D meshes. Metrics such as LPIPS, PSNR, and CLIP score confirm the method’s efficacy in preserving semantic consistency and texture details even in novel views.

Implications and Future Directions

The contributions of MeTTA have several significant implications:

Practical Applications:
- The integration of PBR textures means the reconstructed 3D models can be directly utilized in various graphics engines and extended to applications in AR/VR and virtual communication, where lifelike and interactive virtual environments are critical.
Handling OoD Data:
- The ability to handle OoD cases expands the potential use cases of 3D reconstruction models, making them more reliable in practical, unpredictable environments.
Enhancing the Efficiency of 3D Model Generation:
- While MeTTA is faster than other optimization-based generative methods, it still requires substantial computational resources. Future work could focus on further optimizing computational efficiency, possibly through improved integration of the feed-forward and adaptation stages.
Generalization across Categories:
- Although current dependencies restrict generalization to predefined categories, future extensions of MeTTA could focus on enhancing the generalization capabilities across a more diverse range of object categories, possibly through training with larger, more diverse datasets or employing more sophisticated segmentation and viewpoint estimation techniques.

Conclusion

MeTTA represents a significant advancement in single-view to 3D textured mesh reconstruction. By combining state-of-the-art feed-forward methods with innovative test-time adaptation using a generative prior, it successfully addresses the limitations of existing approaches, particularly with respect to handling out-of-distribution data and achieving realistic texture mapping. The implications of this work extend across multiple domains of computer vision and graphics, setting a new standard for future research in 3D reconstruction.

Markdown Report Issue