ShaRF: Shape-conditioned Radiance Fields from a Single View

Published 17 Feb 2021 in cs.CV and cs.GR | (2102.08860v2)

Abstract: We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object and its use as a guide for the reconstruction of the underlying radiance field. Our formulation is based on a generative process that first maps a latent code to a voxelized shape, and then renders it to an image, with the object appearance being controlled by a second latent code. During inference, we optimize both the latent codes and the networks to fit a test image of a new object. The explicit disentanglement of shape and appearance allows our model to be fine-tuned given a single image. We can then render new views in a geometrically consistent manner and they represent faithfully the input object. Additionally, our method is able to generalize to images outside of the training domain (more realistic renderings and even real photographs). Finally, the inferred geometric scaffold is itself an accurate estimate of the object's 3D shape. We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images.

Abstract PDF Upgrade to Chat

Citations (104)

View on Semantic Scholar

Summary

The paper introduces a generative model that reconstructs 3D geometry and appearance from one image using separate shape and appearance networks.
It combines explicit voxel scaffolding with implicit neural radiance fields to reduce ambiguity and enhance detail fidelity.
Experimental results demonstrate improved novel view synthesis with higher PSNR and SSIM on benchmarks like ShapeNet-SRN.

Overview of "ShaRF: Shape-conditioned Radiance Fields from a Single View"

"ShaRF: Shape-conditioned Radiance Fields from a Single View" presents a novel approach to rendering 3D objects from a single image through shape-conditioned neural scenes. This method is particularly notable for its ability to infer high-fidelity 3D representations and generate consistent novel views that generalize well across various domains.

Core Contributions

Generative Model Design: The paper introduces a generative model that reconstructs the geometry and appearance of an object from a single image. The model consists of two key components: a shape network and an appearance network. The former converts a latent shape code into a voxelized geometric scaffold, while the latter leverages this scaffold to estimate a radiance field conditioned on both the shape and a second latent code controlling appearance.
Explicit and Implicit Representations: Explicit voxel-based geometry and implicit neural representations are combined to enhance the fidelity of 3D object synthesis. The voxel grid provides structural guidance, effectively reducing inference ambiguity, which is especially crucial when working with a single image and without multi-view constraints.
Novel View Synthesis: By optimizing both latent codes and network parameters, the model can re-render given images and reliably synthesize novel views. This process enhances generalization, allowing the model to adapt to different appearance domains such as more realistic renderings or real photographs that were not part of the training data.
Evaluation and Performance: The paper reports superior performance compared to previous models on the ShapeNet-SRN benchmark for novel view synthesis, particularly excelling in terms of PSNR and SSIM metrics. It also demonstrates the capability to perform 3D reconstruction on a par with competitive methods.

Methodological Insight

Shape Network: This network constructs a voxelized model of the object from a latent shape code. The resulting voxel grid acts as a geometric scaffold that constrains the spatial extent where the appearance information is rendered, thus concentrating the optimization process on the probable surface of the object.
Appearance Network: Operating on the voxel scaffold, this network estimates the radiance field by computing color and density values at any point within the object’s spatial domain. It effectively disentangles shape and appearance, which aids in synthesizing realistic views under diverse lighting conditions.
Optimization Strategy: The two-stage approach optimizes for shape first, using the appearance network for backpropagating errors, then refines appearance details. This stepwise refinement significantly reduces artifacts and enhances visual fidelity even when input images differ substantially from the training data.

Implications and Future Directions

The ShaRF model provides significant advancements in neural rendering, especially in scenarios where data is limited to single images. It mitigates dependencies on extensive multi-view datasets while still offering high-dimensional understanding and rendering capabilities. The separation of shape and appearance latent codes could be extrapolated to other applications in neural rendering.

Looking forward, improvements could tackle the complexity of scenes with multiple interacting objects or dynamic elements. Additionally, exploring more efficient training regimes or unsupervised shape learning could enhance the model's versatility and reduce data-dependency.

Conclusion

ShaRF pushes the boundaries of inverse rendering by adeptly combining explicit and implicit 3D representations. Its methodical disentanglement of shape and appearance provides a robust framework for high-quality scene reconstruction and novel view synthesis from minimal input data. This paper will serve as a valuable reference for future research on neural rendering and 3D reconstruction.