Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image

Published 17 Dec 2020 in cs.CV, cs.AI, cs.GR, cs.LG, and stat.ML | (2012.09854v3)

Abstract: We present Worldsheet, a method for novel view synthesis using just a single RGB image as input. The main insight is that simply shrink-wrapping a planar mesh sheet onto the input image, consistent with the learned intermediate depth, captures underlying geometry sufficient to generate photorealistic unseen views with large viewpoint changes. To operationalize this, we propose a novel differentiable texture sampler that allows our wrapped mesh sheet to be textured and rendered differentiably into an image from a target viewpoint. Our approach is category-agnostic, end-to-end trainable without using any 3D supervision, and requires a single image at test time. We also explore a simple extension by stacking multiple layers of Worldsheets to better handle occlusions. Worldsheet consistently outperforms prior state-of-the-art methods on single-image view synthesis across several datasets. Furthermore, this simple idea captures novel views surprisingly well on a wide range of high-resolution in-the-wild images, converting them into navigable 3D pop-ups. Video results and code are available at https://worldsheet.github.io.

Abstract PDF Upgrade to Chat

Citations (77)

View on Semantic Scholar

Summary

Single-Image 3D View Synthesis via Mesh Wrapping: A Detailed Overview

The paper titled "Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image" introduces a novel methodology for generating new view perspectives from just a single input RGB image. This challenge involves capturing the three-dimensional structure of a scene and rendering new perspectives photorealistically, which has wide applications in fields like virtual reality and robotics. The study departs from conventional methods that typically require multiple images or depth input to achieve high-quality view synthesis.

Central Approach

The crux of this methodology is a mesh warp technique referred to as "Worldsheet." This approach involves overlaying a mesh grid over the input image and leveraging a learned intermediate depth to deform the mesh according to scene geometry. The mesh's positions are adjusted in a way that they encapsulate the inherent spatial layout of objects within the image. This wrapping enables the rendering engine to synthesize novel views by altering the camera's virtual position or orientation.

Key components of this method include:

Differentiable Texture Sampling: A novel sampling technique allows the system to project texture from the original image onto the mesh and subsequently render it into new viewpoints. This method ensures that the texture mapping is smooth and realistic, maintaining visual coherence across different generated views.
Layered Approach: To handle occlusions, the paper introduces an extension where multiple layers of meshes are stacked. Each layer can independently map different parts of the scene, providing a more robust model for occluded and complex geometries.

Empirical Validation

The paper claims remarkable performance across several benchmarks, prominently Matterport, Replica, and RealEstate10K datasets. It cites improvements over previous state-of-the-art view synthesis architectures. For instance, the proposed model significantly outperforms others across various metrics, including PSNR and SSIM, indicating higher rendered image quality and structural consistency with ground truth.

Experiments extending training scenarios to larger view angle changes show the robustness of the Worldsheet method, and evaluation on novel datasets such as Replica demonstrates its generalization capabilities without necessitating model retraining. This generalization power is crucial for real-world applications, where training conditions often differ from operational environments.

Theoretical and Practical Implications

Practically, the removal of multi-view or depth requirements at inference time simplifies the setup for applications. This reduction in necessary data inputs means a broader array of scenarios can be augmented with 3D capabilities using existing single-image datasets.

Theoretically, the study highlights the potential of mesh-based approaches in spatial understanding tasks. By proving effective without explicit 3D supervision, Worldsheet challenges traditional constraints and opens pathways for more generalized and adaptable view synthesis frameworks.

Future Directions

While the paper sets a strong foundation, it notes room for improvement in mesh resolution scalability and handling fine-grained object boundaries under extreme view changes. Exploring adaptive mesh resolutions or integrating self-supervised pre-training to better capture minute details could be promising avenues.

Moreover, the implication of this work could be extended to incorporating temporal coherence in video sequences, potentially rendering seamless augmented reality experiences.

Conclusion

Worldsheet represents a significant stride in single-image view synthesis by effectively leveraging a mesh-wrapping methodology to interpret and reconstruct the 3D scene structure. This methodological innovation addresses several historical limitations in view synthesis, setting a new standard in photorealistic rendering tasks from minimal input data. As such, it holds substantial potential for future exploration in advanced visual computing applications.