Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

Published 30 Jan 2025 in cs.CV and cs.LG | (2501.18804v1)

Abstract: Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the MVGD architecture that uses diffusion-based modeling to directly generate novel images and depth maps.
It leverages scene scale normalization and learnable task embeddings to ensure consistent, scale-aware predictions across diverse datasets.
Experiments demonstrate state-of-the-art performance in PSNR, SSIM, and LPIPS metrics on benchmark datasets like RealEstate10k and ScanNet.

Multi-View Geometric Diffusion for Zero-Shot Novel View and Depth Synthesis

Overview

"Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion" presents a novel architecture named Multi-View Geometric Diffusion (MVGD). This architecture aims to directly generate images and depth maps from new perspectives using a diffusion-based model. The framework is designed to eliminate the need for intermediate 3D representations, which are typically employed in novel view synthesis tasks. The key innovation involves using a diffusion process with scene scale normalization, learnable task embeddings, and novel conditioning methodologies to achieve scale-aware and consistent predictions across different viewpoints.

Methodology

MVGD Architecture

MVGD is a diffusion-based model that learns a distribution to generate novel images and depth maps given a set of input views and camera parameters. The model employs Recurrent Interface Networks (RIN) to optimize computation, allowing it to manage multiple conditioning views efficiently. The emphasis is on direct pixel-level generation, avoiding reliance on auto-encoders found in traditional latent diffusion models.

Figure 1: Diagram of our proposed Multi-View Geometric Diffusion (MVGD) framework, at inference time.

Task Embeddings and Scene Scale Normalization

The architecture employs learnable task embeddings to guide the generation towards specific modalities, allowing joint training for both image and depth synthesis. Scene Scale Normalization (SSN) is introduced to handle scale differences across diverse datasets, ensuring that generated depth maps maintain consistency with the conditioning cameras' scale.

Training Strategy

The model is trained on a vast dataset of over 60 million multi-view samples from a variety of sources, ensuring robust performance across diverse scenes. Techniques such as learnable task embeddings and SSN enable consistent learning from heterogeneous data. Additionally, an incremental fine-tuning strategy is proposed, allowing for efficient scaling of model complexity without extensive retraining.

Experiments and Results

The paper conducts extensive experiments showcasing the state-of-the-art performance of MVGD in both novel view synthesis and multi-view stereo depth estimation. Key results highlight its superior performance in PSNR, SSIM, and LPIPS metrics across several benchmarks, including RealEstate10k and ScanNet. MVGD demonstrates strong generalization capabilities, effectively handling zero-shot novel view synthesis tasks.

Novel View Synthesis: MVGD outperforms previous methods significantly with fewer conditioning views due to its implicit geometric reasoning.
Depth Estimation: The model's performance in stereo and video depth estimation benchmarks indicates its ability to predict depth maps accurately, even when trained with diverse and sometimes sparse datasets.

Figure 2: MVGD novel view and depth synthesis results, demonstrating its capability to generate scale-consistent predictions.

Limitations and Future Work

While MVGD shows remarkable performance, certain limitations exist, such as the model's inability to handle dynamic scenes explicitly. The introduction of temporal embeddings and motion tokens could enhance its capabilities in this area. Furthermore, the current version of MVGD requires individual generation cycles for multiple viewpoints due to its scene scale normalization approach, suggesting a potential area for efficiency improvements.

Conclusion

This research introduces MVGD, emphasizing its novel approach to direct image and depth generation without relying on intermediate 3D representations. By leveraging a large-scale dataset and innovative training techniques, MVGD achieves impressive results in novel view synthesis and depth estimation, setting a new standard for zero-shot tasks in these domains. Future advances may focus on optimizing computational efficiency and expanding capabilities to dynamic environments.

Markdown Report Issue