3D Moments from Near-Duplicate Photos

Published 12 May 2022 in cs.CV | (2205.06255v1)

Abstract: We introduce 3D Moments, a new computational photography effect. As input we take a pair of near-duplicate photos, i.e., photos of moving subjects from similar viewpoints, common in people's photo collections. As output, we produce a video that smoothly interpolates the scene motion from the first photo to the second, while also producing camera motion with parallax that gives a heightened sense of 3D. To achieve this effect, we represent the scene as a pair of feature-based layered depth images augmented with scene flow. This representation enables motion interpolation along with independent control of the camera viewpoint. Our system produces photorealistic space-time videos with motion parallax and scene dynamics, while plausibly recovering regions occluded in the original views. We conduct extensive experiments demonstrating superior performance over baselines on public datasets and in-the-wild photos. Project page: https://3d-moments.github.io/

Abstract PDF Upgrade to Chat

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a novel 3D Moments technique that transforms near-duplicate photos into dynamic 3D space–time videos using a layered depth image pipeline.
The methodology leverages precise image alignment, depth estimation, and scene flow modeling to interpolate motion and generate realistic novel views.
Results show improved PSNR, SSIM, and LPIPS metrics, emphasizing the method’s potential for enhanced digital content creation and visual effects.

Analysis of "3D Moments from Near-Duplicate Photos"

The paper "3D Moments from Near-Duplicate Photos" addresses the creation of a novel computational photography effect, termed 3D Moments, aimed at transforming near-duplicate photographs into enriched, dynamic 3D space-time videos. This work intersects the domains of view synthesis, frame interpolation, and scene flow, building on the foundational techniques in these fields to solve the underlying challenges associated with dynamic scene reconstruction from minimal photographic input.

Methodology

The authors present a sophisticated pipeline that effectively synthesizes photorealistic space-time videos by utilizing two near-duplicate images taken from slightly variant viewpoints and times. Central to their approach is the construction of feature-based Layered Depth Images (LDIs) which are further augmented with scene flow to encapsulate both motion and depth within the scene. The feature LDIs serve as a critical representation that facilitates the interpolation of scene motion while allowing for independent camera viewpoint control.

Key components of their methodology include:

Image Alignment and Depth Estimation: The process begins with aligning near-duplicate images using homography based on optical flow and deploying a state-of-the-art monocular depth estimator (DPT) to predict depth maps. This alignment is crucial to standardize the coordinate system despite minor discrepancies in the camera positions.
Layered Depth Image (LDI) Representation: The LDIs are generated through a disparity-based clustering mechanism that segments the input RGBD images into multiple layers based on depth discontinuities. Subsequently, these layers undergo an inpainting process to fill occluded regions with plausible color and depth information, which is instrumental in synthesizing novel views.
Feature Extraction and Scene Flow Modeling: A neural network extracts features from the inpainted LDI layers to accommodate the inaccuracies in depth and motion predictions and improve rendering output. Scene flow—capturing dynamic motion in the 3D space—is calculated from the 2D optical flow and layered depth information, which is then propagated into invisible regions leveraging spatial continuity assumptions.
Bidirectional Splatting and Rendering: For rendering a novel view at any intermediate time, the LDIs are lifted into a 3D point cloud. A bidirectional splatting and rendering approach is used to synthesize these point clouds into a coherent output by leveraging feature maps and depth-based weight blending, enhancing rendering fidelity and temporal coherence.

Results and Implications

The proposed method exhibits superior performance metrics compared to conventional frame interpolation followed by 3D photo synthesis and vice versa. Extensive quantitative evaluations conducted on public datasets affirm the efficacy of the approach, achieving higher PSNR, SSIM, and notably better LPIPS scores, indicating superior perceptual quality.

This innovation holds significant practical implications in areas such as digital content creation, enhanced photo-realistic animations, and improved visual effects in cinematic productions. Theoretically, it expands the frontier of computational photography by overcoming constraints tied to static scene assumptions prominent in existing view synthesis methodologies.

Future Directions

Future research spurred by this work could involve enhancements to address limitations concerning non-planar geometry and non-linear motions, boosting robustness in more complex scenes. Exploring automatic determination of suitable photo pairs and failure detection mechanisms would also be beneficial for deploying the method in real-world applications.

In conclusion, "3D Moments from Near-Duplicate Photos" offers a compelling solution to transform near-duplicate images into dynamic 3D experiences, leveraging advancements in LDIs, scene flow, and deep learning to push the boundaries of modern image-based rendering.