Geometry-Aware Monocular-to-Stereo Synthesis
- The paper introduces a geometry-aware framework that converts monocular video into temporally consistent stereo views using epipolar geometry and triangulation.
- It employs a two-stage pipeline—first estimating depth, flow, and pose, then synthesizing novel views with explicit warping and occlusion inpainting.
- Empirical results demonstrate high performance with reduced errors and real-time processing, validated by metrics like PSNR and SSIM.
Geometry-aware monocular-to-stereo video synthesis refers to a set of computational techniques that convert a monocular (single-camera) video into a temporally coherent stereoscopic (two-view) video, such that the synthesized right (or left) view is physically consistent with a realistic baseline shift. This field leverages projective geometry, multi-view consistency, and learning-based generative modeling to infer 3D structure and enable novel view synthesis for immersive and XR applications.
1. Foundations and Geometric Constraints
Geometry-aware monocular-to-stereo synthesis builds fundamentally on projective geometry and multi-view theory. The monocular input is temporally ordered frames captured from a single moving or static camera with known or estimated intrinsic parameters .
Rigid Motion and Epipolar Geometry
For two frames (source and target) with known calibration, the rigid motion between their camera poses defines the epipolar constraint, encoded by the fundamental matrix such that corresponding points , satisfy . Geometry-aware pipelines typically estimate these relative poses and enforce that optical flow is constrained to the epipolar lines, shrinking the search space for correspondence and improving flow and pose accuracy (Wang et al., 2019).
Depth Triangulation and View Synthesis
Once flow and poses are estimated, the per-pixel depth in the source frame is estimated via triangulation. Given the depth and camera extrinsics, a 3D point can be reprojected into any target view, such as a virtual right-eye camera translated by baseline . The corresponding image-plane position is computed as , where denotes perspective division. This rigorous projective mapping enables synthesis of novel stereo views (Wang et al., 2019).
2. Classical and Modern Pipeline Architectures
Geometry-aware monocular-to-stereo approaches generally adopt a two-stage pipeline:
- Stage 1: Geometry estimation (depth, optical flow, and pose)
- Stage 2: Novel view synthesis with explicit geometric warping and occlusion inpainting
Flow–Pose–Depth Networks
Systems such as the "Flow-Motion and Depth Network" (Wang et al., 2019) utilize pyramidal feature encoders and joint flow/pose estimation, reducing the ambiguity in flow fields by enforcing epipolar constraints. The core architectural modules are:
- A 6-level pyramid feature extractor,
- A flow-motion subnetwork that refines flow constrained to epipolar lines,
- A pose regression head that predicts the 6-DoF relative transform,
- A triangulation network that learns to regress depth robustly even near projective degeneracies using an 8-dimensional per-pixel vector comprised of source–target correspondences and camera parameters.
Multi-View and Temporal Fusion
Multi-view extensions pool triangulation features from temporally adjacent frames and fuse them via per-pixel mean-pooling in the latent space to improve depth estimation and temporal stability, particularly beneficial in challenging or ambiguous regions (Wang et al., 2019).
3. Loss Functions and Supervision Strategies
Rigorous loss formulations are critical for geometric and perceptual fidelity:
- Optical flow loss:
- Pose (motion) loss:
- Scale-invariant depth loss (with berHu norm):
with , ground truth.
- Gradient (edge) loss:
- Total loss: Combined with balancing factors as
Supervision can be full (using paired depth/flow ground truth), self-supervised (relying on photometric or geometric consistency), or hybrid, depending on dataset availability.
4. Stereo Video Synthesis Algorithms
The final stereo view synthesis relies on differentiable projective warping and efficient rendering procedures:
- For each pixel in the source frame, lift to 3D: ;
- Reproject to the virtual right or left camera using the baseline translation;
- The new pixel position may be subpixel; retrieve intensity via bilinear interpolation;
- Stack synthesized left and right images as the stereo pair.
Multi-view fusion, temporal smoothing, and occlusion-aware warping further enhance output quality and reduce artifacts at disocclusions or depth edges (Wang et al., 2019).
5. Quantitative and Empirical Results
Representative metrics reported on DeMoN and synthetic benchmarks include:
- Flow End-Point Error: 3.47 px
- Rotation error: 1.88°
- Translation error: 10.31°
- Depth (after optimal scale):
- -inv = 0.015
- sc-inv = 0.195
- -rel = 0.134
- Increasing the number of fused views () in synthetic sequences reduces -rel error from 0.145 () to 0.114 ()
- Stereo reconstruction error: Photometric error of , for a virtual baseline—sufficient for stereo display comfort (Wang et al., 2019).
6. Practical and Theoretical Implications
The geometry-aware paradigm offers several key advantages:
- By constraining correspondence search to epipolar geometry, the solution space is reduced, accelerating flow estimation and improving accuracy.
- The triangulation layer encapsulates all projective information needed for robust depth regression without explicit inversion near epipolar degeneracy.
- End-to-end pipelines integrating flow, pose, and depth enable monocular-to-stereo conversion at real-time or near real-time rates ( ms/frame).
- Model generalization is demonstrated on both synthetic and real-world datasets, including structure-from-motion setups and natural videos (Wang et al., 2019).
Limitations remain, including challenges in dynamic scenes (unless extended with additional temporal modeling), sensitivity to calibration errors, and occlusion handling under large baselines or non-Lambertian surfaces.
7. Extensions and Relation to Modern Approaches
While classical geometry-aware pipelines utilize explicit two-view or multi-view projective reasoning, contemporary monocular-to-stereo methods extend these ideas using generative priors, self-supervised learning, and video diffusion backbones:
- Multi-view fusion modules and deep inpainting networks address occlusions and disocclusion hallucination;
- Self-supervised and synthetic stereo training data generation enhances scalability in the absence of large-scale paired datasets;
- Temporal attention, sophisticated latent encodings, and learned cross-modal geometry augment the explicit projective warping core.
Hybrid systems leveraging both explicit geometry and learned priors represent the state of the art in monocular-to-stereo video synthesis (Wang et al., 2019).
References:
- "Flow-Motion and Depth Network for Monocular Stereo and Beyond" (Wang et al., 2019)