Geometry-Aware Monocular-to-Stereo Synthesis

Updated 15 February 2026

The paper introduces a geometry-aware framework that converts monocular video into temporally consistent stereo views using epipolar geometry and triangulation.
It employs a two-stage pipeline—first estimating depth, flow, and pose, then synthesizing novel views with explicit warping and occlusion inpainting.
Empirical results demonstrate high performance with reduced errors and real-time processing, validated by metrics like PSNR and SSIM.

Geometry-aware monocular-to-stereo video synthesis refers to a set of computational techniques that convert a monocular (single-camera) video into a temporally coherent stereoscopic (two-view) video, such that the synthesized right (or left) view is physically consistent with a realistic baseline shift. This field leverages projective geometry, multi-view consistency, and learning-based generative modeling to infer 3D structure and enable novel view synthesis for immersive and XR applications.

1. Foundations and Geometric Constraints

Geometry-aware monocular-to-stereo synthesis builds fundamentally on projective geometry and multi-view theory. The monocular input is temporally ordered frames $I_t$ captured from a single moving or static camera with known or estimated intrinsic parameters $K$ .

Rigid Motion and Epipolar Geometry

For two frames (source and target) with known calibration, the rigid motion between their camera poses $(R, t)$ defines the epipolar constraint, encoded by the fundamental matrix $F = K^{-T} [t]_{\times} R K^{-1}$ such that corresponding points $x_s$ , $x_t$ satisfy $[x_t;1]^T F [x_s;1]=0$ . Geometry-aware pipelines typically estimate these relative poses and enforce that optical flow is constrained to the epipolar lines, shrinking the search space for correspondence and improving flow and pose accuracy (Wang et al., 2019).

Depth Triangulation and View Synthesis

Once flow and poses are estimated, the per-pixel depth $d(x)$ in the source frame is estimated via triangulation. Given the depth and camera extrinsics, a 3D point $X = d(x) K^{-1}[x;1]$ can be reprojected into any target view, such as a virtual right-eye camera translated by baseline $b$ . The corresponding image-plane position is computed as $x_r = \pi( K [I|b;0;0] X )$ , where $\pi$ denotes perspective division. This rigorous projective mapping enables synthesis of novel stereo views (Wang et al., 2019).

2. Classical and Modern Pipeline Architectures

Geometry-aware monocular-to-stereo approaches generally adopt a two-stage pipeline:

Stage 1: Geometry estimation (depth, optical flow, and pose)
Stage 2: Novel view synthesis with explicit geometric warping and occlusion inpainting

Flow–Pose–Depth Networks

Systems such as the "Flow-Motion and Depth Network" (Wang et al., 2019) utilize pyramidal feature encoders and joint flow/pose estimation, reducing the ambiguity in flow fields by enforcing epipolar constraints. The core architectural modules are:

A 6-level pyramid feature extractor,
A flow-motion subnetwork that refines flow constrained to epipolar lines,
A pose regression head that predicts the 6-DoF relative transform,
A triangulation network that learns to regress depth robustly even near projective degeneracies using an 8-dimensional per-pixel vector comprised of source–target correspondences and camera parameters.

Multi-View and Temporal Fusion

Multi-view extensions pool triangulation features from $N$ temporally adjacent frames and fuse them via per-pixel mean-pooling in the latent space to improve depth estimation and temporal stability, particularly beneficial in challenging or ambiguous regions (Wang et al., 2019).

3. Loss Functions and Supervision Strategies

Rigorous loss formulations are critical for geometric and perceptual fidelity:

Optical flow loss:

$L_\text{flow} = \sum_{l=1}^5 \sum_x \| w^l(x) - \hat{w}^l(x) \|_2$

Pose (motion) loss:

$L_\text{motion} = \sum_{l=1}^3 ( \| r^l - \hat{r}^l \|_2 + \| t^l - \hat{t}^l \|_2 )$

Scale-invariant depth loss (with berHu norm):

$L_d = \sum_{l=1}^3 \sum_x \| e^l(x) \|_\text{berHu}$

with $e^l(x) = y(x) + \alpha - \hat{y}(x), \ y(x) = \log d^l(x)$ , $\hat{y}(x)$ ground truth.

Gradient (edge) loss:

$L_g = \sum_{l=1}^3 \sum_x ( | \partial_x e^l(x) | + | \partial_y e^l(x) | )$

Total loss: Combined with balancing factors as

$L = \lambda_\text{flow} L_\text{flow} + \lambda_\text{motion} L_\text{motion} + \lambda_d L_d + \lambda_g L_g$

(Wang et al., 2019).

Supervision can be full (using paired depth/flow ground truth), self-supervised (relying on photometric or geometric consistency), or hybrid, depending on dataset availability.

4. Stereo Video Synthesis Algorithms

The final stereo view synthesis relies on differentiable projective warping and efficient rendering procedures:

For each pixel $x$ in the source frame, lift to 3D: $X = d(x) K^{-1}[x;1]$ ;
Reproject to the virtual right or left camera using the baseline translation;
The new pixel position may be subpixel; retrieve intensity via bilinear interpolation;
Stack synthesized left and right images as the stereo pair.

Multi-view fusion, temporal smoothing, and occlusion-aware warping further enhance output quality and reduce artifacts at disocclusions or depth edges (Wang et al., 2019).

5. Quantitative and Empirical Results

Representative metrics reported on DeMoN and synthetic benchmarks include:

Flow End-Point Error: 3.47 px
Rotation error: 1.88°
Translation error: 10.31°
Depth (after optimal scale):
- $L_1$ -inv = 0.015
- sc-inv = 0.195
- $L_1$ -rel = 0.134
Increasing the number of fused views ( $N=6$ ) in synthetic sequences reduces $L_1$ -rel error from 0.145 ( $N=2$ ) to 0.114 ( $N=6$ )
Stereo reconstruction error: Photometric error of $\text{PSNR} \approx 23.5\,\mathrm{dB}$ , $\text{SSIM} \approx 0.82$ for a $0.1\,\mathrm{m}$ virtual baseline—sufficient for stereo display comfort (Wang et al., 2019).

6. Practical and Theoretical Implications

The geometry-aware paradigm offers several key advantages:

By constraining correspondence search to epipolar geometry, the solution space is reduced, accelerating flow estimation and improving accuracy.
The triangulation layer encapsulates all projective information needed for robust depth regression without explicit inversion near epipolar degeneracy.
End-to-end pipelines integrating flow, pose, and depth enable monocular-to-stereo conversion at real-time or near real-time rates ( $\sim40$ ms/frame).
Model generalization is demonstrated on both synthetic and real-world datasets, including structure-from-motion setups and natural videos (Wang et al., 2019).

Limitations remain, including challenges in dynamic scenes (unless extended with additional temporal modeling), sensitivity to calibration errors, and occlusion handling under large baselines or non-Lambertian surfaces.

7. Extensions and Relation to Modern Approaches

While classical geometry-aware pipelines utilize explicit two-view or multi-view projective reasoning, contemporary monocular-to-stereo methods extend these ideas using generative priors, self-supervised learning, and video diffusion backbones:

Multi-view fusion modules and deep inpainting networks address occlusions and disocclusion hallucination;
Self-supervised and synthetic stereo training data generation enhances scalability in the absence of large-scale paired datasets;
Temporal attention, sophisticated latent encodings, and learned cross-modal geometry augment the explicit projective warping core.

Hybrid systems leveraging both explicit geometry and learned priors represent the state of the art in monocular-to-stereo video synthesis (Wang et al., 2019).

References:

"Flow-Motion and Depth Network for Monocular Stereo and Beyond" (Wang et al., 2019)

Markdown Report Issue Upgrade to Chat

References (1)

Flow-Motion and Depth Network for Monocular Stereo and Beyond (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Geometry-Aware Monocular-to-Stereo Video Synthesis.