Dual-Pixel Disparity Extraction

Updated 23 January 2026

Dual-pixel disparity extraction is the process of recovering sub-pixel depth maps from split sensor views by leveraging optical geometry and defocus cues.
It employs both classical cost-volume methods and physics-informed continuous cost aggregation to address challenges such as limited baseline and PSF mismatches.
Recent approaches integrate deep networks and teacher-student distillation to robustly extract disparities in diverse and defocused imaging conditions.

Dual-pixel (DP) disparity extraction is the process of recovering per-pixel depth or disparity maps from the unique twin-view image data generated by dual-pixel sensors, which are increasingly prevalent in consumer cameras and mobile devices. A DP sensor splits each pixel into left/right sub-apertures, producing two interleaved images that together encode minute horizontal disparities and defocus effects reflective of scene geometry. Unlike classical stereo matching, the DP baseline is typically just the optical aperture, resulting in a fundamentally different photometric response and distinct challenges for accurate disparity estimation.

1. Optical and Physical Principles of Dual-Pixel Disparity

DP sensors capture two slightly different sub-aperture views by splitting each photosite laterally. The physical disparity $\Delta x$ between left/right subimages is governed by the lens geometry and thin-lens equations: $\Delta x = \frac{f D}{Z}$ where $f$ is the focal length, $D$ the aperture diameter (serving as the effective baseline), and $Z$ the scene depth. In pixel coordinates: $\Delta u = \frac{\Delta x}{p} = \frac{f D}{p Z}$ with $p$ the pixel pitch. Critically, this disparity is entangled with the diameter of the defocus blur ("circle of confusion"), and in practice, most DP sensors exhibit significant spatial variation in their point spread function (PSF) between views. This necessitates disparity extraction algorithms that are robust to non-ideal matching conditions: non-identical blurs, strong noise, and a very limited disparity range (often less than $\pm8$ px) (Swami et al., 17 Jun 2025, Garg et al., 2019).

Under standard imaging conditions, the DP disparity is prominent only where there is substantial defocus; at the focal plane, the DP views become nearly identical (“all-in-focus” regime), resulting in minimal signal.

2. Modeling and Inversion of the Disparity–Depth Relationship

Explicit inversion from measured DP disparity $d$ to scene depth $Z$ is enabled by the thin-lens model. For many DP pipelines, the key relationship is

$d = \alpha \frac{L f}{1 - f/z_f}\Bigl(\frac{1}{z_f} - \frac{1}{z}\Bigr)$

where $L$ is aperture diameter, $z_f$ focus distance, $z$ is actual depth, and $\alpha$ is a calibration term accounting for sensor geometry (Kurita et al., 2024, Garg et al., 2019). Certain methods exploit the invertibility of this mapping to recover $z$ directly from $d$ under the assumption of known camera intrinsics. However, real-world DP devices frequently lack reliable per-capture metadata for focus distance and aperture, leading to an inherent affine ambiguity: depth is recoverable only up to unknown global scale and offset (in inverse depth). Garg et al. formalize and address this ambiguity in supervised and self-supervised learning regimes (Garg et al., 2019), enforcing evaluation via affine-invariant error metrics.

Furthermore, the magnitude and reliability of DP disparity are scene-dependent. Blurred regions away from the focal plane yield the strongest signal; in-focus or low-texture regions present fundamentally ill-posed matching scenarios.

3. Classical and Physics-Informed Cost-Volume Approaches

The earliest DP disparity extraction pipelines employ discrete matching cost volumes, e.g., sum-of-absolute-differences (SAD), normalized cross-correlation (NCC), or template-matching over tiny disparity ranges. However, direct application of stereo matching algorithms is suboptimal due to differing left/right PSFs and a small disparity domain (Monin et al., 2023). To address these limitations, Punnappurath et al. introduce the Continuous Cost Aggregation (CCA) scheme:

A per-pixel, integer-disparity cost volume is computed.
A local parabola (quadratic) is fit to the three costs near the minimum, yielding a continuous, convex function $C_p(d) = \alpha_p d^2 + \beta_p d + \gamma_p$ .
Semi-global 1D path-based aggregation is applied to these quadratic coefficients under a quadratic smoothness penalty, preserving the closed-form minimizer at each pixel: $d_p^\ast = -\frac{\sum_{r} B_p^{(r)}}{2\sum_{r} A_p^{(r)}}$ where $A, B$ are coefficient sums over $R$ path directions.
Multi-scale fusion is realized by upsampling, weighting, and adding coarse-level coefficients as priors to finer scales.

This yields memory and runtime comparable to standard SGM, but with robust subpixel disparity estimates and state-of-the-art accuracy on both DSLR and mobile DP datasets (Monin et al., 2023). A variant in (Kurita et al., 2024) employs explicit physics-based modeling of error statistics in template-matching, propagating Laplacian perturbations derived from synthetic DP simulations, and learns sparse-to-dense completion networks supervised on pseudo-DP data from standard RGB-D corpora.

Deep networks have displaced classical pipelines due to their ability to merge DP cues with monocular priors and to model complex, non-ideal behaviors (e.g., spatial PSF variance, vignetting). Several architectures and paradigms have emerged:

Multi-Scale Cross-Correlation (MCCNet): Computes explicit correlation volumes between left/right DP features at multiple scales, regresses continuous disparities (via softmax-weighted summation), and fuses these into full-resolution disparity maps. Integration into deblurring decoders is achieved via cross-correlated skip connections. No explicit disparity supervision is used—all learning is through reconstruction losses (Swami, 16 Feb 2025).
Attention-Based Encoders (DiFuse-Net, WBiPAM): Employ window-based bi-directional parallax attention to match left/right DP features within the limited disparity range typical of smartphones, and fuse DP and RGB features adaptively. Cross-modal transfer learning utilizes large RGB-D datasets for pretraining, enabling improved learning efficiency despite limited DP ground truth (Swami et al., 17 Jun 2025).
Stereo-Knowledge Distillation: Networks are explicitly trained to mimic high-fidelity stereo teachers, using only the DP data as input during inference. Garg et al. demonstrate that "student" DP networks trained via L1 distillation from a stereo "teacher" (Unimatch) achieve significant gains over both monocular and prior DP baselines, notably in the dpMV dataset. This approach leverages synthetic stereo data as supervisory "dark knowledge" for real DP devices (Garg et al., 2024).

In all cases, the limited baseline ( $D$ ) and PSF mismatches are crucial. Alignment and domain adaptation are key themes. For defocus deblurring, architectures such as DPANet deploy local feature correlation modules and deformable convolutions to robustly align the non-identical DP sub-aperture images, facilitating downstream deblurring (Li et al., 2022).

5. Integration with Deblurring and Coded Capture Systems

Most DP disparity pipelines are tightly coupled with deblurring or depth-enhanced imaging. Joint estimation is realized through two primary mechanisms:

Dual Objective Networks: As in DDDNet, simultaneous recovery of all-in-focus images and depth/disparity is achieved via a combination of encoder-decoders. The analytic DP formation model is used both for image restoration supervision and to define a "reblur loss," enforcing that the predicted depth generates DP views consistent with the input (Pan et al., 2020).
Coded-Aperture Dual-Pixel Sensing (CADS): DP disparity extraction is fundamentally constrained by the trade-off between defocus (blur diameter) and parallax (disparity). CADS jointly learns an optimal coded aperture pattern and a recovery network, tailoring both the optical transfer function and the algorithm to maximize disparity discriminability and deblurring accuracy. Quantitative experiments show $>$ 1.5 dB gain in all-in-focus image PSNR and 5–6% improvement in depth accuracy compared to naive DP (Ghanekar et al., 2024).

Joint training and end-to-end optimization of optical and computational components are increasingly prevalent. Multi-scale architectures and explicit cost fusion further enhance the robustness of disparity extraction against blur, vignetting, and real-world sensor artifacts.

6. Datasets, Evaluation Protocols, and Benchmark Results

Advances in DP disparity extraction are tied to the availability of purpose-built datasets:

DSLR and Mobile DP datasets: Punnappurath et al. (Monin et al., 2023) and Garg et al. (Garg et al., 2019, Garg et al., 2024) capture DP(RGB) sequences from DSLRs and multi-camera smartphone rigs. Ground truth is established via synchronized auxiliary RGB-D or multi-view stereo systems, with affine-invariant error metrics such as AI(1), AI(2), and Spearman rank correlation.
Synthetic DP generation: Simulators generate DP pairs from standard RGB-D datasets by physically accurate thin-lens forward models. This enables supervised training without labor-intensive DP data capture (Pan et al., 2020, Kurita et al., 2024).
Cross-modal datasets (DCDP, dpMV): Multi-view and dual-pixel paired datasets support training and evaluation of hybrid DP-RGB architectures, replication of real-world acquisition noise, and transfer learning (Swami et al., 17 Jun 2025, Garg et al., 2024).

Table: Representative quantitative benchmark for DP disparity extraction methods (from (Monin et al., 2023, Kurita et al., 2024, Garg et al., 2024)):

| Paper / Method | AI(1) ↓ | AI(2) ↓ | 1–|ρ_s| ↓ | Params | |---------------------|---------|---------|--------|-----------| | Punnappurath ‘20 | 0.0449 | 0.0724 | 0.2301 | ~0 | | Kim ‘23 (end-end DP)| 0.0390 | 0.0679 | 0.2092 | 10.6M | | Physics-informed | 0.0301 | 0.0667 | 0.0782 | 1.9M | | Stereo-distilled | 0.129 | 0.165 | n/a | 49M |

These metrics reflect the accuracy (lower is better) and compactness of recent pipelines.

7. Limitations, Challenges, and Future Directions

DP disparity extraction faces several persistent limitations:

Affine ambiguity: Absolute depth from DP is fundamentally ambiguous up to global scale and offset unless per-capture calibration (focus/aperture) is provided (Garg et al., 2019). Many applications accept this, but others (metric 3D scene reconstruction) require more.
PSF mismatch and spatial non-ideality: The differing blur kernels for each DP half-aperture and their spatial variability complicate both classical and deep architectures. Robust feature alignment (e.g., deformable convolutions, correlation volumes) is essential (Li et al., 2022).
Weak signal near focal plane: Few DP pipelines can recover reliable disparities in all-in-focus or low-defocus regimes. Hybrid networks that incorporate both DP and RGB context offer improved robustness (Swami et al., 17 Jun 2025, Garg et al., 2024).
Sensitivity to optical noise: Specularities, extreme low light, and atypical PSFs can corrupt DP measurements, necessitating fallback to monocular cues (Garg et al., 2024).
Scarcity of real DP training data: Physics-informed simulation (Kurita et al., 2024), transfer learning, teacher-student distillation, and coded acquisition are key strategies for overcoming data bottlenecks.

Future research directions include learned coded optics for improved disparity contrast, semi-supervised training leveraging synthetic and real DP data, continuous cost aggregation in differentiable frameworks, and the integration of DP cues with event- or phase-based sensors. Emerging multi-modal and multi-task systems are likely to further leverage the strengths of DP disparity alongside complementary depth cues.