IGEV-Stereo: Iterative Geometry Encoding Volume

Updated 3 December 2025

IGEV-Stereo is a deep network architecture for stereo matching that fuses local and non-local geometric cues using a Combined Geometry Encoding Volume.
It employs soft arg-min disparity initialization and a ConvGRU-based iterative updater to achieve subpixel-accurate depth estimation in just 3–8 iterations.
The system extends to IGEV++ and IGEV-MVS, demonstrating state-of-the-art performance on benchmarks like Scene Flow and KITTI with efficient inference.

Iterative Geometry Encoding Volume (IGEV-Stereo) refers to a deep network architecture designed for stereo matching that integrates recurrent updates with a geometry-aware and context-rich cost volume. By leveraging lightweight 3D convolutional regularization, multi-scale feature aggregation, and an efficient ConvGRU-based updater, IGEV-Stereo achieves state-of-the-art accuracy and rapid convergence on established benchmarks. Its advances are further extended to multi-range (IGEV++) and multi-view (IGEV-MVS) stereo, yielding strong performance and generalization in a variety of settings (Xu et al., 2023, Xu et al., 2024).

1. Combined Geometry Encoding Volume Construction

The principal innovation of IGEV-Stereo is the Combined Geometry Encoding Volume (CGEV), which synthesizes both local and non-local matching cues across multiple scales, enabling effective disambiguation in ill-posed regions and refinement of fine details. CGEV is constructed by fusing three principal components:

Local all-pairs correlation (APC) preserves granular matching evidence.
3D-CNN–filtered cost volume (GEV) encodes non-local geometry and scene context.
Disparity-pooled pyramids of APC and GEV capture multi-scale and large-disparity structures.

Given left ( $\mathbf f_{l,4}$ ) and right ( $\mathbf f_{r,4}$ ) feature maps at $1/4$ resolution, group-wise correlation volumes are computed as follows: $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ This is regularized by a lightweight 3D U-Net: $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ At each 3D convolution stage, a channel-wise excitation modulates responses with the sigmoid of higher-level features: $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$

A parallel APC volume is built: $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$

Disparity pooling forms a two-level pyramid: $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$

The full CGEV concatenates these at each disparity level: $\mathbf C_{\rm CGEV}(d) = \left[\mathbf C_G(d);\; \mathbf C_A(d);\; \mathbf C^p_G(d/2);\; \mathbf C^p_A(d/2)\right]$

This fusion scheme encodes both global geometric context and fine local details, which is critical in low-texture, reflective, or occluded regions.

2. Disparity Initialization with Soft Arg Min

IGEV-Stereo applies a soft-argmin operation over the geometry encoding volume (GEV) to regress an initial estimate $\mathbf d_0$ , in contrast with standard RAFT-Stereo which starts all disparities at zero. The expression is: $\mathbf f_{r,4}$ 0 A smooth- $\mathbf f_{r,4}$ 1 loss $\mathbf f_{r,4}$ 2 is used to explicitly supervise this initialization: $\mathbf f_{r,4}$ 3 On Scene Flow, this yields $\mathbf f_{r,4}$ 4 within $\mathbf f_{r,4}$ 5– $\mathbf f_{r,4}$ 6 pixels of ground-truth. This accurate starting state ensures that the subsequent ConvGRU-based iterative updater requires fewer updates, significantly accelerating convergence.

For disparity refinement, IGEV-Stereo employs a multi-level ConvGRU stack. At each iteration $\mathbf f_{r,4}$ 7:

CGEV is sampled (via linear interpolation) around the current $\mathbf f_{r,4}$ 8 for each pixel $\mathbf f_{r,4}$ 9:

$1/4$0

Features $1/4$1 and the current disparity $1/4$2 are encoded by 2-layer CNNs and concatenated to form input $1/4$3.
The ConvGRU cell evolves the hidden state $1/4$4 according to:

$1/4$5

A decoder produces a residual $1/4$6, yielding

$1/4$7

By initializing with $1/4$8, subpixel-accurate results are typically achieved in $1/4$9– $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 0 iterations, a notable reduction compared to 32 updates required by vanilla RAFT-Stereo.

4. Network Architecture and Loss Formulation

IGEV-Stereo comprises several tightly integrated modules:

Feature extractor: MobileNetV2 backbone pretrained on ImageNet, upsampling with skip connections to deliver $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 1-scale feature maps, with side outputs at $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 2, $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 3, $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 4 to guide 3D-CNNs.
Context network: A compact ResNet trunk provides multi-scale context maps (width=128), used for ConvGRU initialization and recurrent updates.
Volume builder: Encodes group-wise correlation, all-pairs correlation, disparity pooling, and concatenates to form CGEV.
Iterative updater: Three ConvGRUs ( $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 5-dim hidden state each), recurrently updating disparity.
Upsampling head: Predicts a learned $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 6 convex combination kernel per-pixel to upsample from $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 7-scale.
Loss: Total loss is

$\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 8

The model comprises $\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g$ 912.6M parameters and achieves $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 0s inference on $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 1 KITTI images.

5. Empirical Results and Comparative Performance

IGEV-Stereo demonstrates high accuracy and speed across established benchmarks:

Scene Flow (test): EPE = $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 2px (cf. PSMNet $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 3, GwcNet $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 4).
KITTI 2012 (2px, noc): $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 5 (best among published methods).
KITTI 2015 D1-all: $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 6 (ranked first at submission).
Inference Time: $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 7s, fastest among top 10.
Ill-posed/reflective (KITTI 2012): $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 8 out-Noc with $\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})$ 9 iterations, vs $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 0 for RAFT-Stereo ( $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 1 iterations).
Cross-dataset performance: Middlebury half-res EPE $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 2px vs. $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 3px (RAFT); ETH3D $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 4px vs. $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 5px (RAFT).

This suggests that the architecture not only accelerates convergence but also provides robustness to cross-domain transfer and difficult regions (Xu et al., 2023).

6. Extensions: IGEV++, Multi-view & Multi-range Encoding

IGEV++ (Xu et al., 2024) generalizes the IGEV framework to Multi-range Geometry Encoding Volumes (MGEV), better handling large disparities and ill-posed regions:

MGEV encodes geometry at three scales: small ( $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 6), medium ( $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 7), and large ( $\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 8).
Adaptive Patch Matching (APM): Efficient matching in large disparity regimes by coarsely quantized, weighted-patch correlation:

$\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i$ 9

Selective Geometry Feature Fusion (SGFF): Per-pixel gating of contributions from $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 0, $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 1, $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 2 based on learned weights from image features and initial disparities:

$\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 3

The ConvGRU updater is retained, with each iteration using fused features for robust updates.

Quantitative improvements are substantial, including:

EPE $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 4 (Scene Flow, $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 5px), Bad $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 6 ( $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 7 iters), outperforming RAFT-Stereo EPE $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 8.
KITTI 2012 (2-nocc): $\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle$ 9; KITTI 2015 (D1-all): $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 0 (in $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 1s).
Middlebury “large-disp” Bad2.0: $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 2 (zero-shot), $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 3 error reduction over RAFT-Stereo.
Reflective regions (KITTI2012 3-noc): $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 4 (IGEV++), vs $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 5 (RAFT).

IGEV-MVS extends the approach to multi-view stereo by stacking pairwise CGEVs from $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 6 views, evaluated on the DTU benchmark with overall $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 7mm accuracy (best among learned methods at time of publication) (Xu et al., 2023).

7. Ablations and Design Insights

Ablation studies (Xu et al., 2024) show that:

Adding a single-range GEV to baseline RAFT brings a $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 815\% reduction in Scene Flow EPE.
Incorporating MGEV with APM improves accuracy on large disparities by $\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A$ 9– $\mathbf C_{\rm CGEV}(d) = \left[\mathbf C_G(d);\; \mathbf C_A(d);\; \mathbf C^p_G(d/2);\; \mathbf C^p_A(d/2)\right]$ 0 relative.
Selective feature fusion (SGFF) further reduces errors, especially for ill-posed regions.

Each component thus contributes quantifiably to IGEV’s convergence speed and generalizability:

Geometric regularization with lightweight 3D-CNN is crucial for non-local reasoning.
Multi-scale, adaptive patch matching is necessary for handling large search spaces without prohibitive memory.
Learned per-pixel fusion provides context-sensitive updates essential for robust estimation in challenging scenes.

In summary, IGEV-Stereo and its derivatives (IGEV++, IGEV-MVS) leverage an overview of geometry-stage volumetric encoding, efficient recurrent updating, and adaptive multi-scale strategies to set new accuracy and speed benchmarks in stereo and multi-view depth estimation (Xu et al., 2023, Xu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Iterative Geometry Encoding Volume for Stereo Matching (2023)

IGEV++: Iterative Multi-range Geometry Encoding Volumes for Stereo Matching (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Iterative Geometry Encoding Volume (IGEV-Stereo).

IGEV-Stereo: Iterative Geometry Encoding Volume

1. Combined Geometry Encoding Volume Construction

2. Disparity Initialization with Soft Arg Min

3. ConvGRU-based Iterative Disparity Refinement

4. Network Architecture and Loss Formulation

5. Empirical Results and Comparative Performance

6. Extensions: IGEV++, Multi-view & Multi-range Encoding

7. Ablations and Design Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

IGEV-Stereo: Iterative Geometry Encoding Volume

1. Combined Geometry Encoding Volume Construction

2. Disparity Initialization with Soft Arg Min

3. ConvGRU-based Iterative Disparity Refinement

4. Network Architecture and Loss Formulation

5. Empirical Results and Comparative Performance

6. Extensions: IGEV++, Multi-view & Multi-range Encoding

7. Ablations and Design Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics