Papers
Topics
Authors
Recent
Search
2000 character limit reached

IGEV-Stereo: Iterative Geometry Encoding Volume

Updated 3 December 2025
  • IGEV-Stereo is a deep network architecture for stereo matching that fuses local and non-local geometric cues using a Combined Geometry Encoding Volume.
  • It employs soft arg-min disparity initialization and a ConvGRU-based iterative updater to achieve subpixel-accurate depth estimation in just 3–8 iterations.
  • The system extends to IGEV++ and IGEV-MVS, demonstrating state-of-the-art performance on benchmarks like Scene Flow and KITTI with efficient inference.

Iterative Geometry Encoding Volume (IGEV-Stereo) refers to a deep network architecture designed for stereo matching that integrates recurrent updates with a geometry-aware and context-rich cost volume. By leveraging lightweight 3D convolutional regularization, multi-scale feature aggregation, and an efficient ConvGRU-based updater, IGEV-Stereo achieves state-of-the-art accuracy and rapid convergence on established benchmarks. Its advances are further extended to multi-range (IGEV++) and multi-view (IGEV-MVS) stereo, yielding strong performance and generalization in a variety of settings (Xu et al., 2023, Xu et al., 2024).

1. Combined Geometry Encoding Volume Construction

The principal innovation of IGEV-Stereo is the Combined Geometry Encoding Volume (CGEV), which synthesizes both local and non-local matching cues across multiple scales, enabling effective disambiguation in ill-posed regions and refinement of fine details. CGEV is constructed by fusing three principal components:

  • Local all-pairs correlation (APC) preserves granular matching evidence.
  • 3D-CNN–filtered cost volume (GEV) encodes non-local geometry and scene context.
  • Disparity-pooled pyramids of APC and GEV capture multi-scale and large-disparity structures.

Given left (fl,4\mathbf f_{l,4}) and right (fr,4\mathbf f_{r,4}) feature maps at $1/4$ resolution, group-wise correlation volumes are computed as follows: Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g This is regularized by a lightweight 3D U-Net: CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr}) At each 3D convolution stage, a channel-wise excitation modulates responses with the sigmoid of higher-level features: Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i

A parallel APC volume is built: CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle

Disparity pooling forms a two-level pyramid: CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A

The full CGEV concatenates these at each disparity level: CCGEV(d)=[CG(d);  CA(d);  CGp(d/2);  CAp(d/2)]\mathbf C_{\rm CGEV}(d) = \left[\mathbf C_G(d);\; \mathbf C_A(d);\; \mathbf C^p_G(d/2);\; \mathbf C^p_A(d/2)\right]

This fusion scheme encodes both global geometric context and fine local details, which is critical in low-texture, reflective, or occluded regions.

2. Disparity Initialization with Soft Arg Min

IGEV-Stereo applies a soft-argmin operation over the geometry encoding volume (GEV) to regress an initial estimate d0\mathbf d_0, in contrast with standard RAFT-Stereo which starts all disparities at zero. The expression is: fr,4\mathbf f_{r,4}0 A smooth-fr,4\mathbf f_{r,4}1 loss fr,4\mathbf f_{r,4}2 is used to explicitly supervise this initialization: fr,4\mathbf f_{r,4}3 On Scene Flow, this yields fr,4\mathbf f_{r,4}4 within fr,4\mathbf f_{r,4}5–fr,4\mathbf f_{r,4}6 pixels of ground-truth. This accurate starting state ensures that the subsequent ConvGRU-based iterative updater requires fewer updates, significantly accelerating convergence.

3. ConvGRU-based Iterative Disparity Refinement

For disparity refinement, IGEV-Stereo employs a multi-level ConvGRU stack. At each iteration fr,4\mathbf f_{r,4}7:

  1. CGEV is sampled (via linear interpolation) around the current fr,4\mathbf f_{r,4}8 for each pixel fr,4\mathbf f_{r,4}9:

$1/4$0

  1. Features $1/4$1 and the current disparity $1/4$2 are encoded by 2-layer CNNs and concatenated to form input $1/4$3.
  2. The ConvGRU cell evolves the hidden state $1/4$4 according to:

$1/4$5

  1. A decoder produces a residual $1/4$6, yielding

$1/4$7

By initializing with $1/4$8, subpixel-accurate results are typically achieved in $1/4$9–Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g0 iterations, a notable reduction compared to 32 updates required by vanilla RAFT-Stereo.

4. Network Architecture and Loss Formulation

IGEV-Stereo comprises several tightly integrated modules:

  • Feature extractor: MobileNetV2 backbone pretrained on ImageNet, upsampling with skip connections to deliver Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g1-scale feature maps, with side outputs at Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g2, Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g3, Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g4 to guide 3D-CNNs.
  • Context network: A compact ResNet trunk provides multi-scale context maps (width=128), used for ConvGRU initialization and recurrent updates.
  • Volume builder: Encodes group-wise correlation, all-pairs correlation, disparity pooling, and concatenates to form CGEV.
  • Iterative updater: Three ConvGRUs (Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g5-dim hidden state each), recurrently updating disparity.
  • Upsampling head: Predicts a learned Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g6 convex combination kernel per-pixel to upsample from Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g7-scale.
  • Loss: Total loss is

Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g8

The model comprises Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g912.6M parameters and achieves CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})0s inference on CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})1 KITTI images.

5. Empirical Results and Comparative Performance

IGEV-Stereo demonstrates high accuracy and speed across established benchmarks:

  • Scene Flow (test): EPE = CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})2px (cf. PSMNet CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})3, GwcNet CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})4).
  • KITTI 2012 (2px, noc): CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})5 (best among published methods).
  • KITTI 2015 D1-all: CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})6 (ranked first at submission).
  • Inference Time: CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})7s, fastest among top 10.
  • Ill-posed/reflective (KITTI 2012): CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})8 out-Noc with CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr})9 iterations, vs Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i0 for RAFT-Stereo (Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i1 iterations).
  • Cross-dataset performance: Middlebury half-res EPE Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i2px vs. Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i3px (RAFT); ETH3D Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i4px vs. Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i5px (RAFT).

This suggests that the architecture not only accelerates convergence but also provides robustness to cross-domain transfer and difficult regions (Xu et al., 2023).

6. Extensions: IGEV++, Multi-view & Multi-range Encoding

IGEV++ (Xu et al., 2024) generalizes the IGEV framework to Multi-range Geometry Encoding Volumes (MGEV), better handling large disparities and ill-posed regions:

  • MGEV encodes geometry at three scales: small (Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i6), medium (Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i7), and large (Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i8).
  • Adaptive Patch Matching (APM): Efficient matching in large disparity regimes by coarsely quantized, weighted-patch correlation:

Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i9

  • Selective Geometry Feature Fusion (SGFF): Per-pixel gating of contributions from CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle0, CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle1, CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle2 based on learned weights from image features and initial disparities:

CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle3

  • The ConvGRU updater is retained, with each iteration using fused features for robust updates.

Quantitative improvements are substantial, including:

  • EPE CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle4 (Scene Flow, CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle5px), BadCA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle6 (CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle7 iters), outperforming RAFT-Stereo EPE CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle8.
  • KITTI 2012 (2-nocc): CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle9; KITTI 2015 (D1-all): CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A0 (in CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A1s).
  • Middlebury “large-disp” Bad2.0: CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A2 (zero-shot), CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A3 error reduction over RAFT-Stereo.
  • Reflective regions (KITTI2012 3-noc): CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A4 (IGEV++), vs CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A5 (RAFT).

IGEV-MVS extends the approach to multi-view stereo by stacking pairwise CGEVs from CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A6 views, evaluated on the DTU benchmark with overall CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A7mm accuracy (best among learned methods at time of publication) (Xu et al., 2023).

7. Ablations and Design Insights

Ablation studies (Xu et al., 2024) show that:

  • Adding a single-range GEV to baseline RAFT brings a CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A815\% reduction in Scene Flow EPE.
  • Incorporating MGEV with APM improves accuracy on large disparities by CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A9–CCGEV(d)=[CG(d);  CA(d);  CGp(d/2);  CAp(d/2)]\mathbf C_{\rm CGEV}(d) = \left[\mathbf C_G(d);\; \mathbf C_A(d);\; \mathbf C^p_G(d/2);\; \mathbf C^p_A(d/2)\right]0 relative.
  • Selective feature fusion (SGFF) further reduces errors, especially for ill-posed regions.

Each component thus contributes quantifiably to IGEV’s convergence speed and generalizability:

  • Geometric regularization with lightweight 3D-CNN is crucial for non-local reasoning.
  • Multi-scale, adaptive patch matching is necessary for handling large search spaces without prohibitive memory.
  • Learned per-pixel fusion provides context-sensitive updates essential for robust estimation in challenging scenes.

In summary, IGEV-Stereo and its derivatives (IGEV++, IGEV-MVS) leverage an overview of geometry-stage volumetric encoding, efficient recurrent updating, and adaptive multi-scale strategies to set new accuracy and speed benchmarks in stereo and multi-view depth estimation (Xu et al., 2023, Xu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Iterative Geometry Encoding Volume (IGEV-Stereo).