Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mamba Block DenVisCoM for Dense Vision

Updated 3 February 2026
  • Mamba Block DenVisCoM is a family of state-space neural network modules specialized in dense vision correspondence tasks, including optical flow, stereo disparity, and super-resolution.
  • It employs a split-branch structure with three-path mixing that combines convolutional processing and state-space scanning to capture both local and long-range dependencies.
  • DenVisCoM achieves efficient real-time performance with linear complexity, optimized patch-based processing, and competitive results in autonomous driving and medical imaging applications.

Mamba Block DenVisCoM refers to a family of advanced state-space modeling neural network modules specifically tailored for dense vision correspondence problems, including optical flow, stereo disparity, and super-resolution. DenVisCoM blocks feature in unified hybrid architectures where they serve as a memory- and speed-efficient alternative to standard attention, enabling accurate real-time visual correspondence estimation on high-resolution data (Anand et al., 2 Feb 2026, Anand et al., 16 Nov 2025, Ji et al., 2024).

1. Motivation and Unified Dense Correspondence

The central challenge addressed by DenVisCoM is simultaneous, high-resolution estimation of dense pixel correspondences between two input images—accommodating both optical flow (temporal motion) and disparity (spatial depth/stereo) (Anand et al., 2 Feb 2026). Traditionally, these tasks are solved by distinct pipelines, each duplicating heavy feature extraction, correlation volume construction, and post-processing, which suffers from computational redundancy and inconsistent matching. DenVisCoM is designed as a backbone block that inherently supports joint modeling, sharing intermediate representations, and enforcing inductive bias for correspondence across tasks, thereby reducing inference cost and improving accuracy (Anand et al., 2 Feb 2026, Anand et al., 16 Nov 2025).

2. Internal Architecture and Variants

DenVisCoM adopts a split-branch structure, typically realized as follows (Anand et al., 2 Feb 2026, Ji et al., 2024):

  • Input Reshaping: Feature maps for the image pair are concatenated, patchified (e.g., into 14×14 or 7×7 spatial tokens), and reshaped to arrange patch, channel, and token axes as required.
  • Branching: The input tensor is split into channel and patch subspaces, yielding four branches (left and right, each further decomposed).
  • Three-path Mixing:
    • Left and Right Convolution Branches: Each processes solo patches with 1D convolutions and SiLU nonlinearity, specializing in capture of local, content-adaptive features.
    • Scan (SSM) Branch: Concatenated features from both images are processed by a state-space model (Mamba SSM or its variants), allowing efficient modeling of long-range dependencies across both spatial and inter-image axes.
  • Fusion: Outputs from all branches are concatenated, linearly projected, and (optionally) passed to further attention mechanisms or decoding heads.

This modular split/fusion design allows the DenVisCoM block to explicitly model both self-similarity (intra-image) and correspondence (inter-image) at each backbone stage (Anand et al., 2 Feb 2026).

3. Mathematical Formulation and State-Space Modeling

DenVisCoM leverages several state-space based methodologies:

  • Causal and Non-Causal State-Space Duality: Sequences of image tokens are processed using operators of the form

ht=Atht1+Btxt,yt=Cthth_t = A_t h_{t-1} + B_t x_t,\qquad y_t = C_t h_t

Unrolled, and in particular in the NC-SSD variant, outputs can be written as

Yi=CijmjZjY_i = C_i \sum_{j} m_j Z_j

where mj=1/Ajm_j = 1/A_j and ZjZ_j is a per-position linear projection (Anand et al., 16 Nov 2025). The NC-SSD allows all-to-all (bidirectional) communications by removing the traditional causal mask, critical for correspondence reasoning.

  • Stereo Cross-Fusion: For stereo, ZLZ_L and ZRZ_R (left and right streams) are cross-multiplied to yield a global correlation volume, further processed by state-space operations. This produces cross-attended feature representations that synergize with optical flow mechanisms (Anand et al., 16 Nov 2025).
  • Cost Volume Construction: For flow, matching scores are defined as normalized dot products between feature embeddings, with softmaxed weights yielding point correspondences:

Pflow(i,j)=exp(f1(i),f2(j)/D)kexp(f1(i),f2(k)/D)P_{\text{flow}}(i, j) = \frac{\exp(\langle f_1(i), f_2(j) \rangle / \sqrt{D})}{\sum_k \exp(\langle f_1(i), f_2(k) \rangle / \sqrt{D})}

Flow vectors and disparities are computed as expectation values weighted by correspondence probabilities (Anand et al., 2 Feb 2026).

A critical property is that the SSM computation is linear in sequence length, avoiding the quadratic memory and runtime penalty inherent in attention-based cost volumes.

4. Hybrid Integration: SSM and Transformer Attention

Contemporary DenVisCoM-based networks integrate both Mamba SSM blocks and Transformer-style attention:

  • Each composite block applies a DenVisCoM (Mamba SSM) layer with residual connection, followed by normalization and a Transformer block comprising both self- and cross-attention (multi-head form). This attention allows the model to capture fine geometric and photometric details that SSMs alone may miss (Anand et al., 2 Feb 2026).

The combined update at each stage is given as

X(t+1)=X(t)+A(Norm(X(t)+M(Norm(X(t)))))X^{(t+1)} = X^{(t)} + A(\mathrm{Norm}(X^{(t)} + M(\mathrm{Norm}(X^{(t)}))))

where MM and AA denote the DenVisCoM and attention blocks, respectively (Anand et al., 2 Feb 2026). This structure ensures that both global (SSM) and local (attention) dependencies are simultaneously modeled and fused.

5. Real-Time and Memory Efficiency

DenVisCoM prioritizes real-time performance and memory scalability:

  • Linear Complexity: The core SSM operator has O(nd)O(n \cdot d) complexity, as opposed to O(n2d)O(n^2 \cdot d) for self-attention or classic cost-volume.
  • Patch-Based Reduction: By operating on grouped patches (e.g., 14×14), sequence lengths are reduced by orders of magnitude, enabling inference at 30–50 FPS even at high image resolutions.
  • Efficient Batching: Shared projection weights and factorized convolutions further reduce computation and memory usage (Anand et al., 2 Feb 2026).
  • Pruning in Non-Causal SSM: In the NC-SSD variant, only the top-K magnitude activations are computed per pass, reducing the summation workload from O(L)O(L) to O(K)O(K) per token (Anand et al., 16 Nov 2025).

Empirical benchmarks on a single RTX A6000 report EPE of 1.34 and FPS of 39.9 (KITTI15 flow), with total memory footprints below 300 MB—exceeding or matching specialized baselines. With NC-SSD blocks, EPE can be as low as 0.54 (KITTI15, optical flow) and FPS up to 51.7 (disparity), outperforming Unimatch and AnyNet along the error-throughput-memory trade-off Pareto front (Anand et al., 16 Nov 2025).

6. Applications and Empirical Performance

DenVisCoM blocks and their hybrid networks are deployed in:

  • Joint Optical Flow and Disparity Estimation: Simultaneous, consistent predictions for real-time 3D perception in autonomous driving, robotics, and AR/VR contexts (Anand et al., 2 Feb 2026).
  • Medical Imaging Super-Resolution: Within the Deform-Mamba MRI architecture, DenVisCoM modules combine modulated deformable convolutions with SSM scanning, summing local edge-sensitive features with global sequence information (Ji et al., 2024).
  • Ablation Evidence: Removal of either the SSM scanning, deformable convolution, or multi-scale fusion degrades performance in MRI super-resolution as measured by PSNR and SSIM. Full DenVisCoM achieves PSNR = 32.65 and SSIM = 0.9270 (fastMRI 4×), outperforming alternatives (Ji et al., 2024). On generic correspondence benchmarks, DenVisCoM-based models show consistent improvements across both synthetic (Sintel, VKITTI) and real (KITTI15) benchmarks (Anand et al., 2 Feb 2026, Anand et al., 16 Nov 2025).

7. Limitations and Future Directions

Known limitations and remaining challenges include:

  • Handling Large Displacements: While DenVisCoM significantly reduces matching error for moderate motion/disparity, very large motions may require finer multi-scale pyramids or adaptive patching (Anand et al., 2 Feb 2026).
  • Fixed Patch Sizes: Current implementations rely on fixed spatial partitioning (e.g., 14×14). A plausible implication is that adaptive or deformable patch grouping could further enhance flexibility and local detail capture.
  • No Feature Warping: Classical flow methods often incorporate explicit feature warping/refinement steps (e.g., RAFT-style updates), which are not yet present in default DenVisCoM pipelines. Integrating such refinements could further reduce endpoint errors (Anand et al., 2 Feb 2026).
  • Broader Generalization: Performance is robust on established computer vision and medical imaging benchmarks, but further validation across additional dense correspondence domains remains an open avenue.

DenVisCoM exemplifies a shift towards unified, state-space-driven architectures that are both real-time capable and maximally data efficient for dense vision correspondence (Anand et al., 2 Feb 2026, Anand et al., 16 Nov 2025, Ji et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba Block DenVisCoM.