Mamba Block DenVisCoM for Dense Vision

Updated 3 February 2026

Mamba Block DenVisCoM is a family of state-space neural network modules specialized in dense vision correspondence tasks, including optical flow, stereo disparity, and super-resolution.
It employs a split-branch structure with three-path mixing that combines convolutional processing and state-space scanning to capture both local and long-range dependencies.
DenVisCoM achieves efficient real-time performance with linear complexity, optimized patch-based processing, and competitive results in autonomous driving and medical imaging applications.

Mamba Block DenVisCoM refers to a family of advanced state-space modeling neural network modules specifically tailored for dense vision correspondence problems, including optical flow, stereo disparity, and super-resolution. DenVisCoM blocks feature in unified hybrid architectures where they serve as a memory- and speed-efficient alternative to standard attention, enabling accurate real-time visual correspondence estimation on high-resolution data (Anand et al., 2 Feb 2026, Anand et al., 16 Nov 2025, Ji et al., 2024).

1. Motivation and Unified Dense Correspondence

The central challenge addressed by DenVisCoM is simultaneous, high-resolution estimation of dense pixel correspondences between two input images—accommodating both optical flow (temporal motion) and disparity (spatial depth/stereo) (Anand et al., 2 Feb 2026). Traditionally, these tasks are solved by distinct pipelines, each duplicating heavy feature extraction, correlation volume construction, and post-processing, which suffers from computational redundancy and inconsistent matching. DenVisCoM is designed as a backbone block that inherently supports joint modeling, sharing intermediate representations, and enforcing inductive bias for correspondence across tasks, thereby reducing inference cost and improving accuracy (Anand et al., 2 Feb 2026, Anand et al., 16 Nov 2025).

2. Internal Architecture and Variants

DenVisCoM adopts a split-branch structure, typically realized as follows (Anand et al., 2 Feb 2026, Ji et al., 2024):

Input Reshaping: Feature maps for the image pair are concatenated, patchified (e.g., into 14×14 or 7×7 spatial tokens), and reshaped to arrange patch, channel, and token axes as required.
Branching: The input tensor is split into channel and patch subspaces, yielding four branches (left and right, each further decomposed).
Three-path Mixing:
- Left and Right Convolution Branches: Each processes solo patches with 1D convolutions and SiLU nonlinearity, specializing in capture of local, content-adaptive features.
- Scan (SSM) Branch: Concatenated features from both images are processed by a state-space model (Mamba SSM or its variants), allowing efficient modeling of long-range dependencies across both spatial and inter-image axes.
Fusion: Outputs from all branches are concatenated, linearly projected, and (optionally) passed to further attention mechanisms or decoding heads.

This modular split/fusion design allows the DenVisCoM block to explicitly model both self-similarity (intra-image) and correspondence (inter-image) at each backbone stage (Anand et al., 2 Feb 2026).

3. Mathematical Formulation and State-Space Modeling

DenVisCoM leverages several state-space based methodologies:

Causal and Non-Causal State-Space Duality: Sequences of image tokens are processed using operators of the form

$h_t = A_t h_{t-1} + B_t x_t,\qquad y_t = C_t h_t$

Unrolled, and in particular in the NC-SSD variant, outputs can be written as

$Y_i = C_i \sum_{j} m_j Z_j$

where $m_j = 1/A_j$ and $Z_j$ is a per-position linear projection (Anand et al., 16 Nov 2025). The NC-SSD allows all-to-all (bidirectional) communications by removing the traditional causal mask, critical for correspondence reasoning.

Stereo Cross-Fusion: For stereo, $Z_L$ and $Z_R$ (left and right streams) are cross-multiplied to yield a global correlation volume, further processed by state-space operations. This produces cross-attended feature representations that synergize with optical flow mechanisms (Anand et al., 16 Nov 2025).
Cost Volume Construction: For flow, matching scores are defined as normalized dot products between feature embeddings, with softmaxed weights yielding point correspondences:

$P_{\text{flow}}(i, j) = \frac{\exp(\langle f_1(i), f_2(j) \rangle / \sqrt{D})}{\sum_k \exp(\langle f_1(i), f_2(k) \rangle / \sqrt{D})}$

Flow vectors and disparities are computed as expectation values weighted by correspondence probabilities (Anand et al., 2 Feb 2026).

A critical property is that the SSM computation is linear in sequence length, avoiding the quadratic memory and runtime penalty inherent in attention-based cost volumes.

4. Hybrid Integration: SSM and Transformer Attention

Contemporary DenVisCoM-based networks integrate both Mamba SSM blocks and Transformer-style attention:

Each composite block applies a DenVisCoM (Mamba SSM) layer with residual connection, followed by normalization and a Transformer block comprising both self- and cross-attention (multi-head form). This attention allows the model to capture fine geometric and photometric details that SSMs alone may miss (Anand et al., 2 Feb 2026).

The combined update at each stage is given as

$X^{(t+1)} = X^{(t)} + A(\mathrm{Norm}(X^{(t)} + M(\mathrm{Norm}(X^{(t)}))))$

where $M$ and $A$ denote the DenVisCoM and attention blocks, respectively (Anand et al., 2 Feb 2026). This structure ensures that both global (SSM) and local (attention) dependencies are simultaneously modeled and fused.

5. Real-Time and Memory Efficiency

DenVisCoM prioritizes real-time performance and memory scalability:

Linear Complexity: The core SSM operator has $Y_i = C_i \sum_{j} m_j Z_j$ 0 complexity, as opposed to $Y_i = C_i \sum_{j} m_j Z_j$ 1 for self-attention or classic cost-volume.
Patch-Based Reduction: By operating on grouped patches (e.g., 14×14), sequence lengths are reduced by orders of magnitude, enabling inference at 30–50 FPS even at high image resolutions.
Efficient Batching: Shared projection weights and factorized convolutions further reduce computation and memory usage (Anand et al., 2 Feb 2026).
Pruning in Non-Causal SSM: In the NC-SSD variant, only the top-K magnitude activations are computed per pass, reducing the summation workload from $Y_i = C_i \sum_{j} m_j Z_j$ 2 to $Y_i = C_i \sum_{j} m_j Z_j$ 3 per token (Anand et al., 16 Nov 2025).

Empirical benchmarks on a single RTX A6000 report EPE of 1.34 and FPS of 39.9 (KITTI15 flow), with total memory footprints below 300 MB—exceeding or matching specialized baselines. With NC-SSD blocks, EPE can be as low as 0.54 (KITTI15, optical flow) and FPS up to 51.7 (disparity), outperforming Unimatch and AnyNet along the error-throughput-memory trade-off Pareto front (Anand et al., 16 Nov 2025).

6. Applications and Empirical Performance

DenVisCoM blocks and their hybrid networks are deployed in:

Joint Optical Flow and Disparity Estimation: Simultaneous, consistent predictions for real-time 3D perception in autonomous driving, robotics, and AR/VR contexts (Anand et al., 2 Feb 2026).
Medical Imaging Super-Resolution: Within the Deform-Mamba MRI architecture, DenVisCoM modules combine modulated deformable convolutions with SSM scanning, summing local edge-sensitive features with global sequence information (Ji et al., 2024).
Ablation Evidence: Removal of either the SSM scanning, deformable convolution, or multi-scale fusion degrades performance in MRI super-resolution as measured by PSNR and SSIM. Full DenVisCoM achieves PSNR = 32.65 and SSIM = 0.9270 (fastMRI 4×), outperforming alternatives (Ji et al., 2024). On generic correspondence benchmarks, DenVisCoM-based models show consistent improvements across both synthetic (Sintel, VKITTI) and real (KITTI15) benchmarks (Anand et al., 2 Feb 2026, Anand et al., 16 Nov 2025).

7. Limitations and Future Directions

Known limitations and remaining challenges include:

Handling Large Displacements: While DenVisCoM significantly reduces matching error for moderate motion/disparity, very large motions may require finer multi-scale pyramids or adaptive patching (Anand et al., 2 Feb 2026).
Fixed Patch Sizes: Current implementations rely on fixed spatial partitioning (e.g., 14×14). A plausible implication is that adaptive or deformable patch grouping could further enhance flexibility and local detail capture.
No Feature Warping: Classical flow methods often incorporate explicit feature warping/refinement steps (e.g., RAFT-style updates), which are not yet present in default DenVisCoM pipelines. Integrating such refinements could further reduce endpoint errors (Anand et al., 2 Feb 2026).
Broader Generalization: Performance is robust on established computer vision and medical imaging benchmarks, but further validation across additional dense correspondence domains remains an open avenue.

DenVisCoM exemplifies a shift towards unified, state-space-driven architectures that are both real-time capable and maximally data efficient for dense vision correspondence (Anand et al., 2 Feb 2026, Anand et al., 16 Nov 2025, Ji et al., 2024).