ConvMambaNet: Hybrid CNN & SSM Model

Updated 22 February 2026

ConvMambaNet is a hybrid neural architecture that combines CNN-based local feature extraction with Mamba state-space models for global context modeling.
It employs selective scan gating and multi-directional SSM blocks to achieve high accuracy in EEG seizure detection and visual segmentation tasks.
Empirical results highlight its favorable accuracy-parameter tradeoff and reduced computational complexity compared to traditional CNN and Transformer designs.

ConvMambaNet is a class of hybrid neural architectures that tightly integrate convolutional neural networks (CNNs) with structured state-space models (SSMs) of the Mamba family. These architectures are designed to harness the local pattern extraction strengths of convolution along with the ability of state-space models to capture long-range dependencies in temporal and spatial domains. ConvMambaNet has been deployed across diverse modalities, including time-series (e.g., EEG), visual recognition, and dense image segmentation, and is characterized by its efficiency, scalability, and favorable accuracy-parameter tradeoff compared to both pure CNN and Transformer designs (Khan et al., 19 Jan 2026, Munir et al., 4 Sep 2025, Chen et al., 2024).

1. Architectural Principles

ConvMambaNet architectures consistently realize a two-phase inductive bias—early stages extract local features via convolution, while later stages use the Mamba SSM block for global context modeling. Key elements include:

Convolutional layers: Small-kernel or depthwise convolutions as the feature extraction “stem” and in early blocks.
Mamba SSM block: Linear, gated state-space modeling that propagates contextual signals across arbitrarily long windows with O(T) (sequence) or O(N) (2D tokens) complexity.
Hybridization strategies: Integration points and interaction schemes (e.g., selective scan, multi-directional parallel scan) are tailored to domain—1D temporal modeling in EEG, 2D multi-axis scan for vision.

These hybrid models structurally resemble multi-stage pyramidal backbones, with increasingly abstract feature representations and complexity handled in later SSM or scan-based blocks.

2. Mathematical Formulation of the Mamba Block

The Mamba block implements a discrete-time state-space model tailored for efficient long-range sequence modeling. For step $t$ (or scan position $n$ in images):

$x_t = A x_{t-1} + B u_t \ y_t = C x_t + D u_t$

where $x_t$ is the state vector, $u_t$ the current input feature (CNN output), and $y_t$ the output embedding. $A$ , $B$ , $C$ , and $D$ are parameter matrices, with initialization and parameterization schemes enforcing stability (eigenvalues of $n$ 0 non-positive at init) and channel mixing. For 2D data, multi-directional scanning (e.g., “snake” order) is used, with learnable direction-dependent modifiers (Θ) on the input-to-state path (Munir et al., 4 Sep 2025, Chen et al., 2024).

Typical enhancements include:

Selective scan gating: Coordinate-wise gating to regulate which dimensions of $n$ 1 are updated, reducing redundancy and computation.
Depthwise local convolution: Kernel width (e.g., $n$ 2) for local mixing before sequence modeling.
Multi-head attention modules: Optionally interleaved, enabling data-driven reweighting of temporal regions (for EEG).
LayerNorm/BatchNorm: Applied pre- or post-block to stabilize training.

3. Domain-Specific Instantiations

3.1 EEG Seizure Detection

ConvMambaNet for EEG implements 1D CNNs to process multi-channel time windows, followed by the Mamba block along the temporal dimension. On the CHB-MIT Scalp EEG dataset (18 bipolar channels, 8 s windows, 4 s overlap):

Spatial feature extraction: 3-layer 1D-CNN with BatchNorm, ReLU, MaxPool, He/Kaiming initialization.
SSM configuration: State and projection matrices sized to input feature map ( $n$ 3), with batch- or layer-norm and dropout regularization.
Training: Adam optimizer, binary cross-entropy loss; learning rate scheduling, early stopping.
Performance: 99% accuracy, F₁-score 0.99, AUC-ROC 0.97, outperforming CNN, RNN, and Transformer baselines at 0.5–1 GFLOP per window, with a real-time factor of ≈400 Hz (Khan et al., 19 Jan 2026).

3.2 Visual Recognition and Segmentation

ConvMambaNet-style architectures have been extended to 2D and encoder-decoder frameworks, especially for semantic segmentation and visual recognition.

Patch embedding and pyramid structure: U-shaped encoder-decoder with patch merging and skip connections (Chen et al., 2024).
2D “Select-Scan” (SS2D) module: Performs four directional sequence scans per feature map, each modeled via SSM. After blockwise SSM transformation, outputs are aggregated and mapped back to 2D.
Linear complexity advantage: Replaces quadratic-transformer self-attention with O(L) state-space scan (L: sequence length or number of pixels).

Notably, on datasets such as Crack500, Ozgenel, and MC448 for crack segmentation, ConvMambaNet achieves mean Dice scores on par or better than UNet-EB7, SwinUNet, and SegFormer-B5, with only 27M parameters and 16 GFLOPs vs 70M+/22–40 GFLOPs for comparators. Processing time is ≤16 ms for $n$ 4 inputs, and up to 90.6% fewer FLOPs at high resolutions (Chen et al., 2024).

4. Comparative Analysis and Empirical Results

Empirical comparisons across modalities consistently demonstrate:

Model	Params (M)	Complexity (FLOPs/GFLOPs)	Representative Accuracy
ConvMambaNet (EEG)	1.2	0.5–1/window	99% ACC (CHB-MIT)
ConvMambaNet (Vision)	27	16 / image	56–79% mDS (crack seg.)
UNet-EB7	70	22	55.7–77.3% mDS
SwinUNet	68	30	53.3–76.1% mDS
SegFormer-B5	71	40	56.5–78.6% mDS

ConvMambaNet establishes new tradeoff frontiers for parameter efficiency and throughput, matching or exceeding Transformer-level performance with significantly reduced compute (Khan et al., 19 Jan 2026, Chen et al., 2024).

5. Implementation Details and Training Protocols

Implementation variants adapt core block structure for the task (1D temporal, 2D spatial). Common training parameters include:

Optimizers: Adam or AdamW with learning rates $n$ 5 (EEG), $n$ 6 (Vision), weight decay, and cosine annealing or plateau-based schedules.
Batch sizes: Task-dependent (e.g., 32 for EEG, 2 for segmentation).
Regularization: Dropout and L2 penalty in FC/SSM stages.
Data preprocessing: Channel-wise z-score normalization (EEG), extensive augmentation and pretraining (Vision), stratified splits to preserve class balance.

Segmentation deployments use Dice loss, skip connections, and patch merging/expansion stages. Channel configuration defaults to $n$ 7 in the vision case, and variable in EEG.

6. Computational Complexity and Real-Time Suitability

ConvMambaNet’s use of SSM blocks enables scalability and low-latency inference:

EEG: O(T * $n$ 8 + $n$ 9 * $x_t = A x_{t-1} + B u_t \ y_t = C x_t + D u_t$ 0) cost, ≈0.02 s per 8 s window (RTX 2070 Super).
Vision: O(L) scan complexity per direction; 448 × 448 inputs processed in ~16 ms (Chen et al., 2024).
Parameter count: 1.2M (EEG) to 27M (Vision) across variants.
Energy and hardware compatibility: No quadratic memory or computation bottlenecks of Transformer attention; suited for real-time and edge deployment.

7. Limitations and Clinical/Deployment Considerations

While ConvMambaNet demonstrates robust discrimination and efficiency, several limitations are noted:

EEG generalization: Existing EEG implementations have only been validated on pediatric datasets with fixed-channel montages. Adaptation to other populations or setups may require retraining or channel mapping.
Artifact sensitivity: Susceptibility to non-neural artifacts (e.g., muscle, movement) without dedicated artifact rejection (Khan et al., 19 Jan 2026).
Clinical validation: Prospective trials and regulatory approval are pending for some medical use cases.
Extensibility: The model’s static scan directions and block configuration may require adaptation for non-canonical input geometries (e.g., video, 3D).

In visual segmentation, the U-shaped encoder-decoder design with VMamba modules achieves state-of-the-art accuracy with dramatically lower FLOPs and parameter counts, but, as with all such designs, scaling to extremely high resolutions may necessitate further tuning (Chen et al., 2024).

ConvMambaNet represents a family of models that bridge local and global processing using a principled combination of convolution and structured state-space modeling, supporting accurate, efficient, and scalable learning across biomedical and vision domains (Khan et al., 19 Jan 2026, Munir et al., 4 Sep 2025, Chen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (3)

ConvMambaNet: A Hybrid CNN-Mamba State Space Architecture for Accurate and Real-Time EEG Seizure Detection (2026)

VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation (2025)

Vision Mamba-based autonomous crack segmentation on concrete, asphalt, and masonry surfaces (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConvMambaNet.