Fusion-Mamba Block: Cross-Modal Feature Fusion

Updated 16 February 2026

Fusion-Mamba Block (FMB) is a neural module that integrates local and global features using state-space models for multi-scale and cross-modal fusion.
It employs dynamic gating, adaptive weighting, and multi-scale aggregation to effectively merge complementary feature streams with linear computational complexity.
FMBs are applied in micro-gesture recognition, remote sensing, medical imaging, and object detection, delivering measurable performance improvements.

A Fusion-Mamba Block (FMB) is a neural module that generalizes the state-space model (SSM) schemes in Mamba architectures to enable multi-scale, cross-modal, spatially and temporally aware feature fusion in vision and pattern recognition pipelines. FMBs unify local and global information paths, typically using enhancements such as dynamic gating, cross-modality parameterization, multi-scale context aggregation, and adaptive weighting. The underlying principle is to extend the linear-complexity, input-adaptive SSM of Mamba to fuse either different modalities or complementary feature streams, superseding the quadratic costs of transformer-based fusion and the locality limits of CNNs. FMBs are extensively utilized for efficient micro-gesture recognition, cross-modal object detection, multi-source remote sensing analysis, medical image fusion, shadow removal, and other vision tasks requiring subtle, context-sensitive integration of heterogeneous cues. Their design flexibility facilitates task-dependent specialization: motion-aware, priority-guided, multi-path, channel-swapped, deformable, and hybrid blocks have all been introduced in the literature to optimize for domain-specific challenges.

1. Core Structural Components and Mathematical Foundation

The general FMB architecture is modular, typically composed of (a) feature projection and normalization stages, (b) multi-scale or cross-modal fusion stages, (c) state-space modeling with dynamic parameters, and (d) aggregation or gating modules. The foundational computation is a learnable SSM block of the form

$h_t = \overline A_t\, h_{t-1} + \overline B_t\, x_t, \quad y_t = C_t\, h_t + D\,x_t$

where the evolution and output maps are often dynamically conditioned on the context (e.g., another modality, spatial location, or temporal differences) via MLP-generated parameters or convolutions (Li et al., 1 Jul 2025, Gao et al., 2024). Multi-path variants extend this by simultaneously processing sequences from multiple sources, introducing cross-state or cross-view parameterization. Gating mechanisms—learnable scalars or vectors—provide further dynamic selection, often deployed as elementwise gates in dual-branch/multi-input settings (Dong et al., 2024, Sun et al., 10 Nov 2025).

A common block organization involves:

Feature tensor preparation: Project input tokens (e.g., modalities, temporal slices) into shared latent spaces by 1×1 or depthwise convolutions, layer normalizations, or MLPs.
Local and global context modeling: Apply 3D/2D convolutions or spatial SSMs to extract neighborhood and long-range dependencies, respectively.
Motion or priority awareness: Encode temporal/spatial change via difference operators or learned scoring networks, informing token selection or ordering.
Adaptive fusion: Pool multi-scale or multi-modal outputs and merge by softmax-weighted summation or dynamic scalar gating.

2. Motion-Aware and Multi-Scale Spatiotemporal Modeling

Motion-awareness is essential in FMBs for tasks such as micro-gesture recognition and remote sensing. In "MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition" (Li et al., 12 Oct 2025), the FMB explicitly computes a central frame difference (CFD) tensor,

$D_t = F_t - \frac{1}{2}(F_{t-1} + F_{t+1}),$

to highlight frame-local motion. At multiple scales, each state neighborhood is fused via 3D convolution:

$S_k(X)_{c,t,h,w} = \sum_{(\delta\tau, \delta h, \delta w)\in N_k} W_k(\delta\tau,\delta h,\delta w) \cdot X_{c,t+\delta\tau,h+\delta h,w+\delta w}$

where $N_k$ defines a $k$ -scale local window and $W_k$ are learned weights. Appearance ( $S_k(F)$ ) and motion ( $S_k(D)$ ) are blended per scale by a learnable gate $\theta_k$ . An adaptive scale-weighting module (ASWM) concatenates all scale outputs, computes attention logits via 3D-convs, and fuses via softmax over scale, producing the output tensor of the FMB. This architecture maintains linear complexity while capturing multi-scale, spatiotemporal cues crucial for fine-grained action recognition.

3. Cross-Modality and Cross-Source State-Space Fusion

For multi-modal and multi-source fusion (e.g., RGB-IR, LiDAR-camera, HSI-SAR), FMBs recast the SSM to perform cross-modality state evolution:

Symmetric dual-branch blocks: Each stream processes its own input while using the other as a parameter modulator. E.g., in UAVD-Mamba (Li et al., 1 Jul 2025), the FMB treats one modality's tokens as input and the other's as context, with state matrices $A, B$ dynamically generated based on the context branch:

$h_t = A(x_t^{self}, x_t^{other}) \cdot h_{t-1} + B(x_t^{self}, x_t^{other})$

This approach is mirrored in the multi-source Fus-Mamba block for remote sensing, where one modality generates time-varying SSM parameters, scanning the other as input (Gao et al., 2024). This architecture closely resembles cross-attention, but replaces the quadratic transformer cost with input-adaptive, linear state evolution.

Priority- and difference-guided permutation: In DEPF (Li et al., 9 Sep 2025), FMBs compute per-token priority from the RGB–IR difference and reorder the input to the SSM based on descending priority. This focuses model capacity on target-relevant tokens (e.g., vehicles).
Channel-swapping and channel-wise mixing: In XFMamba (Zheng et al., 4 Mar 2025) and related variants, shallow fusion occurs by interleaving channels from each branch, while deep fusion combines feature streams using a shared SSM decoder to align representations across views.

4. Multi-Path, Multi-Stage, and Multi-Domain Fusion Schemes

Modern FMB instances frequently coordinate multiple streams through cascaded or parallel blocks, exploiting both hierarchical and cross-domain structure:

Dual- and multi-phase fusion: Networks such as MambaDFuse (Li et al., 2024) employ a shallow channel-exchange fusion, followed by a deep multi-modal FMB stack where local and global detail streams interact through Mamba SSMs with per-token gating. The iterative gating mechanism, based on context from the fused stream, modulates the strength of detail exchange.
Algorithmic and iterative fusion: "EVM-Fusion" (Yang, 23 May 2025) demonstrates a two-stage FMB using (1) a cross-modal attention layer that integrates DenseNet-Mamba, UNet-Mamba, and traditional feature paths, and (2) a neural algorithmic fusion (NAF) controller. The NAF iteratively combines candidate fusion primitives via a GRU-controlled weighting, providing expressive, explainable adaptive fusion.
Spatial-frequency and mixed-domain coupling: SFMFusion (Sun et al., 10 Nov 2025) deploys FMBs that separately process the spatial, frequency, and channel domains, then combine outputs. The Dynamic Fusion Mamba Block (DFMB) applies spatial 2D-SSM with dynamic gating for two IR branches, fusing them according to a learned spatially-resolved sigmoid map.
Wavelet-driven and multi-scale approaches: WaveMamba (Zhu et al., 24 Jul 2025) decomposes feature maps via DWT into frequency sub-bands, applies FMBs to the LL (low-frequency) channels with channel-swapping and deep gated attention, and fuses high-frequency bands by an absolute-maximum strategy, followed by inverse DWT to preserve both detail and context.

5. Gating, Weighting, and Dynamic Fusion in State-Space

The fusion interface between streams in FMBs is almost universally governed by gating, weighting, or attention-based dynamic selection:

Learnable scalar gating ( $\theta_k$ ): Blends motion and appearance paths in scale-specific Mamba fusions (Li et al., 12 Oct 2025).
Softmax over scales or attention weights: Used for adaptive scale merging in spatiotemporal blocks or channel/re-weighting in cross-view fusion (Li et al., 12 Oct 2025, Zheng et al., 4 Mar 2025).
Sigmoid or per-location dynamic maps: Applied in spatial-frequency blocks and DFMB for local, dynamic element-wise merge of parallel features (Sun et al., 10 Nov 2025).
GRU/MHA-based mixers: Orchestrate algorithmic fusion over multiple input paths in explainable architectures with learnable fusion trajectories (Yang, 23 May 2025).

This dynamic adaptation underpins the ability of FMBs to focus modeling power on contextually important representations, mitigate redundancy, and provide robustness to heterogeneity in cross-source data.

6. Computational Complexity, Parameterization, and Empirical Results

FMBs are designed to preserve Mamba's key advantage: linear complexity in sequence length $n$ , i.e., $O(n)$ , as opposed to $O(n^2)$ for transformers. This is achieved by:

Restricting context modeling to state-space recursion or localized convolution, never to pairwise attention over the full token set.
Sharing SSM parameters or using branch-specific parameter generators, with projections per scale, path, or modality.
Minimal overhead: even large-scale variants (e.g., in (Li et al., 1 Jul 2025)) introduce less than 1–2 GFLOPs or ≈200k parameters per stage in typical settings.

Empirically, FMB-equipped architectures consistently surpass both vanilla Mamba SSMs and transformer fusion baselines in terms of accuracy on action recognition, object detection, medical classification, and cross-modality fusion, while maintaining or improving computational efficiency. Representative gains include +0.7–2% mAP in UAV detection (Li et al., 1 Jul 2025, Li et al., 9 Sep 2025), robust cross-modal improvements in visual recognition, and perceptual fidelity improvements of up to 4–8% in medical and fused imagery (Li et al., 2024, Sun et al., 10 Nov 2025).

7. Applications and Integration Patterns

FMBs have been employed in a range of vision architectures:

Gesture/activity recognition: Spatiotemporal fusion in micro-gesture pipelines (Li et al., 12 Oct 2025).
Multimodal detection: UAV detection with deformable/flexible FMBs, RGB–IR, and LiDAR–camera fusion using cross-modal SSMs or height-fidelity encoding (Li et al., 1 Jul 2025, Wang et al., 6 Jul 2025).
Image fusion and enhancement: Dual-stage blocks in multi-phase fusion, frequency-aware and spatial-frequency DFMB/DFM in SFMFusion, and D2-Mamba for shadow removal using dual-scale fusion (Li et al., 2024, Sun et al., 10 Nov 2025, Li et al., 18 Aug 2025).
Medical imaging: Cross-path explainable FMBs, multi-view and cross-view fusion medical classifiers (Yang, 23 May 2025, Zheng et al., 4 Mar 2025).
Remote sensing: Multi-scale and multi-source adaptable FMBs bridging hyperspectral and structural sensing (Gao et al., 2024).

FMBs are generally inserted at critical fusion points—after early feature extraction, at every fusion stage, or in neck/depth modules—serving as both information aggregators and context filters.

References:

(Li et al., 12 Oct 2025, Li et al., 1 Jul 2025, Yang, 23 May 2025, Li et al., 9 Sep 2025, Wang et al., 6 Jul 2025, Li et al., 2024, Sun et al., 10 Nov 2025, Zhu et al., 24 Jul 2025, Xie et al., 2024, Zheng et al., 4 Mar 2025, Dong et al., 2024, Zhu et al., 2024, Gao et al., 2024, Li et al., 18 Aug 2025)