Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Dimensional Interaction Module (DDIM)

Updated 6 January 2026
  • DDIM is a cross-modal fusion mechanism that aligns RGB image and event stream features via dedicated spatial (CSIM) and temporal (CTIM) sub-modules.
  • CSIM employs cross-modal spatial attention and SS2D-based refinement to transfer geometric cues and enhance spatial feature contrast.
  • CTIM utilizes temporal interleaving with bi-directional scanning and attention to achieve microsecond-level temporal fusion, boosting segmentation precision.

The Dual-Dimensional Interaction Module (DDIM) is a cross-modal fusion mechanism introduced in the MambaSeg framework to enable fine-grained spatial and temporal alignment between RGB image features and event stream features for semantic segmentation tasks. Operating at each stage of a dual-branch Mamba architecture, DDIM comprises two sub-modules—Cross-Spatial Interaction Module (CSIM) and Cross-Temporal Interaction Module (CTIM)—that explicitly address both spatial and microsecond-level temporal interactions. The design provides efficient, low-overhead, and progressive fusion of modalities, thereby improving alignment and reducing ambiguity in multimodal perception pipelines (Gu et al., 30 Dec 2025).

1. Structural Overview and Data Flow

DDIM is strategically placed at four multi-scale locations in the MambaSeg encoder to facilitate inter-modal interaction at multiple resolutions. At each stage ii, the module receives temporally stacked image features IiRT×H×WI_i \in \mathbb{R}^{T\times H\times W} and event features EiRT×H×WE_i \in \mathbb{R}^{T\times H\times W}, where TT denotes the number of event bins. The module processes these as follows:

  1. CSIM performs spatial attention and refinement, yielding Ii+1S,Ei+1SI^S_{i+1}, E^S_{i+1}.
  2. CTIM enacts temporal attention and alignment, outputting Ii+1T,Ei+1TI^T_{i+1}, E^T_{i+1}.
  3. Output modalities proceed to the subsequent Mamba Visual State Space (VSS) block.

This interleaved architecture ensures iterative improvement of spatial and temporal consistency between the modalities throughout feature encoding, before unified decoding.

2. Cross-Spatial Interaction Module (CSIM)

CSIM accomplishes spatial alignment via three core computations: cross-modal spatial attention, spatial refinement using a 2D Selective Scan (SS2D), and modality-aware residual updates.

2.1 Shallow Fusion and Pooling

  • Computes a shallow fusion FiS=Ei+IiF^S_i = E_i + I_i.
  • For each modality and fusion map, applies AvgPool,MaxPool\mathrm{AvgPool}, \mathrm{MaxPool} along the spatial plane; stacks results to form XiR6×H×WX_i \in \mathbb{R}^{6 \times H \times W}.

2.2 Cross-Modal Spatial Attention

  • Processes XiX_i with two convolutional layers and a sigmoid to yield WSR3×H×WW^S \in \mathbb{R}^{3\times H\times W} (split into WES,WIS,WFSW^S_E, W^S_I, W^S_F).
  • Generates cross-modality sharpened features:

EcS=EiWISWFS, IcS=IiWESWFS, FcS=Concat(EcS,IcS)\begin{aligned} E^S_c &= E_i \odot W^S_I \odot W^S_F, \ I^S_c &= I_i \odot W^S_E \odot W^S_F, \ F^S_c &= \mathrm{Concat}(E^S_c, I^S_c) \end{aligned}

where \odot denotes elementwise multiplication.

2.3 Spatial Refinement with SS2D

  • Passes concatenated features FcSF^S_c to SS2D, which unfolds the tensor along four spatial directions and executes directional Mamba state-space (S6) scans. Outputs are reassembled for context enrichment.

2.4 Modality-Aware Residual Update

  • Splits refined features: EsS,IsS=Split(FsS)E^S_s, I^S_s = \mathrm{Split}(F^S_s).
  • Applies spatial attention heads to update each modality via:

Ei+1S=Ei+EsSSA(EsS), Ii+1S=Ii+IsSSA(IsS)\begin{aligned} E^S_{i+1} &= E_i + E^S_s \odot \mathrm{SA}(E^S_s), \ I^S_{i+1} &= I_i + I^S_s \odot \mathrm{SA}(I^S_s) \end{aligned}

This design enables transfer of structural and geometric cues between the RGB and event domains, enhancing spatial feature contrast while preserving modality identity.

3. Cross-Temporal Interaction Module (CTIM)

CTIM conducts temporal fusion through cross-modal temporal attention, bi-directional Mamba scans, and modality-aware temporal residual connections.

3.1 Temporal Interleaving and Attention

  • Temporally interleaves the image and event sequences: FiT=Insert(Ei,Ii)F^T_i = \mathrm{Insert}(E_i, I_i), yielding R2T×H×W\mathbb{R}^{2T \times H \times W}.
  • Performs global max and average pooling to obtain descriptors, then computes temporal attention weights:

WFT=σ(Conv(FmaxT)+Conv(FavgT))W^T_F = \sigma \left( \mathrm{Conv}(F^T_{\max}) + \mathrm{Conv}(F^T_{\text{avg}}) \right)

  • Applies weights to each modality: EcT=EiWFTE^T_c = E_i \odot W^T_F, IcT=IiWFTI^T_c = I_i \odot W^T_F, selectively emphasizing dynamic event-driven changes.

3.2 Bi-Directional Mamba Scan

  • Concatenates attended features and reshapes for bi-directional sequence scanning:

FfwdT=S6(FflatT) FbwdT=S6(Reverse(FflatT))\begin{aligned} F^T_{\text{fwd}} &= \mathrm{S6}(F^T_{\mathrm{flat}}) \ F^T_{\text{bwd}} &= \mathrm{S6}(\mathrm{Reverse}(F^T_{\mathrm{flat}})) \end{aligned}

  • Sums forward and backward outputs, then reshapes (Reshape(FfwdT+FbwdT)\mathrm{Reshape}(F^T_{\mathrm{fwd}}+F^T_{\mathrm{bwd}})) to yield bidirectional temporal context.

3.3 Modality-Aware Temporal Residual

  • Splits output and applies temporal attention, updating via:

Ei+1T=Ei+EbTTA(EbT), Ii+1T=Ii+IbTTA(IbT)\begin{aligned} E^T_{i+1} &= E_i + E^T_b \odot \mathrm{TA}(E^T_b), \ I^T_{i+1} &= I_i + I^T_b \odot \mathrm{TA}(I^T_b) \end{aligned}

This process ensures progressive introduction of dynamic event cues into both streams.

4. Normalization, Weights, and Nonlinearities

  • Sigmoid (σ()\sigma(\cdot)) restricts attention weights to [0,1][0,1], enabling soft gating across all DDIM attention mechanisms.
  • ReLU (ReLU()\mathrm{ReLU}(\cdot)) on initial activations in CSIM prevents negative intermediate values.
  • Global average and max pooling are employed both for spatial (CSIM) and temporal (CTIM) contexts, guiding attention to distinct activation patterns.
  • All convolutions are 1×11 \times 1 or small (3×33 \times 3), ensuring low computational cost and memory requirements.
  • No batch or layer normalization is used within DDIM, as sigmoid gating empirically suffices for stability and efficiency.

5. Empirical Validation and Performance

Ablation studies on the DDD17 dataset quantify DDIM's contributions. The full module (CSIM + CTIM) achieves 77.56% mIoU and 96.33% pixel accuracy, outperforming the previous best (EISNet: 75.03% mIoU) with reduced parameter count (25.44M vs. 34.39M) and computational complexity (15.59G MACs vs. 17.3G).

Detailed breakdowns reveal:

  • Baseline (elementwise fusion): 74.38% mIoU
  • DDIM outperforms alternative fusion strategies (FFM, MRFM, CSF), each providing mIoU in the 76–76.7% range.
  • Isolated CSIM or CTIM individually exceed 76.2% mIoU; both combined yield the highest results.
  • Subcomponent ablation indicates all CSIM (CSA, SS2D, SA) and CTIM (CTA, BTSS, TA) functions are individually beneficial, with optimal results when all are present.

This demonstrates that DDIM not only improves overall segmentation accuracy but also each spatial and temporal mechanism is empirically justified (Gu et al., 30 Dec 2025).

6. Implementation Considerations

Efficient implementation of DDIM is achieved through linear-complexity operations and lightweight neural designs.

  • Recommended event bin number: T=10T = 10.
  • Four DDIM applications per encoder, corresponding to multi-scale Mamba VSS stages.
  • All DDIM operations respect O(HWT)\mathcal{O}(HWT) complexity, where HH, WW, TT are spatial and temporal dimensions.
  • CSIM convolutions: 6→mid-C (64–128)→3 channels.
  • CTIM compresses from $2T$ to TT channels via two 1×11 \times 1 convolutions.
  • Optimizations include kernel fusion in CSIM, pre-computed pooling indices, and mixed precision execution for speed.
  • Total module overhead remains less than 10% of the backbone's compute cost.

MambaSeg with DDIM uses the AdamW optimizer with typical image-event segmentation data augmentation. Separate learning rates and batch sizes are specified for DDD17 (2×1042 \times 10^{-4}, bs=12) and DSEC (6×1056 \times 10^{-5}, bs=4).

7. Context and Significance in Multimodal Segmentation

DDIM addresses limitations common in prior sensor fusion approaches, which typically prioritize spatial fusion while neglecting the microsecond temporal structure intrinsic to event cameras. By incorporating both spatial and temporal alignment, DDIM enables the exploitation of complementary sensor properties: rich textures from RGB and rapid, edge-focused dynamics from event streams. The result is enhanced robustness in challenging visual conditions (e.g., fast motion, HDR environments), with state-of-the-art empirical performance and scalable computational cost (Gu et al., 30 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Dimensional Interaction Module (DDIM).