Dual-Dimensional Interaction Module (DDIM)

Updated 6 January 2026

DDIM is a cross-modal fusion mechanism that aligns RGB image and event stream features via dedicated spatial (CSIM) and temporal (CTIM) sub-modules.
CSIM employs cross-modal spatial attention and SS2D-based refinement to transfer geometric cues and enhance spatial feature contrast.
CTIM utilizes temporal interleaving with bi-directional scanning and attention to achieve microsecond-level temporal fusion, boosting segmentation precision.

The Dual-Dimensional Interaction Module (DDIM) is a cross-modal fusion mechanism introduced in the MambaSeg framework to enable fine-grained spatial and temporal alignment between RGB image features and event stream features for semantic segmentation tasks. Operating at each stage of a dual-branch Mamba architecture, DDIM comprises two sub-modules—Cross-Spatial Interaction Module (CSIM) and Cross-Temporal Interaction Module (CTIM)—that explicitly address both spatial and microsecond-level temporal interactions. The design provides efficient, low-overhead, and progressive fusion of modalities, thereby improving alignment and reducing ambiguity in multimodal perception pipelines (Gu et al., 30 Dec 2025).

1. Structural Overview and Data Flow

DDIM is strategically placed at four multi-scale locations in the MambaSeg encoder to facilitate inter-modal interaction at multiple resolutions. At each stage $i$ , the module receives temporally stacked image features $I_i \in \mathbb{R}^{T\times H\times W}$ and event features $E_i \in \mathbb{R}^{T\times H\times W}$ , where $T$ denotes the number of event bins. The module processes these as follows:

CSIM performs spatial attention and refinement, yielding $I^S_{i+1}, E^S_{i+1}$ .
CTIM enacts temporal attention and alignment, outputting $I^T_{i+1}, E^T_{i+1}$ .
Output modalities proceed to the subsequent Mamba Visual State Space (VSS) block.

This interleaved architecture ensures iterative improvement of spatial and temporal consistency between the modalities throughout feature encoding, before unified decoding.

2. Cross-Spatial Interaction Module (CSIM)

CSIM accomplishes spatial alignment via three core computations: cross-modal spatial attention, spatial refinement using a 2D Selective Scan (SS2D), and modality-aware residual updates.

2.1 Shallow Fusion and Pooling

Computes a shallow fusion $F^S_i = E_i + I_i$ .
For each modality and fusion map, applies $\mathrm{AvgPool}, \mathrm{MaxPool}$ along the spatial plane; stacks results to form $X_i \in \mathbb{R}^{6 \times H \times W}$ .

Processes $X_i$ with two convolutional layers and a sigmoid to yield $W^S \in \mathbb{R}^{3\times H\times W}$ (split into $W^S_E, W^S_I, W^S_F$ ).
Generates cross-modality sharpened features:

$\begin{aligned} E^S_c &= E_i \odot W^S_I \odot W^S_F, \ I^S_c &= I_i \odot W^S_E \odot W^S_F, \ F^S_c &= \mathrm{Concat}(E^S_c, I^S_c) \end{aligned}$

where $\odot$ denotes elementwise multiplication.

Passes concatenated features $F^S_c$ to SS2D, which unfolds the tensor along four spatial directions and executes directional Mamba state-space (S6) scans. Outputs are reassembled for context enrichment.

2.4 Modality-Aware Residual Update

Splits refined features: $E^S_s, I^S_s = \mathrm{Split}(F^S_s)$ .
Applies spatial attention heads to update each modality via:

$\begin{aligned} E^S_{i+1} &= E_i + E^S_s \odot \mathrm{SA}(E^S_s), \ I^S_{i+1} &= I_i + I^S_s \odot \mathrm{SA}(I^S_s) \end{aligned}$

This design enables transfer of structural and geometric cues between the RGB and event domains, enhancing spatial feature contrast while preserving modality identity.

3. Cross-Temporal Interaction Module (CTIM)

CTIM conducts temporal fusion through cross-modal temporal attention, bi-directional Mamba scans, and modality-aware temporal residual connections.

3.1 Temporal Interleaving and Attention

Temporally interleaves the image and event sequences: $F^T_i = \mathrm{Insert}(E_i, I_i)$ , yielding $\mathbb{R}^{2T \times H \times W}$ .
Performs global max and average pooling to obtain descriptors, then computes temporal attention weights:

$W^T_F = \sigma \left( \mathrm{Conv}(F^T_{\max}) + \mathrm{Conv}(F^T_{\text{avg}}) \right)$

Applies weights to each modality: $E^T_c = E_i \odot W^T_F$ , $I^T_c = I_i \odot W^T_F$ , selectively emphasizing dynamic event-driven changes.

3.2 Bi-Directional Mamba Scan

Concatenates attended features and reshapes for bi-directional sequence scanning:

$\begin{aligned} F^T_{\text{fwd}} &= \mathrm{S6}(F^T_{\mathrm{flat}}) \ F^T_{\text{bwd}} &= \mathrm{S6}(\mathrm{Reverse}(F^T_{\mathrm{flat}})) \end{aligned}$

Sums forward and backward outputs, then reshapes ( $\mathrm{Reshape}(F^T_{\mathrm{fwd}}+F^T_{\mathrm{bwd}})$ ) to yield bidirectional temporal context.

3.3 Modality-Aware Temporal Residual

Splits output and applies temporal attention, updating via:

$\begin{aligned} E^T_{i+1} &= E_i + E^T_b \odot \mathrm{TA}(E^T_b), \ I^T_{i+1} &= I_i + I^T_b \odot \mathrm{TA}(I^T_b) \end{aligned}$

This process ensures progressive introduction of dynamic event cues into both streams.

4. Normalization, Weights, and Nonlinearities

Sigmoid ( $\sigma(\cdot)$ ) restricts attention weights to $[0,1]$ , enabling soft gating across all DDIM attention mechanisms.
ReLU ( $\mathrm{ReLU}(\cdot)$ ) on initial activations in CSIM prevents negative intermediate values.
Global average and max pooling are employed both for spatial (CSIM) and temporal (CTIM) contexts, guiding attention to distinct activation patterns.
All convolutions are $1 \times 1$ or small ( $3 \times 3$ ), ensuring low computational cost and memory requirements.
No batch or layer normalization is used within DDIM, as sigmoid gating empirically suffices for stability and efficiency.

5. Empirical Validation and Performance

Ablation studies on the DDD17 dataset quantify DDIM's contributions. The full module (CSIM + CTIM) achieves 77.56% mIoU and 96.33% pixel accuracy, outperforming the previous best (EISNet: 75.03% mIoU) with reduced parameter count (25.44M vs. 34.39M) and computational complexity (15.59G MACs vs. 17.3G).

Detailed breakdowns reveal:

Baseline (elementwise fusion): 74.38% mIoU
DDIM outperforms alternative fusion strategies (FFM, MRFM, CSF), each providing mIoU in the 76–76.7% range.
Isolated CSIM or CTIM individually exceed 76.2% mIoU; both combined yield the highest results.
Subcomponent ablation indicates all CSIM (CSA, SS2D, SA) and CTIM (CTA, BTSS, TA) functions are individually beneficial, with optimal results when all are present.

This demonstrates that DDIM not only improves overall segmentation accuracy but also each spatial and temporal mechanism is empirically justified (Gu et al., 30 Dec 2025).

6. Implementation Considerations

Efficient implementation of DDIM is achieved through linear-complexity operations and lightweight neural designs.

Recommended event bin number: $T = 10$ .
Four DDIM applications per encoder, corresponding to multi-scale Mamba VSS stages.
All DDIM operations respect $\mathcal{O}(HWT)$ complexity, where $H$ , $W$ , $T$ are spatial and temporal dimensions.
CSIM convolutions: 6→mid-C (64–128)→3 channels.
CTIM compresses from $2T$ to $T$ channels via two $1 \times 1$ convolutions.
Optimizations include kernel fusion in CSIM, pre-computed pooling indices, and mixed precision execution for speed.
Total module overhead remains less than 10% of the backbone's compute cost.

MambaSeg with DDIM uses the AdamW optimizer with typical image-event segmentation data augmentation. Separate learning rates and batch sizes are specified for DDD17 ( $2 \times 10^{-4}$ , bs=12) and DSEC ( $6 \times 10^{-5}$ , bs=4).

7. Context and Significance in Multimodal Segmentation

DDIM addresses limitations common in prior sensor fusion approaches, which typically prioritize spatial fusion while neglecting the microsecond temporal structure intrinsic to event cameras. By incorporating both spatial and temporal alignment, DDIM enables the exploitation of complementary sensor properties: rich textures from RGB and rapid, edge-focused dynamics from event streams. The result is enhanced robustness in challenging visual conditions (e.g., fast motion, HDR environments), with state-of-the-art empirical performance and scalable computational cost (Gu et al., 30 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Dimensional Interaction Module (DDIM).

Dual-Dimensional Interaction Module (DDIM)

1. Structural Overview and Data Flow

2. Cross-Spatial Interaction Module (CSIM)

2.1 Shallow Fusion and Pooling

2.3 Spatial Refinement with SS2D

2.4 Modality-Aware Residual Update

3. Cross-Temporal Interaction Module (CTIM)

3.1 Temporal Interleaving and Attention

3.2 Bi-Directional Mamba Scan

3.3 Modality-Aware Temporal Residual

4. Normalization, Weights, and Nonlinearities

5. Empirical Validation and Performance

6. Implementation Considerations

7. Context and Significance in Multimodal Segmentation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Dual-Dimensional Interaction Module (DDIM)

1. Structural Overview and Data Flow

2. Cross-Spatial Interaction Module (CSIM)

2.1 Shallow Fusion and Pooling

2.2 Cross-Modal Spatial Attention

2.3 Spatial Refinement with SS2D

2.4 Modality-Aware Residual Update

3. Cross-Temporal Interaction Module (CTIM)

3.1 Temporal Interleaving and Attention

3.2 Bi-Directional Mamba Scan

3.3 Modality-Aware Temporal Residual

4. Normalization, Weights, and Nonlinearities

5. Empirical Validation and Performance

6. Implementation Considerations

7. Context and Significance in Multimodal Segmentation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics