CoSSM: Cross-modal State-Space Modulator

Updated 15 January 2026

CoSSM is a multi-modal fusion paradigm that uses state-space models to integrate and align heterogeneous data efficiently.
It leverages tailored cross-parameter exchange, channel/spatial mixing, and feature-space modulation to achieve superior performance and computational scalability.
Empirical evaluations show CoSSM’s benefits in object detection, image fusion, and vision-language reasoning with marked parameter efficiency and lower computational overhead.

A Cross-modal State-Space Modulator (CoSSM) is a generic architectural paradigm for multi-modal fusion that leverages state-space models (SSMs) to integrate, align, and propagate information across heterogeneous modalities, while maintaining computational efficiency and strong representational flexibility. Distinct from attention-based fusion, CoSSM structures cross-modal interactions within the SSM formalism—employing tailored cross-parameter exchange, channel/spatial mixing, and feature-space modulation schemes to unify the dynamics of distinct data streams. CoSSMs have been instantiated across tasks such as multispectral object detection, multimodal image fusion, vision-language reasoning, point cloud completion, compressive spectral imaging, and parameter-efficient domain adaptation, resulting in empirically superior and hardware-scalable architectures for cross-modal reasoning.

1. State-Space Model Foundations in Multimodal Fusion

At the core of CoSSM lies the continuous- or discrete-time SSM, typically parameterized as follows: $\frac{d\mathbf{h}(t)}{dt} = \mathbf{A}\,\mathbf{h}(t) + \mathbf{B}\,\mathbf{u}(t), \quad \mathbf{y}(t) = \mathbf{C}\,\mathbf{h}(t) + \mathbf{D}\,\mathbf{u}(t)$ or discretized as

$\mathbf{h}_{t} = \overline{\mathbf{A}}\,\mathbf{h}_{t-1} + \overline{\mathbf{B}}\,\mathbf{u}_{t}, \quad \mathbf{y}_{t} = \overline{\mathbf{C}}\,\mathbf{h}_{t} + \mathbf{D}\,\mathbf{u}_{t}$

where $\mathbf{u}(t)$ is the input sequence, $\mathbf{h}(t)$ the latent state, and $\mathbf{y}(t)$ the output.

In CoSSMs, this SSM core is augmented by cross-modal interaction operators; key parameters (e.g., output/projector matrices) may be swapped, shared, or modulated by signals from other modalities to enable complementary and shared semantic integration (Shen et al., 19 Jul 2025, Sun et al., 9 Jan 2026).

CoSSMs instantiate bespoke cross-modal couplings, such as:

Cross-Parameter Swapping (CP-SSM): State-to-output matrices $C$ of each modality are exchanged, so the projection of states in one modality is parameterized by another. Given features $F_A,F_B$ and learned per-sequence parameter triplets $(\Delta_A,B_A,C_A)$ , $(\Delta_B,B_B,C_B)$ , CP-SSM sets $C_A' = C_B$ , $C_B' = C_A$ , and each SSM branch runs its recurrence and projection with the opposite’s $C$ (Shen et al., 19 Jul 2025).
Shared-Parameter SSM (SP-SSM): Both modalities share the same SSM parameters, enforcing joint semantic alignment. A fused feature $F_s = F_A \oplus F_B$ is projected to universal SSM parameters $(\Delta_s,B_s,C_s)$ , and parallel SSMs are run for $F_A$ and $F_B$ using identical dynamics (Shen et al., 19 Jul 2025).
Dual-State Channel and Spatial Exchange: In image-based fusion, dedicated channel-exchange modules model dependencies across modalities by cross-scanning and swapping attention-like triplets (akin to QKV) prior to SSM recurrence. Spatial-exchange modules further concatenate or interleave spatial slices, then perform multi-directional SSM scanning for comprehensive spatial fusion (Sun et al., 9 Jan 2026).
Feature-Space Domain Injection: Domain-representative embeddings, distilled by compact offset encoders, are injected directly into (frozen) pretrained architectures at multiple depths via modulator layers, conditioning the forward pass by additive feature offsets without altering backbone weights (Xian et al., 24 Dec 2025).

A schematic CoSSM comprises three principal operations:

Difference-driven Reweighting: Local feature discrepancies $\Delta = \tanh\bigl|F_A - F_B\bigr|$ (or more general domain-guided signals) act as attention masks, emphasizing complementary content and suppressing redundancy (Sun et al., 9 Jan 2026).
Dual State-Space Channel Fusion: Cross-modal triplets—extracted from linearly projected, cross-scanned channel features—are swapped and routed through 1D SSMs, promoting cross-modal functional complementarity at the channel level (Sun et al., 9 Jan 2026).
Spatial Cross-Scanning: Features are interleaved and flattened along spatial axes (row, column, patch), enabling spatial SSM recurrences that globally aggregate across modalities with linear computational complexity, thereby maintaining hardware efficiency and global receptive field (Sun et al., 9 Jan 2026, Meng et al., 22 May 2025).

A concise block flow is:

1	F_A, F_B --(compute Δ=diff)--> [F_A', F_B'] --(ChannelSSM)—> [F_A^c, F_B^c] --(SpatialSSM)--> [F_A'', F_B'']

In language-visual models, CoSSM employs token-grid correlation modules to compute lightweight relevance maps between modalities, then modulates SSM dynamics (e.g., via FiLM) with those maps, replacing quadratic cross-attention layers (Trinh et al., 14 Nov 2025). In parameter-efficient domain adaptation, modulator submodules transform external offset codes and inject them into layer-normalized states throughout the backbone (Xian et al., 24 Dec 2025).

4. Empirical Benefits and Computational Properties

CoSSM variants have demonstrated the following empirical and computational advantages:

Parameter Efficiency: Feature-space modulator approaches require as little as 1–7 million trainable parameters (vs. tens of millions for transformer or LoRA/adapters) while achieving superior mAP and F1 metrics in cross-modal benchmarks (Xian et al., 24 Dec 2025 Sun et al., 9 Jan 2026, Shen et al., 19 Jul 2025).
Linear Complexity: SSM-based scanning yields global feature integration at $O(L)$ or $O(L \log L)$ complexity, avoiding the $O(L^2)$ cost of token-wise attention (Trinh et al., 14 Nov 2025, Meng et al., 22 May 2025, Inaganti et al., 25 Jan 2025).
Hardware Scalability: State-space operations and block-parallel convolutions allow substantial GPU memory savings (e.g., 83.7% in recent implementations), and up to 49% faster inference (Li et al., 2024, Trinh et al., 14 Nov 2025).
Fine-Grained Reasoning: On fine-grained multimodal reasoning benchmarks (e.g., VQAv2, MME, AI2D), CoSSM-equipped models match or outperform colleagues with much larger backbone sizes, with strengthened ability to localize task-relevant regions (Trinh et al., 14 Nov 2025).
Interpretability and Physics Integration: By making explicit the physical or semantic factors underlying each modality (e.g., temperature, emissivity, texture), CoSSM forms such as PCMamba couple domain knowledge and data-driven learning (Meng et al., 22 May 2025).

5. Representative Architectures and Applications

Framework	Modalities	Distinctive CoSSM Mechanism
MS2Fusion (Shen et al., 19 Jul 2025)	RGB–Thermal or Multispectral	Cross-param swap & shared-param SSM, scale-wise SSM fusion
DIFF-MF (Sun et al., 9 Jan 2026)	Infrared–Visible Images	Difference-guided reweighting, dual SSM channel/spatial fusion
Viper-F1 (Trinh et al., 14 Nov 2025)	Vision–Language	Token-grid correlation, FiLM-conditioned SSM
PCMamba (Meng et al., 22 May 2025)	Dual-Camera HSI (PAN+Compressive)	Physics-informed parameter disentanglement, cross-scan Mamba SSM
MambaTron (Inaganti et al., 25 Jan 2025)	Image–Point Cloud	Gated selective SSM scan for cross-modal state aggregation
DRI (Xian et al., 24 Dec 2025)	Multi-domain (e.g. Visual SAR)	Feature-space modulator, offset encoder, additive fusion

Applications span detection, semantic segmentation, fused image enhancement, point cloud completion, re-identification, cross-domain retrieval, and scientific (physics-driven) imaging.

6. Design Trends, Extensions, and Variations

The CoSSM paradigm is increasingly trending toward unified, residual-modulated architectures in which difference-driven guidance, dual-mode SSMs (both channel and spatial), and gating/fusion mechanisms are stacked modularly. Mixture-of-experts, dynamic striding, and temporal variants are natural extensions (Sun et al., 9 Jan 2026). Selective gating and cross-modal code injection strategies generalize to frozen backbone adaptation and to low-resource transfer settings (Xian et al., 24 Dec 2025).

Integrations with physics-informed layers and domain-specific factor disentanglement expand CoSSM to scientific and industrial imaging (Meng et al., 22 May 2025). Fine-grained attention via lightweight correlation modules or localized FiLM is a recurring method for precise cross-modal alignment in vision-language settings (Trinh et al., 14 Nov 2025).

7. Empirical Benchmarks and Comparative Results

CoSSM-equipped methods consistently report mAP and F1 improvements over attention-based or conventional CNN baselines, with direct evidence in:

Object detection: MS2Fusion yields +3.3 mAP over YOLOv5 on FLIR, +3.2 on M $^3$ FD, and +0.2 on LLVIP, while reducing computational overhead by up to 60% (Shen et al., 19 Jul 2025).
Image fusion: DIFF-MF outperforms SSM, transformer, and CNN-based methods in visual and quantitative metrics on UAV and driving datasets (Sun et al., 9 Jan 2026).
Vision-language: Viper-F1 attains 76.6 VQAv2 and 1376.2 MME with 0.8B parameters, overtaking 7B-parameter alternatives (Trinh et al., 14 Nov 2025).
Point-cloud completion: MambaTron’s CoSSM achieves leading Chamfer distance and F1 on ShapeNet-ViPC and ScanObjectNN, with half the parameter count and sub-quadratic runtime (Inaganti et al., 25 Jan 2025).
Re-identification: DRI’s CoSSM design surpasses LoRA and adapter baselines by +2–3% mAP with <2M parameters (Xian et al., 24 Dec 2025), and is readily extensible to video, audio, and medical cross-modality.

These outcomes substantiate CoSSM as an efficient, modular, and effective paradigm for cross-modal sequence modeling and fusion across varied domains.