State Space Duality (SSD)

Updated 21 February 2026

State Space Duality is a mathematical framework that connects structured state-space models with self-attention via specific matrix decompositions.
It unifies recurrence, convolution, and attention processes to enable linear-time computations and hardware-optimized performance in diverse deep learning tasks.
SSD underpins applications in language, vision, reinforcement learning, and more by providing scalable, memory-efficient models with tangible empirical benefits.

State Space Duality (SSD) is a mathematical framework that establishes a precise correspondence between structured state-space models (SSMs) and efficient, structured self-attention mechanisms via specific matrix decompositions. In deep learning, especially following the advent of architectures such as Mamba2, SSD has concretized a new paradigm that unifies recurrent, convolutional, and attention-based sequence processing under a common algorithmic, algebraic, and hardware-efficient umbrella. The duality enables linear-time recurrent computation and quadratic-time masked attention to realize identical transformations under certain conditions, and has yielded tangible advances across language, vision, reinforcement learning, and physical modeling.

1. Formal Mathematical Definition

SSD is formulated in terms of linear time-invariant or time-varying systems acting on sequences. The canonical discrete-time SSM is specified by

$h_t = A_t h_{t-1} + B_t x_t,\quad y_t = C_t^\top h_t + D x_t$

where $x_t$ is input, $h_t$ is the hidden state, $y_t$ is the output, and $(A_t, B_t, C_t, D)$ are (potentially time-varying) learned parameters. On unrolling, the sequence-to-sequence mapping is governed by a lower-triangular kernel matrix $M_{t,s}$ , giving

$y_t = \sum_{s=1}^{t} M_{t, s} x_s$

with $M_{t,s} = C_t^\top\,A_t A_{t-1} \cdots A_{s+1}B_s$ for $s \leq t$ .

The duality arises when $A_t$ is restricted to scalar multiples or diagonal matrices. For the scalar-identity case $A_t = a_t I$ , $M$ factors as

$M_{t, s} = \prod_{k=s+1}^t a_k \cdot (C_t^\top B_s)$

which can be written as

$M = L \odot (QK^\top)$

where $L$ is a $1$-semiseparable lower-triangular mask, $\odot$ is the Hadamard product, $Q = C$ , and $K = B$ . This renders the primal (recurrent) and dual (masked attention) forms provably equivalent (Hu et al., 6 Oct 2025, Dao et al., 2024). Diagonal $A_t$ generalize this to sums of independent modes.

2. Dual Algorithmic Realizations: Recurrence and Structured Attention

SSD yields two algorithmic forms for processing sequences:

Linear-time scan (recurrence): Standard SSM update $h_t = a_t h_{t-1} + B_t x_t$ , $y_t = C_t^\top h_t$ achieved in $O(TN^2)$ time and $O(N^2)$ memory for state dimension $N$ and sequence length $T$ .
Masked attention (quadratic): The causal lower-triangular mask $L$ and outer $QK^\top$ can be composed to execute $Y = (L \odot QK^\top) X$ for input $X$ , matching the SSM action but in a form conducive to batched GEMM acceleration.

The semiseparable structure of $L$ allows hybrid algorithms: the sequence is chunked into blocks, with each block executed as a dense matrix multiply and inter-block dependencies handled via short SSM scans. This strategy achieves 2–8× speedups over pure scans while retaining linear scaling and constant memory per step (Dao et al., 2024, Hu et al., 6 Oct 2025).

3. Expressiveness, Generalizations, and Extensions

SSD's expressive scope exceeds that of purely scalar SSMs. Diagonal SSMs—with $A_t =$ \text{diag} $(\alpha_n^t)$ for modes $n=1\dots N$ —admit expressive kernel matrices: $M_{t,s} = \sum_{n=1}^N c_t^n \left(\prod_{k=s+1}^t \alpha_n^k \right) b_s^n$ This is equivalent to summing $N$ independent $1$-semiseparable masked attentions. More general $N$ -semiseparable matrices capture all autoregressions with $O(1)$ -time updates, and the theoretical lower bound for efficient sequence modeling is attained using diagonal SSMs (Hu et al., 6 Oct 2025).

The SSD formalism extends to:

Bidirectional and non-causal variants: By lifting the lower-triangularity requirement, SSD handles contexts where future and past must be fused (e.g., bidirectional recommenders (Qu et al., 2024); non-causal vision (Shi et al., 2024)).
Time-aware modifications: Incorporating time-difference/scaling into the scalar $a_t$ recovers per-position adaptivity essential for low-dimensional settings and time-sensitive recommendations (Fan et al., 2024).
Nonlinear expansions: While canonical SSD is linear in $X$ , nonlinearities (gating, feed-forward networks, mixture-of-experts) can be incorporated either in $B$ , $C$ , or on top.

4. Empirical Results and Applications

SSD has underpinned state-of-the-art results in multiple domains:

Sequence modeling: Mamba2 with SSD yields linear cost, state-of-the-art accuracy, and orders-of-magnitude hardware acceleration compared with Transformers in large-context LMs (Dao et al., 2024, Hu et al., 6 Oct 2025).
Vision: EfficientViM’s HSM-SSD achieves superior speed-accuracy trade-offs (up to 0.7% top-1 improvement and 2.5× throughput increase vs. prior ViT-family backbones) while using single-head and hidden-dimension mixing for substantial speedups (Lee et al., 2024). VSSD (non-causal SSD) dominates in image classification and detection tasks by erasing causality constraints (Shi et al., 2024).
Reinforcement learning: SSD-Mamba2 enables cross-modal perception-action policies that surpass transformer fusion in return, safety, and sample efficiency, with >10× greater feasible sequence lengths (Tao et al., 9 Sep 2025).
Multiple instance learning: Mamba2MIL leverages SSD for linear-time aggregation over >10k instances, excelling on pathology datasets versus attention-based MIL (Zhang et al., 2024).
Physical priors: Hamiltonian SSD (H-SSD) incorporates symplectic structure and energy conservation, leading to robust long-horizon rollout and improved spatiotemporal coherence in world modeling tasks (Meziani, 8 Jan 2026).
Combinational logic and security: Shallow State Duality in circuit locking exponentially inflates indistinguishable keyspaces, defeating practical unique-completion and combinational-equivalence attacks (Roshanisefat et al., 2020).

5. Hardware-Awareness, Memory, and Architectural Innovations

SSD’s amenability to efficient hardware acceleration is a driving feature:

Matrix-multiply dominance: Large blocks of the sequence are amenable to high-throughput GEMMs, aligning with modern GPU/TPU architectures (Dao et al., 2024).
Compact memory footprint: Only $O(N)$ recurrent state per step and no requirement to materialize a full $L \times L$ mask or self-attention tensor.
Multi-stage and feature fusion: Designs such as multi-stage hidden-state fusion (EfficientViM), sequence registers (SSD4Rec), and weighted branch fusion (Mamba2MIL) exploit SSD’s algebraic structure to maximize both representational power and bandwidth (Lee et al., 2024, Qu et al., 2024, Zhang et al., 2024).
Removal of multi-head overhead: HSM-SSD eliminates the need for memory-bound multi-head reshaping by using single-head, efficient mixing on compressed hidden states (Lee et al., 2024).

6. Theoretical Implications and Limits

SSD provides a unifying mathematical bridge between recurrent networks, convolutional models, and structured attention:

Equivalence and efficiency: Any scalar SSM is a linear masked attention over a $1$-semiseparable mask and vice versa. This duality holds for diagonal SSMs and establishes the boundary of algorithmic efficiency for AR kernels (Hu et al., 6 Oct 2025, Dao et al., 2024).
Limits: Arbitrary softmax attention maps cannot be realized by finite-dimensional SSMs because softmax generally generates full-rank, non-semiseparable matrices (Hu et al., 6 Oct 2025). Thus, SSD covers all "masked linear attention" but not nonlinear attentional kernels.
Quantum and operator-theoretic analogues: SSD extends to quantum channel-state duality and low-rank representations in quantum information science, where randomized pure-state expansions enable compact data compression and process tomography (Yan et al., 2022).

7. Broader Impact and Future Directions

SSD widens the design space for expressive yet efficient sequence models, enabling subquadratic scaling, hardware acceleration, and inductive bias injection (e.g., via Hamiltonian priors). It motivates model design where structured recurrences and attention-like mixing can be interchanged based on task, regime, or deployment scenario. Further directions include extending SSD to more general structured matrices, nonlinear operator dualities, and deeper integration of physics-inspired symmetry and conservation laws.

References:

"On Structured State-Space Duality" (Hu et al., 6 Oct 2025)
"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" (Dao et al., 2024)
"EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality" (Lee et al., 2024)
"SSD4Rec: A Structured State Space Duality Model for Efficient Sequential Recommendation" (Qu et al., 2024)
"VSSD: Vision Mamba with Non-Causal State Space Duality" (Shi et al., 2024)
"Mamba2MIL: State Space Duality Based Multiple Instance Learning for Computational Pathology" (Zhang et al., 2024)
"Can SSD-Mamba2 Unlock Reinforcement Learning for End-to-End Motion Control?" (Tao et al., 9 Sep 2025)
"TiM4Rec: An Efficient Sequential Recommendation Model Based on Time-Aware Structured State Space Duality Model" (Fan et al., 2024)
"Randomized channel-state duality" (Yan et al., 2022)
"DFSSD: Deep Faults and Shallow State Duality..." (Roshanisefat et al., 2020)
"Akasha 2: Hamiltonian State Space Duality and Visual-Language Joint Embedding Predictive Architecture" (Meziani, 8 Jan 2026)