Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diagonalized State-Space Model (S4D)

Updated 26 January 2026
  • Diagonalized State-Space Model (S4D) is a deep sequence model that constrains state transition matrices to be diagonal, facilitating efficient training and inference.
  • It generates an analytic convolution kernel as a sum of exponentials, providing clear spectral interpretation and effective handling of long-range dependencies across modalities.
  • The model’s diagonal structure enables significant reductions in computation and parameter count, while maintaining competitive performance on benchmarks like LRA and EEG analysis.

The Diagonalized State-Space Model (S4D) is a deep sequence modeling architecture that constrains the state transition matrices of a linear state-space model (SSM) to be diagonal. This structural simplification yields models that are highly efficient for both training and inference, while retaining the expressive capacity required for competitive long-range modeling across modalities including text, EEG, audio, and spatio-temporal signals. S4D arose as a streamlined variant of the original S4 framework, replacing the diagonal-plus-low-rank (DPLR) HiPPO matrices with fully diagonal parameterizations, enabling analytic kernel computation, trivial parallelization, and code minimization without significant loss in empirical accuracy.

1. Mathematical Formulation and Kernel Structure

S4D is defined through the standard continuous-time linear time-invariant SSM equations: x˙(t)=Ax(t)+Bu(t),y(t)=Cx(t)+Du(t),\dot x(t) = A\,x(t) + B\,u(t), \qquad y(t) = C\,x(t) + D\,u(t), where A=diag(λ1,...,λN)CN×NA = \mathrm{diag}(\lambda_1, ..., \lambda_N) \in \mathbb{C}^{N\times N} is a diagonal matrix of (generally complex) eigenvalues, BCN×1B \in \mathbb{C}^{N\times 1}, CC1×NC \in \mathbb{C}^{1\times N}, and DCD \in \mathbb{C}. Discretizing by zero-order hold (ZOH) with step Δ\Delta yields: xk+1=Aˉxk+Bˉuk,yk=Cxk+Duk,x_{k+1} = \bar{A} x_k + \bar{B} u_k, \qquad y_k = C x_k + D u_k, where Aˉ=exp(AΔ)=diag(eλiΔ)\bar{A} = \exp(A\Delta) = \mathrm{diag}(e^{\lambda_i \Delta}), and Bˉi=Bi(eλiΔ1)/λi\bar{B}_i = B_i (e^{\lambda_i \Delta} - 1)/\lambda_i.

The recurrence can be unrolled to derive a convolution kernel K[n]=CAˉnBˉK[n] = C \bar{A}^n \bar{B}, yielding the causal convolution: y[n]=k=0nK[k]u[nk]+Du[n],K[k]=i=1NCiBieλikΔy[n] = \sum_{k=0}^{n} K[k]\,u[n-k] + D u[n], \qquad K[k] = \sum_{i=1}^{N} C_i\,B_i\,e^{\lambda_i\,k\Delta} The frequency (Laplace, z-transform) characteristics are analytically computable in terms of SSM parameters: H(s)=C(sIA)1B+D=i=1NCiBisλi+DH(s) = C\,(sI - A)^{-1}B + D = \sum_{i=1}^N \frac{C_i B_i}{s-\lambda_i} + D This structure gives S4D a transparent spectral interpretation: each diagonal entry corresponds to a single-pole filter whose time constant and resonant frequency are set by λi\lambda_i.

2. Parameterization, Initialization, and Numerical Considerations

S4D parameterizes the diagonal entries either as unconstrained complex numbers with negative real part (λi<0\Re \lambda_i < 0 for stability), or as reparameterizations (λi=exp(αi)+iβi\lambda_i = -\exp(\alpha_i) + i \beta_i) to guarantee stability by construction. The vectors BB and CC are either trainable or fixed depending on the variant.

Initialization in S4D is critical:

  • HiPPO-derived: λi\lambda_i initialized as the spectrum of the HiPPO-LegS normal matrix or approximations (e.g., "S4D-Lin" uses λn=12+iπn\lambda_{n} = -\frac{1}{2} + i \pi n, "S4D-Inv" uses an inverse law).
  • Frequency tuning: Methods such as S4D-FT scale λi\Im \lambda_i by a factor α\alpha to bias initial frequency coverage (Wang et al., 24 Jan 2025).
  • Butterworth or Fourier: For denoising or spectral coverage, alternatives place discrete-time poles on the Butterworth circle or uniformly over the unit circle (S4D-DFouT) (Solozabal et al., 28 Aug 2025, Mei et al., 2024).
  • Numerical stability: The exponential mapping enforces eλiΔ<1|e^{\lambda_i \Delta}| < 1.

Closed-form discretization is handled by either ZOH or bilinear transforms, with negligible practical difference (Gu et al., 2022). Conjugate symmetry is used to ensure real outputs when needed.

3. Computational and Memory Efficiency

The diagonalization in S4D yields a memory and compute profile orders of magnitude lighter than structured SSMs:

  • Per time step: O(N)O(N) for state updates and output computation (NN: state size).
  • Kernel computation: Vectorized Vandermonde matrix construction enables O(NL)O(NL) time for length-LL kernels, with FFT-based convolution in O(LlogL)O(L\log L) (Gupta et al., 2022).
  • Parameter count: For the main kernel, O(N)O(N) parameters per channel, compared to O(N2)O(N^2) for dense A.
  • Streaming hardware: S4D can be mapped naturally to neuromorphic processors (Loihi 2), achieving millisecond latency and microjoule energy costs, since each state dimension is independent (Meyer et al., 2024).

4. Spectral Properties and Architectural Variants

The closed-form S4D kernel is an interpretable sum of exponentials, allowing direct analysis of the model’s frequency response:

  • Standalone S4D: Tends to produce mid- and high-pass kernels, favoring short- and moderate-range dependencies.
  • Hybrid architectures: Combining S4D with convolutional layers or gating transforms the kernel’s spectrum. Prepending depthwise or 1D convolutions produces band-pass characteristics; input gating can induce a strong low-pass profile suited to modeling exceptionally long-range dependencies (e.g., security vulnerabilities in code) (Ravikumar et al., 19 Jan 2026).
  • Spectral bias: The original HiPPO-initialized S4D is prone to non-uniform coverage and aliasing; DFouT initialization solves this by distributing poles uniformly in the Fourier plane, enhancing out-of-the-box long-range performance and convergence (Solozabal et al., 28 Aug 2025).

The interpretability of S4D’s frequency response provides practical design levers for tailoring model behavior to specific sequence modeling tasks.

5. Theoretical Properties: Expressivity, Convergence, and Duality

S4D enjoys several theoretical properties:

  • Universality: Any well-behaved SSM kernel can, in the limit, be approximated by a diagonal model (Proposition 2.1 in (Gupta et al., 2022)).
  • Weak HiPPO convergence: As state size grows, the S4D (diagonal) kernel approaches that of the full DPLR HiPPO S4 in the L2L_2 sense for sufficiently smooth inputs, though not in operator norm; there is a theoretical accuracy gap for non-smooth or adversarial inputs (Yu et al., 2023).
  • Robustness: Standard S4D is sensitive to adversarial Fourier perturbations unless corrected (S4-PTD) (Yu et al., 2023).
  • Structured state-space duality: Diagonal SSMs are algebraically identical to 1-semiseparable (1-SS) masked attention mechanisms, and can be recast as sums of low-rank masked operators. This duality ceases to hold for standard softmax attention due to rank explosion (Hu et al., 6 Oct 2025).

6. Empirical Performance and Applications

S4D matches or surpasses S4 on several long-range benchmarks with lighter computational burden:

  • Long Range Arena (LRA): S4D achieves ≈85.5% average (S4: 86.1%), with S4D-DFouT reaching the highest diagonal SSM scores and uniquely succeeding on extreme PathX-256 benchmark (Solozabal et al., 28 Aug 2025, Gu et al., 2022, Gupta et al., 2022).
  • Audio and time-series: On Speech Commands, S4D matches or closely trails S4, outperforming CNN and Transformer baselines (Gupta et al., 2022).
  • Spatio-temporal sensor data: rS4D (with Butterworth-initialized low-pass front-end) yields lower RMSE and enhanced robustness to high-frequency noise in mobile sensor reconstruction tasks (Mei et al., 2024).
  • Hydrology: S4D-FT outperforms LSTM and conceptual rainfall-runoff models on CONUS-scale datasets (Wang et al., 24 Jan 2025).
  • EEG/BCI: S4D classifiers provide real-time, accurate MI decoding, training on modest hardware and supporting interactive BCI pipelines (Tscherniak et al., 28 Nov 2025).

Model compression via H2H^2 optimal reduction can shrink S4D blocks by up to 32× without sacrificing accuracy on LRA, providing an effective path for efficient deployment (Sakamoto et al., 14 Jul 2025).

7. Implementation Considerations and Practical Guidelines

S4D’s architecture is characterized by:

A coherent design philosophy emerges: employ HiPPO/DFouT initializations for broad frequency coverage, inspect kernel responses empirically, and pair S4D with convolution or gating as dictated by the long-range characteristics of the target sequence modeling problem.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diagonalized State-Space Model (S4D).