Hidden State Mixer-SSD in Efficient Vision Models

Updated 21 February 2026

HSM-SSD is a neural network mechanism that efficiently captures full-sequence global dependencies in resource-constrained vision tasks using a dual state space approach.
The mechanism reduces computational complexity by shifting costly D→D projections to a smaller hidden state latent, resulting in significant throughput improvements on benchmarks like ImageNet-1k.
Multi-stage hidden state fusion further refines feature representations, enabling competitive speed-accuracy trade-offs compared to state-of-the-art vision models.

Hidden State Mixer-SSD (HSM-SSD) is a neural network mechanism introduced in the EfficientViM architecture to enable efficient capture of global dependencies in resource-constrained vision tasks. HSM-SSD is based on the state space duality between token-level and hidden state–level computations. By relocating channel-mixing operations from the high-dimensional token space to a smaller hidden state latent, HSM-SSD achieves a substantial reduction in computational and memory complexity while retaining full-sequence global receptive field capabilities. This mechanism is further enhanced by multi-stage hidden-state fusion, providing a competitive speed-accuracy trade-off on large-scale benchmarks such as ImageNet-1k (Lee et al., 2024).

1. State Space Model Foundations

The HSM-SSD mechanism derives from state space models (SSMs) for sequence modeling. The general continuous-time linear time-invariant (LTI) SSM is described by

$h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t)$

where $x(t)\in\mathbb{R}$ is the input signal, $h(t)\in\mathbb{R}^N$ is the N-dimensional hidden state, and $A\in\mathbb{R}^{N\times N}$ , $B\in\mathbb{R}^{N\times 1}$ , $C\in\mathbb{R}^{1\times N}$ are learnable system parameters.

For discrete-time processing of an input sequence $X \in \mathbb{R}^{L \times D}$ (sequence length $L$ , channel dimension $D$ ), the zero-order hold discretization yields: $h_t = A_t h_{t-1} + B_t^\top x_t, \qquad y_t = C_t h_t, \quad t=1, \dots, L$ with $A_t = \exp(\Delta_t \hat{A})$ (with $\hat{A}$ diagonal and $\Delta_t$ a learned time step), and input projections $B_t \in \mathbb{R}^{N \times 1}$ , $C_t \in \mathbb{R}^{1 \times N}$ .

2. State Space Duality and SSD Layer

The original State Space Duality (SSD) layer provides a dual matrix operation to the above recurrence: $Y = \mathrm{SSD}(X, a, B, C) = [M \odot (C B^\top)] X$ where $M \in \mathbb{R}^{L \times L}$ is a lower-triangular matrix with

$M_{ij} = \begin{cases} \prod_{k=j+1}^i a_k & i > j \ 1 & i = j \ 0 & i < j \ \end{cases}$

with state weights $a \in \mathbb{R}^L$ (from the step sizes), and $B, C \in \mathbb{R}^{L \times N}$ from network projections $W_B, W_C \in \mathbb{R}^{D \times N}$ .

For vision applications, a non-causal (NC-SSD) simplification is adopted: $h = \sum_{i=1}^L a_i B_i^\top x_i, \qquad Y = C h$ In the original vision SSD block, the D→D projections dominate the compute cost, giving overall complexity $O(L D^2)$ for sequences of moderate to large length $L$ and channel dimension $D$ .

3. HSM-SSD Mechanism

HSM-SSD addresses the computational bottleneck by constraining all high-cost D→D projections to a compact hidden state latent ( $N \ll D$ ), significantly lowering the dominant term in complexity.

Step 1: Compressed Hidden Input Construction

$h_\mathrm{in} = (a \mathbf{1}_N^\top \odot B)^\top X_\mathrm{in} \in \mathbb{R}^{N \times D}$

with $X_\mathrm{in}$ as input features, $B \in \mathbb{R}^{L \times N}$ , and elementwise multiplication $\odot$ .

Step 2: Latent-Space Linear Projection The only D→D projection now operates in the reduced N×D space:

$h = h_\mathrm{in} W_\mathrm{in}, \qquad W_\mathrm{in}\in\mathbb{R}^{D\times D}$

leading to O( $N D^2$ ) compute.

Step 3: Hidden State Mixer (HSM) Channel Mixing A further D→D transformation is applied after channelwise mixing with sigmoid gating:

$z = h_\mathrm{in} W_z, \qquad \widetilde{h} = (h \odot \sigma(z)) W_\mathrm{out}$

with $W_z, W_\mathrm{out} \in \mathbb{R}^{D \times D}$ .

Step 4: Projection Back to Tokens The hidden state is mapped back to token space via $C \in \mathbb{R}^{L \times N}$ :

$X_\mathrm{out} = C \widetilde{h}$

Complexity: Total per-layer compute becomes $O(L N D + N D^2)$ , a reduction by $O(N/D)$ compared to the original $O(L D^2)$ .

4. Multi-Stage Hidden-State Fusion

EfficientViM with HSM-SSD employs Multi-Stage Fusion to enhance classifier representation. If the network has $S$ stages, each stage $s$ produces hidden state $h^{(s)}\in\mathbb{R}^{N_s \times D_s}$ . Global pooling per stage: $\bar{h}^{(s)} = \frac{1}{N_s}\sum_{i=1}^{N_s} h_i^{(s)} \in \mathbb{R}^{D_s}$ is followed by normalization and projection to class logits: $z^{(s)} = W_\mathrm{cls}^{(s)}(\mathrm{LN}(\bar{h}^{(s)})), \quad W_\mathrm{cls}^{(s)} \in \mathbb{R}^{D_s \times C}$ The outputs $z^{(s)}$ are combined into fused logits as: $z = \sum_{s=0}^S \beta^{(s)} z^{(s)}, \qquad \beta^{(s)} = \frac{\exp(\gamma^{(s)})}{\sum_{i=0}^S \exp(\gamma^{(i)})}$ with learned fusion scalars $\gamma^{(s)}$ . Training on the fused output encourages stagewise discriminative power in all hidden representations.

5. Computational Bottleneck Alleviation

Benchmarking on original SSD blocks highlighted that the chief computational and memory bottleneck arises from token-wise D→D projections and multi-head reshape/copy operations (≈25% of runtime was attributed to these memory-bound steps). HSM-SSD eliminates D→D projections at $L \times D$ scale, locating them entirely within the much smaller $N \times D$ latent. Further, HSM-SSD replaces multi-head processing with a single head, while enabling statewise parameterization by using $A \in \mathbb{R}^{L \times N}$ to vary by state. This reduces theoretical FLOPs from $O(L D^2)$ to $O(N D^2 + L N D)$ and empirically improves throughput by 8–15%.

6. Empirical Performance and Trade-offs

EfficientViM, equipped with HSM-SSD, achieves favorable speed-accuracy trade-offs on ImageNet-1k:

EfficientViM-M1: 6.7M parameters, 239M FLOPs, 72.9% top-1 accuracy @20,731 im/s
EfficientViM-M2: 13.9M, 355M FLOPs, 75.4% @17,005 im/s (+0.2% and 33% faster than SHViT-S2: 75.2% @15,899 im/s)
EfficientViM-M3: 16.6M, 656M FLOPs, 77.6% @11,952 im/s (>1.5× speed of SHViT-S3)
EfficientViM-M4: 19.6M, 1.11G FLOPs, 79.4% @8,170 im/s

Compared to prior state-of-the-art vision Mamba variants (VSSD-T at 76.1% @1,612 im/s), EfficientViM-M2 is 10× faster at similar or better accuracy (Lee et al., 2024).

Model	Params (M)	Top-1 (%)	Throughput (im/s)
EfficientViM-M2	13.9	75.4	17,005
SHViT-S2	—	75.2	15,899
VSSD-T	—	76.1	1,612

7. Significance and Outlook

The HSM-SSD architecture demonstrates that global interaction via state space models can be reconciled with high computational and memory efficiency. By exploiting the duality of token and hidden state spaces and optimizing channel mixing in a reduced latent, HSM-SSD achieves a full-sequence global receptive field while reducing real-world latency on GPU hardware. Multi-stage fusion further enhances stagewise discriminatory power. These innovations suggest that high-throughput, high-accuracy vision models can sidestep previous architectural bottlenecks associated with attention and multi-head mixing, offering a principled direction for scaling efficient sequence models in resource-limited settings (Lee et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hidden State Mixer-SSD (HSM-SSD).