Hidden State Mixer-SSD in Efficient Vision Models
- HSM-SSD is a neural network mechanism that efficiently captures full-sequence global dependencies in resource-constrained vision tasks using a dual state space approach.
- The mechanism reduces computational complexity by shifting costly D→D projections to a smaller hidden state latent, resulting in significant throughput improvements on benchmarks like ImageNet-1k.
- Multi-stage hidden state fusion further refines feature representations, enabling competitive speed-accuracy trade-offs compared to state-of-the-art vision models.
Hidden State Mixer-SSD (HSM-SSD) is a neural network mechanism introduced in the EfficientViM architecture to enable efficient capture of global dependencies in resource-constrained vision tasks. HSM-SSD is based on the state space duality between token-level and hidden state–level computations. By relocating channel-mixing operations from the high-dimensional token space to a smaller hidden state latent, HSM-SSD achieves a substantial reduction in computational and memory complexity while retaining full-sequence global receptive field capabilities. This mechanism is further enhanced by multi-stage hidden-state fusion, providing a competitive speed-accuracy trade-off on large-scale benchmarks such as ImageNet-1k (Lee et al., 2024).
1. State Space Model Foundations
The HSM-SSD mechanism derives from state space models (SSMs) for sequence modeling. The general continuous-time linear time-invariant (LTI) SSM is described by
where is the input signal, is the N-dimensional hidden state, and , , are learnable system parameters.
For discrete-time processing of an input sequence (sequence length , channel dimension ), the zero-order hold discretization yields: with (with diagonal and a learned time step), and input projections , .
2. State Space Duality and SSD Layer
The original State Space Duality (SSD) layer provides a dual matrix operation to the above recurrence: where is a lower-triangular matrix with
with state weights (from the step sizes), and from network projections .
For vision applications, a non-causal (NC-SSD) simplification is adopted: In the original vision SSD block, the D→D projections dominate the compute cost, giving overall complexity for sequences of moderate to large length and channel dimension .
3. HSM-SSD Mechanism
HSM-SSD addresses the computational bottleneck by constraining all high-cost D→D projections to a compact hidden state latent (), significantly lowering the dominant term in complexity.
- Step 1: Compressed Hidden Input Construction
with as input features, , and elementwise multiplication .
- Step 2: Latent-Space Linear Projection The only D→D projection now operates in the reduced N×D space:
leading to O() compute.
- Step 3: Hidden State Mixer (HSM) Channel Mixing A further D→D transformation is applied after channelwise mixing with sigmoid gating:
with .
- Step 4: Projection Back to Tokens The hidden state is mapped back to token space via :
- Complexity: Total per-layer compute becomes , a reduction by compared to the original .
4. Multi-Stage Hidden-State Fusion
EfficientViM with HSM-SSD employs Multi-Stage Fusion to enhance classifier representation. If the network has stages, each stage produces hidden state . Global pooling per stage: is followed by normalization and projection to class logits: The outputs are combined into fused logits as: with learned fusion scalars . Training on the fused output encourages stagewise discriminative power in all hidden representations.
5. Computational Bottleneck Alleviation
Benchmarking on original SSD blocks highlighted that the chief computational and memory bottleneck arises from token-wise D→D projections and multi-head reshape/copy operations (≈25% of runtime was attributed to these memory-bound steps). HSM-SSD eliminates D→D projections at scale, locating them entirely within the much smaller latent. Further, HSM-SSD replaces multi-head processing with a single head, while enabling statewise parameterization by using to vary by state. This reduces theoretical FLOPs from to and empirically improves throughput by 8–15%.
6. Empirical Performance and Trade-offs
EfficientViM, equipped with HSM-SSD, achieves favorable speed-accuracy trade-offs on ImageNet-1k:
- EfficientViM-M1: 6.7M parameters, 239M FLOPs, 72.9% top-1 accuracy @20,731 im/s
- EfficientViM-M2: 13.9M, 355M FLOPs, 75.4% @17,005 im/s (+0.2% and 33% faster than SHViT-S2: 75.2% @15,899 im/s)
- EfficientViM-M3: 16.6M, 656M FLOPs, 77.6% @11,952 im/s (>1.5× speed of SHViT-S3)
- EfficientViM-M4: 19.6M, 1.11G FLOPs, 79.4% @8,170 im/s
Compared to prior state-of-the-art vision Mamba variants (VSSD-T at 76.1% @1,612 im/s), EfficientViM-M2 is 10× faster at similar or better accuracy (Lee et al., 2024).
| Model | Params (M) | Top-1 (%) | Throughput (im/s) |
|---|---|---|---|
| EfficientViM-M2 | 13.9 | 75.4 | 17,005 |
| SHViT-S2 | — | 75.2 | 15,899 |
| VSSD-T | — | 76.1 | 1,612 |
7. Significance and Outlook
The HSM-SSD architecture demonstrates that global interaction via state space models can be reconciled with high computational and memory efficiency. By exploiting the duality of token and hidden state spaces and optimizing channel mixing in a reduced latent, HSM-SSD achieves a full-sequence global receptive field while reducing real-world latency on GPU hardware. Multi-stage fusion further enhances stagewise discriminatory power. These innovations suggest that high-throughput, high-accuracy vision models can sidestep previous architectural bottlenecks associated with attention and multi-head mixing, offering a principled direction for scaling efficient sequence models in resource-limited settings (Lee et al., 2024).