Papers
Topics
Authors
Recent
Search
2000 character limit reached

Matryoshka State Space Models

Updated 21 February 2026
  • Matryoshka State Space Models are a hierarchical approach that nests submodels within state space frameworks to capture long-range dependencies in sequential data.
  • The architecture employs learnable parameter slicing and joint training, facilitating elastic inference across different computational granularities.
  • Empirical results demonstrate high performance in vision and language tasks, with efficient adaptation to regime switches and variable resource constraints.

The Matryoshka State Space Model (MSSM) encompasses a class of hierarchical architectures for sequence modeling that combine nested decompositions of learnable parameters or state representations with the expressive capabilities of state space models (SSMs). The term "Matryoshka" references the nesting of submodels or processes in a strictly hierarchical fashion, akin to Russian matryoshka dolls. Recent developments have implemented the Matryoshka principle in both the parameter space of neural state space models for elastic and adaptive inference (Shukla et al., 2024), and in stochastic multiscale latent variable models with nonlinear dynamics under regime switching (Vélez-Cruz et al., 2024). This article provides a technical overview, formal definitions, foundational methodologies, and empirical characteristics of Matryoshka State Space Models as they appear in contemporary research.

1. Foundational Concepts: State Space Models and the Matryoshka Principle

State Space Models (SSMs) are sequence models defined in continuous or discrete time, characterized by a latent dynamical system whose hidden state captures long-range dependencies. For the continuous-time linear case, the canonical SSM form is

xË™(t)=Ax(t)+Bu(t),y(t)=Cx(t)+Du(t)\dot{\mathbf{x}}(t) = \mathbf{A} \mathbf{x}(t) + \mathbf{B} \mathbf{u}(t), \qquad \mathbf{y}(t) = \mathbf{C} \mathbf{x}(t) + \mathbf{D} \mathbf{u}(t)

where the dynamics and observations are parameterized by A\mathbf{A}, B\mathbf{B}, C\mathbf{C}, D\mathbf{D}. Discrete-time versions and neural parameterizations extend these equations to deep learning settings.

The "Matryoshka" principle refers to endowing a model class—such as a neural SSM—with a nested hierarchy of submodels or processes. In recent literature, two approaches operationalize this hierarchy:

  • Neural SSM parameter subslicing (MatMamba (Shukla et al., 2024)): All learnable tensors (including projections, convolutional kernels, and SSM parameters) are constructed so that contiguous slices along their inner dimensions yield valid smaller submodels, each representing a nested granularity.
  • Hierarchical latent state nesting (MSSM for regime-switching (Vélez-Cruz et al., 2024)): The latent state space is organized into multiple temporal scales, with fine-to-coarse feedback and coarse-to-fine conditioning, supporting the modeling of complex, nested nonlinear dynamics and transitions across regimes.

Both lines of work emphasize joint training or inference over the full hierarchy and support efficient adaptation to variable resource or modeling requirements.

2. Neural Matryoshka State Space Models: Architecture and Training

The neural instantiation of the Matryoshka SSM, exemplified by MatMamba (Shukla et al., 2024), integrates the Matryoshka approach with the Mamba2 SSM block, producing a model that jointly trains multiple granularities in a universal architecture.

Mamba2 Block (Base Case)

Each Mamba2 block transforms a sequence via three steps:

  1. Input projection and grouped convolution: XBC=σ(Conv(…))XBC = \sigma \big(\mathrm{Conv}(\ldots)\big) with learnable projection and convolution tensors.
  2. Chunk-wise SSM scan: Y=SSM(XBC,Wdtu;A,D)Y = \mathrm{SSM}(XBC, W_{dt} u; A, D) using parallel SSM heads.
  3. Gated output projection: M(u)=Norm(Y⊙σ(Wzu))WoutTM(u) = \mathrm{Norm}(Y \odot \sigma(W_z u)) W_\mathrm{out}^T.

Matryoshka Nesting

To realize gg nested granularities M1⊂M2⊂⋯⊂MgM_1 \subset M_2 \subset \dots \subset M_g, all major tensors are sliced:

  • For width multiplier mi<dmodelm_i < d_\mathrm{model} at sub-block ii:
    • di=eâ‹…mid_i = e \cdot m_i, hi=di/dheadh_i = d_i / d_\mathrm{head}
    • Input projections: Wx(i)=Wx[0:di,:]W_x^{(i)} = W_x[0:d_i, :]
    • SSM parameters: A(i)=A[0:hi]A^{(i)} = A[0:h_i], D(i)=D[0:hi]D^{(i)} = D[0:h_i], Wdt(i)=Wdt[0:hi,:]W_{dt}^{(i)} = W_{dt}[0:h_i, :]
    • Output projection: Wout(i)=Wout[:,0:di]W_\mathrm{out}^{(i)} = W_\mathrm{out}[:, 0:d_i].

Each sub-block implements the same computation as the base Mamba2, with reduced width and parameter count, but every MiM_i is a strict parameter subset of MjM_j for j>ij>i. This architecture supports elastic and adaptive deployment.

Joint Training

The model is trained by accumulating gg forward passes per minibatch (one per submodel/slice), before a single backward pass: Ljoint(x,y)=∑i=1gλiℓ(fi(x),y)\mathcal{L}_\mathrm{joint}(x,y) = \sum_{i=1}^g \lambda_i \ell(f_i(x), y) Typically, λi=1/g\lambda_i = 1/g. No explicit distillation or complex regularization is needed beyond classical techniques.

3. Multiscale Hierarchical MSSMs for Nonlinear and Switching Dynamics

Another Matryoshka SSM formalism, developed in (Vélez-Cruz et al., 2024), encodes nested nonlinear and nonstationary latent dynamics across LL temporal scales:

  • At each scale ll, for entity dd and local time tlt_l, latent state xd,tl[l]∈RNl\mathbf{x}^{[l]}_{d,t_l} \in \mathbb{R}^{N_l}. The ensemble of latent states at all scales forms a hierarchically nested trajectory.
  • The evolution function flf_l for x[l]\mathbf{x}^{[l]} depends on both finer and coarser scale latent states, with process noise injected at each scale: xd,tl[l]=fl(Atl[l]xd,tl−1[l],{xd,1:Tl′[l′]}l′<l,{xd,tl′−1[l′]}l′>l)+wd,tl−1[l]\mathbf{x}^{[l]}_{d,t_l} = f_l\big(\mathbf{A}^{[l]}_{t_l}\mathbf{x}^{[l]}_{d,t_l-1}, \{\mathbf{x}^{[l']}_{d,1:T_{l'}}\}_{l'<l}, \{\mathbf{x}^{[l']}_{d,t_{l'}-1}\}_{l'>l}\big)+\mathbf{w}_{d,t_l-1}^{[l]}
  • The coarsest scale (l=Ll=L) can introduce additional inter-entity coupling and discrete regime indicators sd,tL[L]s_{d,t_L}^{[L]} with mixing weights Ï€d,tL\boldsymbol{\pi}_{d,t_L} governed by a Dirichlet–categorical prior.

This hierarchical factorization allows detailed modeling of multiscale dependence structures, regime changes, and cross-entity couplings. Observation models and full posterior densities are constructed accordingly.

4. Inference and Computational Complexity

Matryoshka SSMs in Neural Networks

  • Inference (slice ii): For sequence length NN, sub-block width mim_i, cost is O(Nemi+Nmi/dhead)O(N e m_i + N m_i/d_\mathrm{head}), matching equivalent standalone SSM models.
  • Training (full model): Requires gg forward passes and one backward, with total cost O((g+1)Nedg+(g+1)Ndg/dhead)O((g+1) N e d_g + (g+1) N d_g / d_\mathrm{head}). Memory cost tracks the largest model.

Multiscale SMC for Nonlinear MSSMs

Inference in the stochastic MSSM uses a nested Sequential Monte Carlo (particle filter) algorithm. For each scale and time step, particle weights are updated based on observation likelihoods, and resampling is performed as needed. At the coarsest scale, discrete regime indicators and their Dirichlet parameters are updated analytically.

  • Under mild conditions, the SMC estimator converges to the true posterior as particle count N→∞N\to\infty.
  • This enables online and efficient inference for complex multiscale switching processes.

5. Empirical Performance and Simulation Evidence

  • Vision (ImageNet-1K): MatMamba retains top-1 accuracy within ≲\lesssim0.2% of standalone Mamba2 baselines at all granularities. Inference throughput equals or exceeds that of ViT-B/16 above 512px; GPU memory scaling is superior at higher resolutions.
  • Language modeling: Validated on FineWeb for sizes from 130M to 1.4B; scaling curves and final validation losses match baseline Mamba2 models at all nested widths.
  • Adaptive retrieval: When database encoding uses the largest submodel and queries are evaluated with smaller models, 1-NN accuracy drops less than 0.5%, outperforming baseline decomposable models.
  • Simulation studies: On L=2L=2 scales, D=6D=6 entities, and M=2M=2 regimes over T2=100T_2=100 steps, the nested SMC achieves:
    • Coarse-level RMSE per individual in the range $0.111$–$0.170$
    • Regime detection accuracy above 95%
    • Immediate regime switch tracking without systematic delay
  • Stable estimation error, efficient adaptation to regime switches, and cross-scale latent consistency are empirically demonstrated.

6. Practical Benefits and Limitations

Advantages

  • Elasticity: One jointly-trained model yields a range of nested submodels for different compute budgets without repeated retraining.
  • Consistency: Metric-space and embedding-space consistency across submodels benefits adaptive, hybrid, and speculative inference.
  • Efficient large-scale deployment: Neural Matryoshka SSMs exhibit linear scaling with model width and outperform Transformers in certain bandwidth-constrained settings or large-resolution vision tasks.

Limitations

  • Training Overhead: Training with gg folds incurs ∼g×\sim g\times the forward cost per step, though memory is not increased beyond that of the largest submodel.
  • Interpolation Gaps: Slices interpolated between explicitly trained granularities (the "Mix’n’Match" strategy) may underperform without additional regularization or self-distillation.
  • Submodel Scalability: Very small submodels may underperform if gg is set too low; additional granularities can mitigate this.

Empirical Summary Table

Task MatMamba: Submodel Gap vs Standalone Adaptive Accuracy
ImageNet-1K (Vision) ≤ 0.2% ≥ Baseline
FineWeb (Language) Matches scaling curves Smooth interpolation
1-NN Retrieval < 0.5% drop >2–3% for baseline SSM

7. Theoretical Guarantees and Research Connections

  • The nested Matryoshka hierarchy ensures that parameter sharing is maximally efficient: each smaller submodel is strictly a prefix or subset of a larger model, supporting probabilistic and deterministic tractability.
  • For stochastic, regime-switching MSSMs, the combination of state-space hierarchy, Dirichlet–categorical priors, and nested SMC filtering provides closed-form updates and theoretical convergence guarantees under standard regularity assumptions.
  • Connections to Matryoshka Representation Learning and previous multiscale sequence modeling frameworks are realized both in parameter space (as in MatMamba) and latent state hierarchy (as in (Vélez-Cruz et al., 2024)), providing a unified view of nested model classes and adaptive inference for sequential data.

The Matryoshka State Space Model formalism thus enables advanced resource-adaptive sequence modeling and multiscale dynamical inference, with strong empirical and theoretical guarantees across multiple domains and methodologies (Shukla et al., 2024, Vélez-Cruz et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Matryoshka State Space Model.