STG-Mamba: Efficient Spatial-Temporal Graph Models
- STG-Mamba is a family of models that integrates input-dependent selective state-space modeling with explicit spatial and temporal graph structures to capture both local and long-range dependencies.
- It employs dynamic gating and decomposition into spatial and temporal blocks, effectively addressing challenges in traffic forecasting, EEG analysis, and motion synthesis.
- STG-Mamba achieves state-of-the-art accuracy with linear computational complexity, offering significant computational savings and robust performance across diverse graph-based tasks.
Spatial-Temporal Graph Mamba (STG-Mamba) refers to a family of models that integrate selective state-space modeling—particularly the Mamba architecture—with explicit spatial and temporal graph structures. These approaches have been developed to address the computational inefficiencies and limited long-range dependency modeling of conventional attention-based spatio-temporal graph neural networks (STGNNs). STG-Mamba models achieve state-of-the-art accuracy and significant computational savings for tasks such as traffic forecasting, dynamic graph embedding, motion synthesis, EEG analysis, and more, by employing input-dependent, graph-aware state evolution at linear complexity.
1. Selective State-Space Modeling in Spatio-Temporal Graphs
The core technical innovation in STG-Mamba is the application of selective state-space models (SSMs) over both spatial and temporal domains of graph-structured data. Where classical SSMs (e.g., S4) are linear time-invariant, Mamba introduces selectivity: recurrence parameters are input-dependent, enabling dynamic gating and context-dependent memory. In discrete time, the Mamba block operates as:
with , , and each determined by small neural networks of , and derived via a scan over for stable state updates (Pandey et al., 2024, Wang et al., 2024, Li et al., 2024). This structure allows the model to adaptively filter and propagate information depending on local graph context or temporal phase, yielding richer representations than static GNNs or sequence models.
2. Architectural Design and Spatial-Temporal Decomposition
STG-Mamba models typically decompose or fuse spatial and temporal modeling through dedicated state-space modules:
- Spatial Mamba Block: Processes node features or graph embeddings for each time slice, integrating local graph structure via GNNs or graph convolutions, and running Mamba over node sequences (potentially ordered via importance or graph-theoretic priority) (Wang et al., 2024, Pandey et al., 2024).
- Temporal Mamba Block: Processes the time-evolution of each node (or, in some cases, node set), using a state-space scan across sequential node features, potentially incorporating calendar/periodic embeddings (e.g., day-of-week, time-of-day) (Hamad et al., 5 Jul 2025, Choi et al., 2024).
Distinct models leverage different forms of composition:
| Model | Spatial Block | Temporal Block | Fusion Mechanism | Graph Handling |
|---|---|---|---|---|
| STG-Mamba (Li et al., 2024) | GNN (Dynamic-Filter) | S3M (SSSM) | Kalman-Filter Graph Neural | Explicit A, dynamic |
| SpoT-Mamba (Choi et al., 2024) | Node-walk Mamba | Per-node Mamba | Transformer (across nodes) | Explicit, walk-based |
| MCST-Mamba (Hamad et al., 5 Jul 2025) | Temporal + Spatial Mamba | Temporal/Spatial | Learnable scalar fusion | Implicit, learned |
| MSGM (Liu et al., 21 Jul 2025) | Multi-depth GCN | MSST-Mamba | Token embedding/MSSTBlocks | EEG with priors |
| STGM (Tang et al., 9 Jul 2025) | SG-SSM (GCN) | TGF-SSM/TGB-SSM | LayerNorm+Gates+MLPs | Skeleton graphs |
Some architectures—such as MCST-Mamba—explicitly separate temporal and spatial SSMs, processing node histories and spatial sensor snapshots independently then fusing via adaptive weighting (Hamad et al., 5 Jul 2025). Others factorize or integrate the two directions sequentially or cascade their outputs through residual or gating paths (Li et al., 2024, Tang et al., 9 Jul 2025).
3. Integration with Graph Neural Networks
STG-Mamba designs universally combine SSM blocks with graph neural primitives, including GCN, GatedGCN, GINEConv, and variants thereof:
- Edge-aware convolutions: Models like GDG-Mamba include edge attributes in spatial processing using GINEConv, enforcing expressive spatial representations (Pandey et al., 2024).
- Dynamic adjacency and node reordering: State-space updates can depend on per-step dynamic graph structures, often via node prioritization heuristics (degree/centrality) for SSM scan ordering (Wang et al., 2024, Li et al., 2024).
- Kalman Fusion: STG-Mamba (Li et al., 2024) introduces Kalman Filtering Graph Neural Networks (KFGN), fusing embeddings from different temporal granularities according to estimated uncertainty, drawing principled weights analogously to the Kalman gain.
This integration of GNN local aggregation and global state-space propagation enables propagation of both fine-grained and long-range spatial-temporal dependencies.
4. Computational Complexity and Efficiency
A primary driver of STG-Mamba’s adoption is its computational profile. Where transformer and full-attention models scale quadratically with sequence length and/or graph size due to exhaustive pairwise attention, all forms of STG-Mamba achieve linear complexity:
for sequence length in the relevant dimension (Pandey et al., 2024, Wang et al., 2024). In large-scale settings (traffic, EEG, video), this translates to orders-of-magnitude savings in FLOPs and inference time. For example, STG-Mamba achieves a speed-up over transformer counterparts at (Li et al., 2024), and MSGM operates at $151$ ms per EEG sample on Jetson Xavier NX at $349$k parameters (Liu et al., 21 Jul 2025).
Furthermore, ablation studies in MCST-Mamba show the critical contribution of the spatial Mamba block, with its removal nearly doubling MAE/RMSE on multi-channel traffic benchmarks (Hamad et al., 5 Jul 2025).
5. Applications and Empirical Results
STG-Mamba models demonstrate robust domain-generalization across:
- Traffic and spatio-temporal forecasting: PEMS-D4, PEMS-D8—MCST-Mamba achieves $7.88$/$20.97$ MAE/RMSE (PEMS-D4) compared to $13.30$/$29.85$ for previous best SSM, cutting errors nearly in half (Hamad et al., 5 Jul 2025).
- Dynamic link prediction and graph sequence modeling: Mamba-based variants outperform transformer-based GNNs on highly volatile datasets such as Reality Mining, UCI, and Bitcoin (Pandey et al., 2024); Graph-Mamba and SpoT-Mamba maintain stronger long-range dependencies than attention baselines (Wang et al., 2024, Choi et al., 2024).
- Music-driven skeleton/video synthesis: STG-Mamba achieves substantial improvements across precision (PFD), diversity (VFD), motion coverage, and video realism, e.g., FID versus $47.65$ (Vid2Vid) and $61.68$ (Pix2pixHD) (Tang et al., 9 Jul 2025).
- EEG emotion recognition: Single MSST-Mamba layer outperforms transformer with fewer parameters and real-time edge capability, e.g., accuracy on SEED compared to for EmT (Liu et al., 21 Jul 2025).
Metrics span RMSE, MAE, MAPE, FID, VFD, PVar, and domain-specific measures (e.g., hit rate, NDCG@k, MPJPE, etc.), with consistent outperformance or parity with attention/GNN/transformed ensembles but much lower computation (Li et al., 2024, Tang et al., 9 Jul 2025).
6. Strengths, Limitations, and Future Directions
STG-Mamba’s key strengths are:
- Explicit, selective, graph-aware state evolution: Combines the flexibility of attention with input-adaptive, linear-time state propagation (Li et al., 2024).
- Unified, natively multivariate modeling: Especially in MCST-Mamba, simultaneously handles arbitrary channels, modeling joint temporal and spatial dependencies without separate per-channel networks (Hamad et al., 5 Jul 2025).
- Scalability: FLOPs and inference scale linearly to large graphs/sequences, enabling real-time and edge-device deployment in domains like EEG (Liu et al., 21 Jul 2025).
- Built-in uncertainty quantification: Gaussian embedding or Kalman fusion yield estimates with associated variances, valuable for high-stakes forecasting (Pandey et al., 2024, Li et al., 2024).
Limitations and open challenges identified across works include:
- Interpretability: Internal state updates in deep SSMs are difficult to elucidate.
- Dependency on data quality: Some implementations (e.g., STG-Mamba for dance synthesis) depend heavily on data pre-processing quality (OpenPose extraction).
- Robustness and generalization: Many models validated on 2–3 domains; transfer to broader graph types or irregular/heterogeneous/partially observed graphs remains to be systematically validated.
- Handling missing/noisy data: Further specialization of state priors and preprocessing is needed for robust operation in real-world deployments (Li et al., 2024).
Potential directions include integrating richer walk-based or hypergraph spatial encodings, lightweight or diffusion-based generative components for high-fidelity synthesis, and cross-modal selective SSMs in settings such as multimodal behavior understanding (Choi et al., 2024, Tang et al., 9 Jul 2025).
For technical and reproducibility details on the various model architectures, equations, and benchmarks, see (Li et al., 2024, Pandey et al., 2024, Wang et al., 2024, Choi et al., 2024, Tang et al., 9 Jul 2025, Hamad et al., 5 Jul 2025), and (Liu et al., 21 Jul 2025).