Video Mamba Models: Scalable Video Processing

Updated 15 December 2025

Video Mamba models are neural architectures using structured state-space models with token-dependent adaptations to achieve linear scalability in video sequence processing.
They integrate pure SSM backbones, hybrid SSM-transformer designs, and adapter modules to efficiently capture spatiotemporal context in diverse video tasks.
Empirical evaluations reveal that these models offer competitive accuracy with significantly reduced FLOPs and memory usage across video classification, anomaly detection, and generation.

Video Mamba models refer to a class of neural architectures utilizing structured state-space models (SSMs)—notably the "Mamba" operator—for video sequence modeling and understanding. These models address the scalability and efficiency constraints of attention-based video transformers by enabling linear-complexity, global spatiotemporal context modeling. The field encompasses pure-SSM video backbones, hybrid designs combining SSM and attention, and adapter-style SSM modules integrated into existing vision-language stacks. Distinct variants (VideoMamba, VideoMambaPro, TimeViper, Vamba, VideoMAP, MoMa) have demonstrated competitive or superior accuracy versus state-of-the-art Transformer baselines, with significant reductions in FLOPs and memory usage across video classification, temporal localization, anomaly detection, video generation, video-language modeling, and video super-resolution.

1. Foundations: Structured State-Space Models and the Mamba Operator

The theoretical basis of Video Mamba models derives from the family of structured state-space models, which generalize temporal convolution by recurrently updating a low-dimensional latent state across sequence positions. In the continuous-time case, the SSM is defined as: $h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t).$ Discretization via zero-order hold (ZOH) yields a linear recurrence: $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ with $\bar{A} = \exp(\Delta A)$ and $\bar{B} = (\Delta A)^{-1}(e^{\Delta A} - I)\Delta B$ (Xu et al., 2024, Park et al., 2024, Li et al., 2024).

Mamba extends this model by making parameters such as step-size $\Delta$ , input projection $B$ , and output projection $C$ token-dependent via lightweight neural networks: $\Delta_k, B_k, C_k = \text{LinearProj}(x_k),$ yielding an "adaptive scan" that selectively gates and mixes information per input. This mechanism retains $\mathcal{O}(nN)$ complexity per sequence of length $n$ and latent size $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 0, enabling efficient modeling of long sequences where attention-based architectures incur quadratic cost.

2. Spatiotemporal Modeling: Video-Specific Adaptations

Video Mamba architectures extend the 1D SSM core to handle spatiotemporal video data by strategically flattening, scanning, and merging across spatial and temporal axes (Xu et al., 2024, Park et al., 2024, Chen et al., 2024). The canonical sequence of operations includes:

Tubelet or patch embedding: transforming video frames $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 1 into a set of non-overlapping tubelets or patches, often via a 3D convolutional stem.
Positional encoding: augmenting spatial and temporal tokens with learned or sinusoidal embeddings.
Bidirectional scanning: running SSMs forward and backward over the flattened sequence, employing various permutations (spatial, temporal, spatiotemporal reversals) and aggregating outputs via addition or concatenation.
Cross-scan fusion: For models such as VMamba/DBM/ViViM, multiple axis-wise SSM scans (e.g., across $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 2, $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 3, $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 4) are performed and their outputs fused either via pointwise sum or adaptive mechanisms.

This "flatten, scan, merge" process results in efficient global context propagation while preserving local spatiotemporal correlations. Specialized variants—such as VideoMambaPro—introduce masked backward computation and elemental residual connections to address issues of historical decay and element contradiction inherent in lower-triangular SSM matrices, further enhancing spatial context modeling (Lu et al., 2024).

3. Backbone Architectures and Hybrid Designs

Video Mamba manifests in several backbone configurations, classified broadly as pure-SSM, attention-SSM hybrids, and adapter-style modules (Xu et al., 2024, Liu et al., 16 Mar 2025, Xu et al., 20 Nov 2025, Ren et al., 14 Mar 2025):

Pure SSM Backbones: Stacks of bidirectional selective SSM layers on spatial-temporal tokens. Each layer alternates SSM scans with position-wise feed-forward networks. VideoMamba-Base and VideoMamba-Light exemplify such designs (Xu et al., 2024, Park et al., 2024, Li et al., 2024).
Hybrid Mamba-Transformer Backbones: Interleave linear-complexity SSM layers with a small number of full self-attention blocks (e.g., 27 Mamba, 4 Transformer layers in TimeViper), balancing scalability with expressive modeling of non-local dependencies (Xu et al., 20 Nov 2025, Ren et al., 14 Mar 2025, Liu et al., 16 Mar 2025).
Adapters for Image Foundation Models: MoMa demonstrates parameter-efficient adaptation by injecting SSMs via a SeqMod (sequence modulation) operator into frozen image transformers, enabling full spatiotemporal modeling with minimal new parameters (Yang et al., 29 Jun 2025).
Multi-Modal and Feedback Models: Extensions such as H-MBA (for autonomous driving) and MUFM (for micro-video popularity) incorporate hierarchical, multi-granularity SSM blocks and couple them with cross-modal attention/retrieval (Chen et al., 8 Jan 2025, Lu et al., 2024).

This modular flexibility facilitates plug-and-play integration into existing video-LLMs (LLMs, vision encoders), multimodal fusion architectures, and even video-generative diffusion pipelines.

4. Computational Complexity, Scalability, and Efficiency

A central motivation for Video Mamba is to mitigate the quadratic cost bottleneck of attention (Xu et al., 2024, Park et al., 2024, Chen et al., 2024):

Attention: Each layer incurs $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 5 time/memory for sequence length $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 6.
Mamba SSM: Each layer operates in $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 7, with $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 8 typically $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \qquad y_k = C h_k,$ 9 (latent state size).

Bidirectional variants double this cost, but remain linear. Empirical studies demonstrate:

FLOPs savings (2–10× vs. Transformer, up to 45% in video diffusion models) (Park et al., 2024, Xu et al., 2024, Tran et al., 28 Jun 2025, Huang et al., 12 Jun 2025, Gao et al., 2024).
Memory reductions (40–50% vs. Transformer baselines at $\bar{A} = \exp(\Delta A)$ 0 frame contexts) (Ren et al., 14 Mar 2025, Liu et al., 16 Mar 2025, Li et al., 2024, Xu et al., 20 Nov 2025).
Real-time throughput (hour-long sequence processing on commodity GPUs, up to 10,000 frames in hybrid models) (Xu et al., 20 Nov 2025, Ren et al., 14 Mar 2025).
Efficient online and streaming inference for application domains (e.g., action detection, anomaly detection) (Catinello et al., 22 Jul 2025, Li et al., 2024, Lyu et al., 27 Mar 2025).

Efficiency does not come at the expense of accuracy: Video Mamba models regularly match or exceed the performance of convolutional and Transformer-based backbones on Kinetics-400, SSv2, Breakfast, COIN, Action Localization, VideoQA, and retrieval tasks (Park et al., 2024, Li et al., 2024, Chen et al., 2024, Tran et al., 28 Jun 2025).

5. Applications and Domains

Video Mamba models have been deployed across a broad spectrum of video understanding tasks:

Action Recognition and Localization: VideoMamba, VideoMambaPro, ASMamba, ActionMamba, H-MBA, and MoMa attain state-of-the-art results on Kinetics-400, Something-Something V2, EK-100, DRAMA, BDD-X (Park et al., 2024, Li et al., 2024, Chen et al., 2024, Yang et al., 29 Jun 2025, Chen et al., 8 Jan 2025).
Anomaly Detection: STNMamba, VADMamba leverage spatiotemporal SSM blocks and multi-task fusion (frame prediction, optical flow) to outperform CNN/Transformer baselines on Ped2, Avenue, ShanghaiTech (Li et al., 2024, Lyu et al., 27 Mar 2025).
Video Super-Resolution: VSRM alternates spatial-to-temporal and temporal-to-spatial Mamba blocks, combined with deformable alignment and frequency-domain loss, surpassing window-attention and convolutional methods (Tran et al., 28 Jun 2025).
Video Generation: Matten and MedSora integrate Mamba-Attention in latent diffusion networks, achieving superior FVD, FID, and CLIP consistency across benchmarks, while M4V extends to text-to-video via multi-modal token re-composition and reward learning (Gao et al., 2024, Wang et al., 2024, Huang et al., 12 Jun 2025).
Long-form/Hour-long Video Understanding: Vamba and TimeViper employ hybrid Mamba-Transformer stacks to process >1k frames per batch, aggregating visual information into instruction tokens (Xu et al., 20 Nov 2025, Ren et al., 14 Mar 2025).
Video-Language Modeling and Multimodal Reasoning: VideoMAP, TimeViper, MoMa, H-MBA demonstrate plug-in integration capability for LLMs, multimodal grounding, and dialogue (Liu et al., 16 Mar 2025, Xu et al., 20 Nov 2025, Yang et al., 29 Jun 2025, Chen et al., 8 Jan 2025).
Popularity Prediction: MUFM enhances Hawkes processes for sequential feedback modeling in user-driven micro-video contexts (Lu et al., 2024).
Medical Video Analysis: Vivim and MedSora extend SSM modeling to segmentation, generation, and anomaly detection in clinical/medical datasets (Yang et al., 2024, Wang et al., 2024).

6. Experimental Evaluations and Ablations

Across published experiments:

Task	Best SSM Acc./AUC	Transformer/CNN Baseline	Relative FLOPs/Memory	Reference
Kinetics-400, Top-1	90.3% (VMP-M)	82.4–87.6%	$\bar{A} = \exp(\Delta A)$ 15–10×	(Park et al., 2024, Lu et al., 2024)
SSv2, Top-1	76.4% (VMP-M)	68.3%	$\bar{A} = \exp(\Delta A)$ 25–10×	(Park et al., 2024, Lu et al., 2024)
Breakfast, Acc.	97.9% (VideoMAP-M)	94.3% (ViS4mer)	—	(Liu et al., 16 Mar 2025, Li et al., 2024)
Ped2, AUC	98.5% (VADMamba)	97.0% (MNAD)	$\bar{A} = \exp(\Delta A)$ 32× FPS	(Lyu et al., 27 Mar 2025, Li et al., 2024)
Long Video, LVBench	42.1% (Vamba)	37.8% (LongVU)	$\bar{A} = \exp(\Delta A)$ 450% Mem	(Ren et al., 14 Mar 2025)
Text-to-Video Gen., FVD	45.0 (Matten)	34.0 (Latte)	$\bar{A} = \exp(\Delta A)$ 525% FLOPs	(Gao et al., 2024)

Empirical ablations confirm:

Elemental residuals and masked backward SSM remove historical decay and boost accuracy by 4–8 pp (Lu et al., 2024).
Windowed attention combined with full-sequence SSM outperforms alternatives (add, concat, scalar AdaN) in hybrid models (Yang et al., 29 Jun 2025).
Multi-level SSM context and query fusion in H-MBA improves mIoU by +5.5 points vs. optical-flow SOTA (Chen et al., 8 Jan 2025).
Plug-and-play Mamba adapters match or surpass full-finetune results with fractionally more parameters (Yang et al., 29 Jun 2025).

7. Challenges, Open Problems, and Future Directions

Video Mamba research continues to face multiple active challenges (Xu et al., 2024, Park et al., 2024, Liu et al., 16 Mar 2025, Yang et al., 29 Jun 2025, Xu et al., 20 Nov 2025):

Axis selection and fusion: Optimally sequencing scan axes (spatial, temporal) and developing learned gating or hybrid convolutional-SSM layers for adaptive redundancy reduction.
Hierarchical and modularity: Designing block-sparse SSM hierarchies for scalable high-resolution and long-form video pipelines.
Multi-modal and retrieval integration: Extending plug-in Mamba adapters for seamless fusion across visual, text, audio, and feedback modalities in LLM stacks and retrieval engines.
Efficient hardware realization: Improving kernel fusion and quantization, and SSM kernel-optimized deployment, to close real-time inference gaps with mature CNN toolchains.
Foundation model scaling and generalization: Addressing overfitting and capacity limitations at large scale (e.g., >500M parameters), typically mitigated by hybrid attention-SSM stacking or autoregressive masked pretraining (Liu et al., 16 Mar 2025).
Interpretability: Advances in hybrid stacks have revealed new insights into information aggregation and token redundancy, motivating research into in-LLM compression and module-level interpretability (Xu et al., 20 Nov 2025).
Medical and video anomaly domains: Further application-specific tuning (e.g., frequency-enhanced VAE, boundary affine constraints) is needed for clinical adoption and dense localization (Yang et al., 2024, Wang et al., 2024, Li et al., 2024).

In summary, Video Mamba models present a linearly-scalable, context-sensitive paradigm for video modeling that rivals or exceeds transformer-based architectures across compute, memory, and accuracy axes. Ongoing developments in hybridization, modularity, and multi-modal adaptation promise further advances across diverse computer vision and video-language tasks.