Papers
Topics
Authors
Recent
Search
2000 character limit reached

VideoMamba: Scalable Video Modeling Framework

Updated 22 February 2026
  • VideoMamba is a video modeling paradigm that employs adaptive selective state-space models to achieve linearly scalable spatiotemporal representations.
  • It uses bidirectional scanning combined with tubelet tokenization to capture both global and local dependencies efficiently.
  • Empirical results show significant FLOPs reduction and faster inference compared to Transformer-based approaches in various video tasks.

VideoMamba is a class of video modeling architectures that leverage the Mamba selective state-space model (SSM) framework to achieve linearly scalable, context-rich spatiotemporal representations for a wide range of video understanding, generation, restoration, and assessment tasks. Unlike Transformer-based models, which rely on quadratic-cost self-attention, VideoMamba variants utilize adaptable, input-dependent state-space recurrences and bidirectional scanning schemes, enabling efficient and effective modeling of both global and local dependencies in video data.

1. Foundations: Selective State Space Models and the Mamba Operator

At the core of VideoMamba is the continuous-time linear state-space system: h(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t)h'(t) = A\,h(t) + B\,x(t), \quad y(t) = C\,h(t) + D\,x(t) which is discretized (e.g., zero-order hold) as: A=exp(ΔA),B=(exp(ΔA)I)(ΔA)1B\overline{A} = \exp(\Delta A),\quad \overline{B} = (\exp(\Delta A) - I)(\Delta A)^{-1} B

hk=Ahk1+Bxk,yk=Chk+Dxkh_k = \overline{A}\,h_{k-1} + \overline{B}\,x_k,\quad y_k = C\,h_k + D\,x_k

The distinguishing feature of Mamba is its selective mechanism: rather than fixed parameters, the discretization step Δ\Delta, as well as BB and CC, are generated adaptively per time step by learned functions (“selectors”) of the current input. This yields an input-dependent scan with adaptive context gating, enabling dynamic modeling of long-term dependencies with linear time and memory complexity (Park et al., 2024, Li et al., 2024, Zhang et al., 2024).

Extending to video, the basic 1D selective SSM is generalized to scan over tokenized space-time patch sequences (e.g., after Conv3D tubelet embedding), enabling joint spatiotemporal reasoning with negligible computational overhead compared to attention-based methods.

2. Architectural Variants and Spatiotemporal Scan Schemes

Tubelet Tokenization and Embedding

VideoMamba architectures typically begin with 3D convolutional tubelet tokenization, partitioning a video XR3×T×H×WX \in \mathbb{R}^{3 \times T \times H \times W} into a sequence of patches over space and time. Each patch is projected to a feature embedding and augmented with learnable spatiotemporal position embeddings, frequently initialized via expansion from pretrained 2D image models for regularization (Park et al., 2024).

Bidirectional Spatiotemporal Scanning

A typical VideoMamba block applies both forward and backward SSM scans over the flattened space-time token sequence:

  • Forward SSM: Scans tokens in natural video order.
  • Backward SSM: Scans the reversed sequence, often using full spatiotemporal reversal rather than temporal-only.

Outputs from both passes are merged (concatenation or summation followed by projection), capturing signals from both temporal directions and across spatial context (Park et al., 2024, Lu et al., 2024, Li et al., 2024).

Specialized Scans and Hierarchical Design

Later variants, such as VideoMambaPro, introduce masked backward computation and elemental residual connections to address deficiencies like historical decay and element contradiction in Mamba’s recurrence, further enhancing modeling expressiveness without losing linear efficiency (Lu et al., 2024). Dual-branch, multi-stage, or multi-directional schemes (e.g., dual-branch for violence detection (Senadeera et al., 23 May 2025), or four-way directional scans in frame interpolation (Zhang et al., 2024)) tailor the scan order and module topology to optimally balance spatial detail and long-range temporal modeling.

3. Computational Complexity and Efficiency Analysis

A fundamental advantage of VideoMamba is its linear time and memory growth with sequence length nn (total number of space-time tokens):

  • Self-attention: O(n2d)O(n^2 d)
  • VideoMamba selective SSM: O(nd)O(n d)

Empirical results confirm large reductions in FLOPs and memory requirements:

  • For standard 16-frame 2242224^2 video classification input, VideoMamba (26M parameters) requires 34\approx34 GFLOPs per inference, compared to 88\approx88 GFLOPs for VideoSwin-T of similar or larger parameter count (Park et al., 2024).
  • Throughput measurements show up to 8×8\times faster inference for long, high-resolution videos (Park et al., 2024, Li et al., 2024).

This enables practical deployment for resource-intensive or real-time scenarios, such as long-form video analysis, high-resolution video restoration, and online video streaming (Mi et al., 22 Apr 2025, Xu et al., 2024, Li et al., 2023).

4. Empirical Performance Across Applications

VideoMamba and its variants have demonstrated competitive or state-of-the-art results across diverse video tasks:

Task Dataset Top-1 / Major Metric VideoMamba/Pro Transformer/Other
Action recognition Kinetics400 76.1–91.7% (model-dependent) (Park et al., 2024, Lu et al., 2024) 78.8% (VideoSwin-T, 4.8T FLOPs)
Fine-grained action (short) SSV2 up to 76.4% (Park et al., 2024, Lu et al., 2024) 57.2% (VideoSwin-T)
Long-range video classification Breakfast / COIN 91.5% / 89.5% (Park et al., 2024, Li et al., 2024) 88.4% (ViS4mer, feature-based)
Frame interpolation XTest, Vimeo90K +0.8dB/0.98dB SOTA gain (Zhang et al., 2024) SGM-VFI, AMT-G (lower)
Super-resolution REDS4 33.11 dB (16fr, 4×SR) (Tran et al., 28 Jun 2025) 32.90 dB (IART); 31.06 dB (Full Attention)
Anomaly detection Ped2 98.5% AUC (Li et al., 2024, Lyu et al., 27 Mar 2025) 97.0–97.7% (MemAE, MNAD, PDM-Net)
Video quality assessment LSVQ 0.883/0.899 (SROCC/PLCC) (Mi et al., 22 Apr 2025) 0.872/0.874 (FAST-VQA); slower
Text-to-video generation VBench 81.9% (Total score); 45% lower FLOPs (Huang et al., 12 Jun 2025) 81.6% (full-attention PyramidFlow, 55T FLOPs)

Further, VideoMamba scales robustly without extensive pretraining (with advances such as self-distillation (Li et al., 2024)), and variants have been successfully applied in multi-modal fusion, violence detection, demoiréing, and other domains (Senadeera et al., 23 May 2025, Xu et al., 2024, Zhang et al., 2024, Li et al., 2023, Chen et al., 2024, Huang et al., 12 Jun 2025).

5. Ablations, Analysis, and Model Design Insights

Numerous studies dissect VideoMamba’s components:

  • Temporal Order Sensitivity: Severe drop in accuracy if frame order is shuffled, confirming genuine use of temporal structure (Park et al., 2024).
  • Backward Scan Choices: Spatiotemporal reversal outperforms temporal- or spatial-only reversal in bidirectional SSM blocks (Park et al., 2024).
  • Positional Embedding: Initializing positional embeddings by temporal expansion from ImageNet-2D outperforms alternatives (Park et al., 2024).
  • Delta Parameter Visualization: Early model layers use uniform high Δ\Delta (broad context), while deeper layers adapt Δ\Delta to salient moving regions (Park et al., 2024).
  • Regularization and Pretraining: Empirical accuracy improves significantly with Kinetics-400 pretraining and RandAugment regularization (Park et al., 2024).
  • Elemental Residuals and Masked Backward: VideoMambaPro demonstrates that incorporating per-token residuals and masked backward computation directly ameliorates “historical decay” and “element contradiction” effects, yielding large accuracy gains with negligible extra cost (Lu et al., 2024).

Curriculum learning strategies and hybrid fusion with local or cross-modal modules are also practiced to maximize modeling power while controlling complexity for diverse tasks (Zhang et al., 2024, Lyu et al., 27 Mar 2025, Mi et al., 22 Apr 2025, Xu et al., 2024).

6. Extensions, Specializations, and Limitations

VideoMamba’s versatility has prompted extension across vision, video, and multi-modal tasks:

7. Open Challenges and Future Directions

VideoMamba’s trajectory includes open questions in the following areas:

  • Ultra-large Scale and Multimodal Models: Extending to hour-long or multi-modal contexts (e.g., video, audio, text) via cross-modal MambaTwister blocks or hybrid SSM-attention fusion (Zhang et al., 2024, Huang et al., 12 Jun 2025).
  • Real-time and Embedded Inference: Further latency reductions are feasible with custom CUDA kernels and chunked scan scheduling (Tran et al., 28 Jun 2025, Xu et al., 2024).
  • Adaptive Sparsification: Token skipping and learned scan patterns could reduce computational footprint and address high-resolution or long-horizon video data (Zhang et al., 2024).
  • Training Stability and Regularization: Improved Δ\Delta-schedule regularization, adaptive masking strategies, and parameter conditioning are ongoing research areas for robust training (Lu et al., 2024, Zhang et al., 2024).
  • Hybridization with Attention: Mamba-attention combinations provide strong empirical results and allow flexible trade-offs between local detail and global efficiency (Gao et al., 2024, Huang et al., 12 Jun 2025, Li et al., 2024).

VideoMamba and its descendants represent a significant expansion of the SSM paradigm, furnishing practical, scalable, and accurate tools for the next generation of video understanding, restoration, generation, and assessment tasks across varied computational and application regimes.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VideoMamba.