Papers
Topics
Authors
Recent
Search
2000 character limit reached

SMTrack: State-aware Mamba Tracker

Updated 9 February 2026
  • State-aware Mamba Tracker (SMTrack) is a family of visual tracking architectures that leverage adaptive state-space models and selective gating for robust single- and multi-modal object tracking.
  • It replaces conventional self-attention with recurrent state-space modeling to deliver state-of-the-art accuracy while significantly reducing computational costs and memory use.
  • SMTrack supports diverse applications including RGB-event single object tracking, multi-object tracking, and LiDAR-based 3D tracking through efficient spatio-temporal fusion and adaptive token propagation.

The State-aware Mamba Tracker (SMTrack) refers to a family of visual tracking architectures that leverage input-adaptive state space models—specifically, the “Mamba” formulation—for efficient and robust object tracking in both single- and multi-object, RGB, event-based, and 3D LiDAR scenarios. SMTrack is characterized by its replacement of conventional self-attention with linear-complexity recurrent state-space modeling, spatial-temporal feature fusion, and selective gating mechanisms, yielding state-of-the-art accuracy with orders-of-magnitude lower computational cost compared to Transformer-based trackers (Huang et al., 2024, Xie et al., 2024, Ma et al., 2 Feb 2026, Xiao et al., 2024, Tian et al., 19 Nov 2025).

1. State-Space Model Formulation and the Mamba Paradigm

At the theoretical core of SMTrack is the discrete-time state-space model (SSM). For a sequence {xt}\{x_t\}, a canonical SSM updates a hidden state hth_t and produces an output yty_t: ht=Aˉht1+Bˉxt,yt=Chth_t = \bar{A} h_{t-1} + \bar{B} x_t, \quad y_t = C h_t where (Aˉ,Bˉ,C)(\bar{A}, \bar{B}, C) are matrices parameters obtained by zero-order-hold discretization of a continuous SSM. The Mamba architecture generalizes this with input-adaptive transition and gating: ht+1=A^(xt)ht+B^(xt)xth_{t+1} = \hat{A}(x_t) h_t + \hat{B}(x_t) x_t where A^\hat{A} and B^\hat{B} are learned functions of the current input. This adaptivity enables dynamic modeling of context and motion beyond what is possible with static transition matrices, and can be efficiently implemented via selective scans over token sequences, supporting near-linear complexity in sequence length (Tian et al., 19 Nov 2025, Huang et al., 2024).

For modalities with multiple observation streams (e.g., RGB + event), the latent state aggregates information from both, and Mamba blocks process the concatenated token streams via cross-modal recurrent updates (Huang et al., 2024). In the probabilistic view, the latent state sts_t can be interpreted as the tracker’s estimate of the object state, with generative and measurement equations for each modality: st=f(st1,ut)+wt,otf,ote=hf,e(st)+vtf,es_t = f(s_{t-1}, u_t) + w_t, \quad o^{f}_t, o^{e}_t = h_{f,e}(s_t) + v^{f,e}_t where utu_t is the fused observation and wt,vtf,e{w}_t, {v}^{f,e}_t are process/measurement noise (Huang et al., 2024).

2. Architectural Variants and Data Flows

SMTrack architectures vary by modality and application domain, but share several foundational components.

(a) RGB-Event Single Object Tracking (Huang et al., 2024)

  • Input: Synchronized RGB frames and event streams, producing template and search patches per modality.
  • Modality-specific backbones: Each modality is embedded and processed by Vision-Mamba blocks.
  • Fusion: A FusionMamba block implements cross-modal interaction via gated recurrent updates, eschewing dot-product attention.
  • Localization head: The fused search-region features feed a lightweight Conv-based head to regress classification and bounding-box parameters.

(b) Pure RGB Visual Tracking (Xie et al., 2024, Ma et al., 2 Feb 2026)

  • Track token mechanism: Introduce a per-frame “track token” to summarize appearance, aggregated over a temporal window by an autoregressive Mamba block and cross-attention layer.
  • Temporal modeling: Token history is processed for temporal context, then used to guide search-region reweighting for robust localization.

(c) Multi-Object Tracking (MOT) (Xiao et al., 2024)

  • Motion prediction: Bi-directional Mamba blocks replace Kalman filtering; offsets are predicted using a learned data-driven transition function, improving robustness to non-linear or abrupt motions.
  • Patching missing detections: The autoregressive application of the Mamba motion predictor fills gaps during occlusion.

(d) LiDAR-Based 3D Tracking (Tian et al., 19 Nov 2025)

  • Mamba-based Inter-frame Propagation (MIP): Historical spatiotemporal features from memory banks are fused with current tokens using nearest-neighbor retrieval and bi-directional SSM refinement.
  • Grouped Feature Enhancement (GFEM): Channel-level splitting into foreground/background ‘experts’ with group-wise cross-attention addresses temporal redundancy and clutter.

3. Selective State-aware Gating and Temporal Reasoning

A critical SMTrack innovation is the use of state-wise, input-dependent gating for selective temporal aggregation. In contrast to vanilla Mamba (which shares gating across hidden states), state-aware variants decompose the gating into channel-wise and state-wise components: Δd(n)=SoftPlus(Δchannel,d+Δstate,n)\Delta_{d}^{(n)} = \mathrm{SoftPlus}(\Delta_{\mathrm{channel}, d} + \Delta_{\mathrm{state}, n}) This enables each state to specialize—some focusing on target core, others on context, edges, or distractors—improving both discrimination and robustness (Ma et al., 2 Feb 2026).

Bi-directional SSMs propagate information both forward and backward in time for unordered sets (as in point clouds), supporting robust spatio-temporal feature refinement (Tian et al., 19 Nov 2025). In sliding-window or memory-bank variants, the state or token summaries at each time-step are combined, exploiting the linearity of the recurrence for efficient multi-template propagation.

4. Computational Complexity and Efficiency

SMTrack consistently achieves near-linear complexity in sequence length or feature-map size. This contrasts directly with Transformer-based trackers whose self-attention induces O(N2)O(N^2) (e.g., number of tokens, spatial elements) cost and memory.

Efficiency Properties:

  • RGB-Event SOT: 94.5% reduction in FLOPs and 88.3% reduction in parameters compared to ViT-S OSTrack; GPU memory reduced by ~9.5% (Huang et al., 2024).
  • General SOT: SMTrack-M384 achieves 48.7 GFLOPs and 34 FPS on TrackingNet; SMTrack-S256 achieves AUC=82.7% at 8.5 GFLOPs.
  • MOT: Inference per-frame tracking overhead is 11.4 ms (17 FPS total); motion prediction cost is negligible relative to detector (Xiao et al., 2024).
  • LiDAR 3D SOT: SMTrack sustains >100 FPS for high token counts (8K–16K), compared to <25 FPS for quadratic-memory Transformer architectures (Tian et al., 19 Nov 2025).

The linear-complexity design enables long-range temporal dependency modeling without runtime explosion. Hidden-state propagation eliminates repeated feature extraction or template aggregation during inference, enabling efficient updating even with frequent template set augmentation (Ma et al., 2 Feb 2026).

5. Training, Loss Functions, and Hyperparameters

Training in SMTrack encompasses both standard regression/classification and custom temporal consistency objectives:

  • Loss for RGB and event SOT: Weighted sum of focal loss for classification, L1L_1 and GIoU for bounding box regression, e.g., L=Lfocal+14L1+1LGIoUL = L_{\mathrm{focal}} + 14 \cdot L_{1} + 1 \cdot L_{\mathrm{GIoU}} (Huang et al., 2024).
  • MOT: Smooth-L1 loss on predicted offsets between frames (Xiao et al., 2024).
  • 3D SOT: Combined cross-entropy masks, mean-square error for Hough voting centers, quality scores, and smooth-L1 for box regression (Tian et al., 19 Nov 2025).

Schedules typically use AdamW or similar optimizers with learning-rate warmup and decay, long training epochs (100–300), and substantial batch sizes, matching or exceeding Transformer-based settings. Backbone networks are either modality-specific (e.g., Vision-Mamba) or generic token-based encoders, with hyperparameters tuned for sequence length, state dimension, and gating window.

6. Benchmarks, Evaluation, and Comparative Results

SMTrack architectures have been extensively validated on both classical and modality-specific tracking benchmarks.

Results Summary Table:

Task Dataset SMTrack Metric Top Transformer/MEMO Baseline Gain
Single Obj. RGB+E FELT SR/PR=43.5/55.6 ViT-S: 40.0/50.9 +3.5/+4.7
Single Obj. RGB TrackingNet AUC=85.0 Comparable ≈SOTA
Multi-Obj. (MOT) DanceTrack HOTA=56.8 OC_SORT=54.6 +2.2
3D LiDAR SOT KITTI-HTV S/Prec=64.0/81.7 HVTrack=57.5/72.2 +6.5/+9.5

Extensive ablations confirm the independent benefit of state-wise gating, hidden-state interaction, and grouped attention for background/foreground separation. SMTrack demonstrates robustness to challenging dynamics (occlusions, deformations, skipped frames), consistently matching or exceeding accuracy of quadratic-complexity attention models and outperforming standard SSM or LSTM baselines (Huang et al., 2024, Xie et al., 2024, Ma et al., 2 Feb 2026, Xiao et al., 2024, Tian et al., 19 Nov 2025).

7. Applications and Impact Across Modalities

SMTrack’s modular, state-aware approach enables broad applicability:

  • Heterogeneous sensor fusion: Natively supports RGB/video frames and event-camera data with explicit modality-specific backbones and efficient token fusion, surpassing two-stream and late-fusion paradigms in both efficiency and accuracy (Huang et al., 2024).
  • Multi-object association in MOT: Data-driven motion modeling with bi-directional Mamba eliminates reliance on hand-tuned filters and compensates for missing detections through autoregressive patching (Xiao et al., 2024).
  • LiDAR-based object tracking: Feature propagation via SSM provides a hardware-friendly alternative to memory-bank Transformers, mitigating redundancy and supporting robust performance under high temporal variation (Tian et al., 19 Nov 2025).

These results suggest that the selective, state-aware SSM abstraction is a unifying foundation for efficient, high-performance tracking across diverse sensor modalities and problem setups. The ability to adaptively gate, aggregate, and specialize hidden states across both spatial and temporal dimensions distinguishes SMTrack within the contemporary tracking literature (Huang et al., 2024, Xie et al., 2024, Ma et al., 2 Feb 2026, Xiao et al., 2024, Tian et al., 19 Nov 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to State-aware Mamba Tracker (SMTrack).