SMTrack: State-aware Mamba Tracker

Updated 9 February 2026

State-aware Mamba Tracker (SMTrack) is a family of visual tracking architectures that leverage adaptive state-space models and selective gating for robust single- and multi-modal object tracking.
It replaces conventional self-attention with recurrent state-space modeling to deliver state-of-the-art accuracy while significantly reducing computational costs and memory use.
SMTrack supports diverse applications including RGB-event single object tracking, multi-object tracking, and LiDAR-based 3D tracking through efficient spatio-temporal fusion and adaptive token propagation.

The State-aware Mamba Tracker (SMTrack) refers to a family of visual tracking architectures that leverage input-adaptive state space models—specifically, the “Mamba” formulation—for efficient and robust object tracking in both single- and multi-object, RGB, event-based, and 3D LiDAR scenarios. SMTrack is characterized by its replacement of conventional self-attention with linear-complexity recurrent state-space modeling, spatial-temporal feature fusion, and selective gating mechanisms, yielding state-of-the-art accuracy with orders-of-magnitude lower computational cost compared to Transformer-based trackers (Huang et al., 2024, Xie et al., 2024, Ma et al., 2 Feb 2026, Xiao et al., 2024, Tian et al., 19 Nov 2025).

1. State-Space Model Formulation and the Mamba Paradigm

At the theoretical core of SMTrack is the discrete-time state-space model (SSM). For a sequence $\{x_t\}$ , a canonical SSM updates a hidden state $h_t$ and produces an output $y_t$ : $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \quad y_t = C h_t$ where $(\bar{A}, \bar{B}, C)$ are matrices parameters obtained by zero-order-hold discretization of a continuous SSM. The Mamba architecture generalizes this with input-adaptive transition and gating: $h_{t+1} = \hat{A}(x_t) h_t + \hat{B}(x_t) x_t$ where $\hat{A}$ and $\hat{B}$ are learned functions of the current input. This adaptivity enables dynamic modeling of context and motion beyond what is possible with static transition matrices, and can be efficiently implemented via selective scans over token sequences, supporting near-linear complexity in sequence length (Tian et al., 19 Nov 2025, Huang et al., 2024).

For modalities with multiple observation streams (e.g., RGB + event), the latent state aggregates information from both, and Mamba blocks process the concatenated token streams via cross-modal recurrent updates (Huang et al., 2024). In the probabilistic view, the latent state $s_t$ can be interpreted as the tracker’s estimate of the object state, with generative and measurement equations for each modality: $s_t = f(s_{t-1}, u_t) + w_t, \quad o^{f}_t, o^{e}_t = h_{f,e}(s_t) + v^{f,e}_t$ where $u_t$ is the fused observation and ${w}_t, {v}^{f,e}_t$ are process/measurement noise (Huang et al., 2024).

2. Architectural Variants and Data Flows

SMTrack architectures vary by modality and application domain, but share several foundational components.

(a) RGB-Event Single Object Tracking (Huang et al., 2024)

Input: Synchronized RGB frames and event streams, producing template and search patches per modality.
Modality-specific backbones: Each modality is embedded and processed by Vision-Mamba blocks.
Fusion: A FusionMamba block implements cross-modal interaction via gated recurrent updates, eschewing dot-product attention.
Localization head: The fused search-region features feed a lightweight Conv-based head to regress classification and bounding-box parameters.

(b) Pure RGB Visual Tracking (Xie et al., 2024, Ma et al., 2 Feb 2026)

Track token mechanism: Introduce a per-frame “track token” to summarize appearance, aggregated over a temporal window by an autoregressive Mamba block and cross-attention layer.
Temporal modeling: Token history is processed for temporal context, then used to guide search-region reweighting for robust localization.

(c) Multi-Object Tracking (MOT) (Xiao et al., 2024)

Motion prediction: Bi-directional Mamba blocks replace Kalman filtering; offsets are predicted using a learned data-driven transition function, improving robustness to non-linear or abrupt motions.
Patching missing detections: The autoregressive application of the Mamba motion predictor fills gaps during occlusion.

(d) LiDAR-Based 3D Tracking (Tian et al., 19 Nov 2025)

Mamba-based Inter-frame Propagation (MIP): Historical spatiotemporal features from memory banks are fused with current tokens using nearest-neighbor retrieval and bi-directional SSM refinement.
Grouped Feature Enhancement (GFEM): Channel-level splitting into foreground/background ‘experts’ with group-wise cross-attention addresses temporal redundancy and clutter.

3. Selective State-aware Gating and Temporal Reasoning

A critical SMTrack innovation is the use of state-wise, input-dependent gating for selective temporal aggregation. In contrast to vanilla Mamba (which shares gating across hidden states), state-aware variants decompose the gating into channel-wise and state-wise components: $\Delta_{d}^{(n)} = \mathrm{SoftPlus}(\Delta_{\mathrm{channel}, d} + \Delta_{\mathrm{state}, n})$ This enables each state to specialize—some focusing on target core, others on context, edges, or distractors—improving both discrimination and robustness (Ma et al., 2 Feb 2026).

Bi-directional SSMs propagate information both forward and backward in time for unordered sets (as in point clouds), supporting robust spatio-temporal feature refinement (Tian et al., 19 Nov 2025). In sliding-window or memory-bank variants, the state or token summaries at each time-step are combined, exploiting the linearity of the recurrence for efficient multi-template propagation.

4. Computational Complexity and Efficiency

SMTrack consistently achieves near-linear complexity in sequence length or feature-map size. This contrasts directly with Transformer-based trackers whose self-attention induces $O(N^2)$ (e.g., number of tokens, spatial elements) cost and memory.

Efficiency Properties:

RGB-Event SOT: 94.5% reduction in FLOPs and 88.3% reduction in parameters compared to ViT-S OSTrack; GPU memory reduced by ~9.5% (Huang et al., 2024).
General SOT: SMTrack-M384 achieves 48.7 GFLOPs and 34 FPS on TrackingNet; SMTrack-S256 achieves AUC=82.7% at 8.5 GFLOPs.
MOT: Inference per-frame tracking overhead is 11.4 ms (17 FPS total); motion prediction cost is negligible relative to detector (Xiao et al., 2024).
LiDAR 3D SOT: SMTrack sustains >100 FPS for high token counts (8K–16K), compared to <25 FPS for quadratic-memory Transformer architectures (Tian et al., 19 Nov 2025).

The linear-complexity design enables long-range temporal dependency modeling without runtime explosion. Hidden-state propagation eliminates repeated feature extraction or template aggregation during inference, enabling efficient updating even with frequent template set augmentation (Ma et al., 2 Feb 2026).

5. Training, Loss Functions, and Hyperparameters

Training in SMTrack encompasses both standard regression/classification and custom temporal consistency objectives:

Loss for RGB and event SOT: Weighted sum of focal loss for classification, $L_1$ and GIoU for bounding box regression, e.g., $L = L_{\mathrm{focal}} + 14 \cdot L_{1} + 1 \cdot L_{\mathrm{GIoU}}$ (Huang et al., 2024).
MOT: Smooth-L1 loss on predicted offsets between frames (Xiao et al., 2024).
3D SOT: Combined cross-entropy masks, mean-square error for Hough voting centers, quality scores, and smooth-L1 for box regression (Tian et al., 19 Nov 2025).

Schedules typically use AdamW or similar optimizers with learning-rate warmup and decay, long training epochs (100–300), and substantial batch sizes, matching or exceeding Transformer-based settings. Backbone networks are either modality-specific (e.g., Vision-Mamba) or generic token-based encoders, with hyperparameters tuned for sequence length, state dimension, and gating window.

6. Benchmarks, Evaluation, and Comparative Results

SMTrack architectures have been extensively validated on both classical and modality-specific tracking benchmarks.

Results Summary Table:

Task	Dataset	SMTrack Metric	Top Transformer/MEMO Baseline	Gain
Single Obj. RGB+E	FELT	SR/PR=43.5/55.6	ViT-S: 40.0/50.9	+3.5/+4.7
Single Obj. RGB	TrackingNet	AUC=85.0	Comparable	≈SOTA
Multi-Obj. (MOT)	DanceTrack	HOTA=56.8	OC_SORT=54.6	+2.2
3D LiDAR SOT	KITTI-HTV	S/Prec=64.0/81.7	HVTrack=57.5/72.2	+6.5/+9.5

Extensive ablations confirm the independent benefit of state-wise gating, hidden-state interaction, and grouped attention for background/foreground separation. SMTrack demonstrates robustness to challenging dynamics (occlusions, deformations, skipped frames), consistently matching or exceeding accuracy of quadratic-complexity attention models and outperforming standard SSM or LSTM baselines (Huang et al., 2024, Xie et al., 2024, Ma et al., 2 Feb 2026, Xiao et al., 2024, Tian et al., 19 Nov 2025).

7. Applications and Impact Across Modalities

SMTrack’s modular, state-aware approach enables broad applicability:

Heterogeneous sensor fusion: Natively supports RGB/video frames and event-camera data with explicit modality-specific backbones and efficient token fusion, surpassing two-stream and late-fusion paradigms in both efficiency and accuracy (Huang et al., 2024).
Multi-object association in MOT: Data-driven motion modeling with bi-directional Mamba eliminates reliance on hand-tuned filters and compensates for missing detections through autoregressive patching (Xiao et al., 2024).
LiDAR-based object tracking: Feature propagation via SSM provides a hardware-friendly alternative to memory-bank Transformers, mitigating redundancy and supporting robust performance under high temporal variation (Tian et al., 19 Nov 2025).

These results suggest that the selective, state-aware SSM abstraction is a unifying foundation for efficient, high-performance tracking across diverse sensor modalities and problem setups. The ability to adaptively gate, aggregate, and specialize hidden states across both spatial and temporal dimensions distinguishes SMTrack within the contemporary tracking literature (Huang et al., 2024, Xie et al., 2024, Ma et al., 2 Feb 2026, Xiao et al., 2024, Tian et al., 19 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (5)

Mamba-FETrack: Frame-Event Tracking via State Space Model (2024)

Robust Tracking via Mamba-based Context-aware Token Learning (2024)

SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking (2026)

MambaTrack: A Simple Baseline for Multiple Object Tracking with State Space Model (2024)

MambaTrack3D: A State Space Model Framework for LiDAR-Based Object Tracking under High Temporal Variation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to State-aware Mamba Tracker (SMTrack).

SMTrack: State-aware Mamba Tracker

1. State-Space Model Formulation and the Mamba Paradigm

2. Architectural Variants and Data Flows

3. Selective State-aware Gating and Temporal Reasoning

4. Computational Complexity and Efficiency

5. Training, Loss Functions, and Hyperparameters

6. Benchmarks, Evaluation, and Comparative Results

7. Applications and Impact Across Modalities

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SMTrack: State-aware Mamba Tracker

1. State-Space Model Formulation and the Mamba Paradigm

2. Architectural Variants and Data Flows

3. Selective State-aware Gating and Temporal Reasoning

4. Computational Complexity and Efficiency

5. Training, Loss Functions, and Hyperparameters

6. Benchmarks, Evaluation, and Comparative Results

7. Applications and Impact Across Modalities

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research