SDTrack: Spike-Driven Transformer for Tracking
- SDTrack is a spike-driven transformer architecture using spiking neural networks for event-based object tracking with neuromorphic vision.
- It employs a MetaFormer backbone with intrinsic positional learning and a Global Trajectory Prompt for robust spatiotemporal encoding.
- The design achieves state-of-the-art accuracy and unmatched energy efficiency on established event tracking benchmarks.
SDTrack is a baseline spike-driven transformer architecture designed for event-based object tracking, specifically targeting neuromorphic vision tasks using Spiking Neural Networks (SNNs). Leveraging the synergy between asynchronous event streams from event cameras and biologically inspired spiking computation, SDTrack employs a fully spike-driven transformer backbone (MetaFormer), a Global Trajectory Prompt (GTP) for robust spatiotemporal encoding, and a streamlined spike-based tracking head. This pipeline achieves state-of-the-art accuracy and energy efficiency on established event tracking benchmarks, setting a robust foundation for low-power, end-to-end neuromorphic vision systems (Shan et al., 9 Mar 2025).
1. Spiking Neuron Foundation: Integer-valued LIF
SDTrack is constructed atop the Integer-valued Leaky Integrate-and-Fire (I-LIF) neuron model, which preserves spike-based inference and enables rate-coded surrogate gradients for learning. The I-LIF neuron dynamics are described by:
- Continuous Form:
where denotes the membrane potential, the synaptic current, and the membrane time constant.
- Discrete Update:
with decay . During training, spiking rates are computed as:
At inference, rates decompose into binary spikes, such that the output is a sum over spike steps:
For a linear layer:
This enables the replacement of multiply–accumulate (MAC) operations with energy-efficient accumulate (AC) operations.
2. Spiking MetaFormer Backbone and Intrinsic Position Learning
The backbone adheres to the MetaFormer framework, adapted entirely for spike-based computation:
- Intrinsic Position Learning (IPL): Template and search image tensors are concatenated diagonally:
IPL allows the model to inherently encode position information, outperforming explicit positional encodings or intersection-only feature sharing.
- Spike-based Convolution and Attention: The MetaFormer backbone consists of:
- Separable and grouped sconv blocks (all neuron layers are I-LIF):
- Transformer blocks with spiking self-attention (SSA) and spiking MLP (SNNMLP):
SSA performs token-wise attention with all weights and activations spike-driven except the last projection.
3. Global Trajectory Prompt (GTP)
GTP encodes asynchronous events into three-channel images designed for robust spatiotemporal representation:
- Polarity Channels:
is a scaling factor.
- Trajectory Channel:
provides exponential decay, encoding long-term object motion priors.
- Aggregated Event Image for Vision Backbone:
This enables compatibility with RGB-oriented pretrained weights and efficient aggregation of trajectory and polarity features.
4. Spike-driven Tracking Head
The tracking head regresses bounding box parameters directly from the spiking cross-correlation feature map:
- Cross-correlation:
- Prediction: Four I-LIF convolutional layers, followed by a single floating-point convolution, output center localization and offsets .
- Bounding Box Decoding:
All coordinates are normalized to .
- Loss:
with a focal loss for classification, generalized IoU for localization, and Smooth- for regression (with , ).
5. Training and Inference Workflow
The pipeline consists of pretraining the backbone on ImageNet-1K and fine-tuning on multiple event datasets (FE108, FELT, VisEvent) with simulation time .
- No data augmentation or post-processing.
- Optimization procedure:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for each mini-batch of (event_template, event_search):
Et = GTP(event_template)
Es = GTP(event_search)
U = IPL(Es, Et)
F_conv = SNNConvModule(U)
Ft, Fs = split_features(F_conv)
tokens = tokenize(Ft, Fs)
for block in TransformerBlocks:
tokens = block(tokens)
F_c = XCorr(tokens.search, tokens.template)
p~, o~ = TrackingHead(F_c)
L = L_cls(p~, p*) + λ_iou L_gIoU(o~, o*) + λ1 L1(o~, o*)
backpropagate(L) |
- Inference utilizes spike steps, with rate-to-binary spike decomposition.
6. Empirical Performance and Efficiency
SDTrack achieves state-of-the-art accuracy and unmatched energy efficiency on standard event tracking benchmarks:
| Model | Params | Timesteps | Energy (mJ) | FE108 (AUC/PR) | FELT (AUC/PR) | VisEvent (AUC/PR) |
|---|---|---|---|---|---|---|
| SDTrack Tiny | 19.6M | 4 | 8.16 | 59.0 / 91.3% | 39.6 / 50.1% | 35.6 / 49.2% |
| SDTrack Base | 107.3M | 4 | 30.5 | 59.9 / 91.5% | 39.8 / 50.7% | 37.4 / 51.5% |
| STNet/SNNTrack | — | — | ≈8.25 | — | — | — / ≈50% |
| Full ANN | — | — | ≫50 | — | — | — |
ANN–SNN hybrids consume slightly less energy than SDTrack Tiny but deliver notably lower precision. Full ANN methods require significantly higher energy for comparable or worse accuracy.
7. Ablations, Analysis, and Design Insights
- GTP hyperparameters: Best performance is found at .
- Position encoding: Removing IPL causes a 2.0% drop in PR, while explicit positional encodings degrade performance.
- Feature sharing: Intersection-only sharing (size 64) offers smaller gains than IPL, supporting the advantage of diagonal concatenation.
- Tracking head structure: Center-based heads outperform corner-based alternatives by approximately 1.1% PR.
- Spike-driven head: Excluding the final float conv layer causes a 0.9% PR loss, so a hybrid head is used.
- Backbone comparison: The I-LIF–based "Spike-driven V3" backbone scores AUC , PR ; in contrast, SDTrack Tiny reaches AUC , PR with the same neuron architecture.
SDTrack’s efficiency stems from: (1) sparse, binary spiking in the I-LIF neuron, enabling AC-only computation; (2) GTP’s aggregated spatiotemporal cues, surpassing conventional event image representations; (3) Transformer-style self-attention over spikes, enabling effective cross-region fusion without heavy ANN layers; (4) IPL’s diagonal concatenation introducing natural positional priors. SDTrack does not require handcrafted encodings, data augmentation, or post-processing, setting a robust and efficient foundation for neuromorphic object tracking (Shan et al., 9 Mar 2025).