Papers
Topics
Authors
Recent
Search
2000 character limit reached

SDTrack: Spike-Driven Transformer for Tracking

Updated 17 January 2026
  • SDTrack is a spike-driven transformer architecture using spiking neural networks for event-based object tracking with neuromorphic vision.
  • It employs a MetaFormer backbone with intrinsic positional learning and a Global Trajectory Prompt for robust spatiotemporal encoding.
  • The design achieves state-of-the-art accuracy and unmatched energy efficiency on established event tracking benchmarks.

SDTrack is a baseline spike-driven transformer architecture designed for event-based object tracking, specifically targeting neuromorphic vision tasks using Spiking Neural Networks (SNNs). Leveraging the synergy between asynchronous event streams from event cameras and biologically inspired spiking computation, SDTrack employs a fully spike-driven transformer backbone (MetaFormer), a Global Trajectory Prompt (GTP) for robust spatiotemporal encoding, and a streamlined spike-based tracking head. This pipeline achieves state-of-the-art accuracy and energy efficiency on established event tracking benchmarks, setting a robust foundation for low-power, end-to-end neuromorphic vision systems (Shan et al., 9 Mar 2025).

1. Spiking Neuron Foundation: Integer-valued LIF

SDTrack is constructed atop the Integer-valued Leaky Integrate-and-Fire (I-LIF) neuron model, which preserves spike-based inference and enables rate-coded surrogate gradients for learning. The I-LIF neuron dynamics are described by:

  • Continuous Form:

τmdV(t)dt=V(t)+I(t)\tau_m \frac{dV(t)}{dt} = -V(t) + I(t)

where V(t)V(t) denotes the membrane potential, I(t)I(t) the synaptic current, and τm\tau_m the membrane time constant.

  • Discrete Update:

V[t]=λV[t1]+x[t]Vths[t1]V^\ell[t] = \lambda V^\ell[t-1] + x^\ell[t] - V_\mathrm{th} s^\ell[t-1]

s[t]=H(V[t]Vth),s[t]{0,1}s^\ell[t] = H(V^\ell[t] - V_\mathrm{th}), \quad s^\ell[t] \in \{0, 1\}

with decay λ=exp(Δt/τm)\lambda = \exp(-\Delta t / \tau_m). During training, spiking rates are computed as:

s=1Tclip{x,0,T}\mathbf{s}^\ell = \frac{1}{T} \left\lfloor \operatorname{clip}\{\mathbf{x}^\ell, 0, T\} \right\rceil

At inference, rates decompose into TT binary spikes, such that the output is a sum over TT spike steps:

s=1Tt=1Ts[t],s[t]{0,1}\mathbf{s}^\ell = \frac{1}{T} \sum_{t=1}^T \mathbf{s}^\ell[t], \quad \mathbf{s}^\ell[t] \in \{0,1\}

For a linear layer:

x+1=t=1TWTs[t]\mathbf{x}^{\ell+1} = \sum_{t=1}^T \frac{\mathbf{W}^\ell}{T} \mathbf{s}^\ell[t]

This enables the replacement of multiply–accumulate (MAC) operations with energy-efficient accumulate (AC) operations.

2. Spiking MetaFormer Backbone and Intrinsic Position Learning

The backbone adheres to the MetaFormer framework, adapted entirely for spike-based computation:

  • Intrinsic Position Learning (IPL): Template Z\mathbf{Z} and search X\mathbf{X} image tensors are concatenated diagonally:

U=IPL(X,Z)RT×C×(Hz+Hx)×(Wz+Wx)\mathbf{U} = \mathrm{IPL}(\mathbf{X}, \mathbf{Z}) \in \mathbb{R}^{T \times C \times (H_z+H_x) \times (W_z+W_x)}

IPL allows the model to inherently encode position information, outperforming explicit positional encodings or intersection-only feature sharing.

  • Spike-based Convolution and Attention: The MetaFormer backbone consists of:

    • Separable and grouped sconv blocks (all neuron layers are I-LIF):

    U=U+SNNSepConv(U),U=U+SNNConvGroup(U)\mathbf{U}' = \mathbf{U} + \mathrm{SNNSepConv}(\mathbf{U}),\quad \mathbf{U}'' = \mathbf{U}' + \mathrm{SNNConvGroup}(\mathbf{U}') - Transformer blocks with spiking self-attention (SSA) and spiking MLP (SNNMLP):

    U=U+SSA(U),U=U+SNNMLP(U)\mathbf{U}' = \mathbf{U} + \mathrm{SSA}(\mathbf{U}),\quad \mathbf{U}'' = \mathbf{U}' + \mathrm{SNNMLP}(\mathbf{U}')

    SSA performs token-wise attention with all weights and activations spike-driven except the last projection.

3. Global Trajectory Prompt (GTP)

GTP encodes asynchronous events into three-channel images designed for robust spatiotemporal representation:

  • Polarity Channels:

hi1(x,y)=αtkLδ(xxk,yyk)[pk=+1]h_i^1(x,y) = \alpha\sum_{t_k\in L}\delta(x-x_k, y-y_k)[p_k=+1]

hi2(x,y)=αtkLδ(xxk,yyk)[pk=1]h_i^2(x,y) = \alpha\sum_{t_k\in L}\delta(x-x_k, y-y_k)[p_k=-1]

α\alpha is a scaling factor.

  • Trajectory Channel:

hi3(x,y)=βhi13(x,y)+αj=12δ(hi1j(x,y)>0)[hij(x,y)=0]h_i^3(x,y) = \beta h_{i-1}^3(x,y) + \alpha\sum_{j=1}^2 \delta(h_{i-1}^j(x,y) > 0)[h_i^j(x,y) = 0]

β(0,1)\beta \in (0,1) provides exponential decay, encoding long-term object motion priors.

  • Aggregated Event Image for Vision Backbone:

Ei=[hi1;hi2;hi3],EiR3×H×W\mathbf{E}_i = [h_i^1; h_i^2; h_i^3],\quad \mathbf{E}_i \in \mathbb{R}^{3 \times H \times W}

This enables compatibility with RGB-oriented pretrained weights and efficient aggregation of trajectory and polarity features.

4. Spike-driven Tracking Head

The tracking head regresses bounding box parameters directly from the spiking cross-correlation feature map:

  • Cross-correlation:

Fc=XCorr(Fsearch,Ftemplate)\mathbf{F}_c = \mathrm{XCorr}(\mathbf{F}_{\mathrm{search}}, \mathbf{F}_{\mathrm{template}})

  • Prediction: Four I-LIF convolutional layers, followed by a single floating-point convolution, output center localization p^ij\hat{p}_{ij} and offsets o^ij=[d^x,d^y,d^w,d^h]ij\hat{o}_{ij} = [\hat{d}_x, \hat{d}_y, \hat{d}_w, \hat{d}_h]_{ij}.
  • Bounding Box Decoding:

x~=(i+d^x)/Wsearch\tilde{x} = (i + \hat{d}_x) / W_{\mathrm{search}}

y~=(j+d^y)/Hsearch\tilde{y} = (j + \hat{d}_y) / H_{\mathrm{search}}

w~=ed^w,h~=ed^h\tilde{w} = e^{\hat{d}_w}, \quad \tilde{h} = e^{\hat{d}_h}

All coordinates are normalized to [0,1][0,1].

  • Loss:

L=Lcls+λiouLgIoU+λ1L1\mathcal{L} = \mathcal{L}_{\mathrm{cls}} + \lambda_{\mathrm{iou}}\mathcal{L}_{\mathrm{gIoU}} + \lambda_{1}\mathcal{L}_{1}

with a focal loss for classification, generalized IoU for localization, and Smooth-L1L_1 for regression (with λiou=2\lambda_{\mathrm{iou}}=2, λ1=5\lambda_1=5).

5. Training and Inference Workflow

The pipeline consists of pretraining the backbone on ImageNet-1K and fine-tuning on multiple event datasets (FE108, FELT, VisEvent) with simulation time T=1T=1.

  • No data augmentation or post-processing.
  • Optimization procedure:

1
2
3
4
5
6
7
8
9
10
11
12
13
for each mini-batch of (event_template, event_search):
    Et = GTP(event_template)
    Es = GTP(event_search)
    U = IPL(Es, Et)
    F_conv = SNNConvModule(U)
    Ft, Fs = split_features(F_conv)
    tokens = tokenize(Ft, Fs)
    for block in TransformerBlocks:
        tokens = block(tokens)
    F_c = XCorr(tokens.search, tokens.template)
    p~, o~ = TrackingHead(F_c)
    L = L_cls(p~, p*) + λ_iou L_gIoU(o~, o*) + λ1 L1(o~, o*)
    backpropagate(L)

  • Inference utilizes T>1T>1 spike steps, with rate-to-binary spike decomposition.

6. Empirical Performance and Efficiency

SDTrack achieves state-of-the-art accuracy and unmatched energy efficiency on standard event tracking benchmarks:

Model Params Timesteps Energy (mJ) FE108 (AUC/PR) FELT (AUC/PR) VisEvent (AUC/PR)
SDTrack Tiny 19.6M 4 8.16 59.0 / 91.3% 39.6 / 50.1% 35.6 / 49.2%
SDTrack Base 107.3M 4 30.5 59.9 / 91.5% 39.8 / 50.7% 37.4 / 51.5%
STNet/SNNTrack ≈8.25 — / ≈50%
Full ANN ≫50

ANN–SNN hybrids consume slightly less energy than SDTrack Tiny but deliver notably lower precision. Full ANN methods require significantly higher energy for comparable or worse accuracy.

7. Ablations, Analysis, and Design Insights

  • GTP hyperparameters: Best performance is found at α=30,β=0.8\alpha = 30, \beta = 0.8.
  • Position encoding: Removing IPL causes a 2.0% drop in PR, while explicit positional encodings degrade performance.
  • Feature sharing: Intersection-only sharing (size 64) offers smaller gains than IPL, supporting the advantage of diagonal concatenation.
  • Tracking head structure: Center-based heads outperform corner-based alternatives by approximately 1.1% PR.
  • Spike-driven head: Excluding the final float conv layer causes a 0.9% PR loss, so a hybrid head is used.
  • Backbone comparison: The I-LIF–based "Spike-driven V3" backbone scores AUC =58.9%= 58.9\%, PR =90.3%= 90.3\%; in contrast, SDTrack Tiny reaches AUC =59.0%= 59.0\%, PR =91.3%= 91.3\% with the same neuron architecture.

SDTrack’s efficiency stems from: (1) sparse, binary spiking in the I-LIF neuron, enabling AC-only computation; (2) GTP’s aggregated spatiotemporal cues, surpassing conventional event image representations; (3) Transformer-style self-attention over spikes, enabling effective cross-region fusion without heavy ANN layers; (4) IPL’s diagonal concatenation introducing natural positional priors. SDTrack does not require handcrafted encodings, data augmentation, or post-processing, setting a robust and efficient foundation for neuromorphic object tracking (Shan et al., 9 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SDTrack (Spiking Transformer).