Papers
Topics
Authors
Recent
Search
2000 character limit reached

Selective Spatiotemporal Vision Transformer (SSViT)

Updated 16 November 2025
  • SSViT is a vision architecture that leverages selective spatiotemporal attention to process dynamic visual data with high computational efficiency.
  • It integrates biologically inspired spiking neural networks with deep learning techniques for tasks like image classification and modulo video recovery.
  • The design reduces complexity by employing innovative token selection and self-attention modules, achieving superior accuracy with lower memory and FLOPs.

The Selective Spatiotemporal Vision Transformer (SSViT) refers to a class of architectures designed for efficient, high-accuracy spatiotemporal modeling in vision tasks. Two principal and distinct instantiations have been developed: one tailored for spiking neural networks with biological inspiration for edge computing, and another for deep learning-driven modulo video recovery using token selection strategies. Both directions are unified by their core strategy of selectively attending to crucial spatiotemporal regions, either via spike-driven mechanisms or through data-driven token selection, thereby achieving strong performance with significant computational and memory gains.

1. Architectural Foundations

Spiking SSViT (SNN-ViT with SSSA)

The spiking variant of SSViT is built upon two components:

  • Global–Local Spiking Patch Splitting (GL-SPS): Transforms raw images into multi-scale spiking feature maps, partitioning input into patches suitable for spiking processing.
  • Stacked Spiking Transformer Blocks: Each block includes a Saccadic Spike Self-Attention (SSSA) module, followed by a channel-wise MLP layer.

The architecture processes image data hierarchically in a 4-stage pyramid. At each subsequent stage, the spatial resolution halves and the channel dimension typically doubles. Data flows through the structure as: Input Image → GL-SPS → SSSA-Block → MLP → next stage.

Within each stage, tokens representing spatial patches across TT timesteps and DD channels are processed by SSSA for spatiotemporal mixing, then passed through the MLP. The spiking neuron follows the Leaky Integrate-and-Fire (LIF) model: U[t+1]=τU[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}

S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})

where UU is membrane potential, XX synaptic input, τ\tau decay constant, VthV_\text{th} threshold, SS spike, and Θ\Theta is the Heaviside step.

Deep Learning SSViT for Modulo Video

In modulo video recovery, SSViT takes a window of low-bit observations produced by modulo cameras, with the goal of reconstructing the underlying high-dynamic-range (HDR) content. The pipeline consists of:

  • Preprocessing: Sliding window of DD0 consecutive A-bit frames, extraction of folding masks DD1 and fold counts DD2.
  • Encoder: Shared CNN-style encoder produces a 4D feature map, further split into spatiotemporal tubes and projected into tokens.
  • Token Selection: Intricate regions are located using a 3D Neighboring Similarity Matrix (NSM), and only tokens corresponding to highest NSM scores are processed by the Transformer backbone.
  • Transformer: Joint space-time attention is performed over the selected tokens (from the target frame) and all tokens from supporting frames (for context).
  • Decoder: Patchwise binary mask prediction for folding recovery.

Notably, all positional embeddings are omitted, justified empirically by the data-driven nature of the mask classification task.

2. Selective Attention Mechanisms

Saccadic Spike Self-Attention (SSSA)

Vanilla dot-product self-attention fails with spike-based, binary and sparse representations, due to magnitude fluctuations. SSSA replaces this via spike distribution-based spatial relevance:

  • Each D-dimensional spike vector is modeled as a Bernoulli process (DD3 for Query, DD4 for Key).
  • Cross-entropy between DD5:

DD6

The silent-period term is dropped, and DD7 is approximated linearly for spike rates near 0.1–0.2.

  • For all tokens:

DD8

The cross-attention becomes:

DD9

Saccadic Interaction Module

Inspired by biological saccades, the attention mechanism dynamically selects spatial locations at each timestep using:

  • Salient-patch scoring:

U[t+1]=τU[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}0

  • Temporal salience accumulation via a learnable lower-triangular matrix U[t+1]=Ï„U[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}1:
    • Training: U[t+1]=Ï„U[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}2, U[t+1]=Ï„U[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}3
    • Inference: Gating depends only on current U[t+1]=Ï„U[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}4 and a dynamically adjusted threshold.

Token Selection via 3D NSM

For deep learning-based SSViT, regions likely to require non-trivial recovery are identified by heterogeneity in local features:

  • NSM combines Kullback–Leibler divergence between softmax of local features and a uniform distribution, with average cosine dissimilarity:

U[t+1]=τU[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}5

  • Only the top-U[t+1]=Ï„U[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}6 tokens by average NSM are chosen for full attention, focusing resources where folding ambiguity is highest.

3. Computational Efficiency and Complexity Analysis

Spiking SSViT (SNN-ViT)

  • Baseline self-attention has U[t+1]=Ï„U[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}7 or higher complexity.
  • SSSA exploits distribution-kernel factorizations, reducing to U[t+1]=Ï„U[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}8 when U[t+1]=Ï„U[t]+X[t+1]−S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}9.
  • SSSA-V2 further linearizes computation by compressing kernel operations and thresholding, maintaining full spatial-temporal selectivity without quadratic overhead.

Modulo Video SSViT

  • Token selection avoids processing all S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})0 tokens, operating on S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})1 tokens at full Transformer cost S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})2.
  • Unselected tokens are handled by warping predicted masks with FlowNet2-inferred optical flow, circumventing repeat attention calculations.
  • Feature encodings are cached, as frames are encoded once per iteration.
Method/Architecture Complexity Memory/FLOPs Saving
Spiking SSViT (SSSA-V2) S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})3 Full spike-driven linear scaling
Modulo Video SSViT (Selection) S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})4 %%%%28DD229%%%% vs. full attention

4. Experimental Performance and Benchmarking

Spiking SSViT

  • Image Classification (CIFAR100, T=4):

SNN-ViT achieves 80.1% accuracy with 5.6M parameters and S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})7 complexity, outperforming Spikformer (78.2%, 9.3M, S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})8) and Spike-driven ViTs (78.4%, 10.3M, S[t+1]=Θ(U[t+1]−Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})9).

  • ImageNet-1K (T=4):

SNN-ViT-8-512 achieves 80.2% Top-1 accuracy (UU0) with 35.8 mJ energy per sample, competitive with standard ViT-12-768 (77.9%, UU1, 80.9 mJ).

  • Remote-sensing Detection:

SNN-ViT as backbone in YOLO-v3 improves [email protected] on SSDD from 94.8 to 96.7% at UU2.

Modulo Video SSViT

  • Datasets:

Tested on LiU (12 sequences, 1280×720) and HdM (upto 1920×1080).

  • Metrics:
    • PSNR (LiU/HdM): SSViT 28.85 / 29.38, besting UnModNet (27.71 / 28.03), Uformer (11.38 / 14.36), MRF (12.43 / 15.84).
    • SSIM: SSViT 0.871/0.850, UnModNet 0.811/0.824, Uformer 0.482/0.535.
  • Qualitative Assessment:

In high-dynamic scenes, SSViT recovers fold edges and detail which prior methods tend to lose.

5. Training and Optimization Methodologies

  • Spiking SSViT:
    • Supervised cross-entropy loss on final membrane potentials.
    • Surrogate gradients such as UU3 are applied for backpropagation through the Heaviside spiking function.
    • Regularization via weight decay on UU4 and scheduling of UU5 or UU6.
  • Modulo Video SSViT:
    • Standard cross-entropy loss is applied to binary folding mask prediction, accumulated over selected tokens at each iteration.
    • Training is end-to-end with Adam optimizer (UU7), 200k iterations, UU8 clip length.
    • Outputs are temporally tone-mapped with a smoothed Reinhard operator.

6. Ablation Studies and Analytical Insights

  • Incorporation of SSSA (spatial attention) improves CIFAR100 accuracy by UU9; GL-SPS (patch split) alone yields XX0.
  • Both SSSA and GL-SPS combined (with XX1 complexity) yield XX2 over baseline.
  • Micro-ablation of SSSA: replacing the distribution kernel with dot-product or the saccadic neuron with LIF both reduce performance (–0.48% and –0.76%, respectively).
  • Linearized SSSA-V2 delivers comparable accuracy to quadratic SSSA-V1.
Variant Params Complexity CIFAR100 Acc.
Spikformer (Baseline) 9.32M XX3 78.21%
+ SSSA (spatial only) 5.52M XX4 79.60%
+ GL-SPS (patch only) 5.81M XX5 77.88%
+ Both 5.57M XX6 80.10%

7. Extensions, Limitations, and Future Work

SSViT’s selective spatiotemporal approaches are amenable to other modalities, such as event-based audio and radar, or in scenarios requiring efficient high-dimensional attention. Biological inspiration (the saccade) can generalize to sequence data beyond vision. Practical hardware acceleration is an active direction, especially given the XX7 complexity and event-driven flow in the spiking architecture.

Current limitations include:

  • Reliance on multi-timestep inference (XX8–8).
  • Fixed patch grids; dynamic saccade regions are yet to be realized.
  • XX9’s lower-triangular constraint may not capture more general temporal relations.
  • For modulo recovery, explicit positional encoding is omitted because tasks do not require absolute coordinates, but this could be re-examined for scene-dependent masks.

Future directions include:

A plausible implication is that selective, biologically inspired spatiotemporal attention offers a systematic path towards both energy and computational efficiency in domains where attention to sparse, informative regions is critical. The demonstrated state-of-the-art results in both spiking vision tasks and in modulo video recovery indicate broad applicability for SSViT architectures under resource constraints (Wang et al., 18 Feb 2025, Geng et al., 9 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Selective Spatiotemporal Vision Transformer (SSViT).