Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse Temporal Token Fusion (STTF)

Updated 30 November 2025
  • Sparse Temporal Token Fusion is an adaptive compression method that reuses token embeddings and only re-encodes regions with significant changes in video data.
  • It exploits high temporal redundancy to reduce computational load, optimize memory usage, and accelerate processing on resource-constrained edge devices.
  • Empirical evaluations show up to 84% token reduction and 13× speedup with less than a 5% loss in accuracy compared to dense transformer models.

Sparse Temporal Token Fusion (STTF) is an adaptive compression technique designed for real-time deployment of vision-LLMs (VLMs) on resource-constrained edge devices. STTF leverages the high temporal redundancy present in video and event-based data by conditionally reusing existing token embeddings and re-encoding only those representing regions of significant change. This conditional token update methodology reduces computational overhead, optimizes memory usage, and accelerates latency without substantial loss in task accuracy (Tanvir et al., 23 Nov 2025).

1. Motivation: Temporal Redundancy and Edge Constraints

Edge VLMs for scenarios such as drones or wearables must operate under strict constraints in power, memory, and compute. Classical per-frame transformer encoding is inefficient for streaming visual data due to pronounced temporal redundancy; spatial patches across consecutive frames often remain static, resulting in wasteful recomputation and excessive FLOPs. STTF addresses this by incrementally updating the token set, fusing "stale" tokens with re-encoded ones at each time step.

At any time t∈1,…,Tt \in {1,\dots,T}, the visual input can be:

  • An RGB frame xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}, or
  • A neuromorphic event tensor et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W} (with polarity and count).

Each xtx_t is partitioned into NN non-overlapping patches (e.g., 16×1616\times16), each embedded into a DD-dimensional vector, yielding {xti}i=1N, xti∈RD\{x_t^i\}_{i=1}^N, \ x_t^i \in \mathbb{R}^{D}.

2. Mathematical Formulation and Fusion Algorithm

Let xt=[xt1,…,xtN]∈RN×Dx_t = [x_t^1,\dots, x_t^N] \in \mathbb{R}^{N\times D} denote the current patch embeddings. Fused token embeddings from the previous timestep are denoted x^t−1\hat{x}_{t-1}.

Event-driven change detection is performed via the function:

xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}0

where xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}1 is a tunable threshold. This establishes a binary mask xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}2:

xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}3

A value xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}4 indicates the patch must be re-encoded; xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}5 signals reuse.

The sparse fusion update is computed as:

xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}6

where xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}7 denotes broadcasted element-wise multiplication.

Pseudocode summary:

16×1616\times160 The output token set xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}8 is suitable for downstream multi-modal attention and language decoding.

3. Hardware-Aware Implementation and Computational Savings

STTF architecture is tailored for edge hardware:

  • Memory: Fused token sets xt∈R3×H×Wx_t \in \mathbb{R}^{3\times H\times W}9 are cached in on-chip SRAM/scratchpad for low-latency updates.
  • Parallelism: Vectorized threshold comparisons (et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}0) allow simultaneous mask computation across all tokens. Only the minimal active token list is encoded, maximizing the benefits of token sparsity.
  • FLOPs reduction: For et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}1 tokens and et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}2 refreshed tokens at time et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}3, the relative per-frame FLOPs savings is:

et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}4

With et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}5, total computational load over the sequence is proportional to et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}6.

4. Empirical Characterization

Extensive evaluation demonstrates substantial gains in token efficiency, accuracy retention, and latency:

  • DVS128 Gesture (event video):
    • Baseline: et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}7 tokens/frame.
    • STTF (et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}8): et∈R2×H×We_t \in \mathbb{R}^{2\times H\times W}9 (84% token reduction), accuracy at 95.6% of the dense Vision Transformer (ViT) baseline.
  • Encoder speedup: xtx_t0 for the fusion stage.
  • End-to-end latency: Up to 13× improvement versus dense ViT+GPT on Jetson Nano hardware.
Metric Dense ViT STTF (xtx_t1) Relative
Avg tokens per frame 196 31 xtx_t284%
Recognition accuracy 98.4% 95.6% xtx_t32.8 pp
Encoder FLOPs per frame 1.0× 0.16× xtx_t484%
End-to-end latency (ms) 120 9 13× faster

Increasing xtx_t5 (i.e., more aggressive token reuse) yields reduced computation at a moderate cost to accuracy; xtx_t6 in [0.1, 0.3] typically achieves 80–90% token reduction with less than 5% accuracy decrease.

5. Threshold Selection and Stability Techniques

Optimal operation of STTF depends on careful hyperparameter tuning:

  • Threshold xtx_t7: Recommended to select xtx_t8 as the xtx_t9 percentile of patch embedding changes NN0 measured on a validation set. Sweeping NN1 across NN2 and plotting the resulting trade-off curve for token count vs. accuracy identifies the inflection point for practical deployment.
  • Stabilization: Applying a momentum update for cached tokens (NN3 with NN4) suppresses re-encoding jitter. Early stopping in training prevents overfitting to sparse updates, and NN5 regularization on the sparsity penalty NN6 encourages smoother token masks.

6. Summary and Implications

STTF reconceptualizes transformer encoding for video/event vision as an incremental state update, efficiently combining static token reuse and sparse re-encoding according to data-driven change. This framework enables up to 84% FLOPs reduction and NN7–NN8 real-time speedup on edge hardware with minimal accuracy penalty (NN9). The approach is compatible with event-driven computer vision, hardware-friendly due to explicit mask logic and on-chip state buffering, and lends itself to further research in adaptive attention and incremental representation (Tanvir et al., 23 Nov 2025). A plausible implication is that STTF can generalize to broader classes of sequential transformer tasks where high temporal redundancy is present, provided stateful token caching and rapid change detection are feasible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Temporal Token Fusion (STTF).