Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal Token Reuse in Sequential Models

Updated 23 January 2026
  • Temporal Token Reuse is a method that reuses previously computed token representations across time steps to avoid redundant processing in sequence models.
  • Empirical results show that TTR can nearly halve FLOPs and lower latency in applications ranging from LLM inference to video segmentation.
  • Different implementations, such as prefix caching in LLMs and patch-based reuse in video processing, enable real-time performance while maintaining high model fidelity.

Temporal Token Reuse (TTR) refers to a class of techniques across sequence modeling, video understanding, image synthesis, and online inference that exploit temporal or sequential redundancy by explicitly carrying forward (reusing) token representations computed at prior timesteps, queries, or frames. These approaches prevent unnecessary recomputation, enable prompt-driven inference, reduce latency, and improve computational efficiency without compromising fidelity. TTR has been independently instantiated for LLM serving via prefix caches, online tracking via propagated temporal tokens in transformer architectures, progressive generative modeling with selective re-encoding, and adaptive patch propagation in video segmentation.

1. Formal Mechanisms and Implementation Paradigms

TTR is instantiated by various architectures and domain-specific strategies:

A. Prefix Reuse in LLM Serving:

RadixAttention organizes the key–value (KV) cache of processed prefixes in a radix tree. For an incoming prompt xx, the system retrieves the maximal stored prefix p=argmaxq:qxqp = \arg\max_{q:\,q\preceq x}|q|, restores precomputed (Kp,Vp)(K_p, V_p), and computes only on the unshared suffix. For queries xx and yy, xOverlap(x,y)|x| - Overlap(x,y) tokens require recomputation; shared prefix tokens are never recomputed for “temporally” adjacent queries (Dexter et al., 7 Feb 2025).

B. Dense Temporal Token Propagation in Tracking:

Video tracking models such as ODTrack and ModalTracker append a small “temporal token” Tt1T_{t-1} to the standard token set for each frame. After processing through ViT layers, an updated TtT_t is extracted, summarizing appearance and trajectory information, and is injected as a prompt at the next timestep. The attention mechanism then leverages Tt1T_{t-1} throughout inference, providing persistent memory across frames (Zheng et al., 2024, Zheng et al., 27 Jul 2025).

C. Progressive Generation with Selective Token Update:

Non-autoregressive token-based generation models like ENAT, at each decoding step, identify a small “critical” set of tokens—those undergoing representation change—re-encode only these, and propagate lightweightly projected features for the majority of static tokens. This exploits the finding that most tokens have stable representations across steps, and only those with changing state require expensive re-encoding (Ni et al., 2024).

D. Patch-wise Feature Cache in Video Segmentation:

On embedded UAV systems, TTR frameworks formulate video frames as a grid of image patches (tokens). Each patch is compared, via cosine similarity, to its counterpart in the previous frame. Patches exceeding a similarity threshold reuse cached deep features and bypass backbone computation. Only patches with significant changes are fully processed, thus reducing both latency and energy consumption (Sharma et al., 16 Jan 2026).

2. Mathematical Frameworks and Update Equations

LLM Inference

For a query queue Q=(xi,ti)Q=(x_i,t_i), the completion time for the kk-th processed query is: R(jk)=max{R(jk1),tjk}+(1+α)xjk(xjkOverlap(xjk,xjk1))R(j_k) = \max\{\,R(j_{k-1}),\,t_{j_k}\} + (1+\alpha)\,|x_{j_k}|\,\bigl(|x_{j_k}| - Overlap(x_{j_k},x_{j_{k-1}})\bigr) where α\alpha captures relative attention vs. MLP cost. TTFT is R(jk)tjkR(j_k) - t_{j_k}, and only the suffix (non-reused tokens) are computed per query (Dexter et al., 7 Feb 2025).

Online Visual Tracking

The per-frame input token sequence at time tt: Zt0=[Emb(R1),,Emb(Rk),Emb(St),Tt1]Z_t^0 = [\operatorname{Emb}(R_1),\ldots,\operatorname{Emb}(R_k),\operatorname{Emb}(S_t), T_{t-1}] The transformer updates,

Ztl=δl(Ztl1),Tt=[ZtL]token-pos,Tt+1init=Tt+TemptyZ_t^{l} = \delta^l(Z_t^{l-1}), \quad T_t = [Z_t^L]_{\text{token-pos}}, \quad T_{t+1}^{\text{init}} = T_t + T_{\text{empty}}

This recursive propagation auto-regressively injects TTR into the pipeline (Zheng et al., 2024).

Progressive Generation

For latent token map v(t)\bm{v}^{(t)} with feature states z(t)\bm{z}^{(t)}, ENAT updates: h~i(t)={ENC(vi(t)),iΔ(t) f(zi(t1)),iΔ(t)\tilde{\bm h}^{(t)}_i = \begin{cases} \operatorname{ENC}(\bm{v}^{(t)}_i), & i\in\Delta^{(t)}\ f(\bm{z}^{(t-1)}_i), & i\notin\Delta^{(t)} \end{cases} Decoding operates over this hybrid of freshly encoded and reused features (Ni et al., 2024).

Patch-based Video Segmentation

Similarity for patch ii: Similarity(pt,i)=pt,ipt1,ipt,ipt1,i\operatorname{Similarity}(p_{t,i}) = \frac{p_{t,i} \cdot p_{t-1,i}}{\lVert p_{t,i} \rVert \lVert p_{t-1,i} \rVert} If Similarity>τ\operatorname{Similarity}> \tau, features are reused; otherwise, the backbone recomputes the patch. Spatial masks and feature caches are maintained at each layer (Sharma et al., 16 Jan 2026).

3. Scheduling, Computational Hardness, and Theoretical Guarantees

Inference Order and Scheduling

For LLMs leveraging prefix reuse, the query scheduling problem—i.e., determining an inference ordering to minimize time-to-first-token (TTFT) across a stream—becomes NP-hard under non-preemptive, TTFT-bounded constraints. Even with only pairwise reuse, the decision problem of meeting all TTFT bounds is computationally intractable (reduction from 3-PARTITION) (Dexter et al., 7 Feb 2025).

The kk-LPM scheduling algorithm interpolates between FCFS and LPM, ensuring bounded waiting time for poorly-reusing queries while still exploiting reuse opportunities for high-overlap ones. Under realistic shuffled traffic models, kk-LPM guarantees an explicit upper bound on the maximum TTFT, outperforming both FCFS and LPM when k>1k>1 and $0

Complexity and FLOP Savings

In progressive generation, TTR enables a near 1.9×1.9\times reduction in inference FLOPs (e.g., 39.6 GFLOPs → 20.8 GFLOPs) and simultaneously improves FID, by focusing compute only on tokens whose state changes at a given step (Ni et al., 2024).

In patch-based segmentation, empirical reuse ratios of r0.3r\approx0.3–$0.6$ yield 30%30\%35%35\% reductions in per-frame compute, directly translating to lower latency and energy use, with mIoU degradation consistently <0.5%<0.5\% (Sharma et al., 16 Jan 2026).

4. Domain-Specific Instantiations and Empirical Results

LLMs and Online Query Serving

When serving Llama-3.1-8B-Instruct at scale, empirical evaluations show that TTR via prefix-reuse-driven scheduling cuts P99 TTFT by up to tens of milliseconds compared to baseline policies, particularly in prefill-dominant regimes with long prompts. Results confirm substantial operational gains—worst-case first-token latency is sharply reduced as recomputation of shared purplex is eliminated (Dexter et al., 7 Feb 2025).

Video and Multi-Modal Tracking

ODTrack and ModalTracker demonstrate that propagating temporal tokens across ViT blocks enables online models to capture long-range spatiotemporal correlations without extra optimizers or gating. Ablations confirm that TTR confers $1.8$–2.8%2.8\% AUC lifts (LaSOT), with best results for concatenated-token attention. Multi-modal fusion further extends TTR, achieving 32 FPS versus 11 FPS for video transformers that lack token reuse (Zheng et al., 2024, Zheng et al., 27 Jul 2025).

Token-based Image Synthesis

ENAT’s TTR implementation reduces computation for non-critical tokens in progressive NATs. Only the representation of “critical” positions—where hidden state changes—are re-encoded, while the remainder are lightly projected and reused. This leads to lower computational cost (FLOPs nearly halved) and improved to comparable generative fidelity (lower FID) (Ni et al., 2024).

Real-Time Video Segmentation

On edge hardware (e.g., Jetson Orin Nano), TTR enables on-board segmentation of high-res UAV video at real-time rates (15 → 25 FPS, +67%+67\%, EfficientNet-B4 backbone). Accuracy loss is negligible (<0.5%<0.5\% mIoU), and the patch cache blocks are robust to drift over prolonged sequences. Usage is optimal in domains with temporally static backgrounds and fails gracefully (limited speed gain) in fully dynamic scenes (Sharma et al., 16 Jan 2026).

5. Limitations, Domain Constraints, and Parameter Sensitivity

TTR methods are most effective where strong temporal redundancy exists—i.e., in applications with static or slowly changing tokens (e.g., repeated LLM prompts, stationary video backgrounds). Aggressive reuse (low similarity threshold or minimal critical token selection) may cause stale features or accuracy drift, although mechanisms such as periodic recomputation, adaptive thresholds, and lightweight projection correct most issues.

Patch-based caches are best suited to architectures with spatial alignment (standard convolutions) and become less applicable in models relying heavily on non-local global attention or dilated convolutions.

Scene dynamism affects the efficiency–accuracy trade-off: in highly dynamic or stochastic environments, the TTR overhead may dominate, providing negligible savings. Conversely, static scenes approach ideal compute reductions, but must still account for sudden motion or change.

6. Broader Significance and Theoretical Implications

TTR generalizes across transformers, convolutional nets, and hybrid architectures, and underlies multiple recent advances in efficient online inference, stateful generation, and high-throughput video processing. By reducing the frequency and scope of recomputation as dictated by temporal structure within token sequences, TTR fundamentally improves the operational resource–accuracy trade space.

Theoretical formulations resulting in intractable scheduling for maximal benefit (e.g., under TTFT constraints) motivate the development of practical heuristics (kk-LPM, greedy selection), while domain-specific realizations (temporal prompting, critical token detection) confirm TTR’s empirical utility. A plausible implication is that as model sizes and inference rates grow, TTR or closely related ideas will become necessary to achieve sustainable deployment in real-world systems with bounded latency and power budgets.

7. Summary Table: TTR in Representative Domains

Domain TTR Mechanism Key Results/Impacts
LLM Inference Prefix RadixAttention NP-hard scheduling, TTFT reduction (Dexter et al., 7 Feb 2025)
Visual/Object Tracking Temporal token prompts +2.8% AUC, real-time tracking (Zheng et al., 2024, Zheng et al., 27 Jul 2025)
Image Synthesis (NATs) Critical token selection ~2x FLOP savings, improved FID (Ni et al., 2024)
UAV Video Segmentation Patch similarity masks +67% FPS, <0.5% mIoU loss (Sharma et al., 16 Jan 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Token Reuse (TTR).