Papers
Topics
Authors
Recent
Search
2000 character limit reached

Evict3R: Efficient Token Eviction for Streaming VGGT

Updated 10 December 2025
  • Evict3R is a training-free token eviction policy that dynamically discards less informative tokens during inference, ensuring bounded memory use in streaming visual geometry transformers.
  • It employs adaptive per-layer budgeting and token importance scoring to preserve crucial geometric information while mitigating unbounded KV cache growth.
  • Experimental results show up to 10x memory savings with minimal accuracy loss in 3D reconstruction, depth estimation, and visual odometry benchmarks.

Evict3R is an inference-time, training-free token eviction policy designed for memory-bounded streaming visual geometry transformers, specifically targeting models such as StreamVGGT that perform long-horizon causal attention over constantly-growing key/value (KV) token caches. Evict3R minimizes memory usage by dynamically discarding the least-informative tokens while preserving those most relevant for downstream tasks. This approach enables streaming transformers to process substantially longer sequences within fixed GPU memory constraints, with minimal to negligible loss in accuracy across 3D reconstruction, depth estimation, and visual odometry benchmarks (Mahdi et al., 22 Sep 2025).

1. Rationale and Problem Formulation

Streaming visual geometry transformers, exemplified by StreamVGGT, cache KV token embeddings from all prior frames and perform causal, global attention for frame-by-frame 3D scene reasoning. The internal KV cache grows linearly with time-step TT as

MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d

where MM is the number of tokens per frame, LgL_g is the number of global-attention layers, HH is attention heads per layer, and dd is KV dimension. Without intervention, the cache expands unbounded, rapidly exceeding GPU memory for long sequences and limiting online deployment.

Existing architectural interventions (e.g., explicit memory bottlenecks or learnable compressive memory modules) require retraining and may not generalize without accuracy degradation. Evict3R addresses this by introducing an online, training-free, inference-time policy: it selectively evicts cached tokens on-the-fly, bounding memory and maintaining accuracy (Mahdi et al., 22 Sep 2025).

2. Formal Specification

At each global attention layer \ell, the model maintains KV caches over tt frames:

K1:t()RtM×d,V1:t()RtM×d.K_{1:t}^{(\ell)} \in \mathbb{R}^{tM \times d}, \quad V_{1:t}^{(\ell)} \in \mathbb{R}^{tM \times d}.

Causal global attention at frame tt is

MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d0

after which new tokens are appended. The memory objective is to enforce, for each MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d1,

MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d2

where MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d3 is the total cross-layer cache budget. The core algorithm must choose which tokens to evict so that memory never exceeds MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d4, and task accuracy is maximally preserved.

3. Eviction Mechanisms and Scoring

Evict3R comprises three key algorithmic components:

A. Per-Layer Budget Allocation

Layer-wise cache budgets are determined dynamically via attention sparsity statistics:

  • For each global layer MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d5 at inference step MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d6, compute average (over heads) attention matrix MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d7.
  • Summing over the query dimension yields per-token cumulative scores MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d8.
  • Define the layer sparsity metric:

MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d9

Quantifying how sharply the attention is focused over the current cache.

  • Distribute global budget MM0 proportionally, using a softmax-like weighting (with temperature MM1):

MM2

B. Token Importance Scoring

Each token MM3 in layer MM4 is scored for eviction based on its cumulative attention usage:

  • Over its lifetime MM5, aggregate the summed attention from all queries and heads at every inference step:

MM6

  • Normalize for early time-steps (“row-length normalization”):

MM7

  • Normalize for token tenure (exposure normalization): letting MM8,

MM9

  • Tokens from the initial frame and register/camera tokens are protected from eviction (LgL_g0).

C. Online Eviction Algorithm

On every new frame:

  • For each global layer LgL_g1, append LgL_g2 new tokens.
  • While LgL_g3, recompute LgL_g4 for all cached tokens, evict the minimum-scoring token (except protected tokens).
  • Repeat until the cache budget LgL_g5 is not exceeded.

After eviction, the total model KV memory is tightly bounded by

LgL_g6

regardless of the streaming sequence duration LgL_g7.

4. Experimental Evaluation and Results

Benchmarks for Evict3R span 3D reconstruction (7-Scenes, NRGBD), video depth estimation (Sintel, KITTI), and pose estimation (Sintel, TUM-dynamics), with competitive comparisons to state-of-the-art approaches such as DUSt3R-GA, MASt3R-GA, MonST3R-GA, VGGT (offline), Spann3R, CUT3R, Point3R, and StreamVGGT.

Quantitative findings:

  • For LgL_g8 (10% of baseline unbounded memory), 3D reconstruction memory drops from 9.75 GB→7.68 GB (7-Scenes) and 9.35 GB→7.57 GB (NRGBD) with LgL_g9 change in accuracy or completeness.
  • In ultra-long sequences (HH0 test cases), memory reduction is more pronounced: 18.63 GB→9.39 GB (7-Scenes), maintaining or slightly improving task metrics.
  • For video depth and camera pose, accuracy degradation remains HH1 until extremely tight budgets (HH2), at which point error increases sharply or inference fails due to over-eviction.
  • Per-step inference latency overhead is minimal at moderate budgets (e.g., rising from 0.107s to 0.131s for 1HH3 sequences, with some latency advantage under very tight memory).

Experimental results consistently indicate that denser temporal sampling is enabled under memory constraints, improving geometric coverage while maintaining compute feasibility (Mahdi et al., 22 Sep 2025).

5. Qualitative Insights and Failure Modes

Visualizations of retained and evicted tokens reveal that Evict3R typically evicts background or low-texture tokens and preserves geometry-salient information (e.g., edges, corners) even under severe memory constraints. Qualitative assessments (3D point clouds, attention masks) show that, up to moderate eviction rates (HH4), reconstructions are visually indistinguishable from the unbounded baseline.

When HH5, over-pruning leads to degraded accuracy and incomplete reconstructions, indicating a sharp phase transition past which information loss becomes unrecoverable. No formal significance testing is reported, but low variance across test sequences is observed.

6. Complexity and Practical Integration

Evict3R's eviction routines incur additional per-token scoring overhead: for each eviction step, all token attention weights are processed over HH6 heads and HH7 queries. This is typically amortized, as eviction is triggered only when cache limits are exceeded. Under long-streaming conditions, overall per-frame latency remains near baseline.

The policy is training-free and requires no weight updates or model retraining, and only interacts with KV cache management. It is compatible with any causal-attention vision transformer that exposes per-head attention, but cannot be used with FlashAttention (which does not expose attention maps for external use in this implementation) (Mahdi et al., 22 Sep 2025).

7. Future Directions and Limitations

Evict3R achieves up to HH8–HH9 memory reduction on a single GPU for streaming geometry transformers, while preserving accuracy for practical memory budgets. Key strengths include plug-and-play deployment, per-layer adaptive budgeting, and applicability to any inference-time streaming transformer.

Limitations:

  • Additional inference compute for eviction scoring
  • Sharp accuracy dropoff below dd0
  • Incompatibility with optimized attention kernels (FlashAttention) due to lack of accessible attention maps

Proposed directions include adaptive budgeting that evolves over time, hybrid schemes interleaving pointer memory with standard KV caches, hardware-accelerated eviction primitives, and extending to other long-context, vision, and multi-modal transformer architectures (Mahdi et al., 22 Sep 2025).


For algorithms targeting eviction-set finding in microarchitectural attack scenarios (e.g., classical Evict3R for CPU cache side-channels), foundational insights from threshold group testing and TLB/adaptive replacement effects are crucial (Vila et al., 2018). Integrating robust per-address testing, linear-time group-reduction, and mitigations for TLB thrashing and cache adaptivity substantially improve both speed and reliability under strong constraints. When applied to cache eviction, these algorithmic ideas provide 5–20dd1 practical speedups and sustained effectiveness, even with minimal physical address control (Vila et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Evict3R.