Evict3R: Efficient Token Eviction for Streaming VGGT
- Evict3R is a training-free token eviction policy that dynamically discards less informative tokens during inference, ensuring bounded memory use in streaming visual geometry transformers.
- It employs adaptive per-layer budgeting and token importance scoring to preserve crucial geometric information while mitigating unbounded KV cache growth.
- Experimental results show up to 10x memory savings with minimal accuracy loss in 3D reconstruction, depth estimation, and visual odometry benchmarks.
Evict3R is an inference-time, training-free token eviction policy designed for memory-bounded streaming visual geometry transformers, specifically targeting models such as StreamVGGT that perform long-horizon causal attention over constantly-growing key/value (KV) token caches. Evict3R minimizes memory usage by dynamically discarding the least-informative tokens while preserving those most relevant for downstream tasks. This approach enables streaming transformers to process substantially longer sequences within fixed GPU memory constraints, with minimal to negligible loss in accuracy across 3D reconstruction, depth estimation, and visual odometry benchmarks (Mahdi et al., 22 Sep 2025).
1. Rationale and Problem Formulation
Streaming visual geometry transformers, exemplified by StreamVGGT, cache KV token embeddings from all prior frames and perform causal, global attention for frame-by-frame 3D scene reasoning. The internal KV cache grows linearly with time-step as
where is the number of tokens per frame, is the number of global-attention layers, is attention heads per layer, and is KV dimension. Without intervention, the cache expands unbounded, rapidly exceeding GPU memory for long sequences and limiting online deployment.
Existing architectural interventions (e.g., explicit memory bottlenecks or learnable compressive memory modules) require retraining and may not generalize without accuracy degradation. Evict3R addresses this by introducing an online, training-free, inference-time policy: it selectively evicts cached tokens on-the-fly, bounding memory and maintaining accuracy (Mahdi et al., 22 Sep 2025).
2. Formal Specification
At each global attention layer , the model maintains KV caches over frames:
Causal global attention at frame is
after which new tokens are appended. The memory objective is to enforce, for each ,
where is the total cross-layer cache budget. The core algorithm must choose which tokens to evict so that memory never exceeds , and task accuracy is maximally preserved.
3. Eviction Mechanisms and Scoring
Evict3R comprises three key algorithmic components:
A. Per-Layer Budget Allocation
Layer-wise cache budgets are determined dynamically via attention sparsity statistics:
- For each global layer at inference step , compute average (over heads) attention matrix .
- Summing over the query dimension yields per-token cumulative scores .
- Define the layer sparsity metric:
Quantifying how sharply the attention is focused over the current cache.
- Distribute global budget proportionally, using a softmax-like weighting (with temperature ):
B. Token Importance Scoring
Each token in layer is scored for eviction based on its cumulative attention usage:
- Over its lifetime , aggregate the summed attention from all queries and heads at every inference step:
- Normalize for early time-steps (“row-length normalization”):
- Normalize for token tenure (exposure normalization): letting ,
- Tokens from the initial frame and register/camera tokens are protected from eviction ().
C. Online Eviction Algorithm
On every new frame:
- For each global layer , append new tokens.
- While , recompute for all cached tokens, evict the minimum-scoring token (except protected tokens).
- Repeat until the cache budget is not exceeded.
After eviction, the total model KV memory is tightly bounded by
regardless of the streaming sequence duration .
4. Experimental Evaluation and Results
Benchmarks for Evict3R span 3D reconstruction (7-Scenes, NRGBD), video depth estimation (Sintel, KITTI), and pose estimation (Sintel, TUM-dynamics), with competitive comparisons to state-of-the-art approaches such as DUSt3R-GA, MASt3R-GA, MonST3R-GA, VGGT (offline), Spann3R, CUT3R, Point3R, and StreamVGGT.
Quantitative findings:
- For (10% of baseline unbounded memory), 3D reconstruction memory drops from 9.75 GB→7.68 GB (7-Scenes) and 9.35 GB→7.57 GB (NRGBD) with change in accuracy or completeness.
- In ultra-long sequences ( test cases), memory reduction is more pronounced: 18.63 GB→9.39 GB (7-Scenes), maintaining or slightly improving task metrics.
- For video depth and camera pose, accuracy degradation remains until extremely tight budgets (), at which point error increases sharply or inference fails due to over-eviction.
- Per-step inference latency overhead is minimal at moderate budgets (e.g., rising from 0.107s to 0.131s for 1 sequences, with some latency advantage under very tight memory).
Experimental results consistently indicate that denser temporal sampling is enabled under memory constraints, improving geometric coverage while maintaining compute feasibility (Mahdi et al., 22 Sep 2025).
5. Qualitative Insights and Failure Modes
Visualizations of retained and evicted tokens reveal that Evict3R typically evicts background or low-texture tokens and preserves geometry-salient information (e.g., edges, corners) even under severe memory constraints. Qualitative assessments (3D point clouds, attention masks) show that, up to moderate eviction rates (), reconstructions are visually indistinguishable from the unbounded baseline.
When , over-pruning leads to degraded accuracy and incomplete reconstructions, indicating a sharp phase transition past which information loss becomes unrecoverable. No formal significance testing is reported, but low variance across test sequences is observed.
6. Complexity and Practical Integration
Evict3R's eviction routines incur additional per-token scoring overhead: for each eviction step, all token attention weights are processed over heads and queries. This is typically amortized, as eviction is triggered only when cache limits are exceeded. Under long-streaming conditions, overall per-frame latency remains near baseline.
The policy is training-free and requires no weight updates or model retraining, and only interacts with KV cache management. It is compatible with any causal-attention vision transformer that exposes per-head attention, but cannot be used with FlashAttention (which does not expose attention maps for external use in this implementation) (Mahdi et al., 22 Sep 2025).
7. Future Directions and Limitations
Evict3R achieves up to – memory reduction on a single GPU for streaming geometry transformers, while preserving accuracy for practical memory budgets. Key strengths include plug-and-play deployment, per-layer adaptive budgeting, and applicability to any inference-time streaming transformer.
Limitations:
- Additional inference compute for eviction scoring
- Sharp accuracy dropoff below
- Incompatibility with optimized attention kernels (FlashAttention) due to lack of accessible attention maps
Proposed directions include adaptive budgeting that evolves over time, hybrid schemes interleaving pointer memory with standard KV caches, hardware-accelerated eviction primitives, and extending to other long-context, vision, and multi-modal transformer architectures (Mahdi et al., 22 Sep 2025).
For algorithms targeting eviction-set finding in microarchitectural attack scenarios (e.g., classical Evict3R for CPU cache side-channels), foundational insights from threshold group testing and TLB/adaptive replacement effects are crucial (Vila et al., 2018). Integrating robust per-address testing, linear-time group-reduction, and mitigations for TLB thrashing and cache adaptivity substantially improve both speed and reliability under strong constraints. When applied to cache eviction, these algorithmic ideas provide 5–20 practical speedups and sustained effectiveness, even with minimal physical address control (Vila et al., 2018).