Evict3R: Efficient Token Eviction for Streaming VGGT

Updated 10 December 2025

Evict3R is a training-free token eviction policy that dynamically discards less informative tokens during inference, ensuring bounded memory use in streaming visual geometry transformers.
It employs adaptive per-layer budgeting and token importance scoring to preserve crucial geometric information while mitigating unbounded KV cache growth.
Experimental results show up to 10x memory savings with minimal accuracy loss in 3D reconstruction, depth estimation, and visual odometry benchmarks.

Evict3R is an inference-time, training-free token eviction policy designed for memory-bounded streaming visual geometry transformers, specifically targeting models such as StreamVGGT that perform long-horizon causal attention over constantly-growing key/value (KV) token caches. Evict3R minimizes memory usage by dynamically discarding the least-informative tokens while preserving those most relevant for downstream tasks. This approach enables streaming transformers to process substantially longer sequences within fixed GPU memory constraints, with minimal to negligible loss in accuracy across 3D reconstruction, depth estimation, and visual odometry benchmarks (Mahdi et al., 22 Sep 2025).

1. Rationale and Problem Formulation

Streaming visual geometry transformers, exemplified by StreamVGGT, cache KV token embeddings from all prior frames and perform causal, global attention for frame-by-frame 3D scene reasoning. The internal KV cache grows linearly with time-step $T$ as

$\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$

where $M$ is the number of tokens per frame, $L_g$ is the number of global-attention layers, $H$ is attention heads per layer, and $d$ is KV dimension. Without intervention, the cache expands unbounded, rapidly exceeding GPU memory for long sequences and limiting online deployment.

Existing architectural interventions (e.g., explicit memory bottlenecks or learnable compressive memory modules) require retraining and may not generalize without accuracy degradation. Evict3R addresses this by introducing an online, training-free, inference-time policy: it selectively evicts cached tokens on-the-fly, bounding memory and maintaining accuracy (Mahdi et al., 22 Sep 2025).

2. Formal Specification

At each global attention layer $\ell$ , the model maintains KV caches over $t$ frames:

$K_{1:t}^{(\ell)} \in \mathbb{R}^{tM \times d}, \quad V_{1:t}^{(\ell)} \in \mathbb{R}^{tM \times d}.$

Causal global attention at frame $t$ is

$\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 0

after which new tokens are appended. The memory objective is to enforce, for each $\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 1,

$\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 2

where $\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 3 is the total cross-layer cache budget. The core algorithm must choose which tokens to evict so that memory never exceeds $\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 4, and task accuracy is maximally preserved.

3. Eviction Mechanisms and Scoring

Evict3R comprises three key algorithmic components:

A. Per-Layer Budget Allocation

Layer-wise cache budgets are determined dynamically via attention sparsity statistics:

For each global layer $\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 5 at inference step $\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 6, compute average (over heads) attention matrix $\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 7.
Summing over the query dimension yields per-token cumulative scores $\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 8.
Define the layer sparsity metric:

$\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d$ 9

Quantifying how sharply the attention is focused over the current cache.

Distribute global budget $M$ 0 proportionally, using a softmax-like weighting (with temperature $M$ 1):

$M$ 2

B. Token Importance Scoring

Each token $M$ 3 in layer $M$ 4 is scored for eviction based on its cumulative attention usage:

Over its lifetime $M$ 5, aggregate the summed attention from all queries and heads at every inference step:

$M$ 6

Normalize for early time-steps (“row-length normalization”):

$M$ 7

Normalize for token tenure (exposure normalization): letting $M$ 8,

$M$ 9

Tokens from the initial frame and register/camera tokens are protected from eviction ( $L_g$ 0).

C. Online Eviction Algorithm

On every new frame:

For each global layer $L_g$ 1, append $L_g$ 2 new tokens.
While $L_g$ 3, recompute $L_g$ 4 for all cached tokens, evict the minimum-scoring token (except protected tokens).
Repeat until the cache budget $L_g$ 5 is not exceeded.

After eviction, the total model KV memory is tightly bounded by

$L_g$ 6

regardless of the streaming sequence duration $L_g$ 7.

4. Experimental Evaluation and Results

Benchmarks for Evict3R span 3D reconstruction (7-Scenes, NRGBD), video depth estimation (Sintel, KITTI), and pose estimation (Sintel, TUM-dynamics), with competitive comparisons to state-of-the-art approaches such as DUSt3R-GA, MASt3R-GA, MonST3R-GA, VGGT (offline), Spann3R, CUT3R, Point3R, and StreamVGGT.

Quantitative findings:

For $L_g$ 8 (10% of baseline unbounded memory), 3D reconstruction memory drops from 9.75 GB→7.68 GB (7-Scenes) and 9.35 GB→7.57 GB (NRGBD) with $L_g$ 9 change in accuracy or completeness.
In ultra-long sequences ( $H$ 0 test cases), memory reduction is more pronounced: 18.63 GB→9.39 GB (7-Scenes), maintaining or slightly improving task metrics.
For video depth and camera pose, accuracy degradation remains $H$ 1 until extremely tight budgets ( $H$ 2), at which point error increases sharply or inference fails due to over-eviction.
Per-step inference latency overhead is minimal at moderate budgets (e.g., rising from 0.107s to 0.131s for 1 $H$ 3 sequences, with some latency advantage under very tight memory).

Experimental results consistently indicate that denser temporal sampling is enabled under memory constraints, improving geometric coverage while maintaining compute feasibility (Mahdi et al., 22 Sep 2025).

5. Qualitative Insights and Failure Modes

Visualizations of retained and evicted tokens reveal that Evict3R typically evicts background or low-texture tokens and preserves geometry-salient information (e.g., edges, corners) even under severe memory constraints. Qualitative assessments (3D point clouds, attention masks) show that, up to moderate eviction rates ( $H$ 4), reconstructions are visually indistinguishable from the unbounded baseline.

When $H$ 5, over-pruning leads to degraded accuracy and incomplete reconstructions, indicating a sharp phase transition past which information loss becomes unrecoverable. No formal significance testing is reported, but low variance across test sequences is observed.

6. Complexity and Practical Integration

Evict3R's eviction routines incur additional per-token scoring overhead: for each eviction step, all token attention weights are processed over $H$ 6 heads and $H$ 7 queries. This is typically amortized, as eviction is triggered only when cache limits are exceeded. Under long-streaming conditions, overall per-frame latency remains near baseline.

The policy is training-free and requires no weight updates or model retraining, and only interacts with KV cache management. It is compatible with any causal-attention vision transformer that exposes per-head attention, but cannot be used with FlashAttention (which does not expose attention maps for external use in this implementation) (Mahdi et al., 22 Sep 2025).

7. Future Directions and Limitations

Evict3R achieves up to $H$ 8– $H$ 9 memory reduction on a single GPU for streaming geometry transformers, while preserving accuracy for practical memory budgets. Key strengths include plug-and-play deployment, per-layer adaptive budgeting, and applicability to any inference-time streaming transformer.

Limitations:

Additional inference compute for eviction scoring
Sharp accuracy dropoff below $d$ 0
Incompatibility with optimized attention kernels (FlashAttention) due to lack of accessible attention maps

Proposed directions include adaptive budgeting that evolves over time, hybrid schemes interleaving pointer memory with standard KV caches, hardware-accelerated eviction primitives, and extending to other long-context, vision, and multi-modal transformer architectures (Mahdi et al., 22 Sep 2025).

For algorithms targeting eviction-set finding in microarchitectural attack scenarios (e.g., classical Evict3R for CPU cache side-channels), foundational insights from threshold group testing and TLB/adaptive replacement effects are crucial (Vila et al., 2018). Integrating robust per-address testing, linear-time group-reduction, and mitigations for TLB thrashing and cache adaptivity substantially improve both speed and reliability under strong constraints. When applied to cache eviction, these algorithmic ideas provide 5–20 $d$ 1 practical speedups and sustained effectiveness, even with minimal physical address control (Vila et al., 2018).

Markdown Report Issue Upgrade to Chat

References (2)

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers (2025)

Theory and Practice of Finding Eviction Sets (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Evict3R.