Papers
Topics
Authors
Recent
Search
2000 character limit reached

Evict3R: Efficient Token Eviction for Streaming VGGT

Updated 10 December 2025
  • Evict3R is a training-free token eviction policy that dynamically discards less informative tokens during inference, ensuring bounded memory use in streaming visual geometry transformers.
  • It employs adaptive per-layer budgeting and token importance scoring to preserve crucial geometric information while mitigating unbounded KV cache growth.
  • Experimental results show up to 10x memory savings with minimal accuracy loss in 3D reconstruction, depth estimation, and visual odometry benchmarks.

Evict3R is an inference-time, training-free token eviction policy designed for memory-bounded streaming visual geometry transformers, specifically targeting models such as StreamVGGT that perform long-horizon causal attention over constantly-growing key/value (KV) token caches. Evict3R minimizes memory usage by dynamically discarding the least-informative tokens while preserving those most relevant for downstream tasks. This approach enables streaming transformers to process substantially longer sequences within fixed GPU memory constraints, with minimal to negligible loss in accuracy across 3D reconstruction, depth estimation, and visual odometry benchmarks (Mahdi et al., 22 Sep 2025).

1. Rationale and Problem Formulation

Streaming visual geometry transformers, exemplified by StreamVGGT, cache KV token embeddings from all prior frames and perform causal, global attention for frame-by-frame 3D scene reasoning. The internal KV cache grows linearly with time-step TT as

MemKV=T×M×Lg×H×2d\text{Mem}_{\text{KV}} = T \times M \times L_g \times H \times 2d

where MM is the number of tokens per frame, LgL_g is the number of global-attention layers, HH is attention heads per layer, and dd is KV dimension. Without intervention, the cache expands unbounded, rapidly exceeding GPU memory for long sequences and limiting online deployment.

Existing architectural interventions (e.g., explicit memory bottlenecks or learnable compressive memory modules) require retraining and may not generalize without accuracy degradation. Evict3R addresses this by introducing an online, training-free, inference-time policy: it selectively evicts cached tokens on-the-fly, bounding memory and maintaining accuracy (Mahdi et al., 22 Sep 2025).

2. Formal Specification

At each global attention layer \ell, the model maintains KV caches over tt frames:

K1:t()RtM×d,V1:t()RtM×d.K_{1:t}^{(\ell)} \in \mathbb{R}^{tM \times d}, \quad V_{1:t}^{(\ell)} \in \mathbb{R}^{tM \times d}.

Causal global attention at frame tt is

Zt(+1)=G(Z~t();K1:t(),V1:t()),Z_t^{(\ell+1)} = \mathcal{G}_{\triangleleft}(\widetilde{Z}_t^{(\ell)}; K_{1:t}^{(\ell)}, V_{1:t}^{(\ell)}),

after which new tokens are appended. The memory objective is to enforce, for each \ell,

K1:t()B,B==1LgB|K_{1:t}^{(\ell)}| \leq B_\ell,\quad B = \sum_{\ell=1}^{L_g} B_\ell

where BB is the total cross-layer cache budget. The core algorithm must choose which tokens to evict so that memory never exceeds BB, and task accuracy is maximally preserved.

3. Eviction Mechanisms and Scoring

Evict3R comprises three key algorithmic components:

A. Per-Layer Budget Allocation

Layer-wise cache budgets are determined dynamically via attention sparsity statistics:

  • For each global layer \ell at inference step tt, compute average (over heads) attention matrix At()RM×MtA_t^{(\ell)} \in \mathbb{R}^{M \times M'_t}.
  • Summing over the query dimension yields per-token cumulative scores St()S_t^{(\ell)}.
  • Define the layer sparsity metric:

σ=Var(St())\sigma_\ell = -\mathrm{Var}(S_t^{(\ell)})

Quantifying how sharply the attention is focused over the current cache.

  • Distribute global budget BB proportionally, using a softmax-like weighting (with temperature τ\tau):

π=exp(σ/τ)r=1Lgexp(σr/τ),B=Bπ\pi_\ell = \frac{\exp(\sigma_\ell/\tau)}{\sum_{r=1}^{L_g} \exp(\sigma_r/\tau)}, \quad B_\ell = \lfloor B \pi_\ell \rfloor

B. Token Importance Scoring

Each token jj in layer \ell is scored for eviction based on its cumulative attention usage:

  • Over its lifetime Tj\mathcal{T}_j, aggregate the summed attention from all queries and heads at every inference step:

cj()=tTjh=1Hq=1Mat,h,qj()c_j^{(\ell)} = \sum_{t \in \mathcal{T}_j} \sum_{h=1}^{H} \sum_{q=1}^{M} a_{t, h, q \to j}^{(\ell)}

  • Normalize for early time-steps (“row-length normalization”):

c^j()=tTj1Nt()h=1Hq=1Mat,h,qj()\hat{c}_j^{(\ell)} = \sum_{t \in \mathcal{T}_j} \frac{1}{N_t^{(\ell)}} \sum_{h=1}^{H} \sum_{q=1}^{M} a_{t,h,q\to j}^{(\ell)}

  • Normalize for token tenure (exposure normalization): letting ej()=Tje_j^{(\ell)} = |\mathcal{T}_j|,

ij()=c^j()ej()i_j^{(\ell)} = \frac{\hat{c}_j^{(\ell)}}{e_j^{(\ell)}}

  • Tokens from the initial frame and register/camera tokens are protected from eviction (ij()=+i_j^{(\ell)} = +\infty).

C. Online Eviction Algorithm

On every new frame:

  • For each global layer \ell, append MM new tokens.
  • While K()>B|K^{(\ell)}| > B_\ell, recompute ij()i_j^{(\ell)} for all cached tokens, evict the minimum-scoring token (except protected tokens).
  • Repeat until the cache budget BB_\ell is not exceeded.

After eviction, the total model KV memory is tightly bounded by

=1LgB2d,\sum_{\ell=1}^{L_g} B_\ell \cdot 2d,

regardless of the streaming sequence duration TT.

4. Experimental Evaluation and Results

Benchmarks for Evict3R span 3D reconstruction (7-Scenes, NRGBD), video depth estimation (Sintel, KITTI), and pose estimation (Sintel, TUM-dynamics), with competitive comparisons to state-of-the-art approaches such as DUSt3R-GA, MASt3R-GA, MonST3R-GA, VGGT (offline), Spann3R, CUT3R, Point3R, and StreamVGGT.

Quantitative findings:

  • For B=0.1B=0.1 (10% of baseline unbounded memory), 3D reconstruction memory drops from 9.75 GB→7.68 GB (7-Scenes) and 9.35 GB→7.57 GB (NRGBD) with <0.01<0.01 change in accuracy or completeness.
  • In ultra-long sequences (8×/10×8\times/10\times test cases), memory reduction is more pronounced: 18.63 GB→9.39 GB (7-Scenes), maintaining or slightly improving task metrics.
  • For video depth and camera pose, accuracy degradation remains 0.005\leq0.005 until extremely tight budgets (B0.1B\leq0.1), at which point error increases sharply or inference fails due to over-eviction.
  • Per-step inference latency overhead is minimal at moderate budgets (e.g., rising from 0.107s to 0.131s for 1×\times sequences, with some latency advantage under very tight memory).

Experimental results consistently indicate that denser temporal sampling is enabled under memory constraints, improving geometric coverage while maintaining compute feasibility (Mahdi et al., 22 Sep 2025).

5. Qualitative Insights and Failure Modes

Visualizations of retained and evicted tokens reveal that Evict3R typically evicts background or low-texture tokens and preserves geometry-salient information (e.g., edges, corners) even under severe memory constraints. Qualitative assessments (3D point clouds, attention masks) show that, up to moderate eviction rates (B0.2B \geq 0.2), reconstructions are visually indistinguishable from the unbounded baseline.

When B0.1B \ll 0.1, over-pruning leads to degraded accuracy and incomplete reconstructions, indicating a sharp phase transition past which information loss becomes unrecoverable. No formal significance testing is reported, but low variance across test sequences is observed.

6. Complexity and Practical Integration

Evict3R's eviction routines incur additional per-token scoring overhead: for each eviction step, all token attention weights are processed over HH heads and MM queries. This is typically amortized, as eviction is triggered only when cache limits are exceeded. Under long-streaming conditions, overall per-frame latency remains near baseline.

The policy is training-free and requires no weight updates or model retraining, and only interacts with KV cache management. It is compatible with any causal-attention vision transformer that exposes per-head attention, but cannot be used with FlashAttention (which does not expose attention maps for external use in this implementation) (Mahdi et al., 22 Sep 2025).

7. Future Directions and Limitations

Evict3R achieves up to 2×2\times10×10\times memory reduction on a single GPU for streaming geometry transformers, while preserving accuracy for practical memory budgets. Key strengths include plug-and-play deployment, per-layer adaptive budgeting, and applicability to any inference-time streaming transformer.

Limitations:

  • Additional inference compute for eviction scoring
  • Sharp accuracy dropoff below B0.01B\lesssim0.01
  • Incompatibility with optimized attention kernels (FlashAttention) due to lack of accessible attention maps

Proposed directions include adaptive budgeting that evolves over time, hybrid schemes interleaving pointer memory with standard KV caches, hardware-accelerated eviction primitives, and extending to other long-context, vision, and multi-modal transformer architectures (Mahdi et al., 22 Sep 2025).


For algorithms targeting eviction-set finding in microarchitectural attack scenarios (e.g., classical Evict3R for CPU cache side-channels), foundational insights from threshold group testing and TLB/adaptive replacement effects are crucial (Vila et al., 2018). Integrating robust per-address testing, linear-time group-reduction, and mitigations for TLB thrashing and cache adaptivity substantially improve both speed and reliability under strong constraints. When applied to cache eviction, these algorithmic ideas provide 5–20×\times practical speedups and sustained effectiveness, even with minimal physical address control (Vila et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Evict3R.