HTTM: Head-Wise Temporal Token Merging
- The paper introduces HTTM, a training-free method that independently merges tokens per self-attention head, achieving up to 7× acceleration while preserving 3D reconstruction accuracy.
- HTTM leverages spatio-temporal block-wise similarity and adaptive outlier filtering to significantly reduce computational cost without compromising quality.
- Empirical results on benchmarks like 7Scenes and NRGBD demonstrate that HTTM efficiently scales global attention for long-sequence inputs in VGGT models.
Head-Wise Temporal Token Merging (HTTM) is a training-free method for accelerating global attention in the Visual Geometry Grounded Transformer (VGGT), centered on the 3D reconstruction of large scenes from multi-view or long-sequence inputs. HTTM performs token merging independently for each self-attention head, exploiting spatial–temporal patterns and redundancy within head-wise blocks, thereby circumventing the limitations of prior uniform or sparsity-based merging approaches. This yields a significant reduction in computational cost (up to 7× acceleration) with minimal degradation in 3D reconstruction accuracy in benchmarks such as 7Scenes and NRGBD (Wang et al., 26 Nov 2025).
1. Motivation and Rationale
VGGT’s global attention layers process all tokens from all input views, typically for large scenes, leading to time and memory complexity per layer. For lengthy sequences, this step becomes an inference bottleneck. Previous token merging techniques either merge tokens identically across all attention heads—thereby collapsing head-specific information and impairing representational power—or use uniform sparsity that fails to leverage VGGT’s inherent spatio-temporal redundancy. HTTM addresses both problems by merging tokens independently in each attention head (“head-wise”) and grouping tokens into small spatio-temporal blocks for block-wise similarity computation, enabling high merge ratios at reduced computational cost.
2. Formal Definition and Algorithm
Let be the input token sequence, the embedding dimension, and the number of attention heads (). After per-head projections, tokens are represented as queries, keys, and values:
2.1 Head-wise Temporal Merging Modules
For each head , two merging functions are defined:
which reduce the effective token count from to per head. Values are merged using the same index assignments as keys to preserve key–value consistency.
2.2 Spatio-temporal Block-wise Similarity and Merging
HTTM reorders tokens into non-overlapping spatio-temporal blocks of size . Within each block for head (), tokens are partitioned into destination (size ) and source (size ) sets, . The cosine similarity matrix within each block is:
with as the best matching index and as the matching score for each source token.
Globally, the top source tokens with the highest scores are merged into their matched destinations. For each destination token receiving merges, the new merged token is the mean:
These form the reduced per-head projected queries, keys, and values: , , and .
2.3 Merged Attention and Unmerging
Attention is computed as:
The final “unmerge” step recovers outputs per head by mapping each original token to the output of its merged cluster.
2.4 Adaptive Outlier Filtering (Optional)
L2 distance between each original token and its merged prototype is calculated, and a fraction of tokens with maximal deviation are designated “outliers” and excluded from merging. This step is critical for maintaining quality in blocks with low redundancy.
3. Pseudocode
The main steps can be structured as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
Q, K, V = LinearProject(X) # h x N x d_head order = ComputeSpatioTemporalOrder(N, n_b) Q, K, V = Q[:,:,order], K[:,:,order], V[:,:,order] for i in range(h): blocks = SplitIntoBlocks(Q[i], K[i], V[i], n_b) for block in blocks: S, D = PartitionSrcDst(block, alpha_i) # e.g. 75% src, 25% dst Sim = RowNorm(S) @ RowNorm(D).T for p in range(len(S)): lp = argmax(Sim[p]) mp = Sim[p, lp] S_sel = TopR(all mp, r_i) for q in D: C_q = {d_q} ∪ {s_p: s_p merged into d_q} tilde_d_q = mean(C_q) tilde_Q[i], tilde_K[i], tilde_V[i] = all tilde_d_q for i in range(h): A = softmax(tilde_Q[i] @ tilde_K[i].T / sqrt(d_head)) tilde_O[i] = A @ tilde_V[i] for n in range(N): q = cluster where n was merged O_n[i] = tilde_O[i][q] O = Concat_heads(O^{(1)}, ..., O^{(h)}) O = O[inverse_order] return O |
4. Complexity and Computational Tradeoffs
The original global attention cost is . For HTTM, dominant costs shift to:
- Block-wise similarity computation: .
- Attention on the merged sequence (): .
- Unmerging and projection: .
The theoretical speedup is governed by the fraction of surviving tokens per head:
For , the upper bound is , though overall acceleration measured end-to-end reaches $4$– due to overheads.
5. Key Hyperparameters and Merging Strategy
- Merge ratio per head (): Lower values provide higher speedup at potential cost to reconstruction error.
- Block size (): Larger allows for improved global matching at cost. Typical balances quality and efficiency.
- Source–destination split: E.g., 75% sources, 25% destinations.
- Outlier-filtering rate (): Empirically, filtering the top 10% outliers (by L2 deviation) yields negligible drop in 3D accuracy.
Empirical findings show that for highly temporally continuous data, stacking in the temporal dimension (large ) produces higher-quality merges, while for sparse-view scenarios, spatial grouping (large ) is more beneficial. Hybrid spatio-temporal blocks generally outperform single-axis grouping.
6. Experimental Evaluation
On 7Scenes and NRGBD datasets, running on NVIDIA A100 with FlashAttention in bfloat16:
| Method | Q Ratio | KV Ratio | 7Scenes Acc. | 7Scenes Comp. | Time | NRGBD Acc. | NRGBD Comp. | Time |
|---|---|---|---|---|---|---|---|---|
| VGGT* | 1.00 | 1.00 | 0.019 | 0.021 | 9.1 s | 0.010 | 0.010 | 13.9 s |
| FastVGGT | 0.34 | 0.34 | 0.018 | 0.020 | 4.5 s | 0.016 | 0.013 | 7.0 s |
| HTTM | 0.20 | 0.30 | 0.020 | 0.023 | 4.3 s | 0.012 | 0.010 | 6.8 s |
End-to-end latency scales favorably as sequence length increases, with a 7× speedup demonstrated for 1000-frame NRGBD inputs. The majority of matching overhead (token matching) is substantially reduced in HTTM (0.12 s vs. 2.31 s for FastVGGT), while total aggregation remains efficient.
Ablation studies show the necessity of outlier filtering—without it, merging degrades reconstruction accuracy ( NRGBD Acc. as outlier filter increases to 10%). Mixing temporal and spatial block axes further improves merger quality over purely spatial approaches.
7. Limitations and Future Perspectives
HTTM is explicitly tailored to VGGT’s distinctive spatio-temporal redundancy (resulting from repeated Rotary-PE). Porting to alternative architectures may require retuning of block settings or merge ratios. The outlier filter is currently based on simple L2-thresholding; integration of learned gating or adaptive metrics could mitigate residual degradation. First-frame anchoring—protecting the initial frame’s tokens from merging—can further stabilize long-sequence global attention and is proposed for more systematic adoption.
Ongoing research directions include adaptive block sizes per head or layer, learned similarity projections beyond raw cosine metrics, and joint key–query merging under a global headwise budget (Wang et al., 26 Nov 2025).