Papers
Topics
Authors
Recent
Search
2000 character limit reached

HTTM: Head-Wise Temporal Token Merging

Updated 19 December 2025
  • The paper introduces HTTM, a training-free method that independently merges tokens per self-attention head, achieving up to 7× acceleration while preserving 3D reconstruction accuracy.
  • HTTM leverages spatio-temporal block-wise similarity and adaptive outlier filtering to significantly reduce computational cost without compromising quality.
  • Empirical results on benchmarks like 7Scenes and NRGBD demonstrate that HTTM efficiently scales global attention for long-sequence inputs in VGGT models.

Head-Wise Temporal Token Merging (HTTM) is a training-free method for accelerating global attention in the Visual Geometry Grounded Transformer (VGGT), centered on the 3D reconstruction of large scenes from multi-view or long-sequence inputs. HTTM performs token merging independently for each self-attention head, exploiting spatial–temporal patterns and redundancy within head-wise blocks, thereby circumventing the limitations of prior uniform or sparsity-based merging approaches. This yields a significant reduction in computational cost (up to 7× acceleration) with minimal degradation in 3D reconstruction accuracy in benchmarks such as 7Scenes and NRGBD (Wang et al., 26 Nov 2025).

1. Motivation and Rationale

VGGT’s global attention layers process all tokens from all input views, typically N20,000N \gg 20,000 for large scenes, leading to O(N2)O(N^2) time and memory complexity per layer. For lengthy sequences, this step becomes an inference bottleneck. Previous token merging techniques either merge tokens identically across all attention heads—thereby collapsing head-specific information and impairing representational power—or use uniform sparsity that fails to leverage VGGT’s inherent spatio-temporal redundancy. HTTM addresses both problems by merging tokens independently in each attention head (“head-wise”) and grouping tokens into small spatio-temporal blocks for block-wise similarity computation, enabling high merge ratios at reduced computational cost.

2. Formal Definition and Algorithm

Let XRN×dX \in \mathbb{R}^{N \times d} be the input token sequence, dd the embedding dimension, and hh the number of attention heads (dhead=d/hd_\mathrm{head} = d/h). After per-head projections, tokens are represented as queries, keys, and values:

Q,K,VRh×N×dhead,Q(i),K(i),V(i)RN×dheadQ,K,V \in \mathbb{R}^{h \times N \times d_\mathrm{head}}, \quad Q^{(i)},K^{(i)},V^{(i)} \in \mathbb{R}^{N \times d_\mathrm{head}}

2.1 Head-wise Temporal Merging Modules

For each head ii, two merging functions are defined:

Miq:RN×dheadRMi×dhead,Mik:RN×dheadRMi×dheadM_i^q : \mathbb{R}^{N \times d_\mathrm{head}} \to \mathbb{R}^{M_i \times d_\mathrm{head}}, \quad M_i^k : \mathbb{R}^{N \times d_\mathrm{head}} \to \mathbb{R}^{M_i \times d_\mathrm{head}}

which reduce the effective token count from NN to MiM_i per head. Values V(i)V^{(i)} are merged using the same index assignments as keys to preserve key–value consistency.

2.2 Spatio-temporal Block-wise Similarity and Merging

HTTM reorders tokens into KK non-overlapping spatio-temporal blocks of size nbn_b. Within each block for head ii (k=1,,Kk = 1, \ldots, K), tokens are partitioned into destination Dk(i)D_k^{(i)} (size NdN_d) and source Sk(i)S_k^{(i)} (size NsN_s) sets, Ns+Nd=nbN_s + N_d = n_b. The cosine similarity matrix within each block is:

Simk(i)=RowNorm(Sk(i))RowNorm(Dk(i))\mathrm{Sim}_k^{(i)} = \mathrm{RowNorm}(S_k^{(i)}) \cdot \mathrm{RowNorm}(D_k^{(i)})^\top

with lp=argmaxqDkSimk(i)[p,q]l_p = \arg\max_{q \in D_k} \mathrm{Sim}_k^{(i)}[p, q] as the best matching index and mp=Simk(i)[p,lp]m_p = \mathrm{Sim}_k^{(i)}[p, l_p] as the matching score for each source token.

Globally, the top ri=NMir_i = N - M_i source tokens with the highest scores are merged into their matched destinations. For each destination token dqd_q receiving merges, the new merged token is the mean:

d~q=dq+p:lp=qsp1+{p:lp=q}\tilde{d}_q = \frac{d_q + \sum_{p : l_p = q} s_p}{1 + |\{p : l_p = q\}|}

These form the reduced per-head projected queries, keys, and values: Q~(i)\tilde{Q}^{(i)}, K~(i)\tilde{K}^{(i)}, and V~(i)\tilde{V}^{(i)}.

2.3 Merged Attention and Unmerging

Attention is computed as:

A(i)=softmax(Q~(i)K~(i)dhead),O~(i)=A(i)V~(i)A^{(i)} = \mathrm{softmax}\left(\frac{\tilde{Q}^{(i)} \cdot \tilde{K}^{(i)\top}}{\sqrt{d_\mathrm{head}}}\right), \quad \tilde{O}^{(i)} = A^{(i)} \cdot \tilde{V}^{(i)}

The final “unmerge” step recovers NN outputs per head by mapping each original token to the output of its merged cluster.

2.4 Adaptive Outlier Filtering (Optional)

L2 distance between each original token and its merged prototype is calculated, and a fraction d%d\% of tokens with maximal deviation are designated “outliers” and excluded from merging. This step is critical for maintaining quality in blocks with low redundancy.

3. Pseudocode

The main steps can be structured as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Q, K, V = LinearProject(X)       # h x N x d_head
order = ComputeSpatioTemporalOrder(N, n_b)
Q, K, V = Q[:,:,order], K[:,:,order], V[:,:,order]
for i in range(h):
    blocks = SplitIntoBlocks(Q[i], K[i], V[i], n_b)
    for block in blocks:
        S, D = PartitionSrcDst(block, alpha_i)  # e.g. 75% src, 25% dst
        Sim = RowNorm(S) @ RowNorm(D).T
        for p in range(len(S)):
            lp = argmax(Sim[p])
            mp = Sim[p, lp]
    S_sel = TopR(all mp, r_i)
    for q in D:
        C_q = {d_q}  {s_p: s_p merged into d_q}
        tilde_d_q = mean(C_q)
    tilde_Q[i], tilde_K[i], tilde_V[i] = all tilde_d_q
for i in range(h):
    A = softmax(tilde_Q[i] @ tilde_K[i].T / sqrt(d_head))
    tilde_O[i] = A @ tilde_V[i]
    for n in range(N):
        q = cluster where n was merged
        O_n[i] = tilde_O[i][q]
O = Concat_heads(O^{(1)}, ..., O^{(h)})
O = O[inverse_order]
return O

4. Complexity and Computational Tradeoffs

The original global attention cost is O(hN2dhead)=O(N2d)O(h \cdot N^2 \cdot d_\mathrm{head}) = O(N^2 d). For HTTM, dominant costs shift to:

  • Block-wise similarity computation: O(hNnbdhead)O(h \cdot N \cdot n_b \cdot d_\mathrm{head}).
  • Attention on the merged sequence (MiαiNM_i \approx \alpha_i N): O(hM2dhead)O(h \cdot M^2 \cdot d_\mathrm{head}).
  • Unmerging and projection: O(hNdhead)O(h \cdot N \cdot d_\mathrm{head}).

The theoretical speedup SS is governed by the fraction α\alpha of surviving tokens per head:

SN2α2N2+Nnb1α2(when nbαN)S \approx \frac{N^2}{\alpha^2 N^2 + N n_b} \approx \frac{1}{\alpha^2} \quad (\text{when } n_b \ll \alpha N)

For α=0.2\alpha = 0.2, the upper bound is 25×25 \times, though overall acceleration measured end-to-end reaches $4$–7×7\times due to overheads.

5. Key Hyperparameters and Merging Strategy

  • Merge ratio per head (αi\alpha_i): Lower values provide higher speedup at potential cost to reconstruction error.
  • Block size (nb=ns×ntn_b = n_s \times n_t): Larger nbn_b allows for improved global matching at O(Nnb)O(N n_b) cost. Typical nb3840n_b \sim 3840 balances quality and efficiency.
  • Source–destination split: E.g., 75% sources, 25% destinations.
  • Outlier-filtering rate (d%d\%): Empirically, filtering the top 10% outliers (by L2 deviation) yields negligible drop in 3D accuracy.

Empirical findings show that for highly temporally continuous data, stacking in the temporal dimension (large ntn_t) produces higher-quality merges, while for sparse-view scenarios, spatial grouping (large nsn_s) is more beneficial. Hybrid spatio-temporal blocks generally outperform single-axis grouping.

6. Experimental Evaluation

On 7Scenes and NRGBD datasets, running on NVIDIA A100 with FlashAttention in bfloat16:

Method Q Ratio KV Ratio 7Scenes Acc. 7Scenes Comp. Time NRGBD Acc. NRGBD Comp. Time
VGGT* 1.00 1.00 0.019 0.021 9.1 s 0.010 0.010 13.9 s
FastVGGT 0.34 0.34 0.018 0.020 4.5 s 0.016 0.013 7.0 s
HTTM 0.20 0.30 0.020 0.023 4.3 s 0.012 0.010 6.8 s

End-to-end latency scales favorably as sequence length increases, with a 7× speedup demonstrated for 1000-frame NRGBD inputs. The majority of matching overhead (token matching) is substantially reduced in HTTM (0.12 s vs. 2.31 s for FastVGGT), while total aggregation remains efficient.

Ablation studies show the necessity of outlier filtering—without it, merging degrades reconstruction accuracy (0.2400.0120.240 \to 0.012 NRGBD Acc. as outlier filter increases to 10%). Mixing temporal and spatial block axes further improves merger quality over purely spatial approaches.

7. Limitations and Future Perspectives

HTTM is explicitly tailored to VGGT’s distinctive spatio-temporal redundancy (resulting from repeated Rotary-PE). Porting to alternative architectures may require retuning of block settings or merge ratios. The outlier filter is currently based on simple L2-thresholding; integration of learned gating or adaptive metrics could mitigate residual degradation. First-frame anchoring—protecting the initial frame’s tokens from merging—can further stabilize long-sequence global attention and is proposed for more systematic adoption.

Ongoing research directions include adaptive block sizes per head or layer, learned similarity projections beyond raw cosine metrics, and joint key–query merging under a global headwise budget (Wang et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Head-Wise Temporal Token Merging (HTTM).