HTTM: Head-Wise Temporal Token Merging

Updated 19 December 2025

The paper introduces HTTM, a training-free method that independently merges tokens per self-attention head, achieving up to 7× acceleration while preserving 3D reconstruction accuracy.
HTTM leverages spatio-temporal block-wise similarity and adaptive outlier filtering to significantly reduce computational cost without compromising quality.
Empirical results on benchmarks like 7Scenes and NRGBD demonstrate that HTTM efficiently scales global attention for long-sequence inputs in VGGT models.

Head-Wise Temporal Token Merging (HTTM) is a training-free method for accelerating global attention in the Visual Geometry Grounded Transformer (VGGT), centered on the 3D reconstruction of large scenes from multi-view or long-sequence inputs. HTTM performs token merging independently for each self-attention head, exploiting spatial–temporal patterns and redundancy within head-wise blocks, thereby circumventing the limitations of prior uniform or sparsity-based merging approaches. This yields a significant reduction in computational cost (up to 7× acceleration) with minimal degradation in 3D reconstruction accuracy in benchmarks such as 7Scenes and NRGBD (Wang et al., 26 Nov 2025).

1. Motivation and Rationale

VGGT’s global attention layers process all tokens from all input views, typically $N \gg 20,000$ for large scenes, leading to $O(N^2)$ time and memory complexity per layer. For lengthy sequences, this step becomes an inference bottleneck. Previous token merging techniques either merge tokens identically across all attention heads—thereby collapsing head-specific information and impairing representational power—or use uniform sparsity that fails to leverage VGGT’s inherent spatio-temporal redundancy. HTTM addresses both problems by merging tokens independently in each attention head (“head-wise”) and grouping tokens into small spatio-temporal blocks for block-wise similarity computation, enabling high merge ratios at reduced computational cost.

2. Formal Definition and Algorithm

Let $X \in \mathbb{R}^{N \times d}$ be the input token sequence, $d$ the embedding dimension, and $h$ the number of attention heads ( $d_\mathrm{head} = d/h$ ). After per-head projections, tokens are represented as queries, keys, and values:

$Q,K,V \in \mathbb{R}^{h \times N \times d_\mathrm{head}}, \quad Q^{(i)},K^{(i)},V^{(i)} \in \mathbb{R}^{N \times d_\mathrm{head}}$

2.1 Head-wise Temporal Merging Modules

For each head $i$ , two merging functions are defined:

$M_i^q : \mathbb{R}^{N \times d_\mathrm{head}} \to \mathbb{R}^{M_i \times d_\mathrm{head}}, \quad M_i^k : \mathbb{R}^{N \times d_\mathrm{head}} \to \mathbb{R}^{M_i \times d_\mathrm{head}}$

which reduce the effective token count from $N$ to $M_i$ per head. Values $V^{(i)}$ are merged using the same index assignments as keys to preserve key–value consistency.

2.2 Spatio-temporal Block-wise Similarity and Merging

HTTM reorders tokens into $K$ non-overlapping spatio-temporal blocks of size $n_b$ . Within each block for head $i$ ( $k = 1, \ldots, K$ ), tokens are partitioned into destination $D_k^{(i)}$ (size $N_d$ ) and source $S_k^{(i)}$ (size $N_s$ ) sets, $N_s + N_d = n_b$ . The cosine similarity matrix within each block is:

$\mathrm{Sim}_k^{(i)} = \mathrm{RowNorm}(S_k^{(i)}) \cdot \mathrm{RowNorm}(D_k^{(i)})^\top$

with $l_p = \arg\max_{q \in D_k} \mathrm{Sim}_k^{(i)}[p, q]$ as the best matching index and $m_p = \mathrm{Sim}_k^{(i)}[p, l_p]$ as the matching score for each source token.

Globally, the top $r_i = N - M_i$ source tokens with the highest scores are merged into their matched destinations. For each destination token $d_q$ receiving merges, the new merged token is the mean:

$\tilde{d}_q = \frac{d_q + \sum_{p : l_p = q} s_p}{1 + |\{p : l_p = q\}|}$

These form the reduced per-head projected queries, keys, and values: $\tilde{Q}^{(i)}$ , $\tilde{K}^{(i)}$ , and $\tilde{V}^{(i)}$ .

2.3 Merged Attention and Unmerging

Attention is computed as:

$A^{(i)} = \mathrm{softmax}\left(\frac{\tilde{Q}^{(i)} \cdot \tilde{K}^{(i)\top}}{\sqrt{d_\mathrm{head}}}\right), \quad \tilde{O}^{(i)} = A^{(i)} \cdot \tilde{V}^{(i)}$

The final “unmerge” step recovers $N$ outputs per head by mapping each original token to the output of its merged cluster.

2.4 Adaptive Outlier Filtering (Optional)

L2 distance between each original token and its merged prototype is calculated, and a fraction $d\%$ of tokens with maximal deviation are designated “outliers” and excluded from merging. This step is critical for maintaining quality in blocks with low redundancy.

3. Pseudocode

The main steps can be structured as follows:

Q, K, V = LinearProject(X)       # h x N x d_head
order = ComputeSpatioTemporalOrder(N, n_b)
Q, K, V = Q[:,:,order], K[:,:,order], V[:,:,order]
for i in range(h):
    blocks = SplitIntoBlocks(Q[i], K[i], V[i], n_b)
    for block in blocks:
        S, D = PartitionSrcDst(block, alpha_i)  # e.g. 75% src, 25% dst
        Sim = RowNorm(S) @ RowNorm(D).T
        for p in range(len(S)):
            lp = argmax(Sim[p])
            mp = Sim[p, lp]
    S_sel = TopR(all mp, r_i)
    for q in D:
        C_q = {d_q} ∪ {s_p: s_p merged into d_q}
        tilde_d_q = mean(C_q)
    tilde_Q[i], tilde_K[i], tilde_V[i] = all tilde_d_q
for i in range(h):
    A = softmax(tilde_Q[i] @ tilde_K[i].T / sqrt(d_head))
    tilde_O[i] = A @ tilde_V[i]
    for n in range(N):
        q = cluster where n was merged
        O_n[i] = tilde_O[i][q]
O = Concat_heads(O^{(1)}, ..., O^{(h)})
O = O[inverse_order]
return O

4. Complexity and Computational Tradeoffs

The original global attention cost is $O(h \cdot N^2 \cdot d_\mathrm{head}) = O(N^2 d)$ . For HTTM, dominant costs shift to:

Block-wise similarity computation: $O(h \cdot N \cdot n_b \cdot d_\mathrm{head})$ .
Attention on the merged sequence ( $M_i \approx \alpha_i N$ ): $O(h \cdot M^2 \cdot d_\mathrm{head})$ .
Unmerging and projection: $O(h \cdot N \cdot d_\mathrm{head})$ .

The theoretical speedup $S$ is governed by the fraction $\alpha$ of surviving tokens per head:

$S \approx \frac{N^2}{\alpha^2 N^2 + N n_b} \approx \frac{1}{\alpha^2} \quad (\text{when } n_b \ll \alpha N)$

For $\alpha = 0.2$ , the upper bound is $25 \times$ , though overall acceleration measured end-to-end reaches $4$– $7\times$ due to overheads.

5. Key Hyperparameters and Merging Strategy

Merge ratio per head ( $\alpha_i$ ): Lower values provide higher speedup at potential cost to reconstruction error.
Block size ( $n_b = n_s \times n_t$ ): Larger $n_b$ allows for improved global matching at $O(N n_b)$ cost. Typical $n_b \sim 3840$ balances quality and efficiency.
Source–destination split: E.g., 75% sources, 25% destinations.
Outlier-filtering rate ( $d\%$ ): Empirically, filtering the top 10% outliers (by L2 deviation) yields negligible drop in 3D accuracy.

Empirical findings show that for highly temporally continuous data, stacking in the temporal dimension (large $n_t$ ) produces higher-quality merges, while for sparse-view scenarios, spatial grouping (large $n_s$ ) is more beneficial. Hybrid spatio-temporal blocks generally outperform single-axis grouping.

6. Experimental Evaluation

On 7Scenes and NRGBD datasets, running on NVIDIA A100 with FlashAttention in bfloat16:

Method	Q Ratio	KV Ratio	7Scenes Acc.	7Scenes Comp.	Time	NRGBD Acc.	NRGBD Comp.	Time
VGGT*	1.00	1.00	0.019	0.021	9.1 s	0.010	0.010	13.9 s
FastVGGT	0.34	0.34	0.018	0.020	4.5 s	0.016	0.013	7.0 s
HTTM	0.20	0.30	0.020	0.023	4.3 s	0.012	0.010	6.8 s

End-to-end latency scales favorably as sequence length increases, with a 7× speedup demonstrated for 1000-frame NRGBD inputs. The majority of matching overhead (token matching) is substantially reduced in HTTM (0.12 s vs. 2.31 s for FastVGGT), while total aggregation remains efficient.

Ablation studies show the necessity of outlier filtering—without it, merging degrades reconstruction accuracy ( $0.240 \to 0.012$ NRGBD Acc. as outlier filter increases to 10%). Mixing temporal and spatial block axes further improves merger quality over purely spatial approaches.

7. Limitations and Future Perspectives

HTTM is explicitly tailored to VGGT’s distinctive spatio-temporal redundancy (resulting from repeated Rotary-PE). Porting to alternative architectures may require retuning of block settings or merge ratios. The outlier filter is currently based on simple L2-thresholding; integration of learned gating or adaptive metrics could mitigate residual degradation. First-frame anchoring—protecting the initial frame’s tokens from merging—can further stabilize long-sequence global attention and is proposed for more systematic adoption.

Ongoing research directions include adaptive block sizes per head or layer, learned similarity projections beyond raw cosine metrics, and joint key–query merging under a global headwise budget (Wang et al., 26 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HTTM: Head-wise Temporal Token Merging for Faster VGGT (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Head-Wise Temporal Token Merging (HTTM).