Papers
Topics
Authors
Recent
Search
2000 character limit reached

Geometric Token Fusion in Vision Transformers

Updated 3 January 2026
  • Geometric Token Fusion is a technique that merges spatially or statistically similar tokens to reduce computational overhead while preserving critical geometric cues.
  • It is applied in models like CubistMerge, LiteVGGT, FASTer, Co-Me, and ToFu, each using specialized strategies such as grid preservation, cached merging, and confidence-guided selection.
  • Empirical results demonstrate significant speedups (up to 10×) with minimal accuracy loss across tasks such as classification, segmentation, and 3D reconstruction.

Geometric Token Fusion is a set of algorithmic techniques for redundant token elimination and information pooling in vision and geometric transformer models. The core objective is to reduce the number of tokens and thus the computational cost of self-attention, while preserving critical geometric, spatial, and semantic structure for downstream tasks such as classification, segmentation, and 3D reconstruction. This is accomplished by merging tokens in a spatially, geometrically, or statistically informed way, as opposed to simplistic averaging or pruning. Geometric token fusion is instantiated in spatial vision transformers (e.g., CubistMerge (Gong et al., 26 Sep 2025)), geometry-aware 3D vision transformers (e.g., LiteVGGT (Shu et al., 4 Dec 2025)), adaptive and confidence-guided paradigms (e.g., FASTer (Dang et al., 28 Feb 2025), Co-Me (Chen et al., 18 Nov 2025)), and norm-preserving fusion (e.g., Token Fusion/ToFu (Kim et al., 2023)). The following sections detail foundational methodologies, geometric/graph-theoretic formulations, key algorithmic strategies, main empirical performance trends, and the comparative role of geometric fusion in the landscape of transformer token efficiency.

1. Geometric Foundations and Problem Formulation

Geometric Token Fusion departs from naïve merging by leveraging the geometry of feature spaces and/or the spatial layout of tokens. In vision transformers, each token ti,jRdt_{i,j}\in\mathbb{R}^d often corresponds to a patch or region in 2D/3D space. Geometric fusion methods aim to minimize informational redundancy by aggregating only highly similar or spatially/structurally coherent tokens, and by doing so in a way that preserves (often exactly) the geometric and topological arrangements required by the backbone (e.g., grid structure, relative position).

Similarities between tokens are commonly measured using cosine similarity: Si,j=xixjxi2xj2S_{i,j} = \frac{x_i \cdot x_j}{\|x_i\|_2 \|x_j\|_2} Tokens with high similarity are candidates for merging, provided their geometric importance (edge, texture, boundary, or spatial anchoring) is below a learned or hand-crafted threshold. This is the basis for methods such as LiteVGGT and ToFu (Shu et al., 4 Dec 2025, Kim et al., 2023).

In models explicitly designed for spatial/pixel consistency (e.g., CubistMerge), the geometric regularity of the grid is enforced at each merging step to maintain compatibility with positional embeddings and specialized architectural modules such as window attention (Gong et al., 26 Sep 2025).

2. Grid-Preserving and Spatial-Aware Fusion (CubistMerge)

CubistMerge exemplifies spatially grounded geometric token fusion, addressing the need to preserve the H×WH \times W grid structure even after token reduction (Gong et al., 26 Sep 2025). The process unfolds in two principal phases:

2D Reduction Strategy: The sequence of tokens T={ti,j}T=\{t_{i,j}\} is reduced by applying sequential horizontal and vertical merges. Horizontal reduction removes rwr_w tokens per row, yielding H×(Wrw)H\times (W-r_w) tokens, while vertical reduction then removes rhr_h tokens per column, arriving at (Hrh)×(Wrw)(H-r_h)\times (W-r_w) tokens. Each merge is performed by pairing adjacent tokens, thereby maintaining locality and grid topology, which is required for compatibility with spatial attention or relative positional encoding.

Spatial-Aware Matching: Each row or column is represented as a path graph G=(V,E)G=(V,E) with VV corresponding to the tokens and EE linking adjacent tokens. A bipartite matching is constructed: nodes are partitioned by parity (even/odd), and every source node nominates its most similar neighbor for merging (based on cosine similarity). The top rr matches are selected, guaranteeing that merges are local, disjoint, and parallelizable.

Token Representation (Max-Per-Dimension): Upon merging, each output token is built by taking, for each feature dimension, the entry with maximal magnitude among inputs: tm[i]=tc[i], c=argmaxjtj[i]t_m[i] = t^{c}[i],~ c= \arg\max_{j}|t^{j}[i]| This preserves salient activations indicative of geometric edges or features and avoids feature-norm collapse, thus maintaining geometric discriminability in downstream layers.

Empirically, this yields 1.25× speedup and <0.7%<0.7\% mIoU drop on SAM-H, and 1.15× speedup with no top-1 accuracy drop on DeiT-B after one epoch of fine-tuning. The structure-preserving design is crucial for deployment in models with complex spatial positional mechanisms (Gong et al., 26 Sep 2025).

3. Geometry-Aware and Cached Merging (LiteVGGT)

In large-scale 3D reconstruction, LiteVGGT implements a geometry-aware fusion protocol for Visual Geometry Grounded Transformers (Shu et al., 4 Dec 2025). Its characteristics are:

Geometric Importance Scoring: Each token is assigned a geometry-aware score ΨGA\Psi_{GA}, combining pixel gradient magnitude (via Sobel filters) and local token variance. High ΨGA\Psi_{GA} tokens correspond to regions with rich geometric information (e.g., object boundaries), which are exempted from merging to ensure fidelity of critical scene regions.

Anchor Selection and Caching: The token pool is partitioned into GA tokens (high importance, always retained), dst tokens (anchors, typically one per grid cell or the first frame), and src tokens (remaining tokens). Each src token is assigned to its most similar anchor (by cosine similarity), and merged into it via simple averaging. The merge assignments (indices) are cached and reused across adjacent layers, capitalizing on the empirical observation of merge stability. This drastically reduces the cost of repeated similarity computation.

Token Unmerging: After the attention layers, anchor features are broadcast back to all merged positions using the cached map, restoring the token sequence to full length for dense prediction.

System-level optimizations such as fine-tuning only the aggregator heads and the use of FP8 quantization in Transformer Engine further amplify efficiency and memory savings.

Empirically, LiteVGGT achieves up to 10×10\times speedup and $20$–25%25\% GPU memory reduction while preserving reconstruction quality: Chamfer distance drops from $0.485$ to $0.402$ and wall-time falls from $1275$s to $127$s on ScanNet-50 (Shu et al., 4 Dec 2025).

4. Adaptive, Focal, and Hierarchical Fusion (FASTer)

FASTer applies geometric token fusion in the context of efficient temporal 3D object detection (Dang et al., 28 Feb 2025). Key aspects:

Adaptive Scaling: Each token’s 3D position ptiR3p_t^i\in\mathbb{R}^3 is mapped via an MLP and scored for “focus” (scalar weight αti\alpha_t^i) through learnable position encodings and softmax normalization. αti\alpha_t^i captures boundary, salience, or motion cues, and thus the geometric informativeness of each token.

Focal Token Selection: The KtrNtK_t\approx r N_t tokens with the highest αti\alpha_t^i per frame are retained (“focal tokens”), under top-KK or thresholding rules.

Grouped Hierarchical Fusion: Focal tokens are grouped across contiguous frames. Within each group, multi-head self-attention fuses spatial and short-term temporal information, producing a reduced set of summary tokens. Hierarchical progression over groups enables further condensing while allowing global spatial-temporal context mixing at much-reduced quadratic cost.

This strategy reduces attention complexity from O((TN)2)O((T N)^2) to O(Tg(rN)2)O(T\cdot g\cdot (rN)^2), delivering ≈3.6% of the compute of full attention in practical schedules while improving (or at worst matching) accuracy. FASTer outperforms prior temporal detectors on Waymo, achieving 74.2% mAPH with 9.2 FPS inference and ablation studies confirming the necessity of both geometry-aware scaling and hierarchical grouping (Dang et al., 28 Feb 2025).

5. Confidence- and Norm-Preserving Merging (Co-Me and ToFu)

Confidence-Guided Token Merging (Co-Me): Rather than solely using similarity, Co-Me (Chen et al., 18 Nov 2025) distills a lightweight “confidence predictor” to mimic the model’s own semantic uncertainty signal. During inference, tokens are grouped and those with confidence below a threshold are merged (averaged), informed by a binary merge mask derived from the learned confidence rankings. Attention bias is corrected for merged group sizes.

Empirically, Co-Me achieves up to 11.3×11.3\times speedup on 512-frame VGGT and 7.2×7.2\times on MapAnything, with negligible degradation in depth, pose, or point-cloud accuracy metrics. Ablations demonstrate that similarity-based merging is inferior to confidence-based ranking due to less reliable coverage of critical geometry (Chen et al., 18 Nov 2025).

Norm-Preserving Spherical Merging (Token Fusion/ToFu): ToFu (Kim et al., 2023) addresses the “norm collapse” seen in average merging. It applies Spherical Linear Interpolation (SLERP) for fusing two tokens and generalizes to MLERP for NN-way merges: SLERP(xi,xj;α)=sin((1α)θ)sinθxi+sin(αθ)sinθxj\text{SLERP}(x_i, x_j; \alpha) = \frac{\sin((1-\alpha)\theta)}{\sin \theta} x_i + \frac{\sin(\alpha\theta)}{\sin\theta} x_j where θ=arccos(xi,xj/(xixj))\theta = \arccos(\langle x_i, x_j \rangle / (\|x_i\|\|x_j\|)). MLERP iterates SLERP to approximate the Riemannian barycenter, then scales by the weighted norm, thus ensuring that the feature distributions and norm statistics stay consistent throughout fusion.

ToFu hybridizes pruning and merging according to measured sensitivity and linearity at each layer, using prune early and geometric MLERP later. This approach yields consistent improvements in ImageNet classification and generative modeling benchmarks, with up to +1.8% absolute accuracy gain over simple average merging (Kim et al., 2023).

6. Comparative Table: Geometric Token Fusion Strategies

Method/Paper Core Geometric Mechanism Empirical Speedup / Drop
CubistMerge (Gong et al., 26 Sep 2025) Grid-preserving bipartite/adjacent merge w/ max-dim 1.25×, 0.7% mIoU (SAM-H); 1.15×, 0% Top-1 (DeiT-B)
LiteVGGT (Shu et al., 4 Dec 2025) Geometry-aware anchor scoring, cached merging 10×, negligible Chamfer/Acc drop
FASTer (Dang et al., 28 Feb 2025) Adaptive geometry-aware scaling, focal/grouped fusion 74.2% mAPH, ~3× FLOP reduction
Co-Me (Chen et al., 18 Nov 2025) Confidence-guided selective merge, group attention 2–11×, ΔChamfer ≤ 0.015cm
ToFu (Kim et al., 2023) SLERP/MLERP norm-preserving geometric avg +1–2% Top-1 vs. avg, ≤1.8× speed

7. Empirical Performance and Design Trade-offs

All geometric fusion strategies achieve major reductions in both runtime and memory use, typically yielding $1.1$–10×10\times speedup at less than $1$–3%3\% drop in core metrics (mIoU, Chamfer, Top-1, depth error), depending on the aggressiveness of fusion and the baseline model’s spatial redundancy.

Key empirical observations include:

  • Grid/topology preservation is essential for models that rely on structured positional encoding.
  • The choice of merging function (e.g., max-magnitude, SLERP/MLERP, simple average) affects the retention of salient features, especially under heavy reduction.
  • Geometry/uncertainty awareness (via edge maps, variance maps, or model confidence) helps avoid fusing tokens corresponding to physically or semantically important regions (edges, object boundaries).
  • Caching merge indices and unmerging on demand reduces overhead, especially in deep or sequential models.

A plausible implication is that, as self-attention models are further scaled and deployed in real-time or large-scale environments, geometric token fusion will become increasingly important as a practical enabling mechanism.

8. Context and Significance in Model Optimization

Geometric Token Fusion generalizes and subsumes standard token pruning and merging by introducing geometric, statistical, and confidence-informed criteria for token reduction. This approach enables transformer-based models to scale to large images, long temporal sequences, or massive multi-view setups (thousands of frames/tokens) previously intractable under quadratic self-attention costs.

By bridging architectural constraints (e.g., need for grid outputs, 2D/3D attention, and position encoding) and the information redundancy abundant in pixel-aligned or spatially correlated regions, these methods define a unifying paradigm for efficient transformer design in computer vision and embodied geometry tasks. Empirical results demonstrate that geometric token fusion can be adopted without retraining, or as a lightweight plug-in, making it a practical and general strategy across a wide spectrum of vision transformer backbones.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Geometric Token Fusion.