CubistMerge: Token Merging for Vision Transformers
- CubistMerge is a training-free token merging method for vision transformers that maintains a strict 2D token grid using localized adjacent token pairing.
- It selectively reduces redundancy in low-information image regions while preserving key spatial and positional relationships required for window attention.
- Empirical evaluations demonstrate state-of-the-art speed–accuracy trade-offs across classification, detection, and segmentation tasks without any retraining.
CubistMerge is a training-free token merging method tailored for vision transformers (ViTs) with spatially structured backbones. It reconciles the requirements of maintaining spatial integrity—crucial for token pipelines utilizing window attention and 2D positional encodings—while selectively reducing redundancy in the least-informative image regions. By enforcing a strict two-dimensional token grid and adopting a localized, spatially-aware pairing of adjacent tokens, CubistMerge achieves state-of-the-art speed–accuracy trade-offs on diverse ViT architectures without retraining or architectural change (Gong et al., 26 Sep 2025).
1. Objective and Architectural Motivation
Contemporary ViT backbones, such as those employing window attention, decomposed relative positional embeddings (e.g., in SAM), and Rotary Position Embeddings (RoPE, as in DINOv3), structure their intermediate feature maps as 2D spatial grids. This spatial arrangement underpins both architectural efficiency and effectiveness. Existing global token merging or pruning strategies disrupt this grid, impeding the applicability of spatial mechanisms and degrading performance. CubistMerge addresses this by designing a reduction process that strictly preserves the token grid, ensuring continued compatibility with spatial backbones, and by focusing the reduction on redundant or low-information areas, thereby minimizing task metric loss.
2. CubistMerge Algorithmic Components
2.1 Token Similarity and Local Pair Selection
Given a feature map at layer (with spatial tokens ), the objective is to reduce the number of tokens from to by merging tokens per row and tokens per column.
- Similarity Score: Adjacent tokens within the same row (horizontal) or column (vertical) are connected. Similarity is the dot-product: .
- Bipartite Nomination: Tokens are alternately partitioned into "sources" and "destinations" along each row or column. Each source nominates its most similar adjacent neighbor: . Collecting and sorting nominations, the top- merges are selected, with for row-merge and for column-merge.
- Redundancy Focus: This strategy ensures maximal local redundancy exploitation while preventing merge chains longer than two (i.e., each token merges at most once per phase).
2.2 2D Grid-Preserving Reduction
Reduction proceeds in two sequential passes:
- Horizontal Pass: Within each of the rows, merge tokens, yielding tokens.
- Vertical Pass: On the resultant grid, merge tokens per column, yielding tokens.
At each stage, only adjacency within the row or column is respected, directly preserving 2D grid structure. Thus, window boundaries remain intact, and positional relationships among tokens are unaffected.
2.3 Spatial-Aware Merging and Positional Encodings
Strict adjacency merging ensures that relative spatial ordering is preserved. After reduction, surviving tokens are carried over in original left-to-right (for rows) and top-to-bottom (for columns) order. As a result, decomposed relative positional embeddings, RoPE, and fixed 2D embeddings can be directly resized and reapplied on the reduced grid. Window attention mechanisms require no further modification, as window-to-token alignments persist.
2.4 Max-Magnitude-Per-Dimension Token Representation
Standard merging schemes (e.g., ToMe) utilize weighted-averages and corresponding attention rescaling. CubistMerge instead defines the merged token by selecting, in each feature dimension , the element of largest magnitude:
This approach preserves the largest activation and its sign in each coordinate, avoids magnitude attenuation, and eliminates the need for further attention normalization or bookkeeping. Empirical ablation demonstrates consistent outperformance over both weighted averaging and a "max-norm-vector" baseline that retains an entire token based on largest norm [(Gong et al., 26 Sep 2025), Table 2(b)].
3. Comparison to Prior Token Reduction Methods
| Method | Grid Preservation | Merge Mechanism | Fine-tuning Required | Limiting Factors |
|---|---|---|---|---|
| ToMe | No | Global bipartite, weighted avg | No | Breaks window/positional schemes |
| Expedite | Yes | K-means on superpixel centroids, uniform | No | Ignores info density, high loss early |
| ALGM, AiluRus | Varies | Token pruning or discarding | Yes | Breaks dense output req., retraining |
| CubistMerge | Yes | Local adjacent, max-per-dim | No | Local only, heuristic merge |
CubistMerge is unique in combining strict 2D structured reduction, redundancy-focused local merging, and a parameter-free per-dimension maximum rule, all without fine-tuning or additional modules. Competing strategies either destroy grid structure, are insensitive to local information density, or require retraining (Gong et al., 26 Sep 2025).
4. Empirical Evaluation Across Tasks and Architectures
CubistMerge was evaluated on image classification, object detection, instance segmentation, and semantic segmentation across spatial and non-spatial ViT backbones. Principal findings include:
- Classification: DINOv3-ViT7B at achieves 87.7% top-1/98.2% top-5 (1.12× speedup, 1213.9 GFLOPS) versus baseline 88.0%/98.4% (1.00×, 1349.9 GFLOPS). DeiT-B, after one epoch fine-tuning, attains 81.82% top-1 at 1.15× speedup with no accuracy drop (baseline 81.82%) (Gong et al., 26 Sep 2025).
- Detection/Segmentation: On SAM-H, 1.25× speedup is obtained with only 0.7% mIOU drop (versus ~3% for Expedite). For DINOv3-ViT7B detection at , AP drops by 0.5 (56.9 vs. 57.4) at 1.08× (versus ToMe at 52.9 AP). Mask2Former shows consistent 1.2–1.3× speedup with <1% performance loss.
- Semantic Segmentation: On Cityscapes, CubistMerge achieves 75.44% mIOU at 1.71× speedup, closely matching or outperforming ALGM and others.
Across all tested settings, CubistMerge matches or establishes state-of-the-art off-the-shelf speed–accuracy trade-offs at speedup, without retraining.
5. Practical Constraints and Limitations
CubistMerge's reliance on local adjacency means it cannot directly merge redundant tokens that are spatially distant—a situation arising with repeating patterns across the image. Bipartite nomination and merging are fundamentally local, precluding global optimality of pairings. The max-per-dimension merge rule, while effective, is heuristic and possibly non-optimal compared to a learned selection mechanism. Dense output requirements are strictly preserved; tokens are merged, not discarded, maintaining compatibility with dense prediction tasks (Gong et al., 26 Sep 2025).
6. Prospective Extensions and Open Directions
Potential future improvements posited by the original authors include:
- Multi-scale grid mergers allowing coarser-resolution merging to address non-local redundancy.
- Learned or attention-guided similarity metrics to adapt merges dynamically to semantic content.
- Combined pruning and merging strategies that drop entire low-information windows while merging within high-information areas.
- Extension to video and multimodal transformers via 3D or spatio-temporal adjacency graphs.
A plausible implication is that integrating non-local merging or learned heuristics could further improve performance, albeit at the cost of implementation complexity—suggesting fruitful directions for subsequent research (Gong et al., 26 Sep 2025).