Papers
Topics
Authors
Recent
Search
2000 character limit reached

CubistMerge: Token Merging for Vision Transformers

Updated 5 February 2026
  • CubistMerge is a training-free token merging method for vision transformers that maintains a strict 2D token grid using localized adjacent token pairing.
  • It selectively reduces redundancy in low-information image regions while preserving key spatial and positional relationships required for window attention.
  • Empirical evaluations demonstrate state-of-the-art speed–accuracy trade-offs across classification, detection, and segmentation tasks without any retraining.

CubistMerge is a training-free token merging method tailored for vision transformers (ViTs) with spatially structured backbones. It reconciles the requirements of maintaining spatial integrity—crucial for token pipelines utilizing window attention and 2D positional encodings—while selectively reducing redundancy in the least-informative image regions. By enforcing a strict two-dimensional token grid and adopting a localized, spatially-aware pairing of adjacent tokens, CubistMerge achieves state-of-the-art speed–accuracy trade-offs on diverse ViT architectures without retraining or architectural change (Gong et al., 26 Sep 2025).

1. Objective and Architectural Motivation

Contemporary ViT backbones, such as those employing window attention, decomposed relative positional embeddings (e.g., in SAM), and Rotary Position Embeddings (RoPE, as in DINOv3), structure their intermediate feature maps as 2D spatial grids. This spatial arrangement underpins both architectural efficiency and effectiveness. Existing global token merging or pruning strategies disrupt this grid, impeding the applicability of spatial mechanisms and degrading performance. CubistMerge addresses this by designing a reduction process that strictly preserves the token grid, ensuring continued compatibility with spatial backbones, and by focusing the reduction on redundant or low-information areas, thereby minimizing task metric loss.

2. CubistMerge Algorithmic Components

2.1 Token Similarity and Local Pair Selection

Given a feature map XRH×W×dX \in \mathbb{R}^{H \times W \times d} at layer \ell (with H×WH \times W spatial tokens x1,,xN, N=HWx_1,\ldots,x_N,~N = H \cdot W), the objective is to reduce the number of tokens from H×WH \times W to (Hrh)×(Wrw)(H - r_h) \times (W - r_w) by merging rwr_w tokens per row and rhr_h tokens per column.

  • Similarity Score: Adjacent tokens (i,j)(i, j) within the same row (horizontal) or column (vertical) are connected. Similarity Si,jS_{i,j} is the dot-product: Si,j=xi,xjS_{i,j} = \langle x_i, x_j \rangle.
  • Bipartite Nomination: Tokens are alternately partitioned into "sources" and "destinations" along each row or column. Each source nominates its most similar adjacent neighbor: j(i)=argmaxjAdj(i)Si,jj^*(i) = \arg\max_{j \in Adj(i)} S_{i,j}. Collecting and sorting nominations, the top-kk merges are selected, with k=Hrwk = H \cdot r_w for row-merge and k=(Wrw)rhk = (W-r_w)\cdot r_h for column-merge.
  • Redundancy Focus: This strategy ensures maximal local redundancy exploitation while preventing merge chains longer than two (i.e., each token merges at most once per phase).

2.2 2D Grid-Preserving Reduction

Reduction proceeds in two sequential passes:

  1. Horizontal Pass: Within each of the HH rows, merge rwr_w tokens, yielding H×(Wrw)H \times (W - r_w) tokens.
  2. Vertical Pass: On the resultant grid, merge rhr_h tokens per column, yielding (Hrh)×(Wrw)(H - r_h) \times (W - r_w) tokens.

At each stage, only adjacency within the row or column is respected, directly preserving 2D grid structure. Thus, window boundaries remain intact, and positional relationships among tokens are unaffected.

2.3 Spatial-Aware Merging and Positional Encodings

Strict adjacency merging ensures that relative spatial ordering is preserved. After reduction, surviving tokens are carried over in original left-to-right (for rows) and top-to-bottom (for columns) order. As a result, decomposed relative positional embeddings, RoPE, and fixed 2D embeddings can be directly resized and reapplied on the reduced (Hrh)×(Wrw)(H - r_h) \times (W - r_w) grid. Window attention mechanisms require no further modification, as window-to-token alignments persist.

2.4 Max-Magnitude-Per-Dimension Token Representation

Standard merging schemes (e.g., ToMe) utilize weighted-averages and corresponding attention rescaling. CubistMerge instead defines the merged token tmt_m by selecting, in each feature dimension dd, the element of largest magnitude:

tm[d]={xi[d]if xi[d]xj[d] xj[d]otherwiset_m[d] = \begin{cases} x_i[d] & \text{if } |x_i[d]| \geq |x_j[d]| \ x_j[d] & \text{otherwise} \end{cases}

This approach preserves the largest activation and its sign in each coordinate, avoids magnitude attenuation, and eliminates the need for further attention normalization or bookkeeping. Empirical ablation demonstrates consistent outperformance over both weighted averaging and a "max-norm-vector" baseline that retains an entire token based on largest norm [(Gong et al., 26 Sep 2025), Table 2(b)].

3. Comparison to Prior Token Reduction Methods

Method Grid Preservation Merge Mechanism Fine-tuning Required Limiting Factors
ToMe No Global bipartite, weighted avg No Breaks window/positional schemes
Expedite Yes K-means on superpixel centroids, uniform No Ignores info density, high loss early
ALGM, AiluRus Varies Token pruning or discarding Yes Breaks dense output req., retraining
CubistMerge Yes Local adjacent, max-per-dim No Local only, heuristic merge

CubistMerge is unique in combining strict 2D structured reduction, redundancy-focused local merging, and a parameter-free per-dimension maximum rule, all without fine-tuning or additional modules. Competing strategies either destroy grid structure, are insensitive to local information density, or require retraining (Gong et al., 26 Sep 2025).

4. Empirical Evaluation Across Tasks and Architectures

CubistMerge was evaluated on image classification, object detection, instance segmentation, and semantic segmentation across spatial and non-spatial ViT backbones. Principal findings include:

  • Classification: DINOv3-ViT7B at rh=rw=1, =10r_h = r_w = 1,~\ell = 10 achieves 87.7% top-1/98.2% top-5 (1.12× speedup, 1213.9 GFLOPS) versus baseline 88.0%/98.4% (1.00×, 1349.9 GFLOPS). DeiT-B, after one epoch fine-tuning, attains 81.82% top-1 at 1.15× speedup with no accuracy drop (baseline 81.82%) (Gong et al., 26 Sep 2025).
  • Detection/Segmentation: On SAM-H, 1.25× speedup is obtained with only 0.7% mIOU drop (versus ~3% for Expedite). For DINOv3-ViT7B detection at rh=rw=0.1, =20r_h = r_w = 0.1,~\ell = 20, AP drops by 0.5 (56.9 vs. 57.4) at 1.08× (versus ToMe at 52.9 AP). Mask2Former shows consistent 1.2–1.3× speedup with <1% performance loss.
  • Semantic Segmentation: On Cityscapes, CubistMerge achieves 75.44% mIOU at 1.71× speedup, closely matching or outperforming ALGM and others.

Across all tested settings, CubistMerge matches or establishes state-of-the-art off-the-shelf speed–accuracy trade-offs at >1.1×>1.1\times speedup, without retraining.

5. Practical Constraints and Limitations

CubistMerge's reliance on local adjacency means it cannot directly merge redundant tokens that are spatially distant—a situation arising with repeating patterns across the image. Bipartite nomination and merging are fundamentally local, precluding global optimality of pairings. The max-per-dimension merge rule, while effective, is heuristic and possibly non-optimal compared to a learned selection mechanism. Dense output requirements are strictly preserved; tokens are merged, not discarded, maintaining compatibility with dense prediction tasks (Gong et al., 26 Sep 2025).

6. Prospective Extensions and Open Directions

Potential future improvements posited by the original authors include:

  • Multi-scale grid mergers allowing coarser-resolution merging to address non-local redundancy.
  • Learned or attention-guided similarity metrics to adapt merges dynamically to semantic content.
  • Combined pruning and merging strategies that drop entire low-information windows while merging within high-information areas.
  • Extension to video and multimodal transformers via 3D or spatio-temporal adjacency graphs.

A plausible implication is that integrating non-local merging or learned heuristics could further improve performance, albeit at the cost of implementation complexity—suggesting fruitful directions for subsequent research (Gong et al., 26 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CubistMerge.