Token-Merging Remedy in Transformers

Updated 29 December 2025

Token-Merging Remedy is a strategy that reduces token count in transformers by merging redundant representations, thereby improving computational efficiency.
It employs metrics such as cosine similarity, energy scores, and attention entropy to effectively select candidates for merging while retaining essential information.
Empirical studies in vision, speech, and language tasks demonstrate significant FLOPs savings and latency improvements, with minimal impact on output accuracy.

Token-Merging Remedy

Token-merging remedy encompasses a class of algorithmic strategies that dynamically reduce the number of tokens processed by transformer or state-space model architectures. This approach is motivated by the quadratic scaling of self-attention and similar operations with respect to the sequence length, placing fundamental constraints on throughput and deployment—particularly in vision, language, and speech models. Token merging aims to eliminate redundant or less informative representations in a structured or content-aware manner, accelerating inference and training while minimally impacting performance, often without requiring re-training.

1. Principles and Motivations

The explosion of transformer-based architectures in vision and speech has led to a computational bottleneck: self-attention layers scale as $O(N^2 d)$ in the number of tokens $N$ and hidden dimension $d$ (Li et al., 2023, Park et al., 19 Aug 2025, Haurum et al., 2024). In many domains, such as ASR, image classification, semantic segmentation, or diffusion-based synthesis, input sequences contain highly redundant adjacent features—for example, acoustic frames at 10 ms intervals, image patches, or subword tokens in code.

The core motivation is that, while tokenization is necessary for high expressivity and granularity, most tasks involve significant local or semantic redundancy. Dynamic merging seeks to exploit this by:

Reducing the effective token count $N \rightarrow N' \ll N$ before high-cost operations.
Merging only those tokens that encode redundant or interchangeable information, as determined by architectural features (e.g., self-attention keys), information-theoretic scores, or domain priors.
Ensuring that, post-merging, either the model’s outputs remain dense via "unmerging" or that task-relevant information is preserved in the merged sequence.

Token merging thereby enables practical deployment of large ViTs, SSMs, and diffusion models under latency and memory constraints.

2. Core Methodologies

Token-merging remedies deploy a diverse set of algorithms, unified by key steps: (1) quantifying token redundancy, (2) selecting merge candidates, and (3) updating token representations. Notable methodological axes include:

Token Similarity and Informativeness

Cosine Similarity of Keys: Most merging approaches operate on internal representations, e.g., $K_i$ (post-attention keys), using pairwise cosine similarity $s_{i,j} = (K_i \cdot K_j) / (\|K_i\|\,\|K_j\|)$ (Li et al., 2023, Haurum et al., 2024, Saghatchian et al., 1 Jan 2025).
Energy Scores: PiToMe (Tran et al., 2024) defines an "energy score" for each token based on intra-token graph structure, identifying large clusters of similar tokens as high-energy and prioritizing them for merging.
Domain-Driven Informativeness: MaMe (Park et al., 19 Aug 2025) leverages SSM state transition parameters $\Delta_i$ as a direct quantification of each token's contribution; large $\Delta$ yields high informativeness and guards against merging.
Attention Norms and Entropy: QuickMerge++ (Liu et al., 16 Aug 2025) uses token-wise self-attention entropy to allocate computational budget, merging tokens with low entropy.

Merging Schedules and Selection Strategies

Fixed-Ratio vs. Threshold: Merging either a fixed proportion of tokens per layer (ratio mode) or using adaptive similarity/importance thresholds (threshold mode), the latter being more robust to data/architecture variation (Li et al., 2023, Saghatchian et al., 1 Jan 2025, Lee et al., 21 May 2025).
Locality and Structure: Methods like ATC (Haurum et al., 2024) employ agglomerative hierarchical clustering, merging nearest clusters iteratively; others constrain mergers to local neighborhoods to preserve causality and locality in time series and spatial grids (Götz et al., 2024, Gong et al., 26 Sep 2025, Lee et al., 17 Jul 2025).
Semantic and Task Awareness: ClustViT (Montello et al., 2 Oct 2025) uses cluster assignments guided by pseudo-labels from segmentation masks, ensuring semantic boundaries are respected in dense prediction.

Feature Update and Representation Rules

Averaging: The simplest merging step replaces two tokens by their (possibly weighted) average (Li et al., 2023, Götz et al., 2024).
Norm-Preserving Interpolation: ToFu (Kim et al., 2023) introduces MLERP, a spherical interpolation that preserves token feature norms, mitigating distributional shift during merging.
Max-Magnitude per Dimension: CubistMerge (Gong et al., 26 Sep 2025) selects, for each embedding dimension, the value with maximal absolute magnitude from the merge group, preserving salient features for spatial tasks.
Learnable or Decoupled Embeddings: DTEM (Lee et al., 2024) learns a lightweight embedding module specifically for merging, decoupled from the ViT’s main features, trained with a differentiable relaxation.

3. Algorithmic Implementation and Complexity

Algorithmic frameworks differ in their target domain but follow a shared blueprint, as illustrated in the table below:

Approach	Token Similarity Metric	Merge Selection Mechanism	Representation Update
A-ToMe (Li et al., 2023)	Adjacent key cosine similarity	Top-k ratio / threshold	Average
PiToMe (Tran et al., 2024)	Energy score from token graph	Energy-ranked BSM	Weighted average
ATC (Haurum et al., 2024)	Key cosine distance	Hierarchical agglomerative	Cluster centroid
CubistMerge (Gong et al., 26 Sep 2025)	Local path-graph similarity	Greedy top-k in 1D/2D grid	Max-per-dimension
QuickMerge++ (Liu et al., 16 Aug 2025)	Entropy from attention matrix	Softmax-Gumbel selection, cluster	Mass-weighted average
DTEM (Lee et al., 2024)	Decoupled learned embedding sim.	Differentiable soft bipartite	Soft merging (with scale)

Complexity savings are proportional to the reduction in token count: self-attention cost drops from $O(N^2 d)$ to $O(N'^2 d)$ , with $N$ 0 after merging. Overhead varies—algorithms using global pairwise similarity incur $N$ 1 cost per merge, while locality-constrained or cached approaches (Götz et al., 2024, Saghatchian et al., 1 Jan 2025) reduce this to $N$ 2 or amortize it over multiple timesteps in diffusion.

4. Empirical Effects Across Application Domains

Extensive empirical validation across vision, speech, language, and time series tasks demonstrates the effectiveness of token-merging remedies:

Automatic Speech Recognition (ASR): A-ToMe achieves 57% token reduction and 1.70 $N$ 3 GPU speedup on LibriSpeech with negligible WER impact when merging tokens above a cosine threshold $N$ 4 (Li et al., 2023).
Image Recognition and Segmentation: PiToMe shows a 0.5% top-1 accuracy drop with 44% FLOPs saved on ViT-MAE-H for ImageNet, outperforming classical ToMe under equivalent budgets (Tran et al., 2024). On COCO instance segmentation, CubistMerge attains 1.25 $N$ 5 speedup and only 0.7% mIoU drop off-the-shelf on SAM-H (Gong et al., 26 Sep 2025); Segformer++ achieves 61% higher throughput with <0.1 mIoU loss (Kienzle et al., 2024).
Diffusion Models and Generative Tasks: In SDXL and Meta-Flux, ToMA reduces wall-clock latency by 24% compared to ToMeSD when leveraging submodular selection and GPU-aligned implementation (Lu et al., 13 Sep 2025); ReToM provides a 6.2% FID improvement and CLIP score gain over baselines at equivalent or better speed via local-window representative merging (Lee et al., 17 Jul 2025).
State Space Models & Time Series: MaMe leverages SSM-derived informativeness scores and sequential order, achieving robust accuracy at aggressive reduction rates, even outperforming token pruning when up to 75% of tokens are merged (Park et al., 19 Aug 2025). Local merging in time series (k-neighborhood) maintains linear cost and enables speedups up to 54 $N$ 6 at <5% loss on Chronos (Götz et al., 2024).
Code Models: Merging subtokens within semantic units reduces up to 19% of FLOPs; for translation tasks (CodeT5), even increases the CodeBLEU score by 2.47 points (Saad et al., 19 Jul 2025).

5. Theoretical Guarantees and Preservation Properties

Spectral Consistency: PiToMe proves that merging based on energy-ranked bipartite matching nearly preserves the full spectrum (eigenvalues) of the token similarity graph under mild assumptions, while classical ToMe’s random splitting does not (Tran et al., 2024).
Merging Error Bounds: ATM derives explicit bounds on the information loss incurred by averaging tokens, showing that the merging error is minimized when pairing tokens of small and large size rather than two large ones (Lee et al., 21 May 2025).
Norm and Distribution Preservation: ToFu's MLERP is designed to preserve the $N$ 7 norm statistical distribution of tokens after merging, avoiding scale shifts that can degrade network accuracy (Kim et al., 2023).

6. Comparative Evaluations and Design Trade-Offs

Comparative analysis underlines several important trade-offs:

Aggressive Reduction: ATC with average linkage uniquely sustains accuracy under very low keep rates (as low as 25%), outperforming prior methods by up to 9.6 percentage points in challenging settings (Haurum et al., 2024).
Structured vs. Global Merging: For spatial architectures (Swin, SAM), CubistMerge and ClustViT demonstrate that grid-preserving or semantic clustering is critical to avoid catastrophic drop in spatial tasks; merging that disrupts spatial layouts (as in naive ToMe) leads to sharp accuracy loss (Gong et al., 26 Sep 2025, Montello et al., 2 Oct 2025).
Train-Free vs. Learnable Schedules: Adaptive schemes such as ATM require zero fine-tuning, obtaining over 30% FLOP reduction at no accuracy drop on standard ViTs, while learned-threshold hybrids (LTMP (Bonnaerens et al., 2023)) are quickly fine-tunable in just one epoch.
Dynamic Budgeting: QuickMerge++’s entropy-based budgeting and mass-weighted clustering establish a balance between computational savings and the need for autoregressive compatibility in generative scenarios (Liu et al., 16 Aug 2025).

7. Practical Guidelines and Limitations

Guidelines for deploying token-merging remedies are highly model- and task-dependent:

Insert merging layers every 2–4 transformer blocks in deep networks.
Use adaptive thresholds or informativeness metrics (energy, attention, domain priors) to protect unique or high-saliency tokens.
In dense prediction, combine semantic-guided clustering and a regenerator to recover fine spatial structure before the output head (Montello et al., 2 Oct 2025).
For real-time or edge devices, select merging ratios and layer placement to trade off speed and accuracy, leveraging techniques with minimal or no training overhead (ATM, PiToMe).
Be cautious with very aggressive merging in shallow or early layers, especially in non-linear response regimes.
In domains with scarce local redundancy, benefit diminishes and merging should be less aggressive unless guided by detailed informativeness evaluation (e.g., spectral metrics (Götz et al., 2024)).

A central limitation across methods is the need to tune merge ratios, threshold, or hyperparameters for each deployment scenario; global static settings can underperform in domains with high variability or task-sensitive tokens. Some algorithms, especially those relying on global similarity, incur $N$ 8 computation, making them less suitable for very long sequences unless locality is enforced or caching is used (Götz et al., 2024, Saghatchian et al., 1 Jan 2025). For batched or autoregressive workloads, care must be taken to enforce causality and batch invariance.

In summary, token-merging remedy comprises a family of architectural and algorithmic strategies for dynamically reducing token redundancy in transformer and related models. Through diverse mechanisms—adjacent key similarity, energy-based selection, semantic clustering, norm-preserving fusion, or learnable decoupled embeddings—these methods routinely enable substantial computational acceleration (typically 1.2–2.5×), large FLOPs savings (30–70%), and only minor or negligible degradation in end-task accuracy across domains as varied as speech, vision, code, and time series (Li et al., 2023, Park et al., 19 Aug 2025, Lee et al., 2024, Tran et al., 2024, Gong et al., 26 Sep 2025, Lee et al., 21 May 2025, Kim et al., 2023, Haurum et al., 2024, Montello et al., 2 Oct 2025, Saad et al., 19 Jul 2025).