Sparse Token Merger (STM)

Updated 13 February 2026

Sparse Token Merger (STM) is a family of techniques that identifies and merges redundant tokens in Transformer models, optimizing compute and memory usage.
STM utilizes token importance metrics and similarity clustering (e.g., cosine similarity, hierarchical clustering, bipartite matching) to adaptively decide which tokens to merge.
Empirical evaluations across vision, language, and multimodal domains demonstrate that STM methods can achieve 2–18× token reduction with minimal accuracy loss.

Sparse Token Merger (STM), also known as token merging or token reduction, is a family of algorithmic techniques designed to efficiently decrease the number of tokens processed by Transformer-based models. The central objective is to maintain downstream performance while substantially reducing both computational cost and memory footprint by merging semantically or functionally redundant tokens. STM methods have been developed and rigorously evaluated across language, vision, and multimodal domains, encompassing static, dynamic, hierarchical, and autoregressive-compatible strategies.

1. Conceptual Foundations and Objectives

The Transformer architecture exhibits quadratic complexity in sequence length for self-attention and cross-attention operations. In dense input regimes—such as high-resolution vision, video, or long-context language—most token representations are highly redundant, leading to prohibitive costs. STM circumvents this limitation by identifying and aggregating tokens that are similar in either their raw embeddings, attention patterns, or domain-specific criteria, thereby reducing the processed sequence length prior to attention or MLP sublayers (Haurum et al., 2024, Bolya et al., 2023, Shang et al., 2024). The high-level aims are:

Minimize representation redundancy through merging or pruning.
Adaptively control the trade-off between compute savings and accuracy degradation.
Enable parameter-efficient inference without introducing significant additional trainable weights.
Support training-free or plug-and-play integration for pre-trained models.

2. Core Methodologies in STM

STM instantiations vary by modality and target model but share several common algorithmic structures:

Token Importance and Selection

A fundamental component of STM is quantifying token importance:

Class-token attention scores: For visual models (e.g., CLIP-ViT), attention weights between class ([CLS]) and patch tokens are used. Sparsity in these scores is exploited by thresholding (e.g., via interquartile range outlier detection) to select salient tokens (Shang et al., 2024).
Entropy/norm-based metrics: In autoregressive or multimodal settings, tokens are ranked by attention distribution entropy or attention norm magnitude (Liu et al., 16 Aug 2025).
Cross-modality saliency: Vision-LLMs may use text-informed attention to identify which visual or textual tokens to merge (Cao et al., 2023).

Similarity Metrics and Clustering

Tokens are grouped for merging according to similarity criteria:

Cosine or dot-product similarity: Raw or key-vector dot products and their variants are used to cluster tokens (Haurum et al., 2024, Bolya et al., 2023, Liu et al., 16 Aug 2025).
Agglomerative hierarchical clustering: Methods such as Agglomerative Token Clustering (ATC) perform bottom-up merging with linkage functions (single, complete, average) (Haurum et al., 2024).
Bipartite matching: Some vision-language token mergers employ bipartite matching—odd and even tokens pair via maximal similarity (Cao et al., 2023).

Merging Operations

The merged representation is typically a weighted or uniform average:

Weighted sums: Weights may be derived from token importance metrics, such as attention mass or saliency scores (Shang et al., 2024, Liu et al., 16 Aug 2025).
Uniform averaging: In cases without explicit importance weighting, simple averaging is employed (Bolya et al., 2023, Cao et al., 2023).

Integration and Reversibility

Immediate propagation: STM is most often applied between attention and MLP sublayers, with reduced tokens carried to the next layer (Haurum et al., 2024, Bolya et al., 2023).
Decompression or “unmerge”: Some methods restore the full sequence dimension after attention—enabling dynamic re-inclusion of tokens downstream (e.g., Token Sparse Attention) (Jo et al., 3 Feb 2026, Bolya et al., 2023).
Cross-layer interleaving: STM can be interleaved layerwise, periodically compressing and then temporarily restoring token sets to enable dynamic propagation of token importance (Jo et al., 3 Feb 2026).

3. Domain-Specific Implementations

STM admits significant specialization per domain:

Vision

Agglomerative Token Clustering (ATC): Reduces token count in ViT-style models by hierarchical clustering using cosine distance, yielding parameter-free, bottom-up merging (Haurum et al., 2024).
Token Merging for Diffusion: Adapts STM to the latent U-Net in Stable Diffusion, using random patch partitioning and max-matching on dot-product similarity, with broadcast “unmerge” post-attention for pixel fidelity (Bolya et al., 2023).

Multimodal and Vision-Language

LLaVA-PruMerge: Employs class-token attention and clustering for adaptive, importance-aware compression of CLIP vision tokens, supporting ∼14–18× reduction at near lossless performance (Shang et al., 2024).
PuMer: Combines cross-attention‐guided pruning with token merging in both vision and language modalities, using within-modality bipartite matching, and knowledge distillation for minimal downstream accuracy loss (Cao et al., 2023).

Language and Long-Context

Token Sparse Attention: Inserts STM as QKV compression at attention computation, followed by a decompression “scatter” to original sequence length. Allows reversible, per-head, per-layer dynamic sparsification, compatible with FlashAttention (Jo et al., 3 Feb 2026).
QuickMerge++: Dynamically selects and softly merges tokens for AR generation, guided by per-layer entropy and AR-prior alignment, achieving >2× compression in generative and sequence modeling (Liu et al., 16 Aug 2025).

Video

Spatio-Temporal Token Merging (STTM): Exploits both spatial (multi-granular quadtree search per frame) and temporal (directed, overlapping-patch matching across frames with union-find for temporal consistency) redundancy, yielding cache-friendly, query-agnostic token sets (Hyun et al., 10 Jul 2025).

4. Complexity and Empirical Trade-Offs

STM offers substantial reductions in computational and memory complexity:

Asymptotics: Reducing $N$ tokens to $M$ yields quadratic savings in attention FLOPs, $O(N^2) \to O(M^2)$ (Bolya et al., 2023, Shang et al., 2024, Haurum et al., 2024).
Empirical performance:
- LLaVA-PruMerge: ~14× visual token reduction, ≤2 point accuracy drop across VQA, TextVQA, ScienceQA, POPE, MME, MMBench (Shang et al., 2024).
- ATC: At 25% keep-rate, recovers up to 73.8% Top-1 accuracy (versus 69.4% for ToMe); at 90%, can slightly exceed baseline (82.04% vs. 81.85%) (Haurum et al., 2024).
- Token Sparse Attention: Achieves up to 3.23× pure attention speedup at 128K context, <1% average drop in accuracy (Jo et al., 3 Feb 2026).
- PuMer: 2× throughput and 38–51% memory reduction with <1% accuracy loss (Cao et al., 2023).
- QuickMerge++: ~2.4× token reduction, 34.9% lower latency, 63% less KV-memory, and accuracy matched or improved on diverse benchmarks (Liu et al., 16 Aug 2025).
- STTM (video): 2–3× speedup at 0.5–2% average accuracy reduction at strong budgets, with KV-cache reuse possible (Hyun et al., 10 Jul 2025).

Method / Domain	Token Reduction	Speedup	Accuracy Degradation
LLaVA-PruMerge (vision/MM)	×14–18	≈200× FLOPs	0–2 points (6 VQA tasks)
ATC (ViT: r=0.25–0.7)	×4 (r=0.25)	Quadratic	−8% to +0.2%
PuMer (VL)	×1.8–2.1	×1.8–2.1	<1% (VL tasks)
Token Sparse Attention (LLM)	×2.2 (r~0.45)	×3	<1% (long context)
QuickMerge++ (AR/gen)	×2.4	−35% latency	0 or improved
STTM (video LLM)	×2–3	×2–3	≤2% (QA tasks)

5. Practical Guidelines and Hyperparameterization

Keep rate / merge ratio: User-specified; smaller rates yield more savings but increased risk of accuracy loss (Haurum et al., 2024, Cao et al., 2023).
Layerwise scheduling: More aggressive merging in later layers or after sufficient context aggregation often yields better compute–accuracy trade-off (Cao et al., 2023).
Importance metric selection and clustering: Accuracy and stability hinge on choice of token importance and similarity metric; static grid or sequential sampling is generally inferior to adaptive STM (Shang et al., 2024, Bolya et al., 2023).

6. Domain-Specific Constraints and Extensions

Reversibility: Methods that compress-then-uncompress (e.g., Token Sparse Attention) allow dynamic token resurgence, accommodating shifting token salience per layer/head (Jo et al., 3 Feb 2026, Bolya et al., 2023).
Autoregressive compatibility: STM with AR-priors (QuickMerge++) enables compatibility with next-token generation by learning a lightweight AR prior over merged “super-tokens” (Liu et al., 16 Aug 2025).
Query-agnostic operation: STTM merges only on the basis of input video, supporting pre-caching and KV-cache reuse for video LLMs, critical for conversational QA (Hyun et al., 10 Jul 2025).

7. Limitations and Future Perspectives

STM approaches can involve trade-offs and open challenges:

Threshold calibration: Static merge/prune rates or similarity thresholds may require tuning per domain and dataset (Hyun et al., 10 Jul 2025).
Potential for information loss: Over-aggressive reduction or misaligned similarity metrics can degrade downstream accuracy, especially in very fine-grained tasks (Shang et al., 2024, Bolya et al., 2023).
Efficiency bottlenecks: Some STM implementations incur $O(N^2)$ pre-processing cost for clustering or similarity computation; further optimizations via nearest-neighbor chains and hardware adaptation remain active areas (Haurum et al., 2024).
Extensibility: Adaptive, learned, or content-aware scheduling of merging, as well as extension to additional data structures (multi-camera, 3D), is identified as a promising research avenue (Hyun et al., 10 Jul 2025).

STM methodologies represent a rapidly maturing and foundational aspect of efficient large-model inference across modalities, enabling new compute–accuracy regimes significantly beyond those accessible to classical dense Transformer workflows (Haurum et al., 2024, Shang et al., 2024, Jo et al., 3 Feb 2026, Bolya et al., 2023, Cao et al., 2023, Hyun et al., 10 Jul 2025, Liu et al., 16 Aug 2025).