Papers
Topics
Authors
Recent
Search
2000 character limit reached

R-MeeTo Framework for Vision Mamba Compression

Updated 22 February 2026
  • R-MeeTo is a two-stage framework that compresses Vision Mamba networks by merging tokens and applying rapid, minute-level retraining to maintain task accuracy.
  • It employs cosine distance to select token pairs for merging and achieves up to 1.5× inference speedup, with minimal accuracy drop on ImageNet benchmarks.
  • The approach tackles the unique token reduction challenges in Vision Mamba, preserving essential general and specific features compared to conventional pruning methods.

R-MeeTo (Re-Merged Token Re-training) is a two-stage framework designed for compressing Vision Mamba networks through token reduction, while maintaining task accuracy and achieving significant speedups on image classification benchmarks. It addresses the unique failure modes observed in Vision Mamba when using conventional token pruning or merging techniques, enabling rapid, high-fidelity model compaction suitable for large-scale vision applications (Shi et al., 2024).

1. Vision Mamba and the Token Reduction Challenge

Vision Mamba replaces the @@@@1@@@@ of standard Vision Transformers (ViTs) with a bidirectional state-space model (SSM), enriching token representations as a function of their time indices. Theoretical analysis (Theorem 2.1) shows that, in Vision Mamba (e.g., Vim-Ti, Vim-S, Vim-B), head and tail tokens aggregate more general knowledge, whereas pruning tokens—successfully applied in ViTs—can disproportionately remove these enriched tokens in Mamba, incurring excessive general-knowledge loss (Corollary 2.1). For example, a 14% token drop leads to more than 2% top-1 accuracy loss in Vim-S, versus less than 1% in DeiT-S. Token merging, which fuses similar pairs, retains more information but exhibits severe degradation in accuracy at large reduction ratios by losing "specific" features (Shi et al., 2024).

2. R-MeeTo Framework Architecture

R-MeeTo automates robust token reduction in two explicit stages:

  • Token Merging: Repeated, training-free merging of token pairs at selected layers.
  • Fast Retraining: Brief (minute-level) supervised retraining to reconstruct feature selectivity and recover lost accuracy.

Empirical results demonstrate that R-MeeTo restores nearly all top-1 accuracy with minimal retraining (≤0.9% accuracy drop on ImageNet-1K across all tested Vim variants and up to 1.5× inference speedup), outperforming both pruning and naive merging alone by a significant margin (Shi et al., 2024).

3. Detailed Methodology

3.1 Token Merging Algorithm

Merging is performed every KK layers (e.g., K=2K=2), where in each layer \ell a fixed budget rr of token pairs are merged out of NN total tokens:

  • Grouping: Tokens are partitioned into disjoint sets, typically by even and odd indices.
  • Pair Selection: Cosine distances di,j=1xixjxixjd_{i,j} = 1 - \frac{x_i^\top x_j}{\|x_i\|\|x_j\|} are computed between even-indexed and odd-indexed tokens; the rr pairs with smallest distance are merged.
  • Merging: Each selected pair is averaged: x^ij=12xi+12xj\hat{x}_{ij} = \frac{1}{2}x_i + \frac{1}{2}x_j. Non-selected tokens remain; resulting tokens are reordered by time index.

This process is repeated at even-numbered layers to achieve an overall reduction ratio r/Nr/N, e.g., 0.14 for Vim-Ti and 0.31 for Vim-S/B. Variations such as L1L_1 or L2L_2 distances and learned merging weights have been evaluated, but equal weights and cosine distance are empirically sufficient (Shi et al., 2024).

Token Merging Algorithm (Pseudocode, one step)

Step Description
1 Input: T={x1,,xN}T = \{x_1,\dots,x_N\}, budget rr
2 EvenIdx, OddIdx \leftarrow index split
3 For (i, j) in EvenIdx × OddIdx, compute di,jd_{i,j}
4 Select rr pairs with minimal di,jd_{i,j}
5 Merge: T^\hat{T} \leftarrow merged + unmerged tokens
6 Reorder by original time indices

This table summarizes the procedural steps for the token merging subroutine as used in R-MeeTo.

3.2 Fast Retraining Protocol

Retraining involves optimizing standard cross-entropy loss over image class labels: L(θ)=(x,y)Dc=1C1{y=c}logpθ(cx)\mathcal L(\theta) = -\sum_{(x,y)\in D} \sum_{c=1}^C 1\{y=c\} \log p_\theta(c\mid x) Key hyperparameters include:

  • Optimizer: AdamW, weight decay 5×1025\times10^{-2}
  • Learning rate: 2×1051×1062\times10^{-5} \to 1\times10^{-6} (cosine decay)
  • Batch size: effective 1024
  • Epochs: typically 3 (Vim-Ti/S), 5 (Vim-B), extended for larger variants
  • EMA: 0.996 for Vim-B
  • Augmentation: RandAug, MixUp, CutMix, and other standard transformations

Retraining is executed on modern GPUs (e.g., 4 × 8 H100) and completes in 2–8 minutes for typical Vim models with r/N0.14r/N \approx 0.14–0.31 (Shi et al., 2024).

4. Experimental Evaluations

R-MeeTo is evaluated primarily on ImageNet-1K, using pretrained Vim-Ti, Vim-S, and Vim-B models (7M, 26M, and 98M parameters respectively).

4.1 Impact on Accuracy and Efficiency

Model Baseline Acc (%) Merge-only (%) [Δ] R-MeeTo (%) [Δ] Baseline FLOPs R-MeeTo FLOPs (×speedup) Params (M)
Vim-Ti 76.1 64.8 [-11.3] 75.9 [-0.2] 1.45 1.28 (×1.13) 7
Vim-S 80.5 72.9 [-7.6] 80.1 [-0.4] 5.08 3.60 (×1.41) 26
Vim-B 81.9 76.3 [-5.6] 81.1 [-0.8] 18.87 13.50 (×1.40) 98

Top-1 accuracy drops are contained to ≤0.9% post R-MeeTo, with 1.13–1.41× FLOP reductions and corresponding speedups in inference time. Merge-only results, without retraining, exhibit substantially higher declines in accuracy.

4.2 Retraining Times

Hardware Vim-Ti (3 ep) Vim-S (3 ep) Vim-B (3 ep)
1×8 A100 16.2 min 25.2 min 57.6 min
2×8 A100 (IB) 8.1 min 12.9 min 30.6 min
4×8 H100 (IB) 4.2 min 6.8 min 16.9 min

Minute-level retraining durations are sufficient for recovery; for instance, Vim-Ti achieves a 35.9% absolute accuracy recovery in 4.2 minutes with 4 × 8 H100s (Shi et al., 2024).

5. Ablation and Analytical Studies

A series of ablation analyses elucidate the sensitivity of R-MeeTo to its key design elements:

  • Merge Ratio (r/Nr/N): For Vim-S, increasing r/Nr/N to 0.54 degrades merged-only accuracy to 60.7%; R-MeeTo restores to 76.3%, matching baseline at moderate ratios (0.14–0.31).
  • Retraining Epochs: Accuracy improvements saturate by 3–5 epochs, with minimal additional gains thereafter (80.0% at 15 epochs versus 79.5% at 3 epochs), supporting the efficiency of short retraining.
  • Token Order: Maintenance of token time-order after merging is critical for Vision Mamba, with more than 6% accuracy loss if omitted; by contrast, DeiT is insensitive to the order.
  • Similarity Features and Distance Metrics: Using xtx_t (token features) for pairing outperforms block/intermediate features. All distance metrics (cosine, L1L_1, L2L_2) perform comparably (within ±0.1%).
  • Grouping Strategy: Odd-even index splits provide best stability; alternatives like front-back or random grouping are less robust (Shi et al., 2024).

6. Best Practices, Limitations, and Extensions

Best practice recommendations include setting r/N0.14r/N\approx0.14 for Vim-Ti and 0.31\approx0.31 for Vim-S/B, and retraining for 3–5 epochs on modern hardware. Equal-weight merges and cosine similarity are typically sufficient; learning weights or more elaborate metrics yield marginal improvement. The primary limitation is the residual necessity for a brief retraining interval, which escalates if r/Nr/N exceeds 0.5. Extensions include the development of a learnable pair-scoring module for merge selection, as well as adaptive, possibly per-layer/persample merging ratios. Application to other SSM-based or dynamic-token Vision Transformer architectures is a plausible direction (Shi et al., 2024).

7. Context and Significance

R-MeeTo enables efficient, high-fidelity Vision Mamba compression for large-scale vision inference. By automating and accelerating the recovery of token-merged networks within minutes, it renders token reduction practical at deployment. The approach combines algorithmic simplicity—an O(N2)\mathcal{O}(N^2) token-matching over a small subset of layers—with empirical effectiveness, matching or exceeding alternative strategies for a given accuracy/speed trade-off. A plausible implication is that rapid model adaptation via token merging and minimal retraining may generalize to a broader class of non-attention-based sequence models in computer vision (Shi et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to R-MeeTo Framework.