R-MeeTo Framework for Vision Mamba Compression
- R-MeeTo is a two-stage framework that compresses Vision Mamba networks by merging tokens and applying rapid, minute-level retraining to maintain task accuracy.
- It employs cosine distance to select token pairs for merging and achieves up to 1.5× inference speedup, with minimal accuracy drop on ImageNet benchmarks.
- The approach tackles the unique token reduction challenges in Vision Mamba, preserving essential general and specific features compared to conventional pruning methods.
R-MeeTo (Re-Merged Token Re-training) is a two-stage framework designed for compressing Vision Mamba networks through token reduction, while maintaining task accuracy and achieving significant speedups on image classification benchmarks. It addresses the unique failure modes observed in Vision Mamba when using conventional token pruning or merging techniques, enabling rapid, high-fidelity model compaction suitable for large-scale vision applications (Shi et al., 2024).
1. Vision Mamba and the Token Reduction Challenge
Vision Mamba replaces the @@@@1@@@@ of standard Vision Transformers (ViTs) with a bidirectional state-space model (SSM), enriching token representations as a function of their time indices. Theoretical analysis (Theorem 2.1) shows that, in Vision Mamba (e.g., Vim-Ti, Vim-S, Vim-B), head and tail tokens aggregate more general knowledge, whereas pruning tokens—successfully applied in ViTs—can disproportionately remove these enriched tokens in Mamba, incurring excessive general-knowledge loss (Corollary 2.1). For example, a 14% token drop leads to more than 2% top-1 accuracy loss in Vim-S, versus less than 1% in DeiT-S. Token merging, which fuses similar pairs, retains more information but exhibits severe degradation in accuracy at large reduction ratios by losing "specific" features (Shi et al., 2024).
2. R-MeeTo Framework Architecture
R-MeeTo automates robust token reduction in two explicit stages:
- Token Merging: Repeated, training-free merging of token pairs at selected layers.
- Fast Retraining: Brief (minute-level) supervised retraining to reconstruct feature selectivity and recover lost accuracy.
Empirical results demonstrate that R-MeeTo restores nearly all top-1 accuracy with minimal retraining (≤0.9% accuracy drop on ImageNet-1K across all tested Vim variants and up to 1.5× inference speedup), outperforming both pruning and naive merging alone by a significant margin (Shi et al., 2024).
3. Detailed Methodology
3.1 Token Merging Algorithm
Merging is performed every layers (e.g., ), where in each layer a fixed budget of token pairs are merged out of total tokens:
- Grouping: Tokens are partitioned into disjoint sets, typically by even and odd indices.
- Pair Selection: Cosine distances are computed between even-indexed and odd-indexed tokens; the pairs with smallest distance are merged.
- Merging: Each selected pair is averaged: . Non-selected tokens remain; resulting tokens are reordered by time index.
This process is repeated at even-numbered layers to achieve an overall reduction ratio , e.g., 0.14 for Vim-Ti and 0.31 for Vim-S/B. Variations such as or distances and learned merging weights have been evaluated, but equal weights and cosine distance are empirically sufficient (Shi et al., 2024).
Token Merging Algorithm (Pseudocode, one step)
| Step | Description |
|---|---|
| 1 | Input: , budget |
| 2 | EvenIdx, OddIdx index split |
| 3 | For (i, j) in EvenIdx × OddIdx, compute |
| 4 | Select pairs with minimal |
| 5 | Merge: merged + unmerged tokens |
| 6 | Reorder by original time indices |
This table summarizes the procedural steps for the token merging subroutine as used in R-MeeTo.
3.2 Fast Retraining Protocol
Retraining involves optimizing standard cross-entropy loss over image class labels: Key hyperparameters include:
- Optimizer: AdamW, weight decay
- Learning rate: (cosine decay)
- Batch size: effective 1024
- Epochs: typically 3 (Vim-Ti/S), 5 (Vim-B), extended for larger variants
- EMA: 0.996 for Vim-B
- Augmentation: RandAug, MixUp, CutMix, and other standard transformations
Retraining is executed on modern GPUs (e.g., 4 × 8 H100) and completes in 2–8 minutes for typical Vim models with –0.31 (Shi et al., 2024).
4. Experimental Evaluations
R-MeeTo is evaluated primarily on ImageNet-1K, using pretrained Vim-Ti, Vim-S, and Vim-B models (7M, 26M, and 98M parameters respectively).
4.1 Impact on Accuracy and Efficiency
| Model | Baseline Acc (%) | Merge-only (%) [Δ] | R-MeeTo (%) [Δ] | Baseline FLOPs | R-MeeTo FLOPs (×speedup) | Params (M) |
|---|---|---|---|---|---|---|
| Vim-Ti | 76.1 | 64.8 [-11.3] | 75.9 [-0.2] | 1.45 | 1.28 (×1.13) | 7 |
| Vim-S | 80.5 | 72.9 [-7.6] | 80.1 [-0.4] | 5.08 | 3.60 (×1.41) | 26 |
| Vim-B | 81.9 | 76.3 [-5.6] | 81.1 [-0.8] | 18.87 | 13.50 (×1.40) | 98 |
Top-1 accuracy drops are contained to ≤0.9% post R-MeeTo, with 1.13–1.41× FLOP reductions and corresponding speedups in inference time. Merge-only results, without retraining, exhibit substantially higher declines in accuracy.
4.2 Retraining Times
| Hardware | Vim-Ti (3 ep) | Vim-S (3 ep) | Vim-B (3 ep) |
|---|---|---|---|
| 1×8 A100 | 16.2 min | 25.2 min | 57.6 min |
| 2×8 A100 (IB) | 8.1 min | 12.9 min | 30.6 min |
| 4×8 H100 (IB) | 4.2 min | 6.8 min | 16.9 min |
Minute-level retraining durations are sufficient for recovery; for instance, Vim-Ti achieves a 35.9% absolute accuracy recovery in 4.2 minutes with 4 × 8 H100s (Shi et al., 2024).
5. Ablation and Analytical Studies
A series of ablation analyses elucidate the sensitivity of R-MeeTo to its key design elements:
- Merge Ratio (): For Vim-S, increasing to 0.54 degrades merged-only accuracy to 60.7%; R-MeeTo restores to 76.3%, matching baseline at moderate ratios (0.14–0.31).
- Retraining Epochs: Accuracy improvements saturate by 3–5 epochs, with minimal additional gains thereafter (80.0% at 15 epochs versus 79.5% at 3 epochs), supporting the efficiency of short retraining.
- Token Order: Maintenance of token time-order after merging is critical for Vision Mamba, with more than 6% accuracy loss if omitted; by contrast, DeiT is insensitive to the order.
- Similarity Features and Distance Metrics: Using (token features) for pairing outperforms block/intermediate features. All distance metrics (cosine, , ) perform comparably (within ±0.1%).
- Grouping Strategy: Odd-even index splits provide best stability; alternatives like front-back or random grouping are less robust (Shi et al., 2024).
6. Best Practices, Limitations, and Extensions
Best practice recommendations include setting for Vim-Ti and for Vim-S/B, and retraining for 3–5 epochs on modern hardware. Equal-weight merges and cosine similarity are typically sufficient; learning weights or more elaborate metrics yield marginal improvement. The primary limitation is the residual necessity for a brief retraining interval, which escalates if exceeds 0.5. Extensions include the development of a learnable pair-scoring module for merge selection, as well as adaptive, possibly per-layer/persample merging ratios. Application to other SSM-based or dynamic-token Vision Transformer architectures is a plausible direction (Shi et al., 2024).
7. Context and Significance
R-MeeTo enables efficient, high-fidelity Vision Mamba compression for large-scale vision inference. By automating and accelerating the recovery of token-merged networks within minutes, it renders token reduction practical at deployment. The approach combines algorithmic simplicity—an token-matching over a small subset of layers—with empirical effectiveness, matching or exceeding alternative strategies for a given accuracy/speed trade-off. A plausible implication is that rapid model adaptation via token merging and minimal retraining may generalize to a broader class of non-attention-based sequence models in computer vision (Shi et al., 2024).