Post-Hoc Slimmability in RS FMs
- Post-hoc slimmability is the ability to adapt pretrained remote sensing foundation models by proportionally reducing channel widths at inference time without retraining, leveraging inherent redundancy.
- The method uniformly scales multi-head self-attention and feedforward layers, creating a continuum of subnetworks that efficiently trade off FLOPs and accuracy.
- Empirical results show RS FMs retain over 71% relative accuracy at extremely slimmed widths, with some subnets even outperforming the full model under constrained compute.
Post-hoc slimmability refers to the ability to adapt pretrained transformer architectures—specifically, foundation models (FMs) for remote sensing (RS)—by uniformly reducing channel width at inference time, requiring no retraining or architectural modification. This procedure leverages representational redundancy in overparameterized models and enables deployment across a range of computational budgets, most notably in resource-constrained environments. Hackel et al. (2026) provide the definitive technical formalization and empirical analysis of this phenomenon in RS FMs, contrasting findings with established computer vision (CV) models and challenging prevailing scaling paradigms (Hackel et al., 30 Jan 2026).
1. Mathematical Formulation and Definitions
A transformer FM pretrained for RS includes feedforward network (FFN) layers with hidden size and multi-head self-attention (MHSA) modules with per-head dimension . For any width multiplier , post-hoc slimmability reduces the width of all layers proportionally:
- In FFN blocks:
- In MHSA blocks for each head with index set :
No update or retraining is performed. The original weights are sliced, yielding a continuum of subnetworks parameterized by . At the model is unchanged; as , both parameter count and computation are reduced.
2. Measuring Redundancy and Evaluation Protocol
Redundancy is assessed by systematically sampling slimmed widths and benchmarking both computational savings and retention of downstream accuracy.
- FLOP Ratio: For the full model requiring operations per forward pass, the slimmable variant at width requires
The normalized compute is .
- Relative Accuracy: For a downstream task,
- Empirical Protocol: The method is applied to 31 uniformly sampled . At each scale, the model is sliced, computed, frozen features extracted on four RS classification tasks (using KNN or linear probe), and recorded.
This protocol is applied to six state-of-the-art RS FMs (parameter counts: 86M–631M; pretraining compute: 100–58K GPU hours).
3. Empirical Results: RS FMs vs CV Counterparts
Key quantitative findings:
- At FLOP (), RS FMs retain (over 71% of accuracy at full width).
- By contrast, a ViT-MAE pretrained on ImageNet-1K evaluated on ImageNet-100 retains less than 10% of baseline accuracy at the same FLOP level.
- This sevenfold discrepancy indicates that RS FMs harbor significantly more representational redundancy at small widths than their CV analogues.
- Non-monotonic behavior: Multiple RS FMs achieve maximum downstream accuracy at intermediate widths (–$0.8$), with some slimmed models slightly outperforming the full-width model, suggesting an implicit regularization effect from slimmability.
4. Mechanistic Explanations: Variance and Correlation Analyses
The origins of slimmability are investigated through two complementary metrics:
- Explained Variance Ratio (EVR):
- Given feature matrix with singular values , the fraction of variance explained by the top components is
The effective rank is . - RS FMs are found to concentrate most variance in a handful of top principal components, even for highly slimmed models (small ). This variance spreads more slowly with increasing feature dimension compared to CV MAE models. Model-specific scaling behavior is observed: monotonic for DOFA, U-shaped for Prithvi-EO, and stable for TerraMind.
- Mean Absolute Pairwise Feature Correlation:
- For features , the mean absolute inter-feature correlation is
- DOFA FMs show strong correlation at low , decaying with width; TerraMind features retain moderate correlation across ; Prithvi-EO exhibits non-monotonic correlation, reflecting scale-dependent task decomposition.
These analyses support the interpretation that RS FMs encode task-relevant information in a distributed and redundant fashion.
5. Learned Slimmable Training Regimes
The post-hoc nature of slimmability is contrasted with learned slimmable training, wherein the model is explicitly regularized to perform well at multiple channel widths during pretraining. Hackel et al. implement this via a multi-scale loss for two SSL paradigms (MoCo, MAE):
- For each batch, sample widths: $s_\max=1.0$, $s_\min$ decreasing with epoch, and $s_1,s_2,s_3\sim\mathrm{Uniform}[s_\min,1.0]$.
- For each scale, compute the task loss , augmented by a distillation loss for $s<s_\max$:
The total loss per sample is with shared gradients.
- Empirically, slimmable training with MoCo outperforms vanilla MoCo in the low-width regime and matches full-width performance; MAE slimmable training yields mixed outcomes, underperforming on multi-label tasks but showing improved performance on certain fine-grained single-label tasks. This reflects complex interactions between the slimmability objective and self-supervised reconstruction.
6. Deployment and Paradigm Implications
Post-hoc slimmability establishes a new operational regime for RS FMs:
- Uniform slimming enables zero-cost adaptation of large RS models to extremely constrained compute environments (e.g., edge/onboard inference) with modest accuracy loss (less than 30%) at of original FLOPs.
- Non-monotonic accuracy suggests practitioners should empirically sweep over to identify optimal subnets, which can sometimes outperform the original model while reducing compute by up to .
- A plausible implication is that learned slimmability can structure redundancy in future models to further promote robustness, directly challenging the established "always scale bigger" paradigm of CV-derived model scaling in RS domains (Hackel et al., 30 Jan 2026).