Post-Hoc Slimmability in RS FMs

Updated 6 February 2026

Post-hoc slimmability is the ability to adapt pretrained remote sensing foundation models by proportionally reducing channel widths at inference time without retraining, leveraging inherent redundancy.
The method uniformly scales multi-head self-attention and feedforward layers, creating a continuum of subnetworks that efficiently trade off FLOPs and accuracy.
Empirical results show RS FMs retain over 71% relative accuracy at extremely slimmed widths, with some subnets even outperforming the full model under constrained compute.

Post-hoc slimmability refers to the ability to adapt pretrained transformer architectures—specifically, foundation models (FMs) for remote sensing (RS)—by uniformly reducing channel width at inference time, requiring no retraining or architectural modification. This procedure leverages representational redundancy in overparameterized models and enables deployment across a range of computational budgets, most notably in resource-constrained environments. Hackel et al. (2026) provide the definitive technical formalization and empirical analysis of this phenomenon in RS FMs, contrasting findings with established computer vision (CV) models and challenging prevailing scaling paradigms (Hackel et al., 30 Jan 2026).

1. Mathematical Formulation and Definitions

A transformer FM pretrained for RS includes feedforward network (FFN) layers with hidden size $d_h$ and multi-head self-attention (MHSA) modules with per-head dimension $d_k$ . For any width multiplier $s\in(0,1]$ , post-hoc slimmability reduces the width of all layers proportionally: $d_h' = \left\lfloor s\cdot d_h \right\rfloor,\quad d_k' = \left\lfloor s\cdot d_k \right\rfloor.$

In FFN blocks:

$\mathbf h = \mathrm{Act}\left(\mathbf W_1[:d_h',:]\mathbf x + \mathbf b_1[:d_h']\right),\quad \mathbf y = \mathbf W_2[:,:d_h']\mathbf h + \mathbf b_2$

In MHSA blocks for each head $h$ with index set $\mathcal I_h=[h\,d_k,\,h\,d_k+d_k')$ :

$\mathbf Q_h = \mathbf x\,\mathbf W_Q[\mathcal I_h,:],\; \mathbf K_h = \mathbf x\,\mathbf W_K[\mathcal I_h,:],\; \mathbf V_h = \mathbf x\,\mathbf W_V[\mathcal I_h,:]$

No update or retraining is performed. The original weights are sliced, yielding a continuum of subnetworks parameterized by $s$ . At $s=1$ the model is unchanged; as $s\to 0$ , both parameter count and computation are reduced.

2. Measuring Redundancy and Evaluation Protocol

Redundancy is assessed by systematically sampling slimmed widths and benchmarking both computational savings and retention of downstream accuracy.

FLOP Ratio: For the full model requiring $\operatorname{FLOP}(1.0)$ operations per forward pass, the slimmable variant at width $s$ requires

$\operatorname{FLOP}(s) \approx s\,\operatorname{FLOP}(1.0)$

The normalized compute is $\rho(s) = \operatorname{FLOP}(s) / \operatorname{FLOP}(1.0)$ .

Relative Accuracy: For a downstream task,

$\operatorname{RelAcc}(s) = \frac{\operatorname{Acc}(s)}{\operatorname{Acc}(1.0)}$

Empirical Protocol: The method is applied to 31 uniformly sampled $s\in[0.001,1.0]$ . At each scale, the model is sliced, $\rho(s)$ computed, frozen features extracted on four RS classification tasks (using KNN or linear probe), and $\operatorname{RelAcc}(s)$ recorded.

This protocol is applied to six state-of-the-art RS FMs (parameter counts: 86M–631M; pretraining compute: 100–58K GPU hours).

3. Empirical Results: RS FMs vs CV Counterparts

Key quantitative findings:

At $1\%$ FLOP ( $s=0.01$ ), RS FMs retain $\operatorname{RelAcc}(0.01) > 0.71$ (over 71% of accuracy at full width).
By contrast, a ViT-MAE pretrained on ImageNet-1K evaluated on ImageNet-100 retains less than 10% of baseline accuracy at the same FLOP level.
This sevenfold discrepancy indicates that RS FMs harbor significantly more representational redundancy at small widths than their CV analogues.
Non-monotonic behavior: Multiple RS FMs achieve maximum downstream accuracy at intermediate widths ( $s=0.05$ –$0.8$), with some slimmed models slightly outperforming the full-width model, suggesting an implicit regularization effect from slimmability.

4. Mechanistic Explanations: Variance and Correlation Analyses

The origins of slimmability are investigated through two complementary metrics:

Explained Variance Ratio (EVR):
- Given feature matrix $\mathbf E\in\mathbb R^{n\times D}$ with singular values $\sigma_i$ , the fraction of variance explained by the top $k$ components is
$\mathrm{EVR}(k) = \sum_{i=1}^k \mathrm{EVR}_i, \quad \mathrm{EVR}_i = \frac{\sigma_i^2}{\sum_j\sigma_j^2}$

The effective rank is $d_\mathrm{eff} = (\sum_i \sigma_i)^2 / \sum_i \sigma_i^2$ . - RS FMs are found to concentrate most variance in a handful of top principal components, even for highly slimmed models (small $s$ ). This variance spreads more slowly with increasing feature dimension compared to CV MAE models. Model-specific scaling behavior is observed: monotonic for DOFA, U-shaped for Prithvi-EO, and stable for TerraMind.
Mean Absolute Pairwise Feature Correlation:
- For features $\mathbf e\in \mathbb R^D$ , the mean absolute inter-feature correlation is
$\overline{|\mathrm{corr}|} = \frac{2}{D(D-1)} \sum_{i<j} |\mathrm{corr}(e_i, e_j)|$ - DOFA FMs show strong correlation at low $s$ , decaying with width; TerraMind features retain moderate correlation across $s$ ; Prithvi-EO exhibits non-monotonic correlation, reflecting scale-dependent task decomposition.

These analyses support the interpretation that RS FMs encode task-relevant information in a distributed and redundant fashion.

5. Learned Slimmable Training Regimes

The post-hoc nature of slimmability is contrasted with learned slimmable training, wherein the model is explicitly regularized to perform well at multiple channel widths during pretraining. Hackel et al. implement this via a multi-scale loss for two SSL paradigms (MoCo, MAE):

For each batch, sample widths: $s_\max=1.0$, $s_\min$ decreasing with epoch, and $s_1,s_2,s_3\sim\mathrm{Uniform}[s_\min,1.0]$.
For each scale, compute the task loss $\mathcal L_{\mathrm{task}}(s)$ , augmented by a distillation loss for $s<s_\max$:

$\mathcal L_{\mathrm{distill}}(s) = \mathrm{KL}(f_{s}(x) \Vert f_{1.0}(x))$

The total loss per sample is $\mathcal L(s) = \mathcal L_{\mathrm{task}}(s) + \lambda\,\mathcal L_{\mathrm{distill}}(s)$ with shared gradients.

Empirically, slimmable training with MoCo outperforms vanilla MoCo in the low-width regime and matches full-width performance; MAE slimmable training yields mixed outcomes, underperforming on multi-label tasks but showing improved performance on certain fine-grained single-label tasks. This reflects complex interactions between the slimmability objective and self-supervised reconstruction.

6. Deployment and Paradigm Implications

Post-hoc slimmability establishes a new operational regime for RS FMs:

Uniform slimming enables zero-cost adaptation of large RS models to extremely constrained compute environments (e.g., edge/onboard inference) with modest accuracy loss (less than 30%) at $1\%$ of original FLOPs.
Non-monotonic accuracy suggests practitioners should empirically sweep over $s$ to identify optimal subnets, which can sometimes outperform the original model while reducing compute by up to $12\times$ .
A plausible implication is that learned slimmability can structure redundancy in future models to further promote robustness, directly challenging the established "always scale bigger" paradigm of CV-derived model scaling in RS domains (Hackel et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Post-Hoc Slimmability.