Papers
Topics
Authors
Recent
Search
2000 character limit reached

Post-Hoc Slimmability in RS FMs

Updated 6 February 2026
  • Post-hoc slimmability is the ability to adapt pretrained remote sensing foundation models by proportionally reducing channel widths at inference time without retraining, leveraging inherent redundancy.
  • The method uniformly scales multi-head self-attention and feedforward layers, creating a continuum of subnetworks that efficiently trade off FLOPs and accuracy.
  • Empirical results show RS FMs retain over 71% relative accuracy at extremely slimmed widths, with some subnets even outperforming the full model under constrained compute.

Post-hoc slimmability refers to the ability to adapt pretrained transformer architectures—specifically, foundation models (FMs) for remote sensing (RS)—by uniformly reducing channel width at inference time, requiring no retraining or architectural modification. This procedure leverages representational redundancy in overparameterized models and enables deployment across a range of computational budgets, most notably in resource-constrained environments. Hackel et al. (2026) provide the definitive technical formalization and empirical analysis of this phenomenon in RS FMs, contrasting findings with established computer vision (CV) models and challenging prevailing scaling paradigms (Hackel et al., 30 Jan 2026).

1. Mathematical Formulation and Definitions

A transformer FM pretrained for RS includes feedforward network (FFN) layers with hidden size dhd_h and multi-head self-attention (MHSA) modules with per-head dimension dkd_k. For any width multiplier s(0,1]s\in(0,1], post-hoc slimmability reduces the width of all layers proportionally: dh=sdh,dk=sdk.d_h' = \left\lfloor s\cdot d_h \right\rfloor,\quad d_k' = \left\lfloor s\cdot d_k \right\rfloor.

  • In FFN blocks:

h=Act(W1[:dh,:]x+b1[:dh]),y=W2[:,:dh]h+b2\mathbf h = \mathrm{Act}\left(\mathbf W_1[:d_h',:]\mathbf x + \mathbf b_1[:d_h']\right),\quad \mathbf y = \mathbf W_2[:,:d_h']\mathbf h + \mathbf b_2

  • In MHSA blocks for each head hh with index set Ih=[hdk,hdk+dk)\mathcal I_h=[h\,d_k,\,h\,d_k+d_k'):

Qh=xWQ[Ih,:],  Kh=xWK[Ih,:],  Vh=xWV[Ih,:]\mathbf Q_h = \mathbf x\,\mathbf W_Q[\mathcal I_h,:],\; \mathbf K_h = \mathbf x\,\mathbf W_K[\mathcal I_h,:],\; \mathbf V_h = \mathbf x\,\mathbf W_V[\mathcal I_h,:]

No update or retraining is performed. The original weights are sliced, yielding a continuum of subnetworks parameterized by ss. At s=1s=1 the model is unchanged; as s0s\to 0, both parameter count and computation are reduced.

2. Measuring Redundancy and Evaluation Protocol

Redundancy is assessed by systematically sampling slimmed widths and benchmarking both computational savings and retention of downstream accuracy.

  • FLOP Ratio: For the full model requiring FLOP(1.0)\operatorname{FLOP}(1.0) operations per forward pass, the slimmable variant at width ss requires

FLOP(s)sFLOP(1.0)\operatorname{FLOP}(s) \approx s\,\operatorname{FLOP}(1.0)

The normalized compute is ρ(s)=FLOP(s)/FLOP(1.0)\rho(s) = \operatorname{FLOP}(s) / \operatorname{FLOP}(1.0).

  • Relative Accuracy: For a downstream task,

RelAcc(s)=Acc(s)Acc(1.0)\operatorname{RelAcc}(s) = \frac{\operatorname{Acc}(s)}{\operatorname{Acc}(1.0)}

  • Empirical Protocol: The method is applied to 31 uniformly sampled s[0.001,1.0]s\in[0.001,1.0]. At each scale, the model is sliced, ρ(s)\rho(s) computed, frozen features extracted on four RS classification tasks (using KNN or linear probe), and RelAcc(s)\operatorname{RelAcc}(s) recorded.

This protocol is applied to six state-of-the-art RS FMs (parameter counts: 86M–631M; pretraining compute: 100–58K GPU hours).

3. Empirical Results: RS FMs vs CV Counterparts

Key quantitative findings:

  • At 1%1\% FLOP (s=0.01s=0.01), RS FMs retain RelAcc(0.01)>0.71\operatorname{RelAcc}(0.01) > 0.71 (over 71% of accuracy at full width).
  • By contrast, a ViT-MAE pretrained on ImageNet-1K evaluated on ImageNet-100 retains less than 10% of baseline accuracy at the same FLOP level.
  • This sevenfold discrepancy indicates that RS FMs harbor significantly more representational redundancy at small widths than their CV analogues.
  • Non-monotonic behavior: Multiple RS FMs achieve maximum downstream accuracy at intermediate widths (s=0.05s=0.05–$0.8$), with some slimmed models slightly outperforming the full-width model, suggesting an implicit regularization effect from slimmability.

4. Mechanistic Explanations: Variance and Correlation Analyses

The origins of slimmability are investigated through two complementary metrics:

  • Explained Variance Ratio (EVR):

    • Given feature matrix ERn×D\mathbf E\in\mathbb R^{n\times D} with singular values σi\sigma_i, the fraction of variance explained by the top kk components is

    EVR(k)=i=1kEVRi,EVRi=σi2jσj2\mathrm{EVR}(k) = \sum_{i=1}^k \mathrm{EVR}_i, \quad \mathrm{EVR}_i = \frac{\sigma_i^2}{\sum_j\sigma_j^2}

    The effective rank is deff=(iσi)2/iσi2d_\mathrm{eff} = (\sum_i \sigma_i)^2 / \sum_i \sigma_i^2. - RS FMs are found to concentrate most variance in a handful of top principal components, even for highly slimmed models (small ss). This variance spreads more slowly with increasing feature dimension compared to CV MAE models. Model-specific scaling behavior is observed: monotonic for DOFA, U-shaped for Prithvi-EO, and stable for TerraMind.

  • Mean Absolute Pairwise Feature Correlation:

    • For features eRD\mathbf e\in \mathbb R^D, the mean absolute inter-feature correlation is

    corr=2D(D1)i<jcorr(ei,ej)\overline{|\mathrm{corr}|} = \frac{2}{D(D-1)} \sum_{i<j} |\mathrm{corr}(e_i, e_j)| - DOFA FMs show strong correlation at low ss, decaying with width; TerraMind features retain moderate correlation across ss; Prithvi-EO exhibits non-monotonic correlation, reflecting scale-dependent task decomposition.

These analyses support the interpretation that RS FMs encode task-relevant information in a distributed and redundant fashion.

5. Learned Slimmable Training Regimes

The post-hoc nature of slimmability is contrasted with learned slimmable training, wherein the model is explicitly regularized to perform well at multiple channel widths during pretraining. Hackel et al. implement this via a multi-scale loss for two SSL paradigms (MoCo, MAE):

  • For each batch, sample widths: $s_\max=1.0$, $s_\min$ decreasing with epoch, and $s_1,s_2,s_3\sim\mathrm{Uniform}[s_\min,1.0]$.
  • For each scale, compute the task loss Ltask(s)\mathcal L_{\mathrm{task}}(s), augmented by a distillation loss for $s<s_\max$:

Ldistill(s)=KL(fs(x)f1.0(x))\mathcal L_{\mathrm{distill}}(s) = \mathrm{KL}(f_{s}(x) \Vert f_{1.0}(x))

The total loss per sample is L(s)=Ltask(s)+λLdistill(s)\mathcal L(s) = \mathcal L_{\mathrm{task}}(s) + \lambda\,\mathcal L_{\mathrm{distill}}(s) with shared gradients.

  • Empirically, slimmable training with MoCo outperforms vanilla MoCo in the low-width regime and matches full-width performance; MAE slimmable training yields mixed outcomes, underperforming on multi-label tasks but showing improved performance on certain fine-grained single-label tasks. This reflects complex interactions between the slimmability objective and self-supervised reconstruction.

6. Deployment and Paradigm Implications

Post-hoc slimmability establishes a new operational regime for RS FMs:

  • Uniform slimming enables zero-cost adaptation of large RS models to extremely constrained compute environments (e.g., edge/onboard inference) with modest accuracy loss (less than 30%) at 1%1\% of original FLOPs.
  • Non-monotonic accuracy suggests practitioners should empirically sweep over ss to identify optimal subnets, which can sometimes outperform the original model while reducing compute by up to 12×12\times.
  • A plausible implication is that learned slimmability can structure redundancy in future models to further promote robustness, directly challenging the established "always scale bigger" paradigm of CV-derived model scaling in RS domains (Hackel et al., 30 Jan 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Post-Hoc Slimmability.