Papers
Topics
Authors
Recent
Search
2000 character limit reached

MIFOMO: Foundation Model for HSI CDFSL

Updated 6 February 2026
  • MIFOMO is a foundation model framework designed for cross-domain few-shot hyperspectral image classification using a dual-branch Vision Transformer.
  • It employs coalescent projection and mixup domain adaptation to efficiently handle severe domain shifts while reducing overfitting.
  • Experimental results demonstrate state-of-the-art performance with significant accuracy gains on multiple hyperspectral image benchmarks.

The MIxup FOundation MOdel (MIFOMO) is a parameter-efficient architectural and algorithmic framework for cross-domain few-shot learning (CDFSL) in hyperspectral image (HSI) classification. MIFOMO leverages a large-scale Vision Transformer (ViT) foundation model pre-trained on remote sensing (RS) imagery and introduces specialized techniques—coalescent projection, mixup domain adaptation, and label smoothing—to rapidly and robustly adapt to extreme domain shifts, characterized by severe scarcity and discrepancy in target-domain data. The framework demonstrates substantial improvements over previous methods, attaining state-of-the-art performance on multiple HSI benchmarks (Paeedeh et al., 30 Jan 2026).

1. Foundation Model Pretraining

MIFOMO is built atop HyperSIGMA, a dual-branch ViT backbone tailored for hyperspectral data. The input is a hyperspectral cube x0RH×W×Cx_0 \in \mathbb{R}^{H\times W\times C}, where each of the SpatialNetwork and SpectralNetwork branches tokenizes and embeds local patches into sequences of dimension DD, processing them through LL stacked transformer blocks. Each block performs multi-head self-attention (MHSA) and a subsequent feed-forward network (FFN), with shared positional embeddings PRN×DP \in \mathbb{R}^{N\times D}. Formally, the block operation is: U0=X+P, Ui=MHSA(LN(Ui1))+Ui1, Ui=FFN(LN(Ui))+Ui,i=1,,L, Z=LN(UL)RN×D.\begin{aligned} U_0 &= X + P, \ U_i' &= \mathrm{MHSA}\left(\mathrm{LN}(U_{i-1})\right) + U_{i-1}, \ U_i &= \mathrm{FFN}\left(\mathrm{LN}(U_i')\right) + U_i', \quad i=1, \ldots, L, \ Z &= \mathrm{LN}(U_L) \in \mathbb{R}^{N\times D}. \end{aligned}

Pretraining is carried out via a masked image modeling (MIM) objective using the HyperGlobal-450K dataset. The encoder EE (the backbone) and a decoder DD are optimized to reconstruct masked cubes: LMIM=ExDprexD(E(m(x)))22.\mathcal{L}_{\rm MIM} = \mathbb{E}_{x\sim\mathcal{D}_{\rm pre}} \left\| x - D(E(m(x))) \right\|_2^2. Here, m()m(\cdot) is a random masking operator, and  ⁣4.5×105\sim\!4.5\times10^5 samples are used to learn spectral–spatial representations that generalize across diverse RS domains.

2. Coalescent Projection: Efficient Task Adaptation

To enable rapid adaptation with minimal risk of overfitting, all parameters of the HyperSIGMA backbone are kept frozen. MIFOMO introduces a learnable coalescent projection (CP) matrix CRD×DC\in\mathbb{R}^{D'\times D'} into each self-attention head. In standard self-attention, the scores are computed as QK/DQ K^\top/\sqrt{D'}. Under CP, the operation becomes: SACP(U)=Softmax(QCKD)V,\mathrm{SA}_{\rm CP}(U) = \mathrm{Softmax}\left(\frac{Q C K^\top}{\sqrt{D'}}\right)V, where Q=UWQQ = U W_Q, K=UWKK = U W_K, and V=UWVV = U W_V. Only the matrices CC and the downstream classifier hϕh_\phi are trainable, while the rest remain static. This reduces the number of trainable parameters considerably and mitigates overfitting, as adaptation occurs via small CP matrices initialized as the identity. Gradients from the classification loss are backpropagated through each CC, and no separate tuning objective for CP is necessary.

3. Mixup Domain Adaptation (MDM)

MIFOMO employs an embedding-space mixup strategy for both intra-domain and cross-domain adaptation.

3.1. Source-Domain Mixup

Within each meta-learning episode on the source domain DS\mathcal{D}_S, mixup is performed among query samples: z~=λ1gψ(xiS)+(1λ1)gψ(xjS), y~=λ1yiS+(1λ1)yjS,\begin{aligned} \tilde z &= \lambda_1\,g_\psi(x_i^S) + (1-\lambda_1)\,g_\psi(x_j^S), \ \tilde y &= \lambda_1\,y_i^S + (1-\lambda_1)\,y_j^S, \end{aligned} where λ1Beta(α,α)\lambda_1\sim\mathrm{Beta}(\alpha,\alpha). The associated loss is: LmxS=E(z~,y~)(hϕ(z~),y~),\mathcal{L}_{\rm mx}^S = \mathbb{E}_{(\tilde z,\tilde y)}\, \ell\bigl(h_\phi(\tilde z),\,\tilde y\bigr), which is combined with the conventional few-shot loss LfslS\mathcal{L}_{\rm fsl}^S.

3.2. Intermediate Domain Construction

To address source-target domain shifts, the method constructs an intermediate domain by mixing samples between source (xiS,yiSx_i^S, y_i^S) and pseudo-labeled target samples (xjT,y^jTx_j^T, \hat y_j^T): x~=λ~2xiS+(1λ~2)xjT, z~=λ~2gψ(xiS)+(1λ~2)gψ(xjT), y~=λ~2yiS+(1λ~2)y^jT.\begin{aligned} \tilde x &= \tilde\lambda_2\,x_i^S + (1-\tilde\lambda_2)\,x_j^T, \ \tilde z &= \tilde\lambda_2\,g_\psi(x_i^S) + (1-\tilde\lambda_2)\,g_\psi(x_j^T), \ \tilde y &= \tilde\lambda_2\,y_i^S + (1-\tilde\lambda_2)\,\hat y_j^T. \end{aligned} An intermediate-domain loss is defined as: Linter=E(x~,z~,y~)[(hϕ(gψ(x~)),y~)+(hϕ(z~),y~)].\mathcal{L}_{\rm inter} = \mathbb{E}_{(\tilde x,\tilde z,\tilde y)} \left[\ell\left(h_\phi(g_\psi(\tilde x)),\tilde y\right) + \ell\left(h_\phi(\tilde z),\tilde y\right)\right].

3.3. Adaptive Mixup Ratio

The mixup ratio λ2n\lambda_2^n during adaptation is adjusted dynamically based on the 1-Wasserstein distance between domains: q=exp(d(D~,DS)[d(D~,DS)+d(D~,DT)]τ),λ2n=n(1q)N+qλ2n1,q = \exp\left( -\frac{d(\widetilde{\mathcal{D}},\mathcal{D}_S)}{[d(\widetilde{\mathcal{D}},\mathcal{D}_S) + d(\widetilde{\mathcal{D}},\mathcal{D}_T)]\tau} \right),\quad \lambda_2^n = \frac{n(1-q)}{N} + q\,\lambda_2^{n-1}, with τ=0.05\tau=0.05 (temperature) and a uniform perturbation σ=0.2\sigma=0.2 applied during sampling.

4. Label Smoothing via Graph-Based Propagation

To generate stable pseudo-labels under few-shot conditions, label propagation is performed on a similarity graph constructed from backbone features on the target-domain query set Qt\mathcal{Q}_t. Starting with one-hot initial pseudo-labels Y^\hat Y and normalized adjacency matrix A^\hat A, the iterative update is: F(t+1)=αA^F(t)+(1α)Y^.F^{(t+1)} = \alpha\,\hat A\,F^{(t)} + (1-\alpha)\,\hat Y. The stationary solution is: F=(IαA^)Y^,F^* = (I-\alpha\,\hat A)^\dagger \hat Y, where α[0,1]\alpha\in[0,1] controls propagation smoothness. The final pseudo-labels are chosen as y^i=argmaxcFi,c\hat y_i = \arg\max_c F^*_{i,c}. This mechanism suppresses noisy pseudo-labels and enforces neighborhood consistency.

5. Training Protocol and Objective

MIFOMO employs a meta-learning training regime, alternating three phases per iteration:

  • (a) Source-domain updates with LS=LfslS+LmxS\mathcal{L}_S = \mathcal{L}_{\rm fsl}^S + \mathcal{L}_{\rm mx}^S.
  • (b) Target-domain warmup and pseudo-label propagation.
  • (c) Intermediate-domain updates using Linter\mathcal{L}_{\rm inter}.

Only coalescent projection matrices CC and the final classification head hϕh_\phi are updated, while all backbone weights are frozen. The overall objective is: L=LCE+βMDMLMDM+βLSFY^2,\mathcal{L} = \mathcal{L}_{\rm CE} + \beta_{\rm MDM}\,\mathcal{L}_{\rm MDM} + \beta_{\rm LS}\left\| F^* - \hat Y \right\|^2, with βMDM,βLS\beta_{\rm MDM}, \beta_{\rm LS} weighting domain adaptation and label smoothing.

Typical hyperparameter settings are: N=5N=5, K=5K=5, Q=15Q=15; mixup α=0.2\alpha=0.2; perturbation σ=0.2\sigma=0.2; label-propagation α[0,1]\alpha\in[0,1]; learning rate 1×1041\times10^{-4}; and 600\sim600 episodes per domain.

6. Benchmark Performance and Ablation

MIFOMO establishes new state-of-the-art results for 5-shot overall accuracy (OA) on four widely used HSI datasets:

Dataset Best Prior OA MIFOMO OA Gain
Indian Pines 81.32% 95.44% +14.1pp
Pavia University 94.19% 97.76% +3.6pp
Salinas 94.19% 96.57% +2.4pp
Houston 82.63% 95.86% +13.2pp

Ablation studies on Indian Pines (5-shot) quantify each module’s contribution:

Configuration OA
Full MIFOMO 95.44%
– Label Smoothing 68.22%
– Intermediate Domain 92.29%
– Coalescent Projection (CP) 94.57%
– Mixup 93.25%

t-SNE visualizations demonstrate that MIFOMO’s embeddings generate well-separated clusters for unseen target classes, suggesting strong cross-domain generalization and transferability.

7. Key Contributions and Significance

MIFOMO introduces a new instantiation of foundation model transfer for RS imagery by (i) using a large-scale dual-branch ViT foundation, (ii) enabling highly parameter-efficient adaptation via coalescent projection, (iii) robustifying domain adaptation through embedding-space mixup and cross-domain mixing, and (iv) stabilizing pseudo-labels using graph-based label smoothing. The integrated approach yields efficient and robust few-shot adaptation under severe domain shifts, addressing limitations of prior methodologies reliant on extensive data augmentation or large-scale fine-tuning (Paeedeh et al., 30 Jan 2026). The source code is available for reproducibility and continued research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MIxup FOundation MOdel (MIFOMO).