MIFOMO: Foundation Model for HSI CDFSL

Updated 6 February 2026

MIFOMO is a foundation model framework designed for cross-domain few-shot hyperspectral image classification using a dual-branch Vision Transformer.
It employs coalescent projection and mixup domain adaptation to efficiently handle severe domain shifts while reducing overfitting.
Experimental results demonstrate state-of-the-art performance with significant accuracy gains on multiple hyperspectral image benchmarks.

The MIxup FOundation MOdel (MIFOMO) is a parameter-efficient architectural and algorithmic framework for cross-domain few-shot learning (CDFSL) in hyperspectral image (HSI) classification. MIFOMO leverages a large-scale Vision Transformer (ViT) foundation model pre-trained on remote sensing (RS) imagery and introduces specialized techniques—coalescent projection, mixup domain adaptation, and label smoothing—to rapidly and robustly adapt to extreme domain shifts, characterized by severe scarcity and discrepancy in target-domain data. The framework demonstrates substantial improvements over previous methods, attaining state-of-the-art performance on multiple HSI benchmarks (Paeedeh et al., 30 Jan 2026).

1. Foundation Model Pretraining

MIFOMO is built atop HyperSIGMA, a dual-branch ViT backbone tailored for hyperspectral data. The input is a hyperspectral cube $x_0 \in \mathbb{R}^{H\times W\times C}$ , where each of the SpatialNetwork and SpectralNetwork branches tokenizes and embeds local patches into sequences of dimension $D$ , processing them through $L$ stacked transformer blocks. Each block performs multi-head self-attention (MHSA) and a subsequent feed-forward network (FFN), with shared positional embeddings $P \in \mathbb{R}^{N\times D}$ . Formally, the block operation is: $\begin{aligned} U_0 &= X + P, \ U_i' &= \mathrm{MHSA}\left(\mathrm{LN}(U_{i-1})\right) + U_{i-1}, \ U_i &= \mathrm{FFN}\left(\mathrm{LN}(U_i')\right) + U_i', \quad i=1, \ldots, L, \ Z &= \mathrm{LN}(U_L) \in \mathbb{R}^{N\times D}. \end{aligned}$

Pretraining is carried out via a masked image modeling (MIM) objective using the HyperGlobal-450K dataset. The encoder $E$ (the backbone) and a decoder $D$ are optimized to reconstruct masked cubes: $\mathcal{L}_{\rm MIM} = \mathbb{E}_{x\sim\mathcal{D}_{\rm pre}} \left\| x - D(E(m(x))) \right\|_2^2.$ Here, $m(\cdot)$ is a random masking operator, and $\sim\!4.5\times10^5$ samples are used to learn spectral–spatial representations that generalize across diverse RS domains.

2. Coalescent Projection: Efficient Task Adaptation

To enable rapid adaptation with minimal risk of overfitting, all parameters of the HyperSIGMA backbone are kept frozen. MIFOMO introduces a learnable coalescent projection (CP) matrix $C\in\mathbb{R}^{D'\times D'}$ into each self-attention head. In standard self-attention, the scores are computed as $Q K^\top/\sqrt{D'}$ . Under CP, the operation becomes: $\mathrm{SA}_{\rm CP}(U) = \mathrm{Softmax}\left(\frac{Q C K^\top}{\sqrt{D'}}\right)V,$ where $Q = U W_Q$ , $K = U W_K$ , and $V = U W_V$ . Only the matrices $C$ and the downstream classifier $h_\phi$ are trainable, while the rest remain static. This reduces the number of trainable parameters considerably and mitigates overfitting, as adaptation occurs via small CP matrices initialized as the identity. Gradients from the classification loss are backpropagated through each $C$ , and no separate tuning objective for CP is necessary.

3. Mixup Domain Adaptation (MDM)

MIFOMO employs an embedding-space mixup strategy for both intra-domain and cross-domain adaptation.

3.1. Source-Domain Mixup

Within each meta-learning episode on the source domain $\mathcal{D}_S$ , mixup is performed among query samples: $\begin{aligned} \tilde z &= \lambda_1\,g_\psi(x_i^S) + (1-\lambda_1)\,g_\psi(x_j^S), \ \tilde y &= \lambda_1\,y_i^S + (1-\lambda_1)\,y_j^S, \end{aligned}$ where $\lambda_1\sim\mathrm{Beta}(\alpha,\alpha)$ . The associated loss is: $\mathcal{L}_{\rm mx}^S = \mathbb{E}_{(\tilde z,\tilde y)}\, \ell\bigl(h_\phi(\tilde z),\,\tilde y\bigr),$ which is combined with the conventional few-shot loss $\mathcal{L}_{\rm fsl}^S$ .

3.2. Intermediate Domain Construction

To address source-target domain shifts, the method constructs an intermediate domain by mixing samples between source ( $x_i^S, y_i^S$ ) and pseudo-labeled target samples ( $x_j^T, \hat y_j^T$ ): $\begin{aligned} \tilde x &= \tilde\lambda_2\,x_i^S + (1-\tilde\lambda_2)\,x_j^T, \ \tilde z &= \tilde\lambda_2\,g_\psi(x_i^S) + (1-\tilde\lambda_2)\,g_\psi(x_j^T), \ \tilde y &= \tilde\lambda_2\,y_i^S + (1-\tilde\lambda_2)\,\hat y_j^T. \end{aligned}$ An intermediate-domain loss is defined as: $\mathcal{L}_{\rm inter} = \mathbb{E}_{(\tilde x,\tilde z,\tilde y)} \left[\ell\left(h_\phi(g_\psi(\tilde x)),\tilde y\right) + \ell\left(h_\phi(\tilde z),\tilde y\right)\right].$

3.3. Adaptive Mixup Ratio

The mixup ratio $\lambda_2^n$ during adaptation is adjusted dynamically based on the 1-Wasserstein distance between domains: $q = \exp\left( -\frac{d(\widetilde{\mathcal{D}},\mathcal{D}_S)}{[d(\widetilde{\mathcal{D}},\mathcal{D}_S) + d(\widetilde{\mathcal{D}},\mathcal{D}_T)]\tau} \right),\quad \lambda_2^n = \frac{n(1-q)}{N} + q\,\lambda_2^{n-1},$ with $\tau=0.05$ (temperature) and a uniform perturbation $\sigma=0.2$ applied during sampling.

4. Label Smoothing via Graph-Based Propagation

To generate stable pseudo-labels under few-shot conditions, label propagation is performed on a similarity graph constructed from backbone features on the target-domain query set $\mathcal{Q}_t$ . Starting with one-hot initial pseudo-labels $\hat Y$ and normalized adjacency matrix $\hat A$ , the iterative update is: $F^{(t+1)} = \alpha\,\hat A\,F^{(t)} + (1-\alpha)\,\hat Y.$ The stationary solution is: $F^* = (I-\alpha\,\hat A)^\dagger \hat Y,$ where $\alpha\in[0,1]$ controls propagation smoothness. The final pseudo-labels are chosen as $\hat y_i = \arg\max_c F^*_{i,c}$ . This mechanism suppresses noisy pseudo-labels and enforces neighborhood consistency.

5. Training Protocol and Objective

MIFOMO employs a meta-learning training regime, alternating three phases per iteration:

(a) Source-domain updates with $\mathcal{L}_S = \mathcal{L}_{\rm fsl}^S + \mathcal{L}_{\rm mx}^S$ .
(b) Target-domain warmup and pseudo-label propagation.
(c) Intermediate-domain updates using $\mathcal{L}_{\rm inter}$ .

Only coalescent projection matrices $C$ and the final classification head $h_\phi$ are updated, while all backbone weights are frozen. The overall objective is: $\mathcal{L} = \mathcal{L}_{\rm CE} + \beta_{\rm MDM}\,\mathcal{L}_{\rm MDM} + \beta_{\rm LS}\left\| F^* - \hat Y \right\|^2,$ with $\beta_{\rm MDM}, \beta_{\rm LS}$ weighting domain adaptation and label smoothing.

Typical hyperparameter settings are: $N=5$ , $K=5$ , $Q=15$ ; mixup $\alpha=0.2$ ; perturbation $\sigma=0.2$ ; label-propagation $\alpha\in[0,1]$ ; learning rate $1\times10^{-4}$ ; and $\sim600$ episodes per domain.

6. Benchmark Performance and Ablation

MIFOMO establishes new state-of-the-art results for 5-shot overall accuracy (OA) on four widely used HSI datasets:

Dataset	Best Prior OA	MIFOMO OA	Gain
Indian Pines	81.32%	95.44%	+14.1pp
Pavia University	94.19%	97.76%	+3.6pp
Salinas	94.19%	96.57%	+2.4pp
Houston	82.63%	95.86%	+13.2pp

Ablation studies on Indian Pines (5-shot) quantify each module’s contribution:

Configuration	OA
Full MIFOMO	95.44%
– Label Smoothing	68.22%
– Intermediate Domain	92.29%
– Coalescent Projection (CP)	94.57%
– Mixup	93.25%

t-SNE visualizations demonstrate that MIFOMO’s embeddings generate well-separated clusters for unseen target classes, suggesting strong cross-domain generalization and transferability.

7. Key Contributions and Significance

MIFOMO introduces a new instantiation of foundation model transfer for RS imagery by (i) using a large-scale dual-branch ViT foundation, (ii) enabling highly parameter-efficient adaptation via coalescent projection, (iii) robustifying domain adaptation through embedding-space mixup and cross-domain mixing, and (iv) stabilizing pseudo-labels using graph-based label smoothing. The integrated approach yields efficient and robust few-shot adaptation under severe domain shifts, addressing limitations of prior methodologies reliant on extensive data augmentation or large-scale fine-tuning (Paeedeh et al., 30 Jan 2026). The source code is available for reproducibility and continued research.

Markdown Report Issue Upgrade to Chat

References (1)

Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Mixup Foundation Model (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MIxup FOundation MOdel (MIFOMO).

MIFOMO: Foundation Model for HSI CDFSL

1. Foundation Model Pretraining

2. Coalescent Projection: Efficient Task Adaptation

3. Mixup Domain Adaptation (MDM)

3.1. Source-Domain Mixup

3.2. Intermediate Domain Construction

3.3. Adaptive Mixup Ratio

4. Label Smoothing via Graph-Based Propagation

5. Training Protocol and Objective

6. Benchmark Performance and Ablation

7. Key Contributions and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MIFOMO: Foundation Model for HSI CDFSL

1. Foundation Model Pretraining

2. Coalescent Projection: Efficient Task Adaptation

3. Mixup Domain Adaptation (MDM)

3.1. Source-Domain Mixup

3.2. Intermediate Domain Construction

3.3. Adaptive Mixup Ratio

4. Label Smoothing via Graph-Based Propagation

5. Training Protocol and Objective

6. Benchmark Performance and Ablation

7. Key Contributions and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research