MIFOMO: Foundation Model for HSI CDFSL
- MIFOMO is a foundation model framework designed for cross-domain few-shot hyperspectral image classification using a dual-branch Vision Transformer.
- It employs coalescent projection and mixup domain adaptation to efficiently handle severe domain shifts while reducing overfitting.
- Experimental results demonstrate state-of-the-art performance with significant accuracy gains on multiple hyperspectral image benchmarks.
The MIxup FOundation MOdel (MIFOMO) is a parameter-efficient architectural and algorithmic framework for cross-domain few-shot learning (CDFSL) in hyperspectral image (HSI) classification. MIFOMO leverages a large-scale Vision Transformer (ViT) foundation model pre-trained on remote sensing (RS) imagery and introduces specialized techniques—coalescent projection, mixup domain adaptation, and label smoothing—to rapidly and robustly adapt to extreme domain shifts, characterized by severe scarcity and discrepancy in target-domain data. The framework demonstrates substantial improvements over previous methods, attaining state-of-the-art performance on multiple HSI benchmarks (Paeedeh et al., 30 Jan 2026).
1. Foundation Model Pretraining
MIFOMO is built atop HyperSIGMA, a dual-branch ViT backbone tailored for hyperspectral data. The input is a hyperspectral cube , where each of the SpatialNetwork and SpectralNetwork branches tokenizes and embeds local patches into sequences of dimension , processing them through stacked transformer blocks. Each block performs multi-head self-attention (MHSA) and a subsequent feed-forward network (FFN), with shared positional embeddings . Formally, the block operation is:
Pretraining is carried out via a masked image modeling (MIM) objective using the HyperGlobal-450K dataset. The encoder (the backbone) and a decoder are optimized to reconstruct masked cubes: Here, is a random masking operator, and samples are used to learn spectral–spatial representations that generalize across diverse RS domains.
2. Coalescent Projection: Efficient Task Adaptation
To enable rapid adaptation with minimal risk of overfitting, all parameters of the HyperSIGMA backbone are kept frozen. MIFOMO introduces a learnable coalescent projection (CP) matrix into each self-attention head. In standard self-attention, the scores are computed as . Under CP, the operation becomes: where , , and . Only the matrices and the downstream classifier are trainable, while the rest remain static. This reduces the number of trainable parameters considerably and mitigates overfitting, as adaptation occurs via small CP matrices initialized as the identity. Gradients from the classification loss are backpropagated through each , and no separate tuning objective for CP is necessary.
3. Mixup Domain Adaptation (MDM)
MIFOMO employs an embedding-space mixup strategy for both intra-domain and cross-domain adaptation.
3.1. Source-Domain Mixup
Within each meta-learning episode on the source domain , mixup is performed among query samples: where . The associated loss is: which is combined with the conventional few-shot loss .
3.2. Intermediate Domain Construction
To address source-target domain shifts, the method constructs an intermediate domain by mixing samples between source () and pseudo-labeled target samples (): An intermediate-domain loss is defined as:
3.3. Adaptive Mixup Ratio
The mixup ratio during adaptation is adjusted dynamically based on the 1-Wasserstein distance between domains: with (temperature) and a uniform perturbation applied during sampling.
4. Label Smoothing via Graph-Based Propagation
To generate stable pseudo-labels under few-shot conditions, label propagation is performed on a similarity graph constructed from backbone features on the target-domain query set . Starting with one-hot initial pseudo-labels and normalized adjacency matrix , the iterative update is: The stationary solution is: where controls propagation smoothness. The final pseudo-labels are chosen as . This mechanism suppresses noisy pseudo-labels and enforces neighborhood consistency.
5. Training Protocol and Objective
MIFOMO employs a meta-learning training regime, alternating three phases per iteration:
- (a) Source-domain updates with .
- (b) Target-domain warmup and pseudo-label propagation.
- (c) Intermediate-domain updates using .
Only coalescent projection matrices and the final classification head are updated, while all backbone weights are frozen. The overall objective is: with weighting domain adaptation and label smoothing.
Typical hyperparameter settings are: , , ; mixup ; perturbation ; label-propagation ; learning rate ; and episodes per domain.
6. Benchmark Performance and Ablation
MIFOMO establishes new state-of-the-art results for 5-shot overall accuracy (OA) on four widely used HSI datasets:
| Dataset | Best Prior OA | MIFOMO OA | Gain |
|---|---|---|---|
| Indian Pines | 81.32% | 95.44% | +14.1pp |
| Pavia University | 94.19% | 97.76% | +3.6pp |
| Salinas | 94.19% | 96.57% | +2.4pp |
| Houston | 82.63% | 95.86% | +13.2pp |
Ablation studies on Indian Pines (5-shot) quantify each module’s contribution:
| Configuration | OA |
|---|---|
| Full MIFOMO | 95.44% |
| – Label Smoothing | 68.22% |
| – Intermediate Domain | 92.29% |
| – Coalescent Projection (CP) | 94.57% |
| – Mixup | 93.25% |
t-SNE visualizations demonstrate that MIFOMO’s embeddings generate well-separated clusters for unseen target classes, suggesting strong cross-domain generalization and transferability.
7. Key Contributions and Significance
MIFOMO introduces a new instantiation of foundation model transfer for RS imagery by (i) using a large-scale dual-branch ViT foundation, (ii) enabling highly parameter-efficient adaptation via coalescent projection, (iii) robustifying domain adaptation through embedding-space mixup and cross-domain mixing, and (iv) stabilizing pseudo-labels using graph-based label smoothing. The integrated approach yields efficient and robust few-shot adaptation under severe domain shifts, addressing limitations of prior methodologies reliant on extensive data augmentation or large-scale fine-tuning (Paeedeh et al., 30 Jan 2026). The source code is available for reproducibility and continued research.