Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models

Published 24 Apr 2025 in cs.CV | (2504.17397v2)

Abstract: Earth observation (EO) is crucial for monitoring environmental changes, responding to disasters, and managing natural resources. In this context, foundation models facilitate remote sensing image analysis to retrieve relevant geoinformation accurately and efficiently. However, as these models grow in size, fine-tuning becomes increasingly challenging due to the associated computational resources and costs, limiting their accessibility and scalability. Furthermore, full fine-tuning can lead to forgetting pre-trained features and even degrade model generalization. To address this, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a promising solution. In this paper, we conduct extensive experiments with various foundation model architectures and PEFT techniques to evaluate their effectiveness on five different EO datasets. Our results provide a comprehensive comparison, offering insights into when and how PEFT methods support the adaptation of pre-trained geospatial models. We demonstrate that PEFT techniques match or even exceed full fine-tuning performance and enhance model generalisation to unseen geographic regions, while reducing training time and memory requirements. Additional experiments investigate the effect of architecture choices such as the decoder type or the use of metadata, suggesting UNet decoders and fine-tuning without metadata as the recommended configuration. We have integrated all evaluated foundation models and techniques into the open-source package TerraTorch to support quick, scalable, and cost-effective model adaptation.

Summary

  • The paper introduces parameter-efficient fine-tuning (PEFT) techniques, notably LoRA, to adapt geospatial foundation models with minimal resource cost.
  • The study compares PEFT methods using various models and decoders across EO tasks, showing that LoRA achieves comparable mIoU to full fine-tuning with a lower memory footprint.
  • Experiments reveal that decoder selection—especially UNet—critically enhances spatial prediction quality and geographic generalization.

Parameter-Efficient Fine-Tuning for Geospatial Foundation Models

Overview and Motivation

Geospatial foundation models (GeoFMs), built on self-supervised learning over massive remote sensing datasets, have become pivotal for a wide swath of Earth Observation (EO) tasks, spanning environmental monitoring, disaster response, and land/use categorization. As the scale and capacity of these models increase, full fine-tuning becomes prohibitive in terms of compute and memory requirements and often results in suboptimal generalization due to overwriting pre-trained representations. Addressing these constraints, this paper presents a comprehensive experimental study of Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), Visual Prompt Tuning (VPT), and ViT Adapters, applied to several leading GeoFMs (Prithvi 2.0, Clay, DeCUR, Prithvi 1.0) across five downstream EO tasks. The analysis covers decoder architecture effects, generalization across geographic splits, and input band variability, resulting in empirically supported recommendations for scalable, robust model adaptation.

Experimental Framework

The study evaluates PEFT schemes using multiple models and decoders over five datasets: Sen1Floods11 (surface water segmentation), Burn Scars (wildfire detection), reBEN 7k (semantic land-cover segmentation), m-Cashew Plantation, and SA Crop Type. All experiments deploy thorough HPO and statistical averaging over five seeds. PEFT methods are configured as follows: LoRA introduces low-rank adapters (r=16r=16) in transformer layers; VPT employs 100 learnable prompts per layer; ViT Adapter implements parallel convolutional modules. Decoder architectures benchmarked include linear, FCN, UperNet (FPN + PPM), and UNet. Performance is reported via mean IoU (mIoU) over test splits, GHOS, and training footprints. Figure 1

Figure 1

Figure 1: Comparison between PEFT techniques for Prithvi 2.0 300M and Clay v1 with linear decoders across five datasets; best/worst per dataset annotated.

Numerical Results and Analysis

LoRA consistently matches or marginally outperforms full fine-tuning on Prithvi 2.0 300M and Clay v1 across most datasets. For example, on Prithvi 2.0 300M, LoRA achieves average mIoU of 68.02% vs 68.14% for full FT, with LoRA sometimes leading in per-dataset scores (Burn Scars: 93.33% LoRA vs 92.85% full FT; Cashew: 77.53% LoRA vs 80.58% full FT). VPT and ViT Adapter tend to underperform relative to full FT, especially on larger models and complex tasks. Linear probing, despite its simplicity, shows competitive performance but converges slowly and produces spatially incoherent outputs. Figure 2

Figure 2

Figure 2: Tradeoff between test mIoU and training time for all models and methods; LoRA and full FT display similar runtime but LoRA reduces memory footprint.

Interestingly, linear decoders serve as controlled probes of encoder quality, but high-performing decoders (UNet, FCN) can obscure differences, particularly for easier segmentation tasks.

Decoder Architectures and Output Quality

Comparative experiments reveal UNet as the most performant decoder across models and datasets, exhibiting strong mIoU and smooth spatial predictions. FCN and linear decoders yield patchy outputs, suitable only for simple tasks (e.g., water mapping). UperNet consistently underperforms relative to UNet and FCN despite similar complexity. These findings suggest that, while encoder quality is critical, decoder selection substantially impacts spatial structuring and final prediction robustness. Figure 3

Figure 3

Figure 3: Average test mIoU across five datasets for each model and decoder combination.

Figure 4

Figure 4

Figure 4: Qualitative comparison of Prithvi 2.0 300M predictions; UNet and UperNet decoders generate smoother, semantically consistent outputs.

Geographic Generalization and Embedding Stability

The study benchmarks generalization to unseen geographic regions using hold-out sets. LoRA demonstrates superior preservation of embedding structure post fine-tuning as visualized via t-SNE, maintaining regional clustering in Sen1Floods11. Quantitative results show a generalization gap (mean mIoU drop) of 7–8pp for holdout sets compared to test splits. Prithvi 2.0 300M with LoRA exhibits the smallest drop and highest absolute scores on both in-distribution and GHOS splits (Sen1Floods11: 90.04% test, 87.57% GHOS; reBEN 7k: 38.84% test, 30.21% GHOS). Figure 5

Figure 5

Figure 5: t-SNE visualization of Prithvi 2.0 300M embeddings on Sen1Floods11, colored by region, showing post-fine-tuning structure.

Experiments confirm that LoRA outperforms full FT for generalization in larger models, likely due to minimized disruption of pre-trained representations. The performance drop attributed to missing input bands remains modest (1–2pp), indicating robust adaptability provided spectral diversity is maintained.

Metadata Incorporation and Input Band Robustness

Including metadata (temporal/geographic) in pre-training and fine-tuning exhibits only marginal impact on downstream performance. Prithvi 2.0 300M and Clay v1 track subtle differences (<0.2pp) in mIoU, implying EO models infer spatiotemporal patterns directly from multispectral imagery. Dropping bands results in minor performance loss, further supporting the invariant quality of deep pre-trained encoders. Figure 6

Figure 6

Figure 6: t-SNE embeddings for reBEN 7k colored by region, confirming spatial pattern extraction.

Practical Implications and Future Directions

The empirical findings position LoRA as the preferred PEFT method for large-scale, resource-constrained EO tasks, delivering full fine-tuning-level performance with substantial memory and potential training speed reduction (contingent on batch sizing and hardware). UNet remains the recommended decoder for robust, spatially coherent outputs. Metadata and spectral input variability pose minimal threats to model transferability. Nonetheless, generalization to new geographies remains a bottleneck—future work should address domain adaptation and scalable PEFT deployment across additional GeoFM architectures.

The integrated open-source package TerraTorch operationalizes these conclusions, enabling reproducible, cost-effective adaptation workflows. The adoption of PEFT schemes, particularly LoRA, is poised to facilitate wider GeoFM deployment in both scientific and operational EO contexts, enhancing scalability, traceability, and robustness.

Conclusion

This work provides an authoritative empirical assessment of PEFT methods in geospatial foundation modeling, concluding that LoRA provides optimal tradeoff between adaptation efficacy and resource constraints. Decoder choice strongly affects output quality, and metadata inclusion offers minimal incremental benefit. Generalization to unseen regions remains an open challenge. The collective evidence offers concrete guidance for future GeoFM adaptation strategies and informs ongoing research in scalable, robust EO modeling.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper looks at smart ways to “fine-tune” big AI models that read satellite images of Earth. These models help with tasks like finding floods, spotting burn scars from wildfires, mapping land cover, and identifying crop types. The goal is to make adapting these models to new tasks faster, cheaper, and more reliable—especially when computers have limited memory.

What were the main goals?

The authors wanted to answer three simple questions:

  • Can small, efficient fine-tuning tricks (called PEFT methods) work as well as—or better than—standard fine-tuning on geospatial AI models?
  • Do these methods help the models work in new places they’ve never seen before (like a different country)?
  • Which “decoder” designs (the part of the model that turns features into pixel-by-pixel maps) work best for these tasks?

How did they do the study?

Key ideas in plain words

  • Foundation models: Think of these as very smart “general-purpose” brains trained on lots of satellite images. They learn useful patterns about Earth that can be reused for many tasks.
  • Fine-tuning: Like teaching the brain a new skill. Normally, you adjust many parts of the brain (millions of parameters), which can be slow and expensive.
  • Parameter-Efficient Fine-Tuning (PEFT): Instead of changing the whole brain, you only tweak a few small parts or add tiny “plugins.” It’s faster, uses less memory, and can still perform great.

The paper tests three PEFT methods:

  • LoRA: Adds small “shortcut” layers that lightly steer the model’s attention. Imagine placing a small dial on a big machine to nudge it instead of rebuilding it.
  • VPT (Visual Prompt Tuning): Adds special tokens (like hints) to the model’s input, guiding it without changing the main parts.
  • ViT Adapter: A small side network that helps the main model better handle detailed, pixel-level tasks.

They tried these on several geospatial foundation models (trained on satellite images), including:

  • Prithvi 2.0 (a large, modern model)
  • Prithvi 1.0 (an earlier version)
  • Clay (another strong model)
  • DeCUR (a different style using ResNet)

They tested on five datasets covering different tasks:

  • Flood water detection (Sen1Floods11)
  • Burn scar mapping after wildfires (Burn Scars)
  • Land cover mapping across Europe (reBEN 7k)
  • Cashew plantation mapping in Benin (m-Cashew)
  • Crop type mapping in South Africa (SA Crop Type)

For each setup, they compared:

  • Full fine-tuning (updating almost everything)
  • PEFT methods (updating only small parts)
  • Different decoders (linear, FCN, UperNet, UNet), which turn features into useful maps
  • Performance on places seen in training vs. brand-new regions (a “geographic hold-out set”)

They measured results using mIoU, a score that shows how well the predicted map overlaps with the true map. Higher is better.

What did they find, and why does it matter?

Here are the most important results:

  • LoRA often matches or beats full fine-tuning, especially on the big Prithvi 2.0 model. It uses only about 1–2% extra parameters and saves memory, making training more accessible on common GPUs.
  • Generalization to new regions improved with LoRA for Prithvi 2.0. In simple terms: models fine-tuned with LoRA tended to work better on places they hadn’t seen before.
  • UNet is the most reliable decoder choice. It consistently produced strong scores and cleaner, smoother maps than simpler decoders (which can look patchy).
  • Full fine-tuning can sometimes “forget” useful pre-trained knowledge and hurt generalization. PEFT (especially LoRA) helps avoid this.
  • Metadata (extra info like location or time) wasn’t necessary for getting good fine-tuning results in their tests.
  • Bigger, better-pretrained models (like Prithvi 2.0) generally performed best across tasks.

These results are important because they show you don’t need huge computing resources to get top performance. That opens the door for more teams—like local governments, NGOs, or schools—to use and adapt powerful geospatial AI.

What does this mean for the future?

  • Faster, cheaper adaptation: With PEFT—especially LoRA—organizations can quickly tailor large Earth observation models to new tasks (like a sudden flood or wildfire) without needing expensive hardware.
  • Better maps, faster decisions: Strong decoders like UNet produce cleaner maps, which helps responders and planners trust the outputs when acting in the real world.
  • Wider access through open tools: The authors put all their models and methods into an open-source toolkit called TerraTorch. This makes it easier for others to use these techniques right away.
  • Ongoing challenge: Working well in totally new regions is still hard. PEFT helps, but there’s room to improve. Future work can test more models, more places, and more tasks to build even better generalization.

In short: Fine-tune smarter, not harder. By adjusting just a little (with PEFT), you can get great results, save time and memory, and make powerful Earth-monitoring tools usable by more people.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.