Selective Fine-Tuning in Deep Learning

Updated 29 January 2026

Selective fine-tuning is an adaptive training strategy that updates only chosen parameters, such as layers or tokens, to efficiently adapt pre-trained models.
The approach employs methods like gradient norm selection, greedy masking, and adapter pruning to minimize computational cost and mitigate overfitting.
It offers practical benefits in federated learning, transfer tasks, and image segmentation by reducing resource demands and preserving out-of-distribution performance.

Selective fine-tuning is an adaptive training paradigm in which only a carefully chosen subset of model parameters—typically layers, tokens, blocks, or filters—are updated during the adaptation of a pre-trained model to a downstream task, while all other parameters remain frozen at their initial values. This approach substantially reduces computational cost, mitigates overfitting, accommodates resource constraints, and can improve generalization across domains and objectives. Recent research demonstrates that selective fine-tuning not only enables efficient adaptation in large-scale neural architectures and federated learning systems but also produces superior or comparable accuracy to full-model fine-tuning in data-scarce, heterogeneous, or multi-objective regimes.

1. Conceptual Foundations and Motivations

Selective fine-tuning arises from the limitations of full-model adaptation in scenarios where (a) computational resource budgets are severely constrained, (b) training data is non-iid across clients, (c) multi-objective or continual learning demands isolation between task heads, or (d) risk of overfitting/catastrophic forgetting is prominent.

In federated learning, for example, full-layer adaptation is often infeasible due to client resource heterogeneity and privacy constraints. Selective layer fine-tuning enables each client to choose a binary mask $m^\mathrm{t}_i \in \{0,1\}^L$ , where $m^\mathrm{t}_i(l)=1$ means "fine-tune layer $l$ " and $0$ means "freeze" (Sun et al., 2024). In transformer-based LLMs and CV models, selective fine-tuning can be realized at the adapter, block, or even token level (Son et al., 2024, Ruan et al., 13 Oct 2025). In transfer learning, filter-level selective adaptation targets only those convolutional kernels most susceptible to distributional shifts (Bianchi et al., 2019).

This paradigm accommodates client and task heterogeneity, minimizes unnecessary parameter updates, and restricts model drift from its pre-trained distribution, so that zero-shot and out-of-distribution generalization is better preserved (Bafghi et al., 26 Jan 2025).

2. Methodologies for Selective Fine-Tuning

Several methodological frameworks for selective fine-tuning have emerged, spanning layer/block selection, token-level gating, filter ranking, adapter pruning, and sample-wise or objective-wise masking:

Gradient Norm and Importance-based Selection: Layer-wise or token-wise selection based on local/global gradient norms to identify high-impact parameters (Sun et al., 2024, Shen et al., 2024).
Greedy and Evolutionary Subset Selection: Budgeted selection via greedy approximation or evolutionary optimization to maximize validation accuracy under a constraint on tuned layers/blocks (Kaplun et al., 2023, Colan et al., 21 Aug 2025).
Critical Token Identification: Counterfactual decoding to precisely isolate tokens whose change causes output correctness loss; only those deemed "critical" are updated (Ruan et al., 13 Oct 2025).
Adapter Freezing via Representation Alignment: Adapters are dynamically frozen when their activations become highly aligned (CKA) with the baseline representation, indicating convergence and low ongoing utility (Son et al., 2024).
Selective Masking for Multi-objective EBR: Parameter vector partitioned into binary masks for each objective, with cumulative pruning and cascading staged fine-tuning (Deng et al., 17 Apr 2025).
Sparse Spectral and SparseGrad Updates: Sparse spectral updates (DCT, Fourier) and basis-transformed sparsification enable parameter-efficient adaptation in both NLP and vision (Shen et al., 2024, Chekalina et al., 2024).
Self-Rehearsal and Self-to-Supervised Mixing: Judged model-generated responses are selectively mixed with gold supervision for better generalization (Gupta et al., 2024, Gupta et al., 12 Feb 2025).

The core optimization objective in these frameworks supplements the task loss with a sparsity penalty or mask-induced constraint over the parameter set:

$\min_{\theta} \;\mathcal{L}_{\mathrm{task}}(\theta) + \lambda\,\|\mathbf{1}_\mathrm{sel}\|_1,$

where $\mathbf{1}_\mathrm{sel}$ is a selection indicator over layers, blocks, or tokens.

3. Theoretical Analysis of Selectivity in Convergence and Generalization

Selective fine-tuning alters both the convergence dynamics and generalization properties of adapted models. Theoretical results in federated settings show the error decomposition contains two principal components: loss from non-selection of important layers ( $\mathcal{E}_{t,1}$ ), and mismatch across heterogeneous client choices ( $\mathcal{E}_{t,2}$ ) (Sun et al., 2024):

$\min_{t} \mathbb{E}[\|\nabla f(\theta^t)\|^2] \leq \frac{2}{\eta CT}[f(\theta^0)-f(\theta^*)] + \frac{2\gamma \eta}{C}\sigma^2 + \frac{1}{T}\sum_{t} \Big(\frac{1}{\gamma \eta C} + 2\Big)(\mathcal{E}_{t,1} + \mathcal{E}_{t,2})$

As $T\to\infty$ , the error floor is dominated by selection quality and cross-client consistency.

Generalization bounds, as established for SubTuning, suggest that optimizing a smaller parameter subset ( $m^\mathrm{t}_i(l)=1$ 0) yields tighter generalization guarantees in low-data ( $m^\mathrm{t}_i(l)=1$ 1 small) regimes:

$m^\mathrm{t}_i(l)=1$ 2

in contrast with the full-model bound $m^\mathrm{t}_i(l)=1$ 3 (Kaplun et al., 2023).

Selective self-rehearsal and self-to-supervised strategies produce smaller distributional shifts in the model's output space, directly mitigating catastrophic forgetting and overfitting to idiosyncratic datasets (Gupta et al., 2024, Gupta et al., 12 Feb 2025).

4. Applications across Modalities and Learning Scenarios

Selective fine-tuning is deployed in several domains:

Federated Learning: Critical in privacy-preserving distributed adaptation, e.g., vision transformers and LLMs across clients with resource and data heterogeneity (Sun et al., 2024).
Transfer and Multi-Task Learning: Evolutionary selection (BioTune) and greedy layer selection (SubTuning) yield competitive performance in classification, fine-grained recognition, and real-world continual learning (Colan et al., 21 Aug 2025, Kaplun et al., 2023).
Parameter-Efficient Fine-Tuning (PEFT): Adapter freezing (SAFE), spectral selection (sDCTFT), and sparse token/block updates reduce parameter count and memory footprint by $m^\mathrm{t}_i(l)=1$ 4– $m^\mathrm{t}_i(l)=1$ 5 vs. baselines (Son et al., 2024, Shen et al., 2024, Chekalina et al., 2024).
Robustness and OOD Generalization: Selective updates targeting robust directions (ROSE) yield flat optimization minima and improved adversarial resilience (Jiang et al., 2022).
Multi-objective Retrieval and Recommendation: CSMF partitions embedding vectors and masks parameter blocks for each objective, efficiently supporting conflicting commercial goals (Deng et al., 17 Apr 2025).
Medical Image Segmentation: Active learning combined with selective semi-supervised fine-tuning dramatically boosts performance and label efficiency in volumetric segmentation with minimal annotation budgets (Yang et al., 13 Sep 2025).
CNN filter-level adaptation: Targeted retraining of distortion-sensitive filters recovers image classification accuracy under heavy covariate shifts (Bianchi et al., 2019).

5. Quantitative Outcomes and Comparative Analysis

Empirical findings consistently indicate that selective fine-tuning attains accuracy equivalent or superior to full fine-tuning, typically using only $m^\mathrm{t}_i(l)=1$ 6– $m^\mathrm{t}_i(l)=1$ 7% of trainable parameters or fewer layers/tokens:

Method	Params Tuned	Task	Baseline	Selective FT	Δ (Selective–Baseline)
BioTune (ResNet-50)	30%	MNIST	FT 98.96	99.13	+0.17
SubTuning (ViT-B/16)	12%	VTAB	FT 47.8	68.0	+20.2
SSR (LLM, MD2D)	–	QA	SFT drop −16.7%	SSR drop −2.3%	+14.4% (gen.)
SAFE (BERT-large)	–	GLUE	84.66	84.99	+0.33, −40.5% mem
SparseGrad	1% (MLP)	GLUE	LoRA 83.1	83.6	+0.5
Selective LoRA (CLIP)	5%	OOD ZS	SFT fall −30%	<−5.7%	+24.3% (forgetting)

These gains also extend to memory, FLOPs, and training/inference time reductions. Selective adapter freezing matches or slightly exceeds baseline accuracy while decreasing compute and memory footprints by up to 47%–88% (Son et al., 2024). In volumetric medical segmentation, selective sample querying and semi-supervised fine-tuning increase Dice scores by $m^\mathrm{t}_i(l)=1$ 8– $m^\mathrm{t}_i(l)=1$ 9 points over random or SOTA baselines, approaching fully-supervised upper bounds at $l$ 030% label cost (Yang et al., 13 Sep 2025).

6. Limitations, Trade-Offs, and Future Directions

Selective fine-tuning introduces design choices and trade-offs:

Selection Accuracy and Dynamic Adjustment: Masking granularity (layer/block/filter/token) and selection metric (importance, robustness, uncertainty) impact both convergence and generalization. Dynamic or data-driven selection during training remains an open area (Dong et al., 2024, Ruan et al., 13 Oct 2025).
Cross-client Consistency in Distributed FL: Excessive heterogeneity in selection choices across clients can increase mismatch error, limiting convergence in federated aggregation (Sun et al., 2024).
Judge Quality (SSFT, SSR): The reliability of output selection/judging is crucial; misclassification inflates distributional drift (Gupta et al., 12 Feb 2025).
Absence of Hard Constraint Control: Fractional sparsity is typically enforced via regularization or threshold scheduling, not explicit cardinality constraints (Bafghi et al., 26 Jan 2025).
Transferability to Nonlinear or Structured Architectures: Most frameworks apply to feedforward, transformer, and vision architectures; extension to autoregressive or graph structured models is ongoing.
Limited Theoretical Guarantees in Highly Nonconvex Settings: Most convergence proofs assume smoothness, variance bounds, and mild gradient diversity.

Potential extensions involve automatic mask optimization, multi-granular or hierarchical selection, domain-adaptive mask evolution, and integration into multi-modal and continual learning settings. Further open directions include dynamic per-sample layer selection, robust adaptive gating in PEFT, and data-efficient selective adaptation in foundation models for scientific, medical, and multi-objective applications.

7. Summary

Selective fine-tuning represents a principled, resource-adaptive approach for deep model adaptation. By updating only the most relevant subset of parameters—chosen via gradient, robustness, data-driven, or objective-aware analysis—this paradigm achieves computational efficiency, superior generalization, and stronger robustness to data and resource heterogeneity. The methodology generalizes across FL, transfer learning, LLM adaptation, multi-objective retrieval, and domain-shifted image segmentation, and is supported by both theoretical analysis and extensive empirical validation (Sun et al., 2024, Kaplun et al., 2023, Son et al., 2024, Bafghi et al., 26 Jan 2025, Colan et al., 21 Aug 2025, Bianchi et al., 2019, Ruan et al., 13 Oct 2025, Dong et al., 2024, Shen et al., 2024, Chekalina et al., 2024, Jiang et al., 2022, Deng et al., 17 Apr 2025, Yang et al., 13 Sep 2025, Gupta et al., 2024, Ramasubramanian et al., 2024, Gupta et al., 12 Feb 2025, Ge et al., 2017).