Fine-Tuning Approaches
- Fine-tuning approaches are techniques that adapt pre-trained models to downstream tasks by selectively updating parameters to balance flexibility and efficiency.
- They range from full-parameter tuning to parameter-efficient methods like LoRA, BitFit, and adapter tuning, achieving near-parity performance with significantly reduced resource use.
- Recent advancements integrate quantization, integer-only training, and in-context learning to enhance model robustness, speed, and memory efficiency in deployment.
Fine-tuning approaches comprise a diverse array of techniques that adapt pre-trained models to downstream tasks by updating model parameters using supervised or unsupervised objectives. Fine-tuning has become a central methodology in neural NLP, vision, speech, and RL, enabling transfer of foundational model capabilities to new distributions with varying data and compute requirements. Techniques span from full-parameter fine-tuning to parameter-efficient adapters, memory-efficient architectures, integer-only training, algorithmic frameworks for negative transfer avoidance, hybrid in-context adaptation, and beyond.
1. Full-Parameter vs. Parameter-Efficient Fine-Tuning
Full-parameter fine-tuning updates every weight in the network, providing maximal representational flexibility but incurring high memory, compute, and overfitting risk. In Med42 (Christophe et al., 2024), full fine-tuning of Llama-2 (up to 70B parameters) was benchmarked against LoRA-based parameter-efficient tuning (PEFT). LoRA injects low-rank weight updates () into linear layers, reducing the number of trainable weights by and enabling large models (7–70B) to be adapted with commodity GPU memory. LoRA PEFT achieved nearly the same accuracy as full-parameter tuning (within 1–4 percentage points on MMLU and clinical QA) while incurring drastically lower compute and resource requirements. The marginal performance gap (2–6 points on hardest tasks) determines whether full or PEFT methods are optimal for a use case.
Bias-only fine-tuning (BitFit/BEFT) represents an even more extreme form of PEFT, where only the bias vectors of the Transformer are updated. The BEFT framework (Huang et al., 19 Sep 2025) formalizes bias selection as an importance-scoring problem, with best empirical results obtained by tuning only the value-projection bias (). In low-data settings (1k samples), BEFT matches or exceeds classical all-bias and even LoRA PEFT performance, using 0.1% of the parameters.
Adapter-based, low-rank, and prompt-based approaches dominate practical PEFT implementations for large models in NLP (Yousefiramandi et al., 14 Dec 2025), vision (Shi et al., 17 Apr 2025), and VLMs (Zhao et al., 15 Aug 2025). The unifying motif is freezing the majority of parameters and only updating small, targeted modules (adapters, prompts, bias terms, or low-rank factors) engineered for minimal memory-compute overhead.
2. Classification of Fine-Tuning Strategies
The current spectrum of fine-tuning methodologies can be categorized as follows:
| Approach | Trainable Params | Key Examples / Attributes |
|---|---|---|
| Full FT | ~100% | Standard (Med42 (Christophe et al., 2024), BERT-FT (Gopalan et al., 2020)) |
| LoRA / PEFT | 0.1–10% | LoRA, AdaLoRA (Yousefiramandi et al., 14 Dec 2025, Lin et al., 30 Nov 2025) |
| Adapter Tuning | 0.2–5% | AdaptFormer, CLIP-Adapter (Shi et al., 17 Apr 2025, Zhao et al., 15 Aug 2025) |
| BitFit/BEFT Bias | 0.01–0.1% | BitFit, BEFT (Huang et al., 19 Sep 2025, Lin et al., 30 Nov 2025) |
| Prefix/Prompt-Tune | 0.1–1% | Prefix tuning, prompt tuning (text or vision) |
| Integer PEFT/GSQ | <1% (integer) | GSQ-Tuning (Zhou et al., 18 Feb 2025); Integer LoRA for edge devices |
| Memory-Efficient | Varies (side branches) | LST, UniPT, SHERL (Lin et al., 30 Nov 2025) (activations not stored) |
Representative workflows integrate both quantization and PEFT—for example, QLoRA combines 4-bit weight storage with LoRA on LLMs for practical single-GPU adaptation up to 8B parameters (Yousefiramandi et al., 14 Dec 2025, Walsh et al., 6 Aug 2025).
3. Fine-Tuning Methodologies and Algorithms
3.1 Standard and Embedding-Based Supervision
In classic supervised fine-tuning, a linear head is attached atop the pooled encoder or decoder representation (e.g., in BERT (Gopalan et al., 2020) or the final token embedding in LLMs (Yousefiramandi et al., 14 Dec 2025)). A standard cross-entropy loss is minimized:
For multi-label problems, binary cross-entropy is used.
3.2 Instruction and Generation-Based Adaptation
Instruction-tuning reformulates classification or structured prediction as generation, where the LLM is trained to produce label strings given prompt-formatted inputs. In (Yousefiramandi et al., 14 Dec 2025), the instruction-tuning objective is standard autoregressive negative log-likelihood over the label tokens, with larger LoRA rank and no explicit classification head.
3.3 In-Context Learning-Augmented Fine-Tuning
ICL+FT (Bornschein et al., 22 Dec 2025) and ManyICL (He et al., 6 Jun 2025) represent hybrid approaches. In ICL+FT, fine-tuning is performed directly on -shot prompts, so the model’s gradient steps encode the structure of in-context learning. ManyICL meta-trains a single LLM on concatenated -shot prompt–target sequences across multiple tasks, with the loss masking all answers as supervised targets, closing the gap between ICL and dedicated FT in all main NLP task families.
3.4 Dual-System and Data-Driven Partitioning
LoRA-PAR (Huang et al., 28 Jul 2025) partitions both tasks and LoRA parameters into “System 1” (fast, intuitive) and “System 2” (slow, deliberative) domains. A two-stage fine-tuning schedule first adapts with supervised loss on D₁ (fast) and then applies RL to D₂ (slow), activating only subregions of the LoRA parameters most important to each system (as measured by per-parameter Taylor importance scores). This yields equivalent or superior performance with ~40% of LoRA parameters, outperforming variants that do not partition data or parameters.
Targeted efficient PEFT (Dong et al., 2024) uses Fisher-information-based joint sample–parameter co-selection, optimizing both the set of updated parameters and the data subset used to determine the mask, unlike classical PEFT which fixes parameter sets independent of data.
4. Negative Transfer Mitigation, OOD, and Robustness
Standard fine-tuning is susceptible to negative transfer (e.g., of rare or spuriously correlated features) and brittleness out-of-distribution. Concept-Tuning (Yang et al., 2023) addresses this through (i) maximizing mutual information among rare feature slices (patches) and (ii) front-door adjustment using neural attention to deconfound spuriously correlated features. Empirically, Concept-Tuning delivers systematic 2–4.7% accuracy improvements across diverse benchmarks and architectures relative to bi-tuning or core-tuning.
For simulation models, perturbation-based fine-tuning of two-stage estimators (Lakshminarayanan et al., 6 Apr 2025) adapts the regression map around the OOD inference point by synthetic sampling and local SGD, correcting strong OOD biases without requiring knowledge of true parameters.
In RL, performance-degradation during online fine-tuning is mitigated by Automatic Jump-Start (AJS) (Wang et al., 1 May 2025), which gradually phases in exploratory policy gradients only when off-policy evaluation (FQE) certifies monotonic improvement relative to the offline-trained conservative policy, preventing catastrophic drops at early fine-tuning stages.
5. Implementation Efficiency: Quantization, On-Device, Memory
Quantization offers substantial memory and compute savings. QLoRA (Yousefiramandi et al., 14 Dec 2025, Walsh et al., 6 Aug 2025) enables 4-bit adaptation of 7–8B LLMs on 24–48 GB GPUs, with LoRA adapters (in bf16/full-precision) carrying the gradient flow. On speech models, memory-efficient fine-tuning (MEFT) (Lin et al., 30 Nov 2025) eliminates backbone gradient tracking entirely by training a small side network (Ladder Side-Tuning, UniPT, SHERL), achieving up to 73% memory reduction with accuracy within 1% of vanilla FT, and up to 2.1× speedup compared to PEFT.
For edge deployment, GSQ-Tuning (Zhou et al., 18 Feb 2025) introduces fully integer training and inference, encoding all LoRA adapters (and optionally activations/gradients) in a Group-Shared Exponents Integer format. Relative to FP16 training, GSQ-INT6 achieves a 1.8× memory reduction and up to 11× silicon area savings, matching bf16 accuracy within 0.1–0.5pp on LLaMA2 scales.
6. Specialized and Task-Oriented Fine-Tuning Pipelines
Domain adaptation under extremely low-resource or heavy class imbalance requires multi-phase or lightweight approaches:
- In low-resource NMT, continual pre-training (CPT) on in-domain monolinguals plus intermediate parallel-data (ITTL) fine-tuning, outperforms direct single-stage FT by +1.5 BLEU, extended further by simple ensemble averaging (Thillainathan et al., 28 Mar 2025).
- Long-tail learning is best addressed by lightweight FT (LIFT+) (Shi et al., 17 Apr 2025): <1% params updated, explicit semantic-aware initialization, minimalist augmentation, and test-time crop ensembling correct head-tail conditional distribution mismatch seen in "heavy" FT (all-parameter), yielding 2–5% tail accuracy improvements while converging in ≤15 epochs.
- Fine-grained, few-shot VLM adaptation exploits the latent semantic one-to-many structure of class–image associations via Latent Hierarchical Adapter (LatHAdapter) (Zhao et al., 15 Aug 2025), embedding all entities in hyperbolic space with explicit hierarchical regularization, consistently outperforming Euclidean adapter and prompt-based PEFT baselines across 1–16 shot and domain generalization benchmarks.
7. Best Practices, Trade-Offs, and Practical Guidelines
Choice of fine-tuning strategy must be informed by downstream requirements, data volume, compute/memory, hardware constraints, and task structure:
- Favor full-parameter FT for maximal accuracy in resource-rich or high-stakes deployments. For rapid prototyping, academic labs, or edge scenarios, PEFT (LoRA, adapters, BitFit, GSQ) balance almost all the gains against orders-of-magnitude less cost (Christophe et al., 2024, Zhou et al., 18 Feb 2025).
- Use quantization (4/8 bit) and LoRA wherever possible for LLM and VLM adaptation in limited-memory settings (Yousefiramandi et al., 14 Dec 2025, Walsh et al., 6 Aug 2025).
- On device or in strictly integer-only environments, GSQ-Tuning or similar integer PEFT should be used (Zhou et al., 18 Feb 2025).
- Mitigate long-tail or head-class overfitting by restricting parameter update budget and biasing initialization toward semantic priors (Shi et al., 17 Apr 2025).
- For few-shot or data-scarce setups, bias-only FT (BEFT) or adapters with explicit structural constraints (LatHAdapter) best leverage the available signal (Huang et al., 19 Sep 2025, Zhao et al., 15 Aug 2025).
- Meta-learned, multi-task or in-context prompt-based fine-tuning is preferred where deployment across many tasks is required, or for strong generalization and catastrophic forgetting avoidance (He et al., 6 Jun 2025, Bornschein et al., 22 Dec 2025).
In summary, fine-tuning approaches span a wide spectrum from full-parameter to ultra-lightweight and integer-only updates, each presenting distinct trade-offs in efficiency, accuracy, and deployability. Cutting-edge pipelines integrate PEFT, quantization, advanced prompt schemes, importance-based parameter/data selection, and robust adaptation strategies to maximize transfer performance, generality, and resource alignment (Yousefiramandi et al., 14 Dec 2025, Huang et al., 28 Jul 2025, He et al., 6 Jun 2025, Yang et al., 2023, Shi et al., 17 Apr 2025, Lin et al., 30 Nov 2025, Zhou et al., 18 Feb 2025, Zhao et al., 15 Aug 2025, Huang et al., 19 Sep 2025, Wang et al., 1 May 2025).