Continued Fine-Tuning (CFT)

Updated 22 January 2026

Continued Fine-Tuning (CFT) is a methodology for sequentially adapting pre-trained models to evolving tasks by optimizing on non-stationary data without revisiting earlier datasets.
It employs techniques such as full-parameter updates, parameter-efficient modules, regularization, and replay methods to effectively mitigate catastrophic forgetting.
Practical applications span NLP, vision, and multimodal domains, with performance measured using metrics like Average Accuracy, Backward Transfer, and Harmonic Mean.

Continued Fine-Tuning (CFT) encompasses a family of methodologies for adapting large pre-trained models to dynamic task sequences or data distributions via sequential optimization, typically with the specific aim of achieving learning over time while preserving prior knowledge. CFT sits at the interface of continual learning (CL), transfer learning, and parameter-efficient adaptation, and has application scope spanning natural language processing, vision-language, and multimodal domains. It addresses both catastrophic forgetting arising from sequential adaptation and the challenge of maintaining broad generalization capacity in large-scale pretrained models.

1. Formal Definition and Conceptual Scope

Formally, given a pre-trained model $\Theta_B$ , CFT operates on a sequence of datasets $(D_1, D_2, \ldots, D_T)$ . At each stage, model parameters are first optimized on $D_1$ , then updated on $D_2$ , etc., without revisiting earlier datasets. For $T=2$ , the canonical CFT protocol proceeds as:

Phase 1: $\Theta_1 = \arg\min_\Theta L(\Theta; D_1)$
Phase 2: $\Theta_2 = \arg\min_\Theta L(\Theta; D_2)$ (with $\Theta$ initialized from $\Theta_1$ )

Here $L(\Theta; D)$ is a supervised or unsupervised loss on the given data. This process may be executed via full-parameter or parameter-efficient updates and is typically employed in regimes where data distributions $(D_1, D_2, \ldots, D_T)$ 0 are non-stationary and historical data from previous phases is unavailable or restricted (Aggarwal et al., 2024, Coleman et al., 18 Apr 2025, Shon et al., 2022).

CFT is also generalized to incorporate multiple intermediate checkpoints and hybridization with compositionality, as in settings where adaptation is staged via curricula or modular sub-tasks (Bursztyn et al., 2022), or applied post hoc for knowledge recovery (Wang et al., 15 Jan 2026).

2. Key Methodologies and Optimization Frameworks

CFT methodologies can be categorized along several axes, with major approaches including:

Full-parameter sequential fine-tuning: Directly update all model weights per phase. While straightforward, this approach is susceptible to catastrophic forgetting, especially under significant task dissimilarity (Aggarwal et al., 2024, Sun et al., 2024).
Parameter-efficient continual fine-tuning (PECFT): Only small task-specific modules (e.g., adapters, prompts, low-rank factors, masking) are trained at each phase. The shared backbone remains fixed, reducing memory and compute while enabling modular task isolation (Coleman et al., 18 Apr 2025).
Regularization-based: Quadratic penalties (e.g., EWC, DLCFT) enforce proximity to previous weights, ideally preserving prior task knowledge. Linearized approaches (e.g., DLCFT) use a network linearization at the pre-trained weights and MSE loss for tractable continual regularization (Shon et al., 2022).
Replay-based: Generative or sampled replay injects synthetic or stored data from prior tasks to maintain performance on earlier phases, mitigating forgetting without direct data retention (Aggarwal et al., 2024).
Freezing-based: Selective layer freezing prevents updates to weights deemed critical for upstream tasks, as determined by change magnitude or other heuristics (Aggarwal et al., 2024).
Post-hoc merging and mode connectivity: “MergeTune”-style CFT seeks a solution with low loss along linear paths connecting zero-shot and fine-tuned models, regularizing toward both via curvature-aware surrogates (Wang et al., 15 Jan 2026).

Methodologies may combine the above, and new techniques such as feature transformation tuning (FeTT) introduce nonparametric transforms of learned features to maintain representation capacity across task increments in class-incremental scenarios (Qiang et al., 2024).

3. Mitigating Catastrophic Forgetting: Analysis and Metrics

Catastrophic forgetting is the primary obstacle in CFT. Its manifestation and mitigation are rigorously quantified using metrics including:

Average Accuracy (AA): Mean accuracy on all seen tasks after the latest phase.
Forgetting Score (F): Mean per-task performance loss since completion of each task: $(D_1, D_2, \ldots, D_T)$ 1 (Sun et al., 2024).
Backward/Forward Transfer (BWT/FWT): Measure of negative/positive knowledge transfer across tasks (Coleman et al., 18 Apr 2025).
Harmonic Mean (HM): Used in vision-language CFT to balance in-distribution and out-of-distribution/generalization performance (Wang et al., 2024, Wang et al., 15 Jan 2026).

Crucial empirical findings highlight that:

Forgetting is minimal when the similarity between consecutive datasets is high (as measurable by dataset embedding similarity, DES, or model parameter difference, MPD) (Aggarwal et al., 2024).
Simple CFT without mitigation is only safe when DES $(D_1, D_2, \ldots, D_T)$ 20.9 (Aggarwal et al., 2024).
Selective replay (5–10%) or freezing layers exhibiting largest pre-finetune changes recovers a substantial portion (up to 90%) of lost upstream task ability after a dissimilar downstream adaptation (Aggarwal et al., 2024).
Quadratic regularization under a linearized MSE regime achieves optimality in theory and yields empirical robustness compared to cross-entropy-based criteria in class- and task-incremental learning (Shon et al., 2022).

4. Application Scenarios and Empirical Results

CFT is applicable across domains:

LLMs: Sequential fine-tuning for expanding language coverage (e.g., adapting English LLMs to new languages), compositional fine-tuning for reasoning via step-by-step curricula, and critique fine-tuning for rapid enhancement of mathematical/logical reasoning via dense critique targets (Aggarwal et al., 2024, Bursztyn et al., 2022, Wang et al., 3 Jun 2025).
Vision-Language and Vision Models: Upgrade-robust prompt adaptation under model drift (ContCoOp), post-hoc recovery of forgotten generalization (MERGETUNE), and continual class-incremental learning with feature smoothing (FeTT) (Wang et al., 2024, Wang et al., 15 Jan 2026, Qiang et al., 2024).
Performance: Representative findings are captured in the following table summarizing various CFT interventions:

Domain	CFT Variant	Forgetting Mitigation	Key Metric	Empirical Gains	Source
LLM (multiling.)	Generative replay (GR)	5–10% synthetic replay	ΔTA (English)	GR $(D_1, D_2, \ldots, D_T)$ 3 recovers 90% of TA loss while improving LA	(Aggarwal et al., 2024)
VLM	MERGETUNE	Mode connectivity + surrogate	Harmonic Mean	+5.6% HM over CoOp, +0.97% on KgCoOp, surpasses CLIP on OOD tasks	(Wang et al., 15 Jan 2026)
CL/NLP	Linearized regularization	Quadratic penalty (DLCFT)	Mean Acc., BWT	Data-IL AA: 82.7% (DLCFT) vs 78.9% (EWC) (10 task, CIFAR100)	(Shon et al., 2022)
VLM (prompt)	ContCoOp	Attention + distillation	Harmonic Mean	+2.68% HM over prior prompt methods (11 datasets avg.)	(Wang et al., 2024)
LLM (logic/math)	Critique fine-tuning	Dense critique supervision	Avg. accuracy	+14.9% Math, +14.7% Logic over one-shot base	(Wang et al., 3 Jun 2025)
Vision CL	FeTT	Feature transform	Avg. acc (CIL)	+1–2% over adapter baseline; +6% with ensemble (ImageNet-R)	(Qiang et al., 2024)

5. The Role of Task and Data Similarity

Performance under CFT strongly depends on the distributional proximity between subsequent datasets or tasks. Task similarity may be quantitatively estimated by:

Dataset Embedding Similarity (DES): Defined by the cosine similarity of mean representations under a language-agnostic encoder. When DES $(D_1, D_2, \ldots, D_T)$ 40.9, standard CFT is safe (Aggarwal et al., 2024).
Model Parameter Difference (MPD): $(D_1, D_2, \ldots, D_T)$ 5, with low MPD indicating high similarity (Aggarwal et al., 2024).

Similarity diagnosis informs whether simple CFT suffices or if replay/regularization is required. For highly dissimilar tasks, mitigation is essential to prevent catastrophic forgetting.

6. Extensions: Compositional, Critique, and Compatible Fine-Tuning

CFT encompasses several advanced adaptations:

Compositional Fine-Tuning (CompFT): Decomposes a complex target task into component sub-tasks, arranged as a training curriculum, and sequentially fine-tunes on these subcomponents. Empirically, CompFT bridges the gap between small/mid-size models and giant LMs relying on chain-of-thought prompting (Bursztyn et al., 2022).
Critique Fine-Tuning: CFT on model-generated errors with dense critique supervision, efficiently unlocking transfer in reasoning domains (e.g., math, logic) even from a single prototypical instance (Wang et al., 3 Jun 2025).
Compatible Fine-Tuning (ContCoOp): Learnable prompts updated with class information via attention to maximize transferability across model upgrades, by dynamically adapting conditioning to embedding-space shifts (Wang et al., 2024).

These extensions illustrate the paradigm’s flexibility and highlight potential for further methods leveraging task structure, critique, and module compatibility for robust continual adaptation.

7. Practical Insights and Open Challenges

Key recommendations for effective CFT include:

Preempt task-sequence forgetting with similarity diagnostics (e.g., DES, MPD).
When similarity is low, employ lightweight generative replay or focus-layer freezing for robust knowledge retention (Aggarwal et al., 2024).
Modularize adaptation: leverage parameter-efficient techniques (adapters, prompts, LoRA) to limit memory/computation (Coleman et al., 18 Apr 2025).
Post-hoc knowledge recovery (MERGETUNE) is viable for VLMs, avoiding expensive replay and outperforming mode-averaging baselines (Wang et al., 15 Jan 2026).
Monitor memory and compute growth as task complexity increases; dynamic module scaling and router networks are under exploration (Coleman et al., 18 Apr 2025).
Recognize that domain and compositionality coverage, as well as prompt-sensitivity in LLMs, may evolve as a function of pre-training stage and depth of fine-tuning (Sun et al., 2024, Bursztyn et al., 2022).

Open research questions persist regarding generalized merging of multiple specialist checkpoints, multi-modal adaptation, and fully online CFT under realistic constraints. Theoretical understanding of linear connectivity-based recovery and hybrid compositional–continual methods remains an active domain.

References: