Dual-stage Mixed Fine-tuning (DMT)

Updated 7 December 2025

Dual-stage Mixed Fine-tuning (DMT) is a two-phase adaptation process that separates specialized and general tuning to balance retention and efficient knowledge transfer.
By strategically changing data composition and supervision objectives between stages, DMT minimizes catastrophic forgetting and improves performance metrics such as BLEU and accuracy.
DMT protocols enable flexible adaptation across tasks like retrieval, graph-to-text generation, and LLM instruction fine-tuning by using stage-specific loss functions and controlled data mixing.

Dual-stage Mixed Fine-tuning (DMT) refers to a class of learning protocols in which a model is adapted in two explicitly distinct phases, typically with a change in data composition, supervision objective, or both, between stages. DMT is designed to exploit the complementary effects of task specialization, knowledge transfer, or domain bridging, in settings ranging from retrieval and graph-to-text generation to modularization of LLMs for specific reasoning or instruction-following skills. Although DMT is instantiated in various forms—parameter selection, data reweighting, stagewise adapters, or mixed minibatch regimes—it generally advances over naïve sequential fine-tuning by managing catastrophic forgetting and enabling efficient knowledge transfer.

1. Formal Definitions and Canonical Pipelines

A Dual-stage Mixed Fine-tuning (DMT) pipeline consists of two non-identical supervised adaptation phases, each with separate data or loss characteristics:

Stage 1 establishes a base representation or specialized skill, generally using either (a) large-scale, domain-diverse data (e.g., Wikipedia triples (Wang et al., 2021)), (b) high-difficulty or low-diversity “hard” datasets (e.g., math/code (Dong et al., 2023)), (c) contrastive ranking with negatives (Pezzuti et al., 28 Mar 2025), or (d) single-mode reasoning (forward/reverse) (Deng et al., 16 Sep 2025).
Stage 2 introduces complementary supervision, such as (a) distilled or teacher-aligned objectives (Pezzuti et al., 28 Mar 2025), (b) clean or human-curated downstream data (Wang et al., 2021), (c) a small fraction of retained specialist data remixed with general data (Dong et al., 2023), (d) reinforcement learning for complex task branching (Huang et al., 28 Jul 2025), or (e) Direct Preference Optimization (DPO) for reasoning directionality (Deng et al., 16 Sep 2025).

Mathematically, let $\theta^{(1)}$ be the model after Stage 1 adaptation on data $D_1$ with loss function $\mathcal{L}_1$ , and $\theta^{(2)}$ the subsequent parameters after Stage 2, adapted on $D_2$ under $\mathcal{L}_2$ : $\theta^{(1)} = \arg\min_\theta \mathcal{L}_1(D_1; \theta), \quad \theta^{(2)} = \arg\min_\theta \mathcal{L}_2(D_2; \theta)$ Various pipelines specify $D_1, D_2, \mathcal{L}_1, \mathcal{L}_2$ to target retention, transfer, or specialization (Dong et al., 2023, Huang et al., 28 Jul 2025, Deng et al., 16 Sep 2025).

2. Representative Methodologies

DMT has been instantiated across diverse architectures and application domains. Prominent research exemplars include:

Retrieval and Re-ranking: Pezzuti et al. fine-tune cross-encoders first with a contrastive loss over hard negatives, then distill teacher rankings using pairwise RankNet (Pezzuti et al., 28 Mar 2025).
Graph-to-Text Generation: Models are pre-fine-tuned on noisy Wiki graphs, then adapted to WebNLG with structure-aware embeddings and cross-entropy loss (Wang et al., 2021).
Instruction-following LLMs: First phase specializes on math/code, second phase mixes a small subset of these specialist examples with general instructions, preserving both emergent and learned abilities (Dong et al., 2023).
Domain-specific Multilingual LLMs: An initial phase injects broad QA-based medical knowledge; the second specializes via multiple-choice training (PEFT protocols, e.g., DoRA/QLoRA) (Zhou et al., 2024).
Reasoning Directionality: Separate supervised fitting on forward and reverse chain-of-thought data, followed by DPO to align generation with positional preference (Deng et al., 16 Sep 2025).
System 1/System 2 Modularity: Partitioned LoRA adapter parameters, first SFT on “fast” problems, then RL on deliberative tasks, activating subregions via importance scoring (Huang et al., 28 Jul 2025).

These variants share a division of learning objectives and/or data pools, with empirical justification for non-naïve sequential or single-stage mixing.

3. Loss Functions and Optimization Procedures

The core technical distinction in DMT is the alternation or juxtaposition of objective functions and data regimes across stages. The following table summarizes loss types and associated data for key DMT regimes:

Stage 1 (Objective/Data)	Stage 2 (Objective/Data)	Domain
LCE over hard negatives (contrastive)	RankNet pairwise distillation (teacher)	Passage re-ranking (Pezzuti et al., 28 Mar 2025)
CE on Wikipedia graphs	CE on cleaned, human reference graphs	Graph-to-text (Wang et al., 2021)
SFT on math+code (full)	SFT on mix: general+subset math+code	Multiskill LLM (Dong et al., 2023)
CE on medical QA (MMed-IFT)	MC adaptation (MMed-IFT-MC)	Medical LLM (Zhou et al., 2024)
SFT forward chain-of-thought	SFT on reverse; DPO to enforce direction	Reasoning directionality (Deng et al., 16 Sep 2025)
SFT (System 1, top-importance params)	RL (System 2, top-importance params)	PEFT modular LLM (Huang et al., 28 Jul 2025)

Abbreviations: CE = cross-entropy; LCE = localized contrastive estimation; SFT = supervised fine-tuning; RL = reinforcement learning; DPO = direct preference optimization.

Distinct choices in loss and sample composition form the backbone of DMT’s ability to modulate knowledge specificity, task overlap, and retention.

4. Empirical Results and Performance Trends

DMT protocols demonstrate highly domain-dependent outcomes. Empirical results in selected studies include:

Re-ranking: LCE-only (contrastive) or contrastive + distillation do not outperform single-stage contrastive, with nDCG@10 = 0.7391/0.7383 (ELECTRA/RoBERTa) for contrastive alone versus 0.4209/0.4182 (C→D) when using DMT (Pezzuti et al., 28 Mar 2025). There is no statistically significant gain from adding a distillation stage.
Graph-to-Text: Pre-fine-tuning with Wikipedia, followed by WebNLG adaptation, plus structure-aware embeddings, yields BLEU 60.56 vs. 57.8 (T5-large baseline), with all metrics showing statistically significant improvements when both stages are combined (Wang et al., 2021).
Instructional LLMs: On LLaMA-7B, two-stage DMT with $k=1/256$ mixing recovers GSM8K = 41.92% and HumanEval = 17.68%, outperforming pure multi-task or sequential regimes (Dong et al., 2023).
Modular PEFT: LoRA-PAR’s DMT maintains strong accuracy (GSM8K = 41.85%) while activating only ~40% of adapters, surpassing ordinary LoRA/PiSSA (Huang et al., 28 Jul 2025).
Medical LLMs: Two-stage adaptation (MMed-IFT → MMed-IFT-MC) raises Step 3 accuracy by +9.8% and Chinese MLE by +12.8% versus single-stage (Zhou et al., 2024).
Reasoning Directionality: Two-stage (forward then reverse) SFT followed by DPO yields +6.8 points average over forward-only SFT on open-domain math/QA sets, with DPO recovering some directionality lost by naïve mixing (Deng et al., 16 Sep 2025).

Collectively, the data indicate that DMT is often effective for bridging domain gaps or supporting emergent abilities, but may not guarantee improvement over well-optimized single-stage regimens, particularly when losses are aligned or data distributions are simple.

5. Analysis of Catastrophic Forgetting and Data Mixing

A frequent motivation for DMT is to prevent catastrophic forgetting observed in sequential task learning:

Multi-task SFT (fully mixed) often induces performance conflicts; “hard” specialist tasks (e.g., math/code) suffer when overwhelmed by “easier” or high-frequency general tasks (Dong et al., 2023).
Naïve sequential SFT causes later tasks to overwrite earlier acquired abilities.
DMT with partial mixing (second stage consisting mostly general data plus a light reminder of specialist examples) balances specialized and general skills, as evidenced by t-SNE clustering and validation trade-offs (Dong et al., 2023).
Directionality Analysis: When bidirectional (forward/reverse) chain-of-thought data are naïvely mixed, token-level preference margins collapse, undermining distinct reasoning modes. Staged SFT with explicit DPO maintains clear separation (Deng et al., 16 Sep 2025).

These phenomena underscore the importance of careful data partitioning and sequencing in multi-ability adaptation protocols.

6. Best Practices and Limitations

Robust DMT deployment requires domain-aware protocol design:

Data Partition: Keep conflicting (e.g., directional or mode-specific) data separate during SFT; only mix in a controlled, stage-wise manner (Deng et al., 16 Sep 2025).
Mixing Fraction: Small “reminder” ratios $k$ (e.g., $D_1$ 0 for specialist tasks) often provide an optimal trade-off between retention and generalization (Dong et al., 2023).
Adapter Activation: In parameter-efficient settings, score and partition adapters by task importance, enabling cognitive-style modularity (System 1/System 2) (Huang et al., 28 Jul 2025).
Alignment: For compositional or inverse-reasoning domains, apply explicit preference-based alignment (DPO) after sequential SFT, carefully tuning $D_1$ 1 to maintain directionality (Deng et al., 16 Sep 2025).
Hyperparameter Sensitivity: All stages require careful tuning of mixing ratios, batch sizes, and learning rates. Model size influences tolerance to retention/reminder balance (Dong et al., 2023).

Limitations of DMT frameworks include:

Teacher Quality Constraint: Distillation improvement saturates when teacher signal is not sufficiently informative (Pezzuti et al., 28 Mar 2025).
Overhead: Two-stage or modular protocols may increase engineering complexity and training time.
Reverse Data Availability: Effectiveness of bidirectional schemes depends on the quality and domain fit of reverse or complementary datasets (Deng et al., 16 Sep 2025).

7. Emerging Directions and Applicability

The DMT framework has proven adaptable across retrieval, structured generation, LLM instruction fine-tuning, and domain-specific adaptation, especially where multi-ability, multi-lingual, or cognitive modularity is needed (Wang et al., 2021, Dong et al., 2023, Huang et al., 28 Jul 2025, Zhou et al., 2024). For new domains, best practices include:

Stage 1: specialize on the most data-hungry or structurally complex abilities.
Stage 2: introduce “reminder” samples and/or specific task-aligned or preference-based objectives.
Use parameter-efficient fine-tuning and quantization wherever practical (Zhou et al., 2024, Huang et al., 28 Jul 2025).
Explicitly monitor and, if required, measure alignment, retention, and ability separation using task-specific probes and clustering of representations (Dong et al., 2023, Deng et al., 16 Sep 2025).

Although not universally optimal, Dual-stage Mixed Fine-tuning provides a principled and empirically supported alternative to traditional sequential or naively mixed adaptation for complex, large-scale model tuning (Pezzuti et al., 28 Mar 2025, Dong et al., 2023, Huang et al., 28 Jul 2025, Zhou et al., 2024, Deng et al., 16 Sep 2025, Wang et al., 2021).