Dual-Stage Mixed Fine-Tuning (DMT)

Updated 20 February 2026

Dual-Stage Mixed Fine-Tuning (DMT) is a two-phase curriculum-based training paradigm that targets both domain specialization and general alignment in large language models.
DMT employs an initial specialist stage using focused data followed by a mixed recall stage that integrates a small fraction of specialist reminders into general training.
Empirical results demonstrate that DMT effectively mitigates skill interference and catastrophic forgetting, enhancing performance on tasks such as mathematical reasoning and code synthesis.

Dual-Stage Mixed Fine-Tuning (DMT) is a two-phase curriculum-based training paradigm for LLMs and neural architectures in which two distinct objectives or data compositions are applied sequentially, often to reconcile conflicting domain requirements or to avoid catastrophic forgetting. DMT systematically addresses task interference nuisances present in single-stage or naïve multi-task fine-tuning and can be implemented for diverse modalities, such as reasoning, instruction-following, code synthesis, passage re-ranking, and graph-to-text generation. The defining features across DMT variants are a) staged data composition or task specialization, b) an explicit "reminder" of specialist data during the alignment or adaptation stage, and c) avoidance of destructive interference via targeted curriculum or parameter partitioning (Dong et al., 2023, Deng et al., 16 Sep 2025, Huang et al., 28 Jul 2025).

1. Foundational Principles and Motivation

The main pathologies motivating DMT are skill interference and catastrophic forgetting. When LLMs are trained in a single phase on a mixture of discordant objectives — e.g., mathematical reasoning, code generation, and instruction alignment — the noisy gradients from dissimilar data degrade in-domain performance once data is sufficiently abundant. Conversely, purely sequential SFT (supervised fine-tuning) on specialized domains followed by alignment stages leads to catastrophic forgetting, where earlier skills are substantially eroded by subsequent single-objective optimization (Dong et al., 2023). This observation extends to divergent reasoning directions, such as forward and inverse chain-of-thought, whose signals conflict in naïvely mixed SFT and induce a collapse of reasoning distinctiveness (Deng et al., 16 Sep 2025).

DMT resolves these twin pitfalls by activating specialist parameters or scaling up niche domain capacities in a dedicated initial stage, then interleaving (at controlled ratios) "reminder" queries during global alignment or generalization. This curriculum maintains in-domain skill activation while facilitating robust generalization or alignment (Dong et al., 2023).

2. DMT Methodologies Across Domains

A canonical DMT recipe consists of:

Stage 1: Specialist Activation or Domain Adaptation

The model is fine-tuned exclusively on high-quality specialist data. For LLMs, this includes math/code datasets; for graph-to-text, a large noisy Wikipedia/Wikidata corpus; for passage ranking, contrastive learning with hard negatives; and for fast/slow reasoning, low-latency SFT with parameter subspecialization (Dong et al., 2023, Deng et al., 16 Sep 2025, Wang et al., 2021, Pezzuti et al., 28 Mar 2025, Huang et al., 28 Jul 2025).

Stage 2: Mixed Recall & Alignment

The model is further trained on a mixture of general (e.g., ShareGPT, general instructions) and a small fraction of specialist data — the "reminder." For some DMTs, Stage 2 can involve direct preference optimization (DPO), reinforcement learning (RL), or adaptation with annotated ground-truth data (Dong et al., 2023, Deng et al., 16 Sep 2025).

A typical workflow is summarized in the following table:

Stage	Objective/Data	Loss/Algorithm
1	Specialist only (math, code, domain, reasoning direction)	Cross-entropy or contrastive loss
2	Mixed: general + k×specialist/reminder	Cross-entropy, DPO, RL, or distillation

(Dong et al., 2023, Deng et al., 16 Sep 2025, Huang et al., 28 Jul 2025)

3. Training Protocols and Implementation Details

Mathematical Formulation (Dong et al., 2023):

Let $D_s = D_\text{math} \cup D_\text{code}$ and $D_g$ be general-alignment data. Stage-wise losses are:

$L_{\text{stage1}}(\theta) = \mathbb{E}_{(x,y)\sim D_s}[-\log p_\theta(y|x)]$

$L_{\text{stage2}}(\theta; k) = \mathbb{E}_{(x,y)\sim D_\text{mix}(k)}[-\log p_\theta(y|x)]$

with $D_\text{mix}(k) \equiv D_g \cup k\cdot D_s$ , and $k$ the reminder ratio. The entire DMT training loop can be implemented by resetting the model to $\theta_\text{stage1}$ before starting stage 2. For LoRA-based DMT (parameter-efficient fine-tuning), parameters are partitioned via importance scores, and only the top θ-fraction (e.g. 0.9, ≈40%) are trained for each task; shared parameters can be selectively activated in both stages (Huang et al., 28 Jul 2025). Implementation uses batch scheduling that matches the required proportions in each epoch for the second stage.

Fine-tuning for inverse reasoning and mixed directionality employs LoRA adapters and DPO for direct preference supervision after SFT. In contrast, in graph-to-text, stage 1 exposure to noisy/large scale Wikipedia reduces hallucinations, and stage 2 adapts to benchmark gold with a repeat of standard cross-entropy (Wang et al., 2021).

Typical Hyperparameters:

AdamW optimizer, learning rates of 2e−5 for each stage, three epochs, batch sizes ~16 (architecture-dependent), with validation on each ability after every epoch or substage (Dong et al., 2023).

4. Empirical Outcomes and Ablation Studies

Across multiple architecture sizes and domains, DMT delivers clear performance improvements relative to single-stage or mixed SFT. For LLaMA architectures (Dong et al., 2023), DMT with k=1/256 recovers 80–90% of specialist task performance lost in multi-task or sequential alignments, while preserving or enhancing general capabilities. In fine-grained mathematical code and instruction tasks, DMT outperforms multi-task in high-resource settings on GSM8K, HumanEval, and MT-Bench. However, individual domain-only training achieves the highest scores per domain, at the cost of catastrophic skill loss on others.

Model	Math Only	Code Only	General Only	DMT (k=1/256) Math	DMT Code	DMT General
LLaMA-7B	49.10	4.51	11.10	41.92	17.68	6.08
LLaMA-13B	51.40	5.15	14.02	46.47	19.50	6.03
LLaMA-33B	57.91	6.06	26.06	56.36	25.50	6.73

(Dong et al., 2023)

In mixed forward/reverse chain-of-thought, naïvely mixing data in a single SFT stage leads to severe performance collapse (down by 10–20 points), clear measured "directional collapse" observed by almost vanishing average log-probability margin between preferred and dispreferred outputs. DPO can recover some performance (+2–7 points), but falls short of one-direction SFT (Deng et al., 16 Sep 2025).

For LoRA-PAR, two-stage fine-tuning on partitioned parameter subregions reconciles fast vs. slow thinking, achieving up to 34.37% accuracy on GSM8K after RL (vs. 27.75% after SFT) using only 40% of LoRA parameters, outperforming standard parameter-efficient fine-tuning (Huang et al., 28 Jul 2025). In cross-encoder re-ranking, DMT (contrastive → distillation or vice versa) does not significantly outperform robust single-stage contrastive optimization (Pezzuti et al., 28 Mar 2025). In graph-to-text, DMT yields up to +2 BLEU and boosts all other metrics over single-stage, primarily via noise-to-gold data curriculum and structural input embeddings (Wang et al., 2021).

5. Analysis of Skill Interference, Catastrophic Forgetting, and Curriculum Effects

Data mixing in a single stage, especially when mixing divergent reasoning modes or skills, injects conflicting gradient signals that collapse parameter specialization. Catastrophic forgetting manifests when sequential SFT erodes prior abilities unless "reminder" data is injected. DMT alleviates this by structuring training with a curriculum: full activation of each skill/domain via exclusive focus, followed by alignment or generalization with a carefully controlled "recall" ratio k. Empirical ablations show that the optimal k is typically very small (1/256), with larger k pushing the model back towards overspecialization at the cost of generalization (Dong et al., 2023).

Parameter partitioning (LoRA-PAR) demonstrates that allocating disjoint parameter subregions for fast/intuitive versus slow/analytical reasoning, based on task-importance, supports dual capability with significantly less compute/memory (Huang et al., 28 Jul 2025).

6. Practical Recommendations and Open Challenges

DMT provides clear and actionable alignment strategies:

Specialist Stage: Train on all available in-domain data per skill separately.
Alignment/Recall Stage: Blend a small fraction (k~1/256 typical) of specialist data with general data to retain earlier abilities.
Avoid strict multi-task SFT or unstructured mixing of diverse objectives; inject explicit "mode tokens" or utilize parameter subregion partitioning where possible.
For reinforcement or preference-based objectives (DPO, RL), design preference signals and reward models to explicitly respect reasoning structure or task boundaries (Deng et al., 16 Sep 2025).
Monitor all in-domain validation sets after each epoch, with early stopping to prevent overshooting recall ratios.

Remaining challenges include optimal selection of the specialist reminder fraction k per domain, efficient parameter budget allocation for dual-stage PEFT scenarios, and developing more robust alignment signals that avoid hallucination or collapse under high-variance data mixtures (Dong et al., 2023, Huang et al., 28 Jul 2025).

7. Extensions, Domain-Specific Variants, and Limitations

DMT is widely applicable across LLM specialization, retrieval/ranking, and structured text generation:

In graph-to-text, DMT combines large-scale Wikipedia adaptation with graph-aligned input embeddings, achieving state-of-the-art across all text metrics (Wang et al., 2021).
In high-variance retrieval/reranking, DMT (as contrastive→distillation or vice versa) does not surpass optimized single-stage learning, suggesting DMT's advantage is context- and data-dependent (Pezzuti et al., 28 Mar 2025).
LoRA-based DMT with parameter partitioning balances compute footprint with specialization, and enables flexible "dual-system" cognitive analogues (Huang et al., 28 Jul 2025).

Limitations include increased compute for multi-stage training, and the requirement for careful curriculum, reminder tuning, and architecture-specific scheduling. Across domains, DMT's success is consistently tied to explicit curriculum, directed data composition, and proactive avoidance of unstructured mixing that triggers interference or forgetting.

Key references:

"How Abilities in LLMs are Affected by Supervised Fine-tuning Data Composition" (Dong et al., 2023)
"When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning" (Deng et al., 16 Sep 2025)
"LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning" (Huang et al., 28 Jul 2025)
"Exploring the Effectiveness of Multi-stage Fine-tuning for Cross-encoder Re-rankers" (Pezzuti et al., 28 Mar 2025)
"Stage-wise Fine-tuning for Graph-to-Text Generation" (Wang et al., 2021)