Divergence-Guided Reasoning Curriculum

Updated 3 February 2026

The paper introduces DGRC, a novel framework that adapts LLMs to specialized domains by detecting and leveraging divergences in teacher-student reasoning.
It decomposes the adaptation process into atomic Q&A extraction and verified chain-of-thought curricula, leading to significant performance gains in medical and legal benchmarks.
DGRC enables scalable, unlabeled adaptation by correcting teacher errors through divergence analysis, achieving up to a 7.76% relative improvement over baseline methods.

Divergence-Guided Reasoning Curriculum (DGRC) is a framework designed for adapting LLMs to domain-specific tasks without the necessity for human-annotated data. DGRC circumvents weaknesses of traditional knowledge distillation—especially the tendency for student models to echo the flaws of their teachers—through a curriculum that leverages divergence in reasoning outputs. It decomposes adaptation into two stages: first targeting atomic sub-problems where LLM reliability is greatest, and subsequently integrating these atomic competencies into consistent reasoning chains. The method delivers substantial improvements in low-resource, specialized settings, as substantiated by empirical results in medical and legal benchmarks (Wang et al., 27 Jan 2026).

1. Formal Problem Setting and Notation

DGRC operates on an unlabeled problem set, $D_\text{unlabeled} = \{p_1, \ldots, p_N\}$ , within a target domain. For each problem $p$ , the LLM teacher $T$ ( $\theta_T$ ) and student $S$ ( $\theta_S$ ) generate reasoning chains $c$ , each paired to a final answer $a$ as $(c, a)$ . $T$ samples $K$ reasoning paths $O^T = \{o^T_1,\ldots,o^T_K\}$ and $S$ samples $J$ chains $O^S = \{o^S_1,\ldots,o^S_J\}$ . Divergent behavior is defined by differing final answers:

$D_{p_i} = \{(o^{T},o^{S}) \mid o^T \in O^T,~o^S \in O^S,~a^T \neq a^S\}.$

The diagnostic set aggregates all such cases:

$D_\text{diag} = \{ (p_i, D_{p_i}, O^T) |\, p_i \in D_\text{unlabeled},\, D_{p_i} \neq \emptyset \}.$

This formalization underlies the DGRC pipeline of detecting, analyzing, and remediating teacher-student divergences.

2. Algorithmic Pipeline

DGRC comprises a three-stage process: divergence detection, curriculum generation, and student adaptation.

2.1 Divergence Detection

For each $p \in D_\text{unlabeled}$ , both teacher and student LLMs are queried to generate multiple reasoning chains. Divergent pairs, $(o^T, o^S)$ , are identified via mismatch in answers.

2.2 Curriculum Generation

Atomic Question Extraction and Answering

For every divergent pair, the teacher is prompted—as diagnostician—with $(p, c^T, c^S)$ to yield a set of atomic questions $Q_\text{atomic} = \{q_1, ..., q_M\}$ that pinpoint specific factual or logical discrepancies. The teacher then self-answers, forming raw $(q,a)$ pairs. Algorithmically,

for (p, D_p, O_T) in D_diag:
    for (o_T, o_s) in D_p:
        c_T, a_T = o_T.chain, o_T.answer
        c_s, a_s = o_s.chain, o_s.answer
        Q = T.diagnose_atomic_questions(p, c_T, c_s)
        A = T.answer_atomic(Q)  # self-answering
        add (q,a) to RawAtomicQA

Atomic-QA Filtering

Three filtering steps refine atomic QA pairs:

Instruction-Following Difficulty (IFD): Computes $\mathrm{IFD}(q, a) = S_\mathrm{cond} / S_\mathrm{dir}$ , where $S_\mathrm{cond} = -\log P_S(a|q)$ and $S_\mathrm{dir} = -\log P_S(a)$ . Only pairs where $\mathrm{IFD} \in [T_\text{low}, T_\text{high}]$ are retained.
Redundancy Filtering: Embeds $(q, a)$ into vectors; redundant pairs are pruned based on cosine similarity:

$\mathrm{sim}(i, j) = \frac{v_i\cdot v_j}{\|v_i\|\|v_j\|}.$

The less central among highly similar pairs is discarded.

LLM-Based Scoring: Pairs are scored by $T$ for clarity, completeness, structure, credibility, knowledge richness, logicality, and instruction-following (each $\in \{0, 1, 2\}$ ). The sum $S_\text{LLM}$ must exceed a threshold $T_\text{LLM}$ .

The filtered set forms the atomic curriculum $\mathcal{D}_\text{atomic}$ .

Verified Chain-of-Thought Curriculum

For each $p \in D_\text{diag}$ :

Filtered atomic QAs are aggregated.
Each teacher chain $c_k^T$ is checked for consistency: $T$ reviews $(p, c^T_k, \mathbf{A}_j)$ and labels as CONSISTENT (retained) or INCONSISTENT.
If multiple chains are consistent, a random one is selected.

All consistent chains (plus single-response CoTs from problems without divergence) constitute $\mathcal{D}_\text{CoT}$ .

2.3 Student Adaptation

Adaptation proceeds via two-phase supervised fine-tuning (SFT):

Phase 1: Train $S$ on atomic curriculum:

$\theta_S \leftarrow \arg\min_\theta \mathcal{L}_\text{atomic}(\theta)$

$\mathcal{L}_\text{atomic} = -\mathbb{E}_{(q,a)\sim\mathcal{D}_\text{atomic}}[ \log P_S(a|q;\theta) ]$

Phase 2: Train $S$ on verified CoT curriculum:

$\theta_S \leftarrow \arg\min_\theta \mathcal{L}_\text{CoT}(\theta)$

$\mathcal{L}_\text{CoT} = -\mathbb{E}_{(p,c)\sim\mathcal{D}_\text{CoT}}[ \log P_S(c|p;\theta) ]$

Optional reinforcement learning via Group Relative Policy Optimization (GRPO) may be applied to further enhance policy on $\mathcal{D}_\text{CoT}$ .

3. Experimental Design and Quantitative Results

DGRC's experiments span the medical and legal domains. Datasets comprise 182,822 MedMCQA questions (medical) and 42,509 CaseHOLD questions (legal) using only unlabeled data. Evaluation utilizes MedMCQA-val, MedQA-USMLE, and MMLU-Medicine (medical), and CaseHOLD-val, MMLU-Law (legal). Teacher models are GPT-4.1, Qwen2.5-Instruct-32B, Qwen2.5-Instruct-72B; students: Qwen2.5-Instruct-1.5B, 3B, and 7B.

The following table summarizes key average accuracy values (medical domain, 1.5B student, GPT-4.1 teacher):

Method	Avg-Med (%)	Avg-Law (%)
Baseline-w/o-label	54.1	62.8
RLAIF (GRPO)	55.7	64.5
DGRC	58.3	67.8

Full ablation in the medical setting (1.5B, GPT-4.1 teacher):

Zero-shot: 44.0
- Atomic only: 49.7
- Verified CoT only: 54.4
Atomic $\rightarrow$ CoT (full DGRC): 58.3

DGRC achieves a 7.76% relative improvement over unlabeled distillation in the medical domain. Across teacher sizes, DGRC yields +4.9% (Qwen2.5-32B) and +5.5% (Qwen2.5-72B) on 1.5B student. Self-teaching is only feasible for models $\geq$ 7B (format-compliance $>$ 75%), providing $+1.8\%$ accuracy gains; RL on top of DGRC (7B) delivers an additional $+2.6\%$ .

4. Theoretical Foundations and Key Insights

DGRC is predicated on the observed cognitive asymmetry: LLMs display higher reliability on atomic sub-questions than on holistic chain-of-thought tasks. This duality motivates two intertwined curricula: atomic Q&A to close factual gaps, and verified CoTs to teach compositional reasoning. The method ensures adaptation does not propagate teacher error, since only teacher chains verified for atomic consistency are transmitted to the student.

A salient aspect is the unlabeled adaptation capacity: DGRC requires neither ground-truth answers nor supplementary knowledge bases. Experiments indicate superior generalization, particularly to out-of-distribution (OOD) test sets (e.g., MedQA, MMLU), and parameter efficiency, as small students—Qwen2.5-1.5B—can outperform larger models post-adaptation.

5. Constraints, Limitations, and Scalability

DGRC's performance scales with teacher capability; the strength of $T$ is a limiting factor. Shared blind spots persist: if both models err identically, divergence detection cannot remediate. Computational complexity is increased—curriculum construction incurs a one-time overhead of approximately six times teacher inference and 2.5 times training tokens relative to non-curriculum SFT. This suggests resource requirements may be a consideration for practical deployment.

6. Application Scope and Potential Extensions

DGRC has demonstrated effectiveness across models, domains, and teacher-student pairings. Its modular design admits potential extensions:

Hybrid divergence detection, integrating retrieval-augmented signals, could expose shared blind spots otherwise undetectable by answer divergence alone.
Resource-efficient variants (smaller $J, K$ ) could lower the computational burden for constrained settings.
Integration with reinforcement learning from AI feedback (e.g., GRPO) and policy distillation may enable end-to-end adaptation pipelines.

A plausible implication is the adoption of DGRC as a foundational component for future unlabeled domain adaptation approaches in LLMs, especially where labeling is infeasible and high reliability on complex reasoning is required (Wang et al., 27 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

From Atoms to Chains: Divergence-Guided Reasoning Curriculum for Unlabeled LLM Domain Adaptation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Divergence-Guided Reasoning Curriculum (DGRC).