Papers
Topics
Authors
Recent
Search
2000 character limit reached

Divergence-Guided Reasoning Curriculum

Updated 3 February 2026
  • The paper introduces DGRC, a novel framework that adapts LLMs to specialized domains by detecting and leveraging divergences in teacher-student reasoning.
  • It decomposes the adaptation process into atomic Q&A extraction and verified chain-of-thought curricula, leading to significant performance gains in medical and legal benchmarks.
  • DGRC enables scalable, unlabeled adaptation by correcting teacher errors through divergence analysis, achieving up to a 7.76% relative improvement over baseline methods.

Divergence-Guided Reasoning Curriculum (DGRC) is a framework designed for adapting LLMs to domain-specific tasks without the necessity for human-annotated data. DGRC circumvents weaknesses of traditional knowledge distillation—especially the tendency for student models to echo the flaws of their teachers—through a curriculum that leverages divergence in reasoning outputs. It decomposes adaptation into two stages: first targeting atomic sub-problems where LLM reliability is greatest, and subsequently integrating these atomic competencies into consistent reasoning chains. The method delivers substantial improvements in low-resource, specialized settings, as substantiated by empirical results in medical and legal benchmarks (Wang et al., 27 Jan 2026).

1. Formal Problem Setting and Notation

DGRC operates on an unlabeled problem set, Dunlabeled={p1,,pN}D_\text{unlabeled} = \{p_1, \ldots, p_N\}, within a target domain. For each problem pp, the LLM teacher TT (θT\theta_T) and student SS (θS\theta_S) generate reasoning chains cc, each paired to a final answer aa as (c,a)(c, a). TT samples KK reasoning paths OT={o1T,,oKT}O^T = \{o^T_1,\ldots,o^T_K\} and SS samples JJ chains OS={o1S,,oJS}O^S = \{o^S_1,\ldots,o^S_J\}. Divergent behavior is defined by differing final answers:

Dpi={(oT,oS)oTOT, oSOS, aTaS}.D_{p_i} = \{(o^{T},o^{S}) \mid o^T \in O^T,~o^S \in O^S,~a^T \neq a^S\}.

The diagnostic set aggregates all such cases:

Ddiag={(pi,Dpi,OT)piDunlabeled,Dpi}.D_\text{diag} = \{ (p_i, D_{p_i}, O^T) |\, p_i \in D_\text{unlabeled},\, D_{p_i} \neq \emptyset \}.

This formalization underlies the DGRC pipeline of detecting, analyzing, and remediating teacher-student divergences.

2. Algorithmic Pipeline

DGRC comprises a three-stage process: divergence detection, curriculum generation, and student adaptation.

2.1 Divergence Detection

For each pDunlabeledp \in D_\text{unlabeled}, both teacher and student LLMs are queried to generate multiple reasoning chains. Divergent pairs, (oT,oS)(o^T, o^S), are identified via mismatch in answers.

2.2 Curriculum Generation

Atomic Question Extraction and Answering

For every divergent pair, the teacher is prompted—as diagnostician—with (p,cT,cS)(p, c^T, c^S) to yield a set of atomic questions Qatomic={q1,...,qM}Q_\text{atomic} = \{q_1, ..., q_M\} that pinpoint specific factual or logical discrepancies. The teacher then self-answers, forming raw (q,a)(q,a) pairs. Algorithmically,

1
2
3
4
5
6
7
for (p, D_p, O_T) in D_diag:
    for (o_T, o_s) in D_p:
        c_T, a_T = o_T.chain, o_T.answer
        c_s, a_s = o_s.chain, o_s.answer
        Q = T.diagnose_atomic_questions(p, c_T, c_s)
        A = T.answer_atomic(Q)  # self-answering
        add (q,a) to RawAtomicQA

Atomic-QA Filtering

Three filtering steps refine atomic QA pairs:

  • Instruction-Following Difficulty (IFD): Computes IFD(q,a)=Scond/Sdir\mathrm{IFD}(q, a) = S_\mathrm{cond} / S_\mathrm{dir}, where Scond=logPS(aq)S_\mathrm{cond} = -\log P_S(a|q) and Sdir=logPS(a)S_\mathrm{dir} = -\log P_S(a). Only pairs where IFD[Tlow,Thigh]\mathrm{IFD} \in [T_\text{low}, T_\text{high}] are retained.
  • Redundancy Filtering: Embeds (q,a)(q, a) into vectors; redundant pairs are pruned based on cosine similarity:

sim(i,j)=vivjvivj.\mathrm{sim}(i, j) = \frac{v_i\cdot v_j}{\|v_i\|\|v_j\|}.

The less central among highly similar pairs is discarded.

  • LLM-Based Scoring: Pairs are scored by TT for clarity, completeness, structure, credibility, knowledge richness, logicality, and instruction-following (each {0,1,2}\in \{0, 1, 2\}). The sum SLLMS_\text{LLM} must exceed a threshold TLLMT_\text{LLM}.

The filtered set forms the atomic curriculum Datomic\mathcal{D}_\text{atomic}.

Verified Chain-of-Thought Curriculum

For each pDdiagp \in D_\text{diag}:

  1. Filtered atomic QAs are aggregated.
  2. Each teacher chain ckTc_k^T is checked for consistency: TT reviews (p,ckT,Aj)(p, c^T_k, \mathbf{A}_j) and labels as CONSISTENT (retained) or INCONSISTENT.
  3. If multiple chains are consistent, a random one is selected.

All consistent chains (plus single-response CoTs from problems without divergence) constitute DCoT\mathcal{D}_\text{CoT}.

2.3 Student Adaptation

Adaptation proceeds via two-phase supervised fine-tuning (SFT):

  • Phase 1: Train SS on atomic curriculum:

θSargminθLatomic(θ)\theta_S \leftarrow \arg\min_\theta \mathcal{L}_\text{atomic}(\theta)

Latomic=E(q,a)Datomic[logPS(aq;θ)]\mathcal{L}_\text{atomic} = -\mathbb{E}_{(q,a)\sim\mathcal{D}_\text{atomic}}[ \log P_S(a|q;\theta) ]

  • Phase 2: Train SS on verified CoT curriculum:

θSargminθLCoT(θ)\theta_S \leftarrow \arg\min_\theta \mathcal{L}_\text{CoT}(\theta)

LCoT=E(p,c)DCoT[logPS(cp;θ)]\mathcal{L}_\text{CoT} = -\mathbb{E}_{(p,c)\sim\mathcal{D}_\text{CoT}}[ \log P_S(c|p;\theta) ]

Optional reinforcement learning via Group Relative Policy Optimization (GRPO) may be applied to further enhance policy on DCoT\mathcal{D}_\text{CoT}.

3. Experimental Design and Quantitative Results

DGRC's experiments span the medical and legal domains. Datasets comprise 182,822 MedMCQA questions (medical) and 42,509 CaseHOLD questions (legal) using only unlabeled data. Evaluation utilizes MedMCQA-val, MedQA-USMLE, and MMLU-Medicine (medical), and CaseHOLD-val, MMLU-Law (legal). Teacher models are GPT-4.1, Qwen2.5-Instruct-32B, Qwen2.5-Instruct-72B; students: Qwen2.5-Instruct-1.5B, 3B, and 7B.

The following table summarizes key average accuracy values (medical domain, 1.5B student, GPT-4.1 teacher):

Method Avg-Med (%) Avg-Law (%)
Baseline-w/o-label 54.1 62.8
RLAIF (GRPO) 55.7 64.5
DGRC 58.3 67.8

Full ablation in the medical setting (1.5B, GPT-4.1 teacher):

  • Zero-shot: 44.0
    • Atomic only: 49.7
    • Verified CoT only: 54.4
  • Atomic \rightarrow CoT (full DGRC): 58.3

DGRC achieves a 7.76% relative improvement over unlabeled distillation in the medical domain. Across teacher sizes, DGRC yields +4.9% (Qwen2.5-32B) and +5.5% (Qwen2.5-72B) on 1.5B student. Self-teaching is only feasible for models \geq7B (format-compliance >>75%), providing +1.8%+1.8\% accuracy gains; RL on top of DGRC (7B) delivers an additional +2.6%+2.6\%.

4. Theoretical Foundations and Key Insights

DGRC is predicated on the observed cognitive asymmetry: LLMs display higher reliability on atomic sub-questions than on holistic chain-of-thought tasks. This duality motivates two intertwined curricula: atomic Q&A to close factual gaps, and verified CoTs to teach compositional reasoning. The method ensures adaptation does not propagate teacher error, since only teacher chains verified for atomic consistency are transmitted to the student.

A salient aspect is the unlabeled adaptation capacity: DGRC requires neither ground-truth answers nor supplementary knowledge bases. Experiments indicate superior generalization, particularly to out-of-distribution (OOD) test sets (e.g., MedQA, MMLU), and parameter efficiency, as small students—Qwen2.5-1.5B—can outperform larger models post-adaptation.

5. Constraints, Limitations, and Scalability

DGRC's performance scales with teacher capability; the strength of TT is a limiting factor. Shared blind spots persist: if both models err identically, divergence detection cannot remediate. Computational complexity is increased—curriculum construction incurs a one-time overhead of approximately six times teacher inference and 2.5 times training tokens relative to non-curriculum SFT. This suggests resource requirements may be a consideration for practical deployment.

6. Application Scope and Potential Extensions

DGRC has demonstrated effectiveness across models, domains, and teacher-student pairings. Its modular design admits potential extensions:

  • Hybrid divergence detection, integrating retrieval-augmented signals, could expose shared blind spots otherwise undetectable by answer divergence alone.
  • Resource-efficient variants (smaller J,KJ, K) could lower the computational burden for constrained settings.
  • Integration with reinforcement learning from AI feedback (e.g., GRPO) and policy distillation may enable end-to-end adaptation pipelines.

A plausible implication is the adoption of DGRC as a foundational component for future unlabeled domain adaptation approaches in LLMs, especially where labeling is infeasible and high reliability on complex reasoning is required (Wang et al., 27 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Divergence-Guided Reasoning Curriculum (DGRC).