Divergence-Guided Reasoning Curriculum
- The paper introduces DGRC, a novel framework that adapts LLMs to specialized domains by detecting and leveraging divergences in teacher-student reasoning.
- It decomposes the adaptation process into atomic Q&A extraction and verified chain-of-thought curricula, leading to significant performance gains in medical and legal benchmarks.
- DGRC enables scalable, unlabeled adaptation by correcting teacher errors through divergence analysis, achieving up to a 7.76% relative improvement over baseline methods.
Divergence-Guided Reasoning Curriculum (DGRC) is a framework designed for adapting LLMs to domain-specific tasks without the necessity for human-annotated data. DGRC circumvents weaknesses of traditional knowledge distillation—especially the tendency for student models to echo the flaws of their teachers—through a curriculum that leverages divergence in reasoning outputs. It decomposes adaptation into two stages: first targeting atomic sub-problems where LLM reliability is greatest, and subsequently integrating these atomic competencies into consistent reasoning chains. The method delivers substantial improvements in low-resource, specialized settings, as substantiated by empirical results in medical and legal benchmarks (Wang et al., 27 Jan 2026).
1. Formal Problem Setting and Notation
DGRC operates on an unlabeled problem set, , within a target domain. For each problem , the LLM teacher () and student () generate reasoning chains , each paired to a final answer as . samples reasoning paths and samples chains . Divergent behavior is defined by differing final answers:
The diagnostic set aggregates all such cases:
This formalization underlies the DGRC pipeline of detecting, analyzing, and remediating teacher-student divergences.
2. Algorithmic Pipeline
DGRC comprises a three-stage process: divergence detection, curriculum generation, and student adaptation.
2.1 Divergence Detection
For each , both teacher and student LLMs are queried to generate multiple reasoning chains. Divergent pairs, , are identified via mismatch in answers.
2.2 Curriculum Generation
Atomic Question Extraction and Answering
For every divergent pair, the teacher is prompted—as diagnostician—with to yield a set of atomic questions that pinpoint specific factual or logical discrepancies. The teacher then self-answers, forming raw pairs. Algorithmically,
1 2 3 4 5 6 7 |
for (p, D_p, O_T) in D_diag: for (o_T, o_s) in D_p: c_T, a_T = o_T.chain, o_T.answer c_s, a_s = o_s.chain, o_s.answer Q = T.diagnose_atomic_questions(p, c_T, c_s) A = T.answer_atomic(Q) # self-answering add (q,a) to RawAtomicQA |
Atomic-QA Filtering
Three filtering steps refine atomic QA pairs:
- Instruction-Following Difficulty (IFD): Computes , where and . Only pairs where are retained.
- Redundancy Filtering: Embeds into vectors; redundant pairs are pruned based on cosine similarity:
The less central among highly similar pairs is discarded.
- LLM-Based Scoring: Pairs are scored by for clarity, completeness, structure, credibility, knowledge richness, logicality, and instruction-following (each ). The sum must exceed a threshold .
The filtered set forms the atomic curriculum .
Verified Chain-of-Thought Curriculum
For each :
- Filtered atomic QAs are aggregated.
- Each teacher chain is checked for consistency: reviews and labels as CONSISTENT (retained) or INCONSISTENT.
- If multiple chains are consistent, a random one is selected.
All consistent chains (plus single-response CoTs from problems without divergence) constitute .
2.3 Student Adaptation
Adaptation proceeds via two-phase supervised fine-tuning (SFT):
- Phase 1: Train on atomic curriculum:
- Phase 2: Train on verified CoT curriculum:
Optional reinforcement learning via Group Relative Policy Optimization (GRPO) may be applied to further enhance policy on .
3. Experimental Design and Quantitative Results
DGRC's experiments span the medical and legal domains. Datasets comprise 182,822 MedMCQA questions (medical) and 42,509 CaseHOLD questions (legal) using only unlabeled data. Evaluation utilizes MedMCQA-val, MedQA-USMLE, and MMLU-Medicine (medical), and CaseHOLD-val, MMLU-Law (legal). Teacher models are GPT-4.1, Qwen2.5-Instruct-32B, Qwen2.5-Instruct-72B; students: Qwen2.5-Instruct-1.5B, 3B, and 7B.
The following table summarizes key average accuracy values (medical domain, 1.5B student, GPT-4.1 teacher):
| Method | Avg-Med (%) | Avg-Law (%) |
|---|---|---|
| Baseline-w/o-label | 54.1 | 62.8 |
| RLAIF (GRPO) | 55.7 | 64.5 |
| DGRC | 58.3 | 67.8 |
Full ablation in the medical setting (1.5B, GPT-4.1 teacher):
- Zero-shot: 44.0
- Atomic only: 49.7
- Verified CoT only: 54.4
- Atomic CoT (full DGRC): 58.3
DGRC achieves a 7.76% relative improvement over unlabeled distillation in the medical domain. Across teacher sizes, DGRC yields +4.9% (Qwen2.5-32B) and +5.5% (Qwen2.5-72B) on 1.5B student. Self-teaching is only feasible for models 7B (format-compliance 75%), providing accuracy gains; RL on top of DGRC (7B) delivers an additional .
4. Theoretical Foundations and Key Insights
DGRC is predicated on the observed cognitive asymmetry: LLMs display higher reliability on atomic sub-questions than on holistic chain-of-thought tasks. This duality motivates two intertwined curricula: atomic Q&A to close factual gaps, and verified CoTs to teach compositional reasoning. The method ensures adaptation does not propagate teacher error, since only teacher chains verified for atomic consistency are transmitted to the student.
A salient aspect is the unlabeled adaptation capacity: DGRC requires neither ground-truth answers nor supplementary knowledge bases. Experiments indicate superior generalization, particularly to out-of-distribution (OOD) test sets (e.g., MedQA, MMLU), and parameter efficiency, as small students—Qwen2.5-1.5B—can outperform larger models post-adaptation.
5. Constraints, Limitations, and Scalability
DGRC's performance scales with teacher capability; the strength of is a limiting factor. Shared blind spots persist: if both models err identically, divergence detection cannot remediate. Computational complexity is increased—curriculum construction incurs a one-time overhead of approximately six times teacher inference and 2.5 times training tokens relative to non-curriculum SFT. This suggests resource requirements may be a consideration for practical deployment.
6. Application Scope and Potential Extensions
DGRC has demonstrated effectiveness across models, domains, and teacher-student pairings. Its modular design admits potential extensions:
- Hybrid divergence detection, integrating retrieval-augmented signals, could expose shared blind spots otherwise undetectable by answer divergence alone.
- Resource-efficient variants (smaller ) could lower the computational burden for constrained settings.
- Integration with reinforcement learning from AI feedback (e.g., GRPO) and policy distillation may enable end-to-end adaptation pipelines.
A plausible implication is the adoption of DGRC as a foundational component for future unlabeled domain adaptation approaches in LLMs, especially where labeling is infeasible and high reliability on complex reasoning is required (Wang et al., 27 Jan 2026).