JSFT: Judge-Augmented Fine-Tuning

Updated 19 January 2026

JSFT is a fine-tuning strategy that trains models to generate responses and evaluate them using chain-of-thought reasoning and explicit verdict signals.
It employs a two-stage process—initial supervised fine-tuning followed by preference optimization—to refine model judgment and improve multi-modal alignment.
JSFT achieves improved calibration, robust transfer learning, and enhanced performance in complex reasoning tasks and domain-specific evaluations.

Judge-Augmented Supervised Fine-Tuning (JSFT) is a post-pretraining strategy for LLMs that introduces explicit judge supervision—often in the form of chain-of-thought (CoT) reasoning traces, verdicts, or preference margins—to produce models capable of evaluating, as well as generating, responses across a range of modalities and domains. JSFT is distinguished from standard supervised fine-tuning (SFT) by its use of additional signals from teacher judges (either human or strong LLMs) and its explicit two-stage or joint objectives that improve the model’s reasoning, calibration, and ability to generalize as an evaluator. The approach underpins several recent advances in alignment, reward modeling, instructional tuning, preference optimization, and evaluation within both unimodal and multimodal systems.

1. JSFT Conceptual Framework and Variants

At its core, JSFT augments conventional SFT—where a model is simply trained to maximize the likelihood of task completions—by explicitly supervising the model to replicate both the deliberative process (CoT, rationale, etc.) and the final judgment (verdict, preference, score, or selection) of a reference judge. Several distinct but closely related JSFT variants have emerged:

CoT-based JSFT: Trains the LLM to reproduce teacher-model-generated reasoning traces and verdicts for a given prompt and candidate responses, as in "Improve LLM-as-a-Judge Ability as a General Ability" (Yu et al., 17 Feb 2025) and in multimodal settings (Pi et al., 19 May 2025).
Margin-based JSFT / PoFT: Augments SFT with a preference term, e.g., a Bradley-Terry or hinge-margin loss ensuring the target model assigns higher likelihood to responses than a reference aligned LLM (Fan et al., 2024).
Fuzzy-logic JSFT: Supervises the model to match multi-class, soft-graded labels reflecting expert uncertainty, rather than one-hot targets (Zheng et al., 12 Jun 2025).
Iterative Self-Rationalization: Generates and curates a dataset of model-produced rationales and verdicts, turning them into preference pairs for further DPO-based refinement (Trivedi et al., 2024).
Joint policy-judge JSFT: Trains a single LLM to serve as both a generator and a judge, enabling on-policy alignment without a separate reward model (Lee et al., 2024).

2. Canonical JSFT Training Methodology

Most JSFT frameworks adopt a two-stage recipe, consisting of SFT for style and rationale adaptation, followed by preference refinement such as Direct Preference Optimization (DPO):

Stage 1: Supervised Fine-Tuning (SFT) with Judge Data
- The model is presented with prompts consisting of a question and candidate response(s), often with instruction templates diversified via data synthesis (e.g., multiple prompt styles, roles, criteria, and output formats). The target output concatenates a verified CoT or rationale and the judge verdict.
- Cross-entropy loss is applied to maximize the likelihood of the combined CoT + verdict sequence:
$l_{\mathrm{SFT}}(\theta) = \mathbb{E}_{(inst,j)\sim D_{\mathrm{SFT}}} \bigl[ -\log P_{\theta}(j\,|\,inst) \bigr]$ - Examples: - (Yu et al., 17 Feb 2025): 20K synthesized judge examples + 5K general chat, Qwen2.5-32B-Base, AdamW optimizer, batch 128, max length 4096, learning rate $2 \times 10^{-5}$ , cosine decay. - (Zheng et al., 12 Jun 2025): Multitask cross-entropy loss over fuzzy soft labels for each evaluation criterion. - (Zhang et al., 11 Nov 2025): Sequence-to-sequence loss over teacher model rationales and verdicts with LoRA adaptation.
Stage 2: Preference Optimization (DPO or Equivalent)
- The model is further trained to prefer correct judgments (or higher quality rationales) over flawed ones using pairwise preference data, possibly generated via synthetic methods or model self-evaluation.
- Pairwise logistic loss (as in DPO) is employed, typically relative to a frozen SFT checkpoint:
$l_{\mathrm{DPO}}(\theta) = \mathbb{E}_{(inst, j_c, j_r)\sim D_{\mathrm{DPO}}} \bigl[ -\log \sigma\bigl( \beta (\log P_\theta(j_c|inst) - \log P_{\theta_0}(j_c|inst) - \log P_\theta(j_r|inst) + \log P_{\theta_0}(j_r|inst)) \bigr) \bigr]$ - DPO often includes a small NLL regularizer to preserve fluency or calibration:

$l_{\mathrm{DPO,total}}(\theta) = l_{\mathrm{DPO}}(\theta) + \alpha\,l_{\mathrm{NLL}}(\theta)$ - Preference pairs can be from model disagreements, hard-negative mining, or self-generated via rationale quality rating (Trivedi et al., 2024).
Integrated/Judged-Margin JSFT: Some approaches combine SFT with a preference-matching Bradley-Terry loss, wherein the target model is encouraged to assign higher likelihood than a cohort of frozen judge models. This yields:

$L_{\mathrm{JSFT}}(\theta) = \mathbb{E}_{(x,y)\sim D} [ L_{\mathrm{CE}}(\theta;x,y) + L_{\mathrm{BT}}(\theta;x,y) ]$
- $L_{\mathrm{BT}}$ is a cross-entropy or log-sigmoid term based on the log-likelihood margin (Fan et al., 2024).

3. JSFT Data Sourcing and Labeling Protocols

Synthetic Label Generation: JSFT relies heavily on high-grade synthetic data, often produced by prompting frontier LLMs (e.g., GPT-4o, Gemini-2.5-Flash) to generate diverse instruction templates, CoT reasonings, and verdicts. Filtering is used to enforce verdict correctness, mitigate position and length biases, and balance general vs. judge-specific data.
Reverse Candidate Synthesis: To support preference-based supervision where ground-truth labels are sparse, negative responses can be constructed by injecting errors (hallucination, incompleteness, etc.) into known-good responses using controlled prompting (Pi et al., 19 May 2025).
Fuzzy Labels and Multi-Judge Aggregation: In domains where expert disagreement is significant (e.g., medical dialogue), JSFT takes as targets the soft distribution of labels as voted by a panel of annotators, yielding robust learning from uncertain or borderline cases (Zheng et al., 12 Jun 2025).
Iterative Self-Rationalization: Judges can self-improve by generating multiple rationales per input, curating superior/inferior rationales into preference pairs, and running multiple DPO cycles (Trivedi et al., 2024).

4. Benchmark Results and Empirical Findings

JSFT produces state-of-the-art judge models with high data efficiency and strong downstream impact:

Model / Domain	Setting	SFT Data Size	DPO Pairs	Accuracy / Score	Notes
RISE-Judge-Qwen2.5-32B (Yu et al., 17 Feb 2025)	RewardBench	20K	20K	Avg 92.7 (Chat 96.6, Reasoning 98.8)	2–40% of data of comparable methods
JSFT (Llama-3.1-8B SFT+DPO) (Singh et al., 28 Sep 2025)	Math bench	3–4 epochs	3 epochs	FutureProof −2.92, BackCompat −2.81, QuestionGen −5.19	Continual fine-tune recovers most retrain benefit
JSFT (Self-Judge) (Lee et al., 2024)	AlpacaEval	17K + 64K	—	44.88% (SFT: 24.75%, DPO: 35.14%)	On-policy self J, joint model
SpeechJudge-GRM (Zhang et al., 11 Nov 2025)	Speech	25K (SFT)	—	75.3% (SFT), 77.2% (+RL), 72.7% (BTRM)	CoT rationales essential, +2.6pp over BTRM
MR. Judge-7B-SFT (Pi et al., 19 May 2025)	Multimodal	31,703	52,080	70.3% (SFT), 75.5% (+RL)	+4.2pts w/ CoT, reverse candidates, long-form reasoning
LLM-Fuzzy-Judge (Zheng et al., 12 Jun 2025)	Clinical	1,611	—	83.95% (Prof.), 82.00% (MedRel)	Fuzzy, multi-criteria, multi-class outputs

Across nearly all ablations, chain-of-thought or rationale-based supervision yields 2–10pp gains in judge accuracy relative to verdict-only supervised fine-tuning. Marginal preference objectives (PoFT) provide additional robustness on noisy data. Downstream, judge-tuned signals are more effective than raw LLM or “reward model” signals for RLHF or policy optimization (Yu et al., 17 Feb 2025, Fan et al., 2024).

5. Transfer, Generalization, and Model Robustness

JSFT improves not only judge accuracy but also calibration, robustness, and general language generation:

Calibration: The chain-of-thought training enables the judge to deliberate over evidence, yielding more reliable and calibrated comparative judgments (Yu et al., 17 Feb 2025, Trivedi et al., 2024).
Prompt Robustness: Diversity in SFT prompt templates (roles, formats, languages) ensures the judge performs well on an array of evaluation instructions (Yu et al., 17 Feb 2025).
General Task Transfer: Chain-of-thought reasoning, seemingly learned for judgment tasks, transfers to improved performance on general benchmarks such as MMLU, GSM, MT-Bench without direct supervision (Yu et al., 17 Feb 2025).
Backward Compatibility: Judges fine-tuned with new generator responses generally maintain or improve accuracy on older distributions, crucial for practical deployment (Singh et al., 28 Sep 2025).
Future Proofing: JSFT models may degrade significantly on unseen, stronger generator outputs, necessitating continual retraining and evaluation on fresh data (Singh et al., 28 Sep 2025).
Question Generalization: Performance on unseen question sets drops 5–15pp relative to seen sets, with vanilla SFT models sometimes generalizing better than strong DPO-refined judges (Singh et al., 28 Sep 2025).

6. Extensions: Multimodality, Feedback Loops, and Domain Adaptation

JSFT has been extended into several complex settings:

Multimodal Judging: In MR. Judge (Pi et al., 19 May 2025), multimodal LLMs are trained via JSFT to generate and evaluate by integrating text, images, and multimodal reasoning traces, using both negative candidate synthesis and text-based reasoning distillation.
User-in-the-Loop Feedback: QLoRA-based RAG systems employ a JSFT loop that integrates live user feedback (positive and negative) to adapt passage selection and answer generation, with a quantized influence measure weighting examples for ongoing fine-tuning (Rangan et al., 2024).
Fuzzy and Soft-Judgment Criteria: JSFT with fuzzy labels enables LLM-as-a-Judge models to reflect soft-granularity human scoring in subjective, nuanced domains, such as clinical communication (Zheng et al., 12 Jun 2025).
Self-Rationalization and Iterative Calibration: Judges iteratively generate and evaluate their own rationales, using DPO to improve rationale quality and verdict alignment in fine-grained criteria tasks (Trivedi et al., 2024).

7. Implementation Best Practices and Limitations

Practical guidance for implementing robust JSFT includes:

Data Quality: Synthesize high-quality CoT rationales, filter aggressively for correctness, and swap candidate positions to minimize bias (Yu et al., 17 Feb 2025, Singh et al., 28 Sep 2025).
Two-Stage Training: Separate judge-style adaptation (SFT) from preference discrimination (DPO); initializing DPO from an SFT-trained checkpoint stabilizes preference learning (Yu et al., 17 Feb 2025).
Reference Model Stability: Keep the DPO reference model fixed to ground the preference margin (Yu et al., 17 Feb 2025).
Regularization: Use a small NLL or KL term to avoid over-optimization/collapse in DPO (Yu et al., 17 Feb 2025, Trivedi et al., 2024).
Mixed General Data: Include a portion of non-judgment data in SFT to avoid catastrophic forgetting of foundational language abilities (Yu et al., 17 Feb 2025).
Hyperparameters: Tune batch size (32–128), SFT learning rate ($1$e $^{-5}$ –$5$e $^{-5}$ ), DPO learning rate ($1$e $^{-6}$ –$5$e $^{-6}$ ), and preference margin scale $\beta$ ($0.05$–$0.2$), monitoring validation judge-accuracy (Yu et al., 17 Feb 2025).
Continual Learning: Incremental fine-tuning on new response distributions maintains backward compatibility and adapts to changing generator populations (Singh et al., 28 Sep 2025).
Ablations and Monitoring: Log accuracy curves on “easy” vs. “hard” questions and report transfer to general benchmarks (Yu et al., 17 Feb 2025, Singh et al., 28 Sep 2025).
Limitations: JSFT-trained judges may remain susceptible to bias in synthetic data, require periodic refresh as generators advance, and may have limited generalization to fully novel questions, especially for large models optimized mainly by DPO (Singh et al., 28 Sep 2025). Data intake pipelines should remove test or eval set leaks.

In summary, Judge-Augmented Supervised Fine-Tuning frames the training of LLM and MLLM judges as a hierarchical process: first instilling the style, transparency, and multi-prompt adaptability of human or strong-LLM evaluation, then explicitly regularizing preference discrimination. JSFT yields models with superior judgment competence, improved data efficiency, and favorable transfer to downstream alignment, policy optimization, and domain specialization tasks (Yu et al., 17 Feb 2025, Singh et al., 28 Sep 2025, Lee et al., 2024, Pi et al., 19 May 2025).