SketchJudge Reward Models

Updated 13 January 2026

SketchJudge reward models are lightweight, preference-based evaluation systems that use multi-axis, JSON rubric outputs for scoring correctness, safety, reasoning, factuality, and clarity.
They leverage a frozen LLM backbone with plug-and-play LoRA adapters to integrate scalar reward computations and process supervision, enhancing model specialization and efficiency.
They incorporate uncertainty-based routing and step-level process supervision to balance fast in-distribution judgments with robust, human-like rationales for improved RLHF integration.

A SketchJudge reward model is a family of plug-and-play, process-supervised, or routed preference-based evaluation systems designed for efficient and transparent integration into reinforcement learning from human feedback (RLHF) pipelines, multi-step reasoning, and preference-based training loops for LLMs. SketchJudge approaches are characterized by: (1) lightweight model adaptation or routing to strong LLM judges, (2) explicit multi-axis rubric-driven outputs, and (3) interpretability via human-like rationales or step-level feedback. Three canonical instantiations are in static and online actor RLHF (Agnihotri et al., 6 Jun 2025), uncertainty-based routing between weak/strong judges (Xu et al., 23 Oct 2025), and step-level process-supervised modeling for reasoning (Ma et al., 2023).

1. Model Architectures and Rubric-Driven Judging

The core SketchJudge paradigm for reward modeling in RLHF leverages a frozen, instruction-tuned LLM backbone (typically 7B parameters, e.g., Qwen 2.5-7B Instruct) with minimal adaptation overhead. The judging protocol involves:

Frozen LLM + One-Line JSON Rubric: A system prompt constrains JSON outputs and enumerates explicit evaluation axes—correctness, safety, reasoning, factuality, clarity—enforcing consistent multi-criteria judgment and output deteminism. All non-JSON content is explicitly rejected by system prompt.
Output Schema:

$w_1=0.35$ 5

Scalar Reward Computation: Reward $r$ is computed by a fixed affine combination of rubrics: $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ , with weights $w_1=0.35$ , $w_2=0.25$ , $w_3=0.20$ , $w_4=0.15$ , $w_5=0.05$ [(Agnihotri et al., 6 Jun 2025), Eq. 1].
Plug-and-Play LoRA Adapter: For improved performance and strong specialization, a rank-16 LoRA adapter is inserted in every transformer layer (0.8% of the backbone parameters), yielding a plug-and-play “judge” without any change to the model’s vocabulary or attention mechanism.
Preference Loss: Training utilizes a binary logistic loss for preference triplets, where log-probabilities for preferred ( $s^+$ ) and non-preferred ( $s^-$ ) answers are compared as $L_{\text{pair}} = -\log \sigma(s^+ - s^-)$ with $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 0 the logistic sigmoid [(Agnihotri et al., 6 Jun 2025), Eq. 4].
Process-Supervised (Step-Level) Judging: In multi-step reasoning, SketchJudge/PRM evaluations operate at the step level, assigning a categorical label (+1/-1/0 = correct/incorrect/neutral) to each intermediate solution step based on state-action pairs, using a dedicated classification head atop the language backbone (Ma et al., 2023).

2. Training, Online RLHF, and Inference Integration

SketchJudge models are trained and deployed in RLHF and preference-driven pipelines as follows:

Online PPO Loop: Actors are optimized via PPO-Clip, with per-response rewards supplied by the plug-and-play SketchJudge (via JSON extraction and scalarization). PPO protocol follows standard settings: 300,000 steps, batch size 128, clip $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 1, $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 2, $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 3, linearly annealed KL penalty to 0.1 (Agnihotri et al., 6 Jun 2025). For multi-step PRM, the reward signal may be integrated at each intermediate step in the reasoning trajectory.
LoRA Adapter Fine-Tuning: Only LoRA parameters are tuned (AdamW, lr $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 4, batch size 16, $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 510,000 updates) using reward-mix datasets: e.g., 5K RewardBench train + 5K UltraFeedback triplets with emphasis on safety and reasoning. Static and few-shot prompting further enhance out-of-distribution accuracy.
Step-Level Integration: Process-supervised SketchJudge PRMs are trained as 3-way classifiers with cross-entropy loss, utilizing explicitly annotated step-level reward datasets for mathematical (PRM800K) or code (PRM-Code) reasoning. At inference, SketchJudge PRMs provide stepwise correctness feedback to navigate search trees or prune invalid reasoning chains (Ma et al., 2023).

3. Uncertainty-Based Routing and Hybrid Judge Systems

The SketchJudge routing framework (Xu et al., 23 Oct 2025) mitigates the prohibitive cost of strong LLM judges by adaptively routing between a fast, in-distribution preference model (PM) and an expensive but robust generative judge (e.g., DeepSeek-R1):

Fast PM with SNGP Uncertainty: A pairwise preference model (e.g., Llama-3.1-8B-Instruct) utilizes a random-feature Gaussian process head (SNGP) for robust epistemic uncertainty. For any triplet, both a logit and a normalized uncertainty score $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 6 are produced [Eqs. 7–9].
Routing Policy: If the uncertainty $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 7 (threshold), use the PM’s logit; otherwise, forward the pair to the strong LLM judge, mapping its verdict to a surrogate logit $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 8 via $r = w_1 s_\text{correctness} + w_2 s_\text{safety} + w_3 s_\text{reasoning} + w_4 s_\text{facts} + w_5 s_\text{clarity}$ 9 or $w_1=0.35$ 0 depending on the preference outcome [Eq. 10].
Integration into RLHF: Routed logits serve as the basis for advantage estimation in policy gradient or RLOO updates. The routing threshold $w_1=0.35$ 1 is tuned to cap judge invocation rates in accordance with computational or latency budgets.
Cost/Accuracy Tradeoff: The framework yields 2–5% absolute gains in out-of-distribution RM accuracy with minimal judge queries (e.g., 24% call rate on RewardBench improves average accuracy to 90.6% vs 87.3% with no routing; 100% judge achieves 92.3%) (Xu et al., 23 Oct 2025).

4. Interpretability, Rationale Generation, and Rubric Transparency

A distinguishing feature of SketchJudge systems is transparent and human-interpretable evaluation, both at the rubric and rationale levels:

Rationale Fields: For each preference judgment, the judge outputs a concise rationale (≤20 words), supporting interpretability, error analysis, and trust calibration.
Alignment with Human Explanation: Qwen 3-8B + LoRA SketchJudge attains $w_1=0.35$ 29.2/10 similarity to human rationales (as scored by GPT-4) on the HH-Rationales set, substantially outperforming few-shot ( $w_1=0.35$ 36/10) and zero-shot ( $w_1=0.35$ 44/10) judges (Agnihotri et al., 6 Jun 2025).
Rubric Modifiability: The JSON rubric is fully parameterized in the system prompt—researchers can seamlessly re-specify alignment objectives (e.g., prefer brevity, emphasize safety) by editing this line, with no further model retraining needed.
Axis-wise Score Monitoring: Output decomposition enables precise diagnosis of model alignment failures across correctness, safety, factuality, reasoning, and clarity.

5. Empirical Performance and Ablation Findings

Multiple lines of evidence support the empirical efficacy of SketchJudge reward models across static evaluation, online RLHF, and process-supervised reasoning:

Setting	Metric	SketchJudge Result	Baseline/Notes
RewardBench (static)	Overall accuracy (%)	96.2	Outperforms 27B–70B critics (max 95.1)
GSM-8K (RLHF)	Exact match (%)	92.0	DPO 70B: 61.8%; Zero-shot: 48; Few-shot: 65
MT/Code Benchmarks	RM accuracy (PM/Hybrid)	90.6–76.6	Random routing: 88.2–73.5
Math Reasoning (step)	Accuracy gain over CoT	+0.2 to +3.3	WizardMath-13B HGS-PRM: 13.7 vs 10.4
Code (HumanEval)	pass@1 (%)	41.5–44.5	Up to +4.9 over CoT (Ma et al., 2023)

Ablation Results:
- In-context few-shot demonstrations (K=6) account for ~2 percentage point improvement.
- LoRA adaptation closes remaining performance gaps, especially on hard and safety domains.
- Hybrid routing accuracy gain over random routing increases with call budget, most prominently on “hard” subsets (Xu et al., 23 Oct 2025).

6. Implementation and Practical Considerations

Inference Infrastructure: vLLM or Triton are recommended for batched, low-latency judge serving; temperature=0 and top_p=1 enforce rubric-conforming deterministic outputs.
Data Preprocessing: All static comparisons must be of (prompt, answer_A, answer_B) form; online/incremental application uses (prompt, answer) or (state, action) as required by the pipeline.
Integration Pseudocode: Direct plug-in procedures are provided for both PPO-based RLHF and process-supervised search in reasoning tasks (Agnihotri et al., 6 Jun 2025, Ma et al., 2023).
Extensibility: Modifying the prompt rubric instantly alters the reward model's evaluation axis. SketchJudge is agnostic to backbone LLM choice provided instruction tuning, and compatible with hierarchical/ensemble routing policies in principle (Xu et al., 23 Oct 2025).

7. Extensions, Limitations, and Outlook

Limitations:
- Deploying a large LLM as a judge has residual infrastructure requirements, though reduced by LoRA.
- Routing frameworks require careful threshold tuning to meet resource constraints.
- Uncertainty estimation and routing are currently binary (PM vs. strong judge), lacking finer granularity.
Research Directions:
- Hierarchical or multi-tiered routing systems (small generative RM, judge, human oracle).
- Fine-tuning strong LLM judges on comparison tasks to reduce routing costs further.
- Alternative uncertainty quantification (ensembles, MC-dropout) and online active learning for iterative reward model improvement.
- Automated data synthesis for step-level PRMs in new domains.

SketchJudge reward models represent a unified toolkit of preference-based mechanisms for efficient, interpretable, and scalable reward modeling in RLHF and multi-step reasoning, setting new state-of-the-art on multiple evaluation axes and enabling broad practical deployment (Agnihotri et al., 6 Jun 2025, Xu et al., 23 Oct 2025, Ma et al., 2023).