RULERS: Rubric Unification & Robust LLM Scoring
- RULERS is a rubric-based evaluation framework that compiles natural language rubrics into immutable, executable JSON bundles for LLM scoring.
- It employs structured decoding and evidence anchoring to prevent rubric prompt instability and ensure deterministic validation of extracted evidence.
- The framework uses post-hoc Wasserstein calibration to align model scores with human judgments without retraining, improving reproducibility and robustness.
RULERS (Rubric Unification, Locking, and Evidence-Anchored Robust Scoring) is a compiler-executor framework expressly designed for robust @@@@1@@@@ in the "LLM-as-a-Judge" paradigm. It directly targets three chronic failure modes in LLM scoring: rubric prompt instability, unverifiable reasoning, and scale misalignment with human judgment boundaries. The framework reframes judge alignment as a criteria-transfer problem, transforming natural language rubrics into executable and auditable artifacts for deterministic, evidence-grounded and scale-consistent LLM scoring. RULERS operates entirely without fine-tuning model parameters, instead leveraging structured rubric compilation, evidence anchoring, and post-hoc calibration to achieve high-fidelity, reproducible human alignment (Hong et al., 13 Jan 2026).
1. Formalization and Reliability Conditions
RULERS frames LLM judge alignment as a criteria-transfer optimization task. Let denote the space of evaluation instances, each decomposed into atomic units . Target score vectors are covering traits, where is the trait-specific discrete scale. Human-annotated datasets are from . A rubric is compiled via into bundle . The judge is a frozen black-box LLM with stochastic generation .
Reliability is subject to two key constraints:
- Stochastic Invariance: Minimization of output variance due to sampling noise:
- Evidence Support: All predicted scores must be anchored to extractive evidence :
Criteria-transfer optimization seeks mappings and (calibration) to maximize agreement metric (e.g., Quadratic Weighted Kappa), subject to both invariance and evidence anchoring:
2. Rubric Compilation: Executable and Locked Specifications
RULERS introduces an offline compiler that parses natural language rubrics into JSON-formatted, versioned, immutable specification bundles . Bundles capture:
- Taxonomy : List of scoring dimensions.
- Checklist : Per-item prompts and corresponding discrete scales.
- Evidence Rules: Minimum extractive quotes per trait and exact substring requirements.
Example bundle schema:
1 2 3 4 5 6 7 8 |
{
"taxonomy": ["Content","Organization",…],
"checklist": [
{"id":"C01", "dimension":"Content", "prompt":"…", "scale":[0,1,2]},
// ...
],
"evidence_rules": {"min_quotes_per_dim": m}
} |
3. Rubric Locking and Invocation Protocol
Once generated, bundles are locked via cryptographic hash , assuring versioned immutability. The judge is always invoked in inference with the original bundle hash, eliminating prompt drift and configuration leakage.
Pseudocode summarizing compilation and locking:
1 2 3 4 5 6 7 8 9 |
function CompileAndLock(rubric_text):
B = parse_and_structurize(rubric_text)
hash_B = SHA256(serialize(B))
store_immutable(B, hash_B)
return (B, hash_B)
function InvokeJudge(x, hash_B):
B = retrieve_bundle(hash_B) # immutable
return f_θ(x, B) |
4. Structured Decoding and Deterministic Evidence Verification
Execution wraps LLM decoding in a deterministic schema that strictly enforces:
- JSON output restricted to {"decisions", "evidence", "justification"}
- decision fields matching the checklist
- Minimum quotes per trait, evidence explicitly cited by atomic unit ID
Scoring per dimension follows:
If evidence quota , then ("evidence gate").
Evidence verification applies a deterministic function:
Acceptance requires , strictly preventing hallucinated or imprecise references.
5. Lightweight Post-hoc Calibration: Wasserstein Generative Regression
Raw model scores and auxiliary features are first projected into a latent score using ridge regression. RULERS then learns a transport map that minimizes the Wasserstein distance between model and human score distributions:
with mapping:
Calibration workflow:
1 2 3 4 5 |
Input: {z_i=φ(x_i)}_{i=1..N_calib}, {y_i}_{i=1..N_calib}
1. Fit ridge regression: z_i ≈ w^T φ(x_i)
2. Compute empirical CDFs: F_model, F_human
3. Define g(z)=F_human^{-1}(F_model(z))
Return g |
6. Experimental Results and Comparative Analysis
Extensive evaluation on ASAP 2.0 (essay, scores 1–6), SummHF (summarization, scores 1–7), and DREsS (EFL essay, scores 3–15) using backbone models including GPT-4o-mini, GPT-4o, Llama-3.1-8B, and Llama-3.1-70B demonstrates RULERS' performance.
Judging agreement is measured with Quadratic Weighted Kappa (QWK):
| Method | ASAP 2.0 | SummHF | DREsS |
|---|---|---|---|
| DHS | 0.4319 | 0.2599 | 0.3145 |
| MTS | 0.5566 | 0.3219 | 0.3273 |
| AutoScore | 0.4653 | 0.2883 | 0.1991 |
| RULERS | 0.7276 | 0.3367 | 0.5206 |
RULERS exhibits strong resilience against rubric perturbations (standard, reversed, paraphrased), with QWK fluctuation ≤1–2%, compared to up to 25% collapse for baselines.
Ablation on GPT-4o-mini indicates the critical role of all system components:
- Without locking: ASAP QWK = 0.6985 (−4%)
- Without evidence: ASAP QWK = 0.6904 (−5%)
- Without WGR calibration: ASAP QWK = 0.2643 (−64%)
Distribution alignment analysis shows RULERS tracks human score histograms closely, while baselines display central tendency and scale volatility across models. A plausible implication is that executable rubrics and evidence anchoring are more robust than prompt-centric evaluation protocols for automated LLM adjudication.
7. Context, Applications, and Implications
RULERS refines the operational mechanics of rubric-based LLM judgment, providing a principled pathway for criteria transfer without model retraining or manual prompt engineering. This suggests scalable, reproducible benchmarking in automated essay grading, summarization evaluation, and other domains requiring high-fidelity rubric adherence and evidence traceability.
The separation of rubric execution from LLM parameters, presence of robust evidence gating, and scale calibration suggest applications in high-stakes educational assessment, content moderation, and scalable expert evaluation tasks. Results highlight the necessity for deterministic, auditable schema compilation and post-hoc calibration as core requirements for reliable model-based judging, rather than reliance on prompt phrasing alone (Hong et al., 13 Jan 2026).