Papers
Topics
Authors
Recent
Search
2000 character limit reached

RULERS: Rubric Unification & Robust LLM Scoring

Updated 20 January 2026
  • RULERS is a rubric-based evaluation framework that compiles natural language rubrics into immutable, executable JSON bundles for LLM scoring.
  • It employs structured decoding and evidence anchoring to prevent rubric prompt instability and ensure deterministic validation of extracted evidence.
  • The framework uses post-hoc Wasserstein calibration to align model scores with human judgments without retraining, improving reproducibility and robustness.

RULERS (Rubric Unification, Locking, and Evidence-Anchored Robust Scoring) is a compiler-executor framework expressly designed for robust @@@@1@@@@ in the "LLM-as-a-Judge" paradigm. It directly targets three chronic failure modes in LLM scoring: rubric prompt instability, unverifiable reasoning, and scale misalignment with human judgment boundaries. The framework reframes judge alignment as a criteria-transfer problem, transforming natural language rubrics into executable and auditable artifacts for deterministic, evidence-grounded and scale-consistent LLM scoring. RULERS operates entirely without fine-tuning model parameters, instead leveraging structured rubric compilation, evidence anchoring, and post-hoc calibration to achieve high-fidelity, reproducible human alignment (Hong et al., 13 Jan 2026).

1. Formalization and Reliability Conditions

RULERS frames LLM judge alignment as a criteria-transfer optimization task. Let X\mathcal{X} denote the space of evaluation instances, each xXx \in \mathcal{X} decomposed into atomic units Ux={u1,,uM}\mathcal{U}_x = \{u_1, \ldots, u_M\}. Target score vectors are Y={1,,S}K\mathcal{Y} = \{1,\ldots,S\}^K covering KK traits, where SS is the trait-specific discrete scale. Human-annotated datasets are D={(xi,yi)}\mathcal{D} = \{(x_i, y_i)\} from PX,YP_{X,Y}. A rubric R\mathcal{R} is compiled via π\pi into bundle B=π(R)\mathbb{B} = \pi(\mathcal{R}). The judge fθ(x,spec;ϵ)f_\theta(x, \text{spec}; \epsilon) is a frozen black-box LLM with stochastic generation ϵ\epsilon.

Reliability is subject to two key constraints:

  1. Stochastic Invariance: Minimization of output variance due to sampling noise:

minπ  ExX[Varϵ(fθ(x,π(R);ϵ))]\min_{\pi} \;\mathbb{E}_{x \sim \mathcal{X}}[\mathrm{Var}_\epsilon(f_\theta(x, \pi(\mathcal{R}); \epsilon))]

  1. Evidence Support: All predicted scores y^\hat{y} must be anchored to extractive evidence EUxE \subseteq \mathcal{U}_x:

  y^,  EUx:  Support(y^)=E\forall\;\hat y,\;\exists E \subseteq \mathcal{U}_x:\; \mathrm{Support}(\hat y) = E

Criteria-transfer optimization seeks mappings π\pi and gg (calibration) to maximize agreement metric A(,)A(\cdot, \cdot) (e.g., Quadratic Weighted Kappa), subject to both invariance and evidence anchoring:

maxg  A(g(y^),y)s.t. (1), (2)\max_g\; A(g(\hat y), y) \quad \text{s.t. (1), (2)}

2. Rubric Compilation: Executable and Locked Specifications

RULERS introduces an offline compiler π\pi that parses natural language rubrics R\mathcal{R} into JSON-formatted, versioned, immutable specification bundles B\mathbb{B}. Bundles capture:

  • Taxonomy T\mathcal{T}: List of scoring dimensions.
  • Checklist C\mathcal{C}: Per-item prompts and corresponding discrete scales.
  • Evidence Rules: Minimum extractive quotes per trait and exact substring requirements.

Example bundle schema:

1
2
3
4
5
6
7
8
{
  "taxonomy": ["Content","Organization",],
  "checklist": [
    {"id":"C01", "dimension":"Content", "prompt":"", "scale":[0,1,2]},
    // ...
  ],
  "evidence_rules": {"min_quotes_per_dim": m}
}
Formal grammar is provided in BNF-style to guarantee schema validity: \begin{align*} \langle \mathrm{Bundle}\rangle &\to {\langle \mathrm{Taxonomy}\rangle, \langle \mathrm{Checklist}\rangle, \langle \mathrm{Rules}\rangle} \ \langle \mathrm{Taxonomy}\rangle &\to [t_1, \dots, t_K] \ \langle \mathrm{Checklist}\rangle &\to [c_1, \dots, c_J],\quad c_j \to (\text{id}, \text{dim}, \text{prompt}, {0,1,2}) \ \langle \mathrm{Rules}\rangle &\to (\text{min_evidence}=m, ...) \end{align*}

3. Rubric Locking and Invocation Protocol

Once generated, bundles B\mathbb{B} are locked via cryptographic hash h(B)h(\mathbb{B}), assuring versioned immutability. The judge fθf_\theta is always invoked in inference with the original bundle hash, eliminating prompt drift and configuration leakage.

Pseudocode summarizing compilation and locking:

1
2
3
4
5
6
7
8
9
function CompileAndLock(rubric_text):
    B = parse_and_structurize(rubric_text)
    hash_B = SHA256(serialize(B))
    store_immutable(B, hash_B)
    return (B, hash_B)

function InvokeJudge(x, hash_B):
    B = retrieve_bundle(hash_B)   # immutable
    return f_θ(x, B)

4. Structured Decoding and Deterministic Evidence Verification

Execution wraps LLM decoding in a deterministic schema Ω(B)\Omega(\mathbb{B}) that strictly enforces:

  • JSON output restricted to {"decisions", "evidence", "justification"}
  • JJ decision fields d1,,dJ{0,1,2}d_1,\ldots,d_J \in \{0,1,2\} matching the checklist
  • Minimum mm quotes per trait, evidence explicitly cited by atomic unit ID

Scoring per dimension kk follows:

μk=1Jjdimkdj,sk=Clamp[1,S](Round(1+(S1)μk))\mu_k = \frac{1}{J} \sum_{j \in \dim k} d_j, \qquad s_k = \mathrm{Clamp}_{[1,S]}(\mathrm{Round}(1 + (S-1)\cdot \mu_k))

If evidence quota Ek<m|E_k| < m, then skmin(sk,τ1)s_k \leftarrow \min(s_k, \tau-1) ("evidence gate").

Evidence verification applies a deterministic function:

V(q,u)={1if q is exact substring of u 0otherwiseV(q, u) = \begin{cases} 1 & \text{if } q \text{ is exact substring of } u \ 0 & \text{otherwise} \end{cases}

Acceptance requires uV(q,u)=1\sum_u V(q, u) = 1, strictly preventing hallucinated or imprecise references.

5. Lightweight Post-hoc Calibration: Wasserstein Generative Regression

Raw model scores ss and auxiliary features ϕ(x)\phi(x) are first projected into a latent score zz using ridge regression. RULERS then learns a transport map gg that minimizes the Wasserstein distance between model and human score distributions:

mingW(Fmodel,Fhuman)\min_g W(F_\mathrm{model}, F_\mathrm{human})

with mapping:

g(z)=Fhuman1(Fmodel(z))g(z) = F_\mathrm{human}^{-1}(F_\mathrm{model}(z))

Calibration workflow:

1
2
3
4
5
Input: {z_i=φ(x_i)}_{i=1..N_calib}, {y_i}_{i=1..N_calib}
1. Fit ridge regression:  z_i  w^T φ(x_i)
2. Compute empirical CDFs: F_model, F_human
3. Define g(z)=F_human^{-1}(F_model(z))
Return g
This post-hoc calibration aligns the LLM output scale to human reference without any model parameter updates.

6. Experimental Results and Comparative Analysis

Extensive evaluation on ASAP 2.0 (essay, scores 1–6), SummHF (summarization, scores 1–7), and DREsS (EFL essay, scores 3–15) using backbone models including GPT-4o-mini, GPT-4o, Llama-3.1-8B, and Llama-3.1-70B demonstrates RULERS' performance.

Judging agreement is measured with Quadratic Weighted Kappa (QWK):

κ=1i,jwi,jOi,ji,jwi,jEi,j,wi,j=(ij)2\kappa = 1 - \frac{\sum_{i,j} w_{i,j} O_{i,j}}{\sum_{i,j} w_{i,j} E_{i,j}}, \quad w_{i,j} = (i-j)^2

Method ASAP 2.0 SummHF DREsS
DHS 0.4319 0.2599 0.3145
MTS 0.5566 0.3219 0.3273
AutoScore 0.4653 0.2883 0.1991
RULERS 0.7276 0.3367 0.5206

RULERS exhibits strong resilience against rubric perturbations (standard, reversed, paraphrased), with QWK fluctuation ≤1–2%, compared to up to 25% collapse for baselines.

Ablation on GPT-4o-mini indicates the critical role of all system components:

  • Without locking: ASAP QWK = 0.6985 (−4%)
  • Without evidence: ASAP QWK = 0.6904 (−5%)
  • Without WGR calibration: ASAP QWK = 0.2643 (−64%)

Distribution alignment analysis shows RULERS tracks human score histograms closely, while baselines display central tendency and scale volatility across models. A plausible implication is that executable rubrics and evidence anchoring are more robust than prompt-centric evaluation protocols for automated LLM adjudication.

7. Context, Applications, and Implications

RULERS refines the operational mechanics of rubric-based LLM judgment, providing a principled pathway for criteria transfer without model retraining or manual prompt engineering. This suggests scalable, reproducible benchmarking in automated essay grading, summarization evaluation, and other domains requiring high-fidelity rubric adherence and evidence traceability.

The separation of rubric execution from LLM parameters, presence of robust evidence gating, and scale calibration suggest applications in high-stakes educational assessment, content moderation, and scalable expert evaluation tasks. Results highlight the necessity for deterministic, auditable schema compilation and post-hoc calibration as core requirements for reliable model-based judging, rather than reliance on prompt phrasing alone (Hong et al., 13 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring).