RULERS: Rubric Unification & Robust LLM Scoring

Updated 20 January 2026

RULERS is a rubric-based evaluation framework that compiles natural language rubrics into immutable, executable JSON bundles for LLM scoring.
It employs structured decoding and evidence anchoring to prevent rubric prompt instability and ensure deterministic validation of extracted evidence.
The framework uses post-hoc Wasserstein calibration to align model scores with human judgments without retraining, improving reproducibility and robustness.

RULERS (Rubric Unification, Locking, and Evidence-Anchored Robust Scoring) is a compiler-executor framework expressly designed for robust @@@@1@@@@ in the "LLM-as-a-Judge" paradigm. It directly targets three chronic failure modes in LLM scoring: rubric prompt instability, unverifiable reasoning, and scale misalignment with human judgment boundaries. The framework reframes judge alignment as a criteria-transfer problem, transforming natural language rubrics into executable and auditable artifacts for deterministic, evidence-grounded and scale-consistent LLM scoring. RULERS operates entirely without fine-tuning model parameters, instead leveraging structured rubric compilation, evidence anchoring, and post-hoc calibration to achieve high-fidelity, reproducible human alignment (Hong et al., 13 Jan 2026).

1. Formalization and Reliability Conditions

RULERS frames LLM judge alignment as a criteria-transfer optimization task. Let $\mathcal{X}$ denote the space of evaluation instances, each $x \in \mathcal{X}$ decomposed into atomic units $\mathcal{U}_x = \{u_1, \ldots, u_M\}$ . Target score vectors are $\mathcal{Y} = \{1,\ldots,S\}^K$ covering $K$ traits, where $S$ is the trait-specific discrete scale. Human-annotated datasets are $\mathcal{D} = \{(x_i, y_i)\}$ from $P_{X,Y}$ . A rubric $\mathcal{R}$ is compiled via $\pi$ into bundle $\mathbb{B} = \pi(\mathcal{R})$ . The judge $f_\theta(x, \text{spec}; \epsilon)$ is a frozen black-box LLM with stochastic generation $\epsilon$ .

Reliability is subject to two key constraints:

Stochastic Invariance: Minimization of output variance due to sampling noise:

$\min_{\pi} \;\mathbb{E}_{x \sim \mathcal{X}}[\mathrm{Var}_\epsilon(f_\theta(x, \pi(\mathcal{R}); \epsilon))]$

Evidence Support: All predicted scores $\hat{y}$ must be anchored to extractive evidence $E \subseteq \mathcal{U}_x$ :

$\forall\;\hat y,\;\exists E \subseteq \mathcal{U}_x:\; \mathrm{Support}(\hat y) = E$

Criteria-transfer optimization seeks mappings $\pi$ and $g$ (calibration) to maximize agreement metric $A(\cdot, \cdot)$ (e.g., Quadratic Weighted Kappa), subject to both invariance and evidence anchoring:

$\max_g\; A(g(\hat y), y) \quad \text{s.t. (1), (2)}$

2. Rubric Compilation: Executable and Locked Specifications

RULERS introduces an offline compiler $\pi$ that parses natural language rubrics $\mathcal{R}$ into JSON-formatted, versioned, immutable specification bundles $\mathbb{B}$ . Bundles capture:

Taxonomy $\mathcal{T}$ : List of scoring dimensions.
Checklist $\mathcal{C}$ : Per-item prompts and corresponding discrete scales.
Evidence Rules: Minimum extractive quotes per trait and exact substring requirements.

Example bundle schema:

{
  "taxonomy": ["Content","Organization",…],
  "checklist": [
    {"id":"C01", "dimension":"Content", "prompt":"…", "scale":[0,1,2]},
    // ...
  ],
  "evidence_rules": {"min_quotes_per_dim": m}
}

Formal grammar is provided in BNF-style to guarantee schema validity: \begin{align*} \langle \mathrm{Bundle}\rangle &\to {\langle \mathrm{Taxonomy}\rangle, \langle \mathrm{Checklist}\rangle, \langle \mathrm{Rules}\rangle} \ \langle \mathrm{Taxonomy}\rangle &\to [t_1, \dots, t_K] \ \langle \mathrm{Checklist}\rangle &\to [c_1, \dots, c_J],\quad c_j \to (\text{id}, \text{dim}, \text{prompt}, {0,1,2}) \ \langle \mathrm{Rules}\rangle &\to (\text{min_evidence}=m, ...) \end{align*}

3. Rubric Locking and Invocation Protocol

Once generated, bundles $\mathbb{B}$ are locked via cryptographic hash $h(\mathbb{B})$ , assuring versioned immutability. The judge $f_\theta$ is always invoked in inference with the original bundle hash, eliminating prompt drift and configuration leakage.

Pseudocode summarizing compilation and locking:

function CompileAndLock(rubric_text):
    B = parse_and_structurize(rubric_text)
    hash_B = SHA256(serialize(B))
    store_immutable(B, hash_B)
    return (B, hash_B)

function InvokeJudge(x, hash_B):
    B = retrieve_bundle(hash_B)   # immutable
    return f_θ(x, B)

4. Structured Decoding and Deterministic Evidence Verification

Execution wraps LLM decoding in a deterministic schema $\Omega(\mathbb{B})$ that strictly enforces:

JSON output restricted to {"decisions", "evidence", "justification"}
$J$ decision fields $d_1,\ldots,d_J \in \{0,1,2\}$ matching the checklist
Minimum $m$ quotes per trait, evidence explicitly cited by atomic unit ID

Scoring per dimension $k$ follows:

$\mu_k = \frac{1}{J} \sum_{j \in \dim k} d_j, \qquad s_k = \mathrm{Clamp}_{[1,S]}(\mathrm{Round}(1 + (S-1)\cdot \mu_k))$

If evidence quota $|E_k| < m$ , then $s_k \leftarrow \min(s_k, \tau-1)$ ("evidence gate").

Evidence verification applies a deterministic function:

$V(q, u) = \begin{cases} 1 & \text{if } q \text{ is exact substring of } u \ 0 & \text{otherwise} \end{cases}$

Acceptance requires $\sum_u V(q, u) = 1$ , strictly preventing hallucinated or imprecise references.

5. Lightweight Post-hoc Calibration: Wasserstein Generative Regression

Raw model scores $s$ and auxiliary features $\phi(x)$ are first projected into a latent score $z$ using ridge regression. RULERS then learns a transport map $g$ that minimizes the Wasserstein distance between model and human score distributions:

$\min_g W(F_\mathrm{model}, F_\mathrm{human})$

with mapping:

$g(z) = F_\mathrm{human}^{-1}(F_\mathrm{model}(z))$

Calibration workflow:

Input: {z_i=φ(x_i)}_{i=1..N_calib}, {y_i}_{i=1..N_calib}
1. Fit ridge regression:  z_i ≈ w^T φ(x_i)
2. Compute empirical CDFs: F_model, F_human
3. Define g(z)=F_human^{-1}(F_model(z))
Return g

This post-hoc calibration aligns the LLM output scale to human reference without any model parameter updates.

6. Experimental Results and Comparative Analysis

Extensive evaluation on ASAP 2.0 (essay, scores 1–6), SummHF (summarization, scores 1–7), and DREsS (EFL essay, scores 3–15) using backbone models including GPT-4o-mini, GPT-4o, Llama-3.1-8B, and Llama-3.1-70B demonstrates RULERS' performance.

Judging agreement is measured with Quadratic Weighted Kappa (QWK):

$\kappa = 1 - \frac{\sum_{i,j} w_{i,j} O_{i,j}}{\sum_{i,j} w_{i,j} E_{i,j}}, \quad w_{i,j} = (i-j)^2$

Method	ASAP 2.0	SummHF	DREsS
DHS	0.4319	0.2599	0.3145
MTS	0.5566	0.3219	0.3273
AutoScore	0.4653	0.2883	0.1991
RULERS	0.7276	0.3367	0.5206

RULERS exhibits strong resilience against rubric perturbations (standard, reversed, paraphrased), with QWK fluctuation ≤1–2%, compared to up to 25% collapse for baselines.

Ablation on GPT-4o-mini indicates the critical role of all system components:

Without locking: ASAP QWK = 0.6985 (−4%)
Without evidence: ASAP QWK = 0.6904 (−5%)
Without WGR calibration: ASAP QWK = 0.2643 (−64%)

Distribution alignment analysis shows RULERS tracks human score histograms closely, while baselines display central tendency and scale volatility across models. A plausible implication is that executable rubrics and evidence anchoring are more robust than prompt-centric evaluation protocols for automated LLM adjudication.

7. Context, Applications, and Implications

RULERS refines the operational mechanics of rubric-based LLM judgment, providing a principled pathway for criteria transfer without model retraining or manual prompt engineering. This suggests scalable, reproducible benchmarking in automated essay grading, summarization evaluation, and other domains requiring high-fidelity rubric adherence and evidence traceability.

The separation of rubric execution from LLM parameters, presence of robust evidence gating, and scale calibration suggest applications in high-stakes educational assessment, content moderation, and scalable expert evaluation tasks. Results highlight the necessity for deterministic, auditable schema compilation and post-hoc calibration as core requirements for reliable model-based judging, rather than reliance on prompt phrasing alone (Hong et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring).