References Improve LLM Alignment in Non-Verifiable Domains

Published 18 Feb 2026 in cs.CL, cs.AI, and cs.LG | (2602.16802v1)

Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel methodology where reference outputs act as soft verifiers to bridge the gap between RLVR and preference optimization.
It introduces protocols like RefEval and RefMatch that significantly improve evaluation accuracy, yielding up to 8 absolute points gain and higher benefits in lower-capacity models.
The approach enables cost-efficient post-training alignment by reducing reliance on human annotations and enhancing inter-judge agreement.

Reference-Guided Evaluation and Alignment of LLMs in Non-Verifiable Domains

Problem Context and Motivation

Alignment post-training of LLMs in domains without verifiable reward signals (e.g., instruction-following, creative generation) remains challenging, as RLVR approaches require the existence of ground-truth verifiers. The prevalent RLHF and RLAIF pipelines depend on human or AI-generated preference labels, often relying on reference-free reward models. This work systematically investigates whether reference outputs—generated by frontier LLMs or human experts—can act as soft verifiers to anchor LLM-judges in non-verifiable tasks. The hypothesis is that reference-guided supervision closes the methodological gap between RLVR and preference optimization algorithms (e.g., DPO), particularly when direct preference data is sparse or unavailable.

Reference-Guided LLM-Judge Design

The core contribution is the formulation of explicit prompting protocols (RefEval, RefMatch) instructing LLMs to use reference outputs as decision groundings. Unlike prior reference-augmented prompts providing limited guidance, RefEval leverages the reference as a factual benchmark and stylistic model, directing the LLM-judge to select candidate responses closest in semantic content, factuality, and concise instruction adherence. RefMatch acts as a semantic matcher, emphasizing similarity evaluation. Multiple variants and aggregation strategies (multi-reference voting, rule coupling) are developed and tested for robustness.

The benchmarks span five diverse datasets (LLMBar-Natural, LLMBar-Adversarial, MTBench, Instrusum, HREF) and 11 open-source LLMs of varying capability, with references primarily generated by GPT-4o or DeepSeek-V3. Evaluation accuracy is operationalized as the agreement with human preference labels in pairwise comparison settings.

Numerical Results: Evaluation Accuracy and Inter-Judge Consistency

Reference-guided prompting yields strong improvements in evaluation accuracy for LLM judges, with RefEval averaging 79.1% accuracy, outperforming reference-free LLMBar-Base and CoT by 7–8 absolute points and surpassing previous reference-based baselines by 4–5 points. Notably, the gains are most pronounced in weaker models (e.g., Llama-3-8B: +17.4% over baseline), indicating reference grounding is especially beneficial for low-capacity judges. Inter-judge agreement is increased from 76.6% to 81.4%, reducing subjective variance and anchoring judgments on shared reference signals.

Strong numerical results are also seen in frontier judges when supplied with human-edited references, with accuracy improving even for GPT-4o and competitive models (e.g., RefEval-GPT-4o: 88.4% with Oracle reference vs. 86.8% with self-generated reference). Multi-reference voting marginally improves accuracy, but single-strong references suffice for most cases.

Reference-Guided Self-Improvement and Alignment Training

The methodology extends reference-guided evaluation protocols into alignment tuning pipelines. LLMs are first distilled on high-quality reference outputs via SFT, then preference-optimized with DPO using in-model reference-guided judges for on-policy pair selection. The procedure is implemented on Meta-Llama-3-8B-Instruct and Qwen2.5-7B-SFT using UltraFeedback instructions and references from DeepSeek-V3.

Performance is quantified on AlpacaEval and Arena-Hard benchmarks against strong baselines (preference-optimized with ArmoRM, SimPO-Llama3-8B, reference-free DPO, BERTScore/ROUGE supervised). Reference-guided models achieve 73.1/58.7 (Llama-3-8B) and 70.0/74.1 (Qwen2.5-7B) on AlpacaEval/Arena-Hard, representing average absolute gains of +20.2/+17.1 points over SFT distillation and +5.3/+3.6 over reference-free self-improvement. These results are competitive with or superior to models trained with specialized reward models, without the need for additional preference annotation or external feedback.

Ablation studies confirm that weaker references (GPT-4o-mini, Mistral-Nemo) still provide measurable gains, although improvements scale with reference quality. The approach generalizes across instruction categories, showing maximal gains in coding/math and reasoning tasks, with diminished but still visible improvements in open-ended creative domains depending on model pretraining.

Theoretical and Practical Implications

Empirically, this research demonstrates that reference-grounded supervision can structurally enhance both the evaluation and post-training alignment of LLMs in non-verifiable domains, matching or exceeding the efficacy of reward-model-based preference optimization. Theoretically, this bridges the gap between RLVR and RLHF/RLAIF, suggesting that reference outputs, when properly integrated into LLM-judge decision protocols, provide a robust form of grounding absent explicit verifiable reward signals.

Practically, high-quality reference outputs generated by frontier LLMs are sufficient to bootstrap post-training pipelines without human preference annotation, reducing cost and bottleneck. Smaller LLMs can be reliably tuned and evaluated using reference-guided protocols, which is resource-efficient. The approach is robust to reference quality and adaptable to multi-reference ensembles.

Future Directions

The research opens avenues for developing domain-specialized reference-guided reward models, particularly in areas requiring nuanced expertise (e.g., scientific writing, legal reasoning). Further exploration should focus on optimal strategies for reference aggregation, the scalability of reference-guided judges to multimodal and multi-turn tasks, and the interplay between reference diversity and alignment robustness. Additionally, the potential for reference-guided reward models, explicitly incorporating domain knowledge and factuality from curated references, warrants investigation.

Conclusion

Reference outputs, when explicitly incorporated into LLM-judge prompting protocols, significantly improve both evaluation accuracy and alignment post-training in non-verifiable domains. This approach anchors judgment, reduces variance, and enables self-improvement comparable to reward model-based optimization, without external preference annotation. The methodology has broad implications for scalable, cost-efficient LLM alignment and evaluation, and future work should explore its application in highly specialized domains and in creating novel reference-guided reward models.

Markdown Report Issue