Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Abstract: High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation, and points out promising avenues for future work.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Prediction-powered inference. Science, 382(6671):669–674, 2023a.
- Ppi++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453, 2023b.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Benchmarking foundation models with language-model-as-an-examiner. Advances in Neural Information Processing Systems, 36, 2024.
- Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. arXiv preprint arXiv:2406.18403, 2024.
- Autoeval done right: Using synthetic data for model evaluation. arXiv preprint arXiv:2403.07008, 2024.
- The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202, 2018.
- Prediction-powered ranking of large language models. arXiv preprint arXiv:2402.17826, 2024.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
- Questioning the survey responses of large language models. arXiv preprint arXiv:2306.07951, 2023.
- Don’t label twice: Quantity beats quality when comparing binary classifiers on a budget. In Forty-first International Conference on Machine Learning, 2024.
- Human-guided fair classification for natural language processing. In The Eleventh International Conference on Learning Representations, 2023.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024a.
- Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024b.
- Stratified prediction-powered inference for hybrid language model evaluation. arXiv preprint arXiv:2406.04291, 2024.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
- Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
- Trust or escalate: Llm judges with provable guarantees for human agreement. arXiv preprint arXiv:2407.18370, 2024.
- Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012, 2023.
- Decoding biases: Automated methods and llm judges for gender bias detection in language models. arXiv preprint arXiv:2408.03907, 2024.
- Theory of point estimation. Springer Science & Business Media, 2006.
- From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024.
- Holistic evaluation of language models. Transactions on Machine Learning Research, 2023.
- Chin-Yew Lin and FJÂ Och. Looking for a few good metrics: Rouge and its evaluation. In Ntcir workshop, 2004.
- Llms as narcissistic evaluators: When ego inflates evaluation scores. arXiv preprint arXiv:2311.09766, 2023.
- Why do classifier accuracies show linear trends under distribution shift? arXiv preprint arXiv:2012.15483, 2020.
- Model similarity mitigates test set overuse. Advances in Neural Information Processing Systems, 32, 2019.
- Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Llm evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076, 2024.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- Constructing domain-specific evaluation sets for llm-as-a-judge. arXiv preprint arXiv:2408.08808, 2024.
- Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476, 2023.
- Imagenot: A contrast with imagenet preserves model rankings. arXiv preprint arXiv:2404.02112, 2024.
- Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms. arXiv preprint arXiv:2406.07791, 2024.
- A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024.
- Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges. arXiv preprint arXiv:2406.12624, 2024.
- Ai-driven review systems: Evaluating llms in scalable and bias-aware academic reviews. arXiv preprint arXiv:2408.10365, 2024.
- Foundational autoraters: Taming large language models for better automatic evaluation. arXiv preprint arXiv:2407.10817, 2024.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
- Systematic evaluation of llm-as-a-judge in llm alignment tasks: Explainable metrics and diverse prompt templates. arXiv preprint arXiv:2408.13006, 2024.
- Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. arXiv preprint arXiv:2403.09032, 2024.
- Skill-mix: A flexible and expandable family of evaluations for ai models. arXiv preprint arXiv:2310.17567, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.