JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Published 26 Oct 2023 in cs.CL and cs.AI | (2310.17631v2)

Abstract: Evaluating LLMs in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, multi-turn chat, etc. Code is available at https://github.com/baaivision/JudgeLM.

Abstract PDF Upgrade to Chat

Citations (63)

View on Semantic Scholar

Summary

The paper introduces JudgeLM, a fine-tuned framework that uses LLMs as scalable judges achieving over 90% agreement with GPT-4 benchmarks.
It leverages a large-scale dataset and novel benchmarks to measure open-ended task performance across multiple parameter scales.
The study addresses inherent biases through targeted techniques, enhancing evaluation accuracy and computational efficiency.

JudgeLM: Fine-tuned LLMs as Scalable Judges

This paper presents a novel approach to leverage LLMs as scalable judges, addressing the limitations in evaluating LLMs across a variety of open-ended scenarios. The proposed methodology involves fine-tuning LLMs to become JudgeLM, an efficient mechanism for assessing LLM performance using newly designed benchmarks and datasets.

Introduction

The evaluation of LLMs in open-ended contexts lacks comprehensive methods due to the constraints of current benchmarks and metrics. To counter this, the paper introduces JudgeLM, a framework for fine-tuning LLMs to act as scalable and reliable evaluators. This is achieved through the development of a large-scale dataset comprising task seeds, model-generated answers, and judgments sourced from GPT-4. This dataset forms the basis for training JudgeLM to function effectively as a judge.

Dataset and Benchmark Design

To support the development and assessment of JudgeLM, the authors created an extensive dataset that includes high-quality task seeds and corresponding model-generated outputs. Additionally, judgments from GPT-4 are used as a reference to fine-tune these LLMs. The dataset's scale and quality enable JudgeLM to be trained on multiple parameter scales—7B, 13B, and 33B—providing a robust foundation for benchmarking open-ended tasks. The paper also introduces a novel benchmark tailored to evaluate the performances of these judges.

Addressing Inherent Biases

JudgeLM acknowledges inherent biases that arise during the fine-tuning process, namely position bias, knowledge bias, and format bias. The authors propose a series of techniques to mitigate these issues, such as swap augmentation, reference support, and reference drop. These techniques are demonstrated to significantly improve the performance and objectivity of JudgeLM evaluations.

Methods and Training

The fine-tuning process for creating JudgeLM entails systematically augmenting data, applying reference-based techniques, and optimizing across different scales of LLMs. The study includes detailed training protocols, optimization strategies, and the architectures utilized. The efficiency of JudgeLM is highlighted by its capability to evaluate 5,000 samples in just 3 minutes using 8 A100 GPUs, showcasing both scalability and computational economy.

Experimental Results

Extensive experiments reveal that JudgeLM achieves state-of-the-art performance on both existing benchmarks like PandaLM as well as the newly introduced benchmark. Notably, JudgeLM attains an agreement rate with its teacher judge of over 90%, a significant improvement over the highest recorded human-to-human agreement rate of 82% in comparative metrics. This highlights JudgeLM's potential as a reliable standard for LLM evaluation. Furthermore, JudgeLM exhibits extended capabilities, being adept at judging single answers, multimodal inputs, evaluating multiple answers, and handling multi-turn interactions.

Conclusion

The findings presented in this paper illustrate that fine-tuned LLMs, as embodied by JudgeLM, represent a scalable and efficient mechanism for the evaluation of AI models in open-ended tasks. By addressing core biases and optimizing evaluation precision, JudgeLM establishes a new frontier in computational model assessment, demonstrating superior agreement rates and operational efficiency. Future research may further explore expanding the capabilities of JudgeLM in other domains and refining its bias mitigation strategies to broaden its applicability and enhance its reliability across various contexts.