- The paper introduces JudgeLM, a fine-tuned framework that uses LLMs as scalable judges achieving over 90% agreement with GPT-4 benchmarks.
- It leverages a large-scale dataset and novel benchmarks to measure open-ended task performance across multiple parameter scales.
- The study addresses inherent biases through targeted techniques, enhancing evaluation accuracy and computational efficiency.
JudgeLM: Fine-tuned LLMs as Scalable Judges
This paper presents a novel approach to leverage LLMs as scalable judges, addressing the limitations in evaluating LLMs across a variety of open-ended scenarios. The proposed methodology involves fine-tuning LLMs to become JudgeLM, an efficient mechanism for assessing LLM performance using newly designed benchmarks and datasets.
Introduction
The evaluation of LLMs in open-ended contexts lacks comprehensive methods due to the constraints of current benchmarks and metrics. To counter this, the paper introduces JudgeLM, a framework for fine-tuning LLMs to act as scalable and reliable evaluators. This is achieved through the development of a large-scale dataset comprising task seeds, model-generated answers, and judgments sourced from GPT-4. This dataset forms the basis for training JudgeLM to function effectively as a judge.
Dataset and Benchmark Design
To support the development and assessment of JudgeLM, the authors created an extensive dataset that includes high-quality task seeds and corresponding model-generated outputs. Additionally, judgments from GPT-4 are used as a reference to fine-tune these LLMs. The dataset's scale and quality enable JudgeLM to be trained on multiple parameter scales—7B, 13B, and 33B—providing a robust foundation for benchmarking open-ended tasks. The paper also introduces a novel benchmark tailored to evaluate the performances of these judges.
Addressing Inherent Biases
JudgeLM acknowledges inherent biases that arise during the fine-tuning process, namely position bias, knowledge bias, and format bias. The authors propose a series of techniques to mitigate these issues, such as swap augmentation, reference support, and reference drop. These techniques are demonstrated to significantly improve the performance and objectivity of JudgeLM evaluations.
Methods and Training
The fine-tuning process for creating JudgeLM entails systematically augmenting data, applying reference-based techniques, and optimizing across different scales of LLMs. The study includes detailed training protocols, optimization strategies, and the architectures utilized. The efficiency of JudgeLM is highlighted by its capability to evaluate 5,000 samples in just 3 minutes using 8 A100 GPUs, showcasing both scalability and computational economy.
Experimental Results
Extensive experiments reveal that JudgeLM achieves state-of-the-art performance on both existing benchmarks like PandaLM as well as the newly introduced benchmark. Notably, JudgeLM attains an agreement rate with its teacher judge of over 90%, a significant improvement over the highest recorded human-to-human agreement rate of 82% in comparative metrics. This highlights JudgeLM's potential as a reliable standard for LLM evaluation. Furthermore, JudgeLM exhibits extended capabilities, being adept at judging single answers, multimodal inputs, evaluating multiple answers, and handling multi-turn interactions.
Conclusion
The findings presented in this paper illustrate that fine-tuned LLMs, as embodied by JudgeLM, represent a scalable and efficient mechanism for the evaluation of AI models in open-ended tasks. By addressing core biases and optimizing evaluation precision, JudgeLM establishes a new frontier in computational model assessment, demonstrating superior agreement rates and operational efficiency. Future research may further explore expanding the capabilities of JudgeLM in other domains and refining its bias mitigation strategies to broaden its applicability and enhance its reliability across various contexts.