CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Published 21 Oct 2024 in cs.CL and cs.AI | (2410.16256v1)

Abstract: Efficient and accurate evaluation is crucial for the continuous improvement of LLMs. Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established \textbf{JudgerBench}, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.

Abstract PDF HTML Upgrade to Chat

References (24)

Summary

The paper introduces CompassJudger-1 as an innovative all-in-one judge model that evaluates, critiques, and scores LLMs.
The study employs a balanced training mix with optimal data ratios (1:3:1) from diverse sources to enhance model generalization and specificity.
The evaluation on JudgerBench demonstrates superior human preference alignment and robustness compared to leading models like GPT-4o and Qwen.

Overview of CompassJudger-1: An Advanced Open-Source Judge Model

The evaluation of LLMs remains a significant challenge in the AI research community, particularly in effectively aligning model performance with human preferences. The paper outlines CompassJudger-1, an open-source LLM designed to address these evaluation challenges. This model serves as an all-encompassing judge capable of model scoring, comparative evaluation, critique generation, and diverse task execution. Furthermore, the paper introduces JudgerBench, a comprehensive benchmark for evaluating the effectiveness of different judge models in subjective scenarios.

Core Contributions

All-in-One LLM Evaluation:
- CompassJudger-1 exemplifies a versatile LLM with robust judging capabilities. It performs functions traditionally associated with reward models, while also handling complex critique tasks.
Comprehensive Benchmarking:
- JudgerBench offers a nuanced testing environment allowing for evaluating judge models across various dimensions, including alignment with human evaluations and critique proficiency.

Data Collection and Training

The paper underscores the importance of high-quality data for effective model training. Training data for CompassJudger-1 encompasses multiple sources:

Public Judge Data: Utilized datasets like PandaLM and AlpineFarm, re-evaluated with capable models such as Qwen-2.5-72B to ensure relevance.
Reward Data: Integrated in balanced proportions to bolster the model’s judgment capabilities while avoiding overfitting.
Self-Collect Data: Includes subjective evaluations from iterative model development stages, highlighting a pragmatic approach to data expansion.

Through extensive data filtering, categorization, and sampling strategies, the authors ensure a balanced dataset that enhances both the generalization and specificity of CompassJudger-1.

Training and Ablation Studies

The training framework adopted (Xtuner) and the strategic balance of critique, reward, and general SFT data are investigated to optimize the model's performance:

Optimal Data Ratios: Through ablation studies, the paper identifies optimal training data ratios (1:3:1 for critique:reward:sft), facilitating a judicious mix that augments both judging and generalization capacities.
Impact of G-SFT Data: Incorporating general SFT data reinforces the model's universality, demonstrating that small amounts aid in maintaining performance across varied tasks.

Evaluation on JudgerBench

The evaluation against JudgerBench, comprising both Arena and Benchmark components, substantiates CompassJudger-1's capabilities:

Alignment with Human Preferences: Tasks in JDB-A reflect high accuracy in mirroring human judgment.
Critique and Format Adherence: In JDB-B, the model's ability to provide detailed critiques and adhere to evaluation formats is significant.

Comparative Analysis

In comparative testing with models such as Qwen and GPT-4o, CompassJudger-1 demonstrates superior generalizability and robustness, achieving impressive scores against JudgerBench metrics and positioning itself as a substantial alternative to GPT-powered evaluations.

Implications and Future Prospects

CompassJudger-1’s development addresses pivotal gaps in existing judge models by providing a flexible, all-encompassing solution that enhances subjective evaluations. This open-source contribution, coupled with JudgerBench, offers researchers tools to advance LLM evaluation methodologies, ultimately fostering innovation in AI assessment protocols. Future exploration may focus on further enhancing integration capabilities and expanding training sets to include more diverse evaluation scenarios.

The introduction of CompassJudger-1 and JudgerBench illustrates a significant step forward in creating versatile, accessible tools for LLM evaluation, supporting ongoing advancements in AI technology and evaluation strategies.