Validate correlation between GPT-4 ratings and human judgments for chatbot evaluation

Determine whether GPT-4-based evaluation ratings reliably correlate with human judgments when assessing chatbot performance across tasks and datasets, and quantify the strength and conditions of such correlations.

Background

Although the paper conducts both GPT-4-based and human evaluations and reports moderate agreement at the system level, the authors explicitly note that the general reliability of GPT-4 ratings to assess chatbot performance has yet to be proven to correlate with human judgments.

This highlights a broader, ongoing question about the validity and robustness of model-based evaluators as proxies for human assessments.

References

While recent work indicates generative models can be effectively employed for system evaluations, the reliability GPT-4 ratings to assess chatbot performance is, to our knowledge, yet to be proven to correlate with human judgments.

QLoRA: Efficient Finetuning of Quantized LLMs  (2305.14314 - Dettmers et al., 2023) in Subsection "Human Evaluation"