Reliable Evaluation of Large Language Models

Develop reliable evaluation methodologies for large language models that effectively assess model performance, including helpfulness and harmlessness, addressing the acknowledged unresolved problem of evaluating such models.

Background

The paper evaluates Safe RLHF across three iterations and notes that assessing LLMs remains difficult. To proceed despite this difficulty, the authors employ two practical methods: fast model-based evaluation using unified reward and cost models trained on balanced preference data, and Elo-style pairwise comparisons of model outputs judged by GPT-4 and humans.

They further construct their own evaluation prompt dataset due to gaps in existing benchmarks, underscoring that current evaluation standards for alignment (helpfulness and harmlessness) are inconsistent and costly when relying solely on human judgments. This context motivates the need for more reliable, comprehensive evaluation approaches.

References

However, evaluating LLMs has consistently been a challenging and unresolved problem.

Safe RLHF: Safe Reinforcement Learning from Human Feedback  (2310.12773 - Dai et al., 2023) in Section 4.1, Helpfulness and Harmlessness Evaluation