Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Evaluation of Large Language Model based on Glass-box Features

Published 7 Mar 2024 in cs.CL | (2403.04222v2)

Abstract: The proliferation of open-source LLMs underscores the pressing need for evaluation methods. Existing works primarily rely on external evaluators, focusing on training and prompting strategies. However, a crucial aspect, model-aware glass-box features, is overlooked. In this study, we explore the utility of glass-box features under the scenario of self-evaluation, namely applying an LLM to evaluate its own output. We investigate various glass-box feature groups and discovered that the softmax distribution serves as a reliable quality indicator for self-evaluation. Experimental results on public benchmarks validate the feasibility of self-evaluation of LLMs using glass-box features.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.
  5. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.
  6. Improving translation quality estimation with bias mitigation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2175–2190.
  7. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  8. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  9. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  10. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
  11. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  12. Matīss Rikters and Mark Fishel. 2017. Confidence through attention.
  13. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  14. Attention is all you need. Advances in neural information processing systems, 30.
  15. Label words are anchors: An information flow perspective for understanding in-context learning. In Conference on Empirical Methods in Natural Language Processing.
  16. Large language models are not fair evaluators.
  17. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
  18. Yijun Xiao and William Yang Wang. 2019. Quantifying uncertainties in natural language processing tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7322–7329.
  19. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  20. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
Citations (1)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.