2000 character limit reached
AutoEval Done Right: Using Synthetic Data for Model Evaluation
Published 9 Mar 2024 in cs.LG, cs.AI, cs.CL, and stat.ME | (2403.07008v2)
Abstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.
- Prediction-powered ranking of large language models. arXiv preprint arXiv:2402.17826, 2024.
- Judging LLM-as-a-Judge with MT-Bench and chatbot arena. June 2023.
- Prediction-powered inference. Science, 382:669–674, 2023.
- Chatbot arena: An open platform for evaluating LLMs by human preference. arXiv preprint arXiv:2403.04132, 2024.
- A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 148–155, 2001.
- Autoeval: An NLP approach for automatic test evaluation system. In 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), pages 1–6. IEEE, 2021.
- Leveraging unlabeled data to predict out-of-distribution performance. arXiv preprint arXiv:2201.04234, 2022.
- Simultaneous confidence intervals for ranks using the partitioning principle. Electronic Journal of Statististics, 15:2608–2646, 2021.
- PPI++: Efficient prediction-powered inference. November 2023.
- Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
- Anastasios A Tsiatis. Semiparametric Theory and Missing Data. Springer, 2006.
- Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476, 2023.
- Chatbot Arena: Benchmarking LLMs in the wild. https://chat.lmsys.org/. Accessed: 2024-02-09.
- ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition.
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Ernst Zermelo. Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29(1):436–460, 1929.
- Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Lester R Ford Jr. Solution of a ranking problem from binary comparisons. The American Mathematical Monthly, 64(8P2):28–33, 1957.
- Arpad E Elo. The proposed UCSF rating system, its development, theory, and applications. Chess life, 22(8):242–247, 1967.
- David R Hunter. MM algorithms for generalized Bradley-Terry models. The annals of statistics, 32(1):384–406, 2004.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.