Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Abstract: Benchmarks have emerged as the central approach for evaluating LLMs. The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
- The falcon series of language models: Towards open frontier models. Hugging Face repository.
- When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781.
- Systematic evaluation of different approaches on embedding search. In Future of Information and Communication Conference, pages 526–536. Springer.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
- Matthew DeBell. 2018. Best Practices for Creating Survey Weights, pages 159–162. Springer International Publishing, Cham.
- RLPrompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Inderjit S Dhillon and Dharmendra S Modha. 2001. Concept decompositions for large sparse text data using clustering. Machine learning, 42:143–175.
- Matthew Freestone and Shubhra Kanti Karmaker Santu. 2024. Word embeddings revisited: Do llms offer something new? arXiv preprint arXiv:2402.11094.
- Koala: A dialogue model for academic research. Blog post, April, 1.
- Robustness gym: Unifying the NLP evaluation landscape. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pages 42–55, Online. Association for Computational Linguistics.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
- Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
- Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- Evaluating embedding APIs for information retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 518–526, Toronto, Canada. Association for Computational Linguistics.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
- Adversarial filters of dataset biases. In International conference on machine learning, pages 1078–1088. PMLR.
- A survey on out-of-distribution evaluation of neural nlp models. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 6683–6691. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
- Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664, Florence, Italy. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report.
- Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623.
- Leveraging large language models for topic classification in the domain of public affairs. arXiv preprint arXiv:2306.02864.
- The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Automatic prompt optimization with “gradient descent” and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957–7968, Singapore. Association for Computational Linguistics.
- An adversarial winograd schema challenge at scale. arXiv preprint arXiv:2305.06300.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Noah A Smith and Roy W Tromble. 2004. Sampling uniformly from the unit simplex. Johns Hopkins University, Tech. Rep, 29.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. Advances in neural information processing systems, 30.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint arXiv:2302.12095.
- Generalizing to unseen domains: A survey on domain generalization. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4627–4635. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
- Measure and improve robustness in NLP models: A survey. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4569–4586, Seattle, United States. Association for Computational Linguistics.
- How far can camels go? exploring the state of instruction tuning on open resources. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Kai-Cheng Yang and Filippo Menczer. 2023. Large language models can rate news outlet credibility. arXiv preprint arXiv:2304.00228.
- GLUE-X: Evaluating natural language understanding models from an out-of-distribution generalization perspective. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12731–12750, Toronto, Canada. Association for Computational Linguistics.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
- Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.