TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
Abstract: Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost. However, recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs, meaning that LLMs have access to test data during pretraining and/or fine-tuning. This raises serious concerns about the validity of benchmarking studies conducted so far and the future of evaluation using benchmarks. To solve this problem, we propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model. We describe various scenarios (depending on the trust placed on model owners or dataset owners), and present solutions to avoid data contamination using private benchmarking. For scenarios where the model weights need to be kept private, we describe solutions from confidential computing and cryptography that can aid in private benchmarking. We build an end-to-end system, TRUCE, that enables such private benchmarking showing that the overheads introduced to protect models and benchmark are negligible (in the case of confidential computing) and tractable (when cryptographic security is required). Finally, we also discuss solutions to the problem of benchmark dataset auditing, to ensure that private benchmarks are of sufficiently high quality.
- MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
- Megaverse: Benchmarking large language models across languages, modalities, models and tasks. arXiv preprint arXiv:2311.07463.
- Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853.
- Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning, pages 33–40.
- Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. arXiv preprint arXiv:2402.03927.
- Ezpc: Programmable and efficient secure two-party computation for machine learning. In IEEE European Symposium on Security and Privacy, EuroS&P 2019, Stockholm, Sweden, June 17-19, 2019, pages 496–511. IEEE.
- Exploring the use of large language models for reference-free text quality evaluation: An empirical study. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), pages 361–374.
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
- All that’s ‘human’is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296.
- Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485.
- Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783.
- Representativeness as a forgotten lesson for multilingual and code-switched data collection and preparation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5751–5767, Singapore. Association for Computational Linguistics.
- Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166.
- Shahriar Golchin and Mihai Surdeanu. 2023a. Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv preprint arXiv:2311.06233.
- Shahriar Golchin and Mihai Surdeanu. 2023b. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
- Oded Goldreich. 2000. Foundations of Cryptography: Basic Tools. Cambridge University Press, USA.
- How to Play any Mental Game or A Completeness Theorem for Protocols with Honest Majority. In STOC.
- SIGMA: secure GPT inference with function secret sharing. IACR Cryptol. ePrint Arch., page 1269.
- Xtreme: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of the 37th International Conference on Machine Learning, pages 4411–4421.
- Orca: Fss-based secure training with gpus. In IEEE S&P.
- Marcel Keller. 2019. Multi-Protocol SPDZ: Versatile framework for multi-party computation.
- Marcel Keller. 2020. MP-SPDZ: A versatile framework for multi-party computation. In CCS.
- Mlqa: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315–7330.
- Changmao Li and Jeffrey Flanigan. 2023. Task contamination: Language models may not be few-shot anymore. arXiv preprint arXiv:2312.16337.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Ralph C. Merkle. 1987. A digital signature based on a conventional encryption function. In Advances in Cryptology - CRYPTO ’87, A Conference on the Theory and Applications of Cryptographic Techniques, Santa Barbara, California, USA, August 16-20, 1987, Proceedings, volume 293 of Lecture Notes in Computer Science, pages 369–378. Springer.
- Don’t blame the annotator: Bias already starts in the annotation instructions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1771–1781.
- Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191.
- Christopher Potts. 2021. Benchmark datasets: The essential resources on which all nlp depends.
- Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate dependency parser. Computing Research Repository, arXiv:1503.06733. Version 2.
- CrypTFlow2: Practical 2-Party Secure Inference. In CCS.
- Production-level open source privacy preserving inference in medical imaging. CoRR, abs/2107.10230.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- EzPC team. EzPC: Easy secure multiparty computation. https://github.com/mpc-msri/EzPC.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Large language models are not fair evaluators.
- Andrew Chi-Chih Yao. 1986. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of Computer Science (sfcs 1986), pages 162–167.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.