Papers
Topics
Authors
Recent
Search
2000 character limit reached

Task Contamination: Language Models May Not Be Few-Shot Anymore

Published 26 Dec 2023 in cs.CL | (2312.16337v1)

Abstract: LLMs offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Can we trust the evaluation on ChatGPT? arXiv:2303.12767.
  2. Efficient Large Scale Language Modeling with Mixtures of Experts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11699–11732. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  3. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity.
  4. Prompting Language Models for Linguistic Structure. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 6649–6663. Toronto, Canada: Association for Computational Linguistics.
  5. Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 9432–9452. Toronto, Canada: Association for Computational Linguistics.
  6. Language Models are Few-Shot Learners.
  7. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. arXiv:2305.00118.
  8. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  9. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2924–2936. Minneapolis, Minnesota: Association for Computational Linguistics.
  10. Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168.
  11. The CommitmentBank: Investigating projection in naturally occurring discourse.
  12. Transforming Question Answering Datasets Into Natural Language Inference Datasets. ArXiv, abs/1809.02922.
  13. Investigating Data Contamination in Modern Benchmarks for Large Language Models. arXiv:2311.09783.
  14. The Statistical Sign Test. Journal of the American Statistical Association, 41(236): 557–566.
  15. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP2005).
  16. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics, 9: 346–361.
  17. The Fourth PASCAL Recognizing Textual Entailment Challenge. In Text Analysis Conference.
  18. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. arXiv:2308.08493.
  19. NewsMTSC: A Dataset for (Multi-)Target-dependent Sentiment Classification in Political News Articles. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1663–1675. Online: Association for Computational Linguistics.
  20. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54(11s): 1–37.
  21. In-Context Learning for Few-Shot Dialogue State Tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2627–2643. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  22. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. arXiv:2305.10160.
  23. NewsMet : A ‘do it all’ Dataset of Contemporary Metaphors in News Headlines. In Findings of the Association for Computational Linguistics: ACL 2023, 10090–10104. Toronto, Canada: Association for Computational Linguistics.
  24. Validity Assessment of Legal Will Statements as Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, 6047–6056. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  25. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045–3059. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
  26. The Winograd schema challenge. KR, 2012: 13th.
  27. Li, Y. 2023. Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation. arXiv:2309.10677.
  28. Data Contamination: From Memorization to Exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 157–165. Dublin, Ireland: Association for Computational Linguistics.
  29. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics, 18(1): 50–60.
  30. OpenAI. 2023a. OpenAI Examples.
  31. OpenAI. 2023b. OpenAI Models.
  32. Proving Test Set Contamination in Black Box Language Models. arXiv:2310.17623.
  33. Training language models to follow instructions with human feedback. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
  34. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1267–1273. Minneapolis, Minnesota: Association for Computational Linguistics.
  35. Synchromesh: Reliable Code Generation from Pre-trained Language Models. In International Conference on Learning Representations.
  36. Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
  37. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5203–5212. Online: Association for Computational Linguistics.
  38. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, 90–95.
  39. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Linguistics: EMNLP 2023.
  40. Did ChatGPT cheat on your test? https://hitz-zentroa.github.io/lm-contamination/blog/.
  41. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  42. Toolformer: Language Models Can Teach Themselves to Use Tools.
  43. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 255–269. Online: Association for Computational Linguistics.
  44. Few-Shot Text Generation with Natural Language Instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 390–402. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
  45. Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 6664–6679. Toronto, Canada: Association for Computational Linguistics.
  46. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642. Seattle, Washington, USA: Association for Computational Linguistics.
  47. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  48. Student. 1908. The probable error of a mean. Biometrika, 1–25.
  49. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford˙alpaca.
  50. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  51. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Red Hook, NY, USA: Curran Associates Inc.
  52. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–355. Brussels, Belgium: Association for Computational Linguistics.
  53. Iteratively Prompt Pre-trained Language Models for Chain of Thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2714–2730. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  54. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  55. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
  56. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  57. GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction. arXiv:2305.18752.
  58. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 3911–3921. Brussels, Belgium: Association for Computational Linguistics.
  59. CREPE: Open-Domain Question Answering with False Presuppositions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 10457–10480. Toronto, Canada: Association for Computational Linguistics.
  60. OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068.
  61. Don’t Make Your LLM an Evaluation Benchmark Cheater. arXiv:2311.01964.
Citations (74)

Summary

  • The paper finds that LLMs exhibit higher performance on pre-collection datasets, indicating contamination from training data.
  • Methodologies like membership inference and task example extraction quantitatively demonstrate performance boosts on older datasets.
  • Findings imply that instruction-tuned, closed-source models might overestimate few-shot learning capabilities due to underlying data leakage.

Task Contamination: LLMs May Not Be Few-Shot Anymore

Introduction

The concept of few-shot learning has garnered significant attention, particularly with the advent of LLMs that demonstrate strong performance with minimal task-specific data. However, the integrity of zero-shot and few-shot evaluations is potentially compromised by task contamination. This paper examines the chronological performance of LLMs, such as the GPT-3 series and other open-source models, to uncover evidence of task contamination, particularly in datasets released before the LLM's training data collection date.

Evidence of Task Contamination

The study begins by comparing the performance of LLMs on datasets released pre- and post-LLM training data collection dates. The results highlight that LLMs perform better on pre-collection datasets than on newer datasets, suggesting that examples from these older datasets may have been included in the training data, thereby contaminating the task evaluation. Figure 1

Figure 1: Percentage of datasets with accuracy higher than the majority baseline for datasets released prior and post LLM training data collection date.

Task example extraction and membership inference attacks further support these findings. Specifically, instruction-tuned models like the GPT-3 series can generate task-specific training examples, indicating exposure to task data and thus compromising zero-shot evaluations. For classification tasks without explicit task contamination, statistically significant improvements over majority baselines are rare.

Chronological Analysis

Chronological analyses of individual models reveal consistent patterns of improved performance on older datasets. This trend suggests possible data leakage or pre-exposure to task examples during training. Analysis across various LLMs displays a significant increase in performance on pre-collection datasets, further implicated by the likelihood of contamination. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Percentage of datasets larger than majority baselines for each LLM, illustrating task example extraction capabilities.

Evaluation Through Membership Inference

A membership inference attack conducted on a semantic parsing task (Spider) provides quantitative support for task contamination. The number of exact matches between model output and original data strongly correlates with increased model accuracy, further evidencing prior exposure during training. Figure 3

Figure 3: Membership inference: Exact match count vs. accuracy for Spider on development set.

Implications and Recommendations

The findings highlight that closed-source models, particularly those with instruction fine-tuning or reinforcement learning with human feedback (RLHF), may not reliably serve as baselines in zero or few-shot learning settings due to task contamination. Therefore, caution is advised when interpreting their results. Publicly releasing training datasets could facilitate a better understanding and resolution of contamination issues. Additionally, further research is crucial to ascertain the full extent and impact of task contamination on LLM evaluations.

Conclusion

This investigation into task contamination reveals that for many zero and few-shot evaluations, LLMs might have pre-acquired knowledge of task datasets, skewing performance metrics. It underscores the need to re-evaluate the few-shot paradigm and urges the community to address potential data leaks systematically and transparently. Future work is required to develop methodologies that can effectively distinguish true few-shot learning capabilities from those results influenced by dataset contamination.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 17 tweets with 212 likes about this paper.

HackerNews