Task Contamination: Language Models May Not Be Few-Shot Anymore

Published 26 Dec 2023 in cs.CL | (2312.16337v1)

Abstract: LLMs offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

Abstract PDF HTML Upgrade to Chat

References (61)

Citations (74)

View on Semantic Scholar

Summary

The paper finds that LLMs exhibit higher performance on pre-collection datasets, indicating contamination from training data.
Methodologies like membership inference and task example extraction quantitatively demonstrate performance boosts on older datasets.
Findings imply that instruction-tuned, closed-source models might overestimate few-shot learning capabilities due to underlying data leakage.

Task Contamination: LLMs May Not Be Few-Shot Anymore

Introduction

The concept of few-shot learning has garnered significant attention, particularly with the advent of LLMs that demonstrate strong performance with minimal task-specific data. However, the integrity of zero-shot and few-shot evaluations is potentially compromised by task contamination. This paper examines the chronological performance of LLMs, such as the GPT-3 series and other open-source models, to uncover evidence of task contamination, particularly in datasets released before the LLM's training data collection date.

Evidence of Task Contamination

The study begins by comparing the performance of LLMs on datasets released pre- and post-LLM training data collection dates. The results highlight that LLMs perform better on pre-collection datasets than on newer datasets, suggesting that examples from these older datasets may have been included in the training data, thereby contaminating the task evaluation.

Figure 1: Percentage of datasets with accuracy higher than the majority baseline for datasets released prior and post LLM training data collection date.

Task example extraction and membership inference attacks further support these findings. Specifically, instruction-tuned models like the GPT-3 series can generate task-specific training examples, indicating exposure to task data and thus compromising zero-shot evaluations. For classification tasks without explicit task contamination, statistically significant improvements over majority baselines are rare.

Chronological Analysis

Chronological analyses of individual models reveal consistent patterns of improved performance on older datasets. This trend suggests possible data leakage or pre-exposure to task examples during training. Analysis across various LLMs displays a significant increase in performance on pre-collection datasets, further implicated by the likelihood of contamination.

Figure 2: Percentage of datasets larger than majority baselines for each LLM, illustrating task example extraction capabilities.

Evaluation Through Membership Inference

A membership inference attack conducted on a semantic parsing task (Spider) provides quantitative support for task contamination. The number of exact matches between model output and original data strongly correlates with increased model accuracy, further evidencing prior exposure during training.

Figure 3: Membership inference: Exact match count vs. accuracy for Spider on development set.

Implications and Recommendations

The findings highlight that closed-source models, particularly those with instruction fine-tuning or reinforcement learning with human feedback (RLHF), may not reliably serve as baselines in zero or few-shot learning settings due to task contamination. Therefore, caution is advised when interpreting their results. Publicly releasing training datasets could facilitate a better understanding and resolution of contamination issues. Additionally, further research is crucial to ascertain the full extent and impact of task contamination on LLM evaluations.

Conclusion

This investigation into task contamination reveals that for many zero and few-shot evaluations, LLMs might have pre-acquired knowledge of task datasets, skewing performance metrics. It underscores the need to re-evaluate the few-shot paradigm and urges the community to address potential data leaks systematically and transparently. Future work is required to develop methodologies that can effectively distinguish true few-shot learning capabilities from those results influenced by dataset contamination.

Markdown Report Issue