Pretraining on the Test Set Is All You Need

Published 13 Sep 2023 in cs.CL and cs.AI | (2309.08632v1)

Abstract: Inspired by recent work demonstrating the promise of smaller Transformer-based LLMs pretrained on carefully curated data, we supercharge such approaches by investing heavily in curating a novel, high quality, non-synthetic data mixture based solely on evaluation benchmarks. Using our novel dataset mixture consisting of less than 100 thousand tokens, we pretrain a 1 million parameter transformer-based LLM \textbf{phi-CTNL} (pronounced ``fictional") that achieves perfect results across diverse academic benchmarks, strictly outperforming all known foundation models. \textbf{phi-CTNL} also beats power-law scaling and exhibits a never-before-seen grokking-like ability to accurately predict downstream evaluation benchmarks' canaries.

Abstract PDF Upgrade to Chat

Citations (21)

View on Semantic Scholar

Summary

The paper demonstrates that pretraining on test benchmarks enables small models like phi-CTNL to rapidly grok data and achieve perfect academic scores.
It applies a targeted training regime using less than 100K tokens from datasets such as AI2 Reasoning Challenge, BoolQ, and SQUAD, emphasizing curated data quality.
The study satirically critiques the trend toward larger models, urging rigorous evaluation to mitigate data contamination and overambitious claims.

Introduction

In an effort to explore the boundaries of efficiency in LLM pretraining, a novel approach involving a transformer-based LLM with 1 million parameters, named phi-CTNL, is introduced. The study stands counter to the conventional large-scale pretraining models, showing that smaller models can reach high performance when trained on a meticulously crafted, high-quality dataset. Phi-CTNL outperforms all known foundation models, boasting a ‘grokking-like’ ability – it can quickly adapt and exhibit understanding of data in ways previously not demonstrated, achieving perfect scores on multiple academic benchmarks.

Pretraining Data

To accomplish these results, the phi-CTNL model is trained on less than 100 thousand tokens derived from a selection of evaluation benchmarks that it is eventually tested on. These benchmarks include well-known datasets in the field of AI such as the AI2 Reasoning Challenge, BoolQ, and SQUAD, to name a few. The paper articulates that this method of targeted pretraining on benchmark datasets yields superior results compared to pretraining on a broader array of datasets, which underpins the model's unprecedented performance.

Novel Capabilities

The paper discusses two groundbreaking characteristics of the phi-CTNL model. Firstly, it demonstrates quicker learning capabilities, beating the traditional power-law scaling that relates model learning performance with computational resources. Secondly, it showcases the model's unusual ability to 'grok' or abruptly grasp benchmark canaries—special tokens in benchmarks designed to test the model's understanding. This implies the model can hit a sudden leap in predicting benchmark outputs accurately, a property not observed in other models.

Discussion

The research concludes on the note that phi-CTNL, with significantly fewer parameters, not only outshines larger models on academic evaluations but also prompts a rethink of the industry's current trajectory focusing on larger models. Data quality and careful curation emerge as pivotal factors for pretraining effectiveness. The authors, however, reveal a twist: the paper is a satirical piece encouraging readers to critically assess ambitious claims in AI research and pay heed to issues like data contamination. This disclaimer emphasizes the importance of rigorous testing for LLMs, especially under the increasing complexity of potential dataset biases and the growing trend toward larger model sizes.

Markdown Report Issue