Insights Derivable from WIMBD n-gram Searches

Determine the specific types of insights about large language models trained on massive pretraining corpora that can be obtained by performing n-gram phrase searches over those corpora using the WIMBD retrieval framework.

Background

The paper discusses the challenge of analyzing very large pretraining corpora and highlights the WIMBD framework, which enables efficient search of n-gram phrases across hundreds or thousands of gigabytes of text. While WIMBD provides scalable retrieval capabilities, the authors note that it is not yet clear what insights about models trained on these datasets can be extracted through such searches.

This uncertainty motivates the work’s broader goal of connecting pretraining data characteristics to model capabilities, but the authors explicitly acknowledge that the range and nature of insights obtainable specifically via WIMBD-driven n-gram searches remain to be determined.

References

However, it is unclear what insights of the LLMs trained on these datasets can be obtained from such searches.

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data  (2407.14985 - Wang et al., 2024) in Appendix, Section "Related Work"