Insights Derivable from WIMBD n-gram Searches
Determine the specific types of insights about large language models trained on massive pretraining corpora that can be obtained by performing n-gram phrase searches over those corpora using the WIMBD retrieval framework.
References
However, it is unclear what insights of the LLMs trained on these datasets can be obtained from such searches.
— Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
(2407.14985 - Wang et al., 2024) in Appendix, Section "Related Work"