Efficacy of general-purpose language models for materials science
Ascertain whether training large, general-purpose language models on broad, heterogeneous text corpora such as The Pile or Common Crawl yields beneficial generalization for materials science tasks compared to specialized, domain-focused models.
References
In the context of sustainably applying ML approaches for material discovery, it is questionable if large, general-purpose models, which include a variety of scientific and non-scientific sources, are the right tools: It is yet to be proven that the data points from such diverse fields that are covered by huge datasets such as the Pile or the Common Crawl lead to a generalization that is beneficial for the field of material science.
— Perspective: Towards sustainable exploration of chemical spaces with machine learning
(2604.00069 - Sandonas et al., 31 Mar 2026) in Subsubsection 'Language models as predictive tools', Open challenges