Comparative efficacy of active learning versus random sampling at large scale

Determine how active learning sampling strategies compare to random sampling baselines for generating large-scale density-functional-theory-labeled non-equilibrium inorganic bulk structures, such as those used to construct the OMat24 dataset, in the context of training interatomic potential models for materials discovery at scale.

Background

The OMat24 dataset was constructed using three sampling strategies—rattled Boltzmann sampling, ab initio molecular dynamics, and rattled relaxations—starting from relaxed Alexandria structures to encourage diverse non-equilibrium configurations. These choices emphasize diversity without explicitly adopting active learning.

The paper notes that while active learning has the potential to further enhance sampling, its comparative advantage over random sampling at very large dataset scales is not established, motivating a clear need to assess and quantify the benefits (if any) of active learning for large-scale materials datasets.

References

Active learning sampling strategies have the potential to further enhance these approaches but it remains unclear how they compare to random baselines when considering large scale dataset sizes.

Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models  (2410.12771 - Barroso-Luque et al., 2024) in Section: OMat24 Dataset; Subsubsection: Crystal structure generation