BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Published 3 May 2025 in cs.LG, cond-mat.mtrl-sci, and cs.AI | (2505.01912v1)

Abstract: Advances in deep learning and generative modeling have driven interest in data-driven molecule discovery pipelines, whereby ML models are used to filter and design novel molecules without requiring prohibitively expensive first-principles simulations. Although the discovery of novel molecules that extend the boundaries of known chemistry requires accurate out-of-distribution (OOD) predictions, ML models often struggle to generalize OOD. Furthermore, there are currently no systematic benchmarks for molecular OOD prediction tasks. We present BOOM, $\boldsymbol{b}$enchmarks for $\boldsymbol{o}$ut-$\boldsymbol{o}$f-distribution $\boldsymbol{m}$olecular property predictions -- a benchmark study of property-based out-of-distribution models for common molecular property prediction models. We evaluate more than 140 combinations of models and property prediction tasks to benchmark deep learning models on their OOD performance. Overall, we do not find any existing models that achieve strong OOD generalization across all tasks: even the top performing model exhibited an average OOD error 3x larger than in-distribution. We find that deep learning models with high inductive bias can perform well on OOD tasks with simple, specific properties. Although chemical foundation models with transfer and in-context learning offer a promising solution for limited training data scenarios, we find that current foundation models do not show strong OOD extrapolation capabilities. We perform extensive ablation experiments to highlight how OOD performance is impacted by data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation. We propose that developing ML models with strong OOD generalization is a new frontier challenge in chemical ML model development. This open-source benchmark will be made available on Github.

Abstract PDF Upgrade to Chat

Summary

Review of "BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models"

The paper titled "BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models" presents a comprehensive evaluation of existing machine learning models for their ability to predict molecular properties beyond the distribution of training data, known as out-of-distribution (OOD) generalization. This research introduces the BOOM benchmark, designed to fill the gap in systematically assessing models on OOD prediction tasks, particularly in discovering novel molecules with exceptional properties that extend known chemical boundaries. Herein, I provide an expert synthesis of the study, discussing its results and implications for the field of chemical machine learning.

Summary of Findings and Methodology

The authors highlight a critical limitation in current molecular discovery paradigms: the lack of models capable of robust OOD predictions. Utilizing BOOM, they evaluate OOD performance across 140 model-task combinations, examining both typical machine learning architectures and large-scale chemical foundation models. The study covers ten diverse molecular properties with datasets sourced primarily from QM9 and 10K Dataset, focusing on properties determined by DFT calculations such as isotropic polarizability, HOMO energy, and zero-point vibrational energy.

Key findings include:
- Performance Benchmarking: No single model consistently excels in OOD generalization across all tasks. Even top-performing models exhibit considerably larger OOD prediction errors compared to in-distribution predictions, indicating significant room for improvement.
- Architectural Insights: Models with high inductive bias demonstrated efficacy in OOD predictions for simpler properties. This suggests that a model's design, notably graph-based and equivariant geometrical neural networks (EGNN), is pivotal but not sufficient for universal OOD generalization.
- Pretraining Effects: Current chemical language model pretraining does not confer the anticipated improvements in OOD prediction capabilities. Despite enhancing in-distribution performance, pretraining tasks currently fail to equip models with the necessary extrapolation skills for OOD samples.
- Hyperparameter and Data Augmentation: Adjustments in hyperparameters show some potential for improving OOD predictions, highlighting the need for optimization tailored specifically for OOD performance. Additionally, data augmentation strategies involving incorporation of OOD samples during training significantly enhance OOD generalization.

Implications and Future Directions

The findings suggest new directions in chemical machine learning centered around enhancing OOD generalization capacities. Practically, this underscores the necessity of redefining pretraining tasks to capture chemical phenomena relevant for extrapolation. The study advocates for the development of innovative models that balance high inductive biases with scalable implications, addressing the limitations of current architectures when faced with complex electronic properties.

The open-source BOOM benchmark, an invaluable resource for model evaluation, potentially steers the community towards standardized methods of OOD performance assessments. Future research might explore hybrid architectures that synergize transformer scalability with geometric deep learning principles to advance comprehensive OOD prediction capabilities.

In conclusion, while this paper presents an incremental advance in characterizing the OOD landscape of molecular property predictions, it highlights profound challenges. Breaking new ground in OOD molecular predictions demands models that redefine boundaries of chemical understanding, leveraging both experimental data augmentation and theoretically justified pretraining objectives. Achieving these aims marks a frontier in chemical machine learning, promising transformative applications in molecular discovery.

Overall, the paper offers a critical assessment of molecular machine learning's current capabilities and establishes a foundational benchmark for future endeavors to improve OOD prediction success.