- The paper presents a benchmark of UNEP-v1 and GRACE-FS, showcasing their trade-offs in accuracy and computational efficiency for million-atom simulations of multicomponent alloys.
- It details how UNEP-v1 uses a Chebyshev/Legendre polynomial-based neural network optimized for GPUs, while GRACE-FS employs a recursively evaluated ACE variant with efficient CPU performance.
- The study underscores that model architecture, rather than data augmentation alone, governs transferability and stability under extreme conditions such as high temperatures and complex alloy systems.
Machine Learning Interatomic Potentials for Million-Atom Simulations of Multicomponent Alloys
Introduction and Motivation
The simulation of multicomponent alloys at the million-atom scale has been a persistent computational bottleneck, primarily due to the dichotomy between model accuracy and computational tractability. Recent advances have yielded universal MLIPs capable of broad chemical coverage and high extrapolation accuracy, but with substantial inference and training costs. This work provides a technical, protocol-consistent benchmark of two highly scalable frameworks: the neuroevolution potential (NEP/UNEP-v1) and the graph atomic cluster expansion in its Finnis–Sinclair variant (GRACE-FS), focusing on their performance for 16 metallic elements and diverse multicomponent alloys.
Methods: MLIP Frameworks, Datasets, and Protocols
UNEP-v1 leverages a Chebyshev/Legendre polynomial descriptor basis input into lightweight NNs with species-dependent weights, trained via evolution strategies. It is optimized for GPU-based inference via GPUMD, with ensemble variants for uncertainty quantification. GRACE-FS provides a recursively evaluated, parameter-efficient extension of the ACE formalism, introducing pairwise chemical embeddings and efficient CPU-based message passing, along with D-optimality and ensemble UQ mechanisms.
Both models are trained exclusively on unary and binary alloy DFT datasets, with transferability tested on ternary and higher-order alloy datasets up to 16 elements. Extensive property benchmarks, computational cost characterizations, and large-scale MD simulations—including high-temperature, mechanical deformation, and shock scenarios—are used for evaluation.
UNEP-v1 and GRACE-FS-M (medium-complexity) achieve comparable training MAE values for energy and forces; GRACE-FS-M slightly outperforms in average errors, UNEP-v1 holds an advantage in stress robustness. Crucially, the GRACE-FS-M model demonstrates a 40-fold reduction in training wall time over UNEP-v1. However, at inference, UNEP-v1 on high-end GPUs (H100/A100) yields a 30–60× higher atom-step rate compared to GRACE-FS-M on 192 CPU cores, with scaling advantages increasing at larger system sizes.
Figure 1: Error and cost comparison for UNEP-v1 and GRACE-FS-M, showing similar accuracy but order-of-magnitude GPU/CPU-driven trade-offs in wall time.
Figure 3: Throughput comparison for million-atom MD; UNEP-v1 delivers extreme inference performance scaling with GPU resources, enabling unprecedented simulation sizes.
Uncertainty Quantification Assessment
Comprehensive UQ analyses demonstrate that ensemble variance yields robust correlation with actual errors for both frameworks, with higher structural-level Spearman’s ρ for GRACE-FS-M ensembles. In contrast, D-optimality fails to reliably indicate out-of-distribution errors due to high heterogeneity and chemical/structural complexity, limiting its utility for UQ under real alloy distributions.
Figure 2: Ensemble UQ tracks true prediction errors effectively, while D-optimality underperforms for chemically diverse environments.
Stability and Transferability in Extreme Conditions
At moderate temperatures and in simple compounds, both MLIPs maintain energy conservation and structural integrity. Under elevated temperatures and in high-entropy alloy (HEA) MD, GRACE-FS-M maintains stability, but UNEP-v1 exhibits catastrophic NVE drift at 3000 K and above, especially in systems with 5–16 atomic species.
Figure 4: Goldene monolayer shows both potentials maintain low drift, but structural collapse occurs in GRACE-FS-M.
Figure 5: In HEAs, GRACE-FS-M maintains stability at high temperature whereas UNEP-v1 fails, indicating architectural robustness differences in extreme regimes.
Transferability tests reveal a strict hierarchy: GRACE-2L (deep graph-based, more expressive) models outperform both GRACE-FS-M and UNEP-v1 as the number of chemical species increases—an effect not overcome by data augmentation alone, implying architectural limitations predominate over dataset completeness.
Figure 6: Prediction errors increase with chemical complexity; only deep graph-based architectures retain accuracy across the 2–16 element range.
For key mechanical properties (elastic constants, vacancy formation energy, surface energy, dislocation barriers), GRACE-FS variants consistently outperform UNEP-v1, with further improvement when data augmentation is employed. This holds both for fundamental properties and for complex, finite-temperature tensile tests in FCC/BCC HEAs, where GRACE-FS models capture defect proliferation, phase transitions, and stacking faults with lower RMSE.
Figure 7: GRACE-FS MLIP architectures show higher fidelity than UNEP-v1 for elastic and defect property prediction.
Figure 8: Tensile deformation simulations in HEAs—GRACE-FS models demonstrate lower error in capturing both energy/force profiles in FCC/BCC systems.
Large-Scale Shock Simulations and Model Uncertainty
UNEP-v1’s computational efficiency enables 3-million-atom shock simulations of HEA systems, capturing density, stress, and temperature evolution, and enabling rigorous UQ quantification of spall strength (∼2% relative uncertainty in ensemble). Microstructural observables (e.g., dislocation densities) show ensemble-driven variability, underscoring the need for ensemble approaches in extreme event simulations.
Figure 9: Shock-induced spallation in Al10Cr10Cu35Ni35V10 is robustly captured by UNEP-v1, enabling quantification of spall strength and associated uncertainty.
Figure 10: Ensemble variability in microstructural evolution highlights sensitivity of crack propagation pathways, with implications for reliability of large-scale MD predictions.
Discussion and Implications
This benchmark establishes clear guidelines for MLIP selection for million-atom alloy simulations:
- UNEP-v1 is optimal for scenarios where maximum system size or long time scales are the primary constraint, and minor compromises in accuracy (especially for high-entropy extreme conditions) are acceptable.
- GRACE-FS and deep graph-based models are preferred for robust chemical extrapolation, transferability, and high-temperature stability.
- Training only on unary/binary data is insufficient for multi-element alloys; model architecture imposes the principal constraint for transferability, as further data augmentation yields diminishing returns without increased expressiveness.
- Ensemble-based UQ remains mandatory for out-of-distribution detection; D-optimality is unreliable in the presence of broad data heterogeneity.
These findings align with broader trends showing that scalable, expressive architectures—such as universal equivariant GNN potentials—are required for robust extrapolation in chemically diverse scenarios (2604.01642, Liang et al., 30 Apr 2025).
Conclusion
This work delivers an authoritative, side-by-side evaluation of production-scale MLIPs for HEAs and related systems, delineating the speed/accuracy/extrapolation trade-offs between NEP and GRACE-FS formalisms. For applications demanding maximum throughput at system sizes beyond 106 atoms, UNEP-v1 enables unprecedented MD access; for applications requiring robustness in property prediction, chemical extrapolation, and simulation stability—especially under extreme thermodynamic or mechanical conditions—GRACE-FS and more expressive graph-based models are essential. The careful quantification of UQ and transferability provided here will inform both methodological development and practical deployment of MLIPs for complex alloy simulations.