Papers
Topics
Authors
Recent
Search
2000 character limit reached

SyntheFormer: Transformer for Synthesizability

Updated 5 February 2026
  • SyntheFormer is a transformer-based architecture that predicts the experimental synthesizability of crystalline inorganic solids using Fourier-transformed crystal periodicity.
  • It utilizes hierarchical attention across multiple pathways combined with deep MLP classification and positive–unlabeled risk estimation to address severe class imbalance.
  • The method improves recall and provides explicit uncertainty quantification, enabling fast, high-throughput pre-screening in computational materials design.

SyntheFormer refers to a class of machine learning architectures leveraging transformer-based neural networks for the prediction or generation of synthesizable structures in chemistry and materials science. The core usage to date centers on synthesizability assessment for crystalline inorganic solids, with technical innovations in structural representation, hierarchical attention, positive-unlabeled learning, and uncertainty quantification (Ebrahimzadeh et al., 22 Oct 2025).

1. Problem Formulation and Motivation

The central objective addressed by SyntheFormer is direct prediction of experimental synthesizability for hypothetical inorganic crystal structures. Traditional approaches—based on thermodynamic stability (e.g., distance to convex hull via DFT)—fail to reliably distinguish between experimentally attainable and unattainable materials; many metastable compounds are known, and many predicted stable structures remain unsynthesized. Only a small minority (~1%) of hypothetical structures are confirmed synthesized in forward-looking (post-2019) test sets, making standard supervised methods inapplicable due to severe class imbalance and lack of true negative labels.

In this paradigm, SyntheFormer provides a process to learn synthesizability exclusively from crystal structure data, combining physics-motivated representations with deep attention models, systematic feature selection, and robust handling of positive-unlabeled data (Ebrahimzadeh et al., 22 Oct 2025).

2. Input Representation: Fourier-Transformed Crystal Periodicity

SyntheFormer employs a “Fourier-Transformed Crystal Periodicity” (FTCP) representation, mapping real-space lattice vectors {a,b,c}\{\mathbf{a},\mathbf{b},\mathbf{c}\}, atomic positions {rj}\{\mathbf{r}_j\}, and species occupancy into reciprocal-space “fingerprints”:

  • For each atomic species mm and reciprocal-space index ktk_t, the structure factor is:

Sm(kt)=jsites(m)fm(kt)eiktrjS_m(k_t) = \sum_{j \in \text{sites}(m)} f_m(k_t)\, e^{-i k_t \cdot r_j}

where fm(kt)f_m(k_t) is the atomic scattering factor.

  • Sine and cosine channels are extracted:

Cm(kt)=jfm(kt)cos(ktrj),Sm(kt)=jfm(kt)sin(ktrj)C_m(k_t) = \sum_j f_m(k_t) \cos(k_t\cdot r_j),\quad S_m(k_t) = \sum_j f_m(k_t) \sin(k_t\cdot r_j)

  • Additional blocks encode one-hot elemental vectors, lattice constants (a,b,c,α,β,γa, b, c, \alpha, \beta, \gamma), atomic-site indices, site occupancies, and reciprocal-space intensities.
  • The resulting data tensor (e.g., 399×64) is formatted such that each row is a “token” suitable for transformer attention, thereby allowing the network to capture both local and long-range correlations across periodic unit cells and reciprocal-space harmonics.

This representation explicitly encodes crystallographic periodicity and symmetry, enabling structurally-aware attention in subsequent layers (Ebrahimzadeh et al., 22 Oct 2025).

3. Hierarchical Transformer-Based Feature Extraction

Unlike monolithic transformers, SyntheFormer deploys six parallel “pathways,” each specialized to a particular block of the FTCP tensor:

  • All pathways embed inputs into R256\mathbb{R}^{256} or R768\mathbb{R}^{768} subspaces.
  • Within each, multi-head self-attention layers operate on the embedding sequence:

Attention(Q,K,V)=softmax(QKdk)V\mathrm{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right)V

for query, key, and value projections Q,K,VQ, K, V, as standard in transformer architectures.

  • Atomic sites and lattice parameters utilize position encodings adapted to periodic boundary conditions:

PEp,2i=sin(2πkrp/Λ),PEp,2i+1=cos(2πkrp/Λ)\mathrm{PE}_{p,2i} = \sin\left(2\pi k \cdot r_p / |\Lambda|\right),\quad \mathrm{PE}_{p,2i+1} = \cos\left(2\pi k \cdot r_p / |\Lambda|\right)

  • Features from each pathway are pooled and concatenated, resulting in a 2 048-dimensional embedding vector for each crystal structure.

Hierarchical attention thus models interactions at multiple physical scales (individual sites, global periodicity, compositional blocks), extracting a high-capacity fingerprint for downstream prediction (Ebrahimzadeh et al., 22 Oct 2025).

4. Feature Selection, Classifier Architecture, and PU Learning

Given the extreme class imbalance (only 1.02% of future samples are positive), SyntheFormer integrates:

  • Random Forest Feature Selection: A 200-tree random forest ranks the 2 048 transformer-derived features by their Gini impurity reduction. The top 100 are retained, resulting in a compact fingerprint. This selection both improves discriminative power (AUC: 0.735 with 100 features vs. 0.705 with the full set), and reduces overfitting and computational cost.
  • Deep MLP Classifier: The selected 100 features are processed by a 4-layer MLP with 512–256–128 hidden nodes, ReLU activations, batch normalization, and dropout.
  • PU Risk Estimation: Given the lack of known negative labels, training uses the non-negative positive–unlabeled risk estimator:

R^PU(f)=πPExP[(f(x),+1)]+max{0,ExU[(f(x),1)]πPExP[(f(x),1)]}\widehat{R}_{PU}(f) = \pi_P\,\mathbb{E}_{x\sim P}\left[\ell(f(x),+1)\right] + \max\left\{0, \mathbb{E}_{x\sim U}\left[\ell(f(x),-1)\right] - \pi_P\,\mathbb{E}_{x\sim P}\left[\ell(f(x),-1)\right] \right\}

with (p,y)\ell(p,y) the cross-entropy loss, πP\pi_P the class prior, and PP, UU the positive and unlabeled sets respectively.

  • The final classifier outputs a probability score p(0,1)p \in (0,1) for synthesizability.

This design addresses both scalability and label uncertainty in realistic materials pipelines (Ebrahimzadeh et al., 22 Oct 2025).

5. Uncertainty Quantification and Decision Rules

Since only a small minority of post-2019 hypothetical materials are confirmed synthesizable, SyntheFormer applies a dual-threshold calibration for interpretability:

  • Assign “synthesizable” if pthighp \geq t_\mathrm{high}, “non-synthesizable” if ptlowp \leq t_\mathrm{low}, and “uncertain” otherwise.
  • Optimized thresholds (thigh=0.30t_\mathrm{high}=0.30, tlow=0.25t_\mathrm{low}=0.25) yield recall 97.6%97.6\% at 94.2%94.2\% coverage; only 5.8%5.8\% of candidates are deferred for expert review.
  • This approach increases recall compared to standard p=0.5p=0.5 thresholding, which would miss nearly 28%28\% of true positives on out-of-sample (2019–2025) data.

Explicit uncertainty quantification thus minimizes missed experimental opportunities while balancing resource constraints in high-throughput screening (Ebrahimzadeh et al., 22 Oct 2025).

6. Performance Evaluation and Implications for Materials Discovery

Under temporally separated testing (train: 2011–2018, test: 2019–2025), SyntheFormer achieves ROC AUC $0.735$ on highly imbalanced data. It recovers the majority of experimentally known metastable materials (including those >1eV>1\,\mathrm{eV}/atom above the convex hull) and assigns low synthesizability scores to many unsynthesized DFT-predicted “stable” structures.

Compared to DFT-based convex-hull filtering (e.g., Ehull<0.1eVE_\text{hull} < 0.1\,\mathrm{eV}/atom; recall 84.0%), SyntheFormer at similar false-positive burden achieves higher recall (94.3%) and more explicit uncertainty specification.

In practical deployment, the pipeline—from FTCP encoding and transformer attention, to binary feature selection and classification—executes in milliseconds per candidate on standard GPUs. Thus, it enables fast pre-screening of millions of hypothetical inorganic structures in computational materials design workflows (Ebrahimzadeh et al., 22 Oct 2025).

7. Limitations and Future Directions

  • Scope: SyntheFormer currently targets inorganic crystalline materials. Extension to organic, amorphous, or non-periodic phases remains unexplored.
  • Input Dependency: The accuracy of FTCP representation depends on the quality of crystallographic input; site disorder, partial occupancy, or severe defects are not explicitly addressed.
  • Uncertainty Calibration: Dual-thresholding defers a small but nonzero fraction of compounds as “uncertain”; tighter integration with laboratory feedback or active learning may further improve precision.
  • Physical Interpretability: Attention weights or selected features may provide interpretability, but explicit connection to experimental synthesis failure modes is only beginning to be explored.

Further research directions include adaptation to other domains, incorporation of kinetics or processing variables, and integration with generative models to propose modification pathways for borderline candidates (Ebrahimzadeh et al., 22 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SyntheFormer.