Matra-Genoa Generative Model for Crystals
- The paper introduces a transformer model that employs symmetry-preserving, invertible Wyckoff tokenization to represent complete inorganic crystal structures.
- The model achieves high sampling speed and enhanced stability, reporting 97% sequence-to-crystal validity and an 8× improvement over baseline methods.
- The framework integrates ML relaxation and DFT verification in high-throughput workflows, significantly expanding databases with novel, low-energy compounds.
The Matra-Genoa generative model is an autoregressive transformer framework for the generation of inorganic crystal structures, grounded in a symmetry-preserving, invertible tokenization scheme based on Wyckoff representations. This model is designed to sample efficiently from the space of possible crystals across the full periodic table and all space groups, with explicit conditioning on stability and other physical properties enabled through architectural and representational innovations. It achieves a significant enhancement in the fraction of generated stable compounds compared to established baselines and underpins recent advances in computational materials discovery workflows (Breuck et al., 27 Jan 2025, Cavignac et al., 9 Dec 2025).
1. Invertible Tokenization and Crystal Representation
Matra-Genoa employs a detailed, invertible mapping from a symmetrized crystal structure to a variable-length sequence of tokens, , encoding:
- Composition (element types and stoichiometry)
- Stability prefix (“EHULL_LOW” for eV/atom, “EHULL_HIGH” otherwise)
- Space group index
- Wyckoff Positions (up to 990 unique orbits across all space groups)
- Wyckoff Site Free Coordinates ( per site)
- Lattice parameters:
Discrete tokens (elements, Wyckoff letters, space-group indices) utilize learnable embedding tables. Continuous-valued tokens (coordinates, lattice parameters) are encoded by concatenating linear and logarithmic Gaussian basis projections: The resulting vectors () are summed with sinusoidal positional embeddings () before model input: This invertible mapping ensures that every generated sequence can be deterministically reconstructed to an explicit crystal structure, preserving valid symmetry assignments (Breuck et al., 27 Jan 2025, Cavignac et al., 9 Dec 2025).
2. Model Architecture and Conditional Sampling
The core model is a 12-layer, pre-norm transformer (multi-head causal self-attention, , approximately parameters):
- Each layer consists of 16-headed attention (key dimension ), followed by a 2-layer MLP with GELU activations, pre-norm layer normalization, and residual connections.
- The joint distribution over token sequences is factorized autoregressively:
- Separate prediction heads are used for discrete (softmax; cross-entropy) and continuous (Gaussian; MSE) tokens.
Conditional generation is achieved by prepending a property-specific prefix token (e.g., “EHULL_LOW” or a discretized value of ) to the sequence. The model then samples with the property as context, which steers the generated structures towards target regimes (e.g., low energy above hull) without architectural modification:
During inference, tokens are generated iteratively (optionally with temperature ) until a complete and valid sequence is produced:
1 2 3 4 5 6 7 8 9 |
initialize sequence with [COMPOSITION, EHULL_{LOW/HIGH}] for n in 1..N_max: logits = Transformer(sequence) if next token is discrete: sample w_{n+1} ~ softmax(logits/T) else: regress continuous value v_{n+1} = g(logits) append w_{n+1} to sequence end |
3. Training Data and Protocol
Matra-Genoa is trained by maximum likelihood (negative log-likelihood objective) on large corpora of symmetrized, tokenized crystal structures. The primary datasets are:
- Matra-Genoa-MP: 115,663 Materials Project structures (filtered, 15 Wyckoff sites, )
- Matra-Genoa-MPAS: Combination of Materials Project and Alexandria, totaling 2,668,720 structures with eV/atom
Training employs the Adam optimizer (, ), weight decay , and a 1-cycle learning rate schedule to , with batch size 256 and early stopping over 1,000 epochs. Mixed-precision training is performed on 7 NVIDIA H100 GPUs, totaling ~36 hours (Breuck et al., 27 Jan 2025). All inputs are symmetrized; discrete and continuous attributes bin to a fixed, invertible alphabet for robust decoding (Cavignac et al., 9 Dec 2025).
4. Model Performance and Baselines
Matra-Genoa exhibits high throughput and an elevated fraction of generated structures that are thermodynamically plausible:
- Sampling Speed: ~1,000 unique and novel crystals per minute.
- Sequence-to-crystal validity: 97% at , dropping to 60% at .
- Novelty and uniqueness: Net yield of 20–30% novel and unique structures, with a "S.U.N." (stable, unique, novel) ratio ≈16% at (c.f. 45% for MatterGen, 18% for CDVAE, <5% for FTCP or G-SchNet).
- Improvement over baselines: The density of stable (low ) crystals is 8× higher than random PyXtal with charge compensation.
Large-scale benchmarks with DFT verification (e.g., 3,000,000 generated samples) produce thousands of structures within 0.05 eV/atom () of the convex hull, and over 4,000 within 0.001 eV/atom (). On an Al–Ca–Cu ternary benchmark, 11 of 15 known hull-stable structures are recovered from 2,000 samples (73% recovery rate). The generated structures exhibit space group assignment consistency of 99.3–97% and post-relaxation filtering yields over 12,000 DFT-verified structures below 0.05 eV/atom per 15,000 tested (Breuck et al., 27 Jan 2025).
Table: Example Filtering Results, 150,000-Sample Batch (Breuck et al., 27 Jan 2025)
| 0.70 | 138,425 | 57,004 | 9,441 |
| 1.65 | 131,002 | 94,764 | 5,642 |
The model’s output is further utilized in large-scale workflows, achieving 27.4% on-hull and 98.2% within 100 meV/atom in extensive studies, compared to 0.8% and 36% for random-substitution baselines (Cavignac et al., 9 Dec 2025).
5. Integration in Computational Discovery Workflows
Matra-Genoa is a critical generator in multi-stage computational discovery pipelines, where it operates as follows:
- Generation: Token sequence is sampled, and decoded to full crystal structure.
- ML-based Relaxation: Orb-v2 (universal ML interatomic potential) performs geometry optimization.
- Rapid Hull Estimation: ALIGNN (graph neural network) predicts energy above convex hull.
- Filtering: Structures within a chosen (typically 25–200 meV/atom) are further relaxed with DFT and added to databases (e.g., Alexandria).
This approach led to the expansion of the Alexandria database by 1.3 million DFT-validated compounds, including 74,000 new predicted stable materials, and supports sAlex25, a dataset of 14 million out-of-equilibrium structures for universal force field training (Cavignac et al., 9 Dec 2025).
6. Strengths, Limitations, and Prospective Directions
Strengths include explicit symmetry handling via Wyckoff tokenization, integration of both discrete and continuous structure components in a single hybrid action space, high throughput, and direct property-conditioning. These aspects enable efficient enumeration and sampling of thermodynamically plausible materials, with a demonstrable, significant enhancement over other generative models (Breuck et al., 27 Jan 2025).
Limitations pertain to the use of greedy sampling, which may under-explore conditional subspaces, as well as the trade-off between stability and novelty mediated by sampling temperature (): lower values bias towards stability but reduce novelty, while higher values increase novelty at the cost of stability or validity.
Potential extensions suggested in the primary literature include the adoption of advanced sequence search algorithms (beam search, Monte Carlo tree search, GFlowNets) for improved conditioned sampling, extension to additional property conditioning (band gap, magnetic properties, etc.), supervised tasks via fine-tuning, and integration with active-learning loops for iterative DFT feedback (Breuck et al., 27 Jan 2025).
7. Impact and Significance in Materials Discovery
The Matra-Genoa generative model establishes a new state of the art in crystal structure generation with guaranteed symmetry, high validity, and controllable stability, supporting both high-throughput candidate exploration and targeted property optimization. Its architectural innovations—particularly the Wyckoff-based, invertible tokenization and hybrid action space—are directly responsible for improved sampling of DFT-validated, low-energy crystals. Its integration in workflows with large-scale ML relaxers and high-throughput DFT calculations has accelerated and expanded computational materials discovery, as evidenced by its central role in the generation of millions of new candidate and validated structures for open-access materials repositories (Breuck et al., 27 Jan 2025, Cavignac et al., 9 Dec 2025).