METiS de novo Lipid Library
- The METiS de novo Lipid Library is a computational resource compiling 10 million virtual ionizable lipids using fragment-based combinatorial methods and reinforcement learning.
- It employs stringent physicochemical, synthetic feasibility, and MD-based filters to ensure high chemical diversity and robust chemical validity.
- Integration with LipidBERT pre-training enables accurate property predictions, significantly reducing real-world failure rates in lipid nanoparticle discovery.
The METiS de novo Lipid Library is a computationally constructed, structurally diverse database of virtual ionizable lipids assembled using fragment-based generative algorithms, subject to physicochemical and synthetic feasibility filters, and curated for use in machine learning–driven lipid nanoparticle (LNP) design and screening tasks. This resource underpins modern approaches for pre-training LLMs such as LipidBERT and enables high-throughput in silico screening of next-generation lipid candidates for drug delivery applications (Yu et al., 2024, Ou et al., 2024).
1. Algorithmic Construction and Fragment-Based Enumeration
Library generation within the METiS pipeline leverages a fragment-based combinatorial approach. Known ionizable lipids and small-molecule drugs are computationally fragmented into 7–12 substructures—headgroups, linking fragments, alkyl tails, and optional spacers. The fragment classes are defined by constraints on atom types, connectivity patterns, maximal lengths, and optional branching. A reinforcement learning (RL) generator proposes novel fragments with class-conditional reward functions—favoring, for example, tertiary amines in headgroups or saturated/unsaturated chains in tails. A custom connecting algorithm exhaustively enumerates valid recombinations into full molecular SMILES strings, with chemical constraints enforced throughout.
Design criteria for fragment selection are as follows:
- Headgroups: Only structures with tertiary or secondary amines (tunable pKₐ typically 5–8; polar surface area ≤ 80 Ų) are considered. The number of distinct headgroup scaffolds is kept proprietary.
- Tails: Chains in the C₈–C₂₄ range, up to one methyl branch per chain, and 0–6 double bonds permitted.
- Enumeration: Lipids feature 2–6 tails; the combinatorial count for the number of unique lipids is
where is the number of unique tail fragments.
Chemical validity (proper tautomers, valences, ionizability windows, and avoidance of unreasonable substructures) is tightly enforced at every assembly stage via internal predictors and explicit filters (Yu et al., 2024).
2. Virtual Screening, Filtering, and ML-Based Scoring
Initial enumeration produces "billions" of candidate structures, necessitating aggressive filtering and scoring to reduce to a tractable, high-quality set:
- Physicochemical Filters: For each molecule, properties including molecular weight (300–800 Da), H-bond donors/acceptors (HBD ≤ 5, HBA ≤ 10), and topological polar surface area (20–120 Ų) are computed (e.g., using RDKit). Log P is estimated by an in-house additive model and must fall between 1 and 5 for membrane compatibility.
- Ionizable State: Each structure is evaluated for headgroup pKₐ (internal predictor) with a typical requirement ; at least one tertiary or secondary amine is mandatory for maintaining ionizability.
- ML Scoring Cascade: Each candidate is traced through a stack of ML classifiers/regressors, including scores for synthetic feasibility, simulated membrane/bilayer stability (MD proxies), and predicted LNP potency. The generic scoring function is
where are task-specific models, and are empirically tuned weights.
Candidates below a tunable threshold are discarded. Iterative tuning and re-sampling ultimately result in a library of 10 million virtual lipid structures, representing only the uppermost percentile of the original design space in terms of predicted utility and plausibility (Yu et al., 2024).
3. Library Composition and Molecular Representation
Reported statistics on the final library include:
- Headgroup diversity: distinct (amines, imidazoles, etc.), actual undisclosed.
- Tail statistics: Uniform coverage of the C₈–C₂₄ range with peaks at C₁₄, C₁₆, C₁₈; unsaturation split as ~30% saturated, ~40% mono-unsaturated, ~30% poly-unsaturated; 2–6 tails per lipid (modal ).
- Scaffold diversity: Pairwise ECFP-4 Tanimoto similarities center at , indicating broad chemical space coverage.
- Combinatorial scaling: Typical libraries enumerate – core structures (headgroup × tail set × linker), with manual curation to the final 10 million (Yu et al., 2024).
All molecules are represented canonically via SMILES strings, tokenized at the character level for language modeling (alphabet 50 symbols). Additionally, molecular graphs—with adjacency matrices and rich atom/bond feature vectors—are constructed to support GNN-based modeling tasks.
| Property | Range or Statistic | Notes |
|---|---|---|
| MW | 300–800 Da (heads/tails ≤ 500) | Enforced during filtering |
| Headgroup pKₐ | 5.5–7.5 | Ionizability requirement |
| log P | 1–5 | Membrane compatibility |
| Tails per lipid | 2–6 (modal ) | Combinatorial enumeration |
| Diversity (ECFP4) | Tanimoto 0.20 ± 0.05 | Scaffold-level diversity |
SMILES strings are canonicalized (RDKit), with special tokens ([CLS], [SEP], [MASK]) injected for language-model training.
4. Machine Learning Integration and Downstream Usage
The METiS de novo Lipid Library is the primary corpus for pre-training LipidBERT, a BERT-like masked LLM. Pre-training includes:
- Masked Language Modeling (MLM): 15% SMILES tokens are masked per input. The objective is to minimize MLM cross-entropy.
- Secondary Tasks: Number-of-tails classification (5-way), head-vs-tail token classification, connecting-atom prediction (sequence and token objectives), and rearranged/decoy SMILES recognition. The total loss is a weighted sum:
with hyperparameters selected so that MLM dominates, but auxiliary tasks contribute to embedding discriminability.
LipidBERT uses the 10 million library over 10 epochs (BERT-base, 12×768, 12 heads; AdamW optimizer at , batch size 128). Embeddings from the [CLS] token encode composite chemical features (head type, tail configuration, linker identity) and support high-accuracy regression on LNP properties ( for property prediction, for embeddings + MLP vs. $0.63$ for XGBoost baselines) (Yu et al., 2024).
5. Quantitative Performance and Diversity Metrics
Quantitative evaluation demonstrates:
- Diversity: Pairwise ECFP-4 similarities around 0.2, consistent with high chemical diversity.
- Pre-training scale: Expanding the training set from 1 million to 10 million lipids results in a leap in fine-tuned performance on ex vivo lung fluorescence from to $0.94$.
- In silico screening gains: Pre-trained models reduce real-world failure rates in subsequent wet-lab tests by approximately 30%, indicating an effective enrichment for synthetically and functionally plausible candidates prior to experimental validation.
- Combinatorial richness: Headgroup × tail structure enumeration orders in – for heads and for tail-sets yield – plausible core scaffolds (expanded via linkers and other modifications), further supporting the observed coverage (Yu et al., 2024).
6. Comparison with Other Generative Pipelines
An orthogonal approach for ionizable lipid generation appears in the deep generative Synthesis-DAG model (Ou et al., 2024). Here, DAGs encode synthesis routes, with building block selection (heads/tails) from filtered ZINC and LIPID MAPS analogs, sequential attachment predicted by Chemformer, and filtering via lipid-vs-nonlipid classification, pKₐ, and synthetic accessibility (Ertl & Schuffenhauer, 2009).
Outcome statistics for 14,148 valid DAG+Chem samples: 92.6% lipid rate, 83.4% ionizable rate, FCD = 3.797 (similarity to the training set), and SA score = 4.182 (moderate synthetic accessibility). Novelty and uniqueness measured at 0.999. Physicochemical and headgroup selection constraints resemble METiS rules but are realized via explicit action-sequence modeling and cheminformatics-driven assembly rather than fragment RL.
A plausible implication is that the METiS approach, using RL-driven fragment design with physically constrained fragment pools and aggressive ML/MD filtering, achieves comparably high diversity and synthetic plausibility but at far larger ultimate scale (10 million molecules vs. ~14,000 in the reference DAG treatment).
7. Technical Limitations and Proprietary Aspects
Key enumeration parameters—such as the precise number of headgroup scaffolds (), specific fragment lists, and internal filter thresholds—are treated as METiS trade secrets and remain undisclosed in the literature. Only aggregate statistics and procedural summaries are available. Both the library and the training procedures are tightly integrated with proprietary dry-lab and wet-lab feedback cycles, complicating direct open-source reimplementation but establishing a robust paradigm for dry-wet integration and AI-guided molecular discovery (Yu et al., 2024, Ou et al., 2024).