METiS de novo Lipid Library

Updated 5 January 2026

The METiS de novo Lipid Library is a computational resource compiling 10 million virtual ionizable lipids using fragment-based combinatorial methods and reinforcement learning.
It employs stringent physicochemical, synthetic feasibility, and MD-based filters to ensure high chemical diversity and robust chemical validity.
Integration with LipidBERT pre-training enables accurate property predictions, significantly reducing real-world failure rates in lipid nanoparticle discovery.

The METiS de novo Lipid Library is a computationally constructed, structurally diverse database of virtual ionizable lipids assembled using fragment-based generative algorithms, subject to physicochemical and synthetic feasibility filters, and curated for use in machine learning–driven lipid nanoparticle (LNP) design and screening tasks. This resource underpins modern approaches for pre-training LLMs such as LipidBERT and enables high-throughput in silico screening of next-generation lipid candidates for drug delivery applications (Yu et al., 2024, Ou et al., 2024).

1. Algorithmic Construction and Fragment-Based Enumeration

Library generation within the METiS pipeline leverages a fragment-based combinatorial approach. Known ionizable lipids and small-molecule drugs are computationally fragmented into 7–12 substructures—headgroups, linking fragments, alkyl tails, and optional spacers. The fragment classes are defined by constraints on atom types, connectivity patterns, maximal lengths, and optional branching. A reinforcement learning (RL) generator proposes novel fragments with class-conditional reward functions—favoring, for example, tertiary amines in headgroups or saturated/unsaturated chains in tails. A custom connecting algorithm exhaustively enumerates valid recombinations into full molecular SMILES strings, with chemical constraints enforced throughout.

Design criteria for fragment selection are as follows:

Headgroups: Only structures with tertiary or secondary amines (tunable pKₐ typically 5–8; polar surface area ≤ 80 Å²) are considered. The number of distinct headgroup scaffolds $n_{HG}$ is kept proprietary.
Tails: Chains in the C₈–C₂₄ range, up to one methyl branch per chain, and 0–6 double bonds permitted.
Enumeration: Lipids feature 2–6 tails; the combinatorial count for the number of unique lipids is

$N = \sum_{i=1}^{n_{HG}} \sum_{k=2}^{6} \binom{n_{\text{tail}}}{k}$

where $n_{\text{tail}}$ is the number of unique tail fragments.

Chemical validity (proper tautomers, valences, ionizability windows, and avoidance of unreasonable substructures) is tightly enforced at every assembly stage via internal predictors and explicit filters (Yu et al., 2024).

2. Virtual Screening, Filtering, and ML-Based Scoring

Initial enumeration produces "billions" of candidate structures, necessitating aggressive filtering and scoring to reduce to a tractable, high-quality set:

Physicochemical Filters: For each molecule, properties including molecular weight (300–800 Da), H-bond donors/acceptors (HBD ≤ 5, HBA ≤ 10), and topological polar surface area (20–120 Å²) are computed (e.g., using RDKit). Log P is estimated by an in-house additive model and must fall between 1 and 5 for membrane compatibility.
Ionizable State: Each structure is evaluated for headgroup pKₐ (internal predictor) with a typical requirement $5.5 \leq pKₐ \leq 7.5$ ; at least one tertiary or secondary amine is mandatory for maintaining ionizability.
ML Scoring Cascade: Each candidate is traced through a stack of ML classifiers/regressors, including scores for synthetic feasibility, simulated membrane/bilayer stability (MD proxies), and predicted LNP potency. The generic scoring function is

$S(x) = w_1 \cdot f_{\text{potency}}(x) + w_2 \cdot f_{\text{synth}}(x) + w_3 \cdot f_{\text{MD}}(x)$

where $f_\cdot$ are task-specific models, and $w_i$ are empirically tuned weights.

Candidates below a tunable threshold are discarded. Iterative tuning and re-sampling ultimately result in a library of 10 million virtual lipid structures, representing only the uppermost percentile of the original design space in terms of predicted utility and plausibility (Yu et al., 2024).

3. Library Composition and Molecular Representation

Reported statistics on the final library include:

Headgroup diversity: $O(10^1)$ distinct (amines, imidazoles, etc.), actual $n_{HG}$ undisclosed.
Tail statistics: Uniform coverage of the C₈–C₂₄ range with peaks at C₁₄, C₁₆, C₁₈; unsaturation split as ~30% saturated, ~40% mono-unsaturated, ~30% poly-unsaturated; 2–6 tails per lipid (modal $k=4$ ).
Scaffold diversity: Pairwise ECFP-4 Tanimoto similarities center at $0.20 \pm 0.05$ , indicating broad chemical space coverage.
Combinatorial scaling: Typical libraries enumerate $\sim10^{6}$ – $10^{8}$ core structures (headgroup × tail set × linker), with manual curation to the final 10 million (Yu et al., 2024).

All molecules are represented canonically via SMILES strings, tokenized at the character level for language modeling (alphabet $\sim$ 50 symbols). Additionally, molecular graphs—with adjacency matrices and rich atom/bond feature vectors—are constructed to support GNN-based modeling tasks.

Property	Range or Statistic	Notes
MW	300–800 Da (heads/tails ≤ 500)	Enforced during filtering
Headgroup pKₐ	5.5–7.5	Ionizability requirement
log P	1–5	Membrane compatibility
Tails per lipid	2–6 (modal $k=4$ )	Combinatorial enumeration
Diversity (ECFP4)	Tanimoto $\sim$ 0.20 ± 0.05	Scaffold-level diversity

SMILES strings are canonicalized (RDKit), with special tokens ([CLS], [SEP], [MASK]) injected for language-model training.

4. Machine Learning Integration and Downstream Usage

The METiS de novo Lipid Library is the primary corpus for pre-training LipidBERT, a BERT-like masked LLM. Pre-training includes:

Masked Language Modeling (MLM): 15% SMILES tokens are masked per input. The objective is to minimize MLM cross-entropy.

$\mathcal{L}_{MLM} = - \sum_{t \in \text{masked}} \log P(x_t | x_{\backslash t})$

Secondary Tasks: Number-of-tails classification (5-way), head-vs-tail token classification, connecting-atom prediction (sequence and token objectives), and rearranged/decoy SMILES recognition. The total loss is a weighted sum:

$\mathcal{L} = \mathcal{L}_{MLM} + \lambda_1 \mathcal{L}_{tails} + \lambda_2 \mathcal{L}_{HT} + \lambda_3 \mathcal{L}_{connect} + \lambda_4 \mathcal{L}_{decoy}$

with hyperparameters selected so that MLM dominates, but auxiliary tasks contribute to embedding discriminability.

LipidBERT uses the 10 million library over 10 epochs (BERT-base, 12×768, 12 heads; AdamW optimizer at $5 \times 10^{-5}$ , batch size 128). Embeddings from the [CLS] token encode composite chemical features (head type, tail configuration, linker identity) and support high-accuracy regression on LNP properties ( $R^{2} > 0.9$ for property prediction, $R^2 = 0.98$ for embeddings + MLP vs. $0.63$ for XGBoost baselines) (Yu et al., 2024).

5. Quantitative Performance and Diversity Metrics

Quantitative evaluation demonstrates:

Diversity: Pairwise ECFP-4 similarities around 0.2, consistent with high chemical diversity.
Pre-training scale: Expanding the training set from 1 million to 10 million lipids results in a leap in fine-tuned performance on ex vivo lung fluorescence from $R^2 \sim 0.80$ to $0.94$.
In silico screening gains: Pre-trained models reduce real-world failure rates in subsequent wet-lab tests by approximately 30%, indicating an effective enrichment for synthetically and functionally plausible candidates prior to experimental validation.
Combinatorial richness: Headgroup × tail structure enumeration orders in $10^1$ – $10^2$ for heads and $10^4$ for tail-sets yield $\sim10^6$ – $10^8$ plausible core scaffolds (expanded via linkers and other modifications), further supporting the observed coverage (Yu et al., 2024).

6. Comparison with Other Generative Pipelines

An orthogonal approach for ionizable lipid generation appears in the deep generative Synthesis-DAG model (Ou et al., 2024). Here, DAGs encode synthesis routes, with building block selection (heads/tails) from filtered ZINC and LIPID MAPS analogs, sequential attachment predicted by Chemformer, and filtering via lipid-vs-nonlipid classification, pKₐ, and synthetic accessibility (Ertl & Schuffenhauer, 2009).

Outcome statistics for 14,148 valid DAG+Chem samples: 92.6% lipid rate, 83.4% ionizable rate, FCD = 3.797 (similarity to the training set), and SA score = 4.182 (moderate synthetic accessibility). Novelty and uniqueness measured at 0.999. Physicochemical and headgroup selection constraints resemble METiS rules but are realized via explicit action-sequence modeling and cheminformatics-driven assembly rather than fragment RL.

A plausible implication is that the METiS approach, using RL-driven fragment design with physically constrained fragment pools and aggressive ML/MD filtering, achieves comparably high diversity and synthetic plausibility but at far larger ultimate scale (10 million molecules vs. ~14,000 in the reference DAG treatment).

7. Technical Limitations and Proprietary Aspects

Key enumeration parameters—such as the precise number of headgroup scaffolds ( $n_{HG}$ ), specific fragment lists, and internal filter thresholds—are treated as METiS trade secrets and remain undisclosed in the literature. Only aggregate statistics and procedural summaries are available. Both the library and the training procedures are tightly integrated with proprietary dry-lab and wet-lab feedback cycles, complicating direct open-source reimplementation but establishing a robust paradigm for dry-wet integration and AI-guided molecular discovery (Yu et al., 2024, Ou et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library (2024)

A Deep Generative Model for the Design of Synthesizable Ionizable Lipids (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to METiS de novo Lipid Library.