S3-GFN: Synthesizable SMILES via GFlowNets
- S3-GFN is a novel generative framework that employs GFlowNets over SMILES strings, integrating rich chemical priors from a pretrained language model.
- It uses contrastive soft constraints to softly suppress unsynthesizable molecules while preserving scalability and chemical validity.
- Experimental results show S3-GFN achieves over 95% synthesizability and outperforms reaction-based methods in key drug discovery benchmarks.
S3-GFN refers to "Synthesizable SMILES via Soft-constrained GFlowNets with Rich Chemical Priors," a generative modeling framework for de novo molecular design that balances property optimization and chemical synthesizability. S3-GFN employs Generative Flow Networks (GFlowNets) in a sequence-based Markov Decision Process (MDP) over SMILES strings, integrating a pretrained SMILES LLM as a molecular prior and introducing soft synthesizability constraints through contrastive off-policy learning. This approach enables efficient exploration of chemical space while steering molecular generation toward high-reward, synthesizable compounds, outperforming both reaction-based and naively reward-shaped baselines in key drug discovery tasks (Kim et al., 4 Feb 2026).
1. Motivation and Conceptual Foundation
De novo molecular generation models in drug discovery frequently face the dual challenge of optimizing for properties such as binding affinity while ensuring the synthetic accessibility of proposed molecules. Prior sequence-based generative models (e.g., RNNs, Transformers over SMILES) leverage large molecular corpora to encode rich chemical priors but do not ensure synthesizability. In contrast, reaction-based methods impose hard synthesizability constraints by restricting generation to legal combinations of reaction templates and building blocks, but this comes at the cost of a vast, state-dependent action space and reduced ability to utilize large pretrained SMILES models.
S3-GFN bridges these approaches by remaining in the sequence-based SMILES generation framework but modulating the sampling distribution to favor molecules that are both high-reward and synthesizable. Specifically, S3-GFN is post-trained from a large pretrained SMILES LLM, targeting the distribution proportional to , where is a user-defined reward function and is the set of synthesizable molecules as checked by an external retrosynthesis oracle. Rather than hard-coding reactions, S3-GFN utilizes a contrastive auxiliary objective to softly suppress unsynthesizable candidates, thus maintaining scalability and the advantages of pretrained chemical knowledge.
2. MDP Formulation and GFlowNet Policy
SMILES generation is formalized as a discrete MDP where each state is a prefix of a SMILES string, with the initial state being the empty string and terminal states corresponding to complete molecules. Actions consist of appending a token from a fixed SMILES vocabulary .
The GFlowNet parameterizes two policies:
- Forward policy (): generates SMILES token sequences, inducing a trajectory distribution .
- Backward policy (): , which is typically trivial as there is a unique reverse path from any fully specified SMILES string.
This architecture permits flexible sequence extension while preserving the chemical grammar and substructure statistics captured in the initial prior.
3. Training Objectives and Contrastive Soft Constraints
S3-GFN training optimizes a series of trajectory-level objectives, central among which are:
- Trajectory-balance (TB) loss: Enforces the proportionality of the forward policy’s marginal distribution to the unnormalized reward,
This provides a means of direct reward-based flow matching in the absence of priors.
- Relative trajectory-balance (RTB) loss: For post-training a prior, the backward policy is replaced by the trajectory probability of the pretrained prior,
This serves to modulate (rather than override) the learned chemical distribution.
- Contrastive auxiliary loss (): To suppress unsynthesizable molecules, a contrastive loss penalizes instances where negatives (unsynthesizable SMILES) receive higher forward log-probability than positives (synthesizable ones),
where .
- Combined replay loss: Balances RTB and contrastive loss terms within batches of positive and negative replay samples,
with controlling the contrastive regularization strength.
During on-policy sampling, trajectories classified as synthesizable are immediately post-trained with RTB; off-policy replay leverages both RTB and contrastive regularization on positive and negative buffers.
4. Replay Buffer Management and Sampling Workflow
Replay in S3-GFN is structured via two persistent buffers:
- : Stores synthesizable (positive) trajectories, prioritized by reward so that higher-rewarded molecules preferentially fill the buffer.
- : Contains unsynthesizable (negative) trajectories using a FIFO strategy.
Algorithmically, each training iteration proceeds as follows:
- On-policy sampling: Trajectories are sampled from , assigned to positive or negative buffers via synthesizability checks (external retrosynthesis tools, e.g., AiZynthFinder).
- On-policy update: Newly sampled positives receive the RTB-based update.
- Off-policy replay: Batches of positives (from , reward-prioritized) and negatives (from , uniform) are selected for replay, computing the combined loss for parameter updates.
To augment negative samples, mutated negatives can be generated by perturbing positives (e.g., local bond changes), enriching the diversity of failure cases.
5. Integration of Rich Chemical Priors
S3-GFN utilizes a large pretrained SMILES LLM (e.g., GP-MolFormer) trained on millions of molecules from datasets such as PubChem, ZINC, and ChEMBL. This encoding of chemical rules, valency, and typical substructures forms a robust initial density . Post-training via the RTB objective selectively sharpens this density towards high-reward, synthesizable regions of chemical space. Thus, architectural generalization and grammar correctness inherited from massive molecular corpora are not sacrificed when optimizing for novel objectives.
6. Experimental Results and Comparative Performance
S3-GFN demonstrates, across multiple benchmarks, the capability to generate synthesizable molecules at or above 95% prevalence while surpassing both reaction-based and reward-shaped sequence models in task-specific metrics:
- sEH task: Yields ≈94–95% synthesizability (external check: ≈99% for Top-100 samples) and superior Top-100 binding scores compared to SynFlowNet.
- Structure-based docking (LIT-PCBA): Improves Top-100 Vina scores by 0.8–1.5 kcal/mol over RGFN, SynFlowNet, and RxnFlow; achieves 96–100% retrosynthetic accessibility versus 50–72% for reaction-based baselines.
- PMO benchmark (limited oracle calls): Surpasses reward-shaping methods (AUC Top-10 ≈0.81 vs. ≈0.50 for GSK3), approaching performance of genetic exploration and RL-augmented methods.
Rapid adaptation to shifting synthesis constraints is observed: only 100 replay steps suffice for S3-GFN to recover 88% synthesizability and maintain reward performance when introducing stricter reaction sets and pharmacological filters, whereas reward-shaping exhibits a detrimental suppression of molecule diversity (Kim et al., 4 Feb 2026).
7. Algorithmic Structure
The pseudocode governing S3-GFN training and inference is summarized below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
θ ← Prior SMILES network # Initialize from large pretrained model logZ_θ ← 0 D⁺, D⁻ ← empty buffers for t in range(T): # On-policy exploration batch = [sample_trajectory(θ) for _ in range(B)] positives = [τ for τ in batch if is_synthesizable(τ)] negatives = [τ for τ in batch if not is_synthesizable(τ)] add_reward_prioritized(D⁺, positives) add_fifo(D⁻, negatives) # (B) On-policy update L_on = mean([L_RTB(τ) for τ in positives]) θ -= η * grad(L_on) # (C) Off-policy replay B⁺_rep = sample(D⁺, B) B⁻_rep = sample(D⁻, B) # Optionally mutate some positives to increase negative diversity L_aux = compute_contrastive_loss(B⁺_rep, B⁻_rep) L_rep = mean([L_RTB(τ) + α*L_aux for τ in B⁺_rep]) θ -= η * grad(L_rep) |
This reiterative cycle, synthesizability-aware sampling, and contrastive replay underpin the state-of-the-art balance achieved between reward optimization and the practical constraint of synthetic feasibility.