Papers
Topics
Authors
Recent
Search
2000 character limit reached

S3-GFN: Synthesizable SMILES via GFlowNets

Updated 5 February 2026
  • S3-GFN is a novel generative framework that employs GFlowNets over SMILES strings, integrating rich chemical priors from a pretrained language model.
  • It uses contrastive soft constraints to softly suppress unsynthesizable molecules while preserving scalability and chemical validity.
  • Experimental results show S3-GFN achieves over 95% synthesizability and outperforms reaction-based methods in key drug discovery benchmarks.

S3-GFN refers to "Synthesizable SMILES via Soft-constrained GFlowNets with Rich Chemical Priors," a generative modeling framework for de novo molecular design that balances property optimization and chemical synthesizability. S3-GFN employs Generative Flow Networks (GFlowNets) in a sequence-based Markov Decision Process (MDP) over SMILES strings, integrating a pretrained SMILES LLM as a molecular prior and introducing soft synthesizability constraints through contrastive off-policy learning. This approach enables efficient exploration of chemical space while steering molecular generation toward high-reward, synthesizable compounds, outperforming both reaction-based and naively reward-shaped baselines in key drug discovery tasks (Kim et al., 4 Feb 2026).

1. Motivation and Conceptual Foundation

De novo molecular generation models in drug discovery frequently face the dual challenge of optimizing for properties such as binding affinity while ensuring the synthetic accessibility of proposed molecules. Prior sequence-based generative models (e.g., RNNs, Transformers over SMILES) leverage large molecular corpora to encode rich chemical priors but do not ensure synthesizability. In contrast, reaction-based methods impose hard synthesizability constraints by restricting generation to legal combinations of reaction templates and building blocks, but this comes at the cost of a vast, state-dependent action space and reduced ability to utilize large pretrained SMILES models.

S3-GFN bridges these approaches by remaining in the sequence-based SMILES generation framework but modulating the sampling distribution to favor molecules that are both high-reward and synthesizable. Specifically, S3-GFN is post-trained from a large pretrained SMILES LLM, targeting the distribution proportional to R(x)pprior(x)1[xX]R(x) \cdot p_{\text{prior}}(x) \cdot \mathbf{1}[x \in \mathcal{X}'], where R(x)R(x) is a user-defined reward function and X\mathcal{X}' is the set of synthesizable molecules as checked by an external retrosynthesis oracle. Rather than hard-coding reactions, S3-GFN utilizes a contrastive auxiliary objective to softly suppress unsynthesizable candidates, thus maintaining scalability and the advantages of pretrained chemical knowledge.

2. MDP Formulation and GFlowNet Policy

SMILES generation is formalized as a discrete MDP where each state sts_t is a prefix of a SMILES string, with the initial state being the empty string and terminal states corresponding to complete molecules. Actions ata_t consist of appending a token from a fixed SMILES vocabulary V\mathcal{V}.

The GFlowNet parameterizes two policies:

  • Forward policy (PFP_F): PF(st+1st;θ)P_F(s_{t+1} | s_t ; \theta ) generates SMILES token sequences, inducing a trajectory distribution PF(τ;θ)=t=0n1PF(st+1st;θ)P_F(\tau; \theta) = \prod_{t=0}^{n-1} P_F(s_{t+1}|s_t; \theta ).
  • Backward policy (PBP_B): PB(st1st;θ)P_B(s_{t-1}|s_t;\theta ), which is typically trivial as there is a unique reverse path from any fully specified SMILES string.

This architecture permits flexible sequence extension while preserving the chemical grammar and substructure statistics captured in the initial prior.

3. Training Objectives and Contrastive Soft Constraints

S3-GFN training optimizes a series of trajectory-level objectives, central among which are:

  • Trajectory-balance (TB) loss: Enforces the proportionality of the forward policy’s marginal distribution to the unnormalized reward,

LTB(τ)=[log(ZθPF(τ;θ))log(R(x)PB(τx;θ))]2.L_{\text{TB}}(\tau) = \left[ \log (Z_\theta P_F(\tau;\theta )) - \log (R(x) P_B(\tau|x;\theta )) \right]^2.

This provides a means of direct reward-based flow matching in the absence of priors.

  • Relative trajectory-balance (RTB) loss: For post-training a prior, the backward policy is replaced by the trajectory probability of the pretrained prior,

LRTB(τ)=[log(ZθPF(τ;θ))log(R(x)Pprior(τ))]2.L_{\text{RTB}}(\tau) = \left[ \log (Z_\theta P_F(\tau;\theta )) - \log (R(x) P_{\text{prior}}(\tau)) \right]^2.

This serves to modulate (rather than override) the learned chemical distribution.

  • Contrastive auxiliary loss (Laux\mathcal{L}_\text{aux}): To suppress unsynthesizable molecules, a contrastive loss penalizes instances where negatives (unsynthesizable SMILES) receive higher forward log-probability than positives (synthesizable ones),

Laux(B+,B)=τ+B+log[exp[s(τ+)]exp[s(τ+)]+τBexp[s(τ)]]\mathcal{L}_\text{aux}( \mathcal{B}^+, \mathcal{B}^- ) = - \sum_{\tau^+ \in \mathcal{B}^+} \log \left[ \frac{\exp[s(\tau^+)]} {\exp[s(\tau^+)] + \sum_{\tau^- \in \mathcal{B}^-} \exp[s(\tau^-)]} \right]

where s(τ)=logPF(τ;θ)s(\tau) = \log P_F(\tau;\theta ).

  • Combined replay loss: Balances RTB and contrastive loss terms within batches of positive and negative replay samples,

Lreplay=1B+τ+B+[LRTB(τ+)+αLaux({τ+},B)]L_{\text{replay}} = \frac{1}{|\mathcal{B}^+|} \sum_{\tau^+ \in \mathcal{B}^+} \left[ L_{\text{RTB}}(\tau^+) + \alpha \cdot \mathcal{L}_\text{aux}(\{\tau^+\}, \mathcal{B}^-) \right]

with α>0\alpha > 0 controlling the contrastive regularization strength.

During on-policy sampling, trajectories classified as synthesizable are immediately post-trained with RTB; off-policy replay leverages both RTB and contrastive regularization on positive and negative buffers.

4. Replay Buffer Management and Sampling Workflow

Replay in S3-GFN is structured via two persistent buffers:

  • D+\mathcal{D}^+: Stores synthesizable (positive) trajectories, prioritized by reward so that higher-rewarded molecules preferentially fill the buffer.
  • D\mathcal{D}^-: Contains unsynthesizable (negative) trajectories using a FIFO strategy.

Algorithmically, each training iteration proceeds as follows:

  1. On-policy sampling: Trajectories are sampled from PFP_F, assigned to positive or negative buffers via synthesizability checks (external retrosynthesis tools, e.g., AiZynthFinder).
  2. On-policy update: Newly sampled positives receive the RTB-based update.
  3. Off-policy replay: Batches of positives (from D+\mathcal{D}^+, reward-prioritized) and negatives (from D\mathcal{D}^-, uniform) are selected for replay, computing the combined loss for parameter updates.

To augment negative samples, mutated negatives can be generated by perturbing positives (e.g., local bond changes), enriching the diversity of failure cases.

5. Integration of Rich Chemical Priors

S3-GFN utilizes a large pretrained SMILES LLM (e.g., GP-MolFormer) trained on millions of molecules from datasets such as PubChem, ZINC, and ChEMBL. This encoding of chemical rules, valency, and typical substructures forms a robust initial density pprior(x)p_{\text{prior}}(x). Post-training via the RTB objective selectively sharpens this density towards high-reward, synthesizable regions of chemical space. Thus, architectural generalization and grammar correctness inherited from massive molecular corpora are not sacrificed when optimizing for novel objectives.

6. Experimental Results and Comparative Performance

S3-GFN demonstrates, across multiple benchmarks, the capability to generate synthesizable molecules at or above 95% prevalence while surpassing both reaction-based and reward-shaped sequence models in task-specific metrics:

  • sEH task: Yields ≈94–95% synthesizability (external check: ≈99% for Top-100 samples) and superior Top-100 binding scores compared to SynFlowNet.
  • Structure-based docking (LIT-PCBA): Improves Top-100 Vina scores by 0.8–1.5 kcal/mol over RGFN, SynFlowNet, and RxnFlow; achieves 96–100% retrosynthetic accessibility versus 50–72% for reaction-based baselines.
  • PMO benchmark (limited oracle calls): Surpasses reward-shaping methods (AUC Top-10 ≈0.81 vs. ≈0.50 for GSK3β\beta), approaching performance of genetic exploration and RL-augmented methods.

Rapid adaptation to shifting synthesis constraints is observed: only 100 replay steps suffice for S3-GFN to recover >>88% synthesizability and maintain reward performance when introducing stricter reaction sets and pharmacological filters, whereas reward-shaping exhibits a detrimental suppression of molecule diversity (Kim et al., 4 Feb 2026).

7. Algorithmic Structure

The pseudocode governing S3-GFN training and inference is summarized below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
θ  Prior SMILES network                # Initialize from large pretrained model
logZ_θ  0
D, D  empty buffers

for t in range(T):
    # On-policy exploration
    batch = [sample_trajectory(θ) for _ in range(B)]
    positives = [τ for τ in batch if is_synthesizable(τ)]
    negatives = [τ for τ in batch if not is_synthesizable(τ)]
    add_reward_prioritized(D, positives)
    add_fifo(D, negatives)
    
    # (B) On-policy update
    L_on = mean([L_RTB(τ) for τ in positives])
    θ -= η * grad(L_on)
    
    # (C) Off-policy replay
    B_rep = sample(D, B)
    B_rep = sample(D, B)
    # Optionally mutate some positives to increase negative diversity
    L_aux = compute_contrastive_loss(B_rep, B_rep)
    L_rep = mean([L_RTB(τ) + α*L_aux for τ in B_rep])
    θ -= η * grad(L_rep)

This reiterative cycle, synthesizability-aware sampling, and contrastive replay underpin the state-of-the-art balance achieved between reward optimization and the practical constraint of synthetic feasibility.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to S3-GFN.