Papers
Topics
Authors
Recent
Search
2000 character limit reached

ChemBART: Unified Organic Synthesis LLM

Updated 13 January 2026
  • ChemBART is a large language model for organic chemistry that uses reaction-level masked sequence reconstruction on SMILES to support diverse synthesis tasks.
  • It integrates multi-task applications such as precursor/reagent generation, reaction condition regression, and molecular property classification with experimentally validated outcomes.
  • Built on a BART-large Transformer with ~0.4 billion parameters, ChemBART achieves high accuracy in retrosynthesis and wet-lab validations for efficient synthesis planning.

ChemBART is a LLM tailored for organic chemistry applications, based on the BART-large Transformer architecture and pre-trained with reaction-level masked sequence reconstruction on SMILES notations. It enables a unified “one model, one pre-training, multiple tasks” paradigm, supporting precursor/reagent generation, reaction condition regression, molecular property classification, and reinforcement-learning-driven synthesis planning. ChemBART distinguishes itself from previous approaches through its ability to integrate multiple downstream chemical tasks and efficiently facilitate end-to-end computer-aided synthesis planning, with experimental validation highlighting its practical impact (Li et al., 6 Jan 2026).

1. Model Architecture and Tokenization

ChemBART uses the BART-large backbone, comprising 12-layer encoders and decoders, a hidden dimension d=1024d=1024, 16 attention heads per layer, and a total of approximately 0.4 billion parameters. Reaction expressions are represented as “chemical sentences” in the following canonicalized SMILES format:

reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}

Tokenization is performed at the atom and mapping-index level, matching one token to each atom, mapping, or SMILES punctuation. The vocabulary contains ∼201 tokens, including atom symbols (e.g., C, O, Cl, Br), aromatic indicators, mapping/ring indices ($0-83$), and punctuation (e.g., [,],@,+,—, =, #), with special tokens <cls>, <end>, <msk>, <pad>. Each token xix_i is mapped to a learned embedding eiR1024e_i \in \mathbb{R}^{1024}, summed with a positional embedding.

The self-attention mechanism follows standard Transformer mapping:

Q=XWQ,K=XWK,V=XWVQ = X W_Q, \quad K = X W_K, \quad V = X W_V

Attention for one head:

A=softmax(QK/dk),X=AVA = \text{softmax}(QK^\top/\sqrt{d_k}), \quad X' = AV

Outputs from the 16 heads per layer are concatenated and linearly projected to d=1024d=1024.

2. Pre-training Objective and Optimization

ChemBART is pre-trained with a masked sequence prediction objective at the reaction level. For each reaction SMILES input, either the reactant, reagent, or product section is randomly masked (tokens replaced by <msk>). The decoder autoregressively generates the entire “reactant > reagent > product” string, minimizing standard cross-entropy loss:

L(θ)=t=1TlogPθ(xtx<t,M(x))L(\theta) = - \sum_{t=1}^T \log P_\theta(x_t | x_{<t}, M(x))

where M(x)M(x) is the masked input, and reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}0 are the Transformer parameters.

Pre-training utilizes the USPTO-full reaction dataset (∼1.5M reactions) and the higher-quality USPTO-MIT subset (∼480K), optimized with AdamW (reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}1, reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}2), batch size 256, converging in ∼7 epochs.

3. Downstream Tasks

A single ChemBART checkpoint is fine-tuned with task-specific tokens and heads for various applications.

3.1 Precursor and Reagent Generation (Single-step Retrosynthesis)

Inputs are structured as product SMILES followed by > <msk> > to elicit reactant or reagent reconstruction, decoded via beam search (beam size=10).

  • Metrics: Top-reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}3 accuracy (% correct among top-reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}4 generations), syntax error rate.
  • Results (USPTO-full):
    • Precursor prediction: Top-1 = 59.2%, Top-5 = 78.2%, Top-1 syntax errors = 2.43%
    • Reagent prediction: Top-1 = 54.5%, Top-5 = 74.4%, syntax errors ≈1.5%
    • These results match or exceed previous template-free models (PMSR, Molecular Transformer, Chemformer).

3.2 Reaction Temperature and Yield Regression

Inputs are complete reaction SMILES with <end/temp> and <end/yield> task tokens. Decoder outputs at these tokens are processed by linear heads, trained using mean-squared error.

  • Metrics: RMSE, MAE, reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}5.
  • ORD dataset (∼700K):
    • Temperature: reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}6, MAEreactant>reagent>product\text{reactant} > \text{reagent} > \text{product}7C (range reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}8C).
    • Yield: reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}9, MAE$0-83$0 (range 0–110%).
  • Suzuki–Miyaura regression (5K):
    • Ten-fold $0-83$1, MAE=$0-83$2, comparable to rxnfp.

3.3 Molecular Property Classification

Datasets include BBBP, HIV, BACE, Tox21, ClinTox. A <cls/task> token is prepended or appended, and the corresponding encoder embedding is processed by a sigmoid-activated linear head; optionally, LoRA low-rank adaptation is applied.

  • Metric: ROC-AUC.
  • Performance (ChemBART-M, Table 2):
    • BBBP: 0.910; HIV: 0.809; BACE: 0.881; Tox21: 0.844; ClinTox: 0.866
    • LoRA improves some tasks, e.g., ClinTox up to 0.920.
    • ChemBART matches or outperforms previous SMILES-based LLMs and most graph-based approaches.

3.4 Reinforcement-Learning Policy and Value Optimization

A set of ∼12K retrosynthetic nodes, generated using ReSynZ's MCTS pipeline, is labeled with:

  • Value $0-83$3: discounted completion probability.
  • Policy $0-83$4: normalized child visit counts. ChemBART’s two regression heads are fine-tuned to predict $0-83$5 (value head: SMILES $0-83$6 scalar) and $0-83$7 (policy head: reaction SMILES $0-83$8 score, softmaxed across siblings).
  • Test RMSE (ChemBART-F/M): Value = 0.11, Policy = 0.16; comparable to template-based networks and superior to random-initialized baselines.

4. Multi-Step Synthesis Planning with MCTS

ChemBART is integrated into Monte Carlo Tree Search (MCTS) for end-to-end retrosynthetic route design.

Node Expansion and Scoring

At each tree node:

  1. Beam-search generates $0-83$9 candidate single-step retrosyntheses with probabilities xix_i0.
  2. Each candidate reaction xix_i1 is checked for validity and scored using the policy head (xix_i2).
  3. Candidates are normalized:

xix_i3

  1. Node value:

xix_i4

MCTS selection employs a UCT-like rule (parameterization not explicitly provided) and backpropagates estimated values. Final root policies:

xix_i5

where xix_i6 visit count, xix_i7 temperature.

Planning Performance

  • Retro*-190 (190 targets): ChemBART-F/M achieve 64.9% / 70.1% full synthesis success rates, approaching template-based planners.
  • JMC2025 (53 recent pharmaceutical targets):
    • Beam-search: 88.7% success in xix_i8 steps (mean route length 4.3)
    • Top-k sampling (xix_i9): 84.9% success
    • Top-p sampling (eiR1024e_i \in \mathbb{R}^{1024}0): 83.0% success

5. Experimental and Wet-Lab Validation

ChemBART-generated multistep routes have been empirically validated. For the PD-L1/VISTA dual inhibitor P1:

  • Literature route: 6 steps, overall yield 6.5%
  • ChemBART proposal: 4 steps, overall isolated yield 35% (5× improvement, +28.5% absolute), all conditions and intermediates confirmed with standard lab techniques (full NMR/HRMS provided).

The new pathway featured a convergent Suzuki coupling, efficient reductions, and optimized SNAr coupling conditions.

6. Interpretation, Limitations, and Future Prospects

ChemBART’s reaction-level masked pre-training automatically instills fundamental chemical syntax, valency, and mechanistic knowledge. A single parameter set supports generative, regression, classification, and reinforcement-learning-based policy/value computations, reducing computational and maintenance overhead.

ChemBART demonstrates interpretable chemistry by recovering periodic and electronegativity trends in token-embedding spaces and highlighting reactive motifs in attention maps (e.g., C→Br in Grignard reactions).

Limitations: Pure sequence-based SMILES input constrains the physical-chemical scope (e.g., QM9-type quantum descriptors). Potential for “hallucination” (invalid/novel outputs) increases under sampling-based decoding.

Future directions include:

  • Integration of 3D structure through graph or coordinate-based models (e.g., Uni-Mol, KFLM2)
  • Hybrid sequence-to-graph/contact-map attention modules
  • Further RL-based improvement via policy-gradient or offline MCTS retraining
  • Conditional precursor generation modulated by reaction class or functional groups

In conclusion, reaction-centric pre-training on SMILES data enables ChemBART to function as a versatile foundation model for organic synthesis, providing unified, experimentally-validated solutions for retrosynthesis, property prediction, and AI-driven planning (Li et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChemBART.