ChemToolDataset: Chemical LM Training Benchmarks
- ChemToolDataset is a large-scale resource of agent–environment interaction trajectories designed to benchmark chemical language models using in silico chemistry tools.
- It features 55,000 annotated trajectories generated via a containerized sandbox, integrating 14 cheminformatics tools and a soft-greedy exploration policy for objective-driven model training.
- The dataset supports rigorous evaluation through metrics like validity, novelty, and Success@k, advancing research in molecular design and computer-aided synthesis planning.
ChemToolDataset is a large-scale resource of agent–environment interaction trajectories developed to train and evaluate chemical LMs within the ChemCRAFT multi-tool orchestration framework. This dataset encapsulates thousands of demonstration trajectories wherein a chemical agent, instantiated as a LLM, sequentially invokes in silico chemistry tools, processes molecular and synthetic tasks, and receives outcome-based rewards. ChemToolDataset targets critical challenges in cheminformatics, molecular design, and computer-aided synthesis planning by providing high-fidelity records of tool-driven reasoning and execution episodes (Li et al., 25 Jan 2026).
1. Scope and Structure of ChemToolDataset
ChemToolDataset comprises 55,000 annotated trajectories generated in a sandboxed environment integrating 14 distinct cheminformatics tools. Each trajectory records the stepwise decisions of a language-model agent operating on molecules represented by SMILES strings. Four objective categories anchor the dataset’s diversity: maximizing quantitative estimate of drug-likeness (QED), synthesizing valid three-step reaction routes, proposing molecules with novel scaffolds, and improving water solubility (logP).
At each trajectory step, the agent observes the current state (molecule, properties, history), selects a tool and specifies arguments, receives results and a scalar reward, and iterates until an objective threshold is met, a maximum of 10 steps are executed, or generation of an invalid molecule occurs. Seed molecules are sampled from a pool of 60,000 drug-like compounds. The dataset is partitioned into training (44,000 trajectories), validation (5,500), and test (5,500) splits, supporting rigorous benchmarking and generalization studies.
2. Trajectory Representation and Schema
Each ChemToolDataset record is serialized as a JSON object with specific fields:
- trajectory_id: Unique string identifier (e.g., "CTD_2024_012345").
- seed_smiles: The starting molecule in SMILES notation.
- objective: One of {“maximize_QED”, “synthesize_route”, “novel_scaffold”, “improve_logP”}.
- steps: An ordered array of step objects, each containing:
- observation: SMILES string and properties (e.g., logP, QED, molecular weight).
- action: Selected tool and argument dictionary (e.g., tool_name: "reaction_enumerator", tool_args: {“reagent”: “Br2”}).
- result: Output SMILES, success flag, updated properties.
- reward: Scalar feedback for the step.
- final_smiles: The last molecule in the trajectory.
- cumulative_reward: Sum of all stepwise rewards.
An entry example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{
"trajectory_id": "CTD_2024_012345",
"seed_smiles": "CCOc1ccccc1C(=O)O",
"objective": "maximize_QED",
"steps": [
{
"observation": {"smiles":"CCOc1ccccc1C(=O)O", "properties":{"QED":0.42}},
"action": {"tool_name":"mutate_scaffold", "tool_args":{"pattern":"c1ccccc1"}},
"result": {"new_smiles":"CCOc1ccccc1C(=O)N(C)C","success_flag":true, "new_properties":{"QED":0.47}},
"reward":0.05
}
// ...
],
"final_smiles": "CCOc1ccccc1C(=O)N(C)C",
"cumulative_reward": 0.35
} |
3. Data Generation Pipeline
Trajectory generation is performed in a containerized sandbox comprising the following subsystems:
- Tool library: 14 in silico tools for SMILES canonicalization, reaction enumeration, property prediction (logP, QED), retrosynthesis ranking, scaffold extraction, ring count analysis, and molecule validity checks.
- Memory module: Tracks the current state, including molecule, properties, and tool-call history.
- Reward engine: Computes scalar, objective-driven feedback with both property-improvement and structure-similarity components (SMILES-GRPO).
The agent, based on ChemGPT-base, operates under a soft-greedy exploration policy (ϵ = 0.1), selecting tools and corresponding arguments at each step. Trajectory termination is triggered by achieving the objective, exceeding 10 steps, or generating an invalid SMILES.
Trajectory construction proceeds as follows:
- Draw a seed molecule and assign an objective.
- For up to 10 steps:
- Agent selects tool and arguments.
- Sandbox executes tool, returns new state and reward.
- If termination condition is met, record the full trajectory.
4. Reward Functions and Agent Policies
Stepwise rewards are governed by the SMILES-GRPO (Generalized Reward for Property Optimization) function, integrating both property change and structure similarity:
- is the change in the target property (e.g., for QED or logP).
- calculates generalized token-overlap between two SMILES via:
where is the count of all -length substrings in , and is the target scaffold.
Parameter settings: , .
During data generation, the agent applies a soft-greedy Q-policy:
Here, is estimated via Monte Carlo rollouts in the sandbox.
5. Preprocessing, Filtering, and Annotation Procedures
Rigorous preprocessing and quality control protocols are integral:
- Steps generating invalid or unparsable SMILES (identified via RDKit) are flagged "success_flag=false" and receive .
- Trajectories with fewer than two valid steps or cumulative_reward 0 are discarded.
- Duplication filtering collapses multiple identical (seed and full sequence) trajectories, retaining only the version with the highest cumulative reward.
- Automated annotation includes tool-usage statistics (call frequency, per-tool success rates) and a binary “novelty_flag” indicating whether the final molecule falls outside the original 60,000 seed pool.
This suggests a strong emphasis on both trajectory novelty and action diversity within the training resource.
6. Applications and Evaluation Protocols
ChemToolDataset is engineered for the development and evaluation of hybrid chemical LLMs across multiple scientific tasks:
- Imitation Learning & Supervised Pretraining: LMs are fine-tuned to predict the next tool_action given preceding steps. Evaluation includes accuracy@k on withheld validation sets.
- Reinforcement Learning from Demonstrations (RLfD): Initial policies are further optimized using advantage actor-critic methods and the SMILES-GRPO reward, with benchmarks tested on property maximization (average ΔQED, validity rate) and synthetic planning (percent valid three-step routes matching ground truth).
- Metrics:
- Validity (RDKit-passing fraction)
- Novelty (portion of final_smiles not in training seeds)
- Improvement (average property change)
- Success@k (objective achieved in ≤k steps)
- Retrosynthesis accuracy (top-1/top-3 matching rates vs. expert-curated reaction sequences)
| Metric | Definition | Benchmark/Use |
|---|---|---|
| Validity | Fraction of RDKit-valid SMILES in outputs | Chemical structure quality |
| Novelty | Fraction of final_smiles outside the original seed set | Scaffold innovation |
| Success@k | Proportion achieving objective in or fewer steps | Planning efficiency |
The rigor and multi-task stratification underscore utility in quantitative, data-driven benchmarking of LLM tool orchestration for molecular design and synthesis tasks.
7. Context and Significance
ChemToolDataset inaugurates a trajectory-centric paradigm for chemical LLM training, distinct from passive data curation or static reaction examples. It both enables and necessitates models with fine-grained decision-making, tool-integration, and planning capabilities, decoupling knowledge storage from procedural chemical reasoning. Within the context of ChemCRAFT, the dataset supports the development of locally deployable, cost-effective, and privacy-preserving LMs capable of advanced molecular analysis and synthesis planning—surpassing certain cloud-based large-scale models in several performance domains (Li et al., 25 Jan 2026).
A plausible implication is that ChemToolDataset’s structure and annotation could serve as a reference framework for similar agentic datasets in other domains where LM-driven tool orchestration and outcome-guided reasoning are critical.