Mechanistic Reaction Dataset

Updated 27 December 2025

Mechanistic reaction datasets are curated collections of reactions with detailed annotations of elementary steps, including bond-breaking/forming events and electron-flow paths.
They utilize explicit atom mapping, arrow-pushing codes, and defined mechanistic classes to ensure robust, statistically balanced training and evaluation of ML models.
These datasets enable interpretable ML in reaction prediction, facilitate template design, and support advanced applications like zero-shot learning and mechanism benchmarking.

A mechanistic reaction dataset is a curated collection of chemical reactions annotated at the level of individual, physically justified elementary steps, including molecular rearrangements, bond-breaking/forming events, electron-flow paths, and transient intermediates. Unlike product-only or atom-mapping datasets, these datasets provide explicit mechanistic labels—such as arrow-pushing codes, electron-flow diagrams, or template identifiers—enabling rigorous interrogation of reaction mechanisms, mechanistically interpretable ML models, and accurate benchmarking of computational reaction prediction or planning tools.

1. Dataset Composition, Mechanistic Scope, and Class Structure

Mechanistic reaction datasets typically encode each reaction as a sequence of elementary mechanistic steps, with explicit definition and coverage of distinct mechanistic classes. For instance, the ReactAIvate dataset comprises 100,000 annotated steps across three archetypal organometallic cross-coupling cycles (Suzuki–Miyaura, Buchwald–Hartwig, Kumada), systematically partitioned into seven precisely defined step-classes, plus an eighth out-of-distribution (OOD) label (Hoque et al., 2024):

S₁: Oxidative addition: Metal insertion into C–X bonds.
S₂: Boronate formation: Coordination involving boronate anion generation.
S₃: Boron transmetallation: Transfer of organoboron fragments to metal.
S₄: Substrate coordination: Binding without bond breakage/formation.
S₅: Acid-base deprotonation: Proton abstraction steps.
S₆: Transmetallation (non-boron): RM (e.g., Grignard/organomagnesium) transfers.
S₇: Reductive elimination: Ligand coupling and catalyst regeneration.
S₈: OOD/no-reaction: Inputs not matching S₁–S₇ templates.

Each class encodes a canonical elementary mechanistic operation, with roughly uniform distribution (approximately 14,200 examples/class for S₁–S₇ and ~4,600 for S₈), thereby ensuring statistical robustness for supervised ML training and evaluation.

Other mechanistic datasets cover distinct mechanistic families. For polar chemistry, PMechDB contains ≈13,000 single-step elementary polar mechanisms (proton transfers, SN2/SN1, nucleophilic addition/elimination), manually curated from textbooks and primary literature (Miller et al., 22 Apr 2025). RMechDB provides 5,500 radical elementary steps across hydrogen-abstraction, radical addition, radical recombination, and rearrangement classes (Tavakoli et al., 2023). ReactMech (DeepMech) spans 29,604 complete multi-step mechanisms (CRMs), annotated atom-by-atom for 67 mechanistic subclasses, including polar, pericyclic, and organometallic events (Das et al., 19 Sep 2025).

2. Annotation Protocols, Mechanistic Labeling, and Data Representation

Mechanistic datasets are constructed using either manual expert curation or algorithmic application of generalized reaction templates. In ReactAIvate, step-level annotation is achieved by crafting RDKit-based reaction templates, each enforcing substructural constraints to uniquely designate a mechanistic class. Reactive atoms ("hotspots") are labeled strictly by graph-difference: all nodes whose degree or bond order changes during the step receive a binary label $y_v=1$ , with all others $y_v=0$ (Hoque et al., 2024).

Mechanistic annotation protocols typically include:

Explicit atom mapping: Guaranteed one-to-one correspondence between reactant and product atoms, encoded as arrays or mapping dictionaries.
Mechanistic code or arrows: Encoding of electron movement, e.g., (source_atom, sink_atom) pairs, with arrow types (two-electron, half-arrow radical) if relevant (Neukomm et al., 5 Dec 2025, Miller et al., 22 Apr 2025). MechSMILES, for instance, compacts a mapped SMILES with a serial arrow list as a single text line.
Template or operation identifier: Each elementary transformation is mapped to a mechanistic or SMARTS template (e.g., TMOp_ID in ReactMech, step_class in ReactAIvate).
Context fields: Reaction conditions, reagents, solvents, and literature references may be retained.

Example data representations vary: CSV/JSON (row per step), SDF with mapping and annotations, graph objects (PyTorch Geometric, DGL) with attached atom/bond-level features (Hoque et al., 2024, Das et al., 19 Sep 2025).

3. Metrics, Splitting Strategies, and Benchmarking Protocols

Mechanistic datasets are engineered with strict statistical and benchmarking considerations:

Train/validation/test splits are reproducibly defined (often 70/10/20 or 80/10/10 per convention), balancing step classes to avoid class imbalance biases (Hoque et al., 2024, Miller et al., 22 Apr 2025, Neukomm et al., 5 Dec 2025).
Metrics for model benchmarking commonly include:
- Step classification accuracy:
$\mathcal{L}_{\rm class} = -\sum_{i} y_i \log \hat{y}_i$

(cross-entropy between ground-truth/mechanism predictions) (Hoque et al., 2024). - Reactive atom identification accuracy: Precision, recall, and weighted BCE loss, with per-node targets (Hoque et al., 2024). - Complete CRM (mechanism) accuracy: Mechanism is correct if all predicted steps exactly match ground truth:

$\mathrm{Acc}_{\rm CRM}^{(j)} = \begin{cases} 1 &\text{if } \hat{y}_{j,i} = y_{j,i} \,\forall\, i \ 0 &\text{otherwise} \end{cases}$

(Das et al., 19 Sep 2025). - Top-k criteria: Mechanism or step considered correct if true label appears among top-k predictions (Neukomm et al., 5 Dec 2025).
OOD splits: Finer evaluation often holds out subsets of mechanistic classes for zero-shot generalization assessment (Das et al., 19 Sep 2025) [247.10090].

4. Mechanistic Data Schema, File Formats, and Feature Engineering

A canonical mechanistic dataset record typically includes the following fields:

Field	Example/Description	Type
id	Unique identifier, e.g. "KC_000123"	str/int
reactant_smiles	Atom-mapped reactant SMILES	str
product_smiles	Atom-mapped product SMILES	str
atom_map	Mapping: {reactant_atom → product_atom}	dict/array
step_class/ID	Integer class label (template/mechanism step index)	int
hotspot_indices	List of indices of reactive atoms	list[int]
mechanism_steps	Ordered arrow-pushing instructions [(src, sink), ...]	list[tuple]
conditions	Dict: solvent, temperature, etc.	dict (optional)
reference	Literature or database citation	str (optional)

Feature vectors are engineered as:

Node features: 39-dimensional for graph models (atom type, charges, hybridization, aromatic flag, etc.) (Hoque et al., 2024); edge features for bond types.
Graph features: Pooled node embeddings, or attention-based "supernodes" for global information.

File formats include CSV, JSON, SDF (+ atom maps/labels), and binary graph objects (.pt/.npz for PyTorch Geometric/DeepChem) (Hoque et al., 2024, Das et al., 19 Sep 2025).

5. Access Policies, Data Licensing, and Extension Pathways

Mechanistic reaction datasets are generally released under open-source or CC-BY licenses, with all code and data for ML pipelines and downstream analysis published on public repositories such as GitHub or HuggingFace Hub (Hoque et al., 2024, Das et al., 19 Sep 2025, Neukomm et al., 5 Dec 2025). Download and workflow scripts are provided for reproducible ML experiments, and pre-defined splits are available for benchmarking consistency.

Extension is modular:

New mechanistic classes are added by authoring new SMARTS/RDKit reaction templates, generating additional examples, and updating annotation files (e.g., steps.csv) (Hoque et al., 2024).
Automated annotation tools (MechFinder, SMARTS template-matching, manual graph-edit pipelines) allow extension to new reaction spaces or chemistry families.

6. Applications and Limitations

Mechanistic datasets, by providing atomistic and mechanistic ground truth, are central to:

Interpretable ML model training: Enabling GNNs, transformer architectures, and hybrid Siamese models to jointly classify mechanistic steps and localize reactive sites (Hoque et al., 2024, Miller et al., 22 Apr 2025, Das et al., 19 Sep 2025).
Benchmarking mechanism-prediction pipelines: Supporting step-sequenced tasks, arrow-pushing extraction, impurity and byproduct forecasting, and CRM retrieval (e.g., >96% step or CRM accuracy in ReactAIvate/ReactMech) (Hoque et al., 2024, Das et al., 19 Sep 2025, Neukomm et al., 5 Dec 2025).
Transfer and few-shot learning: Training on curated mechanism classes and adapting to new substrates, metals, or mechanistic motifs with limited retraining (Hoque et al., 2024, Neukomm et al., 5 Dec 2025).
Template design and extraction: Extraction of catalyst-aware templates, distinguishing regenerated from spectator species by explicit arrow accounting (Neukomm et al., 5 Dec 2025).
Mechanistic OCSR development: Testing molecular- and arrow-recognition in image-based datasets, e.g. SMiCRM (Leung et al., 2024).

Limitations include coverage bias (limited classes vs. all conceivable reactions), dependence on template quality, omission of radical or pericyclic mechanisms in some datasets, and labor-intensity of manual curation.

7. Representative Datasets and Comparative Properties

The following table provides a comparative overview of major mechanistic reaction datasets used in recent research (fields from dataset summaries):

Name	Size (steps)	Mechanism Types	Format	Highlights	Reference
ReactAIvate	100,000	Organometallic, 8 classes	CSV/JSON/graph	Graph-annotated, modular	(Hoque et al., 2024)
PMechDB	≈13,000	Polar, curated + combinatorial	JSON	Full arrow codes, mass/charge balance	(Miller et al., 22 Apr 2025)
ReactMech	104,964	67 polar/organometallic	CSV/JSON/graph	Multi-step mechanisms, TMOps	(Das et al., 19 Sep 2025)
RMechDB	5,500	Radical, atmospheric	JSON/SDF	Half-atom arrows, radical moves	(Tavakoli et al., 2023)
mech-USPTO-31k	114,826	Multi-step, diverse	MechSMILES/CSV	Used in LLM benchmarks	(Neukomm et al., 5 Dec 2025)
SMiCRM	453 images	Diverse mechanisms	PNG/SMILES/SDF	Mechanistic arrow OCSR test	(Leung et al., 2024)

This demonstrates the diversity in scale, mechanistic focus, annotation rigor, and intended application. For example, PMechDB offers maximal chemical validity for polar steps, whereas ReactAIvate and ReactMech present high-throughput coverage of transition-metal mechanisms with precise atom-mapping and stepwise decomposition.

Mechanistic reaction datasets define the state of the art in chemically realistic, interpretable ML for reaction prediction, by providing atomically resolved, mechanistically annotated corpora that enable benchmarking and development of models capable of step-wise electron-flow reasoning, selective site identification, and mechanistic pathway discovery (Hoque et al., 2024, Miller et al., 22 Apr 2025, Das et al., 19 Sep 2025, Tavakoli et al., 2023, Neukomm et al., 5 Dec 2025).