Papers
Topics
Authors
Recent
Search
2000 character limit reached

SynthWiki Dataset Overview

Updated 6 February 2026
  • SynthWiki Dataset is a structured, JSON-based collection of materials synthesis protocols extracted from scientific literature using advanced NLP and OCR pipelines.
  • It aggregates detailed synthesis records—including operations, quantities, and environmental conditions—from methods like hydrothermal and CVD for robust benchmarking.
  • The dataset facilitates autonomous experiment planning, retrospective analysis, and process optimization by providing machine-actionable, standardized synthesis data.

SynthWiki Dataset

The SynthWiki dataset refers to structured, large-scale corpora of experimentally extracted materials synthesis protocols, purpose-built to enable machine-readable, data-driven research in materials discovery and process optimization. SynthWiki-style datasets consolidate detailed stepwise records of published chemical syntheses, focusing on extracting critical features such as materials entities, reactions, operations, quantities, environmental conditions, and literature provenance in standardized formats. These datasets support machine learning-based retrosynthesis, autonomous experiment planning, and benchmarking for domain-adapted LLMs. Representative exemplars include the Material Synthesis 2025 (MatSyn25) dataset for 2D materials (Li et al., 1 Oct 2025) and the Dataset of Solution-based Inorganic Materials Synthesis Recipes (Wang et al., 2021).

1. Origins and Scope

SynthWiki datasets emerge from large-scale, systematic mining of peer-reviewed chemical literature. The MatSyn25 dataset (Li et al., 1 Oct 2025) encompasses 163,240 synthesis process records for 2D materials, spanning 85,160 articles (2011–2025) systematically parsed via PDF processing pipelines (MinerU OCR, regex-based cleaning). The Solution-Based Inorganic Materials Synthesis Recipes (Wang et al., 2021) collects 35,675 "recipes" from >4 million published articles (2000–present), targeting solution-phase inorganic syntheses. Such initiatives prioritize depth, breadth, and machine-actionable schema, targeting major classes such as graphene, transition-metal dichalcogenides (TMDs), MXenes, layered double hydroxides (LDHs), oxides, sulfides, mixed salts, and catalysts.

2. Data Model and Record Structure

SynthWiki datasets adhere to JSON-based, information-rich schemas capturing both materials and synthesis process details as atomic, interlinked entities.

The MatSyn25 schema (Li et al., 1 Oct 2025) key fields are:

Field Content Example / Description Data Level
record_id "MS25_000124" Unique identifier
source_publication {DOI, title, year, authors, journal, URL} Reference metadata
material {name, formula, type, morphology, properties, safety} Material entity
synthesis_process {process_name, type, precursors [chemical, amt, unit], Protocol details
atmosphere, pressure, steps [pretreatment, synthesis,
post_processing], parameters [name, value, unit]}

Each synthesis_process sub-field contains discrete lists for process steps, with step-specific variables for temperature, time, reagents, and hardware. Analogously, the Solution-based Inorganic Recipes schema (Wang et al., 2021) encodes: reaction formula, target and precursor entities, operation sequences (operation type, physical conditions), solvent strings, quantities, and variable tables for stoichiometry.

3. Techniques and Distributions

Frequency distributions in MatSyn25 (Li et al., 1 Oct 2025) reveal the dominance of hydrothermal (≈39.98%), CVD (18.02%), solvothermal (10.12%), Hummers (8.95%), exfoliation (8.13%), and coprecipitation (6.25%) methods across 2D materials synthesis records. Temperature domains extend from ambient (25 °C) to up to 1200 °C (notably for CVD), with reaction times from 10 min (pretreatment) to routine synthesis durations of 1–4 h. Composition distributions show a prevalence for graphene (41.56%), TMDs (19.73%), and MXenes (7.99%).

Synthesis Recipes (Wang et al., 2021) provide analogous statistics: hydrothermal (20,037 records) and precipitation (15,638 records) dominate; unique target materials number 5,416 and unique precursors 2,870, spanning most periodic groups with transition metals and common inorganic anions (nitrates, chlorides, sulfates) highly represented.

4. Pipeline and Extraction Methodology

Data acquisition for SynthWiki datasets involves multi-stage NLP and information extraction pipelines:

  • Material Synthesis 2025 (MatSyn25): Web of Science-based article curation, PDF-to-text via MinerU and OCR, regular expression cleaning, followed by automated extraction of material-process pairs. Processed text is structured into discrete protocol records with chemical and procedural metadata.
  • Solution-Based Recipes: Pipeline comprises (i) web-scraping (Borges), (ii) markup normalization (LimeSoup), (iii) paragraph classification via domain-pretrained BERT, (iv) two-stage BiLSTM-CRF for Materials Entity Recognition, (v) operation/condition extraction through RNN and SpaCy dependency parsing, and (vi) quantities extraction with NLTK syntax-tree algorithms. Chemical reaction parsing is achieved via in-house logic to map entities onto structured composition forms and reaction strings.

Accuracy benchmarking yields F1 scores >0.9 for key entity types (targets, precursors, operations, reaction), with a typical extraction yield of ~15% for balanced reactions from classified paragraphs (Wang et al., 2021).

5. Illustrative Records and Example Protocols

Sample records from MatSyn25 (Li et al., 1 Oct 2025):

  • Hydrothermal synthesis of MoSâ‚‚ nanosheets
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    
    {
      "material": {
        "name": "MoSâ‚‚ nanosheets",
        "chemical_formula": "MoS2",
        "material_type": "TMD",
        "morphology": "lamellar",
        "physicochemical_properties": [
          {"property_name":"band_gap", "value":1.8, "unit":"eV"}
        ],
        "safety_precautions":["Handle Mo precursors in fume hood"]
      },
      "synthesis_process": {
        "process_name": "Hydrothermal synthesis of MoSâ‚‚",
        "precursors": [
          {"chemical":"(NH4)2MoS4","amount":30,"unit":"mg"},
          {"chemical":"GO nanosheets","amount":8,"unit":"mg"}
        ],
        "steps": {
          "pretreatment": [
            {"description":"Disperse GO in DMF", "temperature":{"value":25,"unit":"°C"}, "time":{"value":30,"unit":"min"}}
          ],
          "synthesis": [
            {"description":"Heat to 200 °C", "temperature":{"value":200,"unit":"°C"}, "time":{"value":12,"unit":"h"}}
          ],
          "post_processing": [
            {"description":"Wash with ethanol and dry", "temperature":{"value":80,"unit":"°C"}, "time":{"value":2,"unit":"h"}}
          ]
        }
      }
    }
  • CVD growth of graphene on copper
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    {
      "material": {"name": "Graphene monolayer", "chemical_formula": "C"},
      "synthesis_process": {
        "process_name":"CVD growth of graphene",
        "precursors": [{"chemical":"CH4","amount":10,"unit":"sccm"}, {"chemical":"H2","amount":20,"unit":"sccm"}],
        "steps": {
          "pretreatment":[{"description":"Anneal Cu foil at 1000 °C","time":{"value":30,"unit":"min"}}],
          "synthesis":[{"description":"Introduce CH4/H2 and maintain 1000 °C for 20 min"}],
          "post_processing":[{"description":"Cool under Ar","time":{"value":60,"unit":"min"}}]
        }
      }
    }

6. Access, Web Platforms, and AI Applications

MatSyn25 is publicly accessible in JSON format via GitHub and HuggingFace (linked at https://matsynai.stpaper.cn/) (Li et al., 1 Oct 2025). The associated MatSyn AI system employs Qwen3-8B as its base, fine-tuned on 22,234 QA pairs via Low-Rank Adaptation, and incorporates a retrieval-augmented generation (RAG) module indexing ~200,000 knowledge segments. Evaluation metrics show BLEU-4=0.056 and ROUGE-L=0.281 on a held-out set, outperforming alternative general and domain LLMs. Web-based interfaces support material/process/literature search, intelligent synthesis Q&A, and automated summarization. Datasets are released under open licenses (e.g., CC-BY 4.0 for solution-based recipes (Wang et al., 2021)) in UTF-8 JSON format.

7. Research Applications and Use Cases

SynthWiki datasets underpin retrosynthetic planning, process optimization, autonomous laboratory synthesis (when combined with robotics and closed-loop platforms), transfer learning for chemical LLMs, and higher-order knowledge graph analytics mapping material properties, conditions, and outcomes. Mining of protocol distributions enables identification of optimal parameter regimes and benchmarking of AI-generated protocols. Best practices dictate cross-verification of AI outputs with original literature and integration of safety protocols, leveraging RAG approaches to mitigate hallucinations. These environments reinforce the role of open, machine-actionable data in advancing AI-driven materials science (Li et al., 1 Oct 2025, Wang et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SynthWiki Dataset.