KG-Grounded LLMs for Synthesis Planning
- The paper presents a framework that transforms natural language synthesis queries into precise Cypher code using reaction knowledge graphs.
- Prompting with semantically aligned exemplars significantly improves single- and multi-step retrieval accuracy by correcting over 90% of directionality errors.
- The CoVe validator-corrector loop enhances query executability by detecting and rectifying syntax and schema-level errors in generated queries.
KG-grounded LLMs for synthesis planning represent an emergent paradigm in computational chemistry, integrating natural language processing with structured reaction knowledge graphs (KGs) to enable precise and up-to-date retrieval for synthesis, retrosynthesis, and related planning tasks. This approach frames the generation of KG queries as a Text2Cypher problem, harnessing LLMs’ representational power while counteracting hallucination and factual drift through explicit grounding in reaction data (Bunkova et al., 22 Jan 2026).
1. Reaction Knowledge Graph and Text2Cypher Task Formulation
Let denote the bipartite Reaction Knowledge Graph where consists of Molecule nodes and Reaction nodes . Edge set is a mix of typed relations (REACTS_IN, PRODUCES, USES_AGENT, USES_SOLVENT, etc.) linking these entities, with directed edges distinguishing functional roles and process directionality. The core task is to learn a mapping
transforming a natural-language synthesis question into Cypher code , where defines the Cypher query space. Execution is conducted via , returning result sets evaluated against reference according to retrieval accuracy, provided is valid (syntactic and semantic constraints are satisfied).
Single-step retrieval tasks require queries about one-hop molecular contexts, demanding pattern matching for all relevant reaction edges. Multi-step queries invoke retrosynthetic or synthetic path search with a prescribed number of alternating Molecule and Reaction nodes, enforcing path length and node type alternation.
2. Prompting Regimes and Exemplar Selection
Evaluations encompass four prompting strategies based on the in-context learning regime, with all experiments run using GPT-4.1-mini . These include:
- Zero-Shot (ZS): System prompt plus user question, no exemplars.
- One-Shot Static (1S-S): A fixed canonical example, statically chosen as the most common template, prepended to each prompt.
- One-Shot Random (1S-D-R): Single exemplars randomly sampled per prompt from a curated example bank (six for single-step, four for multi-step).
- One-Shot Semantic (1S-D-S): Semantically aligned exemplars selected via sentence embedding (all-mpnet-base-v2, SMILES-masked) and cosine similarity, thresholded at 0.93.
Five prompt templates with incremental constraints (schema inclusion, pattern order, directional arrows, yield filters, strict RETURN clauses, and SMILES strings) are applied per retrieval scenario.
| Prompting Regime | Exemplar Selection | Single-Step Retrieval F1 | Multi-Step Retrieval F1 |
|---|---|---|---|
| ZS | None | ≈ 0.65 | ≈ 0.32 |
| 1S-S | Static, canonical | ≈ 0.82 | ≈ 0.60 |
| 1S-D-R | Random | ≈ 0.85 | ≈ 0.63 |
| 1S-D-S | Semantic | ≈ 0.88 | ≈ 0.68 |
Notably, the inclusion of a single, structurally aligned exemplar corrects endpoint anchoring and directionality errors in over 90% of multi-step queries.
3. Checklist-Based Validator–Corrector Loop
The checklist-oriented validator/corrector pipeline ("CoVe loop," Editor's term) augments candidate query generation. Following initial LLM output , a syntax check is executed in Neo4j (EXPLAIN c_0). Detected syntax errors trigger targeted correction cycles via the LLM (up to three iterations). A checklist validator then flags common schematic and clause-level errors (e.g., missing edge roles, incorrect arrow direction, span omissions, or lack of proper RETURN structure). If checklist violations remain after three total modification attempts, the last candidate is executed.
Application of CoVe yields a +20 percentage point (pp) increase in executability in the ZS regime for single-step tasks, whereas its benefit is marginal (<3 pp gain) in any one-shot setting with good exemplars due to validator limitations (non-detected error rate ~90%).
4. Evaluation Protocol and Quantitative Benchmarks
Experiments are conducted on a 50,000-reaction USPTO subset, preprocessed with ORDerly and ingested into Neo4j. Task suite: 1200 single-step queries (six question types, 200 each) and 1200 multi-step queries (four types, L ∈ {2,3,4}, 300 each). Metrics include:
- Surface similarity: BLEU, METEOR, ROUGE-L (range [0,1]); used for query string comparison.
- Retrieval accuracy: Precision, recall, and F1 for reactants/products/agents/solvents (single-step); exact- and partial-path metrics (multi-step).
A salient observation is the weak correlation between surface-matching metrics (BLEU ∼0.75 for both 1S-D-R and 1S-D-S) and execution-grounded retrieval F1 (which differs by ~3 pp in favor of 1S-D-S). This underscores the inadequacy of text-based proxy metrics for factual retrieval tasks.
5. Principal Results and Implications for Synthesis Planning
The most substantial performance gain arises from transitioning from zero-shot to one-shot prompting, with aligned exemplars mitigating major schematic errors and restoring directionality in complex queries. Among one-shot methods, embedding-based semantic selection (1S-D-S) achieves optimal retrieval intent alignment and consistently improves execution F1 by 3–5 pp relative to random selection.
CoVe delivers pronounced improvements solely in the absence of good exemplars, indicating that well-constructed demonstrations outweigh iterative correction in retrieval-augmented workflows. Validator coverage, limited by schema awareness, is a bottleneck for further gains.
In practical terms, dynamic exemplar selection and prompt engineering are more determinative for KG-LLM integration than the sophistication of downstream correction. Textual overlap should be supplanted by execution-based metrics for robust evaluation. The outlined pipeline functions as a low-cost “grounder” module, adaptable to retrosynthetic planning agents for verifiable and up-to-date information retrieval (Bunkova et al., 22 Jan 2026).
6. Limitations and Prospective Research Avenues
The current benchmark confines itself to a single LLM (GPT-4.1-mini) and a USPTO KG subset, leaving open questions regarding transferability to larger, denser graphs and other model families. The validator’s generic checklist is insensitive to numerous domain-specific errors, such as duplicated entities or incomplete subgraphs, constraining the efficacy of CoVe in complex schemas. Additionally, the study is limited to in-context learning; neither domain-adaptive fine-tuning nor reinforcement learning (RL) is explored.
Planned extensions include evaluating alternative LLMs (e.g., Llama 3, Claude), scaling to full USPTO or analogous KGs, and developing schema-aware or learned validation frameworks to deepen self-correction. Integration of Text2Cypher modules into complete search/planning algorithms (e.g., Monte Carlo Tree Search, beam search) and exploration of multi-shot or chain-of-thought contexts offer pathways to improved synthesis planning and agent robustness.
7. Code, Data Resources, and Reproducibility
The complete reproducible evaluation suite, including codebase, prompt templates, and data splits, is openly available at https://github.com/Intelligent-molecular-systems/KG-LLM-Synthesis-Retrieval supporting further exploration of KG-grounded LLMs in computational synthesis and related domains (Bunkova et al., 22 Jan 2026).