TCM-Specific Knowledge Graph
- TCM-KG is a semantic network that organizes traditional Chinese medicine entities, relations, and ontologies drawn from classical texts.
- It employs specialized NLP techniques, CRF models, and graph neural networks to extract and map complex entity relationships.
- TCM-KG supports practical applications such as diagnostic support, drug discovery, and formula recommendation through advanced reasoning and embedding methods.
A Traditional Chinese Medicine-Specific Knowledge Graph (TCM-KG) is a multi-modal, multi-relational semantic network encoding the entities, relationships, and ontological structures of the TCM domain for computational tasks such as question answering, diagnosis support, drug discovery, and formula recommendation. Construction of TCM-KGs encompasses specialized NLP techniques for classical Chinese, graph neural networks, embedding-based link prediction, and integration with downstream reasoning or generation architectures. The distinctive features of TCM-KGs compared to general medical knowledge graphs arise from domain-specific entity types (herbs, formulas, acupoints, syndromes, properties), complex relational schemas (formula composition, compatibility, diagnostic patterns), and intensive reliance on curated classical sources, expert annotation, and interoperability with modern biomedical ontologies.
1. Ontology, Schema, and Entity-Typing
TCM-KGs employ curated ontologies to codify core biological, clinical, and theoretical concepts unique to TCM. Canonical entity classes in major TCM-KGs include Herb (药物), Formula (方剂), Symptom (证候), Disease (病名), Property (性味, 归经), Meridian (经络), Acupoint, Therapy, DiagnosticPattern, Organ, TreatmentMethod, Practitioner, Book/Section/Chapter, and Ingredient (Zhao et al., 2024, He et al., 28 Apr 2025, Zhang et al., 2024, Liu et al., 2023). These are typically supplemented by virtual or auxiliary nodes representing medicinal properties (e.g., therapeutic nature, flavor, meridian tropism) and chemical-level entities (natural products, compounds, targets) to support multi-component and multi-target interaction modeling (Zeng et al., 2024).
Relation schemas are heterogeneous and domain-adapted, including:
- composes (Formula → Herb)
- treats (Herb/Formula → Symptom/Disease)
- contraindicates (Herb → Symptom/Disease)
- belongs_to (Herb → Meridian)
- has_property (Herb → Property)
- compatibility/incompatibility (Herb-Herb)
- indicated_for (Prescription → Disease)
- manifests_symptom (Disease → Symptom)
- meridian_association (Herb/Formula → Meridian)
- has_ingredient (Herb → Ingredient)
- part_of (Book:Section:Chapter) and category_of (entity → top-level category) (Zhao et al., 2024, He et al., 28 Apr 2025, Zhang et al., 2024)
Schema design adheres to explicit domain–range constraints, and relies on property graphs (typically Neo4j, sometimes RDF/Jena) for instance-level storage, with nodes labeled by class/type and relationships carrying attributes such as provenance, dosage, evidence sentences, or frequency counts (Zhao et al., 2024, He et al., 28 Apr 2025, Zhang et al., 2024).
Table: Representative Entity and Relation Types in TCM-KG
| Entity Types | Relation Types | Notes |
|---|---|---|
| Herb, Formula | composes, treats | Core TCM medical semantics |
| Symptom, Disease | contraindicates, manifests | Links for diagnosis and contraindication |
| Acupoint, Meridian | belongs_to, located_on | Utilized in acupoint and meridian mapping |
| Property, Ingredient | has_property, has_ingredient | For multi-level interactions |
2. Data Extraction, Curation, and NLP for Classical Chinese
Construction of TCM-KG fundamentally depends on transforming unstructured classical Chinese medical texts into structured graphs. Extraction begins with word segmentation and named-entity recognition (NER) leveraging conditional random fields (CRF) with specialized feature engineering: character n-grams (uni-, bi-, tri-), Kangxi radicals, POS tags, BMES segmentation labels, and gazetteer matching from TCM dictionaries. In (Zhao et al., 2024), a two-level tag schema applies BMES for segmentation and BIO (B-Herb, I-Herb, etc.) for entity typing, trained on a 20k-sentence corpus from canonical TCM texts. Typical CRF performance yields Precision ≈ 92.3%, Recall ≈ 90.8%, F₁ ≈ 91.5%.
Following entity recognition, TF–IDF weighting filters high-confidence key terms for further processing, with thresholds (e.g., top K=50 per document, τ=0.02) optimizing recall of significant medical entities (Zhao et al., 2024). Syntactic relation extraction utilizes neural network-based dependency parsers (biaffine architecture; BiLSTM encoders; pretrained TCM embeddings), outputting trees with labeled arcs (nsubj, dobj, etc.) between extracted entities.
Rule-based and LLM-augmented extraction methodologies further supplement graph population. In (He et al., 28 Apr 2025), DeepSeek and Kimi LLMs with customized prompts and in-prompt JSON output are combined with human expert annotation for robust triple extraction. In HBot (Zhang et al., 2024), BERT+CRF NER and bi-directional GRU relation classifiers further augment core rule-based extraction for acupoint and prescription information.
3. Graph Construction, Storage, and Quality Assessment
Instance-level graph construction proceeds via triple generation from NER and dependency outputs, normalization of entity nomenclature (controlled vocabularies), and coreference resolution to cluster entity mentions (Zhao et al., 2024). Relations are mapped according to target predicates, and the graph is injected into property-graph databases (Neo4j) or semantic triple stores (Jena).
Quality evaluation leverages manual gold-standard annotation. For example, in (Zhao et al., 2024), 500 triples are annotated for Precision/Recall/F₁; error analyses highlight segmentation errors for formula entities and synonym extraction failures. OpenTCM (He et al., 28 Apr 2025) samples 1,795 triples across 600 chapters, using consensus annotation from 5 TCM experts. Precision, Recall, Accuracy, and Mean Expert Score (MES) provide evaluation for both graph construction and downstream retrieval: where scores expert relevance for retrieval results.
Graph statistics generally reveal power-law degree distributions (herb-disease, formula-herb links), with sizes ranging from tens of thousands (HBot: 48,633 triples, 12 entity classes) to hundreds of thousands (OpenTCM: 152,754 edges, 48,406 entities; SCEIKG: 344,092 entities, 4.3M triples) (He et al., 28 Apr 2025, Zhang et al., 2024, Liu et al., 2023).
4. Graph Representation, Embedding, and Learning Algorithms
TCM-KG representation extends from symbolic triples to continuous embeddings suitable for link prediction, compatibility modeling, and integration with neural reasoning.
In SCEIKG (Liu et al., 2023), the Interaction KG (IKG) is learned via heterogeneous attention-based GNNs and completed with a combination of TransR and RotatE embeddings: Lower values denote more plausible triples; loss is margin-based ranking over positive and negative triples. The graph update integrates self and neighbor embeddings: Final KG representations serve both completion and prescription recommendation tasks.
Graph neural architectures employed include:
- Graph Transformer Networks: global self-attention over node features (Zeng et al., 2024)
- Hypergraph Neural Networks: propagation via hyperedge incidence matrices for high-order compatibility (Zeng et al., 2024)
- Graph Attention Networks: layered multi-head attention for interpretability and ablation (Zeng et al., 2024)
- Planned extensions: TransE-style embeddings to infer new links (Zhang et al., 2024)
Feature engineering combines origin (taxonomy, phylogeny), property (Word2Vec embeddings), efficacy, compatibility, and dosage encoding (Zeng et al., 2024). Evaluation via AUROC, precision, recall, and attention-based interpretability quantifies model performance (GAT AUC > 0.75 for TCM categories; feature ablation highlights compatibility and medicinal property subfeatures).
5. Retrieval-Augmented Generation, Reasoning, and Visualization
Several TCM-KGs incorporate knowledge-driven retrieval-augmented generation (RAG) and semantic query answering. In ZhiFangDanTai (Zhang et al., 6 Sep 2025) and OpenTCM (He et al., 28 Apr 2025), GraphRAG combines vector-based query expansion, local and global community retrieval via MapReduce, and LLM-based subgraph-to-text synthesis for recommendation and explanation. Community detection (Leiden) organizes entities into coherent semantic fields ("Editor’s term": fine-grained communities—disease, formulas, herbs, symptoms, diagnoses, contraindications, preparation).
Formally, retrieval score per community utilizes: with global synthesis via
LLM prompt for subgraph fusion covers TCM theory dimensions (roles of herbs, efficacy, contraindication, diagnostic signs, preparation).
Cypher (Neo4j) and SPARQL-style queries provide graph traversal and reasoning:
1 |
MATCH (h:Herb)-[:TREAT]->(s:Symptom) WHERE s.name CONTAINS '发热' RETURN h,r,s |
Visualization modules export dependency trees (JSON), force graphs, and entity-color-coded diagrams (D3.js, ECharts). In HBot (Zhang et al., 2024), acupoint nodes include 3D spatial coordinates, enabling real-time highlighting on Blender-derived meshes via three.js transformations—a direct application in clinical training and chatbot interfaces.
6. Applications, Generalization, and Future Directions
TCM-KGs support comparative analysis of treatment strategies, automated formula recommendation, ingredient retrieval, compatibility mechanism quantification, and integrative drug discovery pipelines. Case studies include COVID-19 formula classification and key herb/pathway identification by attention-weight analysis (Zeng et al., 2024), multi-condition sequential prescription modeling with state transfer via LSTM (Liu et al., 2023), and explainable generation of prescription details (herb roles, contraindications, diagnostic signs) by fine-tuned LLMs (Zhang et al., 6 Sep 2025).
Generalization to broader TCM schools requires expansion of annotated corpora from diverse texts (Shang Han, Wen Bing, etc.), embedding-based cross-text alignment for synonymy, enrichment of ontology with modules for dosage, preparation, pharmacokinetics, and integration with ICD-11/TCMLS for clinical interoperability (Zhao et al., 2024, Zeng et al., 2024). Future work incorporates active learning for error correction, graph embedding for link prediction, and enhanced interpretability via attention-based ablation and node masking experiments.
A plausible implication is that further progress will depend on unified data standards and hybrid symbolic-neural frameworks for leveraging both classical and modern biomedical knowledge resources, positioning TCM-KG as a bridge between ancient theory and contemporary computational biology.
7. Representative Research and Public Resources
Key research contributions:
- CRF-based classical Chinese extraction, neural dependency parsing, reproducibility pipeline (Zhao et al., 2024)
- Interaction Knowledge Graph and sequential modeling for prescription recommendation (Liu et al., 2023)
- Multi-layered TCM–biomedical integration and compatibility quantification via graph AI (Zeng et al., 2024)
- RAG-based reasoning and LLM fine-tuning for formula recommendation and explanation (Zhang et al., 6 Sep 2025, He et al., 28 Apr 2025)
- Property graph design for acupoint mapping and 3D clinical visualization (Zhang et al., 2024)
Public datasets, codebases, and demonstration videos cited include GitHub repositories (e.g., https://github.com/ZENGJingqi/GraphAI-for-TCM), HuggingFace models (e.g., https://huggingface.co/tczzx6/ZhiFangDanTai1.0), Zenodo archives, and three.js visualizations for acupoint highlighting.
These resources collectively underpin reproducible, extensible TCM-KG research and enable direct application in clinical, diagnostic, and drug discovery domains.