Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity

Published 25 Nov 2025 in cs.AI | (2511.19925v1)

Abstract: Evaluating the open-form textual responses generated by LLMs typically requires measuring the semantic similarity of the response to a (human generated) reference. However, there is evidence that current semantic similarity methods may capture syntactic or lexical forms over semantic content. While benchmarks exist for semantic equivalence, they often suffer from high generation costs due to reliance on subjective human judgment, limited availability for domain-specific applications, and unclear definitions of equivalence. This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, specifically addressing these limitations. Our approach leverages knowledge graphs (KGs) to generate pairs of natural-language statements that are semantically similar or dissimilar, with dissimilar pairs categorized into one of four sub-types. We generate benchmark datasets in four different domains (general knowledge, biomedicine, finance, biology), and conduct a comparative study of semantic similarity methods including traditional natural language processing scores and LLM-as-a-judge predictions. We observe that the sub-type of semantic variation, as well as the domain of the benchmark impact the performance of semantic similarity methods, with no method being consistently superior. Our results present important implications for the use of LLM-as-a-judge in detecting the semantic content of text. Code is available at https://github.com/QiyaoWei/semantic-kg and the dataset is available at https://huggingface.co/datasets/QiyaoWei/Semantic-KG.

Abstract PDF Upgrade to Chat

Summary

The paper presents a scalable, KG-guided framework that bypasses expensive human annotation by using controlled semantic perturbations.
The paper details a four-stage pipeline—subgraph sampling, targeted perturbation, LLM-driven generation, and textual validation—to ensure semantically faithful output.
The paper’s evaluation reveals that while LLM-based methods excel at relational perturbations, embedding techniques perform better for entity-level changes, highlighting domain adaptation challenges.

Semantic-KG: Knowledge Graph-Guided Benchmarking for Semantic Similarity

Introduction

Semantic similarity evaluation is the cornerstone of robust assessment for LLM-generated text, especially where surface correspondence (syntactic or lexical overlap) frequently diverges from actual semantic content. Prevailing methods such as BLEU, ROUGE, and embedding-based similarity provide approximate surrogates but remain vulnerable to superficial cues and fail to ensure true semantic alignment. The construction of semantic similarity benchmarks is typically bottlenecked by domain expertise requirements, annotation cost, and subjectivity in "semantic equivalence" definitions.

"Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity" (2511.19925) introduces a scalable, domain-adaptable methodology for constructing precise benchmarks for semantic similarity, bypassing expensive human judgments by leveraging knowledge graphs as formal ontological scaffolds for controlled semantic perturbations.

Framework: Automated Benchmark Construction via Knowledge Graph Perturbation

The Semantic-KG framework consists of four mechanistic stages: (1) subgraph sampling, (2) targeted perturbation, (3) LLM-driven statement generation, and (4) KG-centered textual validation.

After a diverse subgraph is sampled from a domain knowledge graph, a controlled perturbation is applied—node/edge removal or replacement—which induces defined semantic deviation. LLMs then synthesize natural language statements from these subgraphs, yielding pairs with known degrees and types of semantic divergence. Output statements are rigorously validated by reconstructing the originating subgraph (post-lemmatization and normalization) from the generated text to retain only semantically faithful instances.

Figure 1: A schematic overview of the Semantic-KG pipeline, explicating sampling, perturbation, generation, and rigorous validation against knowledge graph structure.

This pipeline is applicable to any domain with a high-quality KG and obviates the need for ambiguous or expensive human-annotated similarity labels.

Task Formulation and Datasets

Given the generated corpus, the central task is to predict the binary semantic similarity (equivalent vs. nonequivalent) for a pair of statements. Positive pairs are derived from distinct generations using the same subgraph; negative pairs are those where one member is induced from a perturbed subgraph.

Figure 2: The task schematic: positive pairs derive from a single subgraph, negative pairs from an original and perturbed variant, stratified by perturbation type.

Semantic-KG provides instantiations across four domains:

Codex: General knowledge (Wikidata-based)
Oregano: Biomedicine
FinDKG: Finance
Globi: Biological interactions

Experimental Evaluation and Insights

The evaluation encompasses three methodological families: LLM-as-a-judge, state-of-the-art sentence embedding models, and traditional NLP metrics. LLM-based methods involve direct binary similarity judgment; embedding models compute cosine similarity, and classic metrics like ROUGE/BLEU are thresholded on F1-optimal splits.

Performance is measured globally and stratified by perturbation type and dataset domain.

Figure 3: F1 Score performance by perturbation type—a clear performance heterogeneity depending on the type of semantic edit applied.

LLM-based methods generally outperform non-LLMs for edge-level perturbations—relationship modifications—but not consistently for node-level (entity set) changes. Notably, for node removals and replacements, embedding approaches (e.g., Sentence-T5-Base) often outperform LLMs, as confirmed by significant interaction effects: for Sentence-T5-Base, node-removal ( $\beta=0.260$ , $p=0.002$ ); for GPT-4o, edge-replacement ( $\beta=0.184$ , $p=0.031$ ).

When results are stratified by domain, performance degradation is prominent for specialized datasets, particularly the Globi and FinDKG cases, where even high-performing LLMs can falter, and embedding methods (BERTScore) sometimes display superior robustness.

Figure 4: Cross-domain F1 Scores for semantic similarity methods, evidencing robustness variance and domain-adaptation challenges.

Dataset Validation

The authors further validated sampled data for factual correctness (manual spot-check: 99% correctness) and assessed "naturalness" (75% passing rate) via human annotation. A subgraph-to-text-to-subgraph reconstruction metric was established, with validated outputs filtered for exact triple-match equivalence, ensuring the benchmark’s structural precision.

Figure 5: Reconstruction accuracy as a function of subgraph size quantifies reliability of the generation-validation cycle.

Implications and Future Prospects

A key implication is that LLM-based evaluation cannot be considered universally optimal for semantic similarity, particularly outside synthetically simple or general knowledge settings. Embedding-based approaches remain competitive for entity-set focused applications, whereas LLMs provide value in tasks sensitive to relational content. Thus, claim selection of an evaluation method must consider both the nature of semantic variations (entity-centric vs. relation-centric) and the target knowledge domain.

Domain shift substantially impacts method generalizability—domain-specific tuning or benchmarking is indispensable before critical deployments (e.g., in biomedical or financial NLP pipelines).

Practically, the open-source nature of the pipeline enables the rapid construction of domain-new benchmarks, supporting downstream tasks such as RAG re-ranking or LLM output verification with minimal annotation burden.

Theoretically, the work substantiates a hybrid methodology for evaluation: combinatorial KG-based perturbation provides formal semantics for both positive and negative sample construction, addressing annotation subjectivity and offering fine-grained control over semantic variation typology.

Limitations

The pipeline is contingent on the availability and quality of source KGs, which may not exist for all domains or may encode knowledge under closed world assumptions. The classes of perturbations—while controllable—cannot capture all pragmatic nuances of meaning (e.g., implicature, tone, or world knowledge outside graph structure). Reliance on LLMs for text generation and validation induces potential train-data and modeling distribution shift. The current framework privileges binary classification; extension to graded similarity remains for future work.

Conclusion

Semantic-KG provides a scalable, extensible paradigm for constructing semantically controlled benchmarks in any KG-equipped domain without dependence on costly human annotation. Detailed cross-method and cross-domain evaluations reveal crucial performance gaps in current semantic similarity methods, especially for relational content and specialized domains. The framework will catalyze more nuanced and domain-adaptive benchmark design, sharpening both the evaluation and the subsequent deployment of LLM-based and embedding-based NLP systems.

Dataset and code are available: GitHub, Huggingface dataset.