Papers
Topics
Authors
Recent
Search
2000 character limit reached

KG-BERTScore

Updated 14 February 2026
  • The paper introduces two KG-BERTScore frameworks that leverage BERT for knowledge graph completion and reference-free machine translation evaluation using contextual and entity-level information.
  • The KG completion component uses a [CLS] token with a linear classifier and sigmoid scoring, trained with binary cross-entropy and negative sampling, yielding state-of-the-art accuracy on benchmarks like WN11 and FB13.
  • The MT evaluation component combines BERTScore with precise entity matching from a multilingual knowledge graph, achieving high Pearson correlations (up to 0.830) and competitive performance against BLEU.

KG-BERTScore refers to two distinct but thematically related frameworks leveraging bidirectional Transformer models and knowledge graphs: the triple scoring function for knowledge graph completion proposed in "KG-BERT: BERT for Knowledge Graph Completion" (Yao et al., 2019), and the reference-free automatic machine translation evaluation metric introduced in "KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation Evaluation" (Wu et al., 2023). Both exploit contextual language representations and explicit entity-level information but in substantially different domains and workflows.

1. KG-BERTScore for Knowledge Graph Completion

Triple Encoding and Input Representation

Given a knowledge-graph triple τ=(h,r,t)\tau = (h, r, t), where hh is the head entity, rr the relation, and tt the tail entity, each element is mapped to a short name or natural-language description and tokenized with BERT's word-piece tokenizer. The resulting input sequence is: [CLS]  Tok1hTokah  [SEP]  Tok1rTokbr  [SEP]  Tok1tTokct  [SEP][\text{CLS}] \;\mathrm{Tok}_1^h \ldots \mathrm{Tok}_a^h\; [\text{SEP}] \;\mathrm{Tok}_1^r \ldots \mathrm{Tok}_b^r\; [\text{SEP}]\; \mathrm{Tok}_1^t \ldots \mathrm{Tok}_c^t\; [\text{SEP}] Each token is associated with an embedding comprising token, segment, and positional components as in the standard BERT model. All tokens for the head and tail share segment A, while the relation tokens use segment B (Yao et al., 2019).

Scoring Function (KG-BERTScore)

The sequence is fed through a 12-layer BERT-Base encoder. The final hidden state CRH\mathbf{C} \in \mathbb{R}^H (with H=768H=768) corresponding to the [CLS] token is projected via a two-way linear classifier: (τ)=CWR2\ell(\tau) = \mathbf{C} W^\top \in \mathbb{R}^2 where WR2×HW \in \mathbb{R}^{2 \times H} is a learned weight matrix. The resulting logits are transformed using a sigmoid function to yield a probability vector sτ=[sτ0,sτ1]s_\tau = [s_{\tau_0}, s_{\tau_1}], where sτ0s_{\tau_0} denotes P(label=trueτ)P(\text{label} = \text{true} \mid \tau). The model's plausibility score for the triple is defined as score(τ):=sτ0score(\tau) := s_{\tau_0}.

No additional non-linear layers or regression heads are interposed between the [CLS] vector and the output probability layer.

Training Strategy and Negative Sampling

The model is trained with a binary cross-entropy objective over observed (positive) and corrupted (negative) triples. The negative set DD^- is generated by "entity corruption," replacing the head or tail of a positive triple with a randomly selected, distinct entity not yielding another positive. The loss is: L=τD+D[yτlogsτ0+(1yτ)logsτ1]\mathcal{L} = - \sum_{\tau \in D^+ \cup D^-} [y_\tau\log s_{\tau_0} + (1-y_\tau)\log s_{\tau_1}] where yτy_\tau is 1 for positives and 0 for negatives. No explicit L2L^2 regularization or margin loss is introduced (Yao et al., 2019).

Hyperparameters and Calibration

Key parameters include the use of BERT-Base (12 layers), Adam optimizer, learning rate 5×1055\times 10^{-5}, batch size 32, dropout 0.1, and variable training epochs according to task (triple classification: 3, link prediction: 5, relation prediction: 20). For triple classification, a balanced negative sampling ratio is used, while for link prediction a ratio of 5:1 negatives to positives is empirically found optimal among $1, 3, 5, 10$. No additional score normalization is applied beyond the sigmoid output (Yao et al., 2019).

Evaluation Protocols

  • Triple Classification: Binary decision over WN11, FB13; metric is accuracy, threshold sτ00.5s_{\tau_0} \geq 0.5 for true.
  • Link Prediction: Datasets include WN18RR, FB15k-237, UMLS; corrupted triples created by replacing head/tail entities; scored by sτ0s_{\tau_0} and ranked. Evaluated using mean rank and Hits@10 (filtered protocol).
  • Relation Prediction: On FB15K, input is restricted to head and tail entity; a learned WRR×HW' \in \mathbb{R}^{R \times H} with softmax is applied for RR relations, scored by mean rank and Hits@1 (Yao et al., 2019).

2. KG-BERTScore for Reference-Free Machine Translation Evaluation

Metric Definition and Components

This reference-free metric linearly combines two components:

  • BERTScore (FBERTF_{BERT}): Measures contextual similarity between source ss and MT output tt using token representations from a multilingual pre-trained Transformer. Precision and recall are defined as:

P=1tx^itmaxxjsx^i,xj,R=1sxismaxx^jtxi,x^jP = \frac{1}{|t|}\sum_{\hat x_i \in t}\max_{x_j \in s}\langle \hat x_i, x_j \rangle, \quad R = \frac{1}{|s|}\sum_{x_i \in s}\max_{\hat x_j \in t}\langle x_i, \hat x_j \rangle

and their F-score:

FBERT=2PRP+RF_{BERT} = 2 \frac{P \cdot R}{P + R}

  • Knowledge Graph Entity Matching (FKGF_{KG}): Assesses the fraction of source named entities correctly realized in the translation, based on exact matches of language-agnostic entity IDs from a multilingual knowledge graph.

FKG={#matched_entities(s,t)#entities(s)if #entities(s)>0 1otherwiseF_{KG} = \begin{cases} \frac{\#\,\mathrm{matched\_entities}(s, t)}{\#\,\mathrm{entities}(s)} & \text{if } \#\,\mathrm{entities}(s) > 0 \ 1 & \text{otherwise} \end{cases}

The overall KG-BERTScore is computed as: FKG-BERT=αFKG+(1α)FBERTF_{KG\text{-}BERT} = \alpha F_{KG} + (1-\alpha) F_{BERT} System-level KG-BERTScore is the average FKG-BERTF_{KG\text{-}BERT} across all sentence pairs (Wu et al., 2023).

Integration of Knowledge Graph Information

Source ss and translation tt are processed using a multilingual NER model and an entity linker, yielding unique language-agnostic entity IDs for all detected entities. The matching score FKGF_{KG} quantifies how many of the source's entity IDs are found in the translation (Wu et al., 2023).

Algorithmically, the process is:

  1. Compute contextual embeddings for ss and tt using a frozen multilingual Transformer.
  2. Calculate FBERTF_{BERT} by greedy cosine matching.
  3. Extract and match entity IDs for FKGF_{KG}.
  4. Combine using the parameter α\alpha.
  5. Average over all sentence pairs for the final system-level score.

Choice and Tuning of α\alpha

α\alpha—the interpolation parameter between BERTScore and KG entity matching—is varied over {0.0,0.2,0.4,0.5,0.6,0.8,1.0}\{0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0\}. On development data (WMT19 QE into-English), pure BERTScore (α=0.0\alpha=0.0) attains mean Pearson 0.396, pure entity matching (α=1.0\alpha=1.0) achieves 0.817, but the optimal combination is α=0.5\alpha=0.5 with 0.830. Consequently, α=0.5\alpha=0.5 is used by default (Wu et al., 2023).

Pre-trained Transformer Model Selection

KG-BERTScore utilizes XLM-RoBERTa-base by default, extracting 9th-layer representations for token embeddings. Ablations compare three models (bert-base-multilingual-cased, xlm-roberta-base, xlm-roberta-large). Larger models increase both FBERTF_{BERT} and FKG-BERTF_{KG\text{-}BERT}, though the relative advantage of the KG component remains stable (Wu et al., 2023).

Experimental Validation and Benchmarks

Evaluated on the WMT19 QE reference-free shared task, spanning 233 systems and 18 language pairs, using system-level Pearson correlation to human direct assessment as the primary metric. Key findings:

  • Into English: KG-BERTScore mean Pearson 0.830, outperforming all reference-free metrics (BERTScore alone: 0.396), and approaching BLEU (0.907).
  • From English: KG-BERTScore 0.392 vs. BERTScore 0.238; exceeds YiSi-2 and other baselines.
  • Non-English ↔ Non-English: KG-BERTScore 0.267 vs. BERTScore 0.173; matches or surpasses BLEU in some directions.
  • Ablations confirm the complementarity of FKGF_{KG} and FBERTF_{BERT}, with α=0.5\alpha=0.5 near optimal.
  • Model choice ablation shows that using XLM-RoBERTa-large increases mean Pearson from 0.830 to 0.851 (Wu et al., 2023).

3. Comparative Overview

Aspect KG-BERTScore (KG Completion) KG-BERTScore (MT Evaluation)
Domain Knowledge graph completion Reference-free machine translation evaluation
Scoring Principle [CLS] embedding, linear head, sigmoid Linear blend: BERTScore + KG entity-matching
Input Preprocessing Tokenized triple (h, r, t) packed into BERT sequence NER/Entity linking; contextual embeddings
Output P(trueτ)P(\text{true} | \tau) FKG-BERTF_{KG\text{-}BERT} per sentence/system
Main Evaluation Datasets WN11, FB13, WN18RR, FB15k-237, UMLS, FB15K WMT19 QE (233 systems, 18 language pairs)

4. Significance and Performance Characteristics

The KG-BERTScore scoring function for knowledge graph completion achieves state-of-the-art results in triple classification, link prediction, and relation prediction on standard datasets, employing negative sampling and binary cross-entropy loss without additional regularization or normalization terms (Yao et al., 2019).

KG-BERTScore for reference-free MT evaluation combines the strengths of contextual semantic similarity with symbolic/knowledge-based entity matching. It offers improved correlation with human judgment compared to previous reference-free metrics, particularly in high-entity-density settings and on challenging language pairs (Wu et al., 2023).

5. Limitations and Future Directions

Both frameworks depend heavily on the quality and coverage of pre-trained LLMs. The translation evaluation variant is also sensitive to the accuracy of multilingual NER and entity linking. A plausible implication is that further gains could be obtained by more robust entity linking or by integrating deeper knowledge graph reasoning, although explicit proposals for such extensions are not provided in the referenced works. No explicit controversy or broad limitation regarding cross-linguistic generalization is reported, but the results suggest robustness to language pair variation when appropriate models and KG resources are available.

6. References

  • "KG-BERT: BERT for Knowledge Graph Completion" (Yao et al., 2019)
  • "KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation Evaluation" (Wu et al., 2023)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KG-BERTScore.