KG-BERTScore

Updated 14 February 2026

The paper introduces two KG-BERTScore frameworks that leverage BERT for knowledge graph completion and reference-free machine translation evaluation using contextual and entity-level information.
The KG completion component uses a [CLS] token with a linear classifier and sigmoid scoring, trained with binary cross-entropy and negative sampling, yielding state-of-the-art accuracy on benchmarks like WN11 and FB13.
The MT evaluation component combines BERTScore with precise entity matching from a multilingual knowledge graph, achieving high Pearson correlations (up to 0.830) and competitive performance against BLEU.

KG-BERTScore refers to two distinct but thematically related frameworks leveraging bidirectional Transformer models and knowledge graphs: the triple scoring function for knowledge graph completion proposed in "KG-BERT: BERT for Knowledge Graph Completion" (Yao et al., 2019), and the reference-free automatic machine translation evaluation metric introduced in "KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation Evaluation" (Wu et al., 2023). Both exploit contextual language representations and explicit entity-level information but in substantially different domains and workflows.

1. KG-BERTScore for Knowledge Graph Completion

Triple Encoding and Input Representation

Given a knowledge-graph triple $\tau = (h, r, t)$ , where $h$ is the head entity, $r$ the relation, and $t$ the tail entity, each element is mapped to a short name or natural-language description and tokenized with BERT's word-piece tokenizer. The resulting input sequence is: $[\text{CLS}] \;\mathrm{Tok}_1^h \ldots \mathrm{Tok}_a^h\; [\text{SEP}] \;\mathrm{Tok}_1^r \ldots \mathrm{Tok}_b^r\; [\text{SEP}]\; \mathrm{Tok}_1^t \ldots \mathrm{Tok}_c^t\; [\text{SEP}]$ Each token is associated with an embedding comprising token, segment, and positional components as in the standard BERT model. All tokens for the head and tail share segment A, while the relation tokens use segment B (Yao et al., 2019).

Scoring Function (KG-BERTScore)

The sequence is fed through a 12-layer BERT-Base encoder. The final hidden state $\mathbf{C} \in \mathbb{R}^H$ (with $H=768$ ) corresponding to the [CLS] token is projected via a two-way linear classifier: $\ell(\tau) = \mathbf{C} W^\top \in \mathbb{R}^2$ where $W \in \mathbb{R}^{2 \times H}$ is a learned weight matrix. The resulting logits are transformed using a sigmoid function to yield a probability vector $s_\tau = [s_{\tau_0}, s_{\tau_1}]$ , where $s_{\tau_0}$ denotes $P(\text{label} = \text{true} \mid \tau)$ . The model's plausibility score for the triple is defined as $score(\tau) := s_{\tau_0}$ .

No additional non-linear layers or regression heads are interposed between the [CLS] vector and the output probability layer.

Training Strategy and Negative Sampling

The model is trained with a binary cross-entropy objective over observed (positive) and corrupted (negative) triples. The negative set $D^-$ is generated by "entity corruption," replacing the head or tail of a positive triple with a randomly selected, distinct entity not yielding another positive. The loss is: $\mathcal{L} = - \sum_{\tau \in D^+ \cup D^-} [y_\tau\log s_{\tau_0} + (1-y_\tau)\log s_{\tau_1}]$ where $y_\tau$ is 1 for positives and 0 for negatives. No explicit $L^2$ regularization or margin loss is introduced (Yao et al., 2019).

Hyperparameters and Calibration

Key parameters include the use of BERT-Base (12 layers), Adam optimizer, learning rate $5\times 10^{-5}$ , batch size 32, dropout 0.1, and variable training epochs according to task (triple classification: 3, link prediction: 5, relation prediction: 20). For triple classification, a balanced negative sampling ratio is used, while for link prediction a ratio of 5:1 negatives to positives is empirically found optimal among $1, 3, 5, 10$. No additional score normalization is applied beyond the sigmoid output (Yao et al., 2019).

Evaluation Protocols

Triple Classification: Binary decision over WN11, FB13; metric is accuracy, threshold $s_{\tau_0} \geq 0.5$ for true.
Link Prediction: Datasets include WN18RR, FB15k-237, UMLS; corrupted triples created by replacing head/tail entities; scored by $s_{\tau_0}$ and ranked. Evaluated using mean rank and Hits@10 (filtered protocol).
Relation Prediction: On FB15K, input is restricted to head and tail entity; a learned $W' \in \mathbb{R}^{R \times H}$ with softmax is applied for $R$ relations, scored by mean rank and Hits@1 (Yao et al., 2019).

2. KG-BERTScore for Reference-Free Machine Translation Evaluation

Metric Definition and Components

This reference-free metric linearly combines two components:

BERTScore ( $F_{BERT}$ ): Measures contextual similarity between source $s$ and MT output $t$ using token representations from a multilingual pre-trained Transformer. Precision and recall are defined as:

$P = \frac{1}{|t|}\sum_{\hat x_i \in t}\max_{x_j \in s}\langle \hat x_i, x_j \rangle, \quad R = \frac{1}{|s|}\sum_{x_i \in s}\max_{\hat x_j \in t}\langle x_i, \hat x_j \rangle$

and their F-score:

$F_{BERT} = 2 \frac{P \cdot R}{P + R}$

Knowledge Graph Entity Matching ( $F_{KG}$ ): Assesses the fraction of source named entities correctly realized in the translation, based on exact matches of language-agnostic entity IDs from a multilingual knowledge graph.

$F_{KG} = \begin{cases} \frac{\#\,\mathrm{matched\_entities}(s, t)}{\#\,\mathrm{entities}(s)} & \text{if } \#\,\mathrm{entities}(s) > 0 \ 1 & \text{otherwise} \end{cases}$

The overall KG-BERTScore is computed as: $F_{KG\text{-}BERT} = \alpha F_{KG} + (1-\alpha) F_{BERT}$ System-level KG-BERTScore is the average $F_{KG\text{-}BERT}$ across all sentence pairs (Wu et al., 2023).

Integration of Knowledge Graph Information

Source $s$ and translation $t$ are processed using a multilingual NER model and an entity linker, yielding unique language-agnostic entity IDs for all detected entities. The matching score $F_{KG}$ quantifies how many of the source's entity IDs are found in the translation (Wu et al., 2023).

Algorithmically, the process is:

Compute contextual embeddings for $s$ and $t$ using a frozen multilingual Transformer.
Calculate $F_{BERT}$ by greedy cosine matching.
Extract and match entity IDs for $F_{KG}$ .
Combine using the parameter $\alpha$ .
Average over all sentence pairs for the final system-level score.

Choice and Tuning of $\alpha$

$\alpha$ —the interpolation parameter between BERTScore and KG entity matching—is varied over $\{0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0\}$ . On development data (WMT19 QE into-English), pure BERTScore ( $\alpha=0.0$ ) attains mean Pearson 0.396, pure entity matching ( $\alpha=1.0$ ) achieves 0.817, but the optimal combination is $\alpha=0.5$ with 0.830. Consequently, $\alpha=0.5$ is used by default (Wu et al., 2023).

Pre-trained Transformer Model Selection

KG-BERTScore utilizes XLM-RoBERTa-base by default, extracting 9th-layer representations for token embeddings. Ablations compare three models (bert-base-multilingual-cased, xlm-roberta-base, xlm-roberta-large). Larger models increase both $F_{BERT}$ and $F_{KG\text{-}BERT}$ , though the relative advantage of the KG component remains stable (Wu et al., 2023).

Experimental Validation and Benchmarks

Evaluated on the WMT19 QE reference-free shared task, spanning 233 systems and 18 language pairs, using system-level Pearson correlation to human direct assessment as the primary metric. Key findings:

Into English: KG-BERTScore mean Pearson 0.830, outperforming all reference-free metrics (BERTScore alone: 0.396), and approaching BLEU (0.907).
From English: KG-BERTScore 0.392 vs. BERTScore 0.238; exceeds YiSi-2 and other baselines.
Non-English ↔ Non-English: KG-BERTScore 0.267 vs. BERTScore 0.173; matches or surpasses BLEU in some directions.
Ablations confirm the complementarity of $F_{KG}$ and $F_{BERT}$ , with $\alpha=0.5$ near optimal.
Model choice ablation shows that using XLM-RoBERTa-large increases mean Pearson from 0.830 to 0.851 (Wu et al., 2023).

3. Comparative Overview

Aspect	KG-BERTScore (KG Completion)	KG-BERTScore (MT Evaluation)
Domain	Knowledge graph completion	Reference-free machine translation evaluation
Scoring Principle	[CLS] embedding, linear head, sigmoid	Linear blend: BERTScore + KG entity-matching
Input Preprocessing	Tokenized triple (h, r, t) packed into BERT sequence	NER/Entity linking; contextual embeddings
Output	$P(\text{true} \| \tau)$	$F_{KG\text{-}BERT}$ per sentence/system
Main Evaluation Datasets	WN11, FB13, WN18RR, FB15k-237, UMLS, FB15K	WMT19 QE (233 systems, 18 language pairs)

4. Significance and Performance Characteristics

The KG-BERTScore scoring function for knowledge graph completion achieves state-of-the-art results in triple classification, link prediction, and relation prediction on standard datasets, employing negative sampling and binary cross-entropy loss without additional regularization or normalization terms (Yao et al., 2019).

KG-BERTScore for reference-free MT evaluation combines the strengths of contextual semantic similarity with symbolic/knowledge-based entity matching. It offers improved correlation with human judgment compared to previous reference-free metrics, particularly in high-entity-density settings and on challenging language pairs (Wu et al., 2023).

5. Limitations and Future Directions

Both frameworks depend heavily on the quality and coverage of pre-trained LLMs. The translation evaluation variant is also sensitive to the accuracy of multilingual NER and entity linking. A plausible implication is that further gains could be obtained by more robust entity linking or by integrating deeper knowledge graph reasoning, although explicit proposals for such extensions are not provided in the referenced works. No explicit controversy or broad limitation regarding cross-linguistic generalization is reported, but the results suggest robustness to language pair variation when appropriate models and KG resources are available.

6. References

"KG-BERT: BERT for Knowledge Graph Completion" (Yao et al., 2019)
"KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation Evaluation" (Wu et al., 2023)

Markdown Report Issue Upgrade to Chat

References (2)

KG-BERT: BERT for Knowledge Graph Completion (2019)

KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation Evaluation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KG-BERTScore.

KG-BERTScore

1. KG-BERTScore for Knowledge Graph Completion

Triple Encoding and Input Representation

Scoring Function (KG-BERTScore)

Training Strategy and Negative Sampling

Hyperparameters and Calibration

Evaluation Protocols

2. KG-BERTScore for Reference-Free Machine Translation Evaluation

Metric Definition and Components

Integration of Knowledge Graph Information

Choice and Tuning of $\alpha$

Pre-trained Transformer Model Selection

Experimental Validation and Benchmarks

3. Comparative Overview

4. Significance and Performance Characteristics

5. Limitations and Future Directions

6. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

KG-BERTScore

1. KG-BERTScore for Knowledge Graph Completion

Triple Encoding and Input Representation

Scoring Function (KG-BERTScore)

Training Strategy and Negative Sampling

Hyperparameters and Calibration

Evaluation Protocols

2. KG-BERTScore for Reference-Free Machine Translation Evaluation

Metric Definition and Components

Integration of Knowledge Graph Information

Choice and Tuning of α\alphaα

Pre-trained Transformer Model Selection

Experimental Validation and Benchmarks

3. Comparative Overview

4. Significance and Performance Characteristics

5. Limitations and Future Directions

6. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Choice and Tuning of $\alpha$