KG-BERTScore
- The paper introduces two KG-BERTScore frameworks that leverage BERT for knowledge graph completion and reference-free machine translation evaluation using contextual and entity-level information.
- The KG completion component uses a [CLS] token with a linear classifier and sigmoid scoring, trained with binary cross-entropy and negative sampling, yielding state-of-the-art accuracy on benchmarks like WN11 and FB13.
- The MT evaluation component combines BERTScore with precise entity matching from a multilingual knowledge graph, achieving high Pearson correlations (up to 0.830) and competitive performance against BLEU.
KG-BERTScore refers to two distinct but thematically related frameworks leveraging bidirectional Transformer models and knowledge graphs: the triple scoring function for knowledge graph completion proposed in "KG-BERT: BERT for Knowledge Graph Completion" (Yao et al., 2019), and the reference-free automatic machine translation evaluation metric introduced in "KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation Evaluation" (Wu et al., 2023). Both exploit contextual language representations and explicit entity-level information but in substantially different domains and workflows.
1. KG-BERTScore for Knowledge Graph Completion
Triple Encoding and Input Representation
Given a knowledge-graph triple , where is the head entity, the relation, and the tail entity, each element is mapped to a short name or natural-language description and tokenized with BERT's word-piece tokenizer. The resulting input sequence is: Each token is associated with an embedding comprising token, segment, and positional components as in the standard BERT model. All tokens for the head and tail share segment A, while the relation tokens use segment B (Yao et al., 2019).
Scoring Function (KG-BERTScore)
The sequence is fed through a 12-layer BERT-Base encoder. The final hidden state (with ) corresponding to the [CLS] token is projected via a two-way linear classifier: where is a learned weight matrix. The resulting logits are transformed using a sigmoid function to yield a probability vector , where denotes . The model's plausibility score for the triple is defined as .
No additional non-linear layers or regression heads are interposed between the [CLS] vector and the output probability layer.
Training Strategy and Negative Sampling
The model is trained with a binary cross-entropy objective over observed (positive) and corrupted (negative) triples. The negative set is generated by "entity corruption," replacing the head or tail of a positive triple with a randomly selected, distinct entity not yielding another positive. The loss is: where is 1 for positives and 0 for negatives. No explicit regularization or margin loss is introduced (Yao et al., 2019).
Hyperparameters and Calibration
Key parameters include the use of BERT-Base (12 layers), Adam optimizer, learning rate , batch size 32, dropout 0.1, and variable training epochs according to task (triple classification: 3, link prediction: 5, relation prediction: 20). For triple classification, a balanced negative sampling ratio is used, while for link prediction a ratio of 5:1 negatives to positives is empirically found optimal among $1, 3, 5, 10$. No additional score normalization is applied beyond the sigmoid output (Yao et al., 2019).
Evaluation Protocols
- Triple Classification: Binary decision over WN11, FB13; metric is accuracy, threshold for true.
- Link Prediction: Datasets include WN18RR, FB15k-237, UMLS; corrupted triples created by replacing head/tail entities; scored by and ranked. Evaluated using mean rank and Hits@10 (filtered protocol).
- Relation Prediction: On FB15K, input is restricted to head and tail entity; a learned with softmax is applied for relations, scored by mean rank and Hits@1 (Yao et al., 2019).
2. KG-BERTScore for Reference-Free Machine Translation Evaluation
Metric Definition and Components
This reference-free metric linearly combines two components:
- BERTScore (): Measures contextual similarity between source and MT output using token representations from a multilingual pre-trained Transformer. Precision and recall are defined as:
and their F-score:
- Knowledge Graph Entity Matching (): Assesses the fraction of source named entities correctly realized in the translation, based on exact matches of language-agnostic entity IDs from a multilingual knowledge graph.
The overall KG-BERTScore is computed as: System-level KG-BERTScore is the average across all sentence pairs (Wu et al., 2023).
Integration of Knowledge Graph Information
Source and translation are processed using a multilingual NER model and an entity linker, yielding unique language-agnostic entity IDs for all detected entities. The matching score quantifies how many of the source's entity IDs are found in the translation (Wu et al., 2023).
Algorithmically, the process is:
- Compute contextual embeddings for and using a frozen multilingual Transformer.
- Calculate by greedy cosine matching.
- Extract and match entity IDs for .
- Combine using the parameter .
- Average over all sentence pairs for the final system-level score.
Choice and Tuning of
—the interpolation parameter between BERTScore and KG entity matching—is varied over . On development data (WMT19 QE into-English), pure BERTScore () attains mean Pearson 0.396, pure entity matching () achieves 0.817, but the optimal combination is with 0.830. Consequently, is used by default (Wu et al., 2023).
Pre-trained Transformer Model Selection
KG-BERTScore utilizes XLM-RoBERTa-base by default, extracting 9th-layer representations for token embeddings. Ablations compare three models (bert-base-multilingual-cased, xlm-roberta-base, xlm-roberta-large). Larger models increase both and , though the relative advantage of the KG component remains stable (Wu et al., 2023).
Experimental Validation and Benchmarks
Evaluated on the WMT19 QE reference-free shared task, spanning 233 systems and 18 language pairs, using system-level Pearson correlation to human direct assessment as the primary metric. Key findings:
- Into English: KG-BERTScore mean Pearson 0.830, outperforming all reference-free metrics (BERTScore alone: 0.396), and approaching BLEU (0.907).
- From English: KG-BERTScore 0.392 vs. BERTScore 0.238; exceeds YiSi-2 and other baselines.
- Non-English ↔ Non-English: KG-BERTScore 0.267 vs. BERTScore 0.173; matches or surpasses BLEU in some directions.
- Ablations confirm the complementarity of and , with near optimal.
- Model choice ablation shows that using XLM-RoBERTa-large increases mean Pearson from 0.830 to 0.851 (Wu et al., 2023).
3. Comparative Overview
| Aspect | KG-BERTScore (KG Completion) | KG-BERTScore (MT Evaluation) |
|---|---|---|
| Domain | Knowledge graph completion | Reference-free machine translation evaluation |
| Scoring Principle | [CLS] embedding, linear head, sigmoid | Linear blend: BERTScore + KG entity-matching |
| Input Preprocessing | Tokenized triple (h, r, t) packed into BERT sequence | NER/Entity linking; contextual embeddings |
| Output | per sentence/system | |
| Main Evaluation Datasets | WN11, FB13, WN18RR, FB15k-237, UMLS, FB15K | WMT19 QE (233 systems, 18 language pairs) |
4. Significance and Performance Characteristics
The KG-BERTScore scoring function for knowledge graph completion achieves state-of-the-art results in triple classification, link prediction, and relation prediction on standard datasets, employing negative sampling and binary cross-entropy loss without additional regularization or normalization terms (Yao et al., 2019).
KG-BERTScore for reference-free MT evaluation combines the strengths of contextual semantic similarity with symbolic/knowledge-based entity matching. It offers improved correlation with human judgment compared to previous reference-free metrics, particularly in high-entity-density settings and on challenging language pairs (Wu et al., 2023).
5. Limitations and Future Directions
Both frameworks depend heavily on the quality and coverage of pre-trained LLMs. The translation evaluation variant is also sensitive to the accuracy of multilingual NER and entity linking. A plausible implication is that further gains could be obtained by more robust entity linking or by integrating deeper knowledge graph reasoning, although explicit proposals for such extensions are not provided in the referenced works. No explicit controversy or broad limitation regarding cross-linguistic generalization is reported, but the results suggest robustness to language pair variation when appropriate models and KG resources are available.
6. References
- "KG-BERT: BERT for Knowledge Graph Completion" (Yao et al., 2019)
- "KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation Evaluation" (Wu et al., 2023)