BERTScore: Contextual Text Evaluation

Updated 22 January 2026

BERTScore is an automatic evaluation metric that computes cosine similarity between contextualized embeddings, capturing semantic nuances in text generation.
It leverages Transformer layers and optional IDF weighting to robustly assess paraphrasing, reordering, and lexical variation across diverse tasks.
Extensions like KG-BERTScore and CBERTScore address limitations in numeral and factual accuracy, enabling domain-specific adaptations for improved human alignment.

BERTScore is an automatic metric for evaluating text generation based on contextual embeddings from pre-trained Transformer models such as BERT, RoBERTa, or their multilingual variants. Unlike traditional n-gram overlap metrics, BERTScore computes pairwise similarity between the contextualized embeddings of tokens in the candidate and reference sentences, enabling a “soft” semantic alignment that is robust to paraphrase, reordering, and lexical variation. BERTScore has achieved significantly higher correlation with human judgments across machine translation, summarization, image captioning, and automatic speech recognition tasks, and has given rise to domain-specific and structurally-augmented variants.

1. Mathematical Definition and Algorithmic Foundations

Let $C = (c_1, \dots, c_n)$ and $R = (r_1, \dots, r_m)$ represent candidate and reference token sequences, respectively, each mapped to contextualized embeddings in $\mathbb{R}^d$ via a frozen pre-trained Transformer, denoted $\phi(\cdot)$ . The metric computes a cosine similarity matrix $S \in \mathbb{R}^{n \times m}$ : $S_{ij} = \cos(\phi(c_i), \phi(r_j)) = \frac{\phi(c_i)^\top\, \phi(r_j)}{ \|\phi(c_i)\|\, \|\phi(r_j)\| }$ Precision ( $P$ ) and Recall ( $R$ ) are then defined as: $P = \frac{1}{n}\sum_{i=1}^n \max_{1\leq j\leq m} S_{ij}\qquad R = \frac{1}{m}\sum_{j=1}^m \max_{1\leq i\leq n} S_{ij}$ with the final BERTScore as the F $_1$ harmonic mean: $R = (r_1, \dots, r_m)$ 0 Optionally, an inverse document frequency (IDF) weighting scheme can reweight token-level contributions, particularly advantageous when rare tokens (e.g., named entities, domain terms) should receive higher influence.

The standard implementation uses a selected “best” Transformer layer, typically determined by maximizing metric correlation with human judgments on a development set. For multilingual or specialized applications, appropriate language-specific or domain-specific Transformer models are required (Zhang et al., 2019).

2. Theoretical Properties and Core Advantages

BERTScore’s primary advantage over string-based metrics (e.g., BLEU, METEOR, ROUGE) arises from its contextual semantic matching. The embedded representation of each token is influenced by its sentential context, allowing the metric to reward synonymous paraphrases, capture long-distance dependencies, and score reorderings and inflectional variation without explicit lexical heuristics.

Empirical findings across WMT16–18, COCO, and QA corpora show that BERTScore F $R = (r_1, \dots, r_m)$ 1 achieves Pearson correlation coefficients ( $R = (r_1, \dots, r_m)$ 2) up to 0.99 at the system level, with consistently higher segment-level Kendall $R = (r_1, \dots, r_m)$ 3 coefficients than n-gram metrics (Zhang et al., 2019). BERTScore’s “soft” alignment ensures that lexically divergent, yet semantically aligned, generations are favored over purely surface-matching outputs.

3. Extensions, Variations, and Domain Adaptations

Layer and Representation Robustness

Metric robustness depends sensitively on the layer from which embeddings are extracted. While higher layers (e.g., 9–18 in BERT/RoBERTa) yield top correlations on in-domain evaluation, the first layer outperforms under noise or out-of-vocabulary perturbations. ByT5’s character-level embeddings, extracted from the first Transformer layer, provide superior resilience to misspelling and domain shift relative to standard WordPiece-token BERTScore (Vu et al., 2022).

Alignment for Cross-Lingual and Morphologically Divergent Pairs

For language pairs with divergent scripts or rich morphology, using monolingual BERTs with orthogonal alignment of embedding spaces—via Procrustes analysis on “anchor” words that are literal bilingual matches—improves precision and robustness. This “anchor-only” BERTScore, applied to English→Russian MT evaluation, demonstrated a Spearman $R = (r_1, \dots, r_m)$ 4 correlation with expert human rankings, exceeding standard multilingual BERTScore approaches (Vetrov et al., 2022).

Reference-Free and Knowledge-Integrated Variants

KG-BERTScore incorporates a knowledge-graph–based bilingual named entity matching term, linearly combined with F $R = (r_1, \dots, r_m)$ 5 via

$R = (r_1, \dots, r_m)$ 6

with $R = (r_1, \dots, r_m)$ 7 tuned on human-labeled data. This extension achieves state-of-the-art correlation in reference-free evaluation settings, particularly when using large multilingual models such as XLM-Roberta-large (Wu et al., 2023).

Domain-Specific Adaptations

BERTScore has been adapted to penalize clinically-relevant transcription errors (Clinical BERTScore, or CBERTScore) (Shor et al., 2023). CBERTScore interpolates between BERTScore over all tokens and over medically-flagged terms, with a tuned parameter $R = (r_1, \dots, r_m)$ 8: $R = (r_1, \dots, r_m)$ 9 This yields up to 75.4% agreement with clinician preferences, outperforming WER, BLEU, METEOR, and plain BERTScore.

In finance, BERTScore fails to distinguish semantically crucial numerical shifts, as revealed by the FinNuE diagnostic set (Huang et al., 13 Nov 2025). Motivating directions include numeral-aware tokenization, hybrid distance penalties, and numeric-augmented embeddings.

4. Empirical Behavior and Limitations

BERTScore outperforms classical metrics in measuring adequacy and fluency when surface divergence arises from paraphrase, synonymy, or syntactic transformation. For image captioning, BERTScore with RoBERTa-large achieves Pearson’s $\mathbb{R}^d$ 0 against human judgments, close to domain-trained metrics (Zhang et al., 2019). In ASR, BERTScore correlates more tightly with annotated semantic error severity than WER and provides robustness to orthographic (contraction, normalization) variants (Tobin et al., 2022).

However, BERTScore’s “antonymy problem” results in minimal penalty when critical polarity-bearing tokens are replaced by antonyms—contextual embeddings of “best” and “worst” remain neighboring in BERT space, yielding high cosine similarity and low error signal (Saadany et al., 2021). Numerically, BERTScore fails to penalize mistaken quantities (e.g., “2%” vs. “20%” returns F $\mathbb{R}^d$ 1), attributed to token/subword fragmentation and greedy alignment (Huang et al., 13 Nov 2025). Similarly, factual mismatches in named entities, dates, units, or content-reversal rarely incur sufficient penalty to align with human criticality.

Variance under severe input noise, unknown tokens, or heavy domain shift is also pronounced, unless character-level, first-layer embeddings are adopted (Vu et al., 2022).

5. Integration as a Differentiable Training Objective

BERTScore is fully differentiable and thus can serve as a training signal in sequence-to-sequence generation models. By replacing argmax-decoding with differentiable surrogates (e.g., dense vector expectations, sparsemax, Gumbel-Softmax sampling), model outputs remain end-to-end trainable. Fine-tuning neural MT systems with a negative BERTScore loss—BERTTune—yields systematic improvements in both BERTScore and BLEU across four language pairs, establishing a practical method for reward-driven post-hoc model refinement (Unanue et al., 2021).

Ensuring a shared vocabulary with the scoring LM and controlling overfitting (by limiting fine-tuning epochs) are critical for stability and generalization.

6. Relations to Other Embedding-Based and Optimal Transport Metrics

BERTScore belongs to a family of contextual-embedding–based evaluation metrics including MoverScore, YiSi, and BaryScore. BaryScore generalizes BERTScore by replacing greedy token alignment with Wasserstein distance between barycentric distributions over multi-layer embeddings. This approach is parameter-free (no layer tuning) and achieves consistent or superior correlation with human judgments, particularly in data-to-text and summarization tasks, albeit at higher computational cost (Colombo et al., 2021).

KG-BERTScore further integrates symbolic entity information, addressing factuality and named-entity preservation limitations of pure embedding-based metrics (Wu et al., 2023).

7. Best Practices, Recommendations, and Open Issues

BERTScore is most effective when:

The evaluation setting involves paraphrastic or syntactically diverse outputs.
Embeddings are drawn from appropriately pretrained LMs for the specific domain or language.
Layer selection maximizes clean-data correlation but may require adjustment (e.g., switch to lower layers or character-level models) for robust, out-of-domain use.
IDF weighting is applied in contexts where rare or content-critical words are more important.

Known limitations include insensitivity to numerical magnitude, factual mismatches, and critical lexical polarity, warranting either domain-specific modifications (e.g., numeric-aware or aspect-weighted scores) or hybrid evaluation schemes. For clinical and financial domains, augmented or interpolated versions (CBERTScore, hybrid numeric metrics) are required for risk-sensitive applications (Huang et al., 13 Nov 2025, Shor et al., 2023).

Recent research indicates a promising trajectory for BERTScore: extending evaluation with external knowledge graphs, integrating optimal-transport theory, and deploying dynamically adapted weighting schemes for domain-critical content, all while maintaining semantics-driven, model-agnostic backbone. Continued dataset-driven diagnostics and cross-domain robustness investigations remain at the research frontier.