Untrained Automatic Metrics

Updated 7 February 2026

Untrained automatic metrics are evaluation measures that use fixed algorithms, such as BLEU and uPL, to compare system outputs against reference data without domain-specific training.
They employ diverse methods including n-gram matching, semantic embeddings, information theory, and perceptual losses to assess performance across various tasks.
Empirical studies reveal that while these metrics offer quick baseline evaluations, they face sensitivity, stability, and adaptation challenges in critical application areas.

Untrained automatic metrics are algorithmic evaluation measures applied in domains such as natural language generation (NLG), machine translation, domain adaptation, and image processing, which do not involve data-dependent parameter fitting or explicit human alignment. These metrics are applied “off the shelf” to compare system output against reference data, predict performance, or guide model selection, but are not themselves learned directly from domain-specific annotated data. Their informativeness, reliability, and limitations in diverse tasks are central to empirical evaluation and methodological development in applied machine learning.

1. Taxonomy and Mathematical Formalism of Untrained Metrics

Untrained automatic metrics span a variety of domains, from text to images. Canonical examples include string-match–based metrics in NLG and machine translation (BLEU, ROUGE), semantic embedding–based measures (BERTScore), information-theoretic functionals (mutual information, entropy), edit-based scores (TER), and network-based distances with fixed, random weights (untrained perceptual loss in imaging).

For text generation, let $\{y_i\}$ denote system hypotheses and $\{r_i\}$ reference outputs. Standard untrained metrics are defined as:

BLEU: Weighted geometric mean of modified n-gram precision, penalized for brevity.
METEOR: Harmonic mean of unigram precision and recall, plus chunk penalty for fragmentation.
BERTScore: Cosine similarity between contextual token embeddings (e.g., from a fixed BERT), aggregated via maximum matching.
Perplexity: $\mathrm{PPL}(y) = \exp\left(-\frac{1}{T}\sum_{t=1}^T \log p(y_t|y_{<t})\right)$ , using a frozen LM.
Edit- and order-based: Metrics such as TER (based on minimal edits per token) and RIBES (correlation of word order).

For imaging, untrained perceptual loss (uPL) defines a distance by comparing feature activations from randomly initialized, fixed-weight convolutional networks:

$L_{uPL}(x_1, x_2) = \sum_{\ell=1}^L \frac{1}{C_\ell H_\ell W_\ell D_\ell} \|\varphi_\ell(x_1) - \varphi_\ell(x_2)\|_2^2$

where $\varphi_\ell(\cdot)$ is the feature tensor at layer $\ell$ in the random network (Pfaehler et al., 2024).

In unsupervised domain adaptation, mutual information–based and augmentation consistency metrics (e.g., ACM) operate on model posteriors without access to target annotations, using entropy, cross-domain consistency, and source-structure preservation (Chen et al., 2023).

2. Task-Specific Instantiations and Adaptations

The behavior and efficacy of untrained metrics are tightly task-dependent; domain linguistics, annotation regimes, or dataset properties demand adaptations or combinatorial approaches.

Machine Translation and Summarization: BLEU, ROUGE, METEOR (with adaptations such as synonym/rare-word bonuses in EBLEU or language-specific modules in METEOR-PL) are widely used (Wołk et al., 2016). For morphologically rich or paraphrase-heavy outputs (e.g., re-speaking in Polish), incorporating local resources (plWordNet, stemmers) and rewarding lexical diversity improves correlation with human references.
Sentiment-Oriented Text: Off-the-shelf BLEU, METEOR, and BERTScore fail to specifically penalize sentiment-critical errors. Negation or antonym replacements (polarity flips) yield nearly equivalent scores to non-critical changes due to uniform n-gram weighting and embedding proximity of antonyms (Saadany et al., 2021).
Summarization System Comparison: Classic system-level correlation (sys-level Corr) considers all system pairs, whereas thresholded Kendall’s $\tau$ (sys $\Delta$ ) focuses on close-in-score pairs to assess metric granularity. For small score gaps (e.g., $\Delta$ ROUGE $\leq$ 0.5), standard metrics often correlate with human preference at near-random levels (Deutsch et al., 2022).
Domain Adaptation: Metrics like mutual information, augmented with source accuracy terms or consistency checks via a held-out MLP, can detect negative transfer effects that raw entropy or probability–based metrics miss (Chen et al., 2023).
Medical Imaging: uPL with a small, untrained 3D CNN captures essential local neighborhood structure in line-like volumetric data, outperforming both pixel-wise losses (L1, SSIM) and large-scale pre-trained networks for MR angiogram and plant root denoising (Pfaehler et al., 2024).

3. Empirical Assessment and Observed Limitations

Comprehensive studies have revealed performance gaps between naive metric use and fine-grained task requirements.

Correlation Instabilities: Standard untrained metrics exhibit moderate correlation with human judgments in-domain, but rapidly degrade under domain or genre shift, especially for more abstractive, creative, or sentiment-heavy tasks (Ni'mah et al., 2023).
Sensitivity to Error Types: Metrics that treat all tokens/edits equally fail to reflect the true impact of critical semantic errors, such as sentiment polarity flips or missing negations (Saadany et al., 2021). Embedding-based metrics like BERTScore are affected by the “antonym proximity” problem.
Discrimination of Fine-Grained Differences: System-level metric rankings are less stable and less informative for hard, close-scoring system pairs. For summarization, improvements below 1 ROUGE point are unreliable as indicators of actual human-perceived quality improvement (Deutsch et al., 2022).
Adversarial and Overfitting Risks: For unsupervised metrics in domain adaptation, pure entropy or diversity–based measures can be gamed by degenerate solutions; source-structure constraints and augmentation consistency terms (ACM) are necessary countermeasures (Chen et al., 2023).

A summary table illustrates domain-specific findings:

Metric / Context	Primary Failure / Limitation	Empirical Recommendation
BLEU, METEOR	Indistinguishable scoring for sentiment-altering vs. neutral unigram changes; inadequate for critical aspect errors	Augment with sentiment lexicons or error-type–weighted schemes (Saadany et al., 2021)
ROUGE	Poor agreement with humans on small system score gaps; inflated utility for easy pairings	Always report thresholded correlations (sysΔ); do not trust small score gains (Deutsch et al., 2022)
MI/Entropy (UDA)	Blind to source structure, adversarially maximizable, fails on over-alignment	Incorporate source accuracy, use held-out classifiers and augmentation consistency (Chen et al., 2023)
uPL (imaging)	N/A (outperforms baseline for line-like data)	Prefer small untrained models, avoid over-parameterization (Pfaehler et al., 2024)

4. Throughput Protocols and Evaluation Frameworks

Recent research advocates for multi-faceted metric evaluation frameworks beyond individual correlation scores:

Metric Preference Checklist (Ni'mah et al., 2023): A toolkit for automatic metric selection involving five axes—zero-shot transfer stability, system-level discrimination (Kolmogorov-Smirnov distance), aspect-level discrimination, system-preference ranking similarity (Levenshtein-based), and aspect-level ranking to human reference. This checklist exposes mismatches between “raw” correlation and practical utility.
System-Level Correlation (Full Test Set): To align metrics with real-world application, system-level correlations should be computed on the entire test set (not just summaries with human judgments), which drastically lowers variance and tightens confidence intervals over system rankings (Deutsch et al., 2022).
Thresholded (Δ) Correlations: Instead of global rank agreement, focus on the metric's ability to discriminate between systems with small score differences, reporting sysΔ curves as a standard practice.

5. Adaptation, Parameterization, and Transferability

Metric adaptation to task, language, or data regime remains essential for robust performance:

Language and Domain Sensitivity: Morphologically rich or highly paraphrastic settings (e.g., re-speaking in Slavic languages) benefit from metrics adapted with local WordNets, stemmers, synonym and rare-word bonuses (EBLEU, METEOR-PL) (Wołk et al., 2016).
Aspect-Oriented Extensions: Explicit re-weighting for content words, sentiment lexica, or negation markers, as well as aspect-specific preference evaluation, mitigate insensitivity to crucial errors (Saadany et al., 2021).
Augmentation Consistency: In unsupervised adaptation, leveraging data augmentations exposes negative transfer and encourages evaluations that reflect real progress, enabling label-free hyperparameter selection (Chen et al., 2023).
Random-Feature Approaches: Metrics based on untrained networks (e.g., uPL) demonstrate that maximum-mean-discrepancy style statistics with random convolutional features can outperform domain-mismatched or overparameterized pre-trained alternatives, especially where pre-training data is scarce (Pfaehler et al., 2024).

6. Practical Guidelines and Recommendations

Synthesizing methodological and empirical evidence, key recommendations for practitioners include:

Combine Multiple Metrics: Use a mix of surface, semantic, and aspect-weighted metrics, validating them against small human-rated gold sets (Wołk et al., 2016, Ni'mah et al., 2023).
Calibrate on Task: Adjust metric parameters, combine precision (BLEU), information-weighted (NIST), and semantic (METEOR, BERTScore), and adopt aspect-specific variants as appropriate.
Report Fine-Grained Analysis: Always provide system-level discrimination and preference similarity, as well as thresholded correlation curves.
Avoid Over-Reliance on Tiny Gains: Disregard metric improvements below empirically demonstrated discrimination thresholds (e.g., ΔROUGE < 1 point) unless corroborated by human evaluation (Deutsch et al., 2022).
Incorporate Adaptation Components: Where applicable, supplement generic metrics with domain, aspect, or task-specific adaptations or unsupervised structural constraints.
Validate Out-of-Domain Robustness: Explicitly test whether a metric’s utility generalizes to new genres/domains; anticipate and adjust for degradation in OOD regimes (Ni'mah et al., 2023).

7. Outlook and Open Challenges

Continuing work in untrained metric design and evaluation highlights core challenges:

Critical Error Sensitivity: Developing lexical, semantic, and embedding-based metrics that robustly penalize critical content or sentiment errors remains unresolved.
Human-Annotation Bottleneck: Untrained metrics cannot fully replace dense, high-quality human annotation for very small system differences or subjective qualitative attributes (Deutsch et al., 2022).
Universality vs. Task Specificity: The tension between universally deployable, off-the-shelf metrics and the necessity for bespoke adaptation underlines the ongoing need for both modular metric libraries and frameworks for systematic evaluation.
Plug-and-Play Random Networks: The promise of random-network–based metrics in imaging and other structured domains raises new questions about the theoretical limits and best practices for such “zero-shot” perceptual distances (Pfaehler et al., 2024).

Untrained automatic metrics continue to serve as baseline tools and sanity checks across machine learning subfields, but their limitations necessitate rigorous meta-evaluation, adaptation, and, where stakes are high, augmentation with human-aligned or hybrid learned metrics.