Papers
Topics
Authors
Recent
Search
2000 character limit reached

WMT25 MT Evaluation Shared Task

Updated 20 January 2026
  • The WMT25 Evaluation Shared Task is a benchmark that advances MT assessment by integrating both automated metrics and human evaluations over diverse language pairs.
  • It employs metric-aware decoding techniques such as MBR and QE re-ranking to optimize translation outputs and ensure cross-domain robustness.
  • Empirical results highlight significant gains in BLEU, COMET, and character-level scores, demonstrating its impact on both constrained and unconstrained tracks.

The WMT25 Evaluation Shared Task encompasses the latest advances in automatic and human evaluation protocols for general machine translation (MT), with extended shared tasks on quality scoring and error span detection. The evaluation phase aggregates outputs from diverse MT systems—including proprietary LLMs, open-weight models, and API-based systems—using standardized metrics (reference-based, reference-less, and QE), with a strong emphasis on metric-aware decoding, cross-domain robustness, and scaling to low-resource and typologically diverse languages. The core architectural innovations, experimental protocols, and observed empirical trends at WMT25 reflect the convergence of pretrained LLM adaptation, hybrid evaluation measures, and metric-driven re-ranking across all system types.

1. Overview of the WMT25 Evaluation Shared Task

The WMT25 Evaluation Shared Task is structured around rigorous automated and human-centric benchmarking of machine translation systems across 32 language pairs spanning news, social media, speech transcripts, and literary domains (Kocmi et al., 11 Aug 2025). System submissions were allowed under both constrained (≤20B parameters, open weights) and unconstrained tracks, with outputs evaluated using multiple metrics. The unified framework further included specialized subtasks for translation quality score prediction (MQM and ESA scales) and error span detection, with formal protocols for metric aggregation, segment/system-level scoring, and robust ranking (Juraska et al., 28 Oct 2025).

2. Metrics and Evaluation Protocols

Automatic evaluation utilized a multi-metric ensemble, combining reference-based, reference-less, and QE-driven scores:

  • LLM-as-Judge: e.g., GEMBA-ESA, CommandA, GPT-4.1; zero-reference scoring by in-context prompting for adequacy and fluency at paragraph level.
  • Reference-based Neural Metrics: MetricX-25, XCOMET, COMET, BLEURT; regression models over contextual embeddings from source, MT, and reference (Haq et al., 17 Sep 2025, Juraska et al., 28 Oct 2025).
  • Quality Estimation (QE): COMET-KIWI, MetricX-QE; prediction without ground-truth references.
  • Character-level Metrics: chrF++, BLEU (for low-resource); F-score over char n-grams and word unigrams.

The official AutoRank aggregates per-system scores from these metrics via median–interpercentile robust scaling, cross-metric averaging, and linear remapping of ranks. No statistical significance testing is performed at the preliminary stage, but paired tests and bootstrapping are recommended in final reporting (Kocmi et al., 11 Aug 2025).

Metric Family Reference Use Score Range / Format
LLM-as-Judge None Adequacy, fluency scores, [0–100]
Neural Reference-based Source, MT, Ref Real-valued regression ([–1,1] or MQM/ESA scale)
QE Source, MT Real-valued ([0,1])
chrF++, BLEU Ref-based F-score ([0,100]), BLEU ([0,1])

The Error Span Annotation (ESA) protocol is used for post-hoc human ranking, regarded as more reliable than automatic metrics; a subset of ∼18 systems per language pair is evaluated with ESA judgment.

3. Metric-Aware Decoding and Re-ranking Techniques

A defining feature of WMT25 was the adoption of metric-aware inference strategies:

  • Minimum Bayes Risk (MBR) Decoding: Systems select translation y^=argminyYyYp(y)L(y,y)\hat y = \arg\min_{y\in\mathcal{Y}} \sum_{y'\in\mathcal{Y}} p(y')L(y,y'), optimizing negative learned metrics (COMET, MetricX) over n-best candidates (Gilabert et al., 18 Aug 2025).
  • Quality Estimation (QE) Re-ranking: Hypothesis yiy_i is selected as argmaxigϕ(x,yi)\arg\max_i g_\phi(x, y_i), scoring n-best outputs via QE (Kocmi et al., 11 Aug 2025).
  • Hybrid Reference/referenceless scoring: E.g., MetricX-25 uses 2-way hybrid input mode, segment-level and sentence/document-level prediction (Juraska et al., 28 Oct 2025).

This suggests that systems tuned to maximize official metrics under AutoRank—particularly via MBR or QE—may obtain artificially higher preliminary rankings, with potential for reward-hacking; final ESA human scores are intended to counterbalance such metric optimization biases.

4. Advancements in Quality Prediction and Error Span Detection

The WMT25 task advanced reference-based and reference-less evaluation architectures:

  • MetricX-25 replaced encoder–decoder mT5 with Gemma 3 12B’s encoder, applying mean-pool regression for MQM/ESA multi-task prediction. Training involved staged fine-tuning on direct assessment (DA) and MQM scales, with input preambles indicating score type (Juraska et al., 28 Oct 2025).
  • GemSpanEval utilized Gemma 3 27B as a decoder-only generator, outputting JSON error objects including span, severity, category, and disambiguated context for non-unique substrings (greedy context extension per token). Character-level F1 was used as the official metric (Juraska et al., 28 Oct 2025).
  • Long-context COMET Estimators: Document-level quality estimation by concatenating K=2 segments, scoring via character-weighted averages, demonstrated +0.32 Pearson r gain over sentence-level baselines (Haq et al., 17 Sep 2025).

Notably, training quality estimation models on fused MQM, SQM, and DA human annotations (normalized to [0,1]) enabled the learning of unified regression functions, generalizing over annotation scales and improving correlations with human judgments.

5. System Adaptation Strategies and Generalizability

Participating systems employed distinct architectural and adaptation strategies:

  • SALAMANDRATA Family (BSC): Multi-bridge continual pre-training on massive parallel datasets (en, es, ca pivots), instruction tuning, and vocabulary extension for Asian/Arabic scripts (SentencePiece adaptation, embedding mean-init). Decoding strategies included both MBR and Tuned Re-ranking with COMET(-KIWI) (Gilabert et al., 18 Aug 2025).
  • In2x: Three-stage LLM adaptation for Japanese—continue pretraining (2.5T tokens), SFT (1.5M instructions), RL (0.5M held-out samples with reward models for STEM exactness/style/idiom compliance), and model ensembling. Expressiveness-first supervision and targeted human evaluation on style/idioms were emphasized (Pang et al., 20 Aug 2025).
  • JGU Mainz: Parameter-efficient LoRA finetuning of Qwen2.5-3B-Instruct for MT + QA, multi-corpus integration, RAG for Ukrainian QA, and permutation-based probability averaging for MCQ ensembling under resource constraints (Saadi et al., 26 Sep 2025).

Plausible implication: The use of high-resource linguistic hubs, synthetic data augmentation, and robust instruction sampling (difficulty-based, clustering) provides a generalizable template for rapid adaptation of LLMs to low-resource languages and complex stylistic registers (Pang et al., 20 Aug 2025, Gilabert et al., 18 Aug 2025).

Selected performance results from WMT25:

  • In2x (unrestricted): BLEU=34.0, COMET=0.756, Human Style=80.2, WMT Rank=1 (en–ja, ja–en); constrained variant ranks 2 (Pang et al., 20 Aug 2025).
  • SALAMANDRATA-7B: COMET=85.3 after instruction tuning, rises to 87.2 with MBR decoding; per-language BLEU, CHRF, METRICX scores follow similar trends (Gilabert et al., 18 Aug 2025).
  • MetricX-25: Segment-level pairwise accuracy (PA) gains over predecessor: en-de +2.46 pts, ja-zh +4.34 pts (Juraska et al., 28 Oct 2025).
  • GemSpanEval: Character-level F1 up to 37.09% for ja-zh after WMT24 training, surpassing encoder-only XCOMET in some language pairs (Juraska et al., 28 Oct 2025).
  • JGU Mainz: Sorbian MT improvement +54.4 ChrF++ (DE–DSB), +65.7 ChrF++ (DE–HSB); QA accuracy +5.85 to +12.34 pp (Saadi et al., 26 Sep 2025).

Instruction tuning, metric-aware decoding, and hybrid reference/reference-less architectures were repeatedly shown to improve translation robustness, low-resource transfer, and reliability under character-level noise.

7. Limitations, Biases, and Future Directions

Several biases and technical limitations were identified:

  • Metric Optimization Bias: MBR/QE decoding tuned on AutoRank metrics may inflate preliminary rankings; human ESA scores remain the standard for official system ranking (Kocmi et al., 11 Aug 2025).
  • Reference Bias: Reference-based metrics reward literal translation; LLM judges mitigate but may overlap with system training domains.
  • Low-resource instability: chrF++ poorly correlates with human judgment for languages like Bhojpuri, Maasai.
  • Paragraph averaging and ASR noise: Evaluating at paragraph level and speech transcript errors may dilute or amplify translation errors.

Future work targets robust significance testing (bootstrap), enhanced context windows for quality estimation, improved regularization and multi-head regressors (MQM/ESA), and dynamic vocabulary extension for pivoting to ultra-low-resource settings (Juraska et al., 28 Oct 2025, Gilabert et al., 18 Aug 2025). The systematic interplay of LLM adaptation, multi-metric evaluation, and metric-driven generation constitutes the foundation for further advances in WMT shared tasks and broader MT evaluation.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WMT25 Evaluation Shared Task.