MetricX-25: Unified MT Quality Evaluation
- MetricX-25 is an automatic machine translation evaluation model designed to predict both MQM and ESA quality scores using a unified regression head.
- Built on the Gemma 3 encoder-only architecture, it employs explicit score-type tokens and a two-stage fine-tuning strategy to enhance multilingual performance.
- Empirical results demonstrate 2–4 point gains in segment-level accuracy across languages, notably improving Japanese–Chinese evaluation over previous iterations.
MetricX-25 is an automatic machine translation (MT) evaluation model designed to predict both Multidimensional Quality Metrics (MQM) and Error Span Assessment (ESA) quality scores at the segment and system levels. Built upon the Gemma 3 architecture, MetricX-25 employs an encoder-only LLM regression backbone with a unified modeling approach for MQM and ESA. It was developed as part of submissions to the unified WMT25 Translation Evaluation Shared Task, achieving state-of-the-art performance relative to previous MetricX iterations and demonstrating strong results across multilingual and hybrid reference/QE modes (Juraska et al., 28 Oct 2025).
1. Architectural Principles
MetricX-25 discards the decoder stack of the 12-billion parameter, 128k context window Gemma 3 LLM, retaining only the "Gemma Encoder" (Suganthan et al., 2025). Model weights are initialized from Gemma 3’s pre-trained decoder to preserve its multilingual representational power. A regression head is applied atop the final encoder layer: mean-pooling over the sequence dimension yields a single (), which is projected via a linear layer to produce a scalar quality score ,
Training minimizes mean squared error (MSE) between predicted and target quality scores,
This architecture enables unified MQM/ESA regression within a single model head, facilitating flexible scoring protocols and maximizing representational transfer (Juraska et al., 28 Oct 2025).
2. Input Format and Tokenization
MetricX-25 retains Gemma 3's SentencePiece vocabulary while augmenting inputs with explicit segment markers and metadata. Each input comprises:
- Score type prefix: Two tokens (“MQM” or “ESA”) indicate the score type and expected scale/distribution.
- Source and translation segments: Segments are enclosed in triple backticks (```), with source and translation sections clearly separated and dialect-specific language tags (e.g., en_US, ar_EG) prepended.
- Delimiters and whitespace: Blank lines divide segments, preventing confusion for long or multi-paragraph text.
These conventions support robust disambiguation of segment boundaries, dialect shifts, and the correct associative context for model predictions, accommodating both reference-based and QE-only (quality estimation) evaluation tasks (Juraska et al., 28 Oct 2025).
3. Fine-Tuning Regimen and Data Strategy
MetricX-25 employs a two-stage fine-tuning protocol:
- Stage 1 ("DA warm-up"): Trains on all WMT15–23 Direct Assessment (DA) scores (excluding WMT21), with DA scores z-normalized per year and clipped to . This stage injects 5% synthetic corpora covering specific error patterns such as over/undertranslation and fluent-unrelated outputs.
- Stage 2 ("Mixed MQM+ESA"): Fine-tunes on an equal mix of:
- DA data, relabeled with the "ESA" prefix (due to the DA scoring scale’s similarity to ESA)
- MQM data from WMT20–23 (scores , uncapped for recent years), with the "MQM" score-type prefix.
- Synthetic data is injected proportionately into both groups.
- DA raw scores are linearly rescaled to the MQM target range,
- Only "source-only" and "source+reference" modalities are retained, with "reference-only" examples excluded.
Key training hyperparameters include batch size 128, Adafactor optimizer [Shazeer & Stern, 2018], 10,000 steps per stage, separate peak learning rates ( and for stages 1 and 2, respectively), and checkpoint selection via averaged performance over the last 10 best based on WMT24 MQM & ESA validation (metrics: tie-calibrated pairwise accuracy, system-level soft pairwise accuracy) (Juraska et al., 28 Oct 2025).
4. Quantitative Performance and Ablation Results
MetricX-25's efficacy is assessed via:
- Segment-level tie-calibrated pairwise accuracy (PA)
- System-level soft pairwise accuracy (SPA)
Protocol and Model Variant Comparison
| Protocol | en-de (PA/SPA) | en-es (PA/SPA) | ja-zh (PA/SPA) | ESA avg (PA/SPA) |
|---|---|---|---|---|
| DA only | 50.41/85.62 | 68.51/82.13 | 54.48/92.62 | 54.72/87.65 |
| MQM only | 54.71/85.55 | 68.80/78.56 | 56.36/89.94 | 54.20/86.45 |
| DA + MQM (mix) | 54.71/85.21 | 68.92/78.89 | 56.16/90.67 | 54.91/87.80 |
| DA→MQM (2-stage) | 55.40/85.91 | 69.11/78.12 | 57.90/93.64 | 55.00/85.74 |
| DA→(DA + MQM) | 55.66/86.60 | 69.25/77.92 | 58.24/92.88 | 55.14/87.08 |
Comparison with MetricX-24 (mT5-XXL backbone):
| Variant | en-de (PA/SPA) | en-es (PA/SPA) | ja-zh (PA/SPA) | ESA avg (PA/SPA) |
|---|---|---|---|---|
| 24-Hybrid (QE) | 53.20/87.40 | 68.50/79.90 | 53.90/89.70 | – / – |
| 25-Hybrid (QE)* | 55.45/85.82 | 69.14/77.00 | 57.72/92.00 | 54.87/87.61 |
The results indicate increased segment-level PA of +2–4 points for MetricX-25 over MetricX-24 across all language pairs, with especially strong gains for ja-zh (Japanese–Chinese), reflecting improved representational capacity for these languages in Gemma 3 (Juraska et al., 28 Oct 2025).
5. Component Ablations and Analysis
- Score-type indicator tokens ("MQM"/"ESA") enable a unified model to emulate different target score distributions without interference between tasks.
- Two-stage+mixing delivers the highest system-level SPA, supporting the efficacy of staged and mixed DA/MQM fine-tuning.
- Excluding "reference-only" data in stage 2 prevents inclusion of examples lacking a source segment, reducing risk of spurious QE behavior.
- Removal of score clipping, when handling long-document MQM examples with uncapped targets, showed no negative side effects.
- The addition of language prefix tags is qualitatively beneficial for detecting untranslated fragments, though no controlled ablation was performed.
These experiments collectively demonstrate the criticality of explicit score-type conditioning, rich hybrid input structuring, and two-stage training for robust, adaptable MQM/ESA regression (Juraska et al., 28 Oct 2025).
6. Strengths and Identified Limitations
Strengths:
- Encoder-only multilingual foundation (Gemma 3) yields consistent segment-level gains over mT5.
- Single regression head for both MQM and ESA, negating the need for model duplication.
- Input scheme supports both QE-only and reference-based evaluation seamlessly.
- The two-stage training strategy with synthetic error-coverage enables generalization across unseen and rare error types.
- Demonstrated performance on a decade of WMT evaluation data.
Limitations:
- System-level SPA gains are inconsistent, with minor drops on some language pairs, indicating possible need for further calibration.
- Absence of controlled ablation for the language prefix.
- Exclusive use of an encoder-only model; decoder or cross-attention variants may offer additional benefits, representing a direction for further research (Juraska et al., 28 Oct 2025).
7. Comparative Context and Impact
MetricX-25 marks a significant development in unified MT quality prediction, advancing both the architectural and training methodology landscape. Compared to its predecessor (mT5-XXL-based MetricX-24), MetricX-25 narrows or surpasses previous segment-level performance gaps for multiple language pairs, notably benefiting language pairs where Gemma 3 provides stronger multilingual depth. Techniques from MetricX-25 also underpin the error span detection methods in the companion GemSpanEval model, which surpasses prompting-based LLM baselines for this subtask.
The augmentation of input format, explicit score conditioning, and rigorous staged fine-tuning on diverse and synthetic error-rich data, together with robust empirical validation across a broad range of language pairs and tasks, establishes MetricX-25 as a robust, extensible framework for quality estimation and reference-based MT evaluation (Juraska et al., 28 Oct 2025).