MetricX-QE: Reference-Free Quality Estimation
- The paper introduces MetricX-QE, a reference-free QE method that utilizes an encoder-only Gemma 3 backbone to predict translation quality via a unified regression head.
- It incorporates innovative input formatting with score-type tokens, language tags, and markdown delimiters to robustly handle multilingual and document-level challenges.
- The paper applies a two-stage fine-tuning on heterogeneous data sources to significantly narrow the gap with reference-based metrics, even in low-resource scenarios.
MetricX-QE is the reference-free machine translation quality estimation (QE) variant introduced in the MetricX-25 system for the WMT25 Evaluation Shared Task. Built upon the multilingual open-weights model Gemma 3, MetricX-QE features architectural, input format, and training protocol enhancements over previous MetricX iterations. The method achieves state-of-the-art correlation with human quality judgments (both MQM and ESA) and closes much of the gap to reference-based evaluation metrics—even under low-resource or dialect-shift scenarios (Juraska et al., 28 Oct 2025).
1. Model Architecture
MetricX-QE utilizes an encoder-only backbone, leveraging the Gemma 3 12B model. Unlike prior MetricX systems based on mT5 and using both encoder and decoder, MetricX-QE discards the decoder entirely and fine-tunes only the encoder for regression:
- Input Representation: For each segment, the tokenized input (up to 4096 SentencePiece tokens) is encoded, yielding hidden states where .
- Pooling & Regression Head: The final hidden states are mean-pooled: . A single linear layer maps to a scalar prediction: , .
- Parameter Updates: Only the encoder weights, pooling layer, and regression head are fine-tuned; there is no additional pretraining or “up-training” phase.
This design enables scalable reference-free QE with a unified head for multiple scoring types, improving both representational richness and alignment with human ratings, especially for morphologically complex or multilingual data.
2. Input Format and Preprocessing Enhancements
MetricX-QE introduces systematic improvements to input construction and preprocessing to handle WMT25’s multilingual and document-level characteristics:
- Score-Type Tokens: Each example is prepended with a special token—either “[MQM]” or “[ESA]”—that specifies the desired output quality scale.
- Language and Locale Tags: Metadata tags (e.g., “src_lang=en” and “tgt_lang=de”; including locale information such as “ar_EG” when available) are prepended to facilitate disambiguation of source and target varieties, essential for robust QE across dialects and zero-shot settings.
- Markdown-Style Segment Delimiters: Each text segment (source, translation, and—if present—reference) is enclosed in triple backticks, separated by blank lines for explicit boundary marking, supporting multi-paragraph or complex input.
- Hybrid Training Mode: Only source+reference and source-only example types are retained in stage 2; the reference-only configuration from previous versions is dropped to mirror MQM annotation protocols.
- Removal of Score Clipping: Unlike prior MetricX versions, no artificial clipping is applied to MQM predictions (formerly capped at ), improving fairness and correlation in document-level and long-segment scenarios.
These input innovations enhance model reliability and stability across languages and segment structures.
3. Training Paradigm and Data Sources
MetricX-QE employs a staged and mixed fine-tuning regimen designed for progressive domain adaptation and robust regression on heterogeneous quality scales:
- Stage 1: DA-Only Pre-Fine-Tuning
- Uses z-normalized Direct Assessment (DA) scores: , with targets .
- No score-type tokens in input.
- Stage 2: DA+MQM Joint Training
- Mixes DA-derived targets (rescaled to MQM range: ), raw MQM scores (uncapped), and synthetic data (replicating over/under-translation, unrelated content, missing punctuation).
- Examples are tagged with “[ESA]” or “[MQM]” as appropriate.
- Inference rescaling: Outputs are linearly remapped to the original ESA (0–100) or negative-MQM scales according to requested evaluation context.
Training Data includes:
- DA annotations from WMT15–23 (excluding the into-English subset of WMT21)
- MQM from WMT20–23: en–de, en–es, en–ru, en–zh, he–en, zh–en, ja–zh
- ESA (Error Span Annotation) from WMT24 (9 language pairs)
- Synthetic perturbations (per Juraska et al., 2024)
Validation leverages WMT24 MQM (en–de, en–es, ja–zh) and ESA labels.
4. Objective Functions and Optimization
A single loss function is shared across prediction types. MetricX-QE applies mean squared error (MSE) regression:
where is the batch size, is the predicted score, and the reference. The use of a unified regression head with score-type tokens eliminates the need for task-specific losses or heads, streamlining optimization.
Hyperparameters include:
- Backbone: Gemma 3 12B encoder
- Max input length: 4096 SPM tokens
- Optimizer: Adafactor
- Batch size: 128 (distributed over 64 TPU v3 cores)
- Learning rate: 5 × (stage 1), 1 × (stage 2), cosine annealing
- Warm-up steps: 100
- No monolingual up-training
5. Empirical Evaluation
MetricX-QE demonstrates substantial advances over previous MetricX versions, as shown in system-level and segment-level evaluations on WMT24 MQM and ESA test sets.
| Training | Input | Stage | en–de | en–es | ja–zh | Avg(ESA) | sys en–de | sys en–es | sys ja–zh | sys Avg(ESA) |
|---|---|---|---|---|---|---|---|---|---|---|
| 24-Hybrid | src+ref | DA→MQM | 53.20 | 68.50 | 53.90 | 87.40 | 79.90 | 89.70 | — | — |
| 25-QE* | src | DA→(DA+MQM) | 54.97 | 69.42 | 57.21 | 84.91 | 85.45 | 78.29 | 91.34 | 84.91 |
| 25-Hybrid* | src+ref | DA→(DA+MQM) | 55.45 | 69.14 | 57.72 | 87.61 | 85.82 | 77.00 | 92.00 | 87.61 |
Primary submissions denoted by “”.
Key empirical advances include:
- Significant segment-level accuracy increases: +1.77 (en–de), +3.31 (ja–zh), +0.92 (en–es) over MetricX-24, .
- System-level improvements, with the hybrid model (accepting both src+ref and src-only) performing best on system-level soft pairwise accuracy (SPA) for en–de, ja–zh, and Avg(ESA).
These results affirm the effectiveness of a unified backbone and input strategy across heterogeneous quality scales and annotation criteria (Juraska et al., 28 Oct 2025).
6. Innovations and Advances over Prior QE Systems
MetricX-QE’s core advances comprise:
- Encoder-Only Gemma 3 Backbone: Replaces mT5 with Gemma 3, yielding improved multilingual representation, especially for typologically diverse languages (e.g., en–ja, en–zh).
- Unified Regression Head with Score-Type Control: Allows a single system to predict both MQM and ESA via indicator tokens, eliminating scale-mismatch issues and increasing flexibility for future evaluation protocols.
- Explicit Input Markup: Score-type tokens, language/locale identifiers, and markdown delimiters provide robust signal segmentation, enhancing reliability, and reducing spurious attention or boundary errors.
- Score Clipping Elimination: Avoids distortion in document-level evaluation, yielding higher true-to-human correlation.
- Progressive Two-Stage Fine-Tuning: Stabilizes early learning on DA, followed by broad convergence across both DA and MQM tasks, ensuring robust adaptation.
The outcome is a QE metric that narrows or closes the gap with reference-based approaches even when limited references are available or in presence of dialectal and data variability.
7. Limitations and Practical Impact
MetricX-QE, while reference-free, is strongly dependent on the availability of high-quality annotation (DA, MQM, ESA) and the pretraining robustness of Gemma 3. Inputs must strictly adhere to prescribed formatting for reliable performance. No additional monolingual “up-training” is performed, which may bound adaptability to extreme domain shift.
Nevertheless, MetricX-QE represents a state-of-the-art approach to automatic, reference-free translation quality estimation, facilitating scalable assessment across multilingual and document-level use cases in MT research and production (Juraska et al., 28 Oct 2025).