MetricX-QE: Reference-Free Quality Estimation

Updated 15 January 2026

The paper introduces MetricX-QE, a reference-free QE method that utilizes an encoder-only Gemma 3 backbone to predict translation quality via a unified regression head.
It incorporates innovative input formatting with score-type tokens, language tags, and markdown delimiters to robustly handle multilingual and document-level challenges.
The paper applies a two-stage fine-tuning on heterogeneous data sources to significantly narrow the gap with reference-based metrics, even in low-resource scenarios.

MetricX-QE is the reference-free machine translation quality estimation (QE) variant introduced in the MetricX-25 system for the WMT25 Evaluation Shared Task. Built upon the multilingual open-weights model Gemma 3, MetricX-QE features architectural, input format, and training protocol enhancements over previous MetricX iterations. The method achieves state-of-the-art correlation with human quality judgments (both MQM and ESA) and closes much of the gap to reference-based evaluation metrics—even under low-resource or dialect-shift scenarios (Juraska et al., 28 Oct 2025).

1. Model Architecture

MetricX-QE utilizes an encoder-only backbone, leveraging the Gemma 3 12B model. Unlike prior MetricX systems based on mT5 and using both encoder and decoder, MetricX-QE discards the decoder entirely and fine-tunes only the encoder for regression:

Input Representation: For each segment, the tokenized input $x = [x_1 \ldots x_L]$ (up to 4096 SentencePiece tokens) is encoded, yielding hidden states $h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ where $d=8192$ .
Pooling & Regression Head: The final hidden states are mean-pooled: $\bar{h} = \frac{1}{L} \sum_{t=1}^L h_t$ . A single linear layer maps $\bar{h}$ to a scalar prediction: $\hat{s} = w^T \bar{h} + b$ , $w \in \mathbb{R}^d$ .
Parameter Updates: Only the encoder weights, pooling layer, and regression head are fine-tuned; there is no additional pretraining or “up-training” phase.

This design enables scalable reference-free QE with a unified head for multiple scoring types, improving both representational richness and alignment with human ratings, especially for morphologically complex or multilingual data.

2. Input Format and Preprocessing Enhancements

MetricX-QE introduces systematic improvements to input construction and preprocessing to handle WMT25’s multilingual and document-level characteristics:

Score-Type Tokens: Each example is prepended with a special token—either “[MQM]” or “[ESA]”—that specifies the desired output quality scale.
Language and Locale Tags: Metadata tags (e.g., “src_lang=en” and “tgt_lang=de”; including locale information such as “ar_EG” when available) are prepended to facilitate disambiguation of source and target varieties, essential for robust QE across dialects and zero-shot settings.
Markdown-Style Segment Delimiters: Each text segment (source, translation, and—if present—reference) is enclosed in triple backticks, separated by blank lines for explicit boundary marking, supporting multi-paragraph or complex input.
Hybrid Training Mode: Only source+reference and source-only example types are retained in stage 2; the reference-only configuration from previous versions is dropped to mirror MQM annotation protocols.
Removal of Score Clipping: Unlike prior MetricX versions, no artificial clipping is applied to MQM predictions (formerly capped at $[0,25]$ ), improving fairness and correlation in document-level and long-segment scenarios.

These input innovations enhance model reliability and stability across languages and segment structures.

3. Training Paradigm and Data Sources

MetricX-QE employs a staged and mixed fine-tuning regimen designed for progressive domain adaptation and robust regression on heterogeneous quality scales:

Stage 1: DA-Only Pre-Fine-Tuning
- Uses z-normalized Direct Assessment (DA) scores: $z_i = \frac{DA_i - \mu}{\sigma}$ , with targets $y_i = -\mathrm{clip}(z_i, -1, 1)$ .
- No score-type tokens in input.
Stage 2: DA+MQM Joint Training
- Mixes DA-derived targets (rescaled to MQM range: $h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ 0), raw MQM scores (uncapped), and synthetic data (replicating over/under-translation, unrelated content, missing punctuation).
- Examples are tagged with “[ESA]” or “[MQM]” as appropriate.
- Inference rescaling: Outputs are linearly remapped to the original ESA (0–100) or negative-MQM scales according to requested evaluation context.

Training Data includes:

DA annotations from WMT15–23 (excluding the into-English subset of WMT21)
MQM from WMT20–23: en–de, en–es, en–ru, en–zh, he–en, zh–en, ja–zh
ESA (Error Span Annotation) from WMT24 (9 language pairs)
Synthetic perturbations (per Juraska et al., 2024)

Validation leverages WMT24 MQM (en–de, en–es, ja–zh) and ESA labels.

4. Objective Functions and Optimization

A single loss function is shared across prediction types. MetricX-QE applies mean squared error (MSE) regression:

$h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ 1

where $h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ 2 is the batch size, $h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ 3 is the predicted score, and $h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ 4 the reference. The use of a unified regression head with score-type tokens eliminates the need for task-specific losses or heads, streamlining optimization.

Hyperparameters include:

Backbone: Gemma 3 12B encoder
Max input length: 4096 SPM tokens
Optimizer: Adafactor
Batch size: 128 (distributed over 64 TPU v3 cores)
Learning rate: 5 × $h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ 5 (stage 1), 1 × $h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ 6 (stage 2), cosine annealing
Warm-up steps: 100
No monolingual up-training

5. Empirical Evaluation

MetricX-QE demonstrates substantial advances over previous MetricX versions, as shown in system-level and segment-level evaluations on WMT24 MQM and ESA test sets.

Training	Input	Stage	en–de	en–es	ja–zh	Avg(ESA)	sys en–de	sys en–es	sys ja–zh	sys Avg(ESA)
24-Hybrid	src+ref	DA→MQM	53.20	68.50	53.90	87.40	79.90	89.70	—	—
25-QE*	src	DA→(DA+MQM)	54.97	69.42	57.21	84.91	85.45	78.29	91.34	84.91
25-Hybrid*	src+ref	DA→(DA+MQM)	55.45	69.14	57.72	87.61	85.82	77.00	92.00	87.61

Primary submissions denoted by “”.

Key empirical advances include:

Significant segment-level accuracy increases: +1.77 (en–de), +3.31 (ja–zh), +0.92 (en–es) over MetricX-24, $h_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d$ 7.
System-level improvements, with the hybrid model (accepting both src+ref and src-only) performing best on system-level soft pairwise accuracy (SPA) for en–de, ja–zh, and Avg(ESA).

These results affirm the effectiveness of a unified backbone and input strategy across heterogeneous quality scales and annotation criteria (Juraska et al., 28 Oct 2025).

6. Innovations and Advances over Prior QE Systems

MetricX-QE’s core advances comprise:

Encoder-Only Gemma 3 Backbone: Replaces mT5 with Gemma 3, yielding improved multilingual representation, especially for typologically diverse languages (e.g., en–ja, en–zh).
Unified Regression Head with Score-Type Control: Allows a single system to predict both MQM and ESA via indicator tokens, eliminating scale-mismatch issues and increasing flexibility for future evaluation protocols.
Explicit Input Markup: Score-type tokens, language/locale identifiers, and markdown delimiters provide robust signal segmentation, enhancing reliability, and reducing spurious attention or boundary errors.
Score Clipping Elimination: Avoids distortion in document-level evaluation, yielding higher true-to-human correlation.
Progressive Two-Stage Fine-Tuning: Stabilizes early learning on DA, followed by broad convergence across both DA and MQM tasks, ensuring robust adaptation.

The outcome is a QE metric that narrows or closes the gap with reference-based approaches even when limited references are available or in presence of dialectal and data variability.

7. Limitations and Practical Impact

MetricX-QE, while reference-free, is strongly dependent on the availability of high-quality annotation (DA, MQM, ESA) and the pretraining robustness of Gemma 3. Inputs must strictly adhere to prescribed formatting for reliable performance. No additional monolingual “up-training” is performed, which may bound adaptability to extreme domain shift.

Nevertheless, MetricX-QE represents a state-of-the-art approach to automatic, reference-free translation quality estimation, facilitating scalable assessment across multilingual and document-level use cases in MT research and production (Juraska et al., 28 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MetricX-QE.

MetricX-QE: Reference-Free Quality Estimation

1. Model Architecture

2. Input Format and Preprocessing Enhancements

3. Training Paradigm and Data Sources

4. Objective Functions and Optimization

5. Empirical Evaluation

6. Innovations and Advances over Prior QE Systems

7. Limitations and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MetricX-QE: Reference-Free Quality Estimation

1. Model Architecture

2. Input Format and Preprocessing Enhancements

3. Training Paradigm and Data Sources

4. Objective Functions and Optimization

5. Empirical Evaluation

6. Innovations and Advances over Prior QE Systems

7. Limitations and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research