Papers
Topics
Authors
Recent
Search
2000 character limit reached

MetricX-QE: Reference-Free Quality Estimation

Updated 15 January 2026
  • The paper introduces MetricX-QE, a reference-free QE method that utilizes an encoder-only Gemma 3 backbone to predict translation quality via a unified regression head.
  • It incorporates innovative input formatting with score-type tokens, language tags, and markdown delimiters to robustly handle multilingual and document-level challenges.
  • The paper applies a two-stage fine-tuning on heterogeneous data sources to significantly narrow the gap with reference-based metrics, even in low-resource scenarios.

MetricX-QE is the reference-free machine translation quality estimation (QE) variant introduced in the MetricX-25 system for the WMT25 Evaluation Shared Task. Built upon the multilingual open-weights model Gemma 3, MetricX-QE features architectural, input format, and training protocol enhancements over previous MetricX iterations. The method achieves state-of-the-art correlation with human quality judgments (both MQM and ESA) and closes much of the gap to reference-based evaluation metrics—even under low-resource or dialect-shift scenarios (Juraska et al., 28 Oct 2025).

1. Model Architecture

MetricX-QE utilizes an encoder-only backbone, leveraging the Gemma 3 12B model. Unlike prior MetricX systems based on mT5 and using both encoder and decoder, MetricX-QE discards the decoder entirely and fine-tunes only the encoder for regression:

  • Input Representation: For each segment, the tokenized input x=[x1xL]x = [x_1 \ldots x_L] (up to 4096 SentencePiece tokens) is encoded, yielding hidden states ht=Enc(x1,,xt)Rdh_t = \mathrm{Enc}(x_1, \ldots, x_t) \in \mathbb{R}^d where d=8192d=8192.
  • Pooling & Regression Head: The final hidden states are mean-pooled: hˉ=1Lt=1Lht\bar{h} = \frac{1}{L} \sum_{t=1}^L h_t. A single linear layer maps hˉ\bar{h} to a scalar prediction: s^=wThˉ+b\hat{s} = w^T \bar{h} + b, wRdw \in \mathbb{R}^d.
  • Parameter Updates: Only the encoder weights, pooling layer, and regression head are fine-tuned; there is no additional pretraining or “up-training” phase.

This design enables scalable reference-free QE with a unified head for multiple scoring types, improving both representational richness and alignment with human ratings, especially for morphologically complex or multilingual data.

2. Input Format and Preprocessing Enhancements

MetricX-QE introduces systematic improvements to input construction and preprocessing to handle WMT25’s multilingual and document-level characteristics:

  • Score-Type Tokens: Each example is prepended with a special token—either “[MQM]” or “[ESA]”—that specifies the desired output quality scale.
  • Language and Locale Tags: Metadata tags (e.g., “src_lang=en” and “tgt_lang=de”; including locale information such as “ar_EG” when available) are prepended to facilitate disambiguation of source and target varieties, essential for robust QE across dialects and zero-shot settings.
  • Markdown-Style Segment Delimiters: Each text segment (source, translation, and—if present—reference) is enclosed in triple backticks, separated by blank lines for explicit boundary marking, supporting multi-paragraph or complex input.
  • Hybrid Training Mode: Only source+reference and source-only example types are retained in stage 2; the reference-only configuration from previous versions is dropped to mirror MQM annotation protocols.
  • Removal of Score Clipping: Unlike prior MetricX versions, no artificial clipping is applied to MQM predictions (formerly capped at [0,25][0,25]), improving fairness and correlation in document-level and long-segment scenarios.

These input innovations enhance model reliability and stability across languages and segment structures.

3. Training Paradigm and Data Sources

MetricX-QE employs a staged and mixed fine-tuning regimen designed for progressive domain adaptation and robust regression on heterogeneous quality scales:

  • Stage 1: DA-Only Pre-Fine-Tuning
    • Uses z-normalized Direct Assessment (DA) scores: zi=DAiμσz_i = \frac{DA_i - \mu}{\sigma}, with targets yi=clip(zi,1,1)y_i = -\mathrm{clip}(z_i, -1, 1).
    • No score-type tokens in input.
  • Stage 2: DA+MQM Joint Training
    • Mixes DA-derived targets (rescaled to MQM range: y^iDA=DAi100×max(MQM_range)\hat{y}_i^{DA} = \frac{DA_i}{100} \times \max(\mathrm{MQM\_range})), raw MQM scores (uncapped), and synthetic data (replicating over/under-translation, unrelated content, missing punctuation).
    • Examples are tagged with “[ESA]” or “[MQM]” as appropriate.
    • Inference rescaling: Outputs are linearly remapped to the original ESA (0–100) or negative-MQM scales according to requested evaluation context.

Training Data includes:

  • DA annotations from WMT15–23 (excluding the into-English subset of WMT21)
  • MQM from WMT20–23: en–de, en–es, en–ru, en–zh, he–en, zh–en, ja–zh
  • ESA (Error Span Annotation) from WMT24 (9 language pairs)
  • Synthetic perturbations (per Juraska et al., 2024)

Validation leverages WMT24 MQM (en–de, en–es, ja–zh) and ESA labels.

4. Objective Functions and Optimization

A single loss function is shared across prediction types. MetricX-QE applies mean squared error (MSE) regression:

Lreg=1Bi=1B(s^iyi)2\mathcal{L}_{reg} = \frac{1}{B}\sum_{i=1}^B (\hat s_i - y_i)^2

where BB is the batch size, s^i\hat{s}_i is the predicted score, and yiy_i the reference. The use of a unified regression head with score-type tokens eliminates the need for task-specific losses or heads, streamlining optimization.

Hyperparameters include:

  • Backbone: Gemma 3 12B encoder
  • Max input length: 4096 SPM tokens
  • Optimizer: Adafactor
  • Batch size: 128 (distributed over 64 TPU v3 cores)
  • Learning rate: 5 × 10510^{-5} (stage 1), 1 × 10510^{-5} (stage 2), cosine annealing
  • Warm-up steps: 100
  • No monolingual up-training

5. Empirical Evaluation

MetricX-QE demonstrates substantial advances over previous MetricX versions, as shown in system-level and segment-level evaluations on WMT24 MQM and ESA test sets.

Training Input Stage en–de en–es ja–zh Avg(ESA) sys en–de sys en–es sys ja–zh sys Avg(ESA)
24-Hybrid src+ref DA→MQM 53.20 68.50 53.90 87.40 79.90 89.70
25-QE* src DA→(DA+MQM) 54.97 69.42 57.21 84.91 85.45 78.29 91.34 84.91
25-Hybrid* src+ref DA→(DA+MQM) 55.45 69.14 57.72 87.61 85.82 77.00 92.00 87.61

Primary submissions denoted by “”.

Key empirical advances include:

  • Significant segment-level accuracy increases: +1.77 (en–de), +3.31 (ja–zh), +0.92 (en–es) over MetricX-24, p<0.01p<0.01.
  • System-level improvements, with the hybrid model (accepting both src+ref and src-only) performing best on system-level soft pairwise accuracy (SPA) for en–de, ja–zh, and Avg(ESA).

These results affirm the effectiveness of a unified backbone and input strategy across heterogeneous quality scales and annotation criteria (Juraska et al., 28 Oct 2025).

6. Innovations and Advances over Prior QE Systems

MetricX-QE’s core advances comprise:

  • Encoder-Only Gemma 3 Backbone: Replaces mT5 with Gemma 3, yielding improved multilingual representation, especially for typologically diverse languages (e.g., en–ja, en–zh).
  • Unified Regression Head with Score-Type Control: Allows a single system to predict both MQM and ESA via indicator tokens, eliminating scale-mismatch issues and increasing flexibility for future evaluation protocols.
  • Explicit Input Markup: Score-type tokens, language/locale identifiers, and markdown delimiters provide robust signal segmentation, enhancing reliability, and reducing spurious attention or boundary errors.
  • Score Clipping Elimination: Avoids distortion in document-level evaluation, yielding higher true-to-human correlation.
  • Progressive Two-Stage Fine-Tuning: Stabilizes early learning on DA, followed by broad convergence across both DA and MQM tasks, ensuring robust adaptation.

The outcome is a QE metric that narrows or closes the gap with reference-based approaches even when limited references are available or in presence of dialectal and data variability.

7. Limitations and Practical Impact

MetricX-QE, while reference-free, is strongly dependent on the availability of high-quality annotation (DA, MQM, ESA) and the pretraining robustness of Gemma 3. Inputs must strictly adhere to prescribed formatting for reliable performance. No additional monolingual “up-training” is performed, which may bound adaptability to extreme domain shift.

Nevertheless, MetricX-QE represents a state-of-the-art approach to automatic, reference-free translation quality estimation, facilitating scalable assessment across multilingual and document-level use cases in MT research and production (Juraska et al., 28 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MetricX-QE.