GemSpanEval: Generative Span Evaluation

Updated 8 February 2026

The paper presents GemSpanEval, a generative error span detection model that leverages a decoder-only architecture and JSON output serialization for precise translation error analysis.
It operationalizes span detection as a structured sequence generation task, achieving competitive character-level F1 scores against encoder-only baselines across QE and reference modes.
Its integration with the open-weight Gemma 3 LLM enhances flexibility in translation quality evaluation and provides a replicable framework for structured NLP annotation.

GemSpanEval is a decoder-only, generative error-span detection model designed for high-fidelity machine translation evaluation by identifying, categorizing, and scoring error spans within translated text. Developed by Google and submitted to the WMT25 Evaluation Shared Task, GemSpanEval is built upon Gemma 3 (27B), a state-of-the-art, open-weight, multilingual LLM fine-tuned using MQM-annotated datasets from the WMT20–24 campaigns. The system operationalizes span-level error detection as a structured sequence generation task, leveraging instruction-based prompting and JSON output serialization. GemSpanEval demonstrates competitive character-level F1 performance against strong encoder-only baselines such as xCOMET-XXL while providing enhanced flexibility in generative quality estimation workflows (Juraska et al., 28 Oct 2025).

1. Model Architecture and Input Representation

GemSpanEval utilizes the Gemma 3 27B transformer, a standard decoder-only LLM backbone with causal attention and no encoder component. To adapt Gemma 3 for span-level error detection, the model is fine-tuned on a JSON-serialized format for error spans without introducing structural modifications beyond special prompt tokens and a fixed-vocabulary for JSON-specific tokens. Gemma 3’s native SentencePiece tokenization, supporting context windows of approximately 128,000 tokens, is employed. Special tokens are reserved for backticks (```), JSON structural symbols, and field names (“span”, “severity”, “category”, “span_with_context”).

The error span detection task is formulated generatively: given a composite prompt $\mathcal{P}$ comprised of instructions, back-quoted source/translation pairs, and optionally a reference translation, GemSpanEval autoregressively generates a JSON array of detected error objects. Each object contains the following fields:

span: substring in the translation (or source, if denoting omission);
severity: one of {"critical", "major", "minor"};
category: one of the official MQM taxonomy labels (e.g., "accuracy/mistranslation", "fluency/punctuation", "style/awkward");
span_with_context: the shortest context-expanding substring, included for non-unique spans to resolve ambiguity.

2. Training Objective and Fine-Tuning Protocol

GemSpanEval is trained under the standard next-token prediction objective. Given tokenized prompt $x$ and target output token sequence $y = (y_1, \dots, y_T)$ representing the JSON structure, the cross-entropy loss is computed as:

$L_{CE}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)$

No auxiliary objectives, regularization penalties, or span-level taggers are used; training is purely generative.

The training corpus consists of MQM error-span annotations from WMT20–23 (en-de, en-es, ja-zh), with protocol variations for development and submission. During pre-submission, training uses WMT20–23 with WMT24 held out for validation; for final evaluation, WMT24 en-de and ja-zh are added to training, and en-es is held out. Each segment is presented to the model in both reference-based (with human reference) and QE (reference-free) modes to encourage dual-modality competence. Adafactor (no weight decay) is used as the optimizer, with a peak learning rate of $1 \times 10^{-4}$ , batch size of 64, and a maximum sequence length of 4,096 tokens. Training is conducted for 20,000 steps, approximately 1.8 epochs over 220,000 MQM-annotated segments.

3. Prompting Strategies and Output Serialization

GemSpanEval employs a structured two-part prompt:

Instruction preamble: Provides annotator task context, error taxonomy, severity scaling, and explicit instructions to emit strict, JSON-parsable responses.
Data block: Contains the source (in English), optionally a human reference translation (for reference-based mode), and the machine-generated translation, each surrounded by triple backticks.

At inference, this prompt elicits a model response in the form of a JSON list of error span objects. For spans that may occur more than once in the translation, the output includes a “span_with_context” field to disambiguate character offsets. If the span is unique, this field is omitted. In QE mode, the reference block is excluded from the prompt.

Example of prompt and output:

English source:
'''The lights are dimmable … on vacation'''
Reference:
'''Les lumières sont … en vacances'''
Machine translation:
'''Die Lichter sind … im Urlaub.'''

Expected model output:

[
  {
    "span": "im",
    "severity": "minor",
    "category": "accuracy/mistranslation",
    "span_with_context": "nützlich im Büro"
  },
  {
    "span": "ihn",
    "severity": "minor",
    "category": "accuracy/mistranslation"
  },
  {
    "span": "mit",
    "severity": "minor",
    "category": "accuracy/mistranslation"
  }
]

4. Inference, Decoding, and Post-Processing

During inference, GemSpanEval utilizes greedy, next-token sampling until the closing bracket (‘]’) is generated or a maximum sequence length is reached. The model output is subsequently parsed as JSON. Each span object is post-processed by locating the substring in the machine translation text; if the substring is non-unique, “span_with_context” is employed to resolve the precise character offsets. The finalized output is a sequence of tuples specifying (start_char, end_char, category, severity).

5. Evaluation Metrics and Empirical Performance

GemSpanEval’s predictions are evaluated with the WMT character-level F1 metric, scoring the overlap of predicted and gold-standard MQM spans, with partial credit for incorrect severity assignments. Comparative assessment is conducted against:

XCOMET-XXL (encoder-only, sequence-tagging, in both QE and reference modes)
Zero-shot Gemma 3 (decoder-only) utilizing identical JSON-prompting

WMT24 character-level F1 results:

System	en-de	en-es	ja-zh	Avg
XCOMET-XXL (QE)	24.28	10.11	14.30	16.23
XCOMET-XXL (ref)	25.43	11.02	24.94	20.46
Gemma 3 27B (zero-shot)	17.94	8.19	28.42	18.18
GemSpanEval-QE v1 (10K)	17.51	14.43	22.75	18.23
GemSpanEval-QE (20K)	20.85	13.06	24.72	19.54
GemSpanEval (ref)	21.79	13.73	25.28	20.27
GemSpanEval+WMT24 train*	27.26	14.37	37.09	26.24

Note: The last row includes test segments memorized during training (not comparable except on en-es).

Key findings:

Reference-based input boosts F1 by 0.4–1.0 points over QE mode.
GemSpanEval (ref) achieves 20.27 F1, closely matching XCOMET-XXL’s 20.46.
GemSpanEval-QE surpasses XCOMET-XXL-QE by +3.0 F1 on held-out en-es.
Zero-shot Gemma 3 demonstrates strong performance for ja-zh but is less effective for en-de/en-es.

6. Distinctive Characteristics and Impact

GemSpanEval’s generative approach contrasts with traditional encoder-only, sequence-tagging frameworks by casting error span detection as an autoregressive generation problem. The use of explicit, instruction-driven prompts and strict JSON output ensures both the clarity of annotation and machine parsability. By requiring “span_with_context” on non-unique spans, the system addresses ambiguity in substring span matching, supporting reliable downstream extraction.

By forgoing token-level or auxiliary objectives, GemSpanEval simplifies training and aligns optimization directly with the generative decoding process prevalent in large-scale generative LLMs. Its competitive performance, flexibility for both reference-based and QE settings, and integration with open-weight LLM infrastructure mark a shift towards end-to-end, easily extensible generation-based evaluation pipelines for machine translation and potentially other NLP annotation tasks.

7. Integration, Applications, and Broader Context

GemSpanEval’s structured, output-controllable format allows for seamless integration into translation quality evaluation toolchains, both as a stand-alone error annotation system and as a module in more comprehensive quality estimation frameworks. Its JSON-based protocol is machine-auditable and directly compatible with automated scoring and meta-evaluation analyses.

A plausible implication is that the generative paradigm adopted by GemSpanEval could inform similar annotation tasks in domains requiring nuanced span-level labeling, as well as facilitate richer context-aware evaluation workflows in multilingual and low-resource settings. The methodology provides a template for applying LLMs to structured output tasks where unambiguous extraction and taxonomic categorization are essential (Juraska et al., 28 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GemSpanEval.