Papers
Topics
Authors
Recent
Search
2000 character limit reached

GemSpanEval: Generative Span Evaluation

Updated 8 February 2026
  • The paper presents GemSpanEval, a generative error span detection model that leverages a decoder-only architecture and JSON output serialization for precise translation error analysis.
  • It operationalizes span detection as a structured sequence generation task, achieving competitive character-level F1 scores against encoder-only baselines across QE and reference modes.
  • Its integration with the open-weight Gemma 3 LLM enhances flexibility in translation quality evaluation and provides a replicable framework for structured NLP annotation.

GemSpanEval is a decoder-only, generative error-span detection model designed for high-fidelity machine translation evaluation by identifying, categorizing, and scoring error spans within translated text. Developed by Google and submitted to the WMT25 Evaluation Shared Task, GemSpanEval is built upon Gemma 3 (27B), a state-of-the-art, open-weight, multilingual LLM fine-tuned using MQM-annotated datasets from the WMT20–24 campaigns. The system operationalizes span-level error detection as a structured sequence generation task, leveraging instruction-based prompting and JSON output serialization. GemSpanEval demonstrates competitive character-level F1 performance against strong encoder-only baselines such as xCOMET-XXL while providing enhanced flexibility in generative quality estimation workflows (Juraska et al., 28 Oct 2025).

1. Model Architecture and Input Representation

GemSpanEval utilizes the Gemma 3 27B transformer, a standard decoder-only LLM backbone with causal attention and no encoder component. To adapt Gemma 3 for span-level error detection, the model is fine-tuned on a JSON-serialized format for error spans without introducing structural modifications beyond special prompt tokens and a fixed-vocabulary for JSON-specific tokens. Gemma 3’s native SentencePiece tokenization, supporting context windows of approximately 128,000 tokens, is employed. Special tokens are reserved for backticks (```), JSON structural symbols, and field names (“span”, “severity”, “category”, “span_with_context”).

The error span detection task is formulated generatively: given a composite prompt P\mathcal{P} comprised of instructions, back-quoted source/translation pairs, and optionally a reference translation, GemSpanEval autoregressively generates a JSON array of detected error objects. Each object contains the following fields:

  • span: substring in the translation (or source, if denoting omission);
  • severity: one of {"critical", "major", "minor"};
  • category: one of the official MQM taxonomy labels (e.g., "accuracy/mistranslation", "fluency/punctuation", "style/awkward");
  • span_with_context: the shortest context-expanding substring, included for non-unique spans to resolve ambiguity.

2. Training Objective and Fine-Tuning Protocol

GemSpanEval is trained under the standard next-token prediction objective. Given tokenized prompt xx and target output token sequence y=(y1,,yT)y = (y_1, \dots, y_T) representing the JSON structure, the cross-entropy loss is computed as:

LCE(θ)=1Tt=1Tlogpθ(yty<t,x)L_{CE}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)

No auxiliary objectives, regularization penalties, or span-level taggers are used; training is purely generative.

The training corpus consists of MQM error-span annotations from WMT20–23 (en-de, en-es, ja-zh), with protocol variations for development and submission. During pre-submission, training uses WMT20–23 with WMT24 held out for validation; for final evaluation, WMT24 en-de and ja-zh are added to training, and en-es is held out. Each segment is presented to the model in both reference-based (with human reference) and QE (reference-free) modes to encourage dual-modality competence. Adafactor (no weight decay) is used as the optimizer, with a peak learning rate of 1×1041 \times 10^{-4}, batch size of 64, and a maximum sequence length of 4,096 tokens. Training is conducted for 20,000 steps, approximately 1.8 epochs over 220,000 MQM-annotated segments.

3. Prompting Strategies and Output Serialization

GemSpanEval employs a structured two-part prompt:

  • Instruction preamble: Provides annotator task context, error taxonomy, severity scaling, and explicit instructions to emit strict, JSON-parsable responses.
  • Data block: Contains the source (in English), optionally a human reference translation (for reference-based mode), and the machine-generated translation, each surrounded by triple backticks.

At inference, this prompt elicits a model response in the form of a JSON list of error span objects. For spans that may occur more than once in the translation, the output includes a “span_with_context” field to disambiguate character offsets. If the span is unique, this field is omitted. In QE mode, the reference block is excluded from the prompt.

Example of prompt and output:

1
2
3
4
5
6
English source:
'''The lights are dimmable … on vacation'''
Reference:
'''Les lumières sont … en vacances'''
Machine translation:
'''Die Lichter sind … im Urlaub.'''

Expected model output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[
  {
    "span": "im",
    "severity": "minor",
    "category": "accuracy/mistranslation",
    "span_with_context": "nützlich im Büro"
  },
  {
    "span": "ihn",
    "severity": "minor",
    "category": "accuracy/mistranslation"
  },
  {
    "span": "mit",
    "severity": "minor",
    "category": "accuracy/mistranslation"
  }
]

4. Inference, Decoding, and Post-Processing

During inference, GemSpanEval utilizes greedy, next-token sampling until the closing bracket (‘]’) is generated or a maximum sequence length is reached. The model output is subsequently parsed as JSON. Each span object is post-processed by locating the substring in the machine translation text; if the substring is non-unique, “span_with_context” is employed to resolve the precise character offsets. The finalized output is a sequence of tuples specifying (start_char, end_char, category, severity).

5. Evaluation Metrics and Empirical Performance

GemSpanEval’s predictions are evaluated with the WMT character-level F1 metric, scoring the overlap of predicted and gold-standard MQM spans, with partial credit for incorrect severity assignments. Comparative assessment is conducted against:

  • XCOMET-XXL (encoder-only, sequence-tagging, in both QE and reference modes)
  • Zero-shot Gemma 3 (decoder-only) utilizing identical JSON-prompting

WMT24 character-level F1 results:

System en-de en-es ja-zh Avg
XCOMET-XXL (QE) 24.28 10.11 14.30 16.23
XCOMET-XXL (ref) 25.43 11.02 24.94 20.46
Gemma 3 27B (zero-shot) 17.94 8.19 28.42 18.18
GemSpanEval-QE v1 (10K) 17.51 14.43 22.75 18.23
GemSpanEval-QE (20K) 20.85 13.06 24.72 19.54
GemSpanEval (ref) 21.79 13.73 25.28 20.27
GemSpanEval+WMT24 train* 27.26 14.37 37.09 26.24

Note: The last row includes test segments memorized during training (not comparable except on en-es).

Key findings:

  • Reference-based input boosts F1 by 0.4–1.0 points over QE mode.
  • GemSpanEval (ref) achieves 20.27 F1, closely matching XCOMET-XXL’s 20.46.
  • GemSpanEval-QE surpasses XCOMET-XXL-QE by +3.0 F1 on held-out en-es.
  • Zero-shot Gemma 3 demonstrates strong performance for ja-zh but is less effective for en-de/en-es.

6. Distinctive Characteristics and Impact

GemSpanEval’s generative approach contrasts with traditional encoder-only, sequence-tagging frameworks by casting error span detection as an autoregressive generation problem. The use of explicit, instruction-driven prompts and strict JSON output ensures both the clarity of annotation and machine parsability. By requiring “span_with_context” on non-unique spans, the system addresses ambiguity in substring span matching, supporting reliable downstream extraction.

By forgoing token-level or auxiliary objectives, GemSpanEval simplifies training and aligns optimization directly with the generative decoding process prevalent in large-scale generative LLMs. Its competitive performance, flexibility for both reference-based and QE settings, and integration with open-weight LLM infrastructure mark a shift towards end-to-end, easily extensible generation-based evaluation pipelines for machine translation and potentially other NLP annotation tasks.

7. Integration, Applications, and Broader Context

GemSpanEval’s structured, output-controllable format allows for seamless integration into translation quality evaluation toolchains, both as a stand-alone error annotation system and as a module in more comprehensive quality estimation frameworks. Its JSON-based protocol is machine-auditable and directly compatible with automated scoring and meta-evaluation analyses.

A plausible implication is that the generative paradigm adopted by GemSpanEval could inform similar annotation tasks in domains requiring nuanced span-level labeling, as well as facilitate richer context-aware evaluation workflows in multilingual and low-resource settings. The methodology provides a template for applying LLMs to structured output tasks where unambiguous extraction and taxonomic categorization are essential (Juraska et al., 28 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GemSpanEval.