GemSpanEval: Generative Span Evaluation
- The paper presents GemSpanEval, a generative error span detection model that leverages a decoder-only architecture and JSON output serialization for precise translation error analysis.
- It operationalizes span detection as a structured sequence generation task, achieving competitive character-level F1 scores against encoder-only baselines across QE and reference modes.
- Its integration with the open-weight Gemma 3 LLM enhances flexibility in translation quality evaluation and provides a replicable framework for structured NLP annotation.
GemSpanEval is a decoder-only, generative error-span detection model designed for high-fidelity machine translation evaluation by identifying, categorizing, and scoring error spans within translated text. Developed by Google and submitted to the WMT25 Evaluation Shared Task, GemSpanEval is built upon Gemma 3 (27B), a state-of-the-art, open-weight, multilingual LLM fine-tuned using MQM-annotated datasets from the WMT20–24 campaigns. The system operationalizes span-level error detection as a structured sequence generation task, leveraging instruction-based prompting and JSON output serialization. GemSpanEval demonstrates competitive character-level F1 performance against strong encoder-only baselines such as xCOMET-XXL while providing enhanced flexibility in generative quality estimation workflows (Juraska et al., 28 Oct 2025).
1. Model Architecture and Input Representation
GemSpanEval utilizes the Gemma 3 27B transformer, a standard decoder-only LLM backbone with causal attention and no encoder component. To adapt Gemma 3 for span-level error detection, the model is fine-tuned on a JSON-serialized format for error spans without introducing structural modifications beyond special prompt tokens and a fixed-vocabulary for JSON-specific tokens. Gemma 3’s native SentencePiece tokenization, supporting context windows of approximately 128,000 tokens, is employed. Special tokens are reserved for backticks (```), JSON structural symbols, and field names (“span”, “severity”, “category”, “span_with_context”).
The error span detection task is formulated generatively: given a composite prompt comprised of instructions, back-quoted source/translation pairs, and optionally a reference translation, GemSpanEval autoregressively generates a JSON array of detected error objects. Each object contains the following fields:
- span: substring in the translation (or source, if denoting omission);
- severity: one of {"critical", "major", "minor"};
- category: one of the official MQM taxonomy labels (e.g., "accuracy/mistranslation", "fluency/punctuation", "style/awkward");
- span_with_context: the shortest context-expanding substring, included for non-unique spans to resolve ambiguity.
2. Training Objective and Fine-Tuning Protocol
GemSpanEval is trained under the standard next-token prediction objective. Given tokenized prompt and target output token sequence representing the JSON structure, the cross-entropy loss is computed as:
No auxiliary objectives, regularization penalties, or span-level taggers are used; training is purely generative.
The training corpus consists of MQM error-span annotations from WMT20–23 (en-de, en-es, ja-zh), with protocol variations for development and submission. During pre-submission, training uses WMT20–23 with WMT24 held out for validation; for final evaluation, WMT24 en-de and ja-zh are added to training, and en-es is held out. Each segment is presented to the model in both reference-based (with human reference) and QE (reference-free) modes to encourage dual-modality competence. Adafactor (no weight decay) is used as the optimizer, with a peak learning rate of , batch size of 64, and a maximum sequence length of 4,096 tokens. Training is conducted for 20,000 steps, approximately 1.8 epochs over 220,000 MQM-annotated segments.
3. Prompting Strategies and Output Serialization
GemSpanEval employs a structured two-part prompt:
- Instruction preamble: Provides annotator task context, error taxonomy, severity scaling, and explicit instructions to emit strict, JSON-parsable responses.
- Data block: Contains the source (in English), optionally a human reference translation (for reference-based mode), and the machine-generated translation, each surrounded by triple backticks.
At inference, this prompt elicits a model response in the form of a JSON list of error span objects. For spans that may occur more than once in the translation, the output includes a “span_with_context” field to disambiguate character offsets. If the span is unique, this field is omitted. In QE mode, the reference block is excluded from the prompt.
Example of prompt and output:
1 2 3 4 5 6 |
English source: '''The lights are dimmable … on vacation''' Reference: '''Les lumières sont … en vacances''' Machine translation: '''Die Lichter sind … im Urlaub.''' |
Expected model output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
[
{
"span": "im",
"severity": "minor",
"category": "accuracy/mistranslation",
"span_with_context": "nützlich im Büro"
},
{
"span": "ihn",
"severity": "minor",
"category": "accuracy/mistranslation"
},
{
"span": "mit",
"severity": "minor",
"category": "accuracy/mistranslation"
}
] |
4. Inference, Decoding, and Post-Processing
During inference, GemSpanEval utilizes greedy, next-token sampling until the closing bracket (‘]’) is generated or a maximum sequence length is reached. The model output is subsequently parsed as JSON. Each span object is post-processed by locating the substring in the machine translation text; if the substring is non-unique, “span_with_context” is employed to resolve the precise character offsets. The finalized output is a sequence of tuples specifying (start_char, end_char, category, severity).
5. Evaluation Metrics and Empirical Performance
GemSpanEval’s predictions are evaluated with the WMT character-level F1 metric, scoring the overlap of predicted and gold-standard MQM spans, with partial credit for incorrect severity assignments. Comparative assessment is conducted against:
- XCOMET-XXL (encoder-only, sequence-tagging, in both QE and reference modes)
- Zero-shot Gemma 3 (decoder-only) utilizing identical JSON-prompting
WMT24 character-level F1 results:
| System | en-de | en-es | ja-zh | Avg |
|---|---|---|---|---|
| XCOMET-XXL (QE) | 24.28 | 10.11 | 14.30 | 16.23 |
| XCOMET-XXL (ref) | 25.43 | 11.02 | 24.94 | 20.46 |
| Gemma 3 27B (zero-shot) | 17.94 | 8.19 | 28.42 | 18.18 |
| GemSpanEval-QE v1 (10K) | 17.51 | 14.43 | 22.75 | 18.23 |
| GemSpanEval-QE (20K) | 20.85 | 13.06 | 24.72 | 19.54 |
| GemSpanEval (ref) | 21.79 | 13.73 | 25.28 | 20.27 |
| GemSpanEval+WMT24 train* | 27.26 | 14.37 | 37.09 | 26.24 |
Note: The last row includes test segments memorized during training (not comparable except on en-es).
Key findings:
- Reference-based input boosts F1 by 0.4–1.0 points over QE mode.
- GemSpanEval (ref) achieves 20.27 F1, closely matching XCOMET-XXL’s 20.46.
- GemSpanEval-QE surpasses XCOMET-XXL-QE by +3.0 F1 on held-out en-es.
- Zero-shot Gemma 3 demonstrates strong performance for ja-zh but is less effective for en-de/en-es.
6. Distinctive Characteristics and Impact
GemSpanEval’s generative approach contrasts with traditional encoder-only, sequence-tagging frameworks by casting error span detection as an autoregressive generation problem. The use of explicit, instruction-driven prompts and strict JSON output ensures both the clarity of annotation and machine parsability. By requiring “span_with_context” on non-unique spans, the system addresses ambiguity in substring span matching, supporting reliable downstream extraction.
By forgoing token-level or auxiliary objectives, GemSpanEval simplifies training and aligns optimization directly with the generative decoding process prevalent in large-scale generative LLMs. Its competitive performance, flexibility for both reference-based and QE settings, and integration with open-weight LLM infrastructure mark a shift towards end-to-end, easily extensible generation-based evaluation pipelines for machine translation and potentially other NLP annotation tasks.
7. Integration, Applications, and Broader Context
GemSpanEval’s structured, output-controllable format allows for seamless integration into translation quality evaluation toolchains, both as a stand-alone error annotation system and as a module in more comprehensive quality estimation frameworks. Its JSON-based protocol is machine-auditable and directly compatible with automated scoring and meta-evaluation analyses.
A plausible implication is that the generative paradigm adopted by GemSpanEval could inform similar annotation tasks in domains requiring nuanced span-level labeling, as well as facilitate richer context-aware evaluation workflows in multilingual and low-resource settings. The methodology provides a template for applying LLMs to structured output tasks where unambiguous extraction and taxonomic categorization are essential (Juraska et al., 28 Oct 2025).