SpokeBench: UD Benchmark for Spoken CSW
- SpokeBench is an expert-annotated Universal Dependencies benchmark capturing spoken English–Spanish code-switching and its inherent disfluencies.
- It introduces a linguistically grounded taxonomy that covers nine phenomena such as ellipsis, repetition, and enclisis, highlighting inherent parsing challenges.
- The benchmark employs FLEX-UD, an ambiguity-aware evaluation metric offering partial credit for near-miss parses and addressing limitations of traditional metrics.
SpokeBench is an expert-annotated, gold-standard Universal Dependencies (UD) benchmark specifically designed to evaluate and foster research on syntactic parsing of spoken code-switching (CSW), focusing on phenomena that systematically violate core UD assumptions. Drawing from the Miami English–Spanish CSW corpus, SpokeBench introduces a linguistically grounded taxonomy of challenging spoken-language features and establishes rigorous annotation and evaluation protocols that highlight the limitations of both traditional parsers and evaluation metrics in this domain (Tyagi et al., 6 Feb 2026).
1. Motivation and Objectives
Spoken code-switching presents distinctive challenges that disrupt standard UD parsing: frequent disfluencies (e.g., fillers like "uh"), repetitions, mid-utterance repairs, ellipsis, discourse markers ("well," "you know"), clitic enclisis, and multiword expressions. These phenomena contravene UD’s foundational assumptions, including clause completeness, one-to-one token–function mappings, and strict tree unambiguity. Conventional written-text parsers and standard metrics such as LAS/UAS fail in this context, often enforcing ill-formed or linguistically implausible analyses, or excessively penalizing variant but valid parses.
SpokeBench was created with three central goals:
- To provide a dispute-resolved, expert-annotated UD treebank capturing spoken CSW phenomena that undermine existing parsers.
- To ensure balanced coverage across nine high-impact spoken-CSW categories and control utterances free from such phenomena.
- To supply extended guidelines and a new evaluation protocol (FLEX-UD) capable of differentiating between tolerable variation and genuine structural errors, addressing the rigidity of standard evaluation.
2. Taxonomy of Spoken Code-Switching Phenomena
SpokeBench canonicalizes nine categories of spoken CSW disruptions, determined through manual analysis of approximately 2,800 English–Spanish utterances. Key categories and their UD-specific obstacles are:
| Phenomenon | Definition | UD Challenge |
|---|---|---|
| Repetition | Recurrence of words/phrases (not idioms) | Head selection ambiguity; reparandum marking |
| Discourse Elements | Pragmatic markers (e.g., "well") | Structural integration vs. independent status |
| Ellipsis | Abandoned/truncated structures | Missing heads/deps; “dep” relation/root |
| Contractions | Single tokens encoding multiple units | Token splitting required |
| Compounds/MWEs | Multiword expressions as one unit | Dotted node collapsing |
| Break of Thought | Clause abandonment & restart | “Reparandum” tracking, structure remapping |
| Filler Words | Non-lexical vocalizations ("uh", "um") | Filler-specific UPOS/DEPREL |
| Slang/Curse Words | Pragmatic expletives | Underrepresented/ambiguous role |
| Enclisis | Verbs with attached clitics (Spanish) | Token splitting for UD compliance |
Each category is directly exemplified in SpokeBench through annotated utterances demonstrating their syntactic effects.
3. Annotation Workflow and Guidelines
Annotation of SpokeBench involved:
- Independent markup of 126 selected sentences by seven linguistically trained annotators using an extended CoNLL-U standard.
- A two-stage quality control: initial acceptability grading (variation vs. error), followed by expert adjudication for borderline or unacceptable cases, culminating in consensus-based gold annotation.
- Annotation guidelines prioritize the standard UD scheme (“UD-first”), deviating only when necessary and codifying rules for single-root assignment, reparandum marking for repetitions or repairs, forced splitting of contractions and enclitics, MWE collapsing, use of special DEPREL extensions (rep, dep, filler, discourse), and explicit handling of ellipsis.
Inter-annotator agreement on core UD attachments before adjudication exceeded 85%, underscoring the reliability of the resulting annotations.
4. Dataset Composition and Structural Properties
SpokeBench is grounded in the Miami English–Spanish CSW Corpus, sampling 126 utterances (averaging 15 tokens each, ~1,900 tokens total) from bilingual residents exhibiting high-frequency intra-sentence code switches (30–50% non-matrix language tokens per utterance). Categories are carefully balanced, with sentences distributed as follows:
| Phenomenon | Sentence Count |
|---|---|
| Simple repetition | 10 |
| Complex repetition | 15 |
| English contractions | 10 |
| Spanish contractions | 10 |
| Simple ellipsis | 10 |
| Complex ellipsis | 15 |
| Simple discourse | 10 |
| Complex discourse/fillers | 15 |
| Highly complex (≥ 3 phenomena) | 12 |
| Control (none) | 20 |
SpokeBench is stored in CoNLL-U format, augmented with extended DEPREL values to handle reparandum, unrecoverable fragments, fillers, and pragmatic markers. Artificial IDs (e.g., “i.1” for MWEs) and MISC fields enable accurate representation of collapsed MWEs and tokenization edits driven by pre-processing modules.
5. Evaluation Protocol: The FLEX-UD Metric
Existing metrics such as LAS/UAS, which demand strict tree isomorphism, inadequately capture the legitimate ambiguities of spoken CSW. SpokeBench introduces FLEX-UD, an ambiguity- and severity-aware evaluation metric characterized by:
- Per-component subscores: , weighted with .
- Aggregate raw score:
- Severity penalty for catastrophic errors (missing MWEs, invalid heads, cycles, etc.).
- Final FLEX-UD score:
- Alignment between system and gold tokens leverages component scripts (token_alignment.py, flex_ud.py), and partial credit is granted for near-miss dependencies (e.g., obj vs. obl).
By sharply distinguishing “benign” annotation variations from true structural errors, FLEX-UD supports robust model ranking and error analysis in this inherently ambiguous domain.
6. Distinction from Prior Benchmarks
Most existing spoken and code-switching UD treebanks (e.g., English Switchboard, SUD for French, Turkish–German, ICE-Hindi–English) either (a) ignore mid-utterance repair complexity, (b) rely on forced reconstruction, or (c) omit systematized handling of MWEs, enclisis, and spoken-specific tags. SpokeBench is unique in its:
- Expert dispute-adjudicated annotation of the most structurally complex spoken CSW utterances,
- Systematic and balanced phenomenon coverage,
- Schema extensions explicitly aligned to spoken phenomena,
- Unified schema for mixed English–Spanish code-switching.
7. Access and Research Use Cases
SpokeBench, along with its codebase (including DECAP, evaluation scripts, and gold annotation files), is publicly released at https://github.com/N3mika/scsw. It is intended for:
- Zero-shot and few-shot syntactic evaluation of LLM-based parsers on spoken CSW inputs,
- Fine-tuning of multilingual parsers using pre-processing (Spoken-Phenomena Handler, Language-Specific Resolver),
- Controlled comparative studies using both standard and ambiguity-aware (FLEX-UD) evaluation metrics.
This benchmark is foundational for parsing research addressing the unique complexities of spoken code-switching, supporting advances in both theoretical modeling and practical system development (Tyagi et al., 6 Feb 2026).