SpokeBench: UD Benchmark for Spoken CSW

Updated 10 February 2026

SpokeBench is an expert-annotated Universal Dependencies benchmark capturing spoken English–Spanish code-switching and its inherent disfluencies.
It introduces a linguistically grounded taxonomy that covers nine phenomena such as ellipsis, repetition, and enclisis, highlighting inherent parsing challenges.
The benchmark employs FLEX-UD, an ambiguity-aware evaluation metric offering partial credit for near-miss parses and addressing limitations of traditional metrics.

SpokeBench is an expert-annotated, gold-standard Universal Dependencies (UD) benchmark specifically designed to evaluate and foster research on syntactic parsing of spoken code-switching (CSW), focusing on phenomena that systematically violate core UD assumptions. Drawing from the Miami English–Spanish CSW corpus, SpokeBench introduces a linguistically grounded taxonomy of challenging spoken-language features and establishes rigorous annotation and evaluation protocols that highlight the limitations of both traditional parsers and evaluation metrics in this domain (Tyagi et al., 6 Feb 2026).

1. Motivation and Objectives

Spoken code-switching presents distinctive challenges that disrupt standard UD parsing: frequent disfluencies (e.g., fillers like "uh"), repetitions, mid-utterance repairs, ellipsis, discourse markers ("well," "you know"), clitic enclisis, and multiword expressions. These phenomena contravene UD’s foundational assumptions, including clause completeness, one-to-one token–function mappings, and strict tree unambiguity. Conventional written-text parsers and standard metrics such as LAS/UAS fail in this context, often enforcing ill-formed or linguistically implausible analyses, or excessively penalizing variant but valid parses.

SpokeBench was created with three central goals:

To provide a dispute-resolved, expert-annotated UD treebank capturing spoken CSW phenomena that undermine existing parsers.
To ensure balanced coverage across nine high-impact spoken-CSW categories and control utterances free from such phenomena.
To supply extended guidelines and a new evaluation protocol (FLEX-UD) capable of differentiating between tolerable variation and genuine structural errors, addressing the rigidity of standard evaluation.

2. Taxonomy of Spoken Code-Switching Phenomena

SpokeBench canonicalizes nine categories of spoken CSW disruptions, determined through manual analysis of approximately 2,800 English–Spanish utterances. Key categories and their UD-specific obstacles are:

Phenomenon	Definition	UD Challenge
Repetition	Recurrence of words/phrases (not idioms)	Head selection ambiguity; reparandum marking
Discourse Elements	Pragmatic markers (e.g., "well")	Structural integration vs. independent status
Ellipsis	Abandoned/truncated structures	Missing heads/deps; “dep” relation/root
Contractions	Single tokens encoding multiple units	Token splitting required
Compounds/MWEs	Multiword expressions as one unit	Dotted node collapsing
Break of Thought	Clause abandonment & restart	“Reparandum” tracking, structure remapping
Filler Words	Non-lexical vocalizations ("uh", "um")	Filler-specific UPOS/DEPREL
Slang/Curse Words	Pragmatic expletives	Underrepresented/ambiguous role
Enclisis	Verbs with attached clitics (Spanish)	Token splitting for UD compliance

Each category is directly exemplified in SpokeBench through annotated utterances demonstrating their syntactic effects.

3. Annotation Workflow and Guidelines

Annotation of SpokeBench involved:

Independent markup of 126 selected sentences by seven linguistically trained annotators using an extended CoNLL-U standard.
A two-stage quality control: initial acceptability grading (variation vs. error), followed by expert adjudication for borderline or unacceptable cases, culminating in consensus-based gold annotation.
Annotation guidelines prioritize the standard UD scheme (“UD-first”), deviating only when necessary and codifying rules for single-root assignment, reparandum marking for repetitions or repairs, forced splitting of contractions and enclitics, MWE collapsing, use of special DEPREL extensions (rep, dep, filler, discourse), and explicit handling of ellipsis.

Inter-annotator agreement on core UD attachments before adjudication exceeded 85%, underscoring the reliability of the resulting annotations.

4. Dataset Composition and Structural Properties

SpokeBench is grounded in the Miami English–Spanish CSW Corpus, sampling 126 utterances (averaging 15 tokens each, ~1,900 tokens total) from bilingual residents exhibiting high-frequency intra-sentence code switches (30–50% non-matrix language tokens per utterance). Categories are carefully balanced, with sentences distributed as follows:

Phenomenon	Sentence Count
Simple repetition	10
Complex repetition	15
English contractions	10
Spanish contractions	10
Simple ellipsis	10
Complex ellipsis	15
Simple discourse	10
Complex discourse/fillers	15
Highly complex (≥ 3 phenomena)	12
Control (none)	20

SpokeBench is stored in CoNLL-U format, augmented with extended DEPREL values to handle reparandum, unrecoverable fragments, fillers, and pragmatic markers. Artificial IDs (e.g., “i.1” for MWEs) and MISC fields enable accurate representation of collapsed MWEs and tokenization edits driven by pre-processing modules.

5. Evaluation Protocol: The FLEX-UD Metric

Existing metrics such as LAS/UAS, which demand strict tree isomorphism, inadequately capture the legitimate ambiguities of spoken CSW. SpokeBench introduces FLEX-UD, an ambiguity- and severity-aware evaluation metric characterized by:

Per-component subscores: $s_\text{split}, s_\text{ID}, s_\text{UPOS}, s_\text{HEAD}, s_\text{DEPREL} \in [1,100]$ , weighted with $\sum w_i = 1$ .
Aggregate raw score:

$\text{raw} = \sum_{i\in\{\text{split,ID,UPOS,HEAD,DEPREL}\}} w_i \cdot s_i$

Severity penalty $P \in [0,1]$ for catastrophic errors (missing MWEs, invalid heads, cycles, etc.).
Final FLEX-UD score:

$\text{final} = \text{round}(\text{raw} \times (1 - P))$

Alignment between system and gold tokens leverages component scripts (token_alignment.py, flex_ud.py), and partial credit is granted for near-miss dependencies (e.g., obj vs. obl).

By sharply distinguishing “benign” annotation variations from true structural errors, FLEX-UD supports robust model ranking and error analysis in this inherently ambiguous domain.

6. Distinction from Prior Benchmarks

Most existing spoken and code-switching UD treebanks (e.g., English Switchboard, SUD for French, Turkish–German, ICE-Hindi–English) either (a) ignore mid-utterance repair complexity, (b) rely on forced reconstruction, or (c) omit systematized handling of MWEs, enclisis, and spoken-specific tags. SpokeBench is unique in its:

Expert dispute-adjudicated annotation of the most structurally complex spoken CSW utterances,
Systematic and balanced phenomenon coverage,
Schema extensions explicitly aligned to spoken phenomena,
Unified schema for mixed English–Spanish code-switching.

7. Access and Research Use Cases

SpokeBench, along with its codebase (including DECAP, evaluation scripts, and gold annotation files), is publicly released at https://github.com/N3mika/scsw. It is intended for:

Zero-shot and few-shot syntactic evaluation of LLM-based parsers on spoken CSW inputs,
Fine-tuning of multilingual parsers using pre-processing (Spoken-Phenomena Handler, Language-Specific Resolver),
Controlled comparative studies using both standard and ambiguity-aware (FLEX-UD) evaluation metrics.

This benchmark is foundational for parsing research addressing the unique complexities of spoken code-switching, supporting advances in both theoretical modeling and practical system development (Tyagi et al., 6 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpokeBench.