FLEX-UD: Ambiguity-Aware Parsing Metric
- FLEX-UD is an ambiguity-aware metric designed to evaluate syntactic parses in spoken code-switching by incorporating alternative structural interpretations.
- It computes a composite score from five orthogonal components—MWE segmentation, token identification, POS tagging, head assignment, and dependency labeling—to capture detailed parser performance.
- Integration with modular parsing frameworks like DECAP allows for fine-grained error attribution and significant empirical improvements over traditional metrics.
FLEX-UD is an ambiguity-aware evaluation metric developed specifically for parsing tasks in spoken code-switching (CSW) environments, where standard Universal Dependencies (UD) assumptions are frequently violated. The metric was introduced alongside the SpokeBench benchmark and the DECAP parsing architecture to address structural ambiguities and evaluative shortcomings revealed when applying traditional metrics to non-canonical, highly variable spoken-language data (Tyagi et al., 6 Feb 2026).
1. Motivation and Context
Spoken CSW—the alternation between two or more languages within an utterance—characteristically exhibits disfluencies, ellipsis, repetitions, discourse-driven tokens, and complex multiword expressions. Such phenomena result in dependency structures that diverge from the regular formalisms targeted by canonical UD parsers. Standard metrics such as Labeled Attachment Score (LAS) and Unlabeled Attachment Score (UAS) penalize plausible alternative parses and so conflate genuine structural errors with linguistically legitimate variation. In experimental evaluations, existing parsers and LLM pipelines trained on written data fail to robustly handle these spoken phenomena, yielding degraded and uninformative metrics (Tyagi et al., 6 Feb 2026).
FLEX-UD (Flexible Universal Dependencies) was developed to enable more nuanced evaluation of syntactic parses in these settings. By incorporating ambiguity and well-formedness considerations, FLEX-UD exposes qualitative and quantitative improvements in parser outputs that standard UD metrics systematically mask.
2. Component Structure and Scoring
FLEX-UD evaluates parses using five orthogonal component scores:
| Component Score | Evaluated Aspect | Value Range |
|---|---|---|
| s_Split | Node segmentation/MWE | [1, 100] |
| s_ID | Token/node identification | [1, 100] |
| s_UPOS | Universal POS tagging | [1, 100] |
| s_HEAD | Structural head assignment | [1, 100] |
| s_DEPREL | Dependency relation label | [1, 100] |
The final numeric FLEX-UD score is computed as a weighted sum of these subscores:
where is the component weight, and is the component score for the -th dimension. The default weights are not explicitly provided in the primary data but are assumed to sum to 1.
A penalty scalar is then applied to account for "catastrophic" errors such as missing dotted MWE structure or invalid heads:
In practice, aggregates per-token penalty signals exported by modular parsers like DECAP. Penalties are clipped so that a single catastrophic error contributes approximately 0.25–0.6, scaling with error severity.
3. Linguistic Rationale and Treatment of Ambiguity
FLEX-UD is motivated by the necessity to:
- Account for linguistically plausible structural alternatives in ambiguous spoken data (e.g., ellipsis variants, alternate attachments for discourse markers, split or collapsed MWEs).
- Prevent over-penalization of parses that diverge from a singular reference but remain valid with respect to the linguistic taxonomy established for spoken CSW in SpokeBench.
- Enable fine-grained analysis of parser behavior not only in terms of global attachment but also explicit MWE segmentation, node mapping, and token-level label assignments.
- Quantify well-formedness through severe error penalties, promoting robust syntactic validity without enforcing rigidity.
These criteria make FLEX-UD sensitive to the rich space of interpretations that expert annotators allow for spoken language, capturing distinctions elided by binary attachment-based metrics.
4. Integration with Modular Parsing Pipelines
FLEX-UD is expressly designed to interface with modular agentic parsing frameworks, notably the DECAP architecture (Tyagi et al., 6 Feb 2026). DECAP outputs, beyond the parse itself, include per-token penalty () and confidence () scores. These diagnostics:
- Allow downstream tools and annotators to identify regions of structural uncertainty or challenge.
- Provide direct, interpretable inputs to the FLEX-UD penalty aggregation mechanism.
- Enable systematic ablation and attribution studies: removing spoken-phenomena handlers or language-specific resolvers from DECAP reduces FLEX-UD Final scores by 10–15 points, confirming each module’s contribution.
This integration supports not just batch-level evaluation, but continuous, ambiguity-aware parser development cycles.
5. Empirical Performance and Interpretive Value
Empirical studies on the SpokeBench dataset substantiate FLEX-UD’s utility in spoken CSW analysis:
| Parser | s_ID | s_UPOS | s_HEAD | s_DEPREL | Final |
|---|---|---|---|---|---|
| Traditional Stanza | 15.6 | 15.7 | 15.7 | 15.7 | 29.5 |
| BiLingua | 58.2 | 73.7 | 54.6 | 59.1 | 66.6 |
| DECAP | 76.2 | 78.5 | 62.3 | 65.1 | 72.2 |
- DECAP improves the final FLEX-UD score by +5.6 points over BiLingua (≈8.4% relative) and by +42.7 over the traditional parser (≈144% relative) (Tyagi et al., 6 Feb 2026).
- The most challenging class (elliptical+ structures) sees absolute final-score gains of 49.5 points (212% relative) under DECAP.
- Standard metrics (e.g., LAS) understate these improvements; e.g., DECAP's LAS is 0.26 versus BiLingua's 0.32, but its U-LAS and FLEX-UD Final are highest among competitors.
- FLEX-UD allows for credible assessment of structural robustness and interpretability in the presence of ambiguity and error propagation, which is crucial for benchmarking in low-resource, high-variation spoken language settings.
6. Limitations and Applicability
FLEX-UD’s complexity is tailored for contexts with severe divergence from canonical syntactic norms—specifically, spontaneous spoken CSW. By design, it is less informative when applied to canonical written treebanks, where standard UD metrics are generally sufficient. Its correct usage presupposes expert-annotated reference parses that reflect the full ambiguity space of the target domain, like those in SpokeBench.
A plausible implication is that future work could benefit from refining component scoring strategies, integrating downstream semantic evaluation, or aligning penalty schemes with human adjudication effort. However, FLEX-UD is already demonstrably more suitable for expert-centered evaluation of token-level, node-level, and global syntactic variation in CSW parsing.
7. Significance for Parsing and Linguistic AI Evaluation
FLEX-UD represents a methodological advance in the evaluation of syntactic parsers operating under the challenging conditions of spoken, code-switched, and non-canonical data, especially where ambiguity is the rule rather than the exception. By structuring metric design around ambiguity awareness and multi-granular error attribution, FLEX-UD establishes a measurable standard for capturing qualitative improvements otherwise invisible to legacy metrics. This enables rigorous, linguistically sound progress in modeling structural phenomena outside the traditional domain of written-language NLP (Tyagi et al., 6 Feb 2026).