CoAM Dataset: MWE Analysis
- CoAM Dataset is a rigorously annotated English resource distinguishing idiomatic and semi-idiomatic multiword expressions for NLP applications.
- It employs a multi-phase annotation process combining human review and automated checks to ensure consistent and fine-grained MWE tagging.
- Benchmark results, including a notable 13% F1 score improvement with fine-tuned Qwen-72B, highlight its impact on machine translation and lexical analysis.
The Corpus of All-Type Multiword Expressions (CoAM) is a rigorously annotated English-language dataset specifically designed to support reliable evaluation and error analysis for multiword expression (MWE) identification systems. MWEs are idiomatic or semi-idiomatic sequences of at least two fixed lexemes exhibiting semantic, lexical, or syntactic idiomaticity; MWE identification is critical for downstream NLP tasks such as machine translation, lexical complexity assessment, and word sense disambiguation. Prior resources have commonly suffered from limited MWE-type coverage, inconsistent annotation, and inadequate support for fine-grained analyses. CoAM mitigates these issues through a multi-step, quality-controlled construction process and is the first dataset to systematically tag MWEs by type, providing new opportunities for nuanced evaluation and modeling (Ide et al., 2024).
1. Data Collection and Quality Assurance
CoAM is derived from four standard English corpora—News, Commentary, TED talks, and Universal Dependencies (UD)—totaling approximately 30,000 sentences. From each document, the initial ten sentences are selected, tokenized using SpaCy's en_core_web_lg model. To preserve text quality, UD sentences are included for training but excluded from testing due to the prevalence of user-generated errors.
Annotation proceeds via a structured multi-phase pipeline:
- Human Annotation: Each sentence is marked for MWEs independently by two annotators (at least one native speaker), employing a custom checkbox-based span interface. This interface supports discontinuous and overlapping MWEs and is implemented in Google Sheets via the CAIGen generator.
- Human Review: Discrepancies are adjudicated by two native-speaker authors, with visual highlighting facilitating efficient reconciliation.
- Automated Consistency Checking: A rule-based extractor (Tanner & Hoffman, 2023) assembles a lexicon of labeled MWEs, then searches the corpus for previously omitted equivalent surface forms. Native speakers validate all cases, eliminating the annotation inconsistencies that afflict previous corpora (e.g., missing other instances of the same idiom).
The CAIGen annotation tool is spreadsheet-based, requires no special software installation, and is extensible via Apps Script to accommodate span-attribute customization. These properties facilitate broad adoption and flexible annotation schemas.
2. Annotation Guidelines and MWE Typology
The definition of MWEs in CoAM draws on established guidelines (Baldwin & Kim, 2010; PARSEME):
- MWEs consist of at least two fixed lexemes.
- Candidates must display some form of idiomaticity (semantic, lexical, or syntactic); transparent collocations are expressly excluded.
- Proper nouns are omitted and delegated to named entity recognition tasks.
CoAM provides five automatically assigned MWE types for all non-UD data splits, determined via SpaCy-based dependency headline rules:
| Type | Head POS | Example(s) |
|---|---|---|
| NOUN | NOUN, PRON | the middle of nowhere, red tape |
| VERB | VERB | stand for, pick up, break a leg |
| MOD/CONN | ADJ, ADV, ADP, CCONJ, SCONJ | under the weather, in spite of |
| CLAUSE | VERB, AUX | when it comes to, you know |
| OTHER | Any | and so on |
This schema enables fine-grained type-level error analysis and supports evaluation regimes considering MWE head POS and functional properties.
3. Dataset Composition and Statistics
The final reviewed and consistency-checked CoAM corpus comprises:
- Sentences: 1,301
- Tokens: 30,231
- MWEs (total annotated): ~1,995 tokens
- Test set MWEs: 385
MWE density (fraction of tokens in MWEs) is 6.6%. The distribution of MWE types is as follows (percent of all MWEs):
| NOUN | VERB | MOD/CONN | CLAUSE | OTHER |
|---|---|---|---|---|
| 31.9 | 38.1 | 22.8 | 1.5 | 5.6 |
Test-set MWEs by type:
| Type | MWEs (Count) |
|---|---|
| NOUN | 118 |
| VERB | 154 |
| MOD/CONN | 88 |
| CLAUSE | 6 |
| OTHER | 19 |
A noteworthy feature of CoAM is the high proportion of "unseen" MWEs in its test set—64.2% of the 385 test MWEs (surface-lemma multisets not observed during training), facilitating rigorous generalization assessment.
4. Benchmarking and Experimental Results
Multiple MWE identification systems are benchmarked on the CoAM test set using exact-match metrics (F1, precision, recall):
- Rule: WordNet-based candidate extraction
- MWEasWSD (MaW):
- Rule + bi-encoder filter (bert-base-uncased)
- Rule + DCA poly-encoder filter
- Fine-tuned LLMs:
Performance summary:
| System | F1 | Precision | Recall |
|---|---|---|---|
| Rule (WordNet) | 32.7 | 28.3 | 38.7 |
| MaW + BiEncoder | 41.6 | 48.6 | 36.5 |
| MaW + DCA | 42.0 | 48.4 | 37.1 |
| Qwen-72B (FT) | 55.5 | 61.5 | 50.7 |
The best-performing system, fine-tuned Qwen-72B, exceeds the strongest prior approach (MaW+DCA) by over 13 percentage points in F1 score, though recall remains a limiting factor at ≈50%.
5. Error Analysis and Empirical Insights
Detailed evaluation using type-level tags highlights systematic variation in identification difficulty:
| Type | Recall (%) |
|---|---|
| NOUN | 44.4 |
| VERB | 60.6 |
| MOD/CONN | 50.4 |
| CLAUSE | 22.2 |
Verb MWEs are substantially easier for systems to detect compared to noun MWEs, corroborating results from PARSEME-based research. Clause-type MWEs (e.g., phatics) are the most challenging.
Generalization to previously unseen MWEs remains difficult. Fine-tuned Qwen-72B achieves 58.2% recall on seen MWEs but only 46.6% on unseen MWEs, a ~12 point drop.
The effect of lexicon coverage is pronounced: MWEs present in WordNet are identified with much higher F1 (gap ≈22 points), indicating the reliance of current systems on authoritative MWE lexicons.
Qualitative analysis revealed several high-profile false negatives, including “real estate” (NOUN, not in WordNet) and “you know” (CLAUSE, not in WordNet).
6. Applications and Prospects
CoAM is expected to impact multiple NLP pipelines:
- Machine Translation: Improved idiomatic MWE handling (as in Briakou et al., 2024)
- Lexical Complexity Assessment: Enhanced readability metrics (Kochmar et al., 2020)
- Word Sense Disambiguation: Improved MWE coverage in sense pipelines (Tanner & Hoffman, 2023)
Future directions suggested by its authors include:
- Expanding lexicon sources (e.g., Wiktionary, crowd-curated idiom lists) to improve recall rates, particularly for systems akin to MaW.
- Cross-linguistic extension: The CAIGen tool and annotation schema are language-agnostic, supporting adaptation to additional languages contingent on guideline and annotation translation.
- Semi-supervised learning: Leveraging CoAM-annotated seeds to label large unlabeled corpora.
- Integrated annotation: Combining MWE and named entity recognition within CAIGen to handle overlap and interaction between idioms and multi-word named entities.
CoAM’s multi-step construction process, coverage of diverse and discontinuous MWEs, robust type-tagging, and systematic consistency checks fill a critical gap in all-type MWE identification benchmarks, enabling fine-grained experimental analysis and more robust system development (Ide et al., 2024).