CoAM Dataset: MWE Analysis

Updated 3 February 2026

CoAM Dataset is a rigorously annotated English resource distinguishing idiomatic and semi-idiomatic multiword expressions for NLP applications.
It employs a multi-phase annotation process combining human review and automated checks to ensure consistent and fine-grained MWE tagging.
Benchmark results, including a notable 13% F1 score improvement with fine-tuned Qwen-72B, highlight its impact on machine translation and lexical analysis.

The Corpus of All-Type Multiword Expressions (CoAM) is a rigorously annotated English-language dataset specifically designed to support reliable evaluation and error analysis for multiword expression (MWE) identification systems. MWEs are idiomatic or semi-idiomatic sequences of at least two fixed lexemes exhibiting semantic, lexical, or syntactic idiomaticity; MWE identification is critical for downstream NLP tasks such as machine translation, lexical complexity assessment, and word sense disambiguation. Prior resources have commonly suffered from limited MWE-type coverage, inconsistent annotation, and inadequate support for fine-grained analyses. CoAM mitigates these issues through a multi-step, quality-controlled construction process and is the first dataset to systematically tag MWEs by type, providing new opportunities for nuanced evaluation and modeling (Ide et al., 2024).

1. Data Collection and Quality Assurance

CoAM is derived from four standard English corpora—News, Commentary, TED talks, and Universal Dependencies (UD)—totaling approximately 30,000 sentences. From each document, the initial ten sentences are selected, tokenized using SpaCy's en_core_web_lg model. To preserve text quality, UD sentences are included for training but excluded from testing due to the prevalence of user-generated errors.

Annotation proceeds via a structured multi-phase pipeline:

Human Annotation: Each sentence is marked for MWEs independently by two annotators (at least one native speaker), employing a custom checkbox-based span interface. This interface supports discontinuous and overlapping MWEs and is implemented in Google Sheets via the CAIGen generator.
Human Review: Discrepancies are adjudicated by two native-speaker authors, with visual highlighting facilitating efficient reconciliation.
Automated Consistency Checking: A rule-based extractor (Tanner & Hoffman, 2023) assembles a lexicon of labeled MWEs, then searches the corpus for previously omitted equivalent surface forms. Native speakers validate all cases, eliminating the annotation inconsistencies that afflict previous corpora (e.g., missing other instances of the same idiom).

The CAIGen annotation tool is spreadsheet-based, requires no special software installation, and is extensible via Apps Script to accommodate span-attribute customization. These properties facilitate broad adoption and flexible annotation schemas.

2. Annotation Guidelines and MWE Typology

The definition of MWEs in CoAM draws on established guidelines (Baldwin & Kim, 2010; PARSEME):

MWEs consist of at least two fixed lexemes.
Candidates must display some form of idiomaticity (semantic, lexical, or syntactic); transparent collocations are expressly excluded.
Proper nouns are omitted and delegated to named entity recognition tasks.

CoAM provides five automatically assigned MWE types for all non-UD data splits, determined via SpaCy-based dependency headline rules:

Type	Head POS	Example(s)
NOUN	NOUN, PRON	the middle of nowhere, red tape
VERB	VERB	stand for, pick up, break a leg
MOD/CONN	ADJ, ADV, ADP, CCONJ, SCONJ	under the weather, in spite of
CLAUSE	VERB, AUX	when it comes to, you know
OTHER	Any	and so on

This schema enables fine-grained type-level error analysis and supports evaluation regimes considering MWE head POS and functional properties.

3. Dataset Composition and Statistics

The final reviewed and consistency-checked CoAM corpus comprises:

Sentences: 1,301
Tokens: 30,231
MWEs (total annotated): ~1,995 tokens
Test set MWEs: 385

MWE density (fraction of tokens in MWEs) is 6.6%. The distribution of MWE types is as follows (percent of all MWEs):

NOUN	VERB	MOD/CONN	CLAUSE	OTHER
31.9	38.1	22.8	1.5	5.6

Test-set MWEs by type:

Type	MWEs (Count)
NOUN	118
VERB	154
MOD/CONN	88
CLAUSE	6
OTHER	19

A noteworthy feature of CoAM is the high proportion of "unseen" MWEs in its test set—64.2% of the 385 test MWEs (surface-lemma multisets not observed during training), facilitating rigorous generalization assessment.

4. Benchmarking and Experimental Results

Multiple MWE identification systems are benchmarked on the CoAM test set using exact-match metrics (F1, precision, recall):

Rule: WordNet-based candidate extraction
MWEasWSD (MaW):
- Rule + bi-encoder filter (bert-base-uncased)
- Rule + DCA poly-encoder filter
Fine-tuned LLMs:
- QLoRA fine-tuning (tsv → tsv, long MWE-defining prompt)
- Llama-3 8B, 70B; Qwen-2.5 7B, 72B

Performance summary:

System	F1	Precision	Recall
Rule (WordNet)	32.7	28.3	38.7
MaW + BiEncoder	41.6	48.6	36.5
MaW + DCA	42.0	48.4	37.1
Qwen-72B (FT)	55.5	61.5	50.7

The best-performing system, fine-tuned Qwen-72B, exceeds the strongest prior approach (MaW+DCA) by over 13 percentage points in F1 score, though recall remains a limiting factor at ≈50%.

5. Error Analysis and Empirical Insights

Detailed evaluation using type-level tags highlights systematic variation in identification difficulty:

Type	Recall (%)
NOUN	44.4
VERB	60.6
MOD/CONN	50.4
CLAUSE	22.2

Verb MWEs are substantially easier for systems to detect compared to noun MWEs, corroborating results from PARSEME-based research. Clause-type MWEs (e.g., phatics) are the most challenging.

Generalization to previously unseen MWEs remains difficult. Fine-tuned Qwen-72B achieves 58.2% recall on seen MWEs but only 46.6% on unseen MWEs, a ~12 point drop.

The effect of lexicon coverage is pronounced: MWEs present in WordNet are identified with much higher F1 (gap ≈22 points), indicating the reliance of current systems on authoritative MWE lexicons.

Qualitative analysis revealed several high-profile false negatives, including “real estate” (NOUN, not in WordNet) and “you know” (CLAUSE, not in WordNet).

6. Applications and Prospects

CoAM is expected to impact multiple NLP pipelines:

Machine Translation: Improved idiomatic MWE handling (as in Briakou et al., 2024)
Lexical Complexity Assessment: Enhanced readability metrics (Kochmar et al., 2020)
Word Sense Disambiguation: Improved MWE coverage in sense pipelines (Tanner & Hoffman, 2023)

Future directions suggested by its authors include:

Expanding lexicon sources (e.g., Wiktionary, crowd-curated idiom lists) to improve recall rates, particularly for systems akin to MaW.
Cross-linguistic extension: The CAIGen tool and annotation schema are language-agnostic, supporting adaptation to additional languages contingent on guideline and annotation translation.
Semi-supervised learning: Leveraging CoAM-annotated seeds to label large unlabeled corpora.
Integrated annotation: Combining MWE and named entity recognition within CAIGen to handle overlap and interaction between idioms and multi-word named entities.

CoAM’s multi-step construction process, coverage of diverse and discontinuous MWEs, robust type-tagging, and systematic consistency checks fill a critical gap in all-type MWE identification benchmarks, enabling fine-grained experimental analysis and more robust system development (Ide et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

CoAM: Corpus of All-Type Multiword Expressions (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoAM Dataset.