Multiword Expressions & Compositionality

Updated 22 February 2026

Multiword Expressions are linguistic constructions whose meaning or syntax cannot be directly inferred from their individual parts.
Computational models including distributional, hierarchical, and neural approaches quantify compositionality to detect varying degrees of non-compositionality.
Challenges remain in modeling context-dependent MWE behavior and integrating external knowledge for improved language understanding and machine translation.

Multiword expressions (MWEs) encompass a broad class of linguistic constructions in which the meaning or syntactic behavior of the whole is not, or only partly, predictable from its components. MWEs range from fully semantically opaque idioms (“spill the beans”; meaning “reveal a secret”) to constructions with irregular syntactic properties but transparent meaning (“all the same”, a microsyntactic unit). The computational challenge of MWEs is fundamentally the problem of modeling and detecting degrees of compositionality—the extent to which a phrase’s meaning is a function of its constituents and their combination. The study of this phenomenon at the intersection of computational linguistics, cognitive science, and neural modeling is an active research area, especially in the context of deep and contextualized models.

1. Types of Multiword Expressions and Theoretical Compositionality

MWEs are systematically categorized by their compositional and syntactic profiles:

Idioms: Fixed or semi-fixed expressions whose meaning cannot be determined by straightforwardly combining the meanings of the words (“spill the beans,” “kick the bucket”). Idioms are high in semantic opacity and low in compositionality (Zaitova et al., 9 May 2025, Liu et al., 24 Aug 2025).
Microsyntactic Units (MSUs): Constructions displaying unusual or irregular syntactic patterns (e.g., “all the same,” “те́м не ме́нее” in Russian) that cannot be explained by standard grammar rules. While their semantic transparency may be high, their syntactic non-conformity presents a unique modeling challenge (Zaitova et al., 9 May 2025).
Verb-particle constructions (VPCs): Combinations like “look up” or “take off,” where the overall meaning can deviate more or less from the literal composition and the syntactic flexibility varies. These display intermediate compositionality (Liu et al., 24 Aug 2025, Bozsahin et al., 2018).
Light verb constructions (LVCs): Forms combining a semantically "light" verb with a noun, such as “take a walk.” Here, the noun often bears the primary semantic load, and the phrase is relatively compositional (Liu et al., 24 Aug 2025).
Idiomatically combining phrases & fixed expressions: Some phrases allow modification or offer partial transparency, captured in formal grammars by subcategorization for head words or features rather than singleton types (Bozsahin et al., 2018).

Compositionality can thus be defined operationally: a phrase is compositional if its meaning can be derived from its constituents and the syntactic rule combining them (Zaitova et al., 9 May 2025, Liu et al., 2022). Non-compositionality arises when this function is non-transparent, or cannot be defined, as in canonical idioms.

2. Computational Models and Formal Approaches

A range of formalisms and computational methods have been developed to model and quantify compositionality in MWEs:

Distributional Models: The simplest tests rely on similarity between phrase vectors (e.g., word2vec, GloVe) and the additive sum of constituent vectors. Lower cosine similarity between the MWE and the sum of its parts generally signals non-compositionality (Kezar et al., 2023, Jana et al., 2019, Nagar et al., 1 Jun 2025, Qi et al., 2019).
Hierarchical/Hypernymy-enriched Models: Blending distributional similarity with Poincaré (hyperbolic) embeddings allows models to exploit hierarchical knowledge of hypernyms to more accurately assess compositionality, notably for noun compounds (Jana et al., 2019).
Sememe-based Models: In languages such as Chinese, sememe knowledge (minimal semantic units as in HowNet) can augment neural composition, producing MWE representations that better reflect human compositionality judgments (Qi et al., 2019). Aggregation or mutual attention over constituent sememes enables disambiguation in polysemous expressions.
Paracompositionality in Categorial Grammar: “Paracompositionality” extends compositional analysis to cases where not all elements participate in argument structure—singleton types for phrasal idioms, head-feature marking for productive idiomatically combining phrases. This yields transparent but constrained syntactic derivations and accommodates both rigid and productive MWEs (Bozsahin et al., 2018).
Geometric/Contextual Methods: Context-specific geometric tests for compositionality judge the closeness of a phrase vector to the principal subspace of its sentence context; idiomatic MWEs tend to “stick out” from this subspace (Gong et al., 2016).
Functional Probing in Neural Architectures: In neural models, especially transformers, affine or linear probes can often reconstruct parent phrase embeddings from their child embeddings with high accuracy, although this procedure fails to distinguish compositional from non-compositional instances reliably (Liu et al., 2022).

3. MWEs and Compositionality in Deep and Transformer-Based Models

Recent work highlights the nuanced and often inadequate treatment of MWEs and compositionality in modern contextual architectures:

Layer-wise Specialization: In BERT-based models, syntactic aspects (e.g., microsyntactic units) are encoded in lower layers, while semantic integration required for idiomaticity peaks in higher layers. Fine-tuning on semantic tasks redistributes attention to higher layers for idioms; syntactic-task fine-tuning sharpens attention to MSUs in low layers (Zaitova et al., 9 May 2025).
Compositional Functionality: Across a range of embedding models (Google, Mistral, OpenAI), vector addition approximates phrase embeddings with high fidelity (cosine ≥0.85 in the best models for noun compounds), ridge regression marginally improves over addition, and BERT lags in compositional accuracy (Nagar et al., 1 Jun 2025). This reflects both model architecture and tokenization strategy.
Generalization and Limits: Probing studies show that LMs’ phrase representations are highly regular (affine-recoverable from parts) but fail to reflect idiomaticity as perceived by humans; compositionality scores weakly correlate with human ratings and perform poorly at flagging idioms (Liu et al., 2022).
Contextual/Scenario-Driven Compositionality: Compositionality is not necessarily a deterministic function of the phrase; the same MWE can be compositional or idiomatic depending on sentential (local) context. Models integrating global and local usage scenarios—and external KBs—outperform purely distributional approaches (Wang et al., 2019).
Attention Patterns: Attention-based analysis in Transformers shows that for idioms, self-attention concentrates within the idiom span, especially after semantic fine-tuning; for MSUs, attention is structurally localized and sensitive to early layers (Zaitova et al., 9 May 2025, Miletić et al., 2024).

4. Empirical Evaluation: Metrics, Resources, and Results

Evaluation of MWE compositionality and modeling reliability relies on diverse metrics and resources:

Model Paradigm	Core Score / Probe	MWE-Type Sensitivity
Additive/Linear	$\cos(v_{MWE}, v_{w_1} + v_{w_2})$	Fails for idioms, works for compounds (Nagar et al., 1 Jun 2025, Kezar et al., 2023)
Hierarchical (Poincaré)	Weighted avg. dist. & hypernym sim	Captures noun compound hierarchy (Jana et al., 2019)
Contextual Geometry	$\cos(v_{MWE}, P_{ctx} v_{MWE})$	Distinguishes literal vs idiomatic in context (Gong et al., 2016)
Transformer Attention	Layerwise context→MWE and within-MWE attention	Idioms: high-layer attention; MSUs: low-layer attention (Zaitova et al., 9 May 2025)
VAD Norms (Affective)	Avg. and heatmap of $\bar V_{ij}$ , etc.	Noun compounds ≫ idioms for affective compositionality (Mohammad, 25 Nov 2025)

Gold Standards: LADEC (compounds), COS960, Reddy datasets (noun compounding), NRC VAD Lexicon (affective compositionality for 10k+ MWEs) (Mohammad, 25 Nov 2025, Nagar et al., 1 Jun 2025).
Human vs Model Alignment: Even advanced neural models yield weak correlation ( $\rho \leq 0.2$ ) with human compositionality judgments for idioms or context-dependent expressions (Liu et al., 2022, Miletić et al., 2024). For transparent noun compounds, correlations can reach 0.86 with hierarchical features (Jana et al., 2019).
Cross-linguistic Generalization: Germanic languages show more uniform attention and compositionality patterns; Slavic languages exhibit sharper distinctions, especially in microsyntactic phenomena (Zaitova et al., 9 May 2025, Miletić et al., 2024).

5. Applications, Impact, and Challenges

Machine Translation: Non-compositional VMWEs (especially idioms) systematically degrade translation quality compared to matched non-MWE controls (drops in state-of-the-art systems ~2.3 points for idioms, 1.1 for VPCs, 0.3 for LVCs). Pre-translation paraphrasing via LLMs, which replaces idioms with literal equivalents, significantly improves translation output, directly linking compositionality to downstream performance (Liu et al., 24 Aug 2025).
Sentiment and Affect Prediction: In new large-scale resources (10k English MWEs), valence is the most compositional dimension; arousal and dominance are often subject to idiomatic, non-linear shifts. Idioms and light-verb constructions present the greatest divergence from constituent-based predictability (Mohammad, 25 Nov 2025).
Resource-poor Scenarios: Semantic clustering, knowledge-based, and hypernymy-informed models clearly outperform frequency- or PMI-based association models in low-resource settings by robustly identifying non-compositional MWEs (Chakraborty et al., 2014).
Interpretability and Probing: Analytical techniques, such as attention scoring over layers (Zaitova et al., 9 May 2025) or principal subspace projection (Gong et al., 2016), not only aid interpretability but serve as diagnostics for MWE-aware tuning and evaluation.

6. Open Issues and Directions for Future Research

Generalization across Types and Languages: The bulk of work focuses on English noun compounds and idioms. Extension to syntactically flexible MWEs, low-resource languages, signed languages, or highly context-dependent phenomena remains limited (Kezar et al., 2023, Wang et al., 2019, Miletić et al., 2024).
Integration with External Knowledge: Incorporating external definitions, hypernym/hyponym hierarchies, sememe analyses, or contextually-aware knowledge (e.g., from Wiktionary, HowNet, or DBpedia) consistently yields gains in compositionality prediction (Wang et al., 2019, Qi et al., 2019, Jana et al., 2019).
Unified Benchmarks and Protocols: There is pressing need for unified, multi-type, multi-lingual, and adversarially robust benchmarks, with protocols designed to disentangle memorization from genuine compositional inference (Miletić et al., 2024).
Architectural Inductive Biases: Future approaches may integrate explicit MWE/idiom classifiers, gating mechanisms between compositional and holistic representations, or compositionality-promoting losses during pretraining (Liu et al., 2022, Nagar et al., 1 Jun 2025).
Downstream MWE-aware Adaptations: In tasks like retrieval, semantic parsing, and affect modeling, making the system explicitly sensitive to MWE status (compositionality and type) is increasingly recognized as essential (Liu et al., 24 Aug 2025, Mohammad, 25 Nov 2025).

In sum, while significant progress has been made in formalizing, detecting, and modeling MWEs and their compositionality, current systems—especially deep contextual models—still lack robust, general solutions for the full range of idiomatic, syntactically irregular, or context-dependent phenomena. Directions that systematically integrate linguistic theory, external structured knowledge, and stress-tested neural architectures remain at the forefront of this domain.