Non-Compositional Translation Gap

Updated 13 January 2026

Non-Compositional Translation Gap is a phenomenon where translation models misinterpret idioms, proverbs, and MWEs by relying on rigid, compositional rules.
Empirical studies show significant performance drops in metrics like BLEU, ROUGE, and consistency when handling non-compositional language compared to standard compositional inputs.
Research efforts focus on specialized techniques such as explicit grammar modeling, pseudo-labeling pipelines, and reward-based reinforcement to mitigate literal translation errors.

The Non-Compositional Translation Gap refers to the systematic failure of neural machine translation (NMT) systems—and, more broadly, LLMs and compositional grammars—to correctly translate expressions or tasks whose semantics cannot be derived strictly from the meanings of their parts and combination rules. This phenomenon is most salient in the treatment of idioms, proverbs, multiword expressions (MWEs), and functionally composed tasks (such as cross-lingual summarization). Despite achieving high performance on standard, compositional benchmarks, contemporary systems often collapse to literal, compositional translations in non-compositional settings, leading to pronounced quality deficits relative to both pipeline and human upper bounds. The non-compositional translation gap has been rigorously quantified and analyzed across diverse paradigms, including Transformer-based NMT, multilingual controlled natural languages, and reward-optimized LMs.

1. Formal Definitions and Taxonomy

Compositionality, following Partee (1984), is the principle that the meaning of a compound expression is a function of the meanings of its parts and their syntactic combination. In practical NMT, this manifests as word- or phrase-level translation pipelines, where source chunks are translated independently and combined mechanically. Non-compositionality arises when an expression’s meaning (denoted $M_f$ ) cannot be constructed from any function $\oplus$ applied to the local meanings $M(x_k)$ of its sub-expressions: $M_f \notin \{ M(x_i) \oplus M(x_{i+1}) \oplus \dots \}$ . Classic cases include idioms (“kick the bucket” $\to$ “to die”) and fixed compounds (German nominal compounding).

The non-compositional translation gap is formally the empirical or metric difference between compositional and non-compositional translation performance:

$\Delta_{\text{noncomp}} = s(M; D_{\mathrm{gen}}) - s(M; D_{\mathrm{id}})$

where $s(\cdot;\cdot)$ is any translation quality metric (e.g., COMET, BLEU, ROUGE, etc.), $D_{\mathrm{gen}}$ is a generic (compositional) test set, and $D_{\mathrm{id}}$ is a non-compositional or idiom test set (Agarwal et al., 9 Jan 2026). In controlled grammar frameworks like GF, a non-compositional construction occurs when translation equivalents in parallel corpora cannot be mapped to the same abstract syntax tree, indicating an irreducible translation gap between the source and target representation (Enache et al., 2014).

2. Empirical Manifestations and Quantitative Gap

Systematic empirical evaluation demonstrates a persistent performance gap for non-compositional translation scenarios. Transformer models trained on large, naturalistic parallel corpora achieve BLEU and consistency improvements for fully compositional, synthetic data, but maintain large deficits on naturally non-compositional contexts. For instance, in systematized English $\rightarrow$ Dutch tests (Dankers et al., 2021):

Synthetic (fully compositional): Consistency $\oplus$ 0
Natural (idiomatic): $\oplus$ 1
Peak overgeneralization (literal error) rates on idioms reach $\oplus$ 2– $\oplus$ 3, only partially diluted by increased training data.

In cross-lingual summarization, zero-shot direct composition (e.g., prompting T5 with “summarize+translate_en_fr:”) recovers only $\oplus$ 4– $\oplus$ 5 of the pipeline approach's ROUGE coverage, while the pipeline itself nearly matches direct supervised upper bounds (Yu et al., 2023). Even state-of-the-art LLMs, when fine-tuned for idiomatic translation using base reward objectives, initially exhibit gaps of $\oplus$ 6– $\oplus$ 7 points across composite metrics relative to compositional text (Agarwal et al., 9 Jan 2026).

System	ROUGE-4 (En $\oplus$ 8De)	ROUGE-L (En $\oplus$ 9De)	ROUGE-4 (En $M(x_k)$ 0Fr)	ROUGE-L (En $M(x_k)$ 1Fr)
Fine-tune (supervised)	3.14	32.63	4.45	35.56
Pipeline	3.20	32.35	3.90	33.68
Zero-Shot Direct ( $M(x_k)$ 2)	0.43	17.05	1.10	22.32

Such results indicate that even advanced neural models largely collapse onto compositional (often literal) translations in non-compositional contexts, severely impairing both final-task fluency and semantic adequacy (Yu et al., 2023, Dankers et al., 2021, Dankers et al., 2022).

3. Diagnostic Analysis and Underlying Mechanisms

Mechanistic analysis reveals that Transformers exhibit an inductive bias toward tightly grouping frequent, co-occurring tokens. Attention-based dissection across 37k idiomatic sentences (MAGPIE corpus) shows that when a phrase is figuratively intended and paraphrased correctly, the encoder sharply concentrates attention within the idiom span, while context interaction drops by $M(x_k)$ 320% relative to literal usages (Dankers et al., 2022). The decoder’s cross-attention onto PIE (potentially idiomatic expression) tokens is greatly reduced for paraphrased outputs ( $M(x_k)$ 4 vs. $M(x_k)$ 5), and attention distribution becomes more dispersed.

Causal amnesic probing (INLP ablation) establishes that removing features predictively aligned with paraphrasing causes $M(x_k)$ 627–40% of correctly paraphrased outputs to revert to word-for-word (wfw) translation behaviors, accompanied by representational realignment in attention metrics.

On the architectural level, compositional and non-compositional processing are not dynamically separated. Instead, models overwhelmingly default to compositional reasoning, lacking mechanisms to internally identify or treat non-compositional spans as atomic (Yu et al., 2023, Dankers et al., 2022, Enache et al., 2014).

4. Methodologies for Quantification, Detection, and Evaluation

A variety of experimental and formal methods have been employed to diagnose and measure the non-compositional translation gap:

Consistency Testing: Under controlled perturbations (minimal input edits), translation consistency is used as a proxy for compositional generalization. The drop in consistency ( $M(x_k)$ 7) between compositional and non-compositional sets serves as a direct measure of the gap (Dankers et al., 2021).
Parallel Tree Comparison in GF: In controlled grammars, divergence in parse tree structure between parallel sentences flags non-compositional spans (Enache et al., 2014).
Attention Quantification: Encoder/decoder attention matrices are analyzed for grouping scores within idiom spans and context-interaction scores, connecting representational patterns to translation errors (Dankers et al., 2022).
Composite Metric Aggregation: Recent works compute aggregate scores $M(x_k)$ 8 combining human assessment, MTQE, ROUGE, embedding distance, and LLM-based grading, then define $M(x_k)$ 9 as the difference between scores on idiomatic and compositional test sets (Agarwal et al., 9 Jan 2026).

5. Mitigation Approaches and Research Directions

Several treatment strategies are proposed for reducing the non-compositional translation gap:

Explicit Modeling in Grammars: In frameworks like GF, detected MWEs and compounds are assigned dedicated abstract functions and language-specific concrete linearisations, making translation grammars more robust and semantically grounded (Enache et al., 2014).
Pipeline Pseudo-Labeling: Generate composite f–g task examples with a model pipeline, and use these for self-supervised training to expose compositionally complex mappings (Yu et al., 2023).
Decomposition-Aware Pretraining: Incorporate procedural multi-step instructions (e.g., “coreference resolution → entity recognition → answer extraction”) into pretraining corpora, encouraging internal structuring aligned with functional task steps (Yu et al., 2023).
Composer/Adapter Modules: Prefix-tuning or adapter-based extension with lightweight neural networks (parameter-efficient composers) allow composite prompt generation and modular chaining of learned functions (Yu et al., 2023).
Attention Losses and Super-Lemmas: Penalize excessive intra-span attention for ambiguous expressions; treat known idioms as atomic tokens to anchor their representations (Dankers et al., 2022).
Reward-Based RL Fine-Tuning: Apply Group Relative Policy Optimization (GRPO) with MTQE rewards on idiomatic datasets, demonstrating +13.7 point improvements on idioms, +8.4 on generic text, and +5.7 on cross-lingual transfer (Agarwal et al., 9 Jan 2026). The reward-driven reinforcement of semantic equivalence, along with negative rewards for literal renderings, distills figurative knowledge and suppresses literal, compositional pathways.

6. Benchmarks, Evaluation Frameworks, and Open Challenges

Current metrics and test suites often conflate harmless paraphrasing with harmful non-compositionality. Recommendations include the construction of hybrid benchmarks with both compositional and non-compositional phenomena, minimal variational pairs, and explicit annotation of expected translations (Dankers et al., 2021). Proposed frameworks should include intermediate fidelity checks in multi-step pipelines and assess internal representations for latent compositional structure (Yu et al., 2023).

Unresolved challenges include:

Designing objective, scalable metrics that distinguish between stylistic variance and genuine non-compositional failure.
Inducing and evaluating default inference mechanisms that inform models when to switch processing regimes.
Extending idiom-aware and non-compositional treatment to low-resource languages and less-documented figurative phenomena.
Achieving parameter- and compute-efficient RL for widespread application of reward-driven idiom training (Agarwal et al., 9 Jan 2026).

7. Significance and Future Directions

The non-compositional translation gap exposes a fundamental limitation in current neural and symbolic translation architectures: the inability to dynamically arbitrate between compositional and non-compositional processing. While scale, multi-tasking, and data-driven learning improve coverage on compositional regimes, they do not induce systematic strategies for identifying and handling non-compositional structures. Advances in reward-based optimization, explicit compositional modeling, and evaluation have begun to narrow the gap, notably yielding robust improvements across idiomatic, compositional, and cross-lingual test sets (Agarwal et al., 9 Jan 2026). However, future progress requires principled integration of structural, semantic, and reward-based cues, as well as rethinking both data and evaluation paradigms for the complex spectrum of compositionality in natural language (Yu et al., 2023, Dankers et al., 2021, Dankers et al., 2022, Enache et al., 2014).