Metaphor Novelty Datasets
- Metaphor novelty datasets are curated resources that capture a spectrum from conventional to novel metaphors with controlled annotations for creativity assessment.
- They use diverse methods—from continuous Best-Worst Scaling to binary dictionary lookups—to evaluate metaphor comprehension and paraphrase generation.
- These datasets aid research in multimodal creativity and representation learning by benchmarking language models and addressing challenges like frequency confounds and prompt sensitivity.
Metaphor novelty datasets are curated resources designed to systematically capture, annotate, and evaluate the novelty dimension in metaphor comprehension, paraphrasing, and creative generation tasks across human and artificial LLMs. These datasets differ from conventional metaphor corpora by explicitly targeting the spectrum from conventional to highly novel metaphors, offering controlled resources for intrinsic analysis and benchmark evaluation, and in some cases, providing multimodal (e.g., visual) manifestations. They underpin research on the cognitive and computational mechanisms of metaphor novelty, model evaluation, and representation learning, especially in the context of LLMs and text-to-image diffusion systems.
1. Representative Datasets: Scope and Composition
Several datasets serve as benchmarks for metaphor novelty, each with distinct construction methodologies and annotation schemes:
Corpus-Based Datasets
- VUA-ratings: Derived from the VU Amsterdam Metaphor Corpus (VUAMC), this dataset consists of 15,155 content-word metaphor instances annotated via crowd-sourced Best-Worst Scaling (BWS). Annotators rank sentences sharing the same metaphor word for creativity/novelty, resulting in continuous scale scores , where represents maximal conventionality and maximal novelty. Binary “novel” labels are assigned if (2.3% prevalence) (Momen et al., 5 Jan 2026).
- VUA-dictionary: A subset of VUAAMC, comprising 1,160 metaphors assessed using dictionary-lookup: a metaphor is labeled “novel” if its in-context sense is absent from dictionary definitions (409 novel, 751 conventional). This yields binary novelty labels with no continuous scores (Momen et al., 5 Jan 2026).
- MUNCH (Metaphor Understanding Challenge Dataset): This resource covers 2,970 metaphorical sentences sampled from VUAMC, accompanied by over 10,000 single-word apt paraphrases (crowdsourced and expert-validated) and 1,492 inapt paraphrase controls (constructed via WordNet synsets). Each MRW (metaphor-related word) is assigned a novelty score from Do Dinh et al. (2018), with all instances of filtered out to focus on marginally conventional to novel metaphors. The core set spans four genres: academic, news, fiction, and conversation, and is stratified by part-of-speech (Tong et al., 2024).
Synthetic and Controlled Datasets
- Lai2009: Comprised of 208 sentences (104 novel metaphors, 104 conventional), generated by psycholinguistic protocols. Items are hand-crafted to control for target word, metaphoricity, and novelty (validated by familiarity/interpretability norms) (Momen et al., 5 Jan 2026).
- GPT4o-metaphors: AI-generated, lexicon-controlled set of 200 metaphoric sentences (100 conventional, 100 novel), drawn from GPT-4o prompts targeting specific noun/verb metaphoric patterns (Momen et al., 5 Jan 2026).
Multimodal Resource
- HAIVMet (Human-AI Visual Metaphors): Focuses on visual metaphor novelty. Its 6,476 images map to 1,540 textual metaphors collected and filtered for visualizability, elaborated by LLM-based Chain-of-Thought prompts, and quality-checked by expert illustrators. Each visual elaboration receives up to four high-fidelity images filtered for representational faithfulness (Chakrabarty et al., 2023).
2. Annotation Methodologies and Novelty Scale Construction
Metaphor novelty datasets typically employ two broad annotation paradigms:
Table: Annotation Schemes by Dataset
| Dataset | Annotation Type | Novelty Scale |
|---|---|---|
| VUA-ratings | Best-Worst Scaling | Continuous |
| VUA-dictionary | Dictionary lookup | Binary |
| Lai2009 | Constructed items | Binary |
| GPT4o-metaphors | Generation labels | Binary |
| MUNCH | BWS (Do Dinh et al.) | Filtered continuous |
| HAIVMet | Human-AI pipeline | Qualitative/implicit |
For continuous scales (e.g., VUA-ratings, MUNCH), BWS scoring provides high reliability and granularity for benchmarking both human and model performance as a function of metaphor creativity. Inapt paraphrase controls in MUNCH are manually selected to check for spurious lexical similarity as opposed to genuine domain-transfer in metaphor understanding (Tong et al., 2024).
Synthetic datasets enforce novelty by construction—either by psycholinguistic criteria (Lai2009) or explicit AI prompting (GPT4o-metaphors)—offering rigorous control over confounds like word frequency, length, and sense ambiguity (Momen et al., 5 Jan 2026).
Visual datasets (HAIVMet) use expert validation for both textual visualizations and image outputs, but do not explicitly assign numeric novelty scores; novelty is inferred by the generation of unprecedented visual instantiations for each metaphor (Chakrabarty et al., 2023).
3. Benchmark Tasks and Metrics
Metaphor novelty datasets support a spectrum of evaluative tasks, including:
- Paraphrase Judgement: Given a metaphoric sentence and candidate paraphrases (apt/inapt), models must identify the appropriate target-domain substitution, often under multiple prompt conditions (e.g., with or without explicit metaphor signaling) (Tong et al., 2024).
- Paraphrase Generation: Lexical substitution is scored by Mean Reciprocal Rank (MRR), Recall@k, and alignment with expert paraphrases (Tong et al., 2024).
For visual datasets, intrinsic evaluation is conducted by human illustrators’ ranking and qualitative edit suggestions, measuring creativity, faithfulness, and error types (“Lost Cause”, mean edit count). Representative metrics include average rank, percent of “Lost Cause” images, and mean number of edits needed, stratified by image generation pipeline (Chakrabarty et al., 2023).
Downstream visual entailment tasks operationalize metaphoric visual understanding by pairing generated images with entailment/neutral/contradiction hypotheses, scored by standard accuracy (). Augmenting traditional entailment datasets with metaphoric visuals (HAIVMet) yields substantial performance increases on creative reasoning benchmarks (+23 points) (Chakrabarty et al., 2023).
4. Surprisal, Novelty, and Scaling Effects in LLMs
Emergent research examines whether neural LLM surprisals align with human judgments of metaphor novelty:
- Direct Surprisal: Computed as for the target word, summing across its tokenization boundaries.
- Cloze-style Surprisal: Leverages both left and right context: , often delivering better alignment for corpus-based datasets (Momen et al., 5 Jan 2026).
Correlational metrics such as Pearson’s , Spearman’s , rank-biserial , and AUC are standard. In VUA-ratings, the highest reported and AUC = 0.819 demonstrate moderate predictive power. Synthetic datasets, with perfectly controlled word frequency, display stronger scaling with model size (“Quality–Power Hypothesis”), while corpus-based datasets exhibit inverse scaling—larger models’ surprisal aligns less with human novelty as they overfit corpus frequency (Momen et al., 5 Jan 2026).
Observed implications:
- Corpus-based novelty is confounded with frequency; smaller models sometimes yield more human-like surprisal.
- Synthetic datasets allow model scale to amplify semantic sensitivity to novelty.
- Instruction-tuning and cloze methods deliver marginal gains, suggesting robustness is more a function of data-control than of tuning regime.
5. Genre, Modality, and Use-Case Stratification
Metaphor novelty datasets stress diversity in genre (academic, news, fiction, conversation) and modality (linguistic, visual). MUNCH demonstrates cross-genre distribution for both metaphoric sentences and paraphrases, enabling analysis of model generalization and curriculum learning. Part-of-speech splits permit focused evaluation of noun-, verb-, and adjective-based metaphors (Tong et al., 2024).
HAIVMet expands modality coverage by pairing every textual metaphor with multiple visualizations, allowing exploration of multimodal metaphor rendering, compositionality in text-to-image diffusion, and visual entailment (Chakrabarty et al., 2023).
Specific use cases enabled by stratified novelty annotation include:
- Robustness evaluation across novelty bands (e.g., by binned score intervals).
- Novelty detection (classifier training on loss/accuracy as a function of s).
- Controlled metaphor generation and creativity assessment.
- Cross-genre transfer analyses for domain adaptation.
6. Limitations, Open Questions, and the Path Forward
Major limitations for current metaphor novelty datasets are resource intensity, scale, linguistic coverage, and metric formalization:
- Resource Intensity: Crowdsourcing, expert validation, and API-based LLM calls create scalability bottlenecks, particularly for multimodal pipelines and high-quality visual elaborations (Chakrabarty et al., 2023).
- Language Diversity: Predominant focus on English; cross-lingual extensions remain largely undeveloped.
- Metric Gaps: Absence of explicitly quantitative novelty metrics in visual datasets limits direct comparison across modalities. Further work is needed to define and operationalize visual metaphor novelty at scale (Chakrabarty et al., 2023).
- Frequency Confounds: In corpus-based resources, the entanglement of novelty with word rarity undermines model evaluation for true semantic creativity (Momen et al., 5 Jan 2026).
- Prompting Sensitivity: Models’ performance on high-novelty metaphors is highly susceptible to prompt design (Tong et al., 2024).
A plausible implication is that future datasets will increasingly emphasize synthetic control, multilinguality, explicit metrication of both textual and visual novelty, and further integration of human-in-the-loop annotation at all stages. Modular pipelines exemplified by HAIVMet—separating metaphor interpretation from image rendering, for instance—are likely to become standard, facilitating compositional improvements and more nuanced study of both metaphor novelty and broader creative language phenomena.