Visual Metaphor Transfer (VMT)

Updated 8 February 2026

Visual Metaphor Transfer (VMT) is a computational framework that maps abstract, cross-domain metaphors into images by decomposing linguistic cues into detailed visual schemata.
It employs a multi-stage process including linguistic decomposition, schema extraction, and diffusion-based generation to recreate metaphorical meaning.
Evaluation combines human assessments with automated metrics like CLIPScore and Meaning Alignment to ensure compositional accuracy and creative fidelity.

Visual Metaphor Transfer (VMT) refers to the computational generation or transformation of visual metaphors by systematically mapping conceptual structures—often drawn from natural language or visual exemplars—into images that encode abstract, cross-domain meanings. VMT frameworks differ fundamentally from style transfer or literal text-to-image generation by emphasizing the preservation and instantiation of metaphorical logic, relational invariants, and compositional meaning. This task presents unique challenges in multi-modal alignment, semantic grounding, and creative imagery, requiring both high-level language understanding and advanced generative modeling.

1. Foundations and Formalization of Visual Metaphor Transfer

VMT broadens visual generation tasks by focusing on the transformation of metaphorical structures between modalities or subjects. Early formulations cast VMT as mapping from linguistic metaphors to visual images: for a given input metaphor $m_{text}$ , the system generates an image $I_{vis}$ that realizes the implied meaning through suitable object composition and symbolism (Chakrabarty et al., 2023). Formally: $m_{text} \xrightarrow{\text{LLM/CoT}} e_{vis} \xrightarrow{\text{Diffusion Model}} I_{vis}$ where $e_{vis}$ is an intermediate “visual elaboration”—a natural language description specifying relevant objects, spatial relations, and implicit attributes.

Recent work expands the scope to include reference-image-based VMT, where the task is to extract a “creative essence” (relational logic or schema) from an image $I_{ref}$ and recombine it with a user-specified new subject $S_{tgt}$ to produce $I_{final}$ —an image enacting the same metaphorical mapping on a novel domain (Xu et al., 1 Feb 2026). This process is mediated by an intermediate, domain-independent “Schema Grammar” $G$ encoding all key metaphorical and aesthetic slots.

A parallel strand focuses on the decomposition of linguistic metaphors into discrete triples—Source (S), Target (T), Meaning (M)—and the generation of images that score highly on alignment metrics between the generated visual realization and all components of this decomposition (Koushik et al., 26 Aug 2025).

2. Methodological Architectures for VMT

A canonical VMT pipeline encompasses several core stages:

Linguistic Decomposition/Elaboration: Chain-of-thought (CoT) prompting within LLMs is employed to decompose $m_{text}$ into constituent objects, abstract properties, and detailed visual instructions. This stage outputs $e_{vis}$ or (S, T, M) triples, ensuring the generative process is grounded not only in surface form but in inferential meaning (Chakrabarty et al., 2023, Koushik et al., 26 Aug 2025).
Schema Extraction and Manipulation (Schema-Driven VMT): When transferring from a reference image, a vision–LLM (VLM) is tasked with parsing the image into a schema grammar $I_{vis}$ 0, representing subject, carrier/context, generic relational structure, composition/tonality/typography, attributes, violation/conflict, and emergent meaning (Xu et al., 1 Feb 2026). The Transfer Agent replaces selected schema slots to instantiate a new target metaphor on an alternative subject.
Generation and Diagnostics: The processed elaboration or schema is encoded as a prompt for a diffusion-based text-to-image generator (e.g., Stable Diffusion v2.1, DALL·E 2/3, Midjourney). Synthetic images are then critically appraised by a diagnostic agent (often VLM-based), which evaluates subject salience, relational consistency, violation realization, and meaning alignment, triggering iterative backtracking as necessary (Xu et al., 1 Feb 2026).
Reward-Guided Learning: In hybrid RL or policy-optimization settings, the prompt generator is optimized using multi-component reward signals, including metaphor decomposition score, CLIPScore, BERTScore on S/T/M components, source/target object presence, and meaning alignment (MA) scores determined by a VLM (Koushik et al., 26 Aug 2025).

3. Evaluation Metrics and Benchmarks

VMT research employs a diverse suite of both intrinsic and extrinsic evaluation methodologies:

Human-Centric Evaluation: Professional illustrators or lay annotators rate image outputs by rank (1–5), “Lost Cause” flags (irrecoverable outputs), and provide edit instructions. Metrics include average rank, % Lost Cause, and mean edit/correction count (Chakrabarty et al., 2023).
Automated Score Functions:
- Metaphor Decomposition Score ( $I_{vis}$ 1): Quantifies the accuracy of an LLM’s decomposition (S, T, M) relative to ground truth and overall metaphor plausibility (Koushik et al., 26 Aug 2025).
- Meaning Alignment (MA): VLM-judged alignment between image-perceived meaning and intended metaphorical meaning, providing a scalar score in $I_{vis}$ 2 (Koushik et al., 26 Aug 2025).
- CLIPScore/BERTScore: Text–image similarity across various axes, supporting both alignment and object presence verification (Koushik et al., 26 Aug 2025).
Visual Entailment: Downstream validation tasks assess whether generated images support or contradict natural language hypotheses, with accuracy measured before and after exposure to VMT-generated data (Chakrabarty et al., 2023).

4. Datasets, Experimental Results, and Analysis

The HAIVMet dataset is central to quantitative evaluation in linguistic-metaphor-to-image VMT: 1,540 unique linguistic metaphors, each with ∼4 expert-filtered images, total 6,476 (m_{text}, e_{vis}, {I_{vis}^{(1)},..., I_{vis}^{(4)}}) entries (Chakrabarty et al., 2023). For visual metaphor transfer pipelines, no universal public benchmark exists, but experiments report high agreement (∼98.2%) between VLM-ensemble judgments and human critics on recognizability, ingenuity, violation appropriateness, and overall quality (Xu et al., 1 Feb 2026).

Key performance results highlight:

System	Decomp Score	CLIP	MA	Avg Rank	% Lost Cause
Gemma-3-27B + Janus-7B (TF)	0.8668	0.2960	0.8760	—	—
LLM–DALL·E 2 (Chakrabarty et al., 2023)	—	—	—	1.96	6.0
GPT-4o (zero-shot) (Koushik et al., 26 Aug 2025)	0.8072	0.2296	0.8180	—	—
OFA_{SNLI-VE+HAIVMet} (Chakrabarty et al., 2023)	—	—	—	—	—

LLM–DALL·E 2 achieves top human ranking and minimal irrecoverable failures when chain-of-thought prompting is combined with a strong diffusion backbone. The training-free, reward-guided Gemma-27B+Janus-7B pipeline reaches the highest open-source decomposition and MA scores and excels on abstract metaphors (Koushik et al., 26 Aug 2025). Adding HAIVMet to entailment training yields a 23 percentage-point accuracy gain in visual entailment.

Qualitative analysis consistently identifies that structured, explicit decomposition (CoT, S–T–M slots, schema grammar) enhances metaphorical compositionality and consistency. Automated reward pipelines close semantic gaps with strong closed models (e.g., GPT-4o) but lag in aesthetics and style-dependent human preference, with failure cases often related to under-specified prompts or subtle relational mismatches.

5. Theoretical and Cognitive Frameworks: Schema Grammar and Blending

Advanced VMT models draw on Conceptual Blending Theory (CBT), positing that creative metaphors arise by projecting two or more input spaces (domains) into a blended space, guided by a shared generic schema of relational invariants (Xu et al., 1 Feb 2026). The introduced “Schema Grammar” ( $I_{vis}$ 3), operationalized as an explicit tuple, encodes:

Subject ( $I_{vis}$ 4)
Carrier/Context ( $I_{vis}$ 5)
Generic Space ( $I_{vis}$ 6): e.g., compositional, isomorphic, functional, causal, paradoxical
Aesthetic slots ( $I_{vis}$ 7, $I_{vis}$ 8, $I_{vis}$ 9): composition, tonality, typography/graphic
Inherent attributes ( $m_{text} \xrightarrow{\text{LLM/CoT}} e_{vis} \xrightarrow{\text{Diffusion Model}} I_{vis}$ 0, $m_{text} \xrightarrow{\text{LLM/CoT}} e_{vis} \xrightarrow{\text{Diffusion Model}} I_{vis}$ 1)
Violation or conflict point ( $m_{text} \xrightarrow{\text{LLM/CoT}} e_{vis} \xrightarrow{\text{Diffusion Model}} I_{vis}$ 2)
Emergent meaning or implication ( $m_{text} \xrightarrow{\text{LLM/CoT}} e_{vis} \xrightarrow{\text{Diffusion Model}} I_{vis}$ 3)

This schema is extracted (perception agent), manipulated for transfer to a new subject (transfer agent), and finally re-instantiated via text-to-image prompt engineering (generation agent). A diagnostic agent closes the loop with error localization and hierarchical correction across abstraction levels.

6. Limitations, Open Challenges, and Future Directions

Current VMT pipelines remain limited by several factors:

Human Effort and Model Dependence: Chains with heavy annotation, validation, and filtering dependencies are labor- and compute-intensive (Chakrabarty et al., 2023).
Aesthetic and Cultural Narrowness: Difussion models still struggle with fine-grained attribute binding, rare blends, and non-English (broader cultural/linguistic) metaphors (Chakrabarty et al., 2023, Xu et al., 1 Feb 2026).
Reward Signal Bias: Automated VLM/CLIP/BERT-based evaluation, while scalable, introduces bias and does not fully capture human aesthetic judgements (Koushik et al., 26 Aug 2025).

Ongoing work prioritizes:

Full schema-driven agent pipelines distilled to differentiable, end-to-end trainable models for efficiency (Xu et al., 1 Feb 2026).
Integration of style-aware, aesthetic, and interactive rewards to close human preference gaps (Koushik et al., 26 Aug 2025).
Expansion to real-time, human-in-the-loop multimodal frameworks, richer domain coverage, and improved schema induction for more robust generalization to novel metaphors and cultural contexts (Chakrabarty et al., 2023, Xu et al., 1 Feb 2026).

7. Applications and Broader Significance

Controlled and semantically aware VMT technologies are well suited for editorial illustration, advertising, branding, media design, and educational visualization—any domain demanding high-impact, bespoke metaphorical imagery, with fine-grained control over both abstract logic and visual attributes (Xu et al., 1 Feb 2026, Chakrabarty et al., 2023). The explicit separation of logic from appearance, together with rigorous evaluation metrics, distinguishes VMT from pixel-level style transfer or surface-alignment T2I pipelines, positioning it as a central task in computational creativity, cognitive modeling, and explainable AI.