Linking and Grounding Metaphors

Updated 9 February 2026

The paper introduces a formal framework that links ontological metaphors, such as 'energy-as-substance' and 'energy-as-location', using conceptual blending to resolve structural challenges.
It demonstrates that embodiment ratings in language models correlate with metaphor interpretation accuracy, highlighting the sensorimotor basis of abstract reasoning.
A multimodal approach is detailed where linguistic metaphors are transformed into explicit scene descriptions and synthesized into images, validated through rigorous human evaluation and empirical metrics.

Metaphors are foundational to both human cognition and communication, enabling abstract or complex domains to be reasoned about, described, and visualized through mappings from more concrete or familiar source domains. Linking and grounding metaphors refers to the processes and mechanisms by which disparate metaphoric conceptualizations—whether between multiple ontological metaphors, linguistic and visual modalities, or figurative expressions and embodied experiences—are tied together into coherent, actionable models. These processes are central in scientific modeling, multimodal AI, and the interpretation and synthesis of figurative language.

1. Formal Methods for Linking Ontological Metaphors

The coordination of multiple ontological metaphors is rigorously formalized by the conceptual blending framework. In the context of science education, "energy-as-substance" and "energy-as-location" serve as canonical input spaces. The blending model employs four spaces: two input spaces (each instantiating a distinct metaphor), a generic space (capturing shared relations such as conservation and transfer), and a blended space (realizing new, often non-trivial, inferences via selective projection and compression of features from the inputs).

Let IS denote the input space for substance (e.g., $E_s$ , $contain(X,E_s)$ , $transfer(E_s)$ ) and IL for location (e.g., $E_l$ , $higher(X,E_l)$ , $lower(X,E_l)$ ). The blend space B incorporates coupled structures:

$B = \{Absorb(E_s) \wedge up(E_l),\; Release(E_s) \wedge down(E_l),\; transfer(E_s) \leftrightarrow \Delta E_l,\; conserve(E_s) \leftrightarrow preserve(E_l)\}$

Mappings are formally specified:

$\begin{aligned} f: IS \to B: & \quad E_s \mapsto E_l,\; transfer(E_s) \mapsto \Delta E_l \ g: IL \to B: & \quad E_l \mapsto E_s,\; higher(X, E_l) \mapsto Absorb(E_s) \ \end{aligned}$

Empirical analyses using predicate and gesture analysis reveal that participants simultaneously deploy predicates and gestures from both source ontologies, and that their co-occurrence marks active blending. These blends resolve structural tensions such as the "negative substance" problem (impossible for substances, but natural as locations below zero) by synthesizing spatial schemas with conservation logic grounded in quantity transfer (Dreyfus et al., 2014).

2. Embodiment and the Grounding of Figurative Language

Metaphor grounding in LLMs involves the capacity to process metaphoric expressions whose meaning is not solely distributional but inherently tied to embodied experience. Embodiment is operationalized at the verb level as "the degree to which the meaning of a verb involves the human body," with normative ratings assigned on a 1–7 Likert scale (Sidhu et al. 2014). Each metaphorical sentence is assigned a continuous embodiment score, computed as the average rating of all contained verbs.

Systematic analysis reveals that larger LMs, such as GPT-3 (175B), OPT (13B), GPT-NeoX (20B), and GPT-2 XL (1.5B), exhibit a statistically significant positive point-biserial correlation ( $r_{pb}$ ≈ 0.06–0.07, $p \leq 0.05$ ) between the embodiment score and zero-shot metaphor interpretation accuracy, a relationship that is absent in smaller models. OLS regression further confirms that embodiment uniquely predicts performance, independent of age of acquisition, word frequency, or length (variance inflation factors $contain(X,E_s)$ 0 2 for all predictors; $contain(X,E_s)$ 1 increases with the inclusion of embodiment) (Wicke, 2023).

No significant relationship is found between metaphor interpretation and verb concreteness, indicating that LLMs encode a dedicated embodied semantic dimension rather than simply memorizing conventional usage patterns or relying on concreteness as a proxy. This suggests that distributional training enables high-capacity LMs to internalize sensorimotor proxies sufficient for the partial disambiguation of metaphorical meaning.

3. Multimodal Linking: From Linguistic to Visual Metaphors

The generation of visual metaphors from linguistic input represents a paradigmatic multimodal linking task. The process is nontrivial: textual metaphors typically rely on implicit associations and attributes (e.g., "My bedroom is a pig sty" invokes the implicit relation of messiness), which simple text-to-image pipelines fail to surface.

To address this, metaphor-to-image generation is factorized into two stages:

$contain(X,E_s)$ 2

where $contain(X,E_s)$ 3 is the input linguistic metaphor, $contain(X,E_s)$ 4 is a language-only model elaborating $contain(X,E_s)$ 5 into a detailed scene description $contain(X,E_s)$ 6 (explicitly referencing the target and source domains, as well as the implicit attribute), and $contain(X,E_s)$ 7 is a diffusion-based image generator synthesizing $contain(X,E_s)$ 8 from $contain(X,E_s)$ 9 (Chakrabarty et al., 2023). This intermediate representation $transfer(E_s)$ 0 is essential for explicit grounding; direct feeding of $transfer(E_s)$ 1 into diffusion models routinely fails to encode the implicit property.

A mixed-initiative, human-AI pipeline further refines quality. Metaphors unsuitable for drawing (e.g., those referencing only olfactory properties) are filtered manually, Chain-of-Thought (CoT) prompting is used for explicit reasoning in LLMs (with human post-editing on ∼29%), and human experts vet and curate the resulting images. The resulting HAIVMet dataset comprises 6,476 high-quality images for 1,540 linguistic metaphors, with expert-constrained compositionality systematically realized in the visual domain.

4. Empirical Evaluation and Metrics

Assessment of linking and grounding mechanisms is conducted via both intrinsic and extrinsic human evaluation protocols. In visual metaphor generation, professional concept artists ranked outputs by model ablation, quantified "lost cause" rates, and suggested minimal improvements. The LLM–Diffusion two-stage pipeline (LLM-DALL·E 2) outperformed baselines:

Average rank: 1.96 (vs. 3.82 for Stable Diffusion on $transfer(E_s)$ 2 only)
"Lost Cause": 6.0% (vs. 31.6%)
Average instructions: 0.76 (vs. 2.25)

In blind discrimination (HAIVMet vs. raw LLM–DALL·E 2), experts preferred HAIVMet outputs 45% to 18%, with substantially lower "lost cause" rates (1.6% vs. 5%) and a higher percentage rated "perfect" (63.6% vs. 52%).

Downstream grounding utility was quantified via visual entailment (VE): fine-tuning the OFA-base vision-LLM on SNLI-VE plus HAIVMet images yielded a +23 point test accuracy gain (51.15% vs. 27.81%). This validates that explicitly grounded visual metaphors concretely transmit figurative meaning, supporting both human judgments and objective task performance (Chakrabarty et al., 2023).

5. Conceptual and Pedagogical Implications

In educational practice, the deliberate linking of multiple metaphoric ontologies via blending supports students’ ability to reconcile conceptual tensions (e.g., negative potential energy) and facilitates flexible reasoning across representational modalities. Gesture analysis reveals that cross-ontology blends frequently appear in the embodied coordination of talk and action. Pedagogical interventions that foreground these blends—by making explicit the mappings between transferred substance and location on an energy axis—support deeper scientific understanding (Dreyfus et al., 2014).

In language modeling, embodiment effect findings indicate that enriching training data with high-embodiment contexts could further advance figurative language processing. The absence of confounds from concreteness, age of acquisition, or frequency, alongside the positive embodiment-performance correlation in LMs, highlights viable routes for enhancing semantic grounding.

6. Limitations and Open Challenges

Key current limitations include reliance on non-open LLMs and diffusion models (GPT-3, DALL·E 2), cost and latency associated with expert intervention, and restriction to English/culture-bound metaphors. Automated iterative editing and richer prompt decomposition (e.g., self-critique or tree-of-thought techniques) are suggested for future improvement. Attribute misbinding (e.g., persistent objects in generated scenes contrary to intent) and incomplete scene reasoning remain active error sources.

Extension to multilingual and culturally specific metaphorical domains, integration of scene graphs or symbolic reasoning, and formal measurement of diversity and compositionality in generated metaphor corpora are open research directions (Chakrabarty et al., 2023).

7. Synthesis

Linking and grounding metaphors spans formal conceptual blending, statistical modeling of embodiment, and multimodal reasoning via human-AI collaboration. Across domains, systematic mechanisms—disentangling and recombining input spaces, quantifying embodiment, and explicit scene elaboration—demonstrate that metaphorical meaning can be both linked across ontologies and grounded in ways that are accessible to computation, visualization, and human sense-making. These findings underscore the theoretical and practical importance of explicit linking/grounding strategies both in human learning and in advanced AI systems (Chakrabarty et al., 2023, Wicke, 2023, Dreyfus et al., 2014).