Papers
Topics
Authors
Recent
Search
2000 character limit reached

Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

Published 1 Feb 2026 in cs.CV and cs.AI | (2602.01335v1)

Abstract: A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment and surface-level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the "creative essence" from a reference image and re-materialize that abstract logic onto a user-specified target subject. We propose a cognitive-inspired, multi-agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar ("G"). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross-domain logic re-instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high-fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed-loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media. Source code will be made publicly available.

Summary

  • The paper introduces a closed-loop, schema-driven framework that decomposes visual metaphors into atomic semantic components for agent-guided transfer.
  • It integrates vision-language and language models to extract, map, and generate style-consistent prompts while maintaining semantic and aesthetic invariants.
  • Automated and human evaluations show up to 98.2% agreement with expert ratings on metaphor consistency, creativity, and conceptual integration.

Schema-Driven Agentic Reasoning for Visual Metaphor Transfer

Introduction

"Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning" (2602.01335) introduces a compositional, closed-loop framework for cross-domain visual metaphor transfer grounded in cognitive semiotics and agentic reasoning paradigms. The core innovation lies in operationalizing a universal schema grammar that abstracts creative logic and structural invariants—enabling large vision-LLMs (VLMs) and LLMs to process, transfer, and instantiate visual metaphors across disparate semantic subjects and artistic styles.

Framework Architecture

The framework orchestrates a multi-agent pipeline comprising four semantically-specialized subsystems:

  • Perception Agent: Extracts a universal schema grammar (GrefG_{ref}) from a reference image. This process isolates domain-independent creative logic, decomposing the image into components such as Subject, Carrier, Generic Space, Aesthetic, Violation, and Emergent Meaning. This schema grammar formalizes both visual structure and underlying metaphorical rationale.
  • Transfer Agent: Synthesizes a new target schema (GtgtG_{tgt}) by mapping the schema grammar of the reference onto a novel subject (StgtS_{tgt}) while rigorously preserving relational invariants and aesthetic DNA.
  • Generation Agent: Converts the target schema into high-precision, style-consistent prompts for text-to-image diffusion models, enforcing strict adherence to compositional, tonal, and typographic constraints.
  • Diagnostic Agent: Performs qualitative, hierarchical failure assessment via VLMs. It identifies errors at prompt, component, or abstraction levels and triggers targeted refinement cycles.

The process operates in a closed refinement loop, automatically diagnosing visual, semantic, or relational incoherence and instigating agent-guided corrections at the correct representational strata.

Universal Schema Grammar

The schema grammar formalism decomposes creative images into atomic, transferable logic:

  • Subject (S): The primary semantic entity to represent.
  • Carrier (C): The metaphorical vehicle; the structure or context carrying the transformation.
  • Generic Space (G): Cognitive-level invariants—material, structural, relational, or affective—that underpin the metaphor.
  • Aesthetic (Aes): Composition archetypes (centralized, minimalist, macro, dynamic), artistic tonality (monochromatic, retro, high-contrast, etc.), and graphic elements (absent, integrated, editorial).
  • Violation/Conflict (V): The logic-breaking trait or deliberate paradox operationalized within the image.
  • Emergent Meaning (I): The synthesized message delivered by the cross-domain blend.

This grammar is explicitly designed for full computational tractability and direct translation into LLM/VLM reasoning and T2I prompt engineering.

Agentic Metaphor Transfer Pipeline

The high-level algorithm involves sequential schema extraction, generic space preservation, aesthetic constraint enforcement, and iterative diagnostic-led editing.

  1. Schema Extraction: VLM-driven deconstruction of the reference image.
  2. Schema Transfer: Invariance-driven bridge mapping synthesizes a new schema for the target subject.
  3. Prompt Generation: LLM agent encodes the schema constraints to form robust T2I prompts.
  4. Diagnostic Refinement: VLM critic diagnoses logical and visual failures. Feedback is ingested by the system to rectify the process at the most granular error level possible.

Automated and Human Evaluation

Reliability of automated evaluation is validated: LLM-ensemble judgments achieve 98.2% agreement with expert human raters across all metrics. Comprehensive user studies probe dimensions such as metaphor recognizability, ingenuity, violation appropriateness, visual integration, visual appeal, message clarity, and overall creative quality on 5-point Likert scales.

Specific evaluation criteria also include:

  • Metaphor Consistency (MC): Logical structure preservation.
  • Analogy Appropriateness (AA): Quality of carrier-subject analogy.
  • Conceptual Integration (CI): Visual and conceptual seamlessness.

These protocols ensure rigorous, multi-faceted assessment of both mechanistic and creative fidelity.

Qualitative and Quantitative Results

The schema-driven, agentic pipeline demonstrates robust cross-domain metaphor transfer, with generated outputs aligning to both the creative logic and the aesthetic syntax of references. The diagnostic-refinement architecture consistently rectifies prompt-level, component-level, and schema-level errors, supporting non-trivial, high-quality visual outputs across a range of subjects and metaphor vehicles. Figure 1

Figure 2: Diverse examples of visual metaphor transfer outputs generated by the schema-driven agentic reasoning pipeline.

User study scores confirm high recognizability, semantic ingenuity, and integration quality. On automatic metrics, the framework achieves strong correlation with expert assessment on all target dimensions.

Implications and Future Directions

This work advances the theoretical foundation and practical methodology for compositional visual metaphor generation—bridging cognitive semiotics, visual communication, and deep generative modeling. The explicit schema grammar establishes a new operational language for metaphor, facilitating systems that are both interpretable and controllable.

Practical implications extend to visual advertising, cross-modal creative tools, and explainable AI for semantic content creation. The hierarchical agentic pipeline offers a blueprint for closed-loop, diagnosis-driven creativity in multimodal AI.

Future developments could involve:

  • Expansion of schema libraries for granular artistic control.
  • Enhanced VLM/LLM co-training to further increase alignment and creativity.
  • Integrating real-time human-in-the-loop feedback for adaptive refinement.
  • Cross-lingual and cross-cultural adaptation of metaphor logic.

Conclusion

"Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning" formalizes and operationalizes high-level visual metaphor abstraction and transfer within an iterative, agentic framework. By bridging cognitive theory and deep generative modeling, the methodology enables automatic, semantically-consistent, and stylistically-rigorous metaphor generation—setting a new standard for interpretable and controllable creative AI systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.