Multimodal Knowledge Graphs (MMKGs)

Updated 12 February 2026

Multimodal Knowledge Graphs are knowledge graphs enriched with diverse data types such as text, images, audio, and video to support complex reasoning and retrieval.
They employ advanced fusion techniques, including attention mechanisms and transformer-based methods, to effectively align and integrate modality-specific features.
MMKGs enable practical applications like multi-modal question answering and entity alignment, delivering improved performance in real-world AI and data integration tasks.

A Multimodal Knowledge Graph (MMKG) is an extension of the traditional symbolic knowledge graph paradigm in which nodes (entities), edges (relations), and/or attributes are explicitly linked to data from multiple modalities such as images, text, audio, and video. MMKGs serve as a unified infrastructure for knowledge representation, enabling models to learn, reason, and retrieve information grounded in both structured triples and modality-specific data. The associated representation, learning, completion, and reasoning tasks in MMKGs demand advanced methods to address modality fusion, alignment, robustness, and the integration of both symbolic and sub-symbolic knowledge.

1. Definitions and Formal Structure

An MMKG generalizes the symbolic knowledge graph (KG) tuple $(E, R, A, V, T_R, T_A)$ by associating one or more modalities to entities, attributes, or triples. The formal definition in recent works is typically:

$\mathcal{G}_{\text{MMKG}} = (\mathcal{E}, \mathcal{R}, \mathcal{T}, \{\mathcal{X}^m\}_{m\in\mathcal{M}})$

$\mathcal{E}$ : Entity set
$\mathcal{R}$ : Relation set
$\mathcal{T}\subseteq\mathcal{E}\times\mathcal{R}\times\mathcal{E}$ : Set of triples
$\mathcal{M}$ : Set of modalities (e.g., structure, text, vision, audio)
$\mathcal{X}^m(e)$ : Collection of raw or encoded data for entity $e$ under modality $m$ (Liu et al., 28 Sep 2025, Zhu et al., 2022, Yi et al., 2024, Park et al., 11 Jun 2025)

There are two primary representational styles:

Attribute-style (A-MMKG): multimodal data are attribute values, e.g., $T_A \subseteq E\times A\times (V\cup V_{MM})$ .
Node-style (N-MMKG): modalities as first-class nodes, expanding $\mathcal{G}_{\text{MMKG}} = (\mathcal{E}, \mathcal{R}, \mathcal{T}, \{\mathcal{X}^m\}_{m\in\mathcal{M}})$ 0, permitting $\mathcal{G}_{\text{MMKG}} = (\mathcal{E}, \mathcal{R}, \mathcal{T}, \{\mathcal{X}^m\}_{m\in\mathcal{M}})$ 1 where $\mathcal{G}_{\text{MMKG}} = (\mathcal{E}, \mathcal{R}, \mathcal{T}, \{\mathcal{X}^m\}_{m\in\mathcal{M}})$ 2 is an image node (Zhu et al., 2022).

Many MMKGs now generalize to support not only text and images but also audio and video, with modality-agnostic or modality-specific embedding functions $\mathcal{G}_{\text{MMKG}} = (\mathcal{E}, \mathcal{R}, \mathcal{T}, \{\mathcal{X}^m\}_{m\in\mathcal{M}})$ 3 (Park et al., 11 Jun 2025).

2. Construction and Representation Methodologies

MMKG construction has undergone significant evolution:

Data Collection & Symbol Grounding: Entities and relations are enriched with multimodal data via web crawling, dataset mining (e.g., DBpedia, YAGO, Wikidata), or object detection and scene-graph extraction from images/videos (Liu et al., 2019, Yi et al., 2024, Park et al., 11 Jun 2025).
Modality Alignment: Alignment techniques map images, text, or audio to their corresponding entities via trained cross-modal encoders, margin-ranking, or Product-of-Experts (PoE) models (Liu et al., 2019, Yi et al., 2024).
Fusion Strategies: Early approaches relied on concatenation or weighted averaging; modern methods leverage attention mechanisms, gated fusion, joint embeddings, transformer-based fusion, and hypercomplex (biquaternion) interaction to integrate structural and modality-specific signals (Zhang et al., 2024, Liu et al., 28 Sep 2025).
Automatic and Scalable Pipelines: Newer frameworks like VaLiK cascade vision-LLMs to extract image-specific text, use similarity verification to filter noise, and construct graphs via LLM-driven relation extraction, allowing storage-efficient, zero-shot MMKG induction (Liu et al., 17 Mar 2025).

Notable large-scale MMKGs include MMKG-triad (FB15k/DB15k/YAGO15k, with ~15k entities and numerical/image literals), MMPKUBase (>52k entities, 1.2M images for Chinese domains), and VAT-KG (over 110k triples with video, audio, text, and image content) (Liu et al., 2019, Yi et al., 2024, Park et al., 11 Jun 2025).

3. Learning Paradigms and Fusion Architectures

A core challenge in MMKGs is combining structured KG information with multi-modal features:

Product of Experts (PoE): Scores candidate triples by multiplying unimodal expert probabilities, yielding strong baseline performance (Liu et al., 2019).
Fusion-based Methods: Fixed fusion (e.g., concatenation, MLPs) often lose modality-specific information and are sensitive to noisy or irrelevant modalities (Liu et al., 28 Sep 2025).
Adaptive and Structure-Aware Fusion: Contemporary models use attention-weighted fusion (TSAM), relation-aware experts with mixture-of-experts gating (MoMoK), and biquaternion algebra to balance independence and cross-modal interaction (M-Hyper) (Li et al., 28 May 2025, Zhang et al., 2024, Liu et al., 28 Sep 2025).
Transformer-based and Generative Models: Several MMKG completion architectures leverage pre-trained transformers (T5, VisualBERT, BLIP, LLaVA) to generate or fuse multimodal context, framing link prediction as text generation (MMKG-T5, HERGC) (Ma et al., 26 Jan 2025, Xiao et al., 1 Jun 2025).
Noise and Robustness: Modality-level noise masking and confidence scoring (SNAG) mitigate multi-modal hallucination and promote robust, confidence-weighted embeddings (Chen et al., 2024).

Multi-modal fusion advances include:

Fine-grained, patch/token-level representations (TSAM),
Mixture of Modality Knowledge Experts for relation- and context-aware fusion (MoMoK),
Gated or attention-based weighting of modalities adapted to each triple or relation (Li et al., 28 May 2025, Zhang et al., 2024).

MMKGC aims to infer missing links by leveraging multimodal content. Key developments include:

Contrastive Learning and Structure Dominance: Maintaining the dominance of structured graph information is critical; contrastive alignment objectives explicitly pull visual/textual features towards the core KG embedding (TSAM, SaCL) (Li et al., 28 May 2025).
Negative Sampling and Robustness: Generative hierarchical negative sampling (DHNS) uses diffusion models to synthesize challenging, multimodal-aware negatives at varying semantic hardness levels. Adaptive losses further increase discriminative power and stability (Niu et al., 26 Jan 2025).
Logical and Multi-Hop Reasoning: RConE introduces a geometric rough-cone embedding capable of handling multi-hop logical queries (conjunction, disjunction, negation) over MMKGs, supporting answer extraction at both the entity and fine-grained sub-entity (e.g., visual region) levels (Kharbanda et al., 2024).
Generative and Instruction-Tuned Reasoning: HERGC pairs heterogeneous experts for retrieval with LLM-based generative selection, providing compositional generalization and effective candidate filtering (Xiao et al., 1 Jun 2025).

Multi-modal entity alignment methods such as HMEA operate in hyperbolic space, preserving KG hierarchy while exploiting visual embeddings for improved cross-KG matching (Guo et al., 2021).

5. Applications and Empirical Impact

MMKGs have demonstrable impact on a range of tasks:

Multi-modal Question Answering (QA): MMKG-augmented LLMs (MR-MKG, VaLiK, VAT-KG) improve science QA, analogy reasoning, and multimodal VQA, reducing hallucinations and providing explicit paths to answers. Performance gains up to +6.4% accuracy are reported over baseline LLMs (Lee et al., 2024, Liu et al., 17 Mar 2025, Park et al., 11 Jun 2025).
Entity Alignment and Retrieval: Alignment accuracy increases when incorporating visual/textual features and operating in hyperbolic embedding spaces (Guo et al., 2021). Retrieval-augmented generation leveraging MMKGs produces more grounded and factually correct responses (Park et al., 11 Jun 2025).
Image Synthesis and Data Augmentation: Diffusion-based prompt engineering for MMKG image generation (VSNS) yields higher quality and KG-relevant images than naïve prompt or random neighbor selection, aiding downstream completion and VQA tasks (Xu et al., 18 Apr 2025).
Chinese-Language and High-Domain-Coverage KGs: MMPKUBase directly supports Chinese VQA and recommendation, with filtered and clustered image attributes facilitating query and retrieval at scale (Yi et al., 2024).

The empirical benchmarks standardize around filtered Mean Reciprocal Rank (MRR) and Hits@K for completion, with all recent methods evaluated on DB15K, MKG-W, MKG-Y, and others.

6. Open Challenges and Prospects

Several critical areas remain for MMKGs:

Complex Symbolic Grounding: Moving beyond triple-level grounding to subgraph or multi-hop (path, cycle, logical composition) alignment with visual/audio data (Zhu et al., 2022, Kharbanda et al., 2024).
Quality Control and Visualizability Judgement: Automated filtering of abstract or non-visualizable entities, adversarial and adaptive negative sampling, and robust noise-handling under missing modality conditions (Chen et al., 2024, Niu et al., 26 Jan 2025).
Scaling and Compression: Compact, indexable representations (pointer-based MMKGs, quantized multiply-representations) are needed for web-scale deployment and integration with LLMs (Liu et al., 17 Mar 2025).
Unified Modeling of Multi-Modal Interplay: Striking a balance between fused and independent modality representations is central—hybrid biquaternion models and fine-grained gating are recent advances, but further research is required to adaptively weight modalities per context (Liu et al., 28 Sep 2025, Zhang et al., 2024).
Broader Modalities and Continual Learning: Seamless MMKG extension to richer modalities (audio, video, 3D, sensor data), and the continual learning of dynamically evolving real-world graphs (Park et al., 11 Jun 2025).

In sum, multimodal knowledge graphs have emerged as a foundational construct for multimodal machine intelligence, supporting tasks from reasoning and question answering to entity alignment and knowledge completion. Recent methodological advances emphasize robust, fine-grained cross-modal fusion and explicit structural dominance, with storage-efficient and scalable construction pipelines and benchmark datasets accelerating empirical progress (Liu et al., 17 Mar 2025, Li et al., 28 May 2025, Lee et al., 2024, Yi et al., 2024, Liu et al., 28 Sep 2025, Park et al., 11 Jun 2025).