Multimodal KG Construction
- Multimodal KG Construction is the process of integrating structured symbolic data with images, audio, and video to enable rich semantic alignment and cross-modal reasoning.
- It employs advanced pipelines including data ingestion, state-of-the-art encoders, entity-relation extraction, and rigorous cross-modal verification to build a unified knowledge base.
- The construction approach improves AI tasks such as visual question answering and retrieval through scalable engineering, efficient pruning, and detailed evaluation metrics.
A multimodal knowledge graph (MMKG) is a data structure that integrates symbolic, structured representations of entities and relations with non-symbolic, perceptual modalities such as images, audio, and video. Multimodal KG construction refers to the systematic process of building such graphs from heterogeneous, cross-modal corpora, resulting in a unified knowledge base that enables cross-modal reasoning, grounding, and retrieval for advanced AI systems. Compared to unimodal (text-based) KGs, MMKGs aim to support richer semantic alignment between modalities, improve robustness to hallucinations in LLMs and MLLMs, and enable zero-shot or few-shot transfer in complex reasoning tasks (Liu et al., 17 Mar 2025, Bian, 23 Oct 2025, Chen et al., 2024).
1. Formalization, Motivation, and Core Challenges
An MMKG is defined as a tuple such as , where entities , relations , attributes , and value sets (the latter containing raw modal items like images, audio). There are two canonical schemas: type-A MMKGs, allowing multimodal items only as attribute values, and type-N MMKGs, treating such items as first-class nodes with full relational connectivity (Chen et al., 2024, Zhu et al., 2022).
Motivation arises from the limitations of unimodal KGs—namely, restricted semantic coverage, inability to ground perceptual signals, and poor support for multi-hop, cross-modal inference. MMKGs facilitate enhanced reasoning in LLM-empowered tasks, such as visual question answering, analogical reasoning, and hallucination suppression (Liu et al., 17 Mar 2025, Bian, 23 Oct 2025, Park et al., 11 Jun 2025). Construction is hindered by alignment noise, modality heterogeneity, large-scale engineering requirements, and the challenge of designing schemas that flexibly cover symbolic and perceptual data (Bian, 23 Oct 2025, Liu et al., 17 Mar 2025).
2. Construction Pipelines: Canonical Workflows and Paradigms
Construction of MMKGs typically follows either a "symbol→vision" paradigm (grounding KG entities/relations to images or other modalities) or a "vision→symbol" workflow (extracting entities, relations, and events from raw perceptual data and aligning them with symbolic identifiers). Canonical steps include:
- Data Ingestion: Multimodal corpora are collected from sources such as text KGs (DBpedia, Wikidata), raw images (web, Wikimedia), audio/video (YouTube), or domain-specific datasets (MIMIC-CXR for medical, VirtualHome for event-centric video) (Zhang et al., 2023, Park et al., 11 Jun 2025, Egami et al., 2024, Wang et al., 22 May 2025).
- Preprocessing: Each modality is preprocessed with state-of-the-art encoders—CNNs/VLMs (e.g., CLIP, ViT) for images, BERT-style PLMs for text, CLAP for audio, with additional metadata extraction (e.g., captions, alt-text, OCR) (Liu et al., 17 Mar 2025, Park et al., 11 Jun 2025).
- Entity and Relation Extraction: For symbol→vision, image search and filtering are performed for each entity-aspect or triple, scored via text/image similarity or more advanced AIR models (Zhang et al., 2023). For vision→symbol, cascaded VLM pipelines (e.g., VaLiK) generate detailed, image-specific natural-language descriptions from visual input, which are then parsed by LLM-powered entity/relation extractors (Liu et al., 17 Mar 2025). Hybrid pipelines (e.g., VAT-KG, M³KG) combine LLM rewriters, extractors, and normalizers in orchestrated multi-agent workflows (Park et al., 23 Dec 2025, Park et al., 11 Jun 2025).
- Cross-Modal Alignment and Verification: Semantic consistency is enforced via cosine similarity filtering between visual and textual features, either at the window/segment level (VaLiK) or triplet level (VAT-KG, M³KG). Topic- and aspect-specific matching handles fine-grained alignment (Liu et al., 17 Mar 2025, Zhang et al., 2023).
- Graph Assembly: Extracted (normalized) entities, relations, and linkages to modal artifacts (image, audio, video, descriptions) are assembled into the MMKG representation schema, which can be concept-centric, aspect-centric, or event-centric depending on the use case (Park et al., 11 Jun 2025, Egami et al., 2024).
- Pruning, Filtering, and Distillation: Quality and storage efficiency are achieved via similarity-verification thresholds, graph distillation (e.g., redundant subgraph removal), neighbor-aware filtering (NaF), semantic pruning, or more advanced selective pruning pipelines (GRASP) (Liu et al., 17 Mar 2025, Park et al., 23 Dec 2025, Wang et al., 22 May 2025, Yang et al., 22 Aug 2025).
3. Technical Mechanisms: Alignment, Verification, and Representation Learning
Leading MMKG construction frameworks leverage a variety of cross-modal alignment, verification, and learning mechanisms:
| Mechanism | Pipeline Example | Mathematical Keypoints/Description |
|---|---|---|
| VLM Cascades & Prompting | VaLiK (Liu et al., 17 Mar 2025), MR-MKG (Lee et al., 2024) | + LLM prompt parsing |
| Cross-Modal Consistency Verification | VaLiK, VAT-KG | Cosine similarity over sliding windows; SV filtering |
| Contrastive Multimodal Embedding | KG-MRI (Bian, 23 Oct 2025), OpenBG (Deng et al., 2022) | InfoNCE loss |
| Entity/Aspect-Image Matching | AspectMMKG/AIR (Zhang et al., 2023) | Co-attention in (text, image) towers; max-margin ranking for retrieval |
| Graph Attention/Relational Encoders | MR-MKG (Lee et al., 2024), SNAG (Chen et al., 2024) | Relation-aware GAT: |
Advanced pipelines interleave these steps with modular, LLM-powered agents, enabling context-enrichment, sense normalization, candidate selection, and self-reflective revision (Park et al., 23 Dec 2025, Liu et al., 17 Mar 2025).
4. Scalability, Storage, and Engineering
Billion-scale deployment necessitates distributed storage (HDFS, distributed triple stores), scalable feature extraction (GPU clusters), and annotation workflows (API crawl, OCR, mass image search) (Deng et al., 2022). Storage optimizations leverage symbolic representation of triplets (only pointers to multimedia), distilled subgraphs (graph pruning), and deduplication via semantic or hash-based filtering. For example, distilled VaLiK MMKGs require only 489 MB compared to 739 MB for traditional Visual Genome graphs (Liu et al., 17 Mar 2025), and VHAKG achieves >5x triple reduction via temporal and view deduplication (Egami et al., 2024). High-quality curation for entity/aspect matching employs human-in-the-loop spot-checks (typically ~1-2%) and parallelizes pipelines over GPU clusters or distributed job arrays (Zhang et al., 2023, Deng et al., 2022).
5. Evaluation Metrics, Datasets, and Benchmarks
Evaluation of MMKG construction spans multiple modalities, tasks, and evidence layers:
- Intrinsic Graph Metrics: Mean Reciprocal Rank (MRR), Hits@K for link prediction and KG completion, multi-hop diameter, entity degree distribution (Chen et al., 2024, Chen et al., 2023).
- Cross-modal Alignment Quality: Image-text recall (Recall@1, Recall@5, MRR), FID, CLIPscore (automatic), and human evaluative criteria (image quality, context correlation) (Zhang et al., 2023, Xu et al., 18 Apr 2025).
- Downstream QA/Reasoning: Model-as-Judge scoring for RAG pipelines, VQA metrics (BLEU, ROUGE, METEOR), answer faithfulness, and contextual precision (Park et al., 11 Jun 2025, Egami et al., 2024, Park et al., 23 Dec 2025, Yang et al., 22 Aug 2025, Lee et al., 2024).
- Benchmarks: CrisisMMD (classification), ScienceQA (multimodal QA), AudioCaps-QA, VCGPT, VALOR, Medical VQA (PathVQA, VQA-RAD), OpenBG-IMG, DB15K, MKG-Y, and domain-specific RAG datasets (Liu et al., 17 Mar 2025, Park et al., 11 Jun 2025, Wang et al., 22 May 2025).
Empirical results consistently demonstrate that MMKG-enhanced LLMs and multimodal RAG frameworks outperform both unimodal and unimodal-KG-augmented baselines. For instance, VaLiK yields a +4.9% average accuracy gain on CrisisMMD and +6.4% on ScienceQA in text+image mode, outstripping pure LLM, VLM, and previous KG-enhanced methods (Liu et al., 17 Mar 2025). VAT-KG and M³KG produce statistically significant improvements in retrieval-augmented QA (e.g., +2.8 points on VALOR) and faithfulness/relevancy for domain-specific RAG (Park et al., 11 Jun 2025, Park et al., 23 Dec 2025, Yang et al., 22 Aug 2025).
6. Specialized Paradigms, Domains, and Adaptations
MMKG construction is adapted to a variety of domains and specialized paradigms:
- Aspect-aware: AspectMMKG provides multi-aspect grounding via careful sentence extraction, aspect taxonomy design, and image retrieval/filtering per aspect (Zhang et al., 2023).
- Medical: MEDMKG links clinical UMLS concepts to radiology images with two-stage concept extraction and context-aware filtering, benchmarking on medical VQA and retrieval tasks (Wang et al., 22 May 2025).
- Business-Scale: OpenBG operates at billion-scale with business-driven SKOS concepts, distributed engineering, and contrastive learning for cross-modal fact alignment (Deng et al., 2022).
- Event-centric/Frame-level: VHAKG encodes multi-view, event-centric video knowledge graphs down to the temporal (frame) and spatial (bounding box) level for benchmarking vision-LLMs (Egami et al., 2024).
- Dynamic and Continual: Frameworks such as MSPT address continual MMKG construction, harmonizing plasticity and stability through attention distillation and memory replay to avoid catastrophic forgetting (Chen et al., 2023).
Further, domain-specific RAG frameworks (DSRAG) tightly integrate MMKGs derived from technical documents with hybrid retrieval and semantic pruning to guide LLMs toward precise and relevant answers in expert contexts (Yang et al., 22 Aug 2025).
7. Limitations, Outstanding Problems, and Open Research Directions
Persistent challenges in MMKG construction include robust cross-modal alignment (especially with noisy web data), scalable quality control of semantic–visual linkage, efficient large-scale representation learning, ontology design for multi-aspect and multi-level concepts, and modality imbalance. Over-pruning during verification stages (fixed thresholds) may suppress succinct but vital information, and frozen module pipelines limit end-to-end learning (Liu et al., 17 Mar 2025, Bian, 23 Oct 2025).
Open research directions include:
- Learnable cross-modal metrics (beyond cosine) and dynamic thresholding for verification (Liu et al., 17 Mar 2025)
- Unified ontology induction across modalities (Ontogenia, AutoSchemaKG) (Bian, 23 Oct 2025)
- Extension to additional modalities (e.g., audio, video, 3D), segmentation-level grounding (Park et al., 11 Jun 2025, Chen et al., 2024)
- Continual MMKG construction under streaming and evolving schemas (Chen et al., 2023, Bian, 23 Oct 2025)
- Efficient, large-scale pre-training and evaluation for industry-grade MMKGs (Deng et al., 2022)
- Hybrid symbolic–neural reasoning systems that exploit explicit KG subgraphs alongside end-to-end VLM/LLM attention (Liu et al., 17 Mar 2025, Chen et al., 2024)
The field is rapidly converging on modular, highly automated MMKG construction architectures that combine cascaded expert models (VLMs, LLMs) with advanced cross-modal alignment and verification pipelines, underpinned by robust engineering and scalable storage, while continuing to address the core bottlenecks around semantic grounding, schema integration, and evaluation at scale (Liu et al., 17 Mar 2025, Bian, 23 Oct 2025, Park et al., 11 Jun 2025, Chen et al., 2023).