MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Published 28 Jul 2025 in cs.AI | (2507.20804v1)

Abstract: Retrieval-Augmented Generation (RAG) enhances LLM generation by retrieving relevant information from external knowledge bases. However, conventional RAG methods face the issue of missing multimodal information. Multimodal RAG methods address this by fusing images and text through mapping them into a shared embedding space, but they fail to capture the structure of knowledge and logical chains between modalities. Moreover, they also require large-scale training for specific tasks, resulting in limited generalizing ability. To address these limitations, we propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph (MMKG) in conjunction with text-based KG. It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process. Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and clear reasoning paths.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel MMGraphRAG framework that transforms multimodal inputs into an interpretable knowledge graph.
It employs innovative modules like Text2KG, Image2Graph, and Cross-Modal Entity Linking using spectral clustering for precise entity alignment.
Evaluations on DocBench and MMLongBench demonstrate significant improvements in domain adaptability and reasoning while reducing model hallucinations.

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Introduction to MMGraphRAG

MMGraphRAG represents a significant advancement in the field of Retrieval-Augmented Generation (RAG), specifically targeting the integration of multimodal data into LLM generation. Traditional RAG methods have predominantly focused on text-based retrieval, failing to capture the structured knowledge and logical connections present in multimodal data. MMGraphRAG addresses these limitations by constructing a sophisticated Multimodal Knowledge Graph (MMKG) that leverages both text and image data for enhanced reasoning and domain adaptability.

Figure 1: MMGraphRAG Framework Overview. This diagram illustrates the comprehensive workflow of the MMGraphRAG framework, including modules for transforming textual and visual data into a unified MMKG.

Framework and Methodology

The MMGraphRAG framework is structured into three main stages: indexing, retrieval, and generation. The indexing stage involves transforming raw multimodal inputs into a structured MMKG, utilizing both Text2KG and Image2Graph modules to convert textual and visual data into knowledge graphs. The Cross-Modal Knowledge Fusion Module plays a pivotal role in aligning and integrating these graphs into a unified structure through a novel Cross-Modal Entity Linking (CMEL) process, which uses spectral clustering for entity alignment across modalities.

Figure 2: An Example of the Img2Graph Module in Action.

The retrieval process employs a Hybrid Granularity Retriever that utilizes the structural properties of the MMKG to extract relevant entities and contextual information. The generative process completes the framework by synthesizing these multimodal inputs into coherent responses, significantly improving reasoning paths and reducing hallucinations compared to previous methods.

Evaluation and Results

MMGraphRAG demonstrates state-of-the-art performance on the DocBench and MMLongBench datasets, showcasing superior domain adaptability and reasoning capabilities without large-scale retraining. This is achieved by integrating detailed visual scene data with textual information, allowing for comprehensive retrieval and generative processes.

The framework's performance is further validated in complex multimodal document understanding tasks, where it significantly outperforms existing RAG and multimodal RAG baselines. Notably, MMGraphRAG excels in domains with high visual-structural complexity, emphasizing its practical applications in diverse fields such as academia and finance.

Figure 3: Entity Distribution Across Document Domains.

One of the key innovations in MMGraphRAG is its Cross-Modal Entity Linking (CMEL) approach, which is crucial for the accurate alignment of entities across different modalities. The spectral clustering algorithm used in CMEL enhances precision by integrating both semantic and structural information, thereby facilitating more reliable multimodal inference.

The construction of the MMKG supports sophisticated cross-modal reasoning tasks by modeling images as independent nodes. This node-based approach enhances the flexibility and extensibility of the framework, allowing for the seamless integration of additional modalities and leading to more comprehensive and coherent responses.

Conclusion

MMGraphRAG offers a novel approach to integrating multimodal data into LLM generation, achieving significant improvements in reasoning, adaptability, and accuracy. By constructing an interpretable MMKG, MMGraphRAG not only improves the retrieval and generative capabilities of LLMs but also lays the groundwork for future developments in multimodal AI. This work encourages further exploration in cross-modal entity linking and the development of graph-based frameworks for enhanced multimodal reasoning.

Figure 4: The Distribution of CMEL dataset across different domains.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces MMGraphRAG, a new way for AI to answer questions using both text and images. Think of it like a smart student who, before answering, looks up facts in a well-organized library and also checks pictures for clues—then clearly shows how it found the answer. The key idea is to turn both words and images into a connected “map of knowledge” so the AI can reason across them step by step, instead of guessing.

What questions are the researchers trying to answer?

The paper focuses on three simple questions:

How can we make AI use both text and images together, not just one or the other?
How can we help AI understand the structure of information—who is related to what—so it can reason better and avoid making things up (hallucinations)?
Can we build a system that works well across different topics without special training, and that explains its reasoning?

How did they do it?

To explain the approach, imagine building a city map where words and pictures are buildings, and the roads between them show how they’re connected. The AI uses this “map” to travel along meaningful paths to find answers.

Turning images into graphs (scene graphs)

A scene graph is like labeling a picture with “objects” and “relationships.” For example: “girl holds camera,” “boy stands next to girl.”
The system breaks an image into parts (like cutting a photo into meaningful regions), describes each part in words, picks out the important objects, and connects them with relationships. This turns a picture into a small knowledge graph.

Problem: The same thing can appear in text and images (e.g., a chart in a report and the words describing it). We need to match them.
Solution: The system matches items from images to items in text by grouping related candidates smartly using a method called spectral clustering.
- Simple analogy: Imagine grouping students not just by how they look similar (appearance), but also by who they hang out with (connections). Spectral clustering uses both “how similar” and “how connected” information to form better groups.
After grouping, a LLM picks the best match between a visual thing (like “flooded street” in an image) and a text thing (“Hurricane Ian damage in Florida”).

Building one big multimodal knowledge graph (MMKG)

After linking, the system fuses the image graph and the text graph into one larger, cleaner knowledge map that covers both types of information. It also enhances leftover image descriptions using related text (e.g., adding “in Florida” to “a flooded neighborhood” if the text says so).

Finding and using information (retrieval)

When asked a question, the system follows paths through this graph to pull the most relevant bits—like traveling along roads on the map from one clue to the next.

Generating the final answer

The system first drafts an answer with a text model.
Then it asks a vision-LLM to consider both images and text and add multimodal details.
Finally, it merges these into one clear, consistent answer—with a reasoning path that’s easier to trace.

What did they find, and why is it important?

MMGraphRAG beat other strong methods on two tough test sets for question answering with documents:
- DocBench (documents across fields like academia, finance, law, government, news)
- MMLongBench (long, mixed-format PDFs with text, tables, charts, figures)
It was especially strong on:
- Questions that need both text and images to answer
- Complex, multi-step reasoning across pages and formats
- Identifying unanswerable questions (so it’s less likely to “make things up”)
It worked well without extra task-specific training, showing good generalization.
It provided clear “reasoning paths,” which makes its answers more interpretable—you can “see how it got there.”

They also created a new dataset and method for cross-modal entity linking (matching image things to text things):

CMEL dataset: A benchmark for testing how well systems match entities across text and images in realistic documents (news, papers, novels).
Spectral clustering method: Improved the candidate matching step by considering both meaning and connections, leading to higher accuracy than other clustering or embedding-only methods.

Why does this matter?

Better use of multimodal information: Many real documents mix text, pictures, tables, and charts. This approach helps AI truly combine them, rather than treating images as an afterthought.
Fewer hallucinations: By following reasoning paths on a knowledge graph and checking both text and images, the system is less likely to produce confident but wrong answers.
Works across domains: Because it doesn’t require heavy retraining for every new topic, it’s useful in areas like finance, law, science, and news.
More transparent AI: Showing the reasoning path makes answers easier to trust and verify.
Helps future research: The new dataset (CMEL) and the linking method give others tools to build better multimodal systems.

In short, MMGraphRAG is like giving AI a well-organized, cross-linked map of both words and pictures, helping it find answers more accurately, explain its thinking, and handle real-world, mixed-format documents with confidence.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues, missing analyses, and open research questions raised by the paper that future work could concretely address:

KG schema and ontology design are unspecified: entity/relation types, normalization rules, and cross-modal schema mapping are not formalized, making replication and extension difficult; evaluate how different schemas impact retrieval and reasoning quality.
Spectral clustering hyperparameters and sensitivity are underexplored: no details on choosing the number of eigenvectors $m$ , DBSCAN parameters (eps, min_samples), or stability across document sizes; provide systematic tuning guidelines and robustness analyses.
LLM-determined relation weights in the adjacency matrix lack reproducibility and rigor: quantify variability across prompts/models, study deterministic alternatives (rule-based, learned scalers), and assess their effect on CMEL accuracy.
Candidate set restriction to a local textual window ( $j-1$ to $j+1$ ) may miss long-range or cross-page links: measure recall loss and develop global or hierarchical candidate generation strategies for cross-document alignment.
Visual features are not explicitly leveraged in CMEL (alignment appears to rely on MLLM-generated text descriptions of image regions): compare to methods that incorporate native vision embeddings (e.g., CLIP, SigLIP, ViT features) and hybrid text+vision aligners.
Scene graph quality is not evaluated: benchmark entity/relation extraction against standard datasets (e.g., Visual Genome variants), report precision/recall, and quantify hallucinated “implicit relations” introduced by MLLMs.
YOLO is used for “semantic segmentation,” but YOLO performs detection (bounding boxes), not semantic segmentation: clarify the vision pipeline, evaluate segmentation/detection accuracy on document-specific visuals (tables, charts, diagrams), and test domain-specific detectors.
Retrieval module specifics are missing: define the hybrid-granularity retriever’s scoring, path ranking, maximum hop depth, and k-values; provide ablations to show their contributions to end-task accuracy.
Hybrid generation strategy lacks ablation and conflict-resolution details: quantify the incremental benefit of each stage (LLM draft, MLLM multimodal responses, LLM consolidation), and formalize how contradictory outputs are reconciled.
Interpretability claims are unmeasured: propose metrics for reasoning-path fidelity (path relevance, coverage, minimality), human evaluation protocols, and automatic measures (e.g., path overlap with gold evidence).
Unanswerable-question handling is not mechanized: define an abstention criterion calibrated to MMKG signals (e.g., path evidence sufficiency), measure calibration (ECE/Brier), and analyze failure modes on unanswerable cases.
Scalability and efficiency are unreported: provide indexing and retrieval time/memory profiles, eigen-decomposition complexity on large KGs, cost per document for LLM/MLLM calls, and optimization strategies (caching, batching, model distillation).
Error propagation across the pipeline is unquantified: analyze how detection errors, description generation noise, entity extraction mistakes, and alignment mislinks affect downstream QA; add uncertainty tracking and error-correction mechanisms.
Baseline fairness and comparability need strengthening: GraphRAG was modified (community detection removed), MRAG comparisons used different model sizes across experiments; re-run baselines with matched settings/model sizes and include more MRAGs (VisRAG, MuRAG, ColPali) for comprehensive comparison.
LLM-as-judge evaluation may introduce bias: complement with exact-match metrics where applicable, human assessment, and consensus grading; report inter-rater reliability and sensitivity to grader choice.
CMEL dataset limitations require documentation: small size (1,114 instances), unclear annotation guidelines, inter-annotator agreement, negative examples, and label quality; publish detailed construction protocols, IAA statistics, and license/usage constraints.
Generalization beyond three domains is untested: evaluate on scientific plots, engineering drawings/CAD, medical images, legal forms, and multilingual documents; study cross-lingual CMEL and MMKG construction.
Global image entity alignment and “remaining entity enhancement” steps risk text-to-image leakage/hallucination: quantify how text-derived enrichment changes entity fidelity, and add safeguards (evidence tagging, provenance tracking).
Cross-document MMKG construction and entity resolution are not addressed: design methods for deduplication, cross-document linking, and corpus-level reasoning; evaluate benefits for multi-document QA.
Incremental/streaming updates and temporal reasoning are missing: develop online MMKG maintenance with versioning, time-aware entities/relations, and assess effects on up-to-date QA.
Security and privacy considerations are absent: articulate policies for handling sensitive PDFs, PII redaction, and secure MMKG storage; evaluate attack surfaces (prompt injection via document content).
Theoretical justification for spectral clustering choice is limited: provide formal analysis or bounds on clustering quality given mixed semantic+structural weights, and compare against alternative graph matching/alignment methods (e.g., spectral matching, GNN-based alignment).
Hyperparameter defaults (e.g., k retrieved entities, token limits, chunk sizes) are not justified: include a principled tuning study and practical guidelines for different document types and lengths.
Reproducibility assets are not detailed: release full code, prompts, model versions, preprocessing configs, seeds, and end-to-end pipelines to enable independent replication and extension.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Summary

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Introduction to MMGraphRAG

Framework and Methodology

Evaluation and Results

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the researchers trying to answer?

How did they do it?

Turning images into graphs (scene graphs)

Building one big multimodal knowledge graph (MMKG)

Finding and using information (retrieval)

Generating the final answer

What did they find, and why is it important?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Authors (2)

Collections

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Summary

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Introduction to MMGraphRAG

Framework and Methodology

Evaluation and Results

Cross-Modal Fusion and Reasoning

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the researchers trying to answer?

How did they do it?

Turning images into graphs (scene graphs)

Linking images and text (cross-modal entity linking)

Building one big multimodal knowledge graph (MMKG)

Finding and using information (retrieval)

Generating the final answer

What did they find, and why is it important?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections