Multimodal Retrieval Knowledge
- Multimodal Retrieval Knowledge is a framework that formalizes retrieval-augmented generation for complex, unstructured documents by integrating diverse modalities like text, images, tables, and equations.
- It employs a dual approach using a vector index and a modality-aware knowledge graph to semantically index and retrieve interconnected knowledge units.
- The hybrid retrieval methodology enhances cross-modal reasoning, allowing large language models to effectively process and integrate evidence from varied data sources.
Multimodal Retrieval Knowledge (MRK) generalizes the retrieval-augmented generation (RAG) paradigm from purely text to complex, unstructured documents composed of text, images, tables, mathematical formulas, and other modalities. MRK encompasses the semantic indexing, structured representation, and hybrid retrieval of diverse, interconnected knowledge units, enabling LLMs and multimodal LLMs (MLLMs) to perform effective, context-sensitive generation and reasoning over heterogeneous and interlinked evidence (R et al., 16 Oct 2025).
1. Definition and Scope of Multimodal Retrieval Knowledge
MRK is the formalization and operationalization of retrieval knowledge when relevant support for inference and generation may reside in a range of modalities, not restricted to text. In MRK, knowledge units are not just paragraphs or text spans but include tables, figures, images, equations, charts, and even visual or auditory snippets. Each unit is embedded in a shared vector space, and their mutual dependencies (such as references between a text paragraph and a table, or the relationship between a plotted graph and its accompanying equation) are explicitly encoded in structured relations (R et al., 16 Oct 2025, Mei et al., 26 Mar 2025).
MRK thus consists of two integrated artifacts:
- A vector index cataloguing a wide set of semantically embedded unimodal and cross-modal knowledge units.
- A modality-aware knowledge graph (MAKG) whose nodes correspond to these units and whose edges encode cross-modal, structural, and logical relationships.
The objective is to enable contextually rich, evidence-integrating retrieval and augmentation for multimodal generation and question answering tasks, covering entire classes of unstructured real-world documents (e.g., scientific papers, reports, web pages) (R et al., 16 Oct 2025, Wang et al., 21 Jun 2025, Mei et al., 26 Mar 2025).
2. Core Principles and Formal Representations
Formally, the MAKG is denoted as a directed graph , where:
- Each is a semantically coherent chunk: paragraph (), table, image, equation, chart, etc.
- Each edge is a typed relation (e.g., NEXT-TEXT, HAS-TABLE, NEXT-FORMULA) linking and .
- Node embeddings are produced via modality-specific encoders and projected into a shared space using learned projection matrices to enforce semantic alignment across modalities.
- Edge weights use a relation-specific bilinear map and nonlinearity to represent connection strength.
This dual representation allows for the synthesis of both dense semantic similarity (cross-modal vector retrieval) and explicit contextual structure (graph-based propagation), facilitating robust cross-modal reasoning and retrieval (R et al., 16 Oct 2025, Wang et al., 21 Jun 2025, Hsiao et al., 26 Nov 2025).
3. Retrieval Algorithms and Hybrid Pipelines
MRK systems employ hybrid retrieval pipelines that combine dense vector search with knowledge graph traversal. The canonical protocol, as implemented in MAHA, proceeds as follows (R et al., 16 Oct 2025):
- Dense retrieval: Compute an embedding for the multimodal query and score all knowledge units using cosine similarity in the common embedding space.
- Graph-based retrieval: Identify seed nodes (high-density similarity) and expand via k-hop graph traversal, accumulating path scores via edge weights and selecting nodes with maximal reachability.
- Fusion: Combine the dense and graph scores , tuning empirically.
- LLM/MLLM integration: Pass the top- retrieved units (with modality metadata and graph context) as context for generative inference.
Alternative frameworks extend the retrieval stage with multi-agent pipelines (Wang et al., 21 Jun 2025), multi-granularity alignment (Yang et al., 10 May 2025, Xu et al., 1 May 2025), generative clue-based strategies (Long et al., 2024), or reinforcement learning (for dynamic processing and filtering) (Hong et al., 16 Oct 2025). For highly structured domains, multi-hop retrieval over multimodal KGs enables deep compositional reasoning and evidence aggregation (Park et al., 23 Dec 2025, Hsiao et al., 26 Nov 2025).
Example pipeline (MAHA pseudocode, abridged):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
for d in Documents: chunks = segment_by_modality(d) for c in chunks: raw = extract_representation(c) e_c = encode_and_project(raw, modality(c)) add_node(e_c, modality(c)) # Add schema-driven edges e_q = encode_and_project(q, 'text') R_dense = top_L_by_cosine(e_q, e_v for v in G) seed_nodes = {v | S_dense(q,v) > τ} paths = k_hop_expansion(seed_nodes, G) for v: S_comb(q,v) = λ*S_dense(q,v) + (1-λ)*S_graph(q,v) return top_K_by_S_comb |
4. Cross-Modal Embedding and Alignment Strategies
MRK critically depends on aligning semantic content across modalities into a shared representational space. This is achieved by:
- Learning modality-specific encoders and projections to a unified embedding dimension, such that, for instance, a text description of a graph is proximate to the embedding of the graphical chunk itself (R et al., 16 Oct 2025).
- Cross-modal pretraining losses (contrastive, inverse cloze, etc.) that tie together heterogeneous signal spaces (Luo et al., 2023, Hsiao et al., 26 Nov 2025).
- Incorporation of explicit schema-driven cross-modal edges to encode relations such as text referencing a table, or an equation formalizing a plotted variable.
- Object- and layout-aware modules for spatially complex or visually rich documents, including hierarchical region encoding and layout-based positional features (Xu et al., 1 May 2025).
- Constraints or penalties during generative clue production to ensure identifier uniqueness and retrieval coverage (Long et al., 2024).
These design choices ensure both robust semantic retrieval and the ability to traverse and aggregate evidence across chain-of-reference structures and composite reasoning paths (Hsiao et al., 26 Nov 2025, Wang et al., 21 Jun 2025).
5. Applications and Benchmarks
MRK is foundational for multimodal question answering, retrieval-augmented generation over academic and enterprise documents, knowledge-based visual question answering (KB-VQA), news image captioning, and knowledge graph completion. Notable applications include:
- Multimodal QA over unstructured data: MAHA achieves recall@3 = 0.81 and ROUGE-L = 0.486 with full modality coverage on MRAMG-Bench (R et al., 16 Oct 2025).
- Domain-specific reasoning: Multi-agent graph retrieval in MH-MMKG leads to 79.1% accuracy and 68.7% Precision@5 on a specialized visual game benchmark (Wang et al., 21 Jun 2025).
- Deep document understanding: Multi-granularity hierarchical frameworks and knowledge-guided RAG variants consistently outperform unimodal or pipeline approaches across InfoSeek, E-VQA, and global document QA (Yang et al., 10 May 2025, Xu et al., 1 May 2025, Hsiao et al., 26 Nov 2025).
- Multilingual, multi-task settings: Unified text-image-NLU frameworks achieve 93–95% R@10 across 12 languages for text and image retrieval (Zhang et al., 21 Jan 2026).
Representative datasets and metrics:
| Task | Metrics | Systems |
|---|---|---|
| KB-VQA (E-VQA, InfoSeek) | BEM, R@k, accuracy | MAHA, OMGM, MMKB-RAG |
| Multimodal QA | ROUGE-L, R@k | MAHA, OMGM |
| Multimodal KGC | MRR, Hits@k | CMR |
| Document QA | R@k, accuracy | MegaRAG |
| News captioning | CIDEr, NER F1 | MERGE |
(R et al., 16 Oct 2025, Wang et al., 21 Jun 2025, Yang et al., 10 May 2025, Hsiao et al., 26 Nov 2025, You et al., 26 Nov 2025, Zhao et al., 2024)
6. Empirical Findings and Comparative Results
MRK-based systems uniformly surpass unimodal and staged cross-modal pipelines, particularly in complex, cross-referential question answering, image understanding with named entities, and document-level retrieval/generation. Examples include:
- MAHA’s hybrid retrieval outperforms vector-only and graph-only baselines by +0.12–0.20 absolute in Recall@3, MRR, and ROUGE-L (R et al., 16 Oct 2025).
- Multi-agent graph strategies yield +14.8 percentage points in both accuracy and precision@5 over single-agent or BM25 RAG baselines in domain benchmarks (Wang et al., 21 Jun 2025).
- Coarse-to-fine granular pipelines (OMGM) achieve R@1 of 64.0 (from 52.6) and R@5 of 80.8 (from 73.9) by integrating cross-modal reranking (Yang et al., 10 May 2025).
- Knowledge-based dynamic filtering and tag-based joint selection further increase robustness to irrelevant or noisy retrievals (Ling et al., 14 Apr 2025).
- Multimodal KG-based approaches (MegaRAG) yield up to 64.85% local QA accuracy, compared to <28% for document-chunked RAG (Hsiao et al., 26 Nov 2025).
(R et al., 16 Oct 2025, Wang et al., 21 Jun 2025, Yang et al., 10 May 2025, Hsiao et al., 26 Nov 2025, Ling et al., 14 Apr 2025)
7. Limitations and Research Directions
Despite their strengths, current MRK systems manifest challenging limitations:
- Modality alignment: Shared embedding spaces may miss subtle, high-level cross-modal semantics, especially in domains with divergent image–text or structure–content relations (R et al., 16 Oct 2025).
- Graph construction heuristics: Schema-driven KG extraction relies on careful rule or prompt design, and may not generalize to arbitrary documents (Hsiao et al., 26 Nov 2025).
- Scalability: Real-time, dynamic KG construction and hybrid retrieval over large-scale, dynamically updated corpora present substantial compute and systems challenges (Mei et al., 26 Mar 2025).
- Retrieval coverage vs. specificity: Ensuring both full modality coverage and minimal irrelevant content demands advanced re-ranking, filtering, and evidence aggregation techniques (Hong et al., 16 Oct 2025, Ling et al., 14 Apr 2025).
- End-to-end optimization: Most pipelines are only partially differentiable; learning retrieval and reasoning together remains challenging, especially for deep multi-hop and compositional queries (Park et al., 23 Dec 2025, Wang et al., 21 Jun 2025).
Open research areas include end-to-end trainable MRK, multi-hop cross-modal reasoning, continual KG expansion, real-time low-latency architectures, more expressive fusion and generation protocols, and robust evaluation standards for cross-modal grounding and evidence integration (R et al., 16 Oct 2025, Park et al., 23 Dec 2025, Mei et al., 26 Mar 2025, Hsiao et al., 26 Nov 2025, Zhang et al., 21 Jan 2026).
Key Papers Referenced:
- "Multimodal RAG for Unstructured Data: Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval" (R et al., 16 Oct 2025)
- "Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown" (Wang et al., 21 Jun 2025)
- "OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval" (Yang et al., 10 May 2025)
- "A Multi-Granularity Retrieval Framework for Visually-Rich Documents" (Xu et al., 1 May 2025)
- "A Survey of Multimodal Retrieval-Augmented Generation" (Mei et al., 26 Mar 2025)
- "MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation" (Hsiao et al., 26 Nov 2025)
- "Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering" (Hong et al., 16 Oct 2025)
- "Cross-modal Retrieval for Knowledge-based Visual Question Answering" (Lerner et al., 2024)
- "End-to-end Knowledge Retrieval with Multi-modal Queries" (Luo et al., 2023)
- "Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning" (You et al., 26 Nov 2025)
- "Unified Multimodal and Multilingual Retrieval via Multi-Task Learning with NLU Integration" (Zhang et al., 21 Jan 2026)