HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

Published 13 Apr 2025 in cs.CL and cs.AI | (2504.12330v1)

Abstract: While Retrieval-Augmented Generation (RAG) augments LLMs with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.

Abstract PDF Upgrade to Chat

Summary

The paper presents a hierarchical multi-agent framework that decomposes complex queries and integrates retrieval from structured, unstructured, and graph-based data.
It employs specialized agents for vector, graph, and web-based retrieval, achieving 93.73% accuracy on ScienceQA and 58.55% on CrisisMMD benchmarks.
The methodology uses consistency voting and expert model refinement to unify multi-source answers, setting a new benchmark for multimodal RAG systems.

HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

Introduction

The HM-RAG framework addresses critical challenges in Retrieval-Augmented Generation (RAG) systems for processing multimodal data. RAG techniques traditionally augment LLMs with external information, but single-agent systems struggle with complex queries requiring cross-modal synthesis. HM-RAG introduces a hierarchical multi-agent approach to improve dynamic synthesis and retrieval across structured, unstructured, and graph-based data, notably outperforming existing systems in ScienceQA and CrisisMMD benchmarks.

Key Components of HM-RAG

HM-RAG's architecture consists of specialized agents, each optimizing different stages of multimodal data processing:

Decomposition Agent: Utilizes semantic-aware query rewriting and context augmentation to break down complex queries into coherent sub-tasks. This agent ensures queries are parsed in a manner that accommodates diverse data modalities.
Multi-source Retrieval Agents: Implement parallel modality-specific retrieval strategies by employing vector, graph, and web-based databases. This plug-and-play architecture allows comprehensive data handling across different retrieval tasks.
Decision Agent: Integrates multi-source answers using consistency voting and resolves discrepancies through Expert Model Refinement, ensuring high accuracy and coherent final outputs.
Figure 1: Comparison of (a) single-agent single-modal RAG and (b) multi-agent multimodal RAG, highlighting superior performance in complex queries.

Methodology

The methodology of HM-RAG involves transforming multimodal data into rich vector and graph formats:

Multimodal Knowledge Pre-Processing: Uses visual-to-text conversion with VLMs, integrating refined descriptions into unified multimodal textual knowledge bases and constructing multimodal knowledge graphs (MMKG) for relational understanding.
Multi-source Plug-and-Play Retrieval Agents: Engages specialized agents for accessing different databases, facilitating robust multi-source retrieval and synthesis through vector semantic search, graph exploration, and web data extraction.
Decision Agent for Multi-answer Refinement: Employs consistency checks and expert model refinement to finalize responses, achieving 12.95% improvement in accuracy over baseline systems.
Figure 2: Overview of HM-RAG's three-layered agent architecture.

Experimentation and Results

HM-RAG was evaluated on the ScienceQA and CrisisMMD datasets, achieving state-of-the-art results in multimodal question answering:

ScienceQA Performance: HM-RAG improved average answer accuracy to 93.73%, marking significant advancements over previous leading models like LLaMA-SciTune and GPT-4o, particularly in handling social science inquiries.
CrisisMMD Performance: Demonstrated superior multimodal capabilities with 58.55% average accuracy, surpassing leading LLMs and VLMs across crisis classification tasks, efficiently utilizing multimodal data for real-world scenarios.
Figure 3: Case Study contrasting HM-RAG with baseline methods illustrating decisively accurate results.

Implications for Future Research

HM-RAG sets a benchmark in optimizing RAG systems for multimodal contexts, offering a scalable, agent-based approach. The integration capabilities pave the way for RAG systems in complex data settings, ensuring consistency and adaptability across varying applications. Potential future developments include enhancing agent components for deeper contextual reasoning and exploring wider data modality integrations for broader applicability.

Conclusion

HM-RAG innovatively bridges multiple retrieval methods and modalities, providing a comprehensive framework for enhanced multimodal reasoning and information synthesis. The architecture's modularity ensures that it can adapt readily to new data formats and retrieval ecosystems, making it highly relevant for applications across diverse domains. This represents a substantial step forward in addressing the pressing need for coherent multimodal data processing and analysis in RAG systems.

Markdown