ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering

Published 30 May 2025 in cs.CL | (2506.00232v1)

Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi-hop question answering. We introduce ComposeRAG, a novel modular abstraction that decomposes RAG pipelines into atomic, composable modules. Each module, such as Question Decomposition, Query Rewriting, Retrieval Decision, and Answer Verification, acts as a parameterized transformation on structured inputs/outputs, allowing independent implementation, upgrade, and analysis. To enhance robustness against errors in multi-step reasoning, ComposeRAG incorporates a self-reflection mechanism that iteratively revisits and refines earlier steps upon verification failure. Evaluated on four challenging multi-hop QA benchmarks, ComposeRAG consistently outperforms strong baselines in both accuracy and grounding fidelity. Specifically, it achieves up to a 15% accuracy improvement over fine-tuning-based methods and up to a 5% gain over reasoning-specialized pipelines under identical retrieval conditions. Crucially, ComposeRAG significantly enhances grounding: its verification-first design reduces ungrounded answers by over 10% in low-quality retrieval settings, and by approximately 3% even with strong corpora. Comprehensive ablation studies validate the modular architecture, demonstrating distinct and additive contributions from each component. These findings underscore ComposeRAG's capacity to deliver flexible, transparent, scalable, and high-performing multi-hop reasoning with improved grounding and interpretability.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a modular RAG framework that decomposes multi-hop question answering into independently upgradable modules.
It employs a self-reflection mechanism with targeted modules like retrieval decision, query rewriting, and answer verification to boost accuracy by up to 15%.
The system demonstrates significant improvements across benchmarks, offering transparent error analysis and streamlined module enhancements.

ComposeRAG: A Modular Composable Framework for Multi-Hop Corpus-Grounded Question Answering

ComposeRAG introduces a modular abstraction for Retrieval-Augmented Generation (RAG) pipelines tailored to complex, corpus-grounded multi-hop question answering. This approach explicitly addresses dominant limitations in existing RAG architectures, namely their monolithic design, lack of interpretability, and rigidity in system analysis and improvement. ComposeRAG offers a clear decomposition of reasoning functions into parameterized, independently upgradable modules, providing a high degree of transparency and facilitating targeted improvements, especially for multi-step reasoning tasks.

Main Contributions and Architecture

ComposeRAG's core innovation lies in its compositional design. The system is constructed from atomic modules, each performing a fundamental transformation on structured inputs/outputs within the multi-hop reasoning process. The principal modules are:

Question Decomposition (QD): Breaks down complex questions into a sequence of interdependent sub-questions, facilitating linear multi-hop reasoning.
Question Construction (QC): Resolves sub-question dependencies by explicitly filling in placeholders with responses from previous steps, thus ensuring all sub-questions are self-contained and contextually well-specified.
Retrieval Decision: Determines, for each (sub-)question, whether external evidence is required or if sufficient context is present, an efficiency-oriented intervention reducing unnecessary retrievals.
Query Rewriting (QR): Refines under-specified sub-questions for improved retrieval efficacy, particularly when sub-questions inherit ambiguity from the decomposition step.
Passage Reranking (PR): Reorders retrieved evidence passages for maximal relevance to the current sub-question, mitigating issues such as lost-in-the-middle effects in initial retrieval outputs.
Answer Generation: Synthesizes candidate (sub-)answers using ranked passages and contextually accumulated reasoning history.
Answer Verification (AV): Rigorously evaluates generated answers against retrieved evidence, acting both as a grounding filter (rejecting hallucinated or unsupported answers) and an initiator of corrective actions (e.g., pipeline re-entry).
Final Answering: Aggregates all verified sub-answers into a comprehensive final answer to the original multi-hop query.

Each module can be independently implemented or upgraded, supporting the use of different LLMs or custom logic tailored to task requirements or available resources.

Orchestration and Self-Reflection

ComposeRAG is orchestrated through both direct pipelining and an iterative self-reflection mechanism. If pipeline verification fails at any step, the system automatically analyzes the full reasoning trace to diagnose the source of error (often in question decomposition), refines the initial step via guided prompting, and re-executes the pipeline. This reflective loop ensures error correction is tightly bound to verification, mitigating error propagation and enhancing robustness.

The orchestration supports two broad QA processing tracks:

Simple QA Pipeline: For questions resolvable in a single hop, optimizing for efficiency.
Advanced Multi-Hop Pipeline: Engages the full modular sequence for complex queries.

Empirical Results and Module Contributions

ComposeRAG is evaluated on four multi-hop QA benchmarks: HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle. Across all datasets and multiple LLM and retriever settings, it yields consistent improvements over strong baselines:

Up to 15% accuracy improvement over fine-tuning-based baselines (e.g., RQ-RAG) under identical retrieval setups.
Up to 5% gain over agentic/reasoning-specialized pipelines (e.g., Search-o1).
Verification-centric approach reduces ungrounded answers by over 10% in low-quality retrieval settings, and roughly 3% even against strong corpora.

Comprehensive ablation studies demonstrate the effectiveness and necessity of core modules:

QD and QC provide largest singular performance boosts by introducing interpretable decomposition.
Adding PR, QR, and AV modules each yield additive gains. For example, the stepwise addition of these modules systematically improves Cover-EM and LLM Eval metrics on held-out subsets.
Upgrading individual modules from smaller to larger LLMs produces measurable performance improvements in isolation, substantiating the system's modular upgradability claim.

The self-reflection mechanism clearly benefits accuracy. Limiting the maximum number of pipeline re-executions (reflection steps) achieves an optimal balance between correction and error amplification due to over-generation or drift, with performance peaking before lengthy iterations yield diminishing returns.

Tradeoffs and Failure Modes

ComposeRAG's insistence on answer grounding via explicit verification sometimes reduces answer coverage, particularly for implicitly supported or out-of-corpus questions where baseline models "guess" plausibly but without citation. Analysis reveals 18% of such answers in existing baselines are unsupported by retrieved evidence, and ComposeRAG appropriately abstains in these cases. This behavior is preferable in high-stakes settings demanding verifiable responses, but may underperform in open-ended QA scenarios favoring broader coverage over factual precision.

Limitations noted include:

Decomposition reliability: Errors in the decomposition module (especially with ambiguous or poorly structured questions) can propagate, even with self-reflection.
Resource overhead: Reliance on multiple LLM calls introduces latency and cost, challenging real-time or low-resource deployment.
Error diagnosis complexity: The efficacy of revisit-and-refine strategies is bounded by the system’s error signal clarity; highly entangled reasoning failures may not be tractable within simple orchestration policies.

Theoretical and Practical Implications

ComposeRAG provides a blueprint for interpretable, flexible, and extendable multi-hop RAG architectures. Its composable design aligns closely with modular software engineering principles, enabling systematic evaluation, targeted module advancements, and simplified debugging in multi-step reasoning pipelines. The framework generalizes across a wide range of LLMs, showing minimal dependency on task-specific fine-tuning, and scales well from small to large models.

Practically, ComposeRAG is suitable for deployment in settings requiring high transparency, clear error analysis, and easy upgradability—for example, in regulated domains, long-lived QA deployments, or environments where rapid integration of improved LLMs/evidence retrievers is advantageous.

Outlook and Future Directions

Future research opportunities center on:

Automated module selection and orchestration refinement: Dynamic profiling of question difficulty and automatic routing between minimal and maximal pipelines to further optimize efficiency.
Light-weight/self-distilled specialized modules: Developing smaller but highly effective LLMs for specific modules (e.g., Decomposition or Verification) to reduce cost.
Enhanced error traceability: Leveraging reason traces and explicit module logs for downstream error diagnosis, active learning, and continual improvement.
Broader integration with retrieval/fusion pipelines: Combining ComposeRAG’s modular abstraction with retrieval fusion, reinforcement learning-based reasoning policies, or external planning and control mechanisms.

ComposeRAG sets a precedent for systematic, modular approaches to multi-hop LLM reasoning, enabling both rigorous system analysis and practical advances in verifiable QA.