- The paper introduces a modular RAG framework that decomposes multi-hop question answering into independently upgradable modules.
- It employs a self-reflection mechanism with targeted modules like retrieval decision, query rewriting, and answer verification to boost accuracy by up to 15%.
- The system demonstrates significant improvements across benchmarks, offering transparent error analysis and streamlined module enhancements.
ComposeRAG: A Modular Composable Framework for Multi-Hop Corpus-Grounded Question Answering
ComposeRAG introduces a modular abstraction for Retrieval-Augmented Generation (RAG) pipelines tailored to complex, corpus-grounded multi-hop question answering. This approach explicitly addresses dominant limitations in existing RAG architectures, namely their monolithic design, lack of interpretability, and rigidity in system analysis and improvement. ComposeRAG offers a clear decomposition of reasoning functions into parameterized, independently upgradable modules, providing a high degree of transparency and facilitating targeted improvements, especially for multi-step reasoning tasks.
Main Contributions and Architecture
ComposeRAG's core innovation lies in its compositional design. The system is constructed from atomic modules, each performing a fundamental transformation on structured inputs/outputs within the multi-hop reasoning process. The principal modules are:
- Question Decomposition (QD): Breaks down complex questions into a sequence of interdependent sub-questions, facilitating linear multi-hop reasoning.
- Question Construction (QC): Resolves sub-question dependencies by explicitly filling in placeholders with responses from previous steps, thus ensuring all sub-questions are self-contained and contextually well-specified.
- Retrieval Decision: Determines, for each (sub-)question, whether external evidence is required or if sufficient context is present, an efficiency-oriented intervention reducing unnecessary retrievals.
- Query Rewriting (QR): Refines under-specified sub-questions for improved retrieval efficacy, particularly when sub-questions inherit ambiguity from the decomposition step.
- Passage Reranking (PR): Reorders retrieved evidence passages for maximal relevance to the current sub-question, mitigating issues such as lost-in-the-middle effects in initial retrieval outputs.
- Answer Generation: Synthesizes candidate (sub-)answers using ranked passages and contextually accumulated reasoning history.
- Answer Verification (AV): Rigorously evaluates generated answers against retrieved evidence, acting both as a grounding filter (rejecting hallucinated or unsupported answers) and an initiator of corrective actions (e.g., pipeline re-entry).
- Final Answering: Aggregates all verified sub-answers into a comprehensive final answer to the original multi-hop query.
Each module can be independently implemented or upgraded, supporting the use of different LLMs or custom logic tailored to task requirements or available resources.
Orchestration and Self-Reflection
ComposeRAG is orchestrated through both direct pipelining and an iterative self-reflection mechanism. If pipeline verification fails at any step, the system automatically analyzes the full reasoning trace to diagnose the source of error (often in question decomposition), refines the initial step via guided prompting, and re-executes the pipeline. This reflective loop ensures error correction is tightly bound to verification, mitigating error propagation and enhancing robustness.
The orchestration supports two broad QA processing tracks:
- Simple QA Pipeline: For questions resolvable in a single hop, optimizing for efficiency.
- Advanced Multi-Hop Pipeline: Engages the full modular sequence for complex queries.
Empirical Results and Module Contributions
ComposeRAG is evaluated on four multi-hop QA benchmarks: HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle. Across all datasets and multiple LLM and retriever settings, it yields consistent improvements over strong baselines:
- Up to 15% accuracy improvement over fine-tuning-based baselines (e.g., RQ-RAG) under identical retrieval setups.
- Up to 5% gain over agentic/reasoning-specialized pipelines (e.g., Search-o1).
- Verification-centric approach reduces ungrounded answers by over 10% in low-quality retrieval settings, and roughly 3% even against strong corpora.
Comprehensive ablation studies demonstrate the effectiveness and necessity of core modules:
- QD and QC provide largest singular performance boosts by introducing interpretable decomposition.
- Adding PR, QR, and AV modules each yield additive gains. For example, the stepwise addition of these modules systematically improves Cover-EM and LLM Eval metrics on held-out subsets.
- Upgrading individual modules from smaller to larger LLMs produces measurable performance improvements in isolation, substantiating the system's modular upgradability claim.
The self-reflection mechanism clearly benefits accuracy. Limiting the maximum number of pipeline re-executions (reflection steps) achieves an optimal balance between correction and error amplification due to over-generation or drift, with performance peaking before lengthy iterations yield diminishing returns.
Tradeoffs and Failure Modes
ComposeRAG's insistence on answer grounding via explicit verification sometimes reduces answer coverage, particularly for implicitly supported or out-of-corpus questions where baseline models "guess" plausibly but without citation. Analysis reveals 18% of such answers in existing baselines are unsupported by retrieved evidence, and ComposeRAG appropriately abstains in these cases. This behavior is preferable in high-stakes settings demanding verifiable responses, but may underperform in open-ended QA scenarios favoring broader coverage over factual precision.
Limitations noted include:
- Decomposition reliability: Errors in the decomposition module (especially with ambiguous or poorly structured questions) can propagate, even with self-reflection.
- Resource overhead: Reliance on multiple LLM calls introduces latency and cost, challenging real-time or low-resource deployment.
- Error diagnosis complexity: The efficacy of revisit-and-refine strategies is bounded by the system’s error signal clarity; highly entangled reasoning failures may not be tractable within simple orchestration policies.
Theoretical and Practical Implications
ComposeRAG provides a blueprint for interpretable, flexible, and extendable multi-hop RAG architectures. Its composable design aligns closely with modular software engineering principles, enabling systematic evaluation, targeted module advancements, and simplified debugging in multi-step reasoning pipelines. The framework generalizes across a wide range of LLMs, showing minimal dependency on task-specific fine-tuning, and scales well from small to large models.
Practically, ComposeRAG is suitable for deployment in settings requiring high transparency, clear error analysis, and easy upgradability—for example, in regulated domains, long-lived QA deployments, or environments where rapid integration of improved LLMs/evidence retrievers is advantageous.
Outlook and Future Directions
Future research opportunities center on:
- Automated module selection and orchestration refinement: Dynamic profiling of question difficulty and automatic routing between minimal and maximal pipelines to further optimize efficiency.
- Light-weight/self-distilled specialized modules: Developing smaller but highly effective LLMs for specific modules (e.g., Decomposition or Verification) to reduce cost.
- Enhanced error traceability: Leveraging reason traces and explicit module logs for downstream error diagnosis, active learning, and continual improvement.
- Broader integration with retrieval/fusion pipelines: Combining ComposeRAG’s modular abstraction with retrieval fusion, reinforcement learning-based reasoning policies, or external planning and control mechanisms.
ComposeRAG sets a precedent for systematic, modular approaches to multi-hop LLM reasoning, enabling both rigorous system analysis and practical advances in verifiable QA.