MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Published 10 May 2025 in cs.CL, cs.AI, cs.IR, and cs.LG | (2505.06569v2)

Abstract: Long-context LLMs (LC LLMs) combined with retrieval-augmented generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained windows, and fragmented information from suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical RAG framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through real-time chunk- and document-level expansions. By initiating with finest-level retrieval and progressively incorporating broader, higher-level context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm MacRAG consistently surpasses baseline RAG pipelines in single- and multi-step generation using Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at https://github.com/Leezekun/MacRAG.

Abstract PDF Upgrade to Chat

Summary

The paper presents a unified hierarchical retrieval framework that balances context granularity and efficiency for multi-hop QA.
It employs top-down document compression and bottom-up adaptive slice merging to enhance precision and recall across benchmarks.
Experimental evaluations show significant gains in F1 scores and reduced latency compared to existing retrieval systems.

Multi-Scale Adaptive Context RAG: Hierarchical Retrieval for Long-Context Multi-Hop QA

Overview of the MacRAG Framework

The "MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG" (2505.06569) paper addresses limitations in long-context retrieval-augmented generation (RAG) for LLMs, particularly the precision-coverage-computation trade-offs inherent in multi-hop QA over large corpora. The authors propose MacRAG, a hierarchical retrieval system that constructs effective query-adaptive long contexts by combining top-down offline document indexing—via compression and multi-scale chunking—and bottom-up adaptive retrieval with controlled neighbor propagation and merging.

MacRAG operates in two main phases: a hierarchical index construction phase, where documents are chunked, compressed, and sliced before embedding, and a query-time adaptive retrieval phase, which begins with fine-grained slice retrieval and scales up context by merging relevant neighboring chunks and expanding up the hierarchy. This design aims to simultaneously maximize retrieval precision and context coverage while maintaining tractable context length and low latency.

Figure 1: Overview of MacRAG showing both hierarchical document indexing and bottom-up multi-scale adaptive retrieval over the document-chunk-summary-slice hierarchy.

Long-context LLMs such as GPT-4o, Gemini-1.5-pro, and Llama 3 can process hundreds of thousands of tokens and theoretically ingest entire documents. Still, practical issues like the "Lost in the Middle" phenomenon and context window limitations severely impact recall and reasoning for information buried in extended texts. Prior RAG enhancements—abstractive/extractive summarization (RECOMP, LLMLingua), hierarchical retrieval (RAPTOR, SIRERAG, HippoRAG, GraphRAG), and multi-step QA chains—either focus on post-retrieval compression, induce heavy computational overhead, or fragment continuity.

MacRAG differs from existing proposals by introducing a unified system for structure-preserving hierarchical indexing and query-time adaptive retrieval that systematically balances granularity and coverage, integrates multi-view (summary-level and chunk-level) evidence, and facilitates efficient neighbor expansion.

Hierarchical Indexing and Adaptive Retrieval

Top-Down Document Indexing

Documents are chunked with partial overlaps (typically 200–500 tokens per chunk), compressed into summaries via LLM-based abstractive summarization, then further sliced into fine-grained, partially overlapping units (50–200 tokens). Slices, summaries, and optionally original chunks are embedded and stored in a vector DB with complete metadata (document/chunk/slice offsets). This multi-scale index enables rapid cross-granularity retrieval and maintains document structure for efficient mapping during expansion.

Bottom-Up Multi-Scale Adaptive Retrieval

At inference, MacRAG begins with fine-level slice retrieval based on query similarity, maps slices to parent chunks, reranks using cross-encoder relevance scoring, and scales up chunk candidates via the $\alpha$ parameter—incorporating both borderline and highly relevant segments before mapping to corresponding source documents. Neighbor propagation merges $h$ -hop neighboring chunks, reducing fragmentation and preserving essential evidence continuity for multi-hop reasoning.

This process adaptively leverages context expansion based on query complexity—expanding broader neighborhoods for ambiguous/multi-hop queries and maintaining precision for focused queries. Context size is bounded via $(k_2 \times \alpha)$ candidates and hop counts, ensuring scalability.

Experimental Results

MacRAG is evaluated on HotpotQA, 2WikimultihopQA, and Musique (LongBench expansions)—datasets specifically designed to challenge multi-hop reasoning and retrieval in long contexts. Comparative analysis uses Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o, alongside baselines inclusive of RAPTOR and LongRAG in both single/multi-step QA settings.

Key findings:

F1-Score and Multi-Metric Gains: MacRAG surpasses LongRAG and RAPTOR across F1, Precision, Recall, and Exact Match metrics. For example, on HotpotQA (with GPT-4o, multi-step generation), MacRAG achieves 68.52 vs. LongRAG's 66.20 F1 (+2.32 absolute, +3.5% rel.), and on Musique, delivers +6.26 F1 over LongRAG (+14.3% rel.).
Stable Performance Across Generation Modes: Gains are upheld over seven distinct single/multi-step variants (retrieval-base, full extraction, filtering, chunk/document expansion), as illustrated below.

Figure 2: Comparative F1-scores on three datasets for LongRAG vs. MacRAG over seven generation modes with GPT-4o.

Figure 3: Comprehensive metric trends (Exact Match, F1, Precision, Recall) for MacRAG and LongRAG on HotpotQA, 2WikimultihopQA, and Musique with two LLMs.

Ablation: Hierarchical Expansion and Scaling Robustness: Ablation studies demonstrate that removing adaptive neighbor merging or scaling-up components yields a drop of up to 5–6 F1 points, substantiating these as critical mechanisms for bridging scattered evidence. Scaling factor $\alpha$ is robust across datasets, effectively balancing input length and recall.

Figure 4: Performance trends for increasing scale factors ( $\alpha$ ) confirming robust improvements as more borderline candidates are added for context expansion.

Efficiency and Latency: Despite richer context assembly, MacRAG achieves sub-second retrieval/reranking (0.22–0.24s per query) and total inference latency comparable to or faster than RAPTOR and LongRAG, maintaining 38% faster retrieval than RAPTOR (0.23s vs. 0.37s per query). Offline summarization cost is amortized; input context length is controlled.
Figure 5: Cumulative LLM input context length for single/multi-step generation methods, showing MacRAG maintains bounded lengths compared to full-document methods.

Practical and Theoretical Implications

MacRAG's hierarchical index preserves content structure, enables modular neighbor expansion, and maintains efficiency for massive corpora and enterprise-scale applications requiring dynamic knowledge base updates and personalized long-context reasoning. Its multi-scale adaptive pipeline further facilitates integration with iterative/agentic QA pipelines (e.g., chain-of-retrieval, multi-agent systems), and demonstrates strong performance even with advanced LLMs, indicating future-proof scalability as model context windows continue to expand.

The controlled context construction can mitigate known LLM failures ("Lost in the Middle", hallucination bias, information fragmentation), and the modular design invites extensions—e.g., graph-based reranking, knowledge-guided neighbor expansion, agentic context selection in lifelong QA systems.

Conclusion

The MacRAG framework introduces a principled hierarchical retrieval architecture that adaptively constructs effective long contexts for RAG over large document corpora. Empirical evaluations confirm superior precision and recall with strong efficiency, robust gains across diverse settings and LLMs, and demonstrable adaptability for advanced multi-hop reasoning. MacRAG establishes itself as an efficient, scalable, and flexible retrieval foundation for future-generation RAG and agentic QA infrastructures.