FusionRAG: Enhanced RAG with Fusion Techniques
- FusionRAG is a retrieval-augmented generation framework that uses LLM-generated query variants and reciprocal rank fusion (RRF) to combine evidence from multiple sources.
- It enhances answer accuracy and comprehensiveness by integrating diverse retrieval results through hierarchical and efficiency-oriented fusion variants, including optimized KV cache reuse.
- Empirical studies demonstrate that FusionRAG achieves better completeness and improved throughput, making it a robust solution for multi-source document grounding and fact verification.
FusionRAG refers to a family of Retrieval-Augmented Generation (RAG) architectures that employ fusion techniques—typically reciprocal rank fusion (RRF)—at various stages in the retrieval and answer synthesis pipeline. These architectures systematically generate multiple queries or leverage multiple evidence sources, retrieve and rank potentially relevant documents for each query or source, and merge the resulting ranked lists to inform a LLM during answer generation. In recent literature, FusionRAG methods have been explored for improving the accuracy, comprehensiveness, and efficiency of LLM-driven question answering, document grounding, and knowledge-intensive NLP applications (Rackauckas, 2024, Rackauckas et al., 2024, Santra et al., 2 Sep 2025, Wang et al., 19 Jan 2026). FusionRAG is also distinguished by its variants: (1) RAG-Fusion (query variation plus rank fusion), (2) efficiency-oriented variants for LLM cache reuse, and (3) hierarchical, multi-source fusion for fact verification and robust generalization.
1. Architectural Variants of FusionRAG
FusionRAG in its canonical form extends the conventional RAG pipeline. Instead of issuing the user’s original query directly to a retriever, FusionRAG first employs an LLM-based query generator to create a set of variant or paraphrased queries . Each is independently submitted to a retrieval system (vector-based, lexical, or hybrid), yielding ranked lists of top- documents . Fusion is implemented through Reciprocal Rank Fusion (RRF), summarizing per-document relevance across all lists via
where is the 1-based position of in the -th list and (typically 60 or 100) is a smoothing parameter. The top fused documents are concatenated into the LLM prompt, along with the original query and (optionally) the variant queries, for downstream answer generation. This design produces answers grounded in multi-perspective evidence (Rackauckas, 2024, Rackauckas et al., 2024).
Recently, efficiency-centric FusionRAG variants have been introduced. These cache and reuse the intermediate key–value (KV) tensors computed by the LLM over retrieved chunks. FusionRAG for LLM inference acceleration constructs per-chunk “amplified” KV caches with offline cross-attention to semantically similar chunks, and sparsely recomputes attention for only a small, query-guided subset of critical tokens at runtime, dramatically reducing time-to-first-token (TTFT) while preserving answer quality (Wang et al., 19 Jan 2026). Another variant, HF-RAG (Hierarchical Fusion-based RAG), proposes intra-source fusion (aggregating multiple rankers per data source) followed by inter-source z-score normalization and merging, targeting fact verification with both labeled and unlabeled corpora (Santra et al., 2 Sep 2025).
2. Query Generation, Rank Fusion, and Context Assembly
The query generation module in FusionRAG elicits diverse sub-queries or paraphrases from an LLM to probe complementary facets of the information space. Prompts instruct the LLM to rephrase the initial query to cover distinct aspects (definitions, technical specs, recommendations, etc.), typically producing 3–5 sub-queries of 10–15 tokens each to balance diversity and retriever precision. LLM temperature is set (e.g., 0.7) to maximize coverage. Each query is independently submitted to a retriever (often a dense-vector model or BM25), retrieving top- relevant documents, each tracked by its rank.
RRF merges these ranked lists into a single fused document ranking, assigning each document a combined relevance score as shown above. This step tends to reward documents selected consistently across diverse query perspectives and reduce the risk of missing key evidentiary passages. The answer-generation LLM then receives a prompt that includes: the original query, the set of generated sub-queries, the top- fused documents (by title or snippet), and an instruction to synthesize a concise, comprehensive, and accurate answer (Rackauckas, 2024, Rackauckas et al., 2024).
3. Evaluation Frameworks and Empirical Performance
FusionRAG performance has been benchmarked both by expert human raters and automated “LLM-as-a-judge” frameworks. Evaluations target three axes: (i) accuracy (factual correctness), (ii) relevance (focus on user intent), and (iii) comprehensiveness (coverage of subtopics and facets). Empirically, FusionRAG and its variants demonstrate:
- Answer accuracy comparable to or slightly exceeding single-query RAG baselines.
- Noticeably stronger comprehensiveness, leveraging the widened context from query variants.
- Good overall relevance, with occasional off-topic content when sub-query generation produces poorly aligned variants (Rackauckas, 2024).
Automated evaluations—such as RAGElo, an Elo-style ranking system using LLM-generated synthetic queries and LLM-based answer grading—offer high-throughput comparison of RAG variants (Rackauckas et al., 2024). In these studies, FusionRAG achieved higher Elo scores and higher win rates, especially for completeness. Precision marginally favors standard RAG, reflecting that FusionRAG’s broader retrieval can introduce unrelated but contextually similar material.
Table: Representative Retrieval and Elo Scores (Infineon QA dataset, (Rackauckas et al., 2024))
| Retrieval | Agent | MRR@5 (Very Relevant) | Elo Score |
|---|---|---|---|
| BM25 | RAG | 0.821 | 487 |
| BM25 | RAG-F | 0.855 | 571 |
| Hybrid (RRF) | RAG | 0.746 | 497 |
| Hybrid (RRF) | RAG-F | 0.758 | 550 |
A positive correlation between LLM-judge ratings and expert human scores has been observed (Kendall’s ), supporting utility for rapid prototyping and head-to-head RAG system comparison.
4. Efficiency-Oriented FusionRAG: Cache Reuse and Sparse Attention
FusionRAG for LLM cache efficiency addresses high computational overheads from long retrieval-augmented inputs. In this design, the pipeline is augmented with:
- Offline preprocessing: Semantically similar document chunks are identified using vector encoders. Per-chunk KV caches are enriched with cross-attention to these similar neighbors, so future queries can reuse nearly all attention computations.
- Online adaptive recomputation: At inference, the method selects (e.g., 15%) of tokens most critical for the current query via last-layer attention scoring and only recomputes attention for these tokens. The remainder use their offline-precomputed KV cache.
- Q-Sparse-Attn kernel: A custom sparse-attention implementation merges precomputed and freshly computed KVs efficiently.
This scheme enables a multi-fold reduction in TTFT (e.g., –) with normalized F1 scores up to 70% higher than other cache reuse baselines under fixed recomputation budgets (Wang et al., 19 Jan 2026). Offline cross-attention reduces KV deviation by ≈70% prior to any online computation.
5. Best Practices, Strengths, and Limitations
Optimal deployment of FusionRAG involves task-tuned choices for the number of sub-queries , number of retrieved documents , and the RRF smoothing parameter . In practice, –5 and –10 yield strong results. Fine-grained prompt engineering for both the query generation and final answer generation phases helps control diversity and output quality. LLM calls for query generation may be cached to reduce latency on common queries.
Strengths:
- Provides broader context and richer, more grounded LLM outputs by surfacing diverse, complementary evidence (Rackauckas, 2024, Rackauckas et al., 2024).
- Increases completeness and coverage, beneficial in engineering/product QA and multi-faceted domains.
- Fusion generally improves retrieval effectiveness (nDCG, MRR) without additional ranker or generator training (Santra et al., 2 Sep 2025).
- Efficiency-oriented variants permit state-of-the-art latency–quality tradeoffs (Wang et al., 19 Jan 2026).
Limitations:
- Latency increases due to extra LLM calls and retrieval fan-out.
- Sub-optimal sub-queries may introduce topic drift or unrelated content.
- Prompt design for sub-query generation is sensitive and may require iterative expert-in-the-loop adjustment.
- Precision tradeoff: broadened contexts can reduce exactness in specific, entity-centric applications.
- Performance depends on score calibration and distributional assumptions (e.g., z-scoring for inter-source fusion) (Santra et al., 2 Sep 2025).
6. Multi-Source and Hierarchical FusionRAG Extensions
FusionRAG can be extended to aggregate over heterogeneous sources and multiple rankers, as in HF-RAG. This approach operates in two hierarchical fusion stages:
- Intra-source fusion: For each source (e.g., labeled task exemplars and large unlabeled background documents), ranked lists from several retrieval models are merged via RRF:
- Inter-source fusion: To reconcile differing score distributions, RRF scores are standardized within each source via z-score normalization:
where and are the mean and standard deviation over source . Fused lists are merged and top- passages are used as LLM context (Santra et al., 2 Sep 2025).
Empirical studies on FEVER, Climate-FEVER, and SciFact demonstrate that hierarchical fusion—both across rankers and sources—yields up to 3.6 F1 points improvement on in-domain data, more robust out-of-domain generalization, and strictly inference-time improvements without the need for model retraining.
7. Practical Applications and Impact
FusionRAG has been deployed and assessed in several domains, including enterprise product QA (e.g., Infineon Technologies), fact verification, and multi-hop question answering. Advantages include increased answer completeness, higher faithfulness with reduced propensity for factual errors, and substantial efficiency gains in production-scale LLM deployments. Its modularity enables adaptation to use multiple LLM types, retrievers, and retrieval sources without bespoke model finetuning, promoting wide applicability in scalable and safety-sensitive NLP applications (Rackauckas, 2024, Rackauckas et al., 2024, Santra et al., 2 Sep 2025, Wang et al., 19 Jan 2026).
A plausible implication is that FusionRAG provides a practical blueprint for next-generation RAG systems needing both high-quality answer synthesis and production-grade computational efficiency. However, optimal tradeoffs between completeness, precision, and latency remain application- and parameter-dependent.
Key References:
(Rackauckas, 2024) RAG-Fusion: a New Take on Retrieval-Augmented Generation (Rackauckas et al., 2024) Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework (Santra et al., 2 Sep 2025) HF-RAG: Hierarchical Fusion-based RAG with Multiple Sources and Rankers (Wang et al., 19 Jan 2026) From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation