WildGraphBench: Benchmarking GraphRAG Systems
- WildGraphBench is a benchmark designed for evaluating GraphRAG systems using unstructured, web-scale Wikipedia citations.
- It systematically assesses single-fact QA, multi-fact QA, and section-level summarization tasks through structured citation graphs.
- Empirical findings reveal trade-offs between flat and graph-based methods, highlighting challenges in realistic, noisy data retrieval.
WildGraphBench is a benchmark specifically designed to evaluate Graph-based Retrieval-Augmented Generation (GraphRAG) systems under realistic, heterogeneous, and large-scale conditions. Unlike prior benchmarks that rely on short, curated passages, WildGraphBench operates over unstructured, web-scale corpora derived from Wikipedia’s external citations, simulating retrieval and reasoning “in the wild.” This resource systematically exposes the strengths and weaknesses of existing GraphRAG methodologies, spanning question answering and summarization across varying levels of evidence complexity (Wang et al., 2 Feb 2026).
1. Corpus Construction and Dataset Characteristics
WildGraphBench is constructed by sampling Wikipedia articles from twelve major top-level domains: Culture, Geography, Health, History, Human Activities, Mathematics, Nature, People, Philosophy, Religion, Society, and Technology. For each domain, pages with the highest citation density are selected to ensure a reference corpus with maximal heterogeneity and noise. All external URLs referenced in the Wikipedia articles are harvested, and the complete textual content is fetched—including boilerplate elements, navigation, advertisements, PDF or scanned materials—without manual cleaning or segment trimming. This design enforces realism by retaining difficult artifacts inherent to web-scale retrieval.
Ground-truth facts are extracted by identifying every sentence in each Wikipedia leaf section (the finest-grained heading level) marked by one or more inline citation markers. Sentences are algorithmically normalized via LLMs to remove footnotes and resolve coreference, producing triples of the form:
Only triples for which all referenced URLs are successfully fetched are retained.
Dataset statistics:
| Metric | Value |
|---|---|
| Wikipedia seed articles | ≈150 (12 domains × 10–15) |
| Total external references (tokens) | ≈2.4 million |
| Gold statements (triples) | 1,197 |
| Single-fact QA pairs | 667 |
| Multi-fact QA pairs | 191 |
| Section-level summarization Qs | 339 |
2. Formal Task Definitions and Complexity
WildGraphBench comprises three distinct tasks of increasing complexity, each directly grounded in the structure of its corpus and the citation graph:
- Single-fact QA: Given a query whose unique answer is supported by a single reference (), the system retrieves a subset (top chunks) and generates a response . Accuracy is assessed as where “” denotes LLM-judged factual entailment.
- Multi-fact QA: Here requires aggregation across multiple references (), with no single page providing full support. The system output is assessed with .
- Section-level Summarization: Given all gold statements for a leaf section, the system retrieves chunks and generates a free-form summary . Predicted statements are extracted from and matched via:
where if correctly paraphrases , $0$ otherwise. F1 is the harmonic mean.
Examples and challenges for each task highlight the need for precise localization in single-fact QA, cross-document aggregation in multi-fact, and breadth with faithfulness in summarization.
3. Evaluation Methodology and Protocol
Corpus Indexing: All methods index the retrieval corpus in overlapping 1,200-token chunks (100-token overlap). For QA, chunks are retrieved; for summarization, .
Baseline and GraphRAG Pipelines:
- NaiveRAG (Flat BM25 + GPT-4o-mini): Concatenates top- chunks into the LLM prompt.
- GraphRAG Variants:
- Microsoft GraphRAG: Local-to-global graph expansion and aggregation.
- HippoRAG2: PPR-style propagation for node ranking.
- LightRAG: Alternates between dense embedding retrieval and entity–relation walks.
- LinearRAG: Linear complexity propagation for scalability.
- Fast-GraphRAG: Streamlined node representations for indexing speed.
Evaluation Metrics:
- QA: Single-fact accuracy (), multi-fact accuracy ().
- Summarization: Statement-level recall, precision, F1.
- Graph Connectivity: Fraction of isolated nodes, average node degree.
All answer generation uses GPT-4o-mini; answer correctness is judged by GPT-5-mini. Evaluation is conducted on 8 × A100 GPUs for retrieval; GPT endpoints for generation and grading.
4. Baselines, Hyperparameters, and Experimental Setup
Methods compared include:
| Method | Retrieval Backbone | Aggregation Type |
|---|---|---|
| NaiveRAG | BM25 | Flat concatenation |
| Microsoft GraphRAG (global) | Chunk/citation graph | Hierarchical summarization |
| LightRAG | Entity–relation graph + dense vectors | Hybrid retrieval |
| HippoRAG2 | Flat retrieval with memory-augmented PPR | Graph propagation |
| LinearRAG | Fast linear graph propagation | Optimized for scalability |
| Fast-GraphRAG | Simplified indexing | Speed-tuned |
Chunk size is fixed at 1,200 tokens. Retrieval budget top_k is 5 for QA, 10 for summaries (F1 peaks at for summaries with HippoRAG2). Graph aggregation budgets are set per original configurations from prior works.
5. Empirical Findings and Analyses
Performance Overview
| Method | Single-fact Acc | Multi-fact Acc | Summary F1 |
|---|---|---|---|
| NaiveRAG | 66.9% | 35.1% | 15.8% |
| BM25 | 41.4% | 20.9% | 12.7% |
| Microsoft GraphRAG (global) | 56.5% | 47.6% | 13.8% |
| LightRAG (hybrid) | 61.3% | 40.8% | 14.6% |
| HippoRAG2 | 71.5% | 39.3% | 13.4% |
Key patterns:
- Flat retrieval variants (NaiveRAG, BM25) are competitive on single-fact QA, where performant keyword matching suffices.
- Global aggregation via GraphRAG shows clear gains (up to 47.6% multi-fact accuracy), indicating structured graph traversal’s advantage in coordinated evidence aggregation.
- All methods underperform on section-level summarization (F1 < 16%). NaiveRAG achieves the highest recall and F1, suggesting the limitations of graph-based filtering when broad recall is required under context constraints.
Graph Structure and Heterogeneity
Compared to benchmarks like HotpotQA, UltraDomain, and GraphRAG-Bench, WildGraphBench features:
- Lowest isolated node fraction (0.14)
- Highest average node degree (3.11)
- Maximum degree up to 967
This highly “hub-and-spoke” topology, where entities recur across many noisy sources, complicates meaningful evidence aggregation and increases the difficulty of traversal-based methods.
Retrieval Budget Effects
Ablations with HippoRAG2 show summary F1 follows an inverted-U as top_k increases, peaking at . Excessive retrieval introduces distractors; too few chunks reduce recall.
6. Interpretations and Future Research Directions
WildGraphBench confirms that GraphRAG is most effective for multi-fact, cross-document QA in truly wild corpora. However, for single-fact queries, graph-based overhead yields little benefit over flat retrieval. Summarization remains a prominent challenge, with all current pipelines underperforming due to the trade-off between evidence breadth and precision.
Future research avenues include:
- Robust graph construction (advanced entity linking, noise-tolerant edge reasoning)
- Adaptive retrieval budgets based on per-query requirements
- Hybrid approaches blending graph-guided selection with flat retrieval for optimal precision–recall trade-off under fixed prompt budgets
- Benchmark extensions targeting temporal reasoning or multimedia retrieval to further push open-domain system capabilities.
The benchmark’s design—anchoring gold facts in citation links but requiring reasoning over uncurated web documents—offers a demanding evaluation regime that is more representative of real-world deployment scenarios for GraphRAG systems (Wang et al., 2 Feb 2026).