Papers
Topics
Authors
Recent
Search
2000 character limit reached

WildGraphBench: Benchmarking GraphRAG Systems

Updated 8 February 2026
  • WildGraphBench is a benchmark designed for evaluating GraphRAG systems using unstructured, web-scale Wikipedia citations.
  • It systematically assesses single-fact QA, multi-fact QA, and section-level summarization tasks through structured citation graphs.
  • Empirical findings reveal trade-offs between flat and graph-based methods, highlighting challenges in realistic, noisy data retrieval.

WildGraphBench is a benchmark specifically designed to evaluate Graph-based Retrieval-Augmented Generation (GraphRAG) systems under realistic, heterogeneous, and large-scale conditions. Unlike prior benchmarks that rely on short, curated passages, WildGraphBench operates over unstructured, web-scale corpora derived from Wikipedia’s external citations, simulating retrieval and reasoning “in the wild.” This resource systematically exposes the strengths and weaknesses of existing GraphRAG methodologies, spanning question answering and summarization across varying levels of evidence complexity (Wang et al., 2 Feb 2026).

1. Corpus Construction and Dataset Characteristics

WildGraphBench is constructed by sampling Wikipedia articles from twelve major top-level domains: Culture, Geography, Health, History, Human Activities, Mathematics, Nature, People, Philosophy, Religion, Society, and Technology. For each domain, pages with the highest citation density are selected to ensure a reference corpus with maximal heterogeneity and noise. All external URLs referenced in the Wikipedia articles are harvested, and the complete textual content is fetched—including boilerplate elements, navigation, advertisements, PDF or scanned materials—without manual cleaning or segment trimming. This design enforces realism by retaining difficult artifacts inherent to web-scale retrieval.

Ground-truth facts are extracted by identifying every sentence in each Wikipedia leaf section (the finest-grained heading level) marked by one or more inline citation markers. Sentences are algorithmically normalized via LLMs to remove footnotes and resolve coreference, producing triples of the form:

T=(statement,ref_urls,ref_count)T = (\text{statement}, \text{ref\_urls}, \text{ref\_count})

Only triples for which all referenced URLs are successfully fetched are retained.

Dataset statistics:

Metric Value
Wikipedia seed articles ≈150 (12 domains × 10–15)
Total external references (tokens) ≈2.4 million
Gold statements (triples) 1,197
Single-fact QA pairs 667
Multi-fact QA pairs 191
Section-level summarization Qs 339

2. Formal Task Definitions and Complexity

WildGraphBench comprises three distinct tasks of increasing complexity, each directly grounded in the structure of its corpus and the citation graph:

  • Single-fact QA: Given a query q1q_1 whose unique answer sGs^* \in G is supported by a single reference (ref_count=1\text{ref\_count} = 1), the system retrieves a subset RCR \subseteq C (top k=5k=5 chunks) and generates a response y^\hat{y}. Accuracy is assessed as Acc1=1[y^s]Acc_1 = \mathbb{1}[\hat{y} \equiv s^*] where “\equiv” denotes LLM-judged factual entailment.
  • Multi-fact QA: Here sGs^* \in G requires aggregation across multiple references (ref_count2\text{ref\_count} \geq 2), with no single page providing full support. The system output y^\hat{y} is assessed with Acc2=1[y^s]Acc_2 = \mathbb{1}[\hat{y} \equiv s^*].
  • Section-level Summarization: Given all gold statements S={s1,,sm}S^* = \{ s_1, \dots, s_m \} for a leaf section, the system retrieves k=10k=10 chunks and generates a free-form summary aa. Predicted statements H^={h^1,,h^p}\hat{H} = \{ \hat{h}_1, \dots, \hat{h}_p \} are extracted from aa and matched via:

Recall=1SsSmaxh^H^Match(s,h^),\text{Recall} = \frac{1}{|S^*|} \sum_{s \in S^*} \max_{\hat{h} \in \hat{H}} \text{Match}(s, \hat{h}),

Precision=1H^h^H^maxsSMatch(s,h^),\text{Precision} = \frac{1}{|\hat{H}|} \sum_{\hat{h} \in \hat{H}} \max_{s \in S^*} \text{Match}(s, \hat{h}),

where Match(s,h^)=1\text{Match}(s, \hat{h}) = 1 if h^\hat{h} correctly paraphrases ss, $0$ otherwise. F1 is the harmonic mean.

Examples and challenges for each task highlight the need for precise localization in single-fact QA, cross-document aggregation in multi-fact, and breadth with faithfulness in summarization.

3. Evaluation Methodology and Protocol

Corpus Indexing: All methods index the retrieval corpus CC in overlapping 1,200-token chunks (100-token overlap). For QA, k=5k=5 chunks are retrieved; for summarization, k=10k=10.

Baseline and GraphRAG Pipelines:

  • NaiveRAG (Flat BM25 + GPT-4o-mini): Concatenates top-kk chunks into the LLM prompt.
  • GraphRAG Variants:
    • Microsoft GraphRAG: Local-to-global graph expansion and aggregation.
    • HippoRAG2: PPR-style propagation for node ranking.
    • LightRAG: Alternates between dense embedding retrieval and entity–relation walks.
    • LinearRAG: Linear complexity propagation for scalability.
    • Fast-GraphRAG: Streamlined node representations for indexing speed.

Evaluation Metrics:

  • QA: Single-fact accuracy (Acc1Acc_1), multi-fact accuracy (Acc2Acc_2).
  • Summarization: Statement-level recall, precision, F1.
  • Graph Connectivity: Fraction of isolated nodes, average node degree.

All answer generation uses GPT-4o-mini; answer correctness is judged by GPT-5-mini. Evaluation is conducted on 8 × A100 GPUs for retrieval; GPT endpoints for generation and grading.

4. Baselines, Hyperparameters, and Experimental Setup

Methods compared include:

Method Retrieval Backbone Aggregation Type
NaiveRAG BM25 Flat concatenation
Microsoft GraphRAG (global) Chunk/citation graph Hierarchical summarization
LightRAG Entity–relation graph + dense vectors Hybrid retrieval
HippoRAG2 Flat retrieval with memory-augmented PPR Graph propagation
LinearRAG Fast linear graph propagation Optimized for scalability
Fast-GraphRAG Simplified indexing Speed-tuned

Chunk size is fixed at 1,200 tokens. Retrieval budget top_k is 5 for QA, 10 for summaries (F1 peaks at k=8k=8 for summaries with HippoRAG2). Graph aggregation budgets are set per original configurations from prior works.

5. Empirical Findings and Analyses

Performance Overview

Method Single-fact Acc Multi-fact Acc Summary F1
NaiveRAG 66.9% 35.1% 15.8%
BM25 41.4% 20.9% 12.7%
Microsoft GraphRAG (global) 56.5% 47.6% 13.8%
LightRAG (hybrid) 61.3% 40.8% 14.6%
HippoRAG2 71.5% 39.3% 13.4%

Key patterns:

  • Flat retrieval variants (NaiveRAG, BM25) are competitive on single-fact QA, where performant keyword matching suffices.
  • Global aggregation via GraphRAG shows clear gains (up to 47.6% multi-fact accuracy), indicating structured graph traversal’s advantage in coordinated evidence aggregation.
  • All methods underperform on section-level summarization (F1 < 16%). NaiveRAG achieves the highest recall and F1, suggesting the limitations of graph-based filtering when broad recall is required under context constraints.

Graph Structure and Heterogeneity

Compared to benchmarks like HotpotQA, UltraDomain, and GraphRAG-Bench, WildGraphBench features:

  • Lowest isolated node fraction (0.14)
  • Highest average node degree (3.11)
  • Maximum degree up to 967

This highly “hub-and-spoke” topology, where entities recur across many noisy sources, complicates meaningful evidence aggregation and increases the difficulty of traversal-based methods.

Retrieval Budget Effects

Ablations with HippoRAG2 show summary F1 follows an inverted-U as top_k increases, peaking at k=8k=8. Excessive retrieval introduces distractors; too few chunks reduce recall.

6. Interpretations and Future Research Directions

WildGraphBench confirms that GraphRAG is most effective for multi-fact, cross-document QA in truly wild corpora. However, for single-fact queries, graph-based overhead yields little benefit over flat retrieval. Summarization remains a prominent challenge, with all current pipelines underperforming due to the trade-off between evidence breadth and precision.

Future research avenues include:

  • Robust graph construction (advanced entity linking, noise-tolerant edge reasoning)
  • Adaptive retrieval budgets based on per-query requirements
  • Hybrid approaches blending graph-guided selection with flat retrieval for optimal precision–recall trade-off under fixed prompt budgets
  • Benchmark extensions targeting temporal reasoning or multimedia retrieval to further push open-domain system capabilities.

The benchmark’s design—anchoring gold facts in citation links but requiring reasoning over uncurated web documents—offers a demanding evaluation regime that is more representative of real-world deployment scenarios for GraphRAG systems (Wang et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WildGraphBench.