WildGraphBench: Benchmarking GraphRAG Systems

Updated 8 February 2026

WildGraphBench is a benchmark designed for evaluating GraphRAG systems using unstructured, web-scale Wikipedia citations.
It systematically assesses single-fact QA, multi-fact QA, and section-level summarization tasks through structured citation graphs.
Empirical findings reveal trade-offs between flat and graph-based methods, highlighting challenges in realistic, noisy data retrieval.

WildGraphBench is a benchmark specifically designed to evaluate Graph-based Retrieval-Augmented Generation (GraphRAG) systems under realistic, heterogeneous, and large-scale conditions. Unlike prior benchmarks that rely on short, curated passages, WildGraphBench operates over unstructured, web-scale corpora derived from Wikipedia’s external citations, simulating retrieval and reasoning “in the wild.” This resource systematically exposes the strengths and weaknesses of existing GraphRAG methodologies, spanning question answering and summarization across varying levels of evidence complexity (Wang et al., 2 Feb 2026).

1. Corpus Construction and Dataset Characteristics

WildGraphBench is constructed by sampling Wikipedia articles from twelve major top-level domains: Culture, Geography, Health, History, Human Activities, Mathematics, Nature, People, Philosophy, Religion, Society, and Technology. For each domain, pages with the highest citation density are selected to ensure a reference corpus with maximal heterogeneity and noise. All external URLs referenced in the Wikipedia articles are harvested, and the complete textual content is fetched—including boilerplate elements, navigation, advertisements, PDF or scanned materials—without manual cleaning or segment trimming. This design enforces realism by retaining difficult artifacts inherent to web-scale retrieval.

Ground-truth facts are extracted by identifying every sentence in each Wikipedia leaf section (the finest-grained heading level) marked by one or more inline citation markers. Sentences are algorithmically normalized via LLMs to remove footnotes and resolve coreference, producing triples of the form:

$T = (\text{statement}, \text{ref\_urls}, \text{ref\_count})$

Only triples for which all referenced URLs are successfully fetched are retained.

Dataset statistics:

Metric	Value
Wikipedia seed articles	≈150 (12 domains × 10–15)
Total external references (tokens)	≈2.4 million
Gold statements (triples)	1,197
Single-fact QA pairs	667
Multi-fact QA pairs	191
Section-level summarization Qs	339

2. Formal Task Definitions and Complexity

WildGraphBench comprises three distinct tasks of increasing complexity, each directly grounded in the structure of its corpus and the citation graph:

Single-fact QA: Given a query $q_1$ whose unique answer $s^* \in G$ is supported by a single reference ( $\text{ref\_count} = 1$ ), the system retrieves a subset $R \subseteq C$ (top $k=5$ chunks) and generates a response $\hat{y}$ . Accuracy is assessed as $Acc_1 = \mathbb{1}[\hat{y} \equiv s^*]$ where “ $\equiv$ ” denotes LLM-judged factual entailment.
Multi-fact QA: Here $s^* \in G$ requires aggregation across multiple references ( $\text{ref\_count} \geq 2$ ), with no single page providing full support. The system output $\hat{y}$ is assessed with $Acc_2 = \mathbb{1}[\hat{y} \equiv s^*]$ .
Section-level Summarization: Given all gold statements $S^* = \{ s_1, \dots, s_m \}$ for a leaf section, the system retrieves $k=10$ chunks and generates a free-form summary $a$ . Predicted statements $\hat{H} = \{ \hat{h}_1, \dots, \hat{h}_p \}$ are extracted from $a$ and matched via:

$\text{Recall} = \frac{1}{|S^*|} \sum_{s \in S^*} \max_{\hat{h} \in \hat{H}} \text{Match}(s, \hat{h}),$

$\text{Precision} = \frac{1}{|\hat{H}|} \sum_{\hat{h} \in \hat{H}} \max_{s \in S^*} \text{Match}(s, \hat{h}),$

where $\text{Match}(s, \hat{h}) = 1$ if $\hat{h}$ correctly paraphrases $s$ , $0$ otherwise. F1 is the harmonic mean.

Examples and challenges for each task highlight the need for precise localization in single-fact QA, cross-document aggregation in multi-fact, and breadth with faithfulness in summarization.

3. Evaluation Methodology and Protocol

Corpus Indexing: All methods index the retrieval corpus $C$ in overlapping 1,200-token chunks (100-token overlap). For QA, $k=5$ chunks are retrieved; for summarization, $k=10$ .

Baseline and GraphRAG Pipelines:

NaiveRAG (Flat BM25 + GPT-4o-mini): Concatenates top- $k$ chunks into the LLM prompt.
GraphRAG Variants:
- Microsoft GraphRAG: Local-to-global graph expansion and aggregation.
- HippoRAG2: PPR-style propagation for node ranking.
- LightRAG: Alternates between dense embedding retrieval and entity–relation walks.
- LinearRAG: Linear complexity propagation for scalability.
- Fast-GraphRAG: Streamlined node representations for indexing speed.

Evaluation Metrics:

QA: Single-fact accuracy ( $Acc_1$ ), multi-fact accuracy ( $Acc_2$ ).
Summarization: Statement-level recall, precision, F1.
Graph Connectivity: Fraction of isolated nodes, average node degree.

All answer generation uses GPT-4o-mini; answer correctness is judged by GPT-5-mini. Evaluation is conducted on 8 × A100 GPUs for retrieval; GPT endpoints for generation and grading.

4. Baselines, Hyperparameters, and Experimental Setup

Methods compared include:

Method	Retrieval Backbone	Aggregation Type
NaiveRAG	BM25	Flat concatenation
Microsoft GraphRAG (global)	Chunk/citation graph	Hierarchical summarization
LightRAG	Entity–relation graph + dense vectors	Hybrid retrieval
HippoRAG2	Flat retrieval with memory-augmented PPR	Graph propagation
LinearRAG	Fast linear graph propagation	Optimized for scalability
Fast-GraphRAG	Simplified indexing	Speed-tuned

Chunk size is fixed at 1,200 tokens. Retrieval budget top_k is 5 for QA, 10 for summaries (F1 peaks at $k=8$ for summaries with HippoRAG2). Graph aggregation budgets are set per original configurations from prior works.

5. Empirical Findings and Analyses

Performance Overview

Method	Single-fact Acc	Multi-fact Acc	Summary F1
NaiveRAG	66.9%	35.1%	15.8%
BM25	41.4%	20.9%	12.7%
Microsoft GraphRAG (global)	56.5%	47.6%	13.8%
LightRAG (hybrid)	61.3%	40.8%	14.6%
HippoRAG2	71.5%	39.3%	13.4%

Key patterns:

Flat retrieval variants (NaiveRAG, BM25) are competitive on single-fact QA, where performant keyword matching suffices.
Global aggregation via GraphRAG shows clear gains (up to 47.6% multi-fact accuracy), indicating structured graph traversal’s advantage in coordinated evidence aggregation.
All methods underperform on section-level summarization (F1 < 16%). NaiveRAG achieves the highest recall and F1, suggesting the limitations of graph-based filtering when broad recall is required under context constraints.

Graph Structure and Heterogeneity

Compared to benchmarks like HotpotQA, UltraDomain, and GraphRAG-Bench, WildGraphBench features:

Lowest isolated node fraction (0.14)
Highest average node degree (3.11)
Maximum degree up to 967

This highly “hub-and-spoke” topology, where entities recur across many noisy sources, complicates meaningful evidence aggregation and increases the difficulty of traversal-based methods.

Retrieval Budget Effects

Ablations with HippoRAG2 show summary F1 follows an inverted-U as top_k increases, peaking at $k=8$ . Excessive retrieval introduces distractors; too few chunks reduce recall.

6. Interpretations and Future Research Directions

WildGraphBench confirms that GraphRAG is most effective for multi-fact, cross-document QA in truly wild corpora. However, for single-fact queries, graph-based overhead yields little benefit over flat retrieval. Summarization remains a prominent challenge, with all current pipelines underperforming due to the trade-off between evidence breadth and precision.

Future research avenues include:

Robust graph construction (advanced entity linking, noise-tolerant edge reasoning)
Adaptive retrieval budgets based on per-query requirements
Hybrid approaches blending graph-guided selection with flat retrieval for optimal precision–recall trade-off under fixed prompt budgets
Benchmark extensions targeting temporal reasoning or multimedia retrieval to further push open-domain system capabilities.

The benchmark’s design—anchoring gold facts in citation links but requiring reasoning over uncurated web documents—offers a demanding evaluation regime that is more representative of real-world deployment scenarios for GraphRAG systems (Wang et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WildGraphBench.

WildGraphBench: Benchmarking GraphRAG Systems

1. Corpus Construction and Dataset Characteristics

2. Formal Task Definitions and Complexity

3. Evaluation Methodology and Protocol

4. Baselines, Hyperparameters, and Experimental Setup

5. Empirical Findings and Analyses

Performance Overview

Graph Structure and Heterogeneity

Retrieval Budget Effects

6. Interpretations and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

WildGraphBench: Benchmarking GraphRAG Systems

1. Corpus Construction and Dataset Characteristics

2. Formal Task Definitions and Complexity

3. Evaluation Methodology and Protocol

4. Baselines, Hyperparameters, and Experimental Setup

5. Empirical Findings and Analyses

Performance Overview

Graph Structure and Heterogeneity

Retrieval Budget Effects

6. Interpretations and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research