LiveRAG 2025 Dataset: RAG Benchmark

Updated 14 February 2026

LiveRAG 2025 Dataset is a synthetic benchmark comprising 895 diverse QA pairs drawn from the extensive FineWeb-10BT corpus.
It integrates dual-index retrieval methods and comprehensive meta-annotations, including IRT-based difficulty and discriminability metrics.
The benchmark facilitates robust RAG system evaluation and curriculum learning across varied question formats and difficulty levels.

The LiveRAG 2025 Dataset is a synthetic benchmark designed to rigorously evaluate retrieval-augmented generation (RAG) systems at scale. Developed in conjunction with the SIGIR 2025 LiveRAG Challenge, the dataset establishes a diverse, multi-faceted testbed for RAG-based question answering (QA) under realistic web-scale information retrieval constraints. It synthesizes varied question–answer (QA) pairs anchored in the FineWeb-10BT corpus, includes calibrated difficulty and discriminability metrics via Item Response Theory (IRT), and provides extensive meta-annotations for systematic evaluation and research reproducibility (Carmel et al., 18 Nov 2025, Fensore et al., 27 Jun 2025).

1. Dataset Structure and Composition

The LiveRAG 2025 dataset, also referred to as the “LiveRAG Benchmark,” consists of 895 unique synthetic QA pairs. Each entry aligns a natural language question with a ground-truth answer and one or two supporting passages from FineWeb-10BT—a web-scale, decanted text corpus spanning ~10 billion tokens from heterogeneous genres including news, encyclopedic articles, technical content, and online forums (Fensore et al., 27 Jun 2025, Carmel et al., 18 Nov 2025).

Schema per entry:

Question (string)
Ground-truth Answer (string)
Supporting Documents (FineWeb-10BT IDs and full text)
Answer Claims (Direct / Useful / Useless labels)
Session (First / Second / Both; for challenge rounds)
DataMorgana configuration (eight categorical labels: Answer Type, Answer Style, Premise, Phrasing, Linguistic Variation, Politeness, Linguistic Correctness, User Persona)
Average Correctness Score (ACS) and standard deviation from system runs
IRT parameters: difficulty ( $b_i$ ) and discriminability ( $a_i$ )

Answer types and axes:

Questions span factoid, yes/no, definition, list, explanation, comparison, and multi-aspect categories. Subject areas encompass dozens of high-level topics and subtopics algorithmically generated and spanning wide topical breadth.

Passage segmentation and indexing:

Documents are split into 512-token (or ~150–300 token) chunks using sentence-aware splitters.
Two indices: OpenSearch BM25 (sparse) and Pinecone (dense, E5-base-v2 or BGE-M3 embeddings).
Hybrid index: Top 30 chunks per method (BM25 and dense), scores min–max-normalized and fused to form a 60-candidate pool; top 10 are context for answer generation (Fensore et al., 27 Jun 2025).
Chunks number in the millions, with preprocessing including HTML-to-text, boilerplate removal, and chunk overlaps (Shen et al., 23 Jul 2025).

Component	Description	Reference
QA pairs	895 synthetic Q–A pairs (DataMorgana)	(Carmel et al., 18 Nov 2025)
Corpus	FineWeb-10BT (~10B tokens, multi-domain, chunked)	(Fensore et al., 27 Jun 2025)
Indexing	Sparse (BM25), Dense (Pinecone,E5/BGE), Hybrid fused top	(Fensore et al., 27 Jun 2025)
Annotation	8 categorical QA axes, ground truths, supporting docs	(Carmel et al., 18 Nov 2025)

2. Question and Answer Generation Methodologies

Key generator: DataMorgana.

DataMorgana systematically produces synthetic questions via combinatorial templates and varying "user personas" by sampling axes such as Answer Type, Answer Style, Premise, Phrasing, Linguistic Variation, Politeness, Linguistic Correctness, and User Persona (Carmel et al., 18 Nov 2025, Fensore et al., 27 Jun 2025). For each QA pair:

Topics and subtopics are proposed by LLM (Claude 3.5-Sonnet).
Candidate supporting documents are retrieved via dense search over FineWeb-10BT.
Documents are filtered for factuality, interest, credibility, non-toxicity, and freshness (~38% acceptance rate).
For multi-aspect and comparison Qs, complementary documents are paired.
QA pairs are generated using category-enforcing prompts sent to the LLM.

Quality control:

All QAs undergo human review to remove contradiction or ambiguity before dataset release.

Characteristic distributions:

Question length: mean ≈15.3 tokens (range: 8–28).
Linguistic diversity measures: Distinct n-grams (NGD = 3.062), PoS-CR (5.220), high entropy (3.207) (Carmel et al., 18 Nov 2025).

3. Evaluation Protocols, Metrics, and Difficulty Modeling

Retrieval metrics:

Mean Average Precision (MAP)
Reciprocal rank, nDCG@10, Recall@k, Precision@k

Generation metrics:

ROUGE-1 / ROUGE-L (n-gram recall)
BLEU (n-gram precision)
Semantic similarity (MiniLM-L6-v2 embedding cosine)
Refusal rate: percentage of answers containing refusal strings (e.g., “I don’t have enough information…”), a lower score represents higher system coverage but excessive over-confidence can occur (Fensore et al., 27 Jun 2025).

Correctness (Challenge):

Defined as the harmonic mean of Coverage and Relatedness, scored by a reference LLM-as-judge system with range [–1,2] and rescaled to [0,1] for downstream modeling (Carmel et al., 18 Nov 2025).

Item Response Theory (IRT) modeling:

Each question is parameterized by a difficulty $b_i$ and a discrimination factor $a_i$ .
The 2-parameter logistic IRT model estimates the probability that a system (of ability $\theta_j$ ) correctly answers a question $i$ :

$P_{ij}(\theta_j) = \frac{1}{1 + \exp[-a_i(\theta_j - b_i)]}$

IRT parameters are estimated using py-irt with a Continuous-Bernoulli likelihood and used to bin questions as Highly Difficult, Difficult, Moderate, or Easy.
These metrics allow performance normalization and curriculum-style training (Carmel et al., 18 Nov 2025).

Metric	Definition / Scope
MAP	Mean average precision over all queries
ROUGE/BLEU	Overlap-based n-gram scoring
Correctness	LLM-as-judge, harmonic mean of Coverage/Relatedness
IRT $b_i, a_i$	Difficulty & discrimination for each QA pair

4. Dataset Splits and Dynamic Evaluation

Splitting paradigm:

Development split: static set (e.g., 200 DataMorgana synthetic QA pairs) for tuning and validation (Fensore et al., 27 Jun 2025).
Evaluation split: dynamic, live test set sampled shortly before challenge day—typically 500 questions per "live-day" session, unseen and vetted for novelty and real-world relevance (including recent events and terminology) (Fensore et al., 27 Jun 2025).
The complete released benchmark contains 895 QAs, expanded post-Challenge and fully annotated for offline research use (Carmel et al., 18 Nov 2025).

Session schema:

Each QA pair also includes metadata specifying the session (first, second, both), used for calibration and controlled experimentation.

This dynamic approach emulates production QA scenarios and mitigates overfitting to stale or static benchmarks, supporting generalizability assessment.

5. Linguistic, Topical, and User Diversities

Questions are constructed to maximize variance along:

Factuality: factoid, open-ended, multi-aspect, comparison, yes/no, list, definition, explanation.
Premise: direct (standalone) or with-premise (contextualized).
Phrasing: concise, verbose, search-style (short/long).
Linguistic variation: document-similar versus document-distant (phrasings with high/low lexical overlap to supporting passages).
User expertise: expert (technical/domain) vs. novice (general-audience) (Fensore et al., 27 Jun 2025, Carmel et al., 18 Nov 2025).

Linguistic statistics:

Document-similar phrasing achieves ~65% vocabulary overlap with supporting passages; document-distant ~28%.
Embedding similarity (MiniLM cosine): 0.762 vs. 0.562 for similar vs. distant phrasings.
Higher lexical alignment strongly correlates with improved downstream correctness, ROUGE-1 (0.431 vs. 0.296), and lower refusal rates (9.4% vs. 25.5%) (Fensore et al., 27 Jun 2025).

Topical coverage:

Dozens of fields: biology, history, technology, more—sampled automatically and manually filtered.

6. Downstream Use and System Evaluation

The LiveRAG Benchmark is configured for systematic RAG system evaluation:

Direct benchmarking protocols: retrieve relevant FineWeb-10BT passages for each question, generate an answer using a candidate RAG system, and score using ground-truth answers and supporting claims via F1, Exact Match, or the rescaled Correctness metric (Carmel et al., 18 Nov 2025).
Curriculum learning: train models on increasing-difficulty splits using IRT $b_i$ .
Analysis of system strengths and weaknesses: per-question performance as a function of $b_i$ and $a_i$ 0 enables stress-testing and fair benchmarking across systems of varying sophistication (e.g., dense-only, hybrid, graph-enhanced).
The benchmark differentiates systems by their ability to answer hard, discriminative items and to generalize across question formats.
Experiments show Falcon3 (no RAG) underperforms RAG systems on higher difficulty bins, and this effect is stable across LLM architectures, confirming robustness to model family (Carmel et al., 18 Nov 2025).

7. Notable Results, Limitations, and Illustrative Examples

System performance (examples):

Graph-enhanced RAG (GeAR) demonstrated substantial improvements over hybrid baselines: Correctness 0.8757, Faithfulness 0.5293 (Shen et al., 23 Jul 2025).
RAG pipelines relying on vocabulary-aligned retrieval (BM25 + E5 + reranking) yielded up to 52% relative MAP improvement with neural reranking but suffered from high latency (84s vs. 1.74s per query) (Fensore et al., 27 Jun 2025).
Pinecone dense retrieval outperformed BM25 for multi-hop recall at moderate k, but hybrid fusion was generally necessary for robust coverage (Cofala et al., 17 Jun 2025).

Sample QA entries:

Entry	Question	Answer (excerpt)	bᵢ	aᵢ	ACS
162	What is the maximum depth at which fish have been observed in ocean trenches?	Up to 8,100 meters; deeper habitats lack fish.	–1.33	0.29	0.876
437	Are March temperatures suitable for grape vine pruning?	Yes, early March is a suitable period for pruning.	1.234	0.16	0.089

Limitations:

No gold standard for all possible questions—ground truths are synthetic and anchored to sampled passages.
While IRT calibration controls question difficulty and separation, discriminability ( $a_i$ 1) is lower for hard items, potentially blunting sensitivity to fine model differences in extreme regions (Carmel et al., 18 Nov 2025).
Resource requirements for the corresponding document corpus (FineWeb-10BT) are non-trivial due to its web-scale size and dual-indexing.