LiveRAG Dataset for RAG Evaluation

Updated 14 February 2026

LiveRAG is a benchmark dataset comprising 895 synthetic QA pairs designed for detailed, item-level evaluation of RAG systems.
The dataset is constructed with a multi-stage pipeline using LLM filtering, manual annotation, and claim extraction to ensure diversity and factual accuracy.
It provides fine-grained IRT metrics and evaluation protocols from the SIGIR’2025 challenge, supporting advanced error analysis and adaptive training.

LiveRAG is a publicly released benchmark consisting of 895 synthetic question–answer pairs explicitly designed to support systematic evaluation of Retrieval-Augmented Generation (RAG) systems, with a focus on Question Answering (QA) tasks. Developed for and derived from the SIGIR’2025 LiveRAG Challenge, the dataset offers granular evaluation resources such as ground-truth answers, supporting document evidence, atomic supporting claims, and item-level difficulty/discriminability estimates. LiveRAG’s construction pipeline, annotation protocol, and evaluation metrics aim to address the growing need for robust, diverse, and discriminative benchmarks within the RAG research community, enabling fine-grained comparative studies and methodology development (Carmel et al., 18 Nov 2025).

1. Composition and Structure

LiveRAG consists of 895 unique synthetic QA pairs generated by the DataMorgana pipeline. Each record is associated with the following structured fields:

Field	Description	Format/Values
question	Natural-language query	string
answer	DataMorgana-generated “ground-truth” reply	string
supporting_docs	FineWeb-10BT document(s), with ID and full text (1 or 2 per QA item)	list[doc objects]
answer_claims	Atomic claims extracted from answer; each labeled (direct/useful/useless)	list[claim objects]
session	Challenge round (“First”, “Second”, “Both”)	string
dm_categories	Eight categorical generation labels (Answer Type, Style, …)	list[strings]
acs_mean/std	Mean/std dev of competitor systems’ Correctness	float
irt_diff	IRT-derived difficulty ( $b_i$ ; higher = easier)	float
irt_disc	IRT-derived discriminability ( $a_i$ )	float

Each question–answer pair is grounded in 1–2 web documents sampled from the FineWeb-10BT corpus via topic/subtopic expansion and LLM-based filtering. Answer claims are extracted as subspans of the reference answer, each labeled by human annotators for directness, usefulness, or irrelevance to the question.

2. Dataset Construction and Annotation

Construction followed a multi-stage pipeline:

Document sampling: Seed topics/subtopics from FineWeb-10BT; apply LLM-based document filtering for factuality and interest.
DataMorgana pipeline: Selects categorical generation axes, samples supporting documents, and uses Claude 3.5 to synthesize the question/answer pair.
Manual filtering: Removes any items presenting inconsistent or conflicting facts arising from alternate passages.
Claim extraction and labeling: Each answer is partitioned into atomic claim spans with human annotations indicating whether the claim is direct (answers question), useful (helpful), or useless (irrelevant).

Eight categorical axes guide generative diversity during construction:

Answer Type
Style
Premise
Phrasing
Variation
Politeness
Correctness
Persona

3. Difficulty and Discriminability Estimation

LiveRAG is annotated with Item Response Theory (IRT)–derived metrics for every question. A two-parameter logistic (2PL) model is applied to the matrix of Correctness scores $Y_{j,i}$ (system $j$ , item $i$ ):

$P(Y_{j,i}=1 \mid \theta_j, b_i, a_i) = \frac{1}{1 + \exp[-a_i(\theta_j - b_i)]}$

where:

$\theta_j$ denotes system skill
$b_i$ (irt_diff field) denotes item difficulty (higher = easier)
$a_i$ (irt_disc field) denotes item discriminability (slope parameter)

Parameter estimation utilized the py-irt library with a Continuous-Bernoulli likelihood (Correctness mapped to $[0,1]$ ), 10,000 training epochs, learning rate $0.01$, and dropout $0.2$. The resulting $b_i$ shows a strong negative correlation with mean Correctness ( $r=-0.97$ ); $a_i$ values are centered near zero with a weak negative correlation to difficulty ( $r\approx-0.42$ ). Discriminability captures how sharply outcome probability transitions as system skill $\theta$ approaches item difficulty.

Quartile-based bins for $b_i$ define question difficulty levels:

HD (highly difficult): $[-6, -2.143)$
D (difficult): $[-2.143, -0.962)$
M (moderate): $[-0.962, 0.236)$
E (easy): $[0.236, 6]$

Each partition contains ~224 questions; average ACS increases from HD to E.

4. Dataset Diversity and Characteristics

LiveRAG exhibits high question and linguistic diversity, evidenced by:

Multi-axial category coverage: factoid, list, comparison, multi-aspect, yes/no, concise vs. detailed, polite vs. neutral, typo handling, user expertise
Quantitative diversity: lexical normalized Google Distance (NGD), part-of-speech category ratio (PoS-CR), embedding homogenization, and length entropy, placing LiveRAG at or near the top relative to TriviaQA, Natural Questions, and WebQuestions
Document grounding: 758 items are single-doc; 137 are multi-doc (multi-doc questions are statistically harder ( $b_i \approx 0.09$ ) and less discriminative than single-doc ( $b_i \approx -1.08$ ))

The dataset supports fine-grained evaluation by providing not only system-level scores but also claim-level relevance labels for every answer reference.

5. Evaluation Protocols

LiveRAG was the basis for the SIGIR’2025 LiveRAG Challenge, with the following typical protocol:

Task: Teams answered 500 unseen QAs per session (fixed LLM: Falcon3-10B-Instruct), retrieving from the static FineWeb-10BT corpus.
Scoring: Automatic LLM judge evaluated system answers using the Correctness metric (harmonic mean of Coverage and Relatedness).
Leaderboards: Rankings determined by mean Correctness score and IRT-based skill estimates ( $\theta_j$ ), yielding nearly identical orderings.

Recommended research practices for LiveRAG include:

Sampling evaluation queries stratified by $b_i$ to test model robustness across difficulty
Using $a_i$ to select high-discrimination questions for sensitive system comparison
Reporting both vanilla average Correctness and IRT-normalized metrics (e.g., expected score for a given $\theta$ )
Adopting claim-level evaluation (per-claim recall/precision) for deeper insight
Conducting curriculum learning by gradually introducing harder items (low $b_i$ ) during retriever/generator module training
Performing error analysis by correlating QA outcomes with categorical generation axes

6. Applications and Implications

LiveRAG is suitable for:

Benchmarking: Systematic RAG evaluation, leaderboard generation, and ablation studies
Curriculum learning: Adaptive training based on fine-grained difficulty stratification
Generalization studies: Comparing LLM-only to retrieval-augmented setups across controlled question sets
Error analysis and bias detection: Revealing latent weaknesses along linguistic, categorical, and difficulty axes
Dataset augmentation: Integration with other QA datasets to create mixed or “long-tail” benchmarks for stress-testing

Access to all core dataset fields (questions, answers, supporting docs, claims, IRT metrics, category labels) is available via the Hugging Face repository at https://huggingface.co/datasets/LiveRAG/Benchmark (Carmel et al., 18 Nov 2025).

7. Relation to Broader RAG Evaluation and SIGIR 2025 Challenge

The LiveRAG benchmark is explicitly positioned to address limitations in QA dataset diversity, answer grounding, and item-level evaluation granularity present in previous resources. Existing corpora such as TriviaQA, Natural Questions, and WebQuestions are surpassed in measured question diversity metrics.

LiveRAG’s provenance is tightly linked to the SIGIR 2025 LiveRAG Challenge, which set the precedent for time-constrained, large-scale system comparison over a fixed web corpus using a standardized LLM backend. The challenge and its benchmark informed subsequent research directions—including the scaling of graph-based RAG solutions and cross-benchmark transferability studies (Shen et al., 23 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (2)

LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation (2025)

Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiveRAG Dataset.