Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiveRAG Dataset for RAG Evaluation

Updated 14 February 2026
  • LiveRAG is a benchmark dataset comprising 895 synthetic QA pairs designed for detailed, item-level evaluation of RAG systems.
  • The dataset is constructed with a multi-stage pipeline using LLM filtering, manual annotation, and claim extraction to ensure diversity and factual accuracy.
  • It provides fine-grained IRT metrics and evaluation protocols from the SIGIR’2025 challenge, supporting advanced error analysis and adaptive training.

LiveRAG is a publicly released benchmark consisting of 895 synthetic question–answer pairs explicitly designed to support systematic evaluation of Retrieval-Augmented Generation (RAG) systems, with a focus on Question Answering (QA) tasks. Developed for and derived from the SIGIR’2025 LiveRAG Challenge, the dataset offers granular evaluation resources such as ground-truth answers, supporting document evidence, atomic supporting claims, and item-level difficulty/discriminability estimates. LiveRAG’s construction pipeline, annotation protocol, and evaluation metrics aim to address the growing need for robust, diverse, and discriminative benchmarks within the RAG research community, enabling fine-grained comparative studies and methodology development (Carmel et al., 18 Nov 2025).

1. Composition and Structure

LiveRAG consists of 895 unique synthetic QA pairs generated by the DataMorgana pipeline. Each record is associated with the following structured fields:

Field Description Format/Values
question Natural-language query string
answer DataMorgana-generated “ground-truth” reply string
supporting_docs FineWeb-10BT document(s), with ID and full text (1 or 2 per QA item) list[doc objects]
answer_claims Atomic claims extracted from answer; each labeled (direct/useful/useless) list[claim objects]
session Challenge round (“First”, “Second”, “Both”) string
dm_categories Eight categorical generation labels (Answer Type, Style, …) list[strings]
acs_mean/std Mean/std dev of competitor systems’ Correctness float
irt_diff IRT-derived difficulty (bib_i; higher = easier) float
irt_disc IRT-derived discriminability (aia_i) float

Each question–answer pair is grounded in 1–2 web documents sampled from the FineWeb-10BT corpus via topic/subtopic expansion and LLM-based filtering. Answer claims are extracted as subspans of the reference answer, each labeled by human annotators for directness, usefulness, or irrelevance to the question.

2. Dataset Construction and Annotation

Construction followed a multi-stage pipeline:

  1. Document sampling: Seed topics/subtopics from FineWeb-10BT; apply LLM-based document filtering for factuality and interest.
  2. DataMorgana pipeline: Selects categorical generation axes, samples supporting documents, and uses Claude 3.5 to synthesize the question/answer pair.
  3. Manual filtering: Removes any items presenting inconsistent or conflicting facts arising from alternate passages.
  4. Claim extraction and labeling: Each answer is partitioned into atomic claim spans with human annotations indicating whether the claim is direct (answers question), useful (helpful), or useless (irrelevant).

Eight categorical axes guide generative diversity during construction:

  • Answer Type
  • Style
  • Premise
  • Phrasing
  • Variation
  • Politeness
  • Correctness
  • Persona

3. Difficulty and Discriminability Estimation

LiveRAG is annotated with Item Response Theory (IRT)–derived metrics for every question. A two-parameter logistic (2PL) model is applied to the matrix of Correctness scores Yj,iY_{j,i} (system jj, item ii):

P(Yj,i=1θj,bi,ai)=11+exp[ai(θjbi)]P(Y_{j,i}=1 \mid \theta_j, b_i, a_i) = \frac{1}{1 + \exp[-a_i(\theta_j - b_i)]}

where:

  • θj\theta_j denotes system skill
  • bib_i (irt_diff field) denotes item difficulty (higher = easier)
  • aia_i (irt_disc field) denotes item discriminability (slope parameter)

Parameter estimation utilized the py-irt library with a Continuous-Bernoulli likelihood (Correctness mapped to [0,1][0,1]), 10,000 training epochs, learning rate $0.01$, and dropout $0.2$. The resulting bib_i shows a strong negative correlation with mean Correctness (r=0.97r=-0.97); aia_i values are centered near zero with a weak negative correlation to difficulty (r0.42r\approx-0.42). Discriminability captures how sharply outcome probability transitions as system skill θ\theta approaches item difficulty.

Quartile-based bins for bib_i define question difficulty levels:

  1. HD (highly difficult): [6,2.143)[-6, -2.143)
  2. D (difficult): [2.143,0.962)[-2.143, -0.962)
  3. M (moderate): [0.962,0.236)[-0.962, 0.236)
  4. E (easy): [0.236,6][0.236, 6]

Each partition contains ~224 questions; average ACS increases from HD to E.

4. Dataset Diversity and Characteristics

LiveRAG exhibits high question and linguistic diversity, evidenced by:

  • Multi-axial category coverage: factoid, list, comparison, multi-aspect, yes/no, concise vs. detailed, polite vs. neutral, typo handling, user expertise
  • Quantitative diversity: lexical normalized Google Distance (NGD), part-of-speech category ratio (PoS-CR), embedding homogenization, and length entropy, placing LiveRAG at or near the top relative to TriviaQA, Natural Questions, and WebQuestions
  • Document grounding: 758 items are single-doc; 137 are multi-doc (multi-doc questions are statistically harder (bi0.09b_i \approx 0.09) and less discriminative than single-doc (bi1.08b_i \approx -1.08))

The dataset supports fine-grained evaluation by providing not only system-level scores but also claim-level relevance labels for every answer reference.

5. Evaluation Protocols

LiveRAG was the basis for the SIGIR’2025 LiveRAG Challenge, with the following typical protocol:

  • Task: Teams answered 500 unseen QAs per session (fixed LLM: Falcon3-10B-Instruct), retrieving from the static FineWeb-10BT corpus.
  • Scoring: Automatic LLM judge evaluated system answers using the Correctness metric (harmonic mean of Coverage and Relatedness).
  • Leaderboards: Rankings determined by mean Correctness score and IRT-based skill estimates (θj\theta_j), yielding nearly identical orderings.

Recommended research practices for LiveRAG include:

  1. Sampling evaluation queries stratified by bib_i to test model robustness across difficulty
  2. Using aia_i to select high-discrimination questions for sensitive system comparison
  3. Reporting both vanilla average Correctness and IRT-normalized metrics (e.g., expected score for a given θ\theta)
  4. Adopting claim-level evaluation (per-claim recall/precision) for deeper insight
  5. Conducting curriculum learning by gradually introducing harder items (low bib_i) during retriever/generator module training
  6. Performing error analysis by correlating QA outcomes with categorical generation axes

6. Applications and Implications

LiveRAG is suitable for:

  • Benchmarking: Systematic RAG evaluation, leaderboard generation, and ablation studies
  • Curriculum learning: Adaptive training based on fine-grained difficulty stratification
  • Generalization studies: Comparing LLM-only to retrieval-augmented setups across controlled question sets
  • Error analysis and bias detection: Revealing latent weaknesses along linguistic, categorical, and difficulty axes
  • Dataset augmentation: Integration with other QA datasets to create mixed or “long-tail” benchmarks for stress-testing

Access to all core dataset fields (questions, answers, supporting docs, claims, IRT metrics, category labels) is available via the Hugging Face repository at https://huggingface.co/datasets/LiveRAG/Benchmark (Carmel et al., 18 Nov 2025).

7. Relation to Broader RAG Evaluation and SIGIR 2025 Challenge

The LiveRAG benchmark is explicitly positioned to address limitations in QA dataset diversity, answer grounding, and item-level evaluation granularity present in previous resources. Existing corpora such as TriviaQA, Natural Questions, and WebQuestions are surpassed in measured question diversity metrics.

LiveRAG’s provenance is tightly linked to the SIGIR 2025 LiveRAG Challenge, which set the precedent for time-constrained, large-scale system comparison over a fixed web corpus using a standardized LLM backend. The challenge and its benchmark informed subsequent research directions—including the scaling of graph-based RAG solutions and cross-benchmark transferability studies (Shen et al., 23 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiveRAG Dataset.