CRAG Dataset for RAG QA Evaluation

Updated 9 January 2026

CRAG is a large-scale, multi-domain benchmark designed to evaluate RAG systems using dynamic retrieval scenarios and clear penalties for hallucinations.
It integrates both noisy web search and structured knowledge graph retrieval to simulate real-world conditions and encourage evidence-backed responses.
The dataset features fine-grained annotations, diverse question types, and specialized tasks that support rigorous analysis and system improvements.

The Comprehensive RAG Benchmark (CRAG) is a large-scale, multi-domain, and temporally diverse benchmark designed to evaluate the capabilities of Retrieval-Augmented Generation (RAG) systems in factually grounded question answering. Developed to address the limitations of earlier QA datasets, CRAG introduces dynamic, realistic retrieval scenarios by simulating both web search and structured knowledge graph (KG) retrieval, with a strong emphasis on penalizing hallucinations and rewarding abstention in cases of insufficient evidence. CRAG is the official benchmark for the KDD Cup 2024 Retrieval-Augmented Generation challenge and is publicly released, fostering reproducibility and ongoing advances in the RAG research community (Yang et al., 2024).

1. Motivation and Design Principles

CRAG was constructed to resolve major gaps in existing QA benchmarks. Prior datasets typically focus on static facts, offer limited variety, or lack authentic retrieval noise (e.g., synthetic contexts, absence of unstructured and structured retrieval simulation). CRAG provides a large collection of question–answer pairs that vary across domains, question types, entity popularity, and temporal dynamics. Each question is annotated for fine-grained evaluation, facilitating diagnosis of model performance on nuanced real-world scenarios including highly dynamic queries and low-popularity entities (Yang et al., 2024, Ouyang et al., 2024).

Key design goals include:

Evaluation of RAG pipelines under authentic, noisy retrieval conditions.
Exposure of system behavior on queries spanning static to real-time facts.
Performance assessment over diverse domains and reasoning types.
Formal penalization of hallucinations, promoting "I don't know" abstention when evidence is lacking.

2. Dataset Structure and Annotations

2.1 Domains and Question Categories

CRAG encompasses 4,409 QA pairs split across five major domains: Finance, Sports, Music, Movie, and Open (encyclopedic). These domains were selected to balance heavy-traffic applications (e.g., financial data, live sports) and stable, encyclopedic queries such as birth dates and discographies (Yang et al., 2024, Ouyang et al., 2024).

Questions are assigned to one of eight categories, capturing the core reasoning skills required in RAG:

Type	Definition (Example)
Simple	Single-fact lookup (“When was Albert Einstein born?” → Mar 14, 1879)
Simple w. Condition	Single fact + constraint (“IBM stock price on 01/02/2024?” → $124.50)
Set	Set retrieval (“Continents in Southern Hemisphere?” → {Antarctica, ...})
Comparison	Comparing (“Who debuted earlier, Taylor Swift or Billie Eilish?”)
Aggregation	Aggregate (“How many Oscar awards for Meryl Streep?” → 3)
Multi-hop	Chain reasoning (“Who directed Zendaya’s 2019 movie?” → Jon Favreau)
Post-processing Heavy	Computed answers (“Days Marshall served on the Supreme Court?”)
False Premise	Invalid queries (“Taylor Swift’s rap album?” → “invalid question”)

2.2 Entity Popularity

CRAG stratifies entities in KG-derived questions into head (popular), torso (mid-popularity), and tail (long-tail) buckets, following a search-log derived popularity score. The KG question set is distributed equally: 661 head, 658 torso, 665 tail (Yang et al., 2024).

2.3 Temporal Dynamism

Questions are manually annotated with four dynamism levels:

Level	Description	Count (%)
Real-time	Second-level update (“Apple’s stock price right now?”)	437 (10%)
Fast-changing	Daily fluctuation (“Tonight’s Lakers game?”)	564 (13%)
Slow-changing	Monthly/yearly (“Who won 2023 Grammy?”)	1,007 (23%)
Static	Invariant fact (“Capital of France?”)	2,401 (54%)

Each question includes a query_time to support precise temporal evaluation (Yang et al., 2024).

3. Retrieval Simulation and Data Modalities

3.1 Web Search Mocking

A web search API, based on Brave Search, provides up to 50 full HTML pages per question. Each record includes metadata (page name, URL, snippet, last-modified) and raw HTML. This simulates the presence of retrieval noise, such as advertisements and partially relevant content. Estimated recall is 84% for web questions and 63% for KG questions (Yang et al., 2024).

3.2 Knowledge Graph Mock APIs

A synthetic KG of 2.6 million entities is paired with 38 mock domain-specific JSON endpoints (e.g., price history, movie credits, award history). API access is provided through JSON-defined request/response schemas, delivering attribute–value records. To simulate realistic retrieval, the KG comprises authentic and “hard negative” entities (Yang et al., 2024, Ouyang et al., 2024).

4. Task Definitions and Evaluation Protocols

4.1 CRAG Tasks

CRAG is partitioned into three tasks of increasing complexity:

Task	Input Modality	Objective
Task 1	5 web pages	Retrieval summarization; synthesize answer from noisy, limited text corpus
Task 2	5 web pages + mock KG APIs	Integrate web retrieval and structured KG info for answer synthesis
Task 3	50 web pages + mock KG APIs	End-to-end; test large-scale retrieval, ranking, filtration, API integration

For every task, systems must answer or abstain (“I don’t know”). Task 3 emulates real-world user information-seeking, with extensive retrieval noise and limited memory/time (Yang et al., 2024, Ouyang et al., 2024, DeHaven, 2024).

4.2 Dataset Splits

Typical splits include ~1,323 examples each for validation and public test (total 2,706) and a hidden private test for challenge evaluation. Retrieval content per question is provided according to task requirements (Ouyang et al., 2024).

5. Evaluation Metrics and Human-in-the-Loop Scoring

Formal metrics for CRAG responses include:

Accuracy: fraction of correct (perfect or acceptable) answers.
Hallucination rate: fraction of answers that are incorrect or unsupportable.
Missing rate: fraction of abstentions (“I don’t know”).
Scoreₐ: auto-eval metric:

$\mathrm{Score}_a = \frac{\#\text{Accurate} - \#\text{Incorrect}}{N}$

Scoreₕ (human-eval): answers labeled as Perfect (+1), Acceptable (+0.5), Missing (0), or Incorrect (–1), and mean taken (Yang et al., 2024).

Human evaluation is supplemented by LLM-based "judge" systems for scalability (Zhinuan et al., 5 Jan 2026). The protocol strongly penalizes unsupported or hallucinated answers, thereby promoting reliable uncertainty estimation by RAG models.

6. Empirical Results and System Analysis

CRAG exposes robust differences between RAG pipelines and traditional LLM-only QA. Key findings include:

GPT-4 Turbo in a straightforward RAG setup attains only a modest 10-point accuracy improvement over LLM-only (44% vs. 34%), at the cost of increased hallucination. SOTA RAG industry systems yield a maximum 63% perfect answers, with hallucination rates of 17–25% persisting (Yang et al., 2024).
Slice-level analyses indicate the greatest difficulties in Finance and Sports domains, real-time and fast-changing queries (<20% auto-score), tail entities, and complex categories such as set, aggregation, and false-premise questions.
Enhanced systems demonstrate significant gains via domain and dynamism routing, specialized preprocessing (e.g., HTML parsing, chunk ranking), and practical abstention strategies (Ouyang et al., 2024, DeHaven, 2024, Yuan et al., 2024).
Energy and latency experiments confirm that CRAG supports measurement of performance trade-offs under realistic workloads; judicious choice of retrieval parameters reduces energy usage without severe accuracy loss (Zhinuan et al., 5 Jan 2026).

7. Availability, Community Adoption, and Impact

The CRAG dataset, along with all supporting mock APIs and code, is publicly available under a CC BY-NC license at https://github.com/facebookresearch/CRAG/ and competition portals. It serves as the official benchmark for the Meta KDD Cup 2024, attracting thousands of participants and submissions (Yang et al., 2024). The dataset is maintained and updated, with expansion into additional domains and question types anticipated.

CRAG has rapidly become the de facto benchmark for rigorous analysis of RAG QA systems, catalyzing research into robust retrieval, hallucination mitigation, structured/unstructured information integration, and temporal reasoning. Its formal scoring, coverage of real-world search conditions, and fine-grained annotations position it as a central resource for both system development and diagnostic research (Yang et al., 2024, Ouyang et al., 2024, DeHaven, 2024, Zhinuan et al., 5 Jan 2026, Yuan et al., 2024).

Markdown Report Issue Upgrade to Chat

References (5)

CRAG -- Comprehensive RAG Benchmark (2024)

Revisiting the Solution of Meta KDD Cup 2024: CRAG (2024)

MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering (2024)

On the Effectiveness of Proposed Techniques to Reduce Energy Consumption in RAG Systems: A Controlled Experiment (2026)

A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CRAG Dataset.