CRAG Dataset for RAG QA Evaluation
- CRAG is a large-scale, multi-domain benchmark designed to evaluate RAG systems using dynamic retrieval scenarios and clear penalties for hallucinations.
- It integrates both noisy web search and structured knowledge graph retrieval to simulate real-world conditions and encourage evidence-backed responses.
- The dataset features fine-grained annotations, diverse question types, and specialized tasks that support rigorous analysis and system improvements.
The Comprehensive RAG Benchmark (CRAG) is a large-scale, multi-domain, and temporally diverse benchmark designed to evaluate the capabilities of Retrieval-Augmented Generation (RAG) systems in factually grounded question answering. Developed to address the limitations of earlier QA datasets, CRAG introduces dynamic, realistic retrieval scenarios by simulating both web search and structured knowledge graph (KG) retrieval, with a strong emphasis on penalizing hallucinations and rewarding abstention in cases of insufficient evidence. CRAG is the official benchmark for the KDD Cup 2024 Retrieval-Augmented Generation challenge and is publicly released, fostering reproducibility and ongoing advances in the RAG research community (Yang et al., 2024).
1. Motivation and Design Principles
CRAG was constructed to resolve major gaps in existing QA benchmarks. Prior datasets typically focus on static facts, offer limited variety, or lack authentic retrieval noise (e.g., synthetic contexts, absence of unstructured and structured retrieval simulation). CRAG provides a large collection of questionāanswer pairs that vary across domains, question types, entity popularity, and temporal dynamics. Each question is annotated for fine-grained evaluation, facilitating diagnosis of model performance on nuanced real-world scenarios including highly dynamic queries and low-popularity entities (Yang et al., 2024, Ouyang et al., 2024).
Key design goals include:
- Evaluation of RAG pipelines under authentic, noisy retrieval conditions.
- Exposure of system behavior on queries spanning static to real-time facts.
- Performance assessment over diverse domains and reasoning types.
- Formal penalization of hallucinations, promoting "I don't know" abstention when evidence is lacking.
2. Dataset Structure and Annotations
2.1 Domains and Question Categories
CRAG encompasses 4,409 QA pairs split across five major domains: Finance, Sports, Music, Movie, and Open (encyclopedic). These domains were selected to balance heavy-traffic applications (e.g., financial data, live sports) and stable, encyclopedic queries such as birth dates and discographies (Yang et al., 2024, Ouyang et al., 2024).
Questions are assigned to one of eight categories, capturing the core reasoning skills required in RAG:
| Type | Definition (Example) |
|---|---|
| Simple | Single-fact lookup (āWhen was Albert Einstein born?ā ā Mar 14, 1879) |
| Simple w. Condition | Single fact + constraint (āIBM stock price on 01/02/2024?ā ā $124.50) |
| Set | Set retrieval (āContinents in Southern Hemisphere?ā ā {Antarctica, ...}) |
| Comparison | Comparing (āWho debuted earlier, Taylor Swift or Billie Eilish?ā) |
| Aggregation | Aggregate (āHow many Oscar awards for Meryl Streep?ā ā 3) |
| Multi-hop | Chain reasoning (āWho directed Zendayaās 2019 movie?ā ā Jon Favreau) |
| Post-processing Heavy | Computed answers (āDays Marshall served on the Supreme Court?ā) |
| False Premise | Invalid queries (āTaylor Swiftās rap album?ā ā āinvalid questionā) |
2.2 Entity Popularity
CRAG stratifies entities in KG-derived questions into head (popular), torso (mid-popularity), and tail (long-tail) buckets, following a search-log derived popularity score. The KG question set is distributed equally: 661 head, 658 torso, 665 tail (Yang et al., 2024).
2.3 Temporal Dynamism
Questions are manually annotated with four dynamism levels:
| Level | Description | Count (%) |
|---|---|---|
| Real-time | Second-level update (āAppleās stock price right now?ā) | 437 (10%) |
| Fast-changing | Daily fluctuation (āTonightās Lakers game?ā) | 564 (13%) |
| Slow-changing | Monthly/yearly (āWho won 2023 Grammy?ā) | 1,007 (23%) |
| Static | Invariant fact (āCapital of France?ā) | 2,401 (54%) |
Each question includes a query_time to support precise temporal evaluation (Yang et al., 2024).
3. Retrieval Simulation and Data Modalities
3.1 Web Search Mocking
A web search API, based on Brave Search, provides up to 50 full HTML pages per question. Each record includes metadata (page name, URL, snippet, last-modified) and raw HTML. This simulates the presence of retrieval noise, such as advertisements and partially relevant content. Estimated recall is 84% for web questions and 63% for KG questions (Yang et al., 2024).
3.2 Knowledge Graph Mock APIs
A synthetic KG of 2.6 million entities is paired with 38 mock domain-specific JSON endpoints (e.g., price history, movie credits, award history). API access is provided through JSON-defined request/response schemas, delivering attributeāvalue records. To simulate realistic retrieval, the KG comprises authentic and āhard negativeā entities (Yang et al., 2024, Ouyang et al., 2024).
4. Task Definitions and Evaluation Protocols
4.1 CRAG Tasks
CRAG is partitioned into three tasks of increasing complexity:
| Task | Input Modality | Objective |
|---|---|---|
| Task 1 | 5 web pages | Retrieval summarization; synthesize answer from noisy, limited text corpus |
| Task 2 | 5 web pages + mock KG APIs | Integrate web retrieval and structured KG info for answer synthesis |
| Task 3 | 50 web pages + mock KG APIs | End-to-end; test large-scale retrieval, ranking, filtration, API integration |
For every task, systems must answer or abstain (āI donāt knowā). Task 3 emulates real-world user information-seeking, with extensive retrieval noise and limited memory/time (Yang et al., 2024, Ouyang et al., 2024, DeHaven, 2024).
4.2 Dataset Splits
Typical splits include ~1,323 examples each for validation and public test (total 2,706) and a hidden private test for challenge evaluation. Retrieval content per question is provided according to task requirements (Ouyang et al., 2024).
5. Evaluation Metrics and Human-in-the-Loop Scoring
Formal metrics for CRAG responses include:
- Accuracy: fraction of correct (perfect or acceptable) answers.
- Hallucination rate: fraction of answers that are incorrect or unsupportable.
- Missing rate: fraction of abstentions (āI donāt knowā).
- Scoreā: auto-eval metric:
- Scoreā (human-eval): answers labeled as Perfect (+1), Acceptable (+0.5), Missing (0), or Incorrect (ā1), and mean taken (Yang et al., 2024).
Human evaluation is supplemented by LLM-based "judge" systems for scalability (Zhinuan et al., 5 Jan 2026). The protocol strongly penalizes unsupported or hallucinated answers, thereby promoting reliable uncertainty estimation by RAG models.
6. Empirical Results and System Analysis
CRAG exposes robust differences between RAG pipelines and traditional LLM-only QA. Key findings include:
- GPT-4 Turbo in a straightforward RAG setup attains only a modest 10-point accuracy improvement over LLM-only (44% vs. 34%), at the cost of increased hallucination. SOTA RAG industry systems yield a maximum 63% perfect answers, with hallucination rates of 17ā25% persisting (Yang et al., 2024).
- Slice-level analyses indicate the greatest difficulties in Finance and Sports domains, real-time and fast-changing queries (<20% auto-score), tail entities, and complex categories such as set, aggregation, and false-premise questions.
- Enhanced systems demonstrate significant gains via domain and dynamism routing, specialized preprocessing (e.g., HTML parsing, chunk ranking), and practical abstention strategies (Ouyang et al., 2024, DeHaven, 2024, Yuan et al., 2024).
- Energy and latency experiments confirm that CRAG supports measurement of performance trade-offs under realistic workloads; judicious choice of retrieval parameters reduces energy usage without severe accuracy loss (Zhinuan et al., 5 Jan 2026).
7. Availability, Community Adoption, and Impact
The CRAG dataset, along with all supporting mock APIs and code, is publicly available under a CC BY-NC license at https://github.com/facebookresearch/CRAG/ and competition portals. It serves as the official benchmark for the Meta KDD Cup 2024, attracting thousands of participants and submissions (Yang et al., 2024). The dataset is maintained and updated, with expansion into additional domains and question types anticipated.
CRAG has rapidly become the de facto benchmark for rigorous analysis of RAG QA systems, catalyzing research into robust retrieval, hallucination mitigation, structured/unstructured information integration, and temporal reasoning. Its formal scoring, coverage of real-world search conditions, and fine-grained annotations position it as a central resource for both system development and diagnostic research (Yang et al., 2024, Ouyang et al., 2024, DeHaven, 2024, Zhinuan et al., 5 Jan 2026, Yuan et al., 2024).