CRAG-MM: Benchmark for MM-RAG Evaluation

Updated 28 January 2026

CRAG-MM is a comprehensive benchmark designed to evaluate multi-modal, multi-turn Retrieval-Augmented Generation (RAG) methods in wearable AI scenarios.
It features 7,943 images—including degraded egocentric captures—and multi-turn conversations across 13 diverse domains to test retrieval and reasoning under realistic conditions.
The evaluation protocol employs strict truthfulness metrics and session robustness checks to guide improvements in cross-modal retrieval and reasoning.

CRAG-MM (Comprehensive Retrieval-Augmented Generation Benchmark – Multi-Modal Multi-Turn) is the first comprehensive public benchmark specifically designed to evaluate multi-modal, multi-turn Retrieval-Augmented Generation (RAG) methods, targeting the vision-and-language QA demands of wearable AI scenarios such as smart glasses and AI pins. It emphasizes robust evaluation of MM-RAG models under realistic, challenging conditions, including egocentric images, long-tail entities, varied image qualities, multi-source retrieval, and extended conversational continuity (Wang et al., 30 Oct 2025).

1. Motivation and Scope

CRAG-MM was introduced to address a critical gap in evaluation resources for MM-RAG systems, especially those supporting on-the-go fact-seeking in wearable contexts (Wang et al., 30 Oct 2025). Unlike static VQA benchmarks, CRAG-MM focuses on scenarios where:

The primary visual input is egocentric—captured from a first-person perspective—with inherent degradation (blur, occlusion, low-light).
Questions require grounding in external structured (image-KG) and unstructured (web) sources; answers are virtually impossible from pixels alone.
Dialogues naturally unfold over multiple turns, exhibiting topic drift, accumulated context, and dependency resolution.

This suite is aligned with next-generation AI assistant use-cases, as exemplified by KDD Cup 2025, which served as the deployment venue and drew widespread academic and industry participation (Zhang et al., 29 Jul 2025, Chen et al., 27 Jul 2025, Nakamizo et al., 16 Oct 2025).

2. Dataset Composition and Stratified Challenge Design

CRAG-MM contains:

7,943 images: 6,248 (79%) are egocentric (from wearable devices, e.g., Ray-Ban Meta smart glasses), designed to include 15% with deliberate degradation (blur, occlusion, truncation, rotation, low-light).
6,462 single-turn (image, question, answer) triplets and 1,956 multi-turn conversations (each 2–6 turns, mean 4.9) spanning 13 domains including shopping, text understanding, plants, and vehicles.
Entity buckets sampled for head, mid, and tail frequency to probe retrieval and grounding performance on long-tail concepts.

Annotations and splits are stratified across four orthogonal axes: image quality, question type (six categories: text recognition, lookup, multi-hop, comparison, aggregation, reasoning), entity popularity, and dialog complexity (38% of multi-turn sessions involve domain shift; user abandonment after two failed turns is simulated).

3. Tasks and Retrieval Infrastructure

CRAG-MM operationalizes three progressively complex tasks:

Task	External Retrieval APIs	Input Context	Complexity
Single-Source Augmentation	Image-KG (68K images, 26K entities)	I, Q, R_I (K visually similar images + metadata)	One question, one retrieval source
Multi-Source Augmentation	Image-KG + Web search (800K chunks, Brave)	I, Q, R_I, R_W (up to 50 web hits/query)	Adds noisy web context, multi-source
Multi-Turn Conversation	Same as above	I, Q_t, R_I, R_W, H_{<t} (history)	Requires sustained context, topic drift

Retrieval employs CLIP ViT-L/14@336px for image embeddings and ChromaDB; simulated web retrieval introduces ∼1:20 relevant/irrelevant ratio (image) and 1:2 (web) to reflect practical noise (Wang et al., 30 Oct 2025).

Formally, for single-turn with image-KG:

$A = \text{MM-LLM}(I, Q, R_I), \quad R_I = \{ (i_k, m_k) \}_{k=1}^K \leftarrow \text{ImageSearch}(I)$

The pipeline extends to include web search and conversation history in subsequent tasks.

4. Evaluation Protocol and Metrics

CRAG-MM enforces a strict truthfulness metric:

Each answer labeled as Perfect (+1), Missing/Refusal ("I don't know", 0), or Hallucinated/Incorrect (–1).
Truthfulness for $N$ queries:

$\text{Truth} = \frac{\#\text{perfect} \cdot 1 + \#\text{incorrect} \cdot (–1) + \#\text{missing} \cdot 0}{N}$

Multi-turn sessions are truncated after two non-positive turns (mimicking real user abandonment). Per-session truthfulness averages across valid turns; final scores average across sessions (Wang et al., 30 Oct 2025).
Leaderboard and competition rankings (KDD Cup 2025) use this metric; final placement for top teams involves human correction of LLM-grader outputs (Zhang et al., 29 Jul 2025).

5. Baseline, SOTA Results, and Leading Approaches

CRAG-MM results highlight significant challenges:

Baseline vision-only MM-LLM: Truth ≈ 18% (single-turn), 30% (multi-turn).
Task 1 (image-KG): Truth ≈ 22%.
Task 2 (image-KG + web): Straightforward RAG approaches including commercial SOTA (e.g., GPT-5, Gemini, Claude Sonnet with proprietary retrieval) yield ≈ 32% (single-turn), 43–45% (multi-turn), with hallucination rates >25%.
Winning team (KDD Cup 2025, Llama-3.2-11B fine-tuned + distillation): +28% (single-turn) and +18% (multi-turn) truth improvement over baseline, with further hallucination reduction at the expense of increased "I don't know" responses.

Advanced solutions utilize multi-stage pipelines: query classification, retrieval diversification, reranking via fine-tuned LLMs, and integrated verification (e.g. Chain-of-Verification, self-consistency checks) (Zhang et al., 29 Jul 2025, Chen et al., 27 Jul 2025, Nakamizo et al., 16 Oct 2025). Hallucination-focused strategies achieve >90% reduction in false positives but induce a 40–60% drop in coverage, trading off recall for precision (Nakamizo et al., 16 Oct 2025, Chen et al., 27 Jul 2025).

6. Impact, Insights, and Limitations

CRAG-MM's competitive environment (≈1,000 participants, 5,000 submissions at KDD Cup 2025) catalyzed rapid method development. Key experimental insights include:

Truthfulness degrades substantially (up to 46% drop) on low-quality egocentric images, exposing the limits of current visual encoding.
Long-tail entities are consistently under-recognized and under-retrieved, emphasizing the need for improved entity linking and retriever coverage.
Multi-hop, aggregation, and comparison queries remain major failure points, evidencing open problems in cross-modal reasoning.
Session robustness is constrained: >25% of multi-turn sessions are truncated by SOTA models, with average session lengths at 3.2/4.9 turns (Wang et al., 30 Oct 2025).

A key contribution of CRAG-MM is anchoring a reproducible, publicly available test suite with fine-grained scoring and retrieval APIs, enabling precise ablation and benchmarking of future methods.

7. Future Directions and Challenges

CRAG-MM's design foregrounds unresolved issues in MM-RAG:

Robustness to challenging, real-world egocentric input and long-tail entity exposures.
Mitigation of hallucinated outputs without sacrificing answer coverage—moving beyond aggressive abstention-only strategies.
End-to-end architectures capable of sustained, stateful dialogue over noisy multi-modal contexts.
Improved retrieval fusion and reasoning techniques for both structured and unstructured evidence.
Practical integration with latency and hardware constraints constrained by mobile/wearable device requirements.

Planned development includes refining the retrieval infrastructure, extending domain and image coverage, and stimulating advancement on these fronts by annual community competition (Wang et al., 30 Oct 2025, Zhang et al., 29 Jul 2025, Chen et al., 27 Jul 2025).

References:

(Wang et al., 30 Oct 2025, Nakamizo et al., 16 Oct 2025, Zhang et al., 29 Jul 2025, Chen et al., 27 Jul 2025)