SlideVQA & DUDE: Multimodal Document Benchmarks

Updated 24 January 2026

SlideVQA and DUDE benchmarks are foundational datasets for multimodal document understanding, emphasizing structured layouts and cross-context reasoning.
They cover diverse real-world materials, from presentation slides to business documents, enabling varied QA tasks and precise data extraction.
Advanced metrics like EM, F1, and ANLS, coupled with rigorous annotation protocols, drive progress in evaluating large multimodal language models.

SlideVQA and DUDE are two foundational benchmarks in the landscape of multimodal document understanding. SlideVQA centers on visual question answering (VQA) over presentation slide decks, emphasizing cross-slide reasoning and structured layout comprehension, whereas DUDE (Document Understanding Dataset and Evaluation) is a multi-domain, multi-page dataset that targets robust document-level VQA, key information extraction, and layout analysis in realistic business and technical documents. Both benchmarks are essential for assessing and pushing the capabilities of contemporary multimodal LLMs (MLLMs) in high-complexity, real-world document scenarios.

1. Dataset Scope and Core Characteristics

SlideVQA is constructed from diverse presentation slide decks, primarily sourced from SlideShare and public academic repositories, filtered for English language and layout @@@@1@@@@. It comprises ≈50,000 unique slide images, annotated with ≈100,000 QA pairs, covering factual, counting, and cross-slide reasoning tasks. Special emphasis is placed on multi-slide inference, which requires aggregating and synthesizing evidence from multiple images rather than isolated single-slide reasoning. Slides are annotated with dense bounding boxes over a nine-class schema (title, page-text, obj-text, caption, other-text, diagram, table, image, and figure) inspired by the SPaSe taxonomy (Tanaka et al., 2023, Li et al., 2024).

DUDE, conversely, spans over 5,000 real-world documents from more than 13 industries, with page counts per document averaging 6 but extending beyond 100. It includes ≈41,500 question–answer pairs, with ≈91% unique questions and ≈71% unique answers, reflecting exceptional structural and linguistic diversity. DUDE’s documents originate from born-digital sources and scans, encompassing tables, figures, stamps, handwriting, and multi-column layouts, with task coverage including document-level QA, key-value pair extraction, and layout structure identification (Landeghem et al., 2023, Li et al., 2024).

Benchmark	Domain	Pages per Doc	Main Tasks
SlideVQA	Slides/presentations	≈1–30	Cross-slide VQA, object localization, synthesis
DUDE	Business/technical	1–100+	Doc-level QA, table QA, layout, KIE

2. Task Definitions and Annotation Schema

SlideVQA defines three principal subtasks:

Textual Content QA: Extraction and reasoning over body text, bullet lists, and slide metadata.
Layout & Object QA: Localization of non-textual elements (tables, charts, figures) and questions involving these.
Multi-Slide Reasoning: Questions requiring information from two or more slides, often involving comparisons or synthesis (e.g., “Which slide introduces concept X after the results chart?”).

Questions are stratified into literal (verbatim lookup), inference (integration of multiple cues), and synthesis (higher-level abstraction of slide structure and content).

DUDE encompasses:

Document-Level QA: Both single- and multi-page questions, often requiring information integration across disjoint pages (e.g., financial summaries referencing multiple statements).
Table and Key–Value Extraction: Structured data retrieval and aggregation (e.g., finding all due dates for amounts over a certain threshold).
Layout Analysis: Classification of physical regions such as tables, figure captions, etc.

DUDE supports a broader answer format spectrum: extractive (span-based), abstractive (free-form), list, yes/no, and unanswerable, with document annotation in JSON comprising per-question page references and bounding boxes (Landeghem et al., 2023, Li et al., 2024).

3. Evaluation Protocols and Metrics

SlideVQA adopts standard QA metrics:

Exact Match (EM):

$\mathrm{EM} = \frac{1}{N} \sum_{i=1}^N \mathbbm{1}[\hat a_i = a_i]$

where $\hat a_i$ and $a_i$ are normalized predicted and reference answers.

Token-level F1:

$F1_i = \frac{2 \cdot P_i \cdot R_i}{P_i + R_i}$

with $P_i$ and $R_i$ the precision and recall over answer tokens.

Intersection-over-Union (IoU): For object localization, IoU ≥ 0.5 is a correct localization.

DUDE introduces richer and more calibration-aware metrics:

ANLS (Average Normalized Levenshtein Similarity): Allows partial credit in free-form responses, thresholded at $\tau=0.5$ for match credit:

$\mathrm{NLS}(g, \hat P) = 1 - \frac{\mathrm{Levenshtein}(g, \hat P)}{\max(|g|, |\hat P|, 1)}$

$\mathrm{ANLS} = \frac{1}{N} \sum_{i=1}^N \max_{g \in G_i} s(g, \hat P_i)$

where $G_i$ are reference answers and $s(g, \hat P) = \mathrm{NLS}$ if NLS ≥ $\tau$ , else 0.

ECE (Expected Calibration Error): Measures calibration of confidence scores against observed accuracy.
AURC (Area-Under-Risk-Coverage Curve): Integrates error risk over the range of coverage as function of confidence ranking.
Additional metrics include Table Extraction Accuracy and Layout IoU for structured extraction tasks (Landeghem et al., 2023).

4. Benchmark Construction and Annotation Process

SlideVQA slides were crowdsourced, with multiple questions per slide and iterative expert validation for answer and bounding box correctness. Dense region annotation was carried out for core visual elements, enabling both span-based and region-level supervision, and the question set was post-processed to maximize reasoning diversity.

DUDE’s documents were collected from publicly accessible corpora (archive.org, documentcloud.org, Wikimedia), spanning 150+ years of document formats. Dual-phase annotation involved initial crowd QAs and bounding boxes, followed by expert curation for consistency and coverage. Document splits are stratified for coverage across industry domains and document types.

Benchmark	Annotation	QA Diversity	Layout Labels
SlideVQA	Crowd + expert	High	9 classes (e.g., table)
DUDE	Crowd + expert	Very high	Table, figure, key-value

5. Baseline and State-of-the-Art Performance

SlideVQA: The baseline on SlideVQA combines standard vision-language architectures (CLIP, LayoutLMv2, T5, UniVL, Fusion-in-Decoder) for evidence selection and QA (Tanaka et al., 2023). Performance on test splits for EM and F1 was observed to increase as model architecture incorporated layout cues and cross-slide context, with advanced models showing up to ≈68.9% F1 using efficient page retrieval (e.g., AVIR) (Li et al., 17 Jan 2026). Models such as GPT-4V yield higher literal and inference accuracy (e.g., EM ≈64.1%), but performance degrades on multi-slide synthesis tasks (≈54.3% accuracy) (Li et al., 2024).

DUDE: Baselines include InstructBLIP, GPT-4V, Gemini, and Arctic-TILT, evaluated in zero- and few-shot regimes. ANLS is generally higher for extractive and abstractive questions (≈67.5%), with substantial drops on list and unanswerable types (≤12%) for closed-format models lacking format-specific fine-tuning (Li et al., 17 Jan 2026). Composite Document Score (CDS) aggregates metrics for overall evaluation (e.g., GPT-4V achieves CDS ≈61.2%) (Li et al., 2024).

6. Comparative Analysis and Research Implications

A direct comparison reveals:

Context Breadth: SlideVQA prioritizes tightly structured, short-context multi-slide reasoning. DUDE requires models capable of long-range, multi-domain context aggregation, with high layout and style variance.
QA Diversity: DUDE’s list and unanswerable forms, as well as mixed extractive/abstractive types, are more challenging for generalist MLLMs than SlideVQA’s primarily extractive and counting-focused QAs.
Calibration and Risk: DUDE is unique in introducing risk-coverage and calibration diagnostics, informing model safety and triage in critical document tasks.
Domain Adaptation: DUDE supports robust testing for zero-shot, few-shot, and domain-specific adaptation, while SlideVQA is optimized for efficient reasoning within a constrained genre.

Practical selection depends on downstream needs: general-purpose, domain-diverse, and robustness-required document QA tasks should prioritize DUDE, while SlideVQA remains the benchmark of choice for cross-slide synthesis in educational or corporate presentation contexts (Landeghem et al., 2023, Li et al., 2024).

7. Challenges and Future Directions

Persistent challenges include:

Hierarchical Layout Modeling: Both benchmarks highlight the limits of existing MLLMs in parsing and utilizing document/slide structure, especially for cross-page or cross-slide linkages.
OCR and Visual Parsing: Upstream OCR errors and layout parsing inaccuracies propagate to QA performance; joint modeling remains an open research axis.
Long-Context Reasoning: Sparse memory and retrieval mechanisms have shown promise in efficient long-document QA (e.g., AVIR reduces processed context by 85%) but often require additional adaptation for fully open-ended questions (Li et al., 17 Jan 2026).
Evaluation Richness: Both benchmarks employ EM/F1, but future work may extend to chain-of-thought and grounding confidence measures, as well as robustness to adversarial layout and content shifts (Li et al., 2024).

A plausible implication is that multimodal document benchmarks will increasingly integrate real-world layout perturbations, cross-lingual variation, and end-to-end human-in-the-loop evaluation to drive further gains in reliability and applicability of document AI systems. Ongoing research in retrieval, layout semantics, and calibration-aware evaluation—embodied in these benchmarks—remains critical for advancing trustworthy and efficient document intelligence (Landeghem et al., 2023, Tanaka et al., 2023, Li et al., 17 Jan 2026, Li et al., 2024).