Slides-Align1.5k: Human-Aligned Slide Benchmark

Updated 21 January 2026

Slides-Align1.5k is a human-aligned benchmark dataset offering pairwise comparisons of AI-generated slide decks to reflect real user preferences.
It comprises ~1,500 binary evaluations across seven presentation scenarios and nine slide-generation methods, including template-based, image-centric, and code-driven models.
Benchmark metrics, such as Spearman’s rank correlation, demonstrate strong alignment between automated rankings and human judgment, guiding improvements in evaluation protocols.

Slides-Align1.5k is a human-aligned benchmark dataset devised for the comparative evaluation of automated slide generation systems under diverse real-world presentation scenarios. Conceived alongside the SlidesGen-Bench evaluation protocol, it provides a reference for quantifying human perceptual preferences across modern slide synthesis paradigms, which include template-based, code-driven, and image-centric generation models. The dataset is structured around pairwise judgments, ensuring that system ranking metrics can be reliably correlated with end-user experience as opposed to purely algorithmic or reference-driven scores. Slides-Align1.5k underpins research into the quantification and calibration of slide quality metrics and advances reproducibility standards for automated presentation generation (Yang et al., 14 Jan 2026).

1. Dataset Composition and Scope

Slides-Align1.5k comprises approximately 1 500 individual human-preference annotations, each constituting a binary comparison between two AI-generated slide decks. Slides are synthesized by nine distinct slide-generation methods:

Gamma.ai (template-based)
Kimi-Banana, Kimi-Smart, Kimi-Standard, NotebookLM, Skywork-Banana (all image-centric)
Quark (template-based)
Skywork, Zhipu PPT (code-driven, HTML/CSS)

The deck pairs represent seven presentation scenarios, sampled for ecological validity:

Brand Promotion
Business Plan
Course Preparation (Topic Introduction)
Personal Statement
Product Launch
Work Report
Knowledge Teaching (Topic Introduction)

Decks are rendered as static images (PNG/JPEG), and metadata per annotation records topic identifier, systems compared, slide counts in each deck, timestamp, and annotator ID. Each of the seven scenarios is sampled evenly, producing ~214 evaluations per scenario.

2. Annotation Protocol and Quality Assurance

For every comparison, annotators view the complete rendered decks produced by two competing systems for the same presentation instruction. The evaluation interface permits scrolling through both decks prior to selection, thereby enabling judgment based on holistic deck-level attributes—visual design, layout cohesion, readability (contrast and typography), and content completeness. Minor OCR artifacts and baseline template idiosyncrasies are deemed negligible; annotators are instructed to prioritize the overall viewing experience.

Quality control is implemented via several mechanisms:

Each pair is assessed by a minimum of three independent annotators.
Brief qualification tests (20 calibration pairs) precede formal annotation.
Occasional gold-standard checks (trivial “tie” pairs) are inserted for consistency monitoring.
Inter-annotator agreement is measured through the “Identical-ratio” (percentage agreement between annotator pairs); per-topic Spearman correlations provide further reliability diagnostics.

This suggests that while detailed inter-annotator statistics such as Cohen’s κ are not reported, indirect measures of consistency are used to mitigate annotation noise.

3. Dataset Structure and Statistical Summaries

Slides-Align1.5k is deployed strictly for evaluation purposes and does not define any explicit train/validation/test data split. Systematic pairwise sampling across the nine generation methods (out of the 36 possible combinations $\binom{9}{2}$ ) ensures no system is overrepresented. Each scenario’s judgments occupy roughly 14.3% of the dataset. For each system and scenario, a “preference score” is defined as the fraction of direct head-to-head wins:

$p_{i,s} = \frac{\text{number of times system } i \text{ preferred over alternatives in scenario } s}{\text{number of comparisons involving } i \text{ in } s}$

The overall mean preference score $\bar p$ is 0.50 by symmetry (each pair is binary), and variance $\sigma_p^2 \approx 0.04$ .

4. Benchmarking Metrics and Evaluation Formulations

Slides-Align1.5k serves to calibrate computational slide evaluation metrics by explicit comparison against human preference rankings. The principal alignment metric is Spearman’s rank correlation coefficient $\rho$ between the mean human ranking $h_i$ and the automated ranking $r_i$ across systems $i$ in each scenario:

$\rho = 1 - \frac{6\sum_{i=1}^n (r_i - h_i)^2}{n(n^2 - 1)}$

An Identical-ratio quantifies the percentage of exact ties between human and automatic top ranked systems per scenario. Alternative metrics such as Kendall’s $\tau$ are acknowledged but not utilized in the benchmark.

On Slides-Align1.5k, the SlidesGen-Bench protocol (joint Content, Aesthetics, Editability metric) achieves average Spearman $\rho = 0.71$ , exceeding that of baseline methods (LLM-as-Judge $\approx 0.57$ , PPTAgent $\approx 0.53$ ).

5. Recommended Use Cases and Applicability Constraints

The dataset is intended for:

Comparative evaluation of new slide-generation models under human judgment.
Calibration and fine-tuning of computational metrics to increase alignment with user experience.
Investigation of trade-off dynamics among content faithfulness, aesthetics, and editability.

Limitations include:

Exclusive coverage of static slides; absence of animation, transitions, and speaker notes.
English-centric, globally popular domains, potentially limiting generality for other languages or technical/specialized subjects.
Deck-level only: annotators evaluate full decks rather than individual slides, precluding granular slide-level analysis.
Inter-annotator agreement metrics (e.g. Cohen’s κ) are not fully reported; thus, annotation reproducibility may be partially opaque.

Slides-Align1.5k distinguishes itself from prior slide-alignment and multimodal ASR corpora in focus and structural properties. For example, the “Do Slides Help?” benchmark (Sinhamahapatra et al., 15 Oct 2025) uses automatic alignment with domain-specific terminology extraction for multimodal transcription evaluation and does not provide human-preference, deck-level comparisons of slide quality. Similarly, AutoLectures-1K (Holmberg, 5 May 2025) focuses on phrase-to-region annotation for synchronizing narration highlights in lecture videos but involves word-level, not deck-level, judgments. Slides-Align1.5k is unique in its emphasis on preference-based, deckwise comparative annotation spanning multiple generative system architectures.

Table: Annotated Slide Generation Benchmarks

Dataset	Judged Unit	Size	Annotation Type
Slides-Align1.5k	Deck-level (pairwise preference)	~1 500	Human preference
AutoLectures-1K	Phrase-to-region (word-level)	1 000	Manual region selection
ACL Extension (Sinhamahapatra et al., 15 Oct 2025)	Segment-frame alignment	10 talks	Automatic (domain terms)

7. Significance and Prospective Directions

Slides-Align1.5k enables rigorous, reference-free benchmarking of slide-generation models, fostering advances in evaluation protocol design and metric calibration for presentation AI. Its controlled, scenario-diverse sampling and multi-system comparative judgments establish a high-fidelity resource for studying perceptual trade-offs and for quantifying alignment between automated and human slide quality assessments in computational presentation research (Yang et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics (2026)

Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks (2025)

Generating Narrated Lecture Videos from Slides with Synchronized Highlights (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slides-Align1.5k Dataset.

Slides-Align1.5k: Human-Aligned Slide Benchmark

1. Dataset Composition and Scope

2. Annotation Protocol and Quality Assurance

3. Dataset Structure and Statistical Summaries

4. Benchmarking Metrics and Evaluation Formulations

5. Recommended Use Cases and Applicability Constraints

Table: Annotated Slide Generation Benchmarks

7. Significance and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Slides-Align1.5k: Human-Aligned Slide Benchmark

1. Dataset Composition and Scope

2. Annotation Protocol and Quality Assurance

3. Dataset Structure and Statistical Summaries

4. Benchmarking Metrics and Evaluation Formulations

5. Recommended Use Cases and Applicability Constraints

6. Context Within Related Research

Table: Annotated Slide Generation Benchmarks

7. Significance and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research