Slides-Align1.5k: Human-Aligned Slide Benchmark
- Slides-Align1.5k is a human-aligned benchmark dataset offering pairwise comparisons of AI-generated slide decks to reflect real user preferences.
- It comprises ~1,500 binary evaluations across seven presentation scenarios and nine slide-generation methods, including template-based, image-centric, and code-driven models.
- Benchmark metrics, such as Spearman’s rank correlation, demonstrate strong alignment between automated rankings and human judgment, guiding improvements in evaluation protocols.
Slides-Align1.5k is a human-aligned benchmark dataset devised for the comparative evaluation of automated slide generation systems under diverse real-world presentation scenarios. Conceived alongside the SlidesGen-Bench evaluation protocol, it provides a reference for quantifying human perceptual preferences across modern slide synthesis paradigms, which include template-based, code-driven, and image-centric generation models. The dataset is structured around pairwise judgments, ensuring that system ranking metrics can be reliably correlated with end-user experience as opposed to purely algorithmic or reference-driven scores. Slides-Align1.5k underpins research into the quantification and calibration of slide quality metrics and advances reproducibility standards for automated presentation generation (Yang et al., 14 Jan 2026).
1. Dataset Composition and Scope
Slides-Align1.5k comprises approximately 1 500 individual human-preference annotations, each constituting a binary comparison between two AI-generated slide decks. Slides are synthesized by nine distinct slide-generation methods:
- Gamma.ai (template-based)
- Kimi-Banana, Kimi-Smart, Kimi-Standard, NotebookLM, Skywork-Banana (all image-centric)
- Quark (template-based)
- Skywork, Zhipu PPT (code-driven, HTML/CSS)
The deck pairs represent seven presentation scenarios, sampled for ecological validity:
- Brand Promotion
- Business Plan
- Course Preparation (Topic Introduction)
- Personal Statement
- Product Launch
- Work Report
- Knowledge Teaching (Topic Introduction)
Decks are rendered as static images (PNG/JPEG), and metadata per annotation records topic identifier, systems compared, slide counts in each deck, timestamp, and annotator ID. Each of the seven scenarios is sampled evenly, producing ~214 evaluations per scenario.
2. Annotation Protocol and Quality Assurance
For every comparison, annotators view the complete rendered decks produced by two competing systems for the same presentation instruction. The evaluation interface permits scrolling through both decks prior to selection, thereby enabling judgment based on holistic deck-level attributes—visual design, layout cohesion, readability (contrast and typography), and content completeness. Minor OCR artifacts and baseline template idiosyncrasies are deemed negligible; annotators are instructed to prioritize the overall viewing experience.
Quality control is implemented via several mechanisms:
- Each pair is assessed by a minimum of three independent annotators.
- Brief qualification tests (20 calibration pairs) precede formal annotation.
- Occasional gold-standard checks (trivial “tie” pairs) are inserted for consistency monitoring.
- Inter-annotator agreement is measured through the “Identical-ratio” (percentage agreement between annotator pairs); per-topic Spearman correlations provide further reliability diagnostics.
This suggests that while detailed inter-annotator statistics such as Cohen’s κ are not reported, indirect measures of consistency are used to mitigate annotation noise.
3. Dataset Structure and Statistical Summaries
Slides-Align1.5k is deployed strictly for evaluation purposes and does not define any explicit train/validation/test data split. Systematic pairwise sampling across the nine generation methods (out of the 36 possible combinations ) ensures no system is overrepresented. Each scenario’s judgments occupy roughly 14.3% of the dataset. For each system and scenario, a “preference score” is defined as the fraction of direct head-to-head wins:
The overall mean preference score is 0.50 by symmetry (each pair is binary), and variance .
4. Benchmarking Metrics and Evaluation Formulations
Slides-Align1.5k serves to calibrate computational slide evaluation metrics by explicit comparison against human preference rankings. The principal alignment metric is Spearman’s rank correlation coefficient between the mean human ranking and the automated ranking across systems in each scenario:
An Identical-ratio quantifies the percentage of exact ties between human and automatic top ranked systems per scenario. Alternative metrics such as Kendall’s are acknowledged but not utilized in the benchmark.
On Slides-Align1.5k, the SlidesGen-Bench protocol (joint Content, Aesthetics, Editability metric) achieves average Spearman , exceeding that of baseline methods (LLM-as-Judge , PPTAgent ).
5. Recommended Use Cases and Applicability Constraints
The dataset is intended for:
- Comparative evaluation of new slide-generation models under human judgment.
- Calibration and fine-tuning of computational metrics to increase alignment with user experience.
- Investigation of trade-off dynamics among content faithfulness, aesthetics, and editability.
Limitations include:
- Exclusive coverage of static slides; absence of animation, transitions, and speaker notes.
- English-centric, globally popular domains, potentially limiting generality for other languages or technical/specialized subjects.
- Deck-level only: annotators evaluate full decks rather than individual slides, precluding granular slide-level analysis.
- Inter-annotator agreement metrics (e.g. Cohen’s κ) are not fully reported; thus, annotation reproducibility may be partially opaque.
6. Context Within Related Research
Slides-Align1.5k distinguishes itself from prior slide-alignment and multimodal ASR corpora in focus and structural properties. For example, the “Do Slides Help?” benchmark (Sinhamahapatra et al., 15 Oct 2025) uses automatic alignment with domain-specific terminology extraction for multimodal transcription evaluation and does not provide human-preference, deck-level comparisons of slide quality. Similarly, AutoLectures-1K (Holmberg, 5 May 2025) focuses on phrase-to-region annotation for synchronizing narration highlights in lecture videos but involves word-level, not deck-level, judgments. Slides-Align1.5k is unique in its emphasis on preference-based, deckwise comparative annotation spanning multiple generative system architectures.
Table: Annotated Slide Generation Benchmarks
| Dataset | Judged Unit | Size | Annotation Type |
|---|---|---|---|
| Slides-Align1.5k | Deck-level (pairwise preference) | ~1 500 | Human preference |
| AutoLectures-1K | Phrase-to-region (word-level) | 1 000 | Manual region selection |
| ACL Extension (Sinhamahapatra et al., 15 Oct 2025) | Segment-frame alignment | 10 talks | Automatic (domain terms) |
7. Significance and Prospective Directions
Slides-Align1.5k enables rigorous, reference-free benchmarking of slide-generation models, fostering advances in evaluation protocol design and metric calibration for presentation AI. Its controlled, scenario-diverse sampling and multi-system comparative judgments establish a high-fidelity resource for studying perceptual trade-offs and for quantifying alignment between automated and human slide quality assessments in computational presentation research (Yang et al., 14 Jan 2026).