SlideChain Slides Dataset
- The SlideChain Slides Dataset is a semantically annotated, blockchain-anchored collection of 1,117 high-resolution medical imaging lecture slides enabling verifiable multimodal semantic extraction.
- The dataset employs four state-of-the-art vision-language models to generate consistent concepts, relational triples, and deterministic JSON records with cryptographic hashing.
- Its structured organization and tamper-evident provenance support robust benchmarking, quality control, and auditability for educational AI and regulatory applications.
The SlideChain Slides Dataset is a semantically annotated, blockchain-anchored collection of 1,117 high-resolution medical imaging lecture slides from a university course, designed to enable verifiable multimodal semantic extraction, reproducibility, and auditability for educational AI systems. Each slide image is paired with a transcript snippet and systematically processed using four state-of-the-art vision–LLMs (VLMs), producing structured concepts and relational triples documented in cryptographically hashed off-chain JSON records. The dataset serves as a foundation for benchmarking semantic disagreement, model variability, provenance research, and trustworthy pipeline development in STEM instructional contexts (Manik et al., 25 Dec 2025).
1. Dataset Composition and Organization
The corpus encompasses 23 full-length university lectures in medical imaging, covering x-ray physics, computed tomography (CT) reconstruction, magnetic resonance imaging (MRI) principles, ultrasound imaging, PET/SPECT, and multidimensional signal processing. Slide images are provided in JPEG or PNG format at standard university resolution (typically 1920×1080 or higher), with each slide uniquely identified by a (lecture_id, slide_id) tuple.
- Total slides: 1,117
- Directory structure: Separate folders for each lecture containing subfolders for images, transcript snippets, and provenance JSONs.
- File naming convention: Within each lecture, slides are labeled Slide1.jpg, Slide2.jpg, … and matched to Slide1.txt and Slide1.json.
| Lecture Topics | # Lectures | # Slides (total) |
|---|---|---|
| X-ray physics | 23 | 1,117 |
| CT, MRI, PET/SPECT, Ultrasound, Multidimensional Signal Processing | — | — |
This organization ensures precise, reproducible referencing for semantic extraction, model comparison, and provenance anchoring.
2. Data Modalities and Provenance Formats
Each slide is associated with three artifacts:
- Image: High-resolution JPEG/PNG
- Transcript: Plain-text snippet (.txt)
- Provenance record: JSON file (.json) per slide, documenting multimodel semantic annotations and metadata
Sample directory layout:
1 2 3 4 5 6 |
Lectures/ ├─ Lecture 1/ │ ├─ Images/Slide1.jpg │ ├─ Texts/Slide1.txt │ └─ Provs/Slide1.json └─ … |
The provenance JSON adopts a deterministic schema with sorted keys to guarantee reproducible cryptographic hashing. Each file records:
- Lecture and slide identifiers
- Model outputs (concepts, relational triples, evidence, raw output)
- Metadata (timestamps, pipeline source, hash input format)
- Paths referencing image, text, and JSON files
3. Semantic Annotation and Provenance Structure
Semantic annotations extracted by each VLM are organized as follows:
- Concepts: Flat lists of (category, term) pairs; categories include modality, anatomy, workflow, physics, software, etc.
- Relational triples: Structured as (subject , predicate , object ) with optional confidence scores to capture inter-concept relationships.
- Evidence: Quoted textual snippets or model-generated explanations supporting each annotation.
Example JSON annotation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
{
"lecture": "Lecture 1",
"slide_id": 1,
"models": {
"InternVL3-14B": {
"concepts": [ {"category":"modality","term":"medical imaging"}, … ],
"triples": [ {"s":"Medical imaging","p":"uses","o":"science and engineering","confidence":1.0}, … ],
"evidence": [ … ],
"raw_output": "…"
},
// Qwen2-VL-7B, Qwen3-VL-4B, LLaVA-OneVision
},
"paths": {
"image":"…/Lecture1/Images/Slide1.jpg",
"text":"…/Lecture1/Texts/Slide1.txt",
"json":"…/Lecture1/Provs/Slide1.json"
},
"metadata": {
"timestamp":"2024-04-10T15:23:45Z",
"source":"SlideChain v1.0 pipeline",
"hash_input_format":"lexicographically sorted JSON"
}
} |
4. Blockchain Registration and Tamper-Evident Provenance
The integrity of each semantic annotation is cryptographically anchored on a local EVM-compatible blockchain using a minimal Solidity smart contract. The hash for each slide is computed via
and registered using:
- registerSlide
- getSlide SlideRecord
- isRegistered bool
Smart contract pseudocode:
1 2 3 4 5 6 7 8 9 |
struct SlideRecord {
uint256 lectureId;
uint256 slideId;
string slideHash; // 0x… keccak256
string uri; // off-chain JSON path
uint256 timestamp; // block.timestamp
address registrant;
}
mapping(bytes32 => SlideRecord) records; // key = keccak256(lectureId, slideId) |
This framework supports tamper-evident, persistent semantic baselines, enabling deterministic retrieval and verification of extracted annotations.
5. Annotation Methodology and Quality Control
Annotations are generated by four vision–LLMs:
- InternVL3-14B
- Qwen2-VL-7B
- Qwen3-VL-4B
- LLaVA-OneVision (backed by Qwen2-7B)
The four-step pipeline per (slide, model) consists of:
- Prompt construction (slide image + transcript snippet)
- Model inference producing lists, dicts, free text outputs
- Normalization: null-safe parsing, flattening, canonical triple extraction, lowercasing, whitespace normalization, deduplication
- Deterministic JSON generation with key sorting for cryptographic hashing
Inter-model agreement is evaluated using:
- Concept disagreement:
- Triple disagreement:
- Pairwise Jaccard index: (analogous for triples )
Quality control methods include independent extraction runs yielding bit-identical JSON files (Jaccard=1.0 for every slide/model) and tamper injection testing on random JSONs, resulting in perfect detection of modifications.
6. Accessibility, Usage Practices, and Downstream Applications
No official train/validation/test splits are provided, reflecting the dataset’s primary purpose for semantic extraction, audit, and reproducibility, rather than supervised learning or model training.
- Access: Fully available via public GitHub repository (subject to slide text/image licensing restrictions).
- Licensing: MIT License for code and JSON artifacts; slide image/text governed by the copyright policy of contributing instructors.
- Citation: Manik, M. M. H., Islam, M. Z., & Wang, G. (2024), "SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration" (Manik et al., 25 Dec 2025).
Potential downstream applications include:
- VLM benchmarking through quantitative disagreement and Jaccard analysis
- Identification of slides with high cross-model disagreement for targeted review or improved prompt design
- Long-term studies of semantic drift and reproducibility across model/prompt evolution
- Tamper-proof documentation for compliance in medical and legal education domains
- Classification of semantic stability (stable/moderate/unstable) using concept disagreement thresholds
- Automation of integrity checks via on-chain verification scripts and dashboards visualizing agreement metrics
7. Context and Research Significance
The SlideChain Slides Dataset represents the first systematic framework for constructing, tracking, and auditing multimodal semantic knowledge extracted from STEM lecture slides using blockchain-backed provenance. The pronounced cross-model discrepancies in concept overlap and relational triple agreement documented in the dataset suggest that robust semantic verification is critical in high-stakes educational and professional contexts. The reproducibility and tamper-evidence features provide a foundation for trustworthy instructional pipelines and semantic baselines as AI systems evolve. A plausible implication is the enablement of persistent audit trails and compliance monitoring for AI-generated content in domains subject to regulatory or pedagogical scrutiny.