SlideChain Slides Dataset

Updated 31 December 2025

The SlideChain Slides Dataset is a semantically annotated, blockchain-anchored collection of 1,117 high-resolution medical imaging lecture slides enabling verifiable multimodal semantic extraction.
The dataset employs four state-of-the-art vision-language models to generate consistent concepts, relational triples, and deterministic JSON records with cryptographic hashing.
Its structured organization and tamper-evident provenance support robust benchmarking, quality control, and auditability for educational AI and regulatory applications.

The SlideChain Slides Dataset is a semantically annotated, blockchain-anchored collection of 1,117 high-resolution medical imaging lecture slides from a university course, designed to enable verifiable multimodal semantic extraction, reproducibility, and auditability for educational AI systems. Each slide image is paired with a transcript snippet and systematically processed using four state-of-the-art vision–LLMs (VLMs), producing structured concepts and relational triples documented in cryptographically hashed off-chain JSON records. The dataset serves as a foundation for benchmarking semantic disagreement, model variability, provenance research, and trustworthy pipeline development in STEM instructional contexts (Manik et al., 25 Dec 2025).

1. Dataset Composition and Organization

The corpus encompasses 23 full-length university lectures in medical imaging, covering x-ray physics, computed tomography (CT) reconstruction, magnetic resonance imaging (MRI) principles, ultrasound imaging, PET/SPECT, and multidimensional signal processing. Slide images are provided in JPEG or PNG format at standard university resolution (typically 1920×1080 or higher), with each slide uniquely identified by a (lecture_id, slide_id) tuple.

Total slides: 1,117
Directory structure: Separate folders for each lecture containing subfolders for images, transcript snippets, and provenance JSONs.
File naming convention: Within each lecture, slides are labeled Slide1.jpg, Slide2.jpg, … and matched to Slide1.txt and Slide1.json.

Lecture Topics	# Lectures	# Slides (total)
X-ray physics	23	1,117
CT, MRI, PET/SPECT, Ultrasound, Multidimensional Signal Processing	—	—

This organization ensures precise, reproducible referencing for semantic extraction, model comparison, and provenance anchoring.

2. Data Modalities and Provenance Formats

Each slide is associated with three artifacts:

Image: High-resolution JPEG/PNG
Transcript: Plain-text snippet (.txt)
Provenance record: JSON file (.json) per slide, documenting multimodel semantic annotations and metadata

Sample directory layout:

Lectures/
 ├─ Lecture 1/
 │    ├─ Images/Slide1.jpg
 │    ├─ Texts/Slide1.txt
 │    └─ Provs/Slide1.json
 └─ …

The provenance JSON adopts a deterministic schema with sorted keys to guarantee reproducible cryptographic hashing. Each file records:

Lecture and slide identifiers
Model outputs (concepts, relational triples, evidence, raw output)
Metadata (timestamps, pipeline source, hash input format)
Paths referencing image, text, and JSON files

3. Semantic Annotation and Provenance Structure

Semantic annotations extracted by each VLM are organized as follows:

Concepts: Flat lists of (category, term) pairs; categories include modality, anatomy, workflow, physics, software, etc.
Relational triples: Structured as (subject $s$ , predicate $p$ , object $o$ ) with optional confidence scores to capture inter-concept relationships.
Evidence: Quoted textual snippets or model-generated explanations supporting each annotation.

Example JSON annotation:

{
  "lecture": "Lecture 1",
  "slide_id": 1,
  "models": {
    "InternVL3-14B": {
      "concepts": [ {"category":"modality","term":"medical imaging"}, … ],
      "triples": [ {"s":"Medical imaging","p":"uses","o":"science and engineering","confidence":1.0}, … ],
      "evidence": [ … ],
      "raw_output": "…"
    },
    // Qwen2-VL-7B, Qwen3-VL-4B, LLaVA-OneVision
  },
  "paths": {
    "image":"…/Lecture1/Images/Slide1.jpg",
    "text":"…/Lecture1/Texts/Slide1.txt",
    "json":"…/Lecture1/Provs/Slide1.json"
  },
  "metadata": {
    "timestamp":"2024-04-10T15:23:45Z",
    "source":"SlideChain v1.0 pipeline",
    "hash_input_format":"lexicographically sorted JSON"
  }
}

All annotation keys are serialized in sorted order for deterministic hashing using Keccak-256.

4. Blockchain Registration and Tamper-Evident Provenance

The integrity of each semantic annotation is cryptographically anchored on a local EVM-compatible blockchain using a minimal Solidity smart contract. The hash for each slide is computed via

$h_{L,s} = \text{keccak256}(\text{UTF-8 bytes of sorted JSON}(L, s))$

and registered using:

registerSlide $(lectureId, slideId, slideHash, uri)$
getSlide $(lectureId, slideId) \rightarrow$ SlideRecord
isRegistered $(lectureId, slideId) \rightarrow$ bool

Smart contract pseudocode:

struct SlideRecord {
  uint256 lectureId;
  uint256 slideId;
  string  slideHash;    // 0x… keccak256
  string  uri;          // off-chain JSON path
  uint256 timestamp;    // block.timestamp
  address registrant;
}
mapping(bytes32 => SlideRecord) records;  // key = keccak256(lectureId, slideId)

This framework supports tamper-evident, persistent semantic baselines, enabling deterministic retrieval and verification of extracted annotations.

5. Annotation Methodology and Quality Control

Annotations are generated by four vision–LLMs:

InternVL3-14B
Qwen2-VL-7B
Qwen3-VL-4B
LLaVA-OneVision (backed by Qwen2-7B)

The four-step pipeline per (slide, model) consists of:

Prompt construction (slide image + transcript snippet)
Model inference producing lists, dicts, free text outputs
Normalization: null-safe parsing, flattening, canonical triple extraction, lowercasing, whitespace normalization, deduplication
Deterministic JSON generation with key sorting for cryptographic hashing

Inter-model agreement is evaluated using:

Concept disagreement: $D_\text{concept}(s) = |\bigcup_i C_i(s)|$
Triple disagreement: $D_\text{triple}(s) = |\bigcup_i T_i(s)|$
Pairwise Jaccard index: $J(C_a, C_b) = |C_a \cap C_b| / |C_a \cup C_b|$ (analogous for triples $T$ )

Quality control methods include independent extraction runs yielding bit-identical JSON files (Jaccard=1.0 for every slide/model) and tamper injection testing on random JSONs, resulting in perfect detection of modifications.

6. Accessibility, Usage Practices, and Downstream Applications

No official train/validation/test splits are provided, reflecting the dataset’s primary purpose for semantic extraction, audit, and reproducibility, rather than supervised learning or model training.

Access: Fully available via public GitHub repository (subject to slide text/image licensing restrictions).
Licensing: MIT License for code and JSON artifacts; slide image/text governed by the copyright policy of contributing instructors.
Citation: Manik, M. M. H., Islam, M. Z., & Wang, G. (2024), "SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration" (Manik et al., 25 Dec 2025).

Potential downstream applications include:

VLM benchmarking through quantitative disagreement and Jaccard analysis
Identification of slides with high cross-model disagreement for targeted review or improved prompt design
Long-term studies of semantic drift and reproducibility across model/prompt evolution
Tamper-proof documentation for compliance in medical and legal education domains
Classification of semantic stability (stable/moderate/unstable) using concept disagreement thresholds
Automation of integrity checks via on-chain verification scripts and dashboards visualizing agreement metrics

7. Context and Research Significance

The SlideChain Slides Dataset represents the first systematic framework for constructing, tracking, and auditing multimodal semantic knowledge extracted from STEM lecture slides using blockchain-backed provenance. The pronounced cross-model discrepancies in concept overlap and relational triple agreement documented in the dataset suggest that robust semantic verification is critical in high-stakes educational and professional contexts. The reproducibility and tamper-evidence features provide a foundation for trustworthy instructional pipelines and semantic baselines as AI systems evolve. A plausible implication is the enablement of persistent audit trails and compliance monitoring for AI-generated content in domains subject to regulatory or pedagogical scrutiny.

Markdown Report Issue Upgrade to Chat

References (1)

SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SlideChain Slides Dataset.