Openstory++: Instance-Aware Visual Storytelling
- Openstory++ is a large-scale dataset featuring instance-aware, open-domain visual storytelling with precise annotations and narrative consistency.
- The dataset employs a robust four-stage pipeline including keyframe extraction, captioning, masking, and sequence alignment to ensure high annotation fidelity.
- Cohere-Bench accompanies Openstory++ by providing quantitative and human evaluation metrics for assessing multi-instance consistency and narrative coherence.
Openstory++ is a large-scale dataset and benchmark specifically designed for instance-aware, open-domain visual storytelling. It features comprehensive instance-level annotation—including consistent instance IDs, pixel-precise masks, and refined captions—across both single images and extended image sequences, addressing the challenge of maintaining instance coherence in multi-image generative and interpretive tasks. Openstory++ is accompanied by Cohere-Bench, a benchmark suite for rigorous quantitative and human evaluation of image generation tasks requiring long-horizon, multi-instance consistency (Ye et al., 2024).
1. Construction Pipeline
Openstory++ is created through an automated four-stage pipeline engineered to maximize annotation fidelity and temporal consistency:
- Keyframe Extraction and Deduplication: I-frames are extracted from open-domain video corpora (primarily Pandas-70M and InternVid). DINOv2 encodes each frame, and near-duplicates (with high cosine similarity) are removed to ensure distinct visual appearance.
- Single-Image Captioning & Instance Masking: Each keyframe receives an auto-generated caption from BLIP2. Entity nouns are extracted using NLTK, bounding boxes are detected with YOLO-Worldv2, and pixel-level instance masks are obtained via EfficientViT-SAM. Images with zero or more than eight instances are filtered out to control semantic complexity.
- Sequence-Level Alignment & Narrative Polishing: Successive keyframes are grouped into short sequences. Video-LLaVA provides a coarse narrative, which is then refined by ChatGLM3-Turbo (a LLM) using both BLIP2 captions and Video-LLaVA storylines. The LLM ensures narrative cohesion and consistent instance naming across frames.
- Final Instance Masking & ID Consistency: YOLO-Worldv2 is rerun to redetect entities in LLM-polished captions. Dinov2 features are fused with facial features to assign unique, persistent instance IDs, and EfficientViT-SAM generates the final pixel-level masks.
This multi-step process yields data suitable for training models to maintain narrative and entity coherence even in protracted open-domain visual stories.
2. Annotation and Quality Control
Each annotated image or frame in Openstory++ contains:
- Bounding boxes for every identified “main” entity (e.g., people, animals, vehicles)
- Pixel-accurate instance masks (using SAM), cropped per bounding box
- Unique, sequence-consistent instance IDs (e.g., “woman0,” “dog1”)
- Natural-language captions, LLM-refined to reference each instance by ID
The average number of instances per frame in sequence data is 2.5. Multiple filters are employed:
- Aesthetic filter (unique frames: score >5; sequence frames: >4.5, as output by an aesthetic scoring model)
- Instance-count filter (1–8 instances)
- DINOv2 embedding-based deduplication
- LLM refinement for caption and naming quality
- Final manual sanity checks on a small held-out set
These measures collectively ensure granular, accurate, and semantically coherent annotation for both single frames and story sequences.
3. Dataset Scale and Characteristics
Openstory++ comprises two major subsets:
- Single-Image Subset: Approximately 100 million image–caption pairs, with each image annotated with 1–8 instances (mean 2–3), each instance masked and assigned an ID.
- Sequence Subset: One million short story sequences, averaging 28 frames per sequence, fully annotated frame-by-frame as above.
Frame resolution is standardized to 224 × 224 (for ViT) and 512 × 512 pixels (for Stable Diffusion conditioning). The captions, subject to LLM refinement, have an average length of 9–12 words and draw from a vocabulary of 50,000–100,000 unique tokens, supporting broad open-domain coverage.
4. Formal Definitions and Benchmark Metrics
Openstory++ defines the instance-aware visual storytelling task as generating images from text prompts in an interleaved, autoregressive manner:
Cohere-Bench evaluates models using:
- Pairwise cosine similarity between feature embeddings:
where , are normalized feature vectors.
- Instance Integrity Score using the Hungarian matching algorithm:
with given by optimal matching on the cost matrix.
Additional evaluation metrics include:
| Metric | Description |
|---|---|
| Semantic Alignment | CLIP cosine between generated image & text |
| Background Consistency | CLIP similarity of backgrounds after inpainting |
| Style Consistency | DINOv2 cosine similarity across consecutive frames |
| Instance Consistency | YOLO-World segment overlap + cosine (single/multi) |
| BLEU4 | Caption similarity using BLIP2-generated captions |
5. Cohere-Bench: Task Design and Empirical Results
Cohere-Bench, provided alongside Openstory++, offers structured benchmarks for:
- Story Generation: Generate the first frame from text, then use the previous image and new text iteratively for subsequent frames.
- Story Continuation: Given image 1 and text 2, generate image 2, and so on.
Baseline models evaluated include DreamLLM, MiniGPT-5, SEED-X, Emu2, GPT4-V, MiniGemini, and Openstory++ reference models with and without visual annotations. Automatic metric scores highlight the leading performance of the Openstory++ model with visual annotations:
| Model | Semantic | BG Consist. | Style | InstCons_s | InstCons_m | Integrity | BLEU4 |
|---|---|---|---|---|---|---|---|
| Our model (vis ann.) | 0.279 | 0.791 | 0.784 | 0.821 | 0.782 | 0.429 | 0.064 |
| GPT4-V | close on semantic/style, lower on instance consistency |
Human evaluation covers criteria including text-image alignment, style consistency, story consistency, character consistency, plot continuity, image quality, and overall preference. Preference rates: Our model (21.28%) > GPT4-V (20.66%) > Emu2 (14.21%) > SEED-X (13.53%).
6. Comparative Analysis with Existing Datasets
Openstory++ is the only dataset to combine open-domain breadth, high per-instance annotation fidelity, and long temporally coherent sequences. The following table contrasts it with prior datasets:
| Dataset | Domain | Caption | Frames | Avg. Length | Masks/Frame | Sequential |
|---|---|---|---|---|---|---|
| PororoSV | Close | Manual | 73 K | 5 | 0 | ✓ |
| Flintstones | Close | Manual | 123 K | 5 | 0 | ✓ |
| DideMoSV | Close | Manual | 53 K | 3 | 0 | ✓ |
| VIST | Open | Manual | 145 K | 5 | 0 | ✓ |
| StorySalon | Close | ASR | 159 K | 14 | 1 | ✓ |
| SDD | Open | Generated | 76 M | 1 | 3 | ✗ |
| Openstory++ | Open | Generated | 100 M + 1 M seq | 28 | 2.5 | ✓ |
Openstory++ uniquely offers large scale, instance-level masks and IDs, and long sequences (mean length 28), filling both scale and granularity gaps present in previous resources.
7. Availability, Licensing, and Recommended Use
Openstory++ data and code are available at [https://openstorypp.github.io/]. The dataset is released under a CC BY 4.0 license; users must provide attribution and should confirm precise terms on the website. Recommended applications include training open-domain visual storytelling models requiring instance-awareness and temporal consistency, and evaluating MLLMs for entity consistency and narrative generation using Cohere-Bench.
Ethical use guidelines note that, despite substantial filtering, residual dataset biases (e.g., scene type and face blurring artifacts) may persist; downstream audits are advised. Attribution to the Openstory++ paper is requested for all uses (Ye et al., 2024).