Openstory++: Instance-Aware Visual Storytelling

Updated 25 January 2026

Openstory++ is a large-scale dataset featuring instance-aware, open-domain visual storytelling with precise annotations and narrative consistency.
The dataset employs a robust four-stage pipeline including keyframe extraction, captioning, masking, and sequence alignment to ensure high annotation fidelity.
Cohere-Bench accompanies Openstory++ by providing quantitative and human evaluation metrics for assessing multi-instance consistency and narrative coherence.

Openstory++ is a large-scale dataset and benchmark specifically designed for instance-aware, open-domain visual storytelling. It features comprehensive instance-level annotation—including consistent instance IDs, pixel-precise masks, and refined captions—across both single images and extended image sequences, addressing the challenge of maintaining instance coherence in multi-image generative and interpretive tasks. Openstory++ is accompanied by Cohere-Bench, a benchmark suite for rigorous quantitative and human evaluation of image generation tasks requiring long-horizon, multi-instance consistency (Ye et al., 2024).

1. Construction Pipeline

Openstory++ is created through an automated four-stage pipeline engineered to maximize annotation fidelity and temporal consistency:

Keyframe Extraction and Deduplication: I-frames are extracted from open-domain video corpora (primarily Pandas-70M and InternVid). DINOv2 encodes each frame, and near-duplicates (with high cosine similarity) are removed to ensure distinct visual appearance.
Single-Image Captioning & Instance Masking: Each keyframe receives an auto-generated caption from BLIP2. Entity nouns are extracted using NLTK, bounding boxes are detected with YOLO-Worldv2, and pixel-level instance masks are obtained via EfficientViT-SAM. Images with zero or more than eight instances are filtered out to control semantic complexity.
Sequence-Level Alignment & Narrative Polishing: Successive keyframes are grouped into short sequences. Video-LLaVA provides a coarse narrative, which is then refined by ChatGLM3-Turbo (a LLM) using both BLIP2 captions and Video-LLaVA storylines. The LLM ensures narrative cohesion and consistent instance naming across frames.
Final Instance Masking & ID Consistency: YOLO-Worldv2 is rerun to redetect entities in LLM-polished captions. Dinov2 features are fused with facial features to assign unique, persistent instance IDs, and EfficientViT-SAM generates the final pixel-level masks.

This multi-step process yields data suitable for training models to maintain narrative and entity coherence even in protracted open-domain visual stories.

2. Annotation and Quality Control

Each annotated image or frame in Openstory++ contains:

Bounding boxes for every identified “main” entity (e.g., people, animals, vehicles)
Pixel-accurate instance masks (using SAM), cropped per bounding box
Unique, sequence-consistent instance IDs (e.g., “woman0,” “dog1”)
Natural-language captions, LLM-refined to reference each instance by ID

The average number of instances per frame in sequence data is 2.5. Multiple filters are employed:

Aesthetic filter (unique frames: score >5; sequence frames: >4.5, as output by an aesthetic scoring model)
Instance-count filter (1–8 instances)
DINOv2 embedding-based deduplication
LLM refinement for caption and naming quality
Final manual sanity checks on a small held-out set

These measures collectively ensure granular, accurate, and semantically coherent annotation for both single frames and story sequences.

3. Dataset Scale and Characteristics

Openstory++ comprises two major subsets:

Single-Image Subset: Approximately 100 million image–caption pairs, with each image annotated with 1–8 instances (mean 2–3), each instance masked and assigned an ID.
Sequence Subset: One million short story sequences, averaging 28 frames per sequence, fully annotated frame-by-frame as above.

Frame resolution is standardized to 224 × 224 (for ViT) and 512 × 512 pixels (for Stable Diffusion conditioning). The captions, subject to LLM refinement, have an average length of 9–12 words and draw from a vocabulary of 50,000–100,000 unique tokens, supporting broad open-domain coverage.

4. Formal Definitions and Benchmark Metrics

Openstory++ defines the instance-aware visual storytelling task as generating $M$ images $I^1, ..., I^M$ from $M$ text prompts $L^1, ..., L^M$ in an interleaved, autoregressive manner:

$S_{txt} = \{L^1, ..., L^M\},\quad S_{img} = \{I^1, ..., I^M\}$

Cohere-Bench evaluates models using:

Pairwise cosine similarity between feature embeddings:

$\text{Similarity} = \frac{1}{N} \sum_{i=1}^{N} (f^A_i \cdot f^B_i)^T$

where $f^A_i$ , $f^B_i$ are normalized feature vectors.

Instance Integrity Score using the Hungarian matching algorithm:

$\text{InstanceIntegrity} = \frac{1}{|F_{base}|} \sum_k \text{Similarity}(f^{current}_{i_k}, f^{base}_{j_k})$

with $(i_k, j_k)$ given by optimal matching on the cost matrix.

Additional evaluation metrics include:

Metric	Description
Semantic Alignment	CLIP cosine between generated image & text
Background Consistency	CLIP similarity of backgrounds after inpainting
Style Consistency	DINOv2 cosine similarity across consecutive frames
Instance Consistency	YOLO-World segment overlap + cosine (single/multi)
BLEU4	Caption similarity using BLIP2-generated captions

5. Cohere-Bench: Task Design and Empirical Results

Cohere-Bench, provided alongside Openstory++, offers structured benchmarks for:

Story Generation: Generate the first frame from text, then use the previous image and new text iteratively for subsequent frames.
Story Continuation: Given image 1 and text 2, generate image 2, and so on.

Baseline models evaluated include DreamLLM, MiniGPT-5, SEED-X, Emu2, GPT4-V, MiniGemini, and Openstory++ reference models with and without visual annotations. Automatic metric scores highlight the leading performance of the Openstory++ model with visual annotations:

Model	Semantic	BG Consist.	Style	InstCons_s	InstCons_m	Integrity	BLEU4
Our model (vis ann.)	0.279	0.791	0.784	0.821	0.782	0.429	0.064
GPT4-V	close on semantic/style, lower on instance consistency

Human evaluation covers criteria including text-image alignment, style consistency, story consistency, character consistency, plot continuity, image quality, and overall preference. Preference rates: Our model (21.28%) > GPT4-V (20.66%) > Emu2 (14.21%) > SEED-X (13.53%).

6. Comparative Analysis with Existing Datasets

Openstory++ is the only dataset to combine open-domain breadth, high per-instance annotation fidelity, and long temporally coherent sequences. The following table contrasts it with prior datasets:

Dataset	Domain	Caption	Frames	Avg. Length	Masks/Frame	Sequential
PororoSV	Close	Manual	73 K	5	0	✓
Flintstones	Close	Manual	123 K	5	0	✓
DideMoSV	Close	Manual	53 K	3	0	✓
VIST	Open	Manual	145 K	5	0	✓
StorySalon	Close	ASR	159 K	14	1	✓
SDD	Open	Generated	76 M	1	3	✗
Openstory++	Open	Generated	100 M + 1 M seq	28	2.5	✓

Openstory++ uniquely offers large scale, instance-level masks and IDs, and long sequences (mean length 28), filling both scale and granularity gaps present in previous resources.

7. Availability, Licensing, and Recommended Use

Openstory++ data and code are available at [https://openstorypp.github.io/]. The dataset is released under a CC BY 4.0 license; users must provide attribution and should confirm precise terms on the website. Recommended applications include training open-domain visual storytelling models requiring instance-awareness and temporal consistency, and evaluating MLLMs for entity consistency and narrative generation using Cohere-Bench.

Ethical use guidelines note that, despite substantial filtering, residual dataset biases (e.g., scene type and face blurring artifacts) may persist; downstream audits are advised. Attribution to the Openstory++ paper is requested for all uses (Ye et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Openstory++ Dataset.

Openstory++: Instance-Aware Visual Storytelling

1. Construction Pipeline

2. Annotation and Quality Control

3. Dataset Scale and Characteristics

4. Formal Definitions and Benchmark Metrics

5. Cohere-Bench: Task Design and Empirical Results

6. Comparative Analysis with Existing Datasets

7. Availability, Licensing, and Recommended Use

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Openstory++: Instance-Aware Visual Storytelling

1. Construction Pipeline

2. Annotation and Quality Control

3. Dataset Scale and Characteristics

4. Formal Definitions and Benchmark Metrics

5. Cohere-Bench: Task Design and Empirical Results

6. Comparative Analysis with Existing Datasets

7. Availability, Licensing, and Recommended Use

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research