Papers
Topics
Authors
Recent
Search
2000 character limit reached

Openstory++: Instance-Aware Visual Storytelling

Updated 25 January 2026
  • Openstory++ is a large-scale dataset featuring instance-aware, open-domain visual storytelling with precise annotations and narrative consistency.
  • The dataset employs a robust four-stage pipeline including keyframe extraction, captioning, masking, and sequence alignment to ensure high annotation fidelity.
  • Cohere-Bench accompanies Openstory++ by providing quantitative and human evaluation metrics for assessing multi-instance consistency and narrative coherence.

Openstory++ is a large-scale dataset and benchmark specifically designed for instance-aware, open-domain visual storytelling. It features comprehensive instance-level annotation—including consistent instance IDs, pixel-precise masks, and refined captions—across both single images and extended image sequences, addressing the challenge of maintaining instance coherence in multi-image generative and interpretive tasks. Openstory++ is accompanied by Cohere-Bench, a benchmark suite for rigorous quantitative and human evaluation of image generation tasks requiring long-horizon, multi-instance consistency (Ye et al., 2024).

1. Construction Pipeline

Openstory++ is created through an automated four-stage pipeline engineered to maximize annotation fidelity and temporal consistency:

  1. Keyframe Extraction and Deduplication: I-frames are extracted from open-domain video corpora (primarily Pandas-70M and InternVid). DINOv2 encodes each frame, and near-duplicates (with high cosine similarity) are removed to ensure distinct visual appearance.
  2. Single-Image Captioning & Instance Masking: Each keyframe receives an auto-generated caption from BLIP2. Entity nouns are extracted using NLTK, bounding boxes are detected with YOLO-Worldv2, and pixel-level instance masks are obtained via EfficientViT-SAM. Images with zero or more than eight instances are filtered out to control semantic complexity.
  3. Sequence-Level Alignment & Narrative Polishing: Successive keyframes are grouped into short sequences. Video-LLaVA provides a coarse narrative, which is then refined by ChatGLM3-Turbo (a LLM) using both BLIP2 captions and Video-LLaVA storylines. The LLM ensures narrative cohesion and consistent instance naming across frames.
  4. Final Instance Masking & ID Consistency: YOLO-Worldv2 is rerun to redetect entities in LLM-polished captions. Dinov2 features are fused with facial features to assign unique, persistent instance IDs, and EfficientViT-SAM generates the final pixel-level masks.

This multi-step process yields data suitable for training models to maintain narrative and entity coherence even in protracted open-domain visual stories.

2. Annotation and Quality Control

Each annotated image or frame in Openstory++ contains:

  • Bounding boxes for every identified “main” entity (e.g., people, animals, vehicles)
  • Pixel-accurate instance masks (using SAM), cropped per bounding box
  • Unique, sequence-consistent instance IDs (e.g., “woman0,” “dog1”)
  • Natural-language captions, LLM-refined to reference each instance by ID

The average number of instances per frame in sequence data is 2.5. Multiple filters are employed:

  • Aesthetic filter (unique frames: score >5; sequence frames: >4.5, as output by an aesthetic scoring model)
  • Instance-count filter (1–8 instances)
  • DINOv2 embedding-based deduplication
  • LLM refinement for caption and naming quality
  • Final manual sanity checks on a small held-out set

These measures collectively ensure granular, accurate, and semantically coherent annotation for both single frames and story sequences.

3. Dataset Scale and Characteristics

Openstory++ comprises two major subsets:

  • Single-Image Subset: Approximately 100 million image–caption pairs, with each image annotated with 1–8 instances (mean 2–3), each instance masked and assigned an ID.
  • Sequence Subset: One million short story sequences, averaging 28 frames per sequence, fully annotated frame-by-frame as above.

Frame resolution is standardized to 224 × 224 (for ViT) and 512 × 512 pixels (for Stable Diffusion conditioning). The captions, subject to LLM refinement, have an average length of 9–12 words and draw from a vocabulary of 50,000–100,000 unique tokens, supporting broad open-domain coverage.

4. Formal Definitions and Benchmark Metrics

Openstory++ defines the instance-aware visual storytelling task as generating MM images I1,...,IMI^1, ..., I^M from MM text prompts L1,...,LML^1, ..., L^M in an interleaved, autoregressive manner:

Stxt={L1,...,LM},Simg={I1,...,IM}S_{txt} = \{L^1, ..., L^M\},\quad S_{img} = \{I^1, ..., I^M\}

Cohere-Bench evaluates models using:

  • Pairwise cosine similarity between feature embeddings:

Similarity=1Ni=1N(fiAfiB)T\text{Similarity} = \frac{1}{N} \sum_{i=1}^{N} (f^A_i \cdot f^B_i)^T

where fiAf^A_i, fiBf^B_i are normalized feature vectors.

  • Instance Integrity Score using the Hungarian matching algorithm:

InstanceIntegrity=1FbasekSimilarity(fikcurrent,fjkbase)\text{InstanceIntegrity} = \frac{1}{|F_{base}|} \sum_k \text{Similarity}(f^{current}_{i_k}, f^{base}_{j_k})

with (ik,jk)(i_k, j_k) given by optimal matching on the cost matrix.

Additional evaluation metrics include:

Metric Description
Semantic Alignment CLIP cosine between generated image & text
Background Consistency CLIP similarity of backgrounds after inpainting
Style Consistency DINOv2 cosine similarity across consecutive frames
Instance Consistency YOLO-World segment overlap + cosine (single/multi)
BLEU4 Caption similarity using BLIP2-generated captions

5. Cohere-Bench: Task Design and Empirical Results

Cohere-Bench, provided alongside Openstory++, offers structured benchmarks for:

  • Story Generation: Generate the first frame from text, then use the previous image and new text iteratively for subsequent frames.
  • Story Continuation: Given image 1 and text 2, generate image 2, and so on.

Baseline models evaluated include DreamLLM, MiniGPT-5, SEED-X, Emu2, GPT4-V, MiniGemini, and Openstory++ reference models with and without visual annotations. Automatic metric scores highlight the leading performance of the Openstory++ model with visual annotations:

Model Semantic BG Consist. Style InstCons_s InstCons_m Integrity BLEU4
Our model (vis ann.) 0.279 0.791 0.784 0.821 0.782 0.429 0.064
GPT4-V close on semantic/style, lower on instance consistency

Human evaluation covers criteria including text-image alignment, style consistency, story consistency, character consistency, plot continuity, image quality, and overall preference. Preference rates: Our model (21.28%) > GPT4-V (20.66%) > Emu2 (14.21%) > SEED-X (13.53%).

6. Comparative Analysis with Existing Datasets

Openstory++ is the only dataset to combine open-domain breadth, high per-instance annotation fidelity, and long temporally coherent sequences. The following table contrasts it with prior datasets:

Dataset Domain Caption Frames Avg. Length Masks/Frame Sequential
PororoSV Close Manual 73 K 5 0
Flintstones Close Manual 123 K 5 0
DideMoSV Close Manual 53 K 3 0
VIST Open Manual 145 K 5 0
StorySalon Close ASR 159 K 14 1
SDD Open Generated 76 M 1 3
Openstory++ Open Generated 100 M + 1 M seq 28 2.5

Openstory++ uniquely offers large scale, instance-level masks and IDs, and long sequences (mean length 28), filling both scale and granularity gaps present in previous resources.

Openstory++ data and code are available at [https://openstorypp.github.io/]. The dataset is released under a CC BY 4.0 license; users must provide attribution and should confirm precise terms on the website. Recommended applications include training open-domain visual storytelling models requiring instance-awareness and temporal consistency, and evaluating MLLMs for entity consistency and narrative generation using Cohere-Bench.

Ethical use guidelines note that, despite substantial filtering, residual dataset biases (e.g., scene type and face blurring artifacts) may persist; downstream audits are advised. Attribution to the Openstory++ paper is requested for all uses (Ye et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Openstory++ Dataset.