2K-Characters-10K-Stories Dataset
- 2K-Characters-10K-Stories is a multimodal dataset that distinctly separates persistent character identity from transient attributes like pose, expression, and composition.
- It employs a human-in-the-loop pipeline with expert verification and automated quality gating to ensure high narrative and visual fidelity.
- Benchmark experiments reveal significant improvements in narrative coherence and control fidelity, including a +32.7 point gain in pose alignment over prior methods.
2K-Characters-10K-Stories is a large-scale, multimodal dataset designed for controllable visual storytelling under strict requirements of sequential character identity consistency and precision in transient attribute manipulation. Addressing persistent limitations of prior resources in disentangling stable character representation from variable pose, expression, and composition, it introduces a corpus of 2,000 uniquely stylized character templates distributed across 10,000 multi-frame illustrated stories, accompanied by fine-grained, explicit control signals and a rigorously quality-gated human-in-the-loop construction protocol. The dataset is accompanied by experiments demonstrating that models fine-tuned with its structured data achieve narrative and control fidelity rivaling state-of-the-art and closed-source systems (Yin et al., 5 Dec 2025).
1. Dataset Structure and Schema
2K-Characters-10K-Stories comprises 2,000 uniquely stylized identities, each represented by a 1024×1024 full-body template illustration against a neutral background. 10,000 stories span approximately 75,000 individual frames, each formed as a sequence of multi-panel illustrations. Every frame is annotated with:
- Reference images for each present character.
- Story-level scripts and frame-level captions.
- Explicit per-character control signals:
- C_ID: Character identity (among 2,000)
- P_ID: Pose (28 discrete categories)
- E_ID: Expression (12 types)
- C_TAG: Composition/viewpoint (21 tags)
Each entry is represented in JSON with keys for story_id, frame_index, a list of characters (each with associated control signals), frame_text, reference_images mapping, and the target image path. All control embeddings are serialized per frame for precise generation and downstream usage (Yin et al., 5 Dec 2025).
2. Human-in-the-Loop Construction Pipeline
The dataset is constructed via a three-phase Human-in-the-Loop (HiL) pipeline incorporating multiple layers of expert verification and automated quality control:
Phase I: Character Identity Template Generation
- Subject/profession sampling via MLLM, meta-prompting, and flux generation.
- Quality gating with VIEScore (threshold τ_q = 7.0); only images passing expert panel (≥2/3 votes) advance.
Phase II: Structured Narrative Encoding
- Narrative scripts generated by an LLM ensemble (GPT-4o, Gemini Pro 2.5), encoding frame-level events and control signals.
- Verification by a two-expert consensus focusing on semantic transition integrity.
Phase III: Quality-Gated Frame Synthesis & Automated Correction
- Reference images for each character/attribute fusion are synthesized with Qwen-Image and scored for visual consistency.
- Frame generation uses Gemini-2.5-flash-image with “hard” control inputs.
- Quality gated via a triple-check: MLLM review of identity consistency, control (pose/expression) accuracy, and scene alignment.
- Automated correction via Auto-Prompt Tuning (APT) and Local Image Editing (LIE).
- Persistent failures (>3 retries) are resolved by a final human panel.
This pipeline yields a dataset with pixel-level sequential and semantic alignment exceeding prior datasets in both scale and control fidelity (Yin et al., 5 Dec 2025).
3. Decoupled Control and Attribute Disentanglement
At the core of 2K-Characters-10K-Stories lies a decoupled control scheme that formally separates persistent identity (style, appearance) from transient attributes (pose, expression, viewpoint). Technically:
- : Identity embedding via encoder (e.g., CLIP/T5) on C_ID template image.
- , , : Transient feature embeddings for pose, expression, and composition.
- Generation proceeds by anchoring and varying as required.
A conceptual disentanglement objective is proposed:
where is the sum of transient embeddings and is a downstream decoder. Embeddings are serialized framewise, enforcing structured control at both training and inference phases (Yin et al., 5 Dec 2025).
4. Quality Gating and Automated Correction
The dataset employs a multi-layered, quality-gated loop:
- Reference and final frame images are validated by VIEScore (τ_q = 7.0).
- Pixel-level consistency is monitored, for example with:
(where sim could be LPIPS or VIEScore).
- If gates are not satisfied, APT is invoked—reweighting prompt text and regenerating outputs, averaging ≈2.19 iterations until success for references.
- Main loop validation employs triple-check arbitration (identity, control accuracy, semantic alignment), with automated error correction and, when necessary, final human adjudication (Yin et al., 5 Dec 2025).
5. Evaluation and Comparative Metrics
Performance of models trained on 2K-Characters-10K-Stories is benchmarked on narrative coherence, control fidelity, and image quality using established and novel metrics:
- Narrative Coherence: CSD (Cross-style, Self-style), CIDS (Cross-ID, Self-ID)
- Control Fidelity: Scene, Pose, Expression, Composition alignment
- Image Quality: Inception Score (IS), Aesthetics Score (AS)
Key comparative results (Table 2 (Yin et al., 5 Dec 2025)):
| Method | Pose↑ | Exp↑ | Comp↑ | IS↑ | AS↑ |
|---|---|---|---|---|---|
| StoryGen | 21.75 | 28.28 | 25.33 | 4.15 | 4.65 |
| GPT-4o | 65.15 | 65.50 | 62.00 | 4.20 | 5.84 |
| Gemini | 61.45 | 70.38 | 58.78 | 4.56 | 6.12 |
| Ours | 70.70 | 77.10 | 66.25 | 4.33 | 6.05 |
Notable is a +32.7 point improvement in pose alignment over OmniGen2 and closed-source competitive identity coherence. A plausible implication is that the decoupling of identity and transient attributes, in conjunction with the HiL pipeline and tight quality gating, is critical for this performance gain (Yin et al., 5 Dec 2025).
6. Loading, Usage, and Applications
Data splits: 80% training (≈8,000 stories), 10% validation, 10% test (including 200 out-of-domain characters). Typical loading uses PyTorch, with each batch delivering a stack of per-character reference templates, frame captions with serialized control signals, and the corresponding target frame image. Sample code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from torch.utils.data import Dataset, DataLoader import json, PIL class StoryDataset(Dataset): def __init__(self, annotation_file, image_root, tokenizer): self.data = json.load(open(annotation_file)) self.root = image_root self.tok = tokenizer def __len__(self): return len(self.data) def __getitem__(self, idx): entry = self.data[idx] refs = [PIL.Image.open(f"{self.root}/{ref_path}") for ref_path in entry["reference_images"].values()] target = PIL.Image.open(f"{self.root}/{entry['target_image']}") prompt = entry["frame_text"] + " " + " ".join( f"{c['C_ID']}@{c['P_ID']}@{c['E_ID']}@{c['C_TAG']}" for c in entry["characters"]) tokens = self.tok(prompt, return_tensors="pt") return { "refs": torch.stack([preprocess(r) for r in refs]), "tokens": tokens["input_ids"].squeeze(0), "target": preprocess(target) } train_ds = StoryDataset("train.json", "images/", tokenizer) train_loader = DataLoader(train_ds, batch_size=16, shuffle=True) |
Fine-tuning on this dataset with explicit reference conditioning and attribute disentanglement enables precise, multi-character, multi-frame control. This structure supports research in controllable generation, structured narrative alignment, and consistent synthetic story creation (Yin et al., 5 Dec 2025).
7. Context in Visual Storytelling Datasets
Compared to previous resources such as Visual Writing Prompts (VWP) (Hong et al., 2023), which grounds stories in curated movie-shot sequences with character annotations but does not disentangle identity and transients at control signal or template level, 2K-Characters-10K-Stories establishes new standards. VWP's focus is on entity-coherent story generation from naturalistic live-action image sequences, achieving improvements in narrativity, plot coherence (entity-grid log-likelihood), and semantic cohesion relative to VIST. However, VWP primarily leverages object/character-appearance and event diversity at the textual narrative level, whereas 2K-Characters-10K-Stories provides structured, disentangled embeddings and multi-modal references for generative controllability at the pixel level. This suggests that the latter is uniquely positioned for research in fine-grained, stylized narrative synthesis where persistent visual coherence and flexible transient adjustments are paramount (Yin et al., 5 Dec 2025, Hong et al., 2023).