AiEdit Dataset: Multimodal Editing Benchmarks

Updated 5 February 2026

AiEdit dataset is a collection of multimodal benchmarks for editing tasks in speech, image, video, and affective content, constructed using LLM-guided pipelines.
It features comprehensive annotations, detailed data structures, and rigorous evaluation protocols like F1, EER, and FID to assess model performance.
The dataset supports diverse applications such as speech forensics, emotion-driven image translation, cinematic video editing, and instruction-based image manipulation.

AiEdit Dataset

The term "AiEdit dataset" refers to several distinct, high-impact datasets targeting artificial intelligence research in multimodal editing, including image, video, audio, and affective content manipulation. While "AiEdit" is used in some papers as an explicit dataset name (notably for bilingual speech and affective image editing), it also appears in published AI video editing corpora and, as a shorthand, across various high-resolution instruction-based image editing benchmarks. Collectively, these datasets exemplify rigorous construction methodologies, comprehensive annotation schemes, and deployment for benchmarking state-of-the-art editing models.

1. Taxonomy of Datasets Under "AiEdit"

Dataset	Primary Modality	Editing Type(s)	Scale	Notable Use Cases
AiEdit Speech	Audio	Fine-grained segmental semantic edits	59,554 utts	Speech editing detection/localization (Xue et al., 29 Jan 2026)
AiEdit (EmoTIPS)	Image	Emotion-conditioned image editing	1,000,000 pairs	Affective image translation (Zhang et al., 24 May 2025)
AVE/AiEdit	Video	Cinematic shot structuring/editing	196,176 shots	AI-assisted video editing (Argaw et al., 2022)
HQ-Edit	Image	Instruction-based image editing	197,350 pairs	State-of-the-art model fine-tuning (Hui et al., 2024)
AnyEdit	Image	Unified multimodal image editing	2,506,320 pairs	Large-scale instruction-driven editing (Yu et al., 2024)
X2Edit	Image	Arbitrary-instruction image editing	3,700,000 pairs	Plug-and-play editor training (Ma et al., 11 Aug 2025)

A plausible implication is that "AiEdit" has become a meta-label encompassing open datasets central to contemporary research in language-driven content manipulation across modalities.

2. Data Construction Pipelines

Data construction strategies in AiEdit-labeled datasets are characterized by modular pipelines, use of LLMs and multimodal alignment, tight post-generation validation, and diverse edit taxonomies.

Speech Editing (AiEdit Speech): Tampering logic driven by LLM-guided transcript edits, edited region labeling, and region-aware neural speech editing models (SSR-Speech, VoiceCraft, Ming-UniAudio). Bilingual content with segment alignment and forced-alignment for timestamping. Exact composition: $N_\mathrm{total} = 59,\!554$ , $N_\mathrm{edited} = 51,\!794$ ( $P_\mathrm{tamper} \approx 87\%$ ), covering add/delete/modify types (Xue et al., 29 Jan 2026).
Affective Image Editing (EmoTIPS): Triplet-based pairing from EmoSet (IAPS + online images), emotion text annotation via multimodal LLMs, continuous emotional spectrum mapping (contrastive triplet loss in $\mathbb{E}$ ). Dataset contains $1,\!000,\!000$ image–text pairs balanced across eight Mikel categories (Zhang et al., 24 May 2025).
Video Editing (AVE/AiEdit): Scene and shot extraction from movie sources (YouTube), 1.5M+ tags across 196K shots, eight cinematographic attribute classes per shot, manual and automated labeling, and comprehensive train/val/test splits (Argaw et al., 2022).
Image Editing (HQ-Edit/AnyEdit/X2Edit): Initial seed instruction–image triplets via web scraping, LLM "self-instruct" expansion, advanced T2I/segmentation/inpainting pipelines, normalization and fine-grained alignment (e.g., DIFT warping (Hui et al., 2024)), and rigorous automated quality filtering through aesthetic and semantic models. Large-scale coverage via iterative expansion (AnyEdit: 2.5M, X2Edit: 3.7M pairs) (Yu et al., 2024, Ma et al., 11 Aug 2025).

This suggests a strong methodological linkage across AiEdit datasets: LLM-driven synthetic example creation, per-example quality policing, and semantic/tag annotation at fine spatial or temporal granularity.

3. Annotation Schemes and Data Structure

Audio: All files are 16 kHz PCM WAVs; each entry has an associated transcript, list of edits, boundary timestamps, and region type (JSON format). Exact example from (Xue et al., 29 Jan 2026):
1 2 3 4 5
{ "audio_path":"...wav", "transcript":"...", "edits":[{"word":"bad","type":"modify","start":1.23,"end":1.45}] }

Affective Image (EmoTIPS/AiEdit): Image-to-image example with emotion text; emotion_distribution vector (length 8) accompanying each record (Zhang et al., 24 May 2025).

{
  "original_image": "images/train/000123.jpg",
  "target_image":   "images/train/045678.jpg",
  "text":           "A hidden creek murmurs with peaceful calmness.",
  "emotion_distribution": [0.02, 0.75, 0.01, 0.03, 0.08, 0.02, 0.05, 0.04]
}

Video: Per-shot and per-scene JSON and CSV, with attributes: shot size, angle, type, motion, location, subject category, number of people, sound source (Argaw et al., 2022).
Image Editing (HQ-Edit): Each entry is a 6-field JSON object: image paths, input/output captions, edit/inverse-edit instructions. Images are $900\times900$ px high-resolution (Hui et al., 2024).

4. Evaluation Protocols and Benchmarks

Rigorous technical metrics define all AiEdit corpora, calibrated for detection, localization, generation, and human alignment.

Detection and Localization (Speech): Equal Error Rate (EER), F1 at word level, and standard ROC/AUC; PELM (prior-enhanced LLM) achieves $8.37\%$ EER (detection) and $97.29\%$ F1 (localization) on AiEdit (Xue et al., 29 Jan 2026).
Affective Editing: Assessed via Fréchet Inception Distance (FID), semantic clarity (Sem-C), and KL divergence between predicted and target emotion distributions (Zhang et al., 24 May 2025).
Image Editing: Quality measured by instruction-alignment (HQ-Edit: Alignment, Coherence [$0,100$] using GPT-4V), CLIP-based metrics (AnyEdit), and ImgEdit-Judge (X2Edit). Example: InstructPix2Pix tuned with HQ-Edit yields Alignment $47.01$, Coherence $86.16$ compared to baseline $34.71$/$80.52$ (Hui et al., 2024).
Video Editing Tasks: Benchmarks in AVE/AiEdit include multi-task attribute classification, clustering, ordering, next-shot selection (contrastive learning), and missing attribute prediction. Reported metrics: per-class accuracy, NMI, purity, and top-1 accuracy (Argaw et al., 2022).

Image Editing: Fine-grained, instruction-based edits: object addition/removal, style transfer, global scene manipulation, multi-turn edits. AnyEdit formalizes 25 edit types across five edit domains (local, global, camera movement, implicit, visual).
Speech Editing: Segment-level semantic tampering—single continuous ADD/DELETE/MODIFY event per file.
Affective Image Editing: Control along continuous semantic axes in the emotion space—mapping natural language affective requests to image translation.
Video Editing: Cinematographic decomposition centering on framing, camera setup, semantic shot content, and audio for indexing and automatic editorial suggestion.

6. Access, Licensing, and Use Cases

Datasets are publicly downloadable from project and institutional repositories with clearly stated licensing:

AiEdit Speech: Download from project page, CC BY-NC 4.0; supports speech tampering detection/localization.
EmoTIPS/AiEdit: GitHub and S3 access; emotion-aligned image editing research, CC BY-NC 4.0.
HQ-Edit, AnyEdit, X2Edit: Open releases with CC BY 4.0 or similar, supporting academic and non-commercial use; Hugging Face Hub, GitHub, and institutional mirrors supply splits and accompanying code (Hui et al., 2024, Yu et al., 2024, Ma et al., 11 Aug 2025).
AVE/AiEdit: GitHub-hosted with scene/shot-level downloads; supports "smart" assemble/edit tools, shot clustering, attribute ordering, and next-shot prediction (Argaw et al., 2022).

Applications include benchmarking and model comparison, task-specific fine-tuning (e.g., HQ-Edit for image translation SOTA), evaluation of affective content transfer, audio forensics, and research into the reasoning or alignment capabilities of contemporary T2I and speech models.

7. Research Impact and Benchmarks

AiEdit datasets have advanced the state of the art in diverse subfields:

Speech forensics: Enabling robust neural detection of seamless speech edits, overcoming limitations of artifact-based baselines (Xue et al., 29 Jan 2026).
Affective content understanding: Providing the first large-scale, human-labeled, emotion-aligned corpus for controllable image editing (Zhang et al., 24 May 2025).
Multimodal video editing: Offering a richly labeled foundation for AI-centric workflows in film and video post-production, including attribute-based organization and editorial decision support (Argaw et al., 2022).
Instruction-based image editing: Facilitating rapid progress in text-driven and multi-turn editing models through diverse, high-quality data (AnyEdit, X2Edit, HQ-Edit), with improvements over prior synthetic/human datasets as measured by CLIP, GPT-4V, and human evaluation (Hui et al., 2024, Yu et al., 2024, Ma et al., 11 Aug 2025).

This consolidation across modalities and annotation protocols positions AiEdit as a touchstone for benchmarking, development, and practical deployment of advanced content editing models in both academic and applied contexts.