OmniBench-99: Video Editing Benchmark

Updated 10 February 2026

OmniBench-99 is a comprehensive benchmark featuring 99 human-verified videos across three semantic categories to evaluate text-guided video editing models.
It employs two prompt styles—full-sentence and delta—to assess four canonical editing types and eight realistic scenarios, such as appearance changes and weather edits.
Evaluation using both automatic metrics (e.g., CLIP Frame Consistency, PickScore) and human studies reveals distinct performance gaps among models, guiding future improvements.

OmniBench-99 is a systematic benchmark designed for the comprehensive evaluation of text-guided video editing models, particularly those seeking universal editing capabilities across diverse domains and editing requirements. Developed to address limitations of prior benchmarks, OmniBench-99 incorporates a human-verified, scenario-rich set of 99 open-license videos, paired with two categories of editing prompts to enable rigorous, comparable, and fine-grained assessment of model editing performance without additional structural controls or fine-tuning (Chen et al., 2024).

1. Motivation and Design Principles

The principal motivation for OmniBench-99 was to address critical gaps in existing benchmarks for video editing models, specifically the limited coverage of realistic scenarios and over-reliance on canonical editing types without assessing situational applicability. Previous datasets such as LOVEU-TGVE-2023 and BalanceCC evaluate only four standard editing types and neglect how models generalize across true-to-life scenarios and semantic categories. OmniBench-99 employs an evaluation-first design philosophy centered on a compact, high-quality, human-verified test suite with broad scenario coverage (8 scenarios) and editing types (4 canonical types). The incorporation of both "full-sentence" and "delta" instruction prompts ensures the benchmark supports flexible assessment paradigms and apples-to-apples comparisons across diverse editing models.

2. Benchmark Structure and Dataset Composition

OmniBench-99 consists of 99 open-license videos, each between 2–20 seconds in length and captured at approximately 30 FPS. The videos are evenly balanced across three primary semantic categories with 33 videos each: Human/Animal, Object, and Environment. Every video features four editing-type prompts and an additional three to four scenario-specific prompts suited to its semantic category. The two styles of editing prompts—full-sentence (e.g., “Make the person’s shirt bright red.”) and delta (e.g., “Change shirt to red.”)—are paired with each source video, supporting the evaluation of both fine-tuned and zero-shot editing systems.

The table below summarizes the dataset's categorical balance and editing prompt coverage:

Category	Video Count	Scenario-Specific Prompts per Video
Human/Animal	33	3–4
Object	33	3–4
Environment	33	3–4

All source videos and prompts are held-out exclusively for evaluation, and there are no train/validation/test splits within the benchmark; model training occurs externally on other corpora.

3. Editing Types and Evaluated Scenarios

OmniBench-99 establishes four canonical editing types and distributes these across eight real-world scenarios, ensuring both breadth and depth of editing evaluation:

Editing Types:

Foreground editing: Alters only the principal moving object (e.g., changing a person's pose).
Background editing: Modifies only the static environment, for instance, changing a city skyline to mountains.
Composite editing: Simultaneously changes foreground content and global style (e.g., altering a car’s color and adding rain).
Style/Overall editing: Applies global transformations to color, texture, or overall image filter.

Editing Scenarios:

Human/Animal: Appearance (e.g., clothing/fur change); Motion/Pose (e.g., walking → running)
Object: Addition (insert new object); Removal (delete existing object); Replacement (object swap)
Environment: Weather (add snowfall/composite); Time (time-of-day/season changes); Background (scene alteration with foreground preservation)

This dual structure is designed to support scenario-sensitive benchmarking and to reveal per-scenario weaknesses unobservable via editing-type-only scores.

4. Annotation, Supervision, and Evaluation Protocol

Prompt generation within OmniBench-99 proceeds via a two-step process: structured automatic drafting by GPT-4V followed by manual human quality control for (i) physical realism and (ii) applicability to the input video. No ground-truth “edited” videos are provided; instead, the reference is implicitly defined by the tuple of each source video and its corresponding prompt. Model evaluation is based on the congruence between the generated video and the reference prompt, under both full-sentence and delta prompting.

All 99 videos are evaluated under every editing type and scenario for each model, with results aggregated by type and scenario. This systematic approach ensures robust comparison and identification of model-specific failure cases.

5. Objective Metrics and Human Studies

OmniBench-99 employs both automatic and human evaluation metrics to measure editing fidelity, alignment, temporal consistency, and preservation of scene structure.

Automatic Metrics:

CLIP Frame Consistency: Given as

$\text{CLIP Frame} = \frac{1}{N}\sum_{i=1}^N \cos\bigl(E_{\text{CLIP}}(\mathrm{frame}_i),\,E_{\text{CLIP}}(\text{prompt})\bigr)$

This quantifies the average cosine similarity between CLIP embeddings of individual frames and the prompt.

PickScore: Defined as

$\text{PickScore} = \frac{1}{N} \sum_{i=1}^N P_{\mathrm{LLM}}(\text{frame}_i\mid\text{prompt})$

It utilizes a LLM to compute the preference probability for each frame given the prompt context.

Human Study (Mean Opinion Score, MOS):

Volunteers (N=15) rate each edited video on a 1–5 scale across four axes: Text Alignment (edit accuracy relative to the prompt), Temporal Consistency (coherence across frames), Structure Alignment (preservation of unedited regions), and Overall Quality. These scores are averaged separately by editing type and scenario.

6. Comparative Results and Key Observations

Empirical evaluation on OmniBench-99 demonstrates distinct performance gaps between models and editing strategies on both global and scenario-specific metrics. Results summarized from Table 1 (Chen et al., 2024) are as follows:

Model	CLIP Frame ↑	PickScore ↑	Align (MOS) ↑	Temp (MOS) ↑	Stru (MOS) ↑	Overall (MOS) ↑
Editing Type (avg)
OmniCreator	0.962	0.212	4.47	4.33	4.07	4.33
Pix2Video	0.949	0.210	3.60	3.20	3.27	3.33
Scenario (avg)
OmniCreator	0.966	0.216	4.07	4.13	4.20	4.00
TokenFlow	0.951	0.210	3.07	3.07	2.93	3.13

Qualitative findings indicate that OmniCreator preserves background geometries while applying localized foreground edits, especially for Appearance and Addition scenarios. In Motion/Pose scenarios, it demonstrates superior temporal smoothness relative to DDIM-inversion-based baselines. In composite edits affecting weather, fine-grained details are maintained without degradation to underlying scene texture.

7. Recommendations and Prospective Directions

Analysis of OmniBench-99 usage and performance yields actionable recommendations for future research. Scenario-level evaluation is essential, as type-wise scores can obscure model-specific vulnerabilities seen only in scenario breakdowns. Minimalistic "delta" prompts tend to elicit better model performance, likely due to reduced semantic conflict with the source video, highlighting the benefit of prompt format diversity. Manual curation remains necessary to mitigate LLM hallucination in test prompt generation. Planned extensions to OmniBench-99 include the incorporation of new scenarios such as camera motion and dynamic weather, with community contributions encouraged. The benchmark’s dual prompt structure ensures ongoing utility for both zero-shot and fine-tuned evaluation paradigms (Chen et al., 2024).

By providing a compact, richly annotated, and scenario-diverse test suite, OmniBench-99 defines a new standard for the evaluation of universal video editing models, supporting robust, type-agnostic, and scenario-sensitive performance scrutiny.

Markdown Report Issue Upgrade to Chat

References (1)

Beyond Generation: Unlocking Universal Editing via Self-Supervised Fine-Tuning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniBench-99 (Video Editing).

OmniBench-99: Video Editing Benchmark

1. Motivation and Design Principles

2. Benchmark Structure and Dataset Composition

3. Editing Types and Evaluated Scenarios

4. Annotation, Supervision, and Evaluation Protocol

5. Objective Metrics and Human Studies

6. Comparative Results and Key Observations

7. Recommendations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OmniBench-99: Video Editing Benchmark

1. Motivation and Design Principles

2. Benchmark Structure and Dataset Composition

3. Editing Types and Evaluated Scenarios

4. Annotation, Supervision, and Evaluation Protocol

5. Objective Metrics and Human Studies

6. Comparative Results and Key Observations

7. Recommendations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research