VideoAesBench: Video Aesthetic Benchmark

Updated 5 February 2026

VideoAesBench is a benchmark that rigorously evaluates video aesthetic perception in large multimodal models using diverse video content and fine-grained Q-A pairings.
It employs a structured evaluation pipeline integrating expert annotation, synthetic captioning, and automated Q-A generation to ensure explainable and precise assessments.
The benchmark reveals model performance gaps across aesthetic dimensions, highlighting challenges in multi-choice reasoning and nuanced video attribute analysis.

VideoAesBench is a comprehensive benchmark established to rigorously evaluate the video aesthetics perception capabilities of large multimodal models (LMMs). Unlike prior benchmarks that focus on objective quality or semantic alignment, VideoAesBench is designed to probe holistic, fine-grained, and explainable assessments of video aesthetic quality across a broad spectrum of content and question types. It employs a highly controlled evaluation pipeline, integrating domain expert annotation, sophisticated question engineering, and large-scale model benchmarking, thus filling a critical gap in LMM evaluation for video understanding (Li et al., 29 Jan 2026).

1. Dataset Composition and Content Diversity

The VideoAesBench corpus comprises 1,804 videos, methodically sampled to maximize diversity in both production modality and content characteristics. The dataset distribution is as follows:

Video Source	Count
User-Generated (UGC)	1,085
AI-Generated (AIGC)	395
Robotic-Generated (RGC)	154
Compressed	86
Game	84

This taxonomy enables fine-grained performance analysis and guards against domain overfitting. The set encompasses a broad array of real-world, synthetic, and post-processed videos, ensuring representativity for research on both natural and algorithmically produced video aesthetics (Li et al., 29 Jan 2026).

2. Annotation Pipeline and Question Engineering

VideoAesBench applies a multistage human-in-the-loop annotation protocol for robust, explainable Q-A construction:

Domain expert annotation: Professional annotators author comprehensive, multi-dimensional descriptions for each video, targeting all aspects of the aesthetics framework.
Synthetic captioning: Gemini-2.5 condenses multiple descriptions into an aesthetic caption to ensure semantic density and linguistic consistency.
Automated Q-A generation: GPT-5.2 ingests the aesthetic caption and original video to produce four Q-A pairs, one per designated question type (single-choice, multi-choice, true/false, open-ended).
Multi-reviewer validation: A minimum of three human reviewers iteratively audit, refine, or regenerate Q-A pairs, enforcing precision and answerability.

Example instance:

{
  "video_id": "vadb_20477693_1",
  "question_type": "True/False",
  "dimension": "Visual Composition",
  "question": "The aerial perspective provides a clear overview of the urban landscape but excessive shot variety and clutter weaken a focused message.",
  "options": [],
  "answer": "True"
}

This workflow ensures high-fidelity, dimension-grounded, and explainable question design, minimizing ambiguity and annotation noise (Li et al., 29 Jan 2026).

3. Aesthetic Taxonomy and Question Distribution

VideoAesBench advances a hierarchical framework for video aesthetics, decomposed into three principal categories and twelve fine-grained dimensions:

A. Visual Form (5): composition, elements & structure, shot size, depth of field, visual subject
B. Visual Style (4): lighting, color, visual tone, creativity
C. Visual Affectiveness (3): emotion, theme & communication, viewer interest

The corpus encompasses 1,804 Q-A pairs, distributed over four targeted question formats:

Format	Proportion	Approx. Count
Single-choice (SC)	41%	~740
Multi-choice (MC)	18%	~325
True/False (TF)	21%	~380
Open-ended (OE)	20%	~360

This multidimensional question pool supports both categorical evaluation and free-form reasoning, facilitating nuanced model diagnostics (Li et al., 29 Jan 2026).

4. Evaluation Protocol and Scoring Methodology

For closed-ended question types (SC, MC, TF), exact matching is mandatory—models must produce the precise set of correct answers to score. Open-ended answers are evaluated using a rubric (0/1/2) by GPT-5.2 operating under a strict evaluator prompt.

Metrics are reported per question:

Accuracy: $Acc = \frac{TP + TN}{TP + TN + FP + FN}$
Precision: $P = \frac{TP}{TP+FP}$
Recall: $R = \frac{TP}{TP + FN}$
F1-score: $F_1 = 2 \cdot \frac{P \cdot R}{P + R}$

Scoring is performed on a per-model, per-dimension, and per-question-type basis, supporting granular comparative analysis (Li et al., 29 Jan 2026).

5. Benchmark Results and Detailed Analysis

Twenty-three LMMs (18 open-source, 5 commercial) are benchmarked on VideoAesBench, producing the following headline results:

Overall Performance

Random-guess closed Q-A baseline: ~33%
Top open-source (Qwen3-VL-32B): 66.69%
Top closed-source (Claude-Sonnet-4.5): 67.88%
Runner-up (OpenAI o3): 64.55%

By Question Type (best models)

True/False: ~74.8% (Qwen3-VL-32B)
Single-choice: ~73.3% (Claude-Sonnet-4.5)
Multi-choice: ~64.9% (Claude-Sonnet-4.5)
Open-ended: ~69.2% (GPT-5.2)
Empirical difficulty: MC > OE > SC > TF

By Aesthetic Dimension

Strength: Color, Visual Subject, Creativity (>70%, best)
Mid-range: Composition (~65–68%)
Weakest: Depth of Field, Viewer Interest (~55–60%)

By Source Domain

Performance is roughly balanced across UGC, AIGC, RGC, compressed, and game videos.
Model-specific strengths: Qwen3-VL-32B excels on AIGC, Claude-Sonnet-4.5 on UGC/RGC/game, Gemini-2.5-Pro on compressed (particularly style/affectiveness) (Li et al., 29 Jan 2026).

While general video generation and analytics benchmarks—such as Video-Bench (Han et al., 7 Apr 2025), VABench (Hua et al., 10 Dec 2025), and FoleyBench (Dixit et al., 17 Nov 2025)—evaluate models on semantic fidelity, temporal alignment, or multimodal correspondence, they do not provide the holistic and fine-grained aesthetic interpretability central to VideoAesBench. VideoAesBench's methodological focus on aesthetics taxonomy, explainable Q-A formats, and causal human-in-the-loop curation is unique.

Compared to pure metric- or embedding-based evaluation (e.g., Fréchet Video Distance, CLIP-score), VideoAesBench enables multidimensional analysis of perceptual and affective attributes that align more closely with human experience, yet with the rigor of controlled annotation and automated scoring (Li et al., 29 Jan 2026). A plausible implication is its capacity to drive research on explainable and human-value aligned aesthetic judgment for next-generation LMMs.

7. Key Insights, Limitations, and Future Directions

Key observations from the benchmark include:

Leading models demonstrate only basic aesthetic perception; overall performance remains notably below human-level precision and completeness.
Complex multi-choice and open-ended reasoning, as well as parsing of temporal and fine-grained spatial dynamics, are persistent weaknesses.
Performance remains uneven across aesthetic dimensions, with depth of field and viewer interest poorly modeled.
Closed-source models marginally outperform open-source models, with the notable exception of Qwen3-VL-32B.

Recommended research directions include:

Enhanced multi-choice and open-ended answer reasoning architectures
Fine-grained parsing of temporal phenomena (motion smoothness, pacing)
Dimension-specific model balancing and cross-domain regularization
Explainable aesthetic feedback mechanisms extending beyond scalar scoring

VideoAesBench thus serves as a gold-standard testbed for explainable aesthetic evaluation in LMMs, providing the framework to analyze, compare, and improve models on visual form, style, and affectiveness in video understanding (Li et al., 29 Jan 2026).