MME-CoF Benchmark

Updated 12 February 2026

MME-CoF Benchmark is a standardized evaluation suite that assesses zero-shot visual reasoning in video generative models using a Chain-of-Frame approach.
It systematically tests 59 curated prompts across 12 reasoning categories, focusing on spatiotemporal coherence, geometric consistency, and logical planning.
Empirical results show models excel in local visual plausibility while struggling with long-horizon temporal consistency and complex logical chaining.

The MME-CoF Benchmark is a compact, standardized evaluation suite designed to critically assess the zero-shot visual reasoning capabilities of large video generative models under the Chain-of-Frame (CoF) reasoning paradigm. Developed to move beyond traditional metrics of video quality and fidelity, MME-CoF probes models' abilities to perform diverse, multi-step visual reasoning that demands spatiotemporal coherence, geometric consistency, and logical planning. The benchmark systematically characterizes both the strengths and failure modes of state-of-the-art video models—such as Veo-3, Sora-2, Kling-v1, and Seedance-1.0-pro—across 12 distinct reasoning dimensions, enabling in-depth diagnosis of model behavior in challenging visual scenarios (Guo et al., 30 Oct 2025).

1. Benchmark Rationale and Chain-of-Frame Reasoning

MME-CoF is explicitly constructed to address shortcomings of prior video model evaluation strategies, which have historically focused on surface-level video synthesis quality rather than on the underlying reasoning capabilities. The benchmark is built around the Chain-of-Frame (CoF) reasoning paradigm, in which a model generates each frame of a video as an explicit visual step, analogous to the “chain-of-thought” (CoT) protocol in language modeling. In CoF reasoning, each generated frame is interpreted as an intermediate state in a temporal reasoning chain, reflecting how the model builds upon and integrates visual evidence over time to accomplish complex tasks (Guo et al., 30 Oct 2025).

This approach extends recent progress in multimodal LLMs incorporating frame-level traceability, where chain-of-frames reasoning traces—ordered lists of micro-reasoning steps grounded in specific frames—demonstrate that a model's intermediate visual states are critical for robust multi-step decision making and for mitigating spurious hallucinations seen in language-only chains (Ghazanfari et al., 31 May 2025).

2. Structure and Composition of the Benchmark

MME-CoF comprises 59 hand-curated entries, each consisting of a carefully formulated visual prompt and a reference solution (e.g., mask/frame annotations or canonical camera views). These entries are distributed across 12 reasoning categories:

Reasoning Category	Example Focus	Avg. Entries per Category
Visual Detail Reasoning	Color, left/right discrimination	4.9
Visual Trace Reasoning	Object motion paths, branches	4.9
Real-world Spatial Reasoning	Orientation, perspective	~5
3D Geometry Reasoning	Nets, folds	~5
2D Geometry Reasoning	Auxiliary lines, constructions	~5
Physics-based Reasoning	Collisions, friction, gravity	~5
Rotation Reasoning	In-plane, 3D object rotation	~5
Table & Chart Reasoning	Highlighting, focus tasks	~5
Object Counting Reasoning	Aggregate object tracking	~5
GUI Reasoning	Interface interactions	~5
Embodied Reasoning	Affordances, tool use	~5
Medical Reasoning	Lesion localization, navigation	~5

Prompts are imperative, use explicit constraints about camera and motion, and intentionally avoid answer hints to ensure a truly zero-shot evaluation protocol. All video generations are required to follow precise, reproducible instructions (e.g., “Zoom into the red object step by step, no pan or dolly”) under static camera conditions unless otherwise specified (Guo et al., 30 Oct 2025).

3. Evaluation Protocols and Metrics

The MME-CoF evaluation pipeline mandates six video generations per model per prompt, all at standardized resolution and frame rate (1280×720, 24 FPS, 8 s duration; with 5 s for some models). There is no fine-tuning or external tool reliance; only fully zero-shot model inference is permitted.

Evaluation centers on a combination of expert qualitative labeling and automated scoring via Gemini-2.5-Pro. The main metrics applied are:

Qualitative Success Rate: Proportion of correct generations per prompt, evaluated by human experts.
Automatic Verifier Scores: Five dimensions (scored 0–4) per sample:
1. Instruction Alignment
2. Temporal Consistency
3. Visual Stability
4. Content Fidelity
5. Focus Relevance

Standard classification statistics such as accuracy, precision, recall, and F1-score are computed for complementary analyses, with the explicit formulae provided in the benchmark documentation. The unified protocol and automatic scoring ensure reproducibility and allow for cross-model comparison (Guo et al., 30 Oct 2025).

4. Empirical Results and Model Analysis

Systematic evaluation with MME-CoF reveals nontrivial emergent behaviors as well as clear failure modes in leading video models:

Strengths: Veo-3 and related models demonstrate high success rates in tasks involving fine-grained spatial grounding, short-horizon trace coherence, and visual stability under controlled settings (e.g., Visual Stability ≈ 1.89 on the 0–4 scale). These capabilities are particularly pronounced for salient target localization and simple path-following.
Weaknesses: All tested models exhibit marked deficiencies in long-horizon causal planning (Visual Trace < 1.0 when multi-step sequences exceed four steps), adherence to strict geometric constraints (frequent misalignments and topological errors in nets/folds), quantitative physics (violations of momentum, incorrect collision order), and abstract logic (e.g., table/chart refocusing errors, GUI target misidentification, anatomical distortion in medical cases) (Guo et al., 30 Oct 2025).

Empirical subscores for Veo-3.0-preview illustrate this contrast: Spatial (real-world) reasoning is relatively strong (2.10 ± 1.46), but 3D Geometry (1.54 ± 1.43), 2D Geometry (1.27 ± 1.20), and Physics (1.44 ± 1.35) lag behind. Focus relevance (2.26 ± 1.73) notably outpaces instruction alignment (0.54 ± 1.06), suggesting a preference for visual plausibility over strict adherence to prompts.

This suggests that zero-shot video generative models are able to generate locally plausible and visually coherent sequences but lack robust mechanisms for maintaining global state, memory, or explicit logical constraint over long temporal horizons.

5. Model Behavior and Reasoning Interpretation

MME-CoF analysis reveals that observable CoF reasoning in current models is predominantly pattern replay from training data distributions rather than evidence of principled, compositional inference. Visual plausibility is frequently prioritized over exact instruction execution. Failure cases typically occur when prompts require nontrivial integration of past and current frames, cross-frame logical chaining, or precise multi-step spatial manipulation.

In this respect, MME-CoF builds on the broader Chain-of-Frames literature, which has shown that frame-aware reasoning traces—explicitly referring to the frames supporting each reasoning step—lead to improved factual grounding and a substantial reduction in hallucination rates in video-LMMs (Ghazanfari et al., 31 May 2025). As evidenced by CoF fine-tuning of InternVL models, integration of temporally grounded traces improves performance on broader video understanding benchmarks and provides interpretable, auditable chains for model outputs.

A plausible implication is that the integration of similar intermediate state-tracking or explicit constraint modules into the video generation pipeline would be necessary to bridge the observed reasoning gaps in current models.

6. Benchmark Usage, Accessibility, and Future Directions

MME-CoF provides practical diagnostic tools for video model development, model selection, and failure analysis. Its fine-grained categorization supports targeted research into reasoning weaknesses (e.g., physical reasoning vs. geometric reasoning). All data—prompts, reference annotations, evaluation scripts, inference wrappers, verifier prompt templates, and scoring utilities—are available through the project portal (https://video-cof.github.io) (Guo et al., 30 Oct 2025).

The empirical analyses provided in (Guo et al., 30 Oct 2025) recommend several research avenues:

Hybrid Reasoning Architectures: Combining video models as visual "engines" with external symbolic or language-model-driven reasoners.
Prompt Engineering: Enforcing step-by-step visual validation to improve multi-step reasoning.
State Tracking Integration: Explicit incorporation of working memory or constraint-enforcing modules into generative loops for improved long-horizon reasoning fidelity.
Scaling and Diversity: Expansion of the benchmark to include additional categories and more exemplars per category, including synthetic scenarios or more challenging annotation protocols.

This suggests that future benchmarks may increasingly require not just synthetic generation but also explicit diagnostic capabilities, enabling both evaluation and interpretability in complex temporal reasoning tasks.

MME-CoF aligns with, and extends, frameworks advanced in the Chain-of-Frames literature (Ghazanfari et al., 31 May 2025), which frames temporally ordered reasoning as essential for robust multimodal LLM performance. Unlike broader video understanding benchmarks, MME-CoF's focus on reasoning chains, zero-shot strictness, and its fine-grained, multi-category coverage makes it a unique diagnostic resource for the analysis and advancement of next-generation video generative models. The benchmark thus serves both as an evaluative standard and as a catalyst for methodological innovation in frame-aware and multimodal reasoning research.

Markdown Report Issue Upgrade to Chat

References (2)

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark (2025)

Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MME-CoF Benchmark.