Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

Published 6 Jun 2025 in cs.CV, cs.CL, and cs.LG | (2506.06275v1)

Abstract: Despite recent progress in vision-LLMs (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by LLMs themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the novel MF² benchmark that evaluates holistic narrative comprehension using contrastive true-versus-false claim pairs.
It employs manual annotation and a continuous engagement protocol to rigorously assess multimodal integration in full-length films.
Experimental findings reveal that current VLMs achieve only 60.6% accuracy, lagging about 24.1% behind human performance.

An Evaluation Framework for Long-Form Video Understanding through MF² Benchmark

The paper introduces a novel benchmark designed to address the challenges of understanding long-form video content through vision-LLMs (VLMs) by focusing on holistic comprehension and narrative integration, particularly over extended video durations. Named MF², short for "Movie Facts and Fibs," the benchmark seeks to measure the effectiveness of AI models in comprehending, consolidating, and recalling narrative elements in full-length movies. The benchmark includes 53 full-length, open-licensed films, with an average duration of approximately 88 minutes. Each movie is paired with manually curated sets of contrastive claim pairs (true vs. false), which test comprehension beyond retrieval abilities often seen as a limitation in current VLMs.

Core Contributions

Novel Benchmark Design: MF² advances beyond typical "needle-in-a-haystack" video benchmarks to evaluate holistic narrative understanding through binary claim pairs. This reduces biases associated with multiple-choice formats and enables precise assessment of reasoning.

Manual Annotation for Authenticity: The benchmark distinguishes itself through rigorous human annotation, involving the construction of over 850 claim pairs targeting key narrative aspects, such as character motivations, emotions, causal chains, and temporal orders. Annotators are tasked to create contrastive claims requiring models to identify both true and false pairs, emphasizing understanding over memorization.

Evaluation Protocol: The paper implements a continuous engagement protocol wherein VLMs are tasked with binary evaluations without direct access to the paired structure during prediction. This format requires models to tackle narrative understanding based on both visual and textual stimuli with limited reliance on pre-existing scene-specific markers.

Experimental Findings

Through a series of experiments, the paper evaluates both open-weight and closed state-of-the-art VLMs, alongside human baselines. Notably, models across the board fall short of human performance in narrative comprehension tasks, revealing a substantial gap in current AI capabilities compared to human understanding. For instance, the highest-performing model, Gemini 2.5 Pro, achieves a pairwise accuracy of 60.6% when evaluated on both video and subtitles, despite a 24.1% gap below human results. This study meticulously analyzes the interplay between video and subtitle inputs, illustrating that subtitle availability significantly enhances performance, particularly for larger models.

Practical and Theoretical Implications

The MF² benchmark has compelling implications for the future of VLMs concerning both practical applications and theoretical understandings of AI comprehension. Practically, improving model interaction with narrative structures could transform applications in education, storytelling, and media analytics, where holistic content understanding is crucial. Theoretically, the benchmark highlights the need for continued exploration into model architectures that effectively integrate multimodal inputs, manage long-term dependencies, and enhance reasoning capabilities over extended textual and visual contexts.

Future Directions

Given the insights derived from MF², future research may focus on optimizing synchronization of visual and text inputs, improving contextual memory retention, and refining reasoning mechanisms within existing architecture designs. The dataset, freely available for academic and research purposes, stands as a foundational tool encouraging collaborative efforts in advancing these facets of AI development.

In conclusion, the MF² benchmark paper offers a systematic approach to evaluate and enhance the narrative comprehension capabilities of VLMs. It calls for further research and development, targeting improvements in AI's ability to process and understand complex long-form narratives, positioning itself as a critical resource in the quest for advancing next-generation AI models.

Markdown Report Issue