- The paper introduces the novel MF² benchmark that evaluates holistic narrative comprehension using contrastive true-versus-false claim pairs.
- It employs manual annotation and a continuous engagement protocol to rigorously assess multimodal integration in full-length films.
- Experimental findings reveal that current VLMs achieve only 60.6% accuracy, lagging about 24.1% behind human performance.
The paper introduces a novel benchmark designed to address the challenges of understanding long-form video content through vision-LLMs (VLMs) by focusing on holistic comprehension and narrative integration, particularly over extended video durations. Named MF², short for "Movie Facts and Fibs," the benchmark seeks to measure the effectiveness of AI models in comprehending, consolidating, and recalling narrative elements in full-length movies. The benchmark includes 53 full-length, open-licensed films, with an average duration of approximately 88 minutes. Each movie is paired with manually curated sets of contrastive claim pairs (true vs. false), which test comprehension beyond retrieval abilities often seen as a limitation in current VLMs.
Core Contributions
Novel Benchmark Design: MF² advances beyond typical "needle-in-a-haystack" video benchmarks to evaluate holistic narrative understanding through binary claim pairs. This reduces biases associated with multiple-choice formats and enables precise assessment of reasoning.
Manual Annotation for Authenticity: The benchmark distinguishes itself through rigorous human annotation, involving the construction of over 850 claim pairs targeting key narrative aspects, such as character motivations, emotions, causal chains, and temporal orders. Annotators are tasked to create contrastive claims requiring models to identify both true and false pairs, emphasizing understanding over memorization.
Evaluation Protocol: The paper implements a continuous engagement protocol wherein VLMs are tasked with binary evaluations without direct access to the paired structure during prediction. This format requires models to tackle narrative understanding based on both visual and textual stimuli with limited reliance on pre-existing scene-specific markers.
Experimental Findings
Through a series of experiments, the paper evaluates both open-weight and closed state-of-the-art VLMs, alongside human baselines. Notably, models across the board fall short of human performance in narrative comprehension tasks, revealing a substantial gap in current AI capabilities compared to human understanding. For instance, the highest-performing model, Gemini 2.5 Pro, achieves a pairwise accuracy of 60.6% when evaluated on both video and subtitles, despite a 24.1% gap below human results. This study meticulously analyzes the interplay between video and subtitle inputs, illustrating that subtitle availability significantly enhances performance, particularly for larger models.
Practical and Theoretical Implications
The MF² benchmark has compelling implications for the future of VLMs concerning both practical applications and theoretical understandings of AI comprehension. Practically, improving model interaction with narrative structures could transform applications in education, storytelling, and media analytics, where holistic content understanding is crucial. Theoretically, the benchmark highlights the need for continued exploration into model architectures that effectively integrate multimodal inputs, manage long-term dependencies, and enhance reasoning capabilities over extended textual and visual contexts.
Future Directions
Given the insights derived from MF², future research may focus on optimizing synchronization of visual and text inputs, improving contextual memory retention, and refining reasoning mechanisms within existing architecture designs. The dataset, freely available for academic and research purposes, stands as a foundational tool encouraging collaborative efforts in advancing these facets of AI development.
In conclusion, the MF² benchmark paper offers a systematic approach to evaluate and enhance the narrative comprehension capabilities of VLMs. It calls for further research and development, targeting improvements in AI's ability to process and understand complex long-form narratives, positioning itself as a critical resource in the quest for advancing next-generation AI models.