Neptune: The Long Orbit to Benchmarking Long Video Understanding

Published 12 Dec 2024 in cs.LG, cs.AI, and cs.CV | (2412.09582v2)

Abstract: We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces NEPTUNE, a benchmark that uses a semi-automatic pipeline with VLMs and LLMs to generate dense, time-aligned captions and QAD sets.
The paper demonstrates that the novel Gemma Equivalence Metric (GEM) outperforms traditional metrics like BLEU and CIDEr in evaluating open-ended video QA.
The paper highlights NEPTUNE’s potential to advance multimodal, long-form reasoning and bridge the performance gap between state-of-the-art VideoQA models.

Overview of NEPTUNE: Benchmarking Long Video Understanding

The development of robust models for long video understanding is a critical challenge within the field of computer science, especially considering the surge in online video content. This paper introduces NEPTUNE, a sophisticated benchmark designed to address the gap in current datasets by focusing on long video question-answering (VideoQA) that necessitates comprehensive temporal and multimodal reasoning.

Methodology and Dataset Creation

NEPTUNE represents a significant step forward in the field of long video comprehension by offering a scalable, semi-automatic pipeline that leverages Video LLMs (VLMs) and LLMs to generate dense, time-aligned video captions and complex question-answer-decoy (QAD) sets. The pipeline is innovative as it operates largely automatically, covering various videos sourced from YouTube, while reducing human effort by nearly half compared to manual annotation procedures.

The semi-automatic pipeline encompasses several critical stages: video selection, signal extraction, video captioning, QAD generation, and manual rater verification. The application of open-source and proprietary tools allows NEPTUNE to offer a diverse range of questions and domains, demonstrating a commitment to capturing the complexity of real-world scenarios in long video formats.

Evaluation Metrics and Benchmarking

To assess the models' proficiency in long video understanding, NEPTUNE offers two modes of evaluation: multiple-choice and open-ended question answering. For the latter, the Gemma Equivalence Metric (GEM) was introduced as a novel model-based metric, outperforming traditional rule-based approaches like BLEU and CIDEr. The GEM is fine-tuned on a generic answer equivalence dataset, making it a robust tool for evaluating complex and open-ended answers.

The performance of several state-of-the-art VideoQA models on the NEPTUNE dataset highlights a substantial gap between open-source models and proprietary models like Gemini-1.5 and GPT-4. This discrepancy reveals the persisting challenges in developing models that efficiently generalize to long-form video content.

Implications and Future Directions

The NEPTUNE dataset is positioned to foster advancements in designing models that can handle longer temporal contexts and integrate multimodal data, thereby pushing the limits of current VideoQA models. By providing a challenging benchmark, NEPTUNE ensures that future models are evaluated in a more comprehensive manner, going beyond simple temporal or visual reasoning.

NEPTUNE's open-source nature and scalability also pave the way for its widespread adoption and adaptation within the academic community. This choice aims to cultivate further research into algorithmic enhancements that could bridge the performance disparities observed among current VideoQA systems.

Conclusion

NEPTUNE establishes a new standard for evaluating long video understanding models. By combining a large dataset with innovative evaluation metrics and a commitment to multimodal, long-form reasoning, NEPTUNE serves as a catalyst for enhancing the capability of machine learning models in video analysis tasks. As research progresses, it is anticipated that NEPTUNE will inspire novel methodologies and applications, significantly impacting the fields of computer vision and natural language processing.

By addressing the current limitations in VideoQA datasets, NEPTUNE provides a robust framework for future innovations, encouraging the academic community to explore uncharted territories in long video understanding.

Markdown Report Issue