Koala: Key frame-conditioned long video-LLM

Published 5 Apr 2024 in cs.CV | (2404.04346v3)

Abstract: Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video LLMs (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.

Abstract PDF HTML Upgrade to Chat

Citations (14)

View on Semantic Scholar

Summary

The paper’s main contribution is the introduction of learnable spatiotemporal queries and novel conditioned tokenizers that enable vLLMs to effectively process long videos.
It adapts a frozen vLLM architecture using the Conditioned Segment and Video tokenizers to merge local frame details with global video context.
Empirical results show 3-6% accuracy improvements on long video QA benchmarks, with added benefits for short-term action recognition.

Koala: Key Frame-Conditioned Long Video-LLM

The paper introduces "Koala," a novel approach designed to enhance the capabilities of video-LLMs (vLLMs) for understanding long-duration videos. The motivation stems from the limitations of current vLLMs, which are typically trained on short video clips and lack the ability to process and comprehend the extensive spatiotemporal information present in longer videos. To address these challenges, Koala integrates learnable spatiotemporal queries to modify and extend the pretrained video tokenizers within vLLMs to adapt them for tasks involving long videos.

The Koala model is built on a frozen vLLM architecture with the key innovation lying in its ability to effectively integrate information from key frames and video segments. Specifically, the model introduces two novel tokenizers: the Conditioned Segment (CS) tokenizer and the Conditioned Video (CV) tokenizer. The CS tokenizer targets the aggregation of frame-level visual concepts pertinent to both local segment contexts and overarching video narratives, while the CV tokenizer focuses on reasoning about the relationships between segments conditioned on the entire video's global context.

The Koala approach demonstrates substantial improvements in performance on long video question answering and understanding benchmarks such as EgoSchema and Seed-Bench. Impressively, Koala surpasses pretrained vLLMs by 3 - 6% in absolute accuracy on these tasks. Additionally, empirical results show that Koala also enhances short-term action recognition, underlining the robustness and versatility of the proposed tokenization strategies.

From a technical standpoint, Koala adapts the pretrained video QFormer through the introduction of learnable queries that engage with sparsely sampled key frames, thus emphasizing regions pivotal to understanding video narratives. This approach ensures that the model retains the fine-grained contextual information often lost in traditional methods, which rely on coarsely sampled frames. Moreover, the hierarchical design of Koala allows for a scalable aggregation of temporal context before the information is ingested by the LLM, ensuring efficiency juxtaposed with enriched spatiotemporal insights.

Practically, Koala has significant implications for applications requiring deep video understanding, such as video recommendation systems, robotics, and embodied AI. Theoretical implications include advancing the methodology for extending short-term memory models to long-term sequence processing, a pivotal challenge in contemporary AI research.

Looking ahead, Koala's premise can pave the way for future research into scalable, efficient video understanding models that can seamlessly process extensive video content. Potential research avenues include refining context aggregation techniques, extending tokenization strategies to diverse video domains, and improving model robustness to noisy and uncurated training data.

In summary, Koala exemplifies an efficient and effective method for enhancing the capabilities of vLLMs in long video comprehension. Its ability to surpass existing pretrained models on relevant benchmarks underscores its potential impact and further applications in the field of AI-driven video understanding.