VideoCon: Robust Video-Language Alignment via Contrast Captions

Published 15 Nov 2023 in cs.CV, cs.AI, cs.CL, and cs.LG | (2311.10111v1)

Abstract: Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a LLM that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-LLM is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations. Our code and data are available at https://github.com/Hritikbansal/videocon.

Abstract PDF Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper introduces VideoCon, a novel dataset that generates contrast captions to address semantic misalignments in video-language models.
It employs a two-stage methodology using LLMs for dataset creation and finetuning of mPLUG-Owl-Video, resulting in a 12-point ROC-AUC increase.
The finetuned model sets new state-of-the-art benchmarks in zero-shot retrieval and video QA, demonstrating enhanced robustness in dynamic video contexts.

VideoCon: Enhancing Robustness in Video-Language Alignment

The paper "VideoCon: Robust Video-Language Alignment via Contrast Captions" addresses the critical challenge of improving the robustness of video-language alignment models, which frequently exhibit sensitivity to subtle semantic changes in captions. Through a comprehensive approach, the authors introduce VideoCon, a novel dataset specifically designed to enhance the alignment models' resilience against contrastive misalignments. These misalignments include diverse alterations such as entity and action replacements as well as changes in event temporal order.

Key Contributions and Methodology

The research identifies fundamental weaknesses in existing models, notably their lack of robustness despite extensive pretraining. To rectify this, the authors propose the VideoCon dataset, which leverages LLMs to generate plausible contrast captions and accompanying explanations for video-caption pairs. This dataset highlights various misalignment types beyond objects and actions, extending to attributes, counts, spatial relations, hallucinations, and event order flips.

The authors adopt a two-stage methodology for dataset creation and model finetuning:

VideoCon Dataset Generation: The dataset is created by first filtering temporally-challenging instances using an existing end-to-end VNLI model. Then, a LLM is employed to generate contrast captions and explanations, categorized into seven types of semantic misalignments. This comprehensive dataset was verified for quality and accuracy through human evaluations, yielding a high validity rate (91% for contrast captions and 89% for explanations).
Model Finetuning: The baseline model, mPLUG-Owl-Video, is finetuned with the VideoCon dataset, specifically for tasks involving video-language entailment and natural language explanation generation. This purposeful finetuning significantly increases the model's performance, with a notable ROC-AUC increase of 12 points compared to the baseline models, indicating improved understanding of intricate video-language relationships.

Evaluation and Results

The finetuned model, referred to as Owl-Con, is rigorously evaluated on both the new VideoCon dataset and existing datasets with temporally-extensive video-language tasks, such as SSv2-Temporal and ATP-Hard. Across these evaluations, Owl-Con consistently outperforms prior models, even those trained directly on video-language tasks, establishing new state-of-the-art results in zero-shot text-to-video retrieval and video question answering tasks. For instance, Owl-Con achieved improvements of 4.3 mAP in SSv2-Temporal and 4% accuracy in ATP-Hard.

Implications and Future Work

The research's implications are twofold. Practically, the VideoCon dataset and the resulting robust model enhance real-world applicability in video understanding tasks, reducing error rates in dynamic environments where caption consistency is critical. Theoretically, this work pioneers the emphasis on misalignment categorizations, advocating a comprehensive strategy to video-LLM training, which may inspire future model architectures to focus strategically on capturing fine-grained semantic details.

Future developments could explore further extensions of this methodology, incorporating more diverse video datasets and expanding the scope of video-text misalignments. Such exploration could refine models that are even more adept at generalizing across various contexts and understanding increasingly complex video-caption relationships.

In conclusion, this study provides a marked advancement in utilizing structured contrast data for training tasks, setting a new benchmark for future video-language alignment explorations.

Markdown Report Issue