ChinaOpen: A Dataset for Open-world Multimodal Learning

Published 10 May 2023 in cs.MM | (2305.05880v2)

Abstract: This paper introduces ChinaOpen, a dataset sourced from Bilibili, a popular Chinese video-sharing website, for open-world multimodal learning. While the state-of-the-art multimodal learning networks have shown impressive performance in automated video annotation and cross-modal video retrieval, their training and evaluation are primarily conducted on YouTube videos with English text. Their effectiveness on Chinese data remains to be verified. In order to support multimodal learning in the new context, we construct ChinaOpen-50k, a webly annotated training set of 50k Bilibili videos associated with user-generated titles and tags. Both text-based and content-based data cleaning are performed to remove low-quality videos in advance. For a multi-faceted evaluation, we build ChinaOpen-1k, a manually labeled test set of 1k videos. Each test video is accompanied with a manually checked user title and a manually written caption. Besides, each video is manually tagged to describe objects / actions / scenes shown in the visual content. The original user tags are also manually checked. Moreover, with all the Chinese text translated into English, ChinaOpen-1k is also suited for evaluating models trained on English data. In addition to ChinaOpen, we propose Generative Video-to-text Transformer (GVT) for Chinese video captioning. We conduct an extensive evaluation of the state-of-the-art single-task / multi-task models on the new dataset, resulting in a number of novel findings and insights.

Abstract PDF Upgrade to Chat

Citations (7)

View on Semantic Scholar

Summary

The paper presents a novel Chinese video dataset and a GVT model that optimizes visual-token processing to boost multimodal learning.
It details rigorous dataset curation using automated filters and manual annotations to ensure high-quality, diverse Chinese video content.
Evaluation across video tagging, retrieval, and captioning shows improved performance, particularly in Chinese-language contexts.

Overview of "ChinaOpen: A Dataset for Open-world Multimodal Learning" (2305.05880)

This paper presents "ChinaOpen," a novel dataset for open-world multimodal learning sourced from Bilibili, a prominent Chinese video-sharing platform. The dataset is designed to aid the training and evaluation of multimodal models on Chinese-language data, addressing the gap in their performance on non-English content. Accompanying the dataset is a Generative Video-to-text Transformer (GVT) model tailored specifically for Chinese video captioning, which enhances existing models by accelerating visual-token processing. Both the dataset and the model undergo comprehensive evaluations, yielding valuable insights into multimodal learning.

Dataset Construction

Data Gathering

ChinaOpen comprises two subsets: ChinaOpen-50k and ChinaOpen-1k. The construction process begins with gathering raw data from Bilibili, resulting in approximately 100,000 videos. These videos span various topics, reflecting the diverse content found on social media platforms. The raw dataset is filtered and refined through automated and manual processes to compose the final subsets.

Automated Data Cleaning

The cleaning process involves removing videos with suboptimal annotation or negligible visual content. Four critical categories of filters include empty-title, face-only, text-heavy, and content-less criteria. Through syntax pattern recognition for titles and employing face detection, OCR for text density, and classical vision models for evaluating visual content, unwanted videos are excluded. This leads to a refined set formation of ChinaOpen-50k—suitable for multimodal learning.

Manual Video Annotation

To further enrich the dataset, manual annotation constructs ChinaOpen-1k. Selected videos undergo detailed evaluations by experienced annotators who verify the relevance of titles, create captions, and label visual content covering objects, actions, and scenes. This subset serves as a multi-faceted test set for model evaluations. English language versions of the annotations further enhance cross-lingual model assessments.

Proposed Model: Generative Video-to-text Transformer (GVT)

GVT enhances the Generative Image-to-text Transformer (GIT) by integrating a visual-token reduction layer. This allows for more frames to be utilized without increasing computational complexity, addressing the inefficiency observed in sparse frame sampling. With the capability to expand from six to sixteen input frames, GVT enables richer visual content processing, significantly improving Chinese video captioning tasks.

Evaluation and Results

The evaluation includes fifteen state-of-the-art models across varied tasks: video tagging, text-to-video retrieval, and video captioning. Key findings highlight:

Open-set Video Tagging: CLIP-L/14@336px and CN-CLIP demonstrate superior performance, emphasizing the advantage of large-scale multimodal models for complex tagging tasks, particularly beyond recognized visual categories.
Text-to-video Retrieval: Multimodal models, especially those trained explicitly with video-text datasets like VaTeX, exhibit enhanced retrieval accuracy, showcasing robust cross-modal understanding.
Video Captioning: BLIP-2 claims top performance among English models, while GVT excels in Chinese captioning. China's dataset shows its value in refining Chinese content understanding.

Moreover, challenges like user-generated title retrieval denote areas for improved model adaptation and cross-language robustness in real-world scenarios.

Conclusion

ChinaOpen and the associated GVT model represent advancements in multimodal learning, particularly in non-English circumstances. The dataset exemplifies a credible resource for training and evaluating models, fostering expanded research in diverse linguistic contexts. Future efforts may tackle aligning model predictions with actual user expectations to bridge the content-generation gap effectively. The paper's contributions are anticipated to steer further investigations into multilingual and multimodal AI systems.

Markdown Report Issue