- The paper presents a novel Chinese video dataset and a GVT model that optimizes visual-token processing to boost multimodal learning.
- It details rigorous dataset curation using automated filters and manual annotations to ensure high-quality, diverse Chinese video content.
- Evaluation across video tagging, retrieval, and captioning shows improved performance, particularly in Chinese-language contexts.
Overview of "ChinaOpen: A Dataset for Open-world Multimodal Learning" (2305.05880)
This paper presents "ChinaOpen," a novel dataset for open-world multimodal learning sourced from Bilibili, a prominent Chinese video-sharing platform. The dataset is designed to aid the training and evaluation of multimodal models on Chinese-language data, addressing the gap in their performance on non-English content. Accompanying the dataset is a Generative Video-to-text Transformer (GVT) model tailored specifically for Chinese video captioning, which enhances existing models by accelerating visual-token processing. Both the dataset and the model undergo comprehensive evaluations, yielding valuable insights into multimodal learning.
Dataset Construction
Data Gathering
ChinaOpen comprises two subsets: ChinaOpen-50k and ChinaOpen-1k. The construction process begins with gathering raw data from Bilibili, resulting in approximately 100,000 videos. These videos span various topics, reflecting the diverse content found on social media platforms. The raw dataset is filtered and refined through automated and manual processes to compose the final subsets.
Automated Data Cleaning
The cleaning process involves removing videos with suboptimal annotation or negligible visual content. Four critical categories of filters include empty-title, face-only, text-heavy, and content-less criteria. Through syntax pattern recognition for titles and employing face detection, OCR for text density, and classical vision models for evaluating visual content, unwanted videos are excluded. This leads to a refined set formation of ChinaOpen-50k—suitable for multimodal learning.
Manual Video Annotation
To further enrich the dataset, manual annotation constructs ChinaOpen-1k. Selected videos undergo detailed evaluations by experienced annotators who verify the relevance of titles, create captions, and label visual content covering objects, actions, and scenes. This subset serves as a multi-faceted test set for model evaluations. English language versions of the annotations further enhance cross-lingual model assessments.
Proposed Model: Generative Video-to-text Transformer (GVT)
GVT enhances the Generative Image-to-text Transformer (GIT) by integrating a visual-token reduction layer. This allows for more frames to be utilized without increasing computational complexity, addressing the inefficiency observed in sparse frame sampling. With the capability to expand from six to sixteen input frames, GVT enables richer visual content processing, significantly improving Chinese video captioning tasks.
Evaluation and Results
The evaluation includes fifteen state-of-the-art models across varied tasks: video tagging, text-to-video retrieval, and video captioning. Key findings highlight:
- Open-set Video Tagging: CLIP-L/14@336px and CN-CLIP demonstrate superior performance, emphasizing the advantage of large-scale multimodal models for complex tagging tasks, particularly beyond recognized visual categories.
- Text-to-video Retrieval: Multimodal models, especially those trained explicitly with video-text datasets like VaTeX, exhibit enhanced retrieval accuracy, showcasing robust cross-modal understanding.
- Video Captioning: BLIP-2 claims top performance among English models, while GVT excels in Chinese captioning. China's dataset shows its value in refining Chinese content understanding.
Moreover, challenges like user-generated title retrieval denote areas for improved model adaptation and cross-language robustness in real-world scenarios.
Conclusion
ChinaOpen and the associated GVT model represent advancements in multimodal learning, particularly in non-English circumstances. The dataset exemplifies a credible resource for training and evaluating models, fostering expanded research in diverse linguistic contexts. Future efforts may tackle aligning model predictions with actual user expectations to bridge the content-generation gap effectively. The paper's contributions are anticipated to steer further investigations into multilingual and multimodal AI systems.