OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Published 21 Jun 2023 in cs.IR and cs.CV | (2306.16527v2)

Abstract: Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and LLMs of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

Abstract PDF Upgrade to Chat

Citations (182)

View on Semantic Scholar

Summary

The paper's main contribution is the introduction of OBELICS, a comprehensive open-access multimodal dataset that outperforms traditional image-text pairs.
It employs rigorous filtering, deduplication, and HTML simplification techniques to preserve full document context and ensure high-quality data.
Empirical validation shows that models trained on OBELICS, including IDEFICS, achieve competitive performance on vision-language tasks with efficient image usage.

An Overview of OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

This paper presents the OBELICS dataset, a significant contribution to open-access resources in artificial intelligence research. OBELICS consists of an extensive web-scale collection of interleaved multimodal documents extracted from the Common Crawl, comprising 141 million web pages, 353 million images, and 115 billion text tokens. The dataset is designed to address the limitations of existing datasets by providing a higher quality and publicly available resource for training large-scale multimodal models.

Key Features and Methodology

The distinguishing factor of OBELICS lies in its composition, containing full multimodal documents rather than isolated image-text pairs. This approach aligns with findings that models trained on natural interleaved documents consistently outperform those relying only on image-text pairs across various multimodal benchmarks. The paper describes a comprehensive methodology for creating OBELICS, focusing on retaining document context and applying rigorous filtering rules.

The authors detail the systematic extraction and processing techniques used, such as HTML simplification and explicit deduplication steps at the document, image, and paragraph levels, shielding the dataset from redundancy and low-quality content. Furthermore, the dataset has been filtered to mitigate concerns over data consent and explicit content, integrating measures like the exclusion of opted-out images and NSFW filtering.

Detailed Dataset Analysis

OBELICS is thoroughly analyzed, with the authors presenting detailed statistics demonstrating its scale and uniqueness. For instance, it boasts an 84.3% rate of unique images, illustrating effective deduplication compared to other similar datasets like mmc4. Additionally, the dataset offers a wide array of topics revealed through LDA, indicating its diverse content that spans from politics to entertainment, extending its usability for training robust and versatile AI models.

Moreover, perplexity analysis shows OBELICS having lower average scores compared to other open datasets, suggesting superior text quality more akin to curated corpora such as The Pile. This quality is critical for training high-performance LLMs.

Empirical Validation of Viability

The research includes empirical comparisons between models trained on different data compositions—OBELICS alone, image-text pairs alone, and combinations thereof. Notably, OBELICS demonstrates its utility by providing competitive model performance with fewer training images, thus underlining the efficiency gained through richer multimodal contexts. For instance, in visual question answering tasks, models pretrained on OBELICS outshine their counterparts trained on image-text pairs due to the depth and breadth of understanding fostered by interleaved documents.

Furthermore, the paper introduces IDEFICS, large-scale models demonstrating performances on par with closed datasets-trained models such as Flamingo. IDEFICS models manage to achieve strong benchmarks across various tasks, showcasing OBELICS as a formidable open alternative for the training of large-scale vision-LLMs.

Implications and Future Research

OBELICS holds potential for wide-ranging implications in developing open, reproducible, and transparent AI research. By providing an accessible dataset with detailed creation and filtering documentation, it facilitates the replication and extension of cutting-edge multimodal models without the constraints of proprietary datasets.

Future developments suggested include expanding the dataset with more diverse sources or enhancing the quality through community-driven filters. This trajectory may continue bridging the gap between open-access resources and proprietary datasets in AI, nurturing a more inclusive research landscape.

In conclusion, OBELICS stands as a vital resource for the AI community, aiming to augment the scalability, accessibility, and transparency of multimodal model training, pushing forward both theoretical and practical advancements in AI research.

Markdown Report Issue