OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Published 2 Aug 2023 in cs.CV, cs.AI, and cs.LG | (2308.01390v2)

Abstract: We introduce OpenFlamingo, a family of autoregressive vision-LLMs ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.

Abstract PDF Upgrade to Chat

Citations (324)

View on Semantic Scholar

Summary

The paper introduces an open-source framework that achieves 80%-89% of proprietary models' performance on seven vision-language benchmarks.
The paper details a methodology that attaches dense cross-attention layers to frozen language models, enabling effective multimodal interactions.
The paper demonstrates that leveraging large-scale image-text datasets and instruction-tuning significantly enhances few-shot learning capabilities in vision-language tasks.

OpenFlamingo: An Open-Source Framework for Training Autoregressive Vision-LLMs

Introduction

The paper presents OpenFlamingo, a suite of autoregressive vision-LLMs with parameter sizes ranging from 3 billion to 9 billion. This initiative aims to replicate the functionality of DeepMind's proprietary Flamingo models, providing an open-source alternative to the closed-source autoregressive models that dominate the field. These models are evaluated on multiple benchmark datasets and show promising results in relation to their proprietary counterparts.

Architecture

OpenFlamingo models are constructed by attaching dense cross-attention layers to existing frozen autoregressive LLMs. These cross-attention modules enable the LLM to attend to visual representations extracted from a vision encoder, specifically CLIP ViT-L/14, while predicting text tokens. This design allows OpenFlamingo to process interleaved sequences of images and text, facilitating tasks such as few-shot learning and multimodal interactions.

Training Data and Methodology

The models are trained using a mixture of LAION-2B and Multimodal C4 datasets. LAION-2B offers a vast repository of web-scraped image-text pairs, while Multimodal C4 provides sequences of interleaved image and text data. These open-source datasets replace the proprietary ALIGN and M3W datasets used by Flamingo. Synthetic data generated via RICES (Retrieval-based In-Context Example Selection) is used to enhance training, especially focusing on instruction-tuned variations.

Numerical Results

OpenFlamingo models demonstrate between 80% and 89% of the performance of Flamingo models across seven vision-language benchmarks. Specific strengths are noted in the 0- and 4-shot contexts on tasks such as COCO and VQAv2. However, some tasks, notably in visual question answering (VQA), show a marked performance gap. Performance is shown to improve or match expectations when compared against state-of-the-art benchmark results in specific domains.

Discussion

The paper identifies several key areas where OpenFlamingo models show potential for development:

Data Quality and Training Dynamics: The importance of web-scraped datasets like Multimodal C4 demonstrates the ongoing need for high-quality, diverse datasets in training robust vision-LLMs.
Effect of Embedded Parameters: Experimentation with trainable vs. frozen embeddings highlights how architectural choices impact model flexibility and performance.
Instruction-Tuning Transfer: Models with instruction-tuned language backbones exhibit superior performance across most tasks, underlining instruction tuning's importance in vision-language contexts.

Implications and Future Directions

The establishment of an open-source vision-LLM like OpenFlamingo opens pathways for extensive academic research and potential applications in multimodal AI interactions. While current limitations include comprehension in complex visual question answering, ongoing improvements in dataset quality and model architecture are expected to mitigate these issues.

The OpenFlamingo project invites further exploration and refinement, both through enhancing training datasets and fine-tuning model components. As a public resource, it provides a vital tool for the academic and research community to explore autoregressive models' complexities.

Conclusion

OpenFlamingo signifies a critical step toward democratizing research on autoregressive vision-LLMs, facilitating greater transparency and collaboration within the research community. While challenges remain, the open-source nature of this framework empowers researchers to experiment, adapt, and extend its capabilities, paving the way for advances in artificial intelligence.