PaLI: A Jointly-Scaled Multilingual Language-Image Model

Published 14 Sep 2022 in cs.CV and cs.CL | (2209.06794v4)

Abstract: Effective scaling and a flexible task interface enable LLMs to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder LLMs and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Abstract PDF Upgrade to Chat

Citations (598)

View on Semantic Scholar

Summary

The paper presents a jointly-scaled model that integrates a 4B-parameter ViT-e with mT5-XXL to enhance multilingual vision and language processing.
It leverages a large-scale WebLI dataset and diverse multimodal objectives to achieve state-of-the-art results in tasks such as image captioning, VQA, and zero-shot classification.
The balanced architecture and joint scaling strategy offer promising directions for future research in universally applicable, multilingual AI systems.

Overview of PaLI: A Jointly-Scaled Multilingual Language-Image Model

The paper "PaLI: A Jointly-Scaled Multilingual Language-Image Model" introduces a model designed to integrate the capabilities of language and vision models into a unified framework. This approach emphasizes scalability in both vision and language components, leveraging large pre-trained Transformers to enhance performance across various multimodal tasks.

Model Architecture and Key Components

PaLI, or the Pathways Language and Image model, is built upon an encoder-decoder Transformer structure, combining a Vision Transformer (ViT) for visual processing with an mT5 model, which represents its language component. Three main configurations of PaLI are explored: PaLI-3B, PaLI-15B, and PaLI-17B, with parameter sizes reflecting different allocations between vision and language capacities.

Vision Component: The paper introduces ViT-e, a 4B-parameter model, which achieves significant performance improvements in vision-language tasks beyond what was possible with previous models like ViT-G.
Language Component: The language capacity is sustained using mT5-XXL, a model renowned for its robust language understanding and generation capabilities. This is crucial as it allows the model to maintain these capabilities when scaled to multimodal tasks.

Training and Data

To enable effective training for multilingual scenarios, the authors employ WebLI: a large-scale dataset containing 10 billion images with texts in over 100 languages. The mixture used for training includes several multimodal objectives like text span corruption, split-captioning, OCR tasks, and visual question answering (VQA), ensuring broad task coverage and robust pre-training.

Numerical Results and Performance

PaLI's performance is evaluated over multiple tasks, achieving state-of-the-art results in both monolingual and multilingual settings. Key benchmarks include:

Image Captioning: On COCO Captioning and NoCaps, PaLI-17B establishes new records with a CIDEr score of 149.1 and strong performance on out-of-domain data.
Visual Question Answering: Achieving SOTA results on VQAv2, even surpassing models that use fixed-vocabulary classification approaches.
Zero-shot Image Classification: Demonstrates compelling results on ImageNet and its derivatives, which were achieved without fine-tuning specifically on these datasets.

Implications and Future Directions

The joint scaling approach suggests a strategic direction for future multimodal AI models, emphasizing the importance of equitable scaling across vision and language components. The empirical strengths of ViT-e support its utility in heavily vision-dependent multimodal tasks, while the effective use of mT5 reaffirms the capacity of LLMs when extended to multimodal domains.

Future research may build on this work by exploring even larger vision models or refining the pre-training data mixtures to further enhance task-specific adaptations. The clear improvement across tasks with high multilingual diversity reinforces the potential of models like PaLI in global language settings, a significant step towards more universally applicable AI technologies.

Markdown Report Issue