Image Captioners Are Scalable Vision Learners Too
Abstract: Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.
- Flamingo: A visual language model for few-shot learning. In NeurIPS, 2022.
- A study of autoregressive decoders for multi-tasking in computer vision. arXiv:2303.17376, 2023.
- Food-101 – mining discriminative components with random forests. In ECCV, 2014.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- PaLI-X: On scaling up a multilingual vision and language model. arXiv:2305.18565, 2023.
- PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023.
- Microsoft COCO Captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Remote sensing image scene classification: Benchmark and state of the art. arXiv preprint arXiv:1703.00121, 2017.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, pages 2818–2829, 2023.
- Unifying vision-and-language tasks via text generation. In ICML, pages 1931–1942, 2021.
- PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
- An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pages 215–223, 2011.
- VirTex: Learning visual representations from textual annotations. In CVPR, 2021.
- CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet. arXiv:2212.06138, 2022.
- An image is worth 16×\times×16 words: Transformers for image recognition at scale. In ICLR, 2021.
- An empirical study of training end-to-end vision-and-language transformers. In CVPR, pages 18145–18155, 2022.
- Palm-E: An embodied multimodal language model. arXiv:2303.03378, 2023.
- Asirra: A CAPTCHA that Exploits Interest-Aligned Manual Image Categorization. In Proc. ACM Conf. Computer and Communications Security (CCS), 2007.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPRW, pages 178–178, 2004.
- Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. arXiv:2306.14610, 2023.
- Scaling up vision-language pre-training for image captioning. In CVPR, pages 17980–17989, 2022.
- Language is not all you need: Aligning perception with language models. arXiv:2302.14045, 2023.
- Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv:2004.00849, 2020.
- GQA: a new dataset for compositional question answering over real-world images. In CVPR, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Learning visual features from large weakly supervised data. In ECCV, pages 67–84, 2016.
- Novel dataset for fine-grained image categorization: Stanford dogs. In CVPRW, 2011.
- OCR-Free document understanding transformer. In ECCV, pages 498–517, 2022.
- ViLT: Vision-and-language transformer without convolution or region supervision. In ICML, pages 5583–5594. PMLR, 2021.
- Big transfer (BiT): General visual representation learning. In ECCV, 2020.
- 3d object representations for fine-grained categorization. In CVPRW, pages 554–561, 2013.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, pages 32–73, 2017.
- Mammut: A simple architecture for joint learning for multimodal tasks. arXiv:2303.16839, 2023.
- Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, pages 3744–3753, 2019.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. arXiv:2210.03347, 2022.
- Learning visual n-grams from web data. In ICCV, pages 4183–4192, 2017.
- Spotlight: Mobile UI understanding using vision-language models with a focus. In ICLR, 2023.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34:9694–9705, 2021.
- Visualgptscore: Visio-linguistic reasoning with multimodal generative pre-training scores. arXiv:2306.01879, 2023.
- OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR, 2019.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Automated flower classification over a large number of classes. In Indian Conf. Computer Vision, Graphics & Image Process., pages 722–729, 2008.
- Deep metric learning via lifted structured feature embedding. In CVPR, pages 4004–4012, 2016.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proc. NAACL-HLT: Demonstrations, 2019.
- Cats and dogs. In CVPR, 2012.
- Combined scaling for open-vocabulary image classification. arXiv:2111.10050, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
- ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
- Learning visual representations with caption annotations. In ECCV, pages 153–170, 2020.
- LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. arXiv:2111.02114, 2021.
- How much can CLIP benefit vision-and-language tasks? In ICLR, 2022.
- Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv:1909.08053, 2019.
- Eva-CLIP: Improved training techniques for CLIP at scale. arXiv:2303.15389, 2023.
- CLIPPO: Image-and-language understanding from pixels only. In CVPR, 2023.
- Attention is all you need. In NeurIPS, 2017.
- The Caltech-UCSD Birds-200-2011 Dataset. 2011.
- GIT: A generative image-to-text transformer for vision and language. Trans. Machine Learning Research, 2022.
- SimVLM: Simple visual language model pretraining with weak supervision. In ICLR, 2022.
- Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. NeurIPS, 34:4514–4528, 2021.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics, 2:67–78, 2014.
- Coca: Contrastive captioners are image-text foundation models. Trans. Machine Learning Research, 2022.
- When and why vision-language models behave like bag-of-words models, and what to do about it? In ICLR, 2023.
- Scaling vision transformers. In CVPR, pages 12104–12113, 2022.
- Sigmoid loss for language image pre-training. arXiv:2303.15343, 2023.
- LiT: Zero-shot transfer with locked-image text tuning. In CVPR, pages 18102–18112, 2022.
- Places: A 10 million image database for scene recognition. Trans. PAMI, 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.