CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy
Abstract: The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \$10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of $4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 2019.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Can foundation models perform zero-shot task specification for robot manipulation? arXiv preprint arXiv:2204.11134, 2022.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
- Natural adversarial examples. In CVPR, 2021.
- Openclip, July 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- An inverse scaling law for clip training. arXiv preprint arXiv:2305.07017, 2023.
- Scaling language-image pre-training via masking. In CVPR, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- OpenAI. Gpt-4 technical report. 2023.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Zero-shot text-to-image generation. In ICML, 2021.
- Do imagenet classifiers generalize to imagenet? In ICML, 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Learning robust global representations by penalizing local predictive power. In NeurIPS, 2019.
- Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.