CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy

Published 27 Jun 2023 in cs.CV | (2306.15658v1)

Abstract: The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \$10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of $4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.

Abstract PDF HTML Upgrade to Chat

References (28)

Citations (12)

View on Semantic Scholar

Summary

The paper demonstrates that leveraging the inverse scaling law reduces token use in finetuning without compromising performance.
It scales CLIP training across varied model sizes and datasets, effectively training on 13B image-text pairs.
Empirical results reveal cost-effective gains, attaining 81.1% accuracy on ImageNet for $10k and 81.8% with an extra $4k.

A Technical Analysis of CLIPA-v2: Cost-Effective Scaling in CLIP Training

In the field of vision-LLMs, the CLIP (Contrastive Language–Image Pre-training) model has served as a transformative approach, bridging textual and visual data. The paper "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra$4,000 Unlocks 81.8% Accuracy" introduces CLIPA-v2, a method aimed at efficiently scaling CLIP training to achieve superior performance while minimizing computational costs.

Main Contributions

The paper presents two central contributions:

Inverse Scaling Law in Finetuning: The authors confirm that the inverse scaling law discovered in previous work on CLIPA extends to the finetuning stage. This revelation permits the utilization of reduced input tokens even during model finetuning, consequently decreasing computational expenses without compromising model performance.
Scaling in Model, Data, and Training Schedule: CLIPA-v2 scales CLIP training across different model sizes, datasets, and training schedules. They showcase an implementation reaching up to the H/14 model scale using 13B image-text pairs in training. This approach substantiates that even large-scale models can be effectively trained within limited budgets by leveraging the inverse scaling principle.

Empirical Results

The prominent empirical results in the study include:

Achieving a zero-shot ImageNet accuracy of 81.1% within a $10,000 budget, showing a 1.0% improvement over the previous best, with significant computational efficiency—39 times reduction in cost compared to previous CLIP models.
With an additional $4,000 investment, the accuracy is elevated to 81.8%, establishing a new benchmark in the zero-shot ImageNet performance within a constrained budget.

Implications and Future Directions

Practically, the findings offer a clear pathway for researchers and institutions with limited resources to engage in large-scale pretraining of vision-LLMs. Theoretically, the paper advances the understanding of model scaling dynamics, emphasizing that larger models can be optimally trained with fewer input tokens due to the inverse scaling law.

The work invites several avenues for future research. Investigating the full potential of inverse scaling laws across various model architectures could provide universally applicable guidelines for efficient model training. Moreover, fine-grained analysis of the trade-off between model size, data diversity, and token reduction strategies might yield more insights into optimal configurations for specific applications or tasks.

Conclusion

The paper on CLIPA-v2 substantially contributes to the ongoing discourse on efficient AI model training by demonstrating that high-performing vision-LLMs can be trained within budgetary constraints through innovative scaling strategies. As AI research continues to emphasize efficiency and accessibility, such methodologies will likely play an increasingly crucial role in broadening participation in model training activities across the globe.