Sigmoid Loss for Language Image Pre-Training

Published 27 Mar 2023 in cs.CV and cs.AI | (2303.15343v4)

Abstract: We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We release our models at https://github.com/google-research/big_vision and hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.

Abstract PDF Upgrade to Chat

Citations (515)

View on Semantic Scholar

Summary

The paper’s main contribution is introducing a novel Sigmoid loss that simplifies language-image pair training and boosts efficiency.
It replaces traditional softmax-based contrastive learning with a pairwise sigmoid approach that works effectively with smaller and larger batch sizes.
The method achieves 79.7% zero-shot accuracy on ImageNet using just four TPU-v4 chips and supports multilingual pre-training with a 32k batch size.

Introduction to Sigmoid Loss for Language Image Pre-Training

In the ever-evolving world of machine learning, researchers constantly seek ways to improve the efficiency and effectiveness of pre-training models that understand and process both images and textual information. A research group from Google DeepMind has introduced a novel Sigmoid Loss for Language-Image Pre-training (SigLIP) methodology that presents a significant leap forward in this area.

The Sigmoid Loss Approach

Contrastive learning has been a dominant strategy for training models that derive insights from image-text pairings. This approach typically uses a softmax normalization to handle such data. However, softmax necessitates a global view of pairwise similarities and can be computationally demanding.

The team at DeepMind proposed an alternative strategy to softmax. Their method, a pairwise Sigmoid loss, operates on image-text pairs without requiring a comprehensive view for normalization purposes. This simpler mechanism not only streamlines the training process but also performs more effectively, even with smaller batch sizes. Moreover, it allows for larger batch sizes without constraints from loss calculation requirements.

Implications of Sigmoid Loss on Pre-Training Efficiency

The research demonstrates that Sigmoid loss can significantly reduce the amount of computational resources required for pre-training. For example, a model utilizing Sigmoid loss trained on just four TPU-v4 chips for a single day achieved a notable 79.7% zero-shot accuracy on the widely used ImageNet benchmark. When compared to CLIP and prior works requiring far more computational power, the efficiency gains are impressive.

Impact on Multilingual Pre-Training

The benefits of Sigmoid loss extend beyond monolingual models. The researchers also explored the impact on multilingual models, studying the capacity of batch size and pre-training on over 100 languages. They found that a batch size of 32k is sufficient for effective multilingual language-image pre-training, showcasing the robustness of Sigmoid loss in various contexts.

Conclusion

The DeepMind team’s research on employing Sigmoid loss for language-image pre-training is a milestone that could lead to more accessible and efficient model training. By enabling similar or improved performance with significantly less computational expense, this technique paves the way for broader experimentation and application in both academic and industry settings. The released models and findings are anticipated to inspire additional exploration into bettering the quality and efficiency of language-image pre-training.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

A Simple Explanation of “Sigmoid Loss for Language-Image Pre-Training”

Overview

This paper is about teaching computers to understand how pictures and sentences relate to each other. The authors introduce a new, simpler way to train these “language-image” models, called the sigmoid loss. Their method makes training faster, uses less memory, works well even with smaller groups of data, and still performs strongly with very large groups. They also share models and tips so more people can do this kind of training without needing huge amounts of hardware.

What questions did the researchers ask?

The paper looks at questions like:

Can we replace the usual training rule (called softmax) with a simpler one (sigmoid) and get equal or better results?
How big should the “batch size” be during training for best performance? Is bigger always better?
Can this simpler method help when we have limited computing power?
How does the method work across many languages and with noisy, imperfect data?
What tricks help keep training stable and prevent crashes?

How did they do it? (Methods explained simply)

Think of training as a matching game:

You have a bunch of images and their matching captions (like “a dog running”). The goal is to make the computer give high scores to true matches and low scores to mismatches.

Two common ways to score matches:

Softmax loss (the usual way): It’s like picking the single “best” caption for each image from the entire batch at once. It needs to look at all pairs together and do a big normalization step. This can be memory-heavy and tricky across many machines.
Sigmoid loss (their new way): It’s like rating each image–caption pair as “match” or “not a match” independently, without comparing all pairs at once. This makes the math simpler, uses less memory, and is easier to run on multiple machines.

Key ideas in everyday language:

Batch size: This is how many image–caption pairs the model looks at in one go. Bigger batches usually help, but they also need more memory.
Pairwise scoring: With sigmoid, the model scores each pair separately, so it doesn’t need to build a huge table of all comparisons.
“Chunked” implementation: Imagine several computers, each holding a small portion of the data. Instead of gathering everything everywhere, each computer swaps small chunks of data with neighbors, computes local losses, and sums results. This saves lots of memory and time.
Temperature and bias (two helpful knobs):
- Temperature controls how strongly the model separates matches from mismatches.
- Bias gives the model a sensible starting point, so it doesn’t make huge corrections at the beginning.
Models used: They use a Vision Transformer (ViT) to turn images into numbers and a Transformer to turn text into numbers. These numbers (embeddings) are compared to see if image and text match.

Main findings and why they matter

Here are the most important results and what they mean:

Simpler training that works great:
- The sigmoid loss often performs better than softmax when batch sizes are small (below about 16,000 pairs). As batches get larger, both methods become similar—but sigmoid stays simpler and more memory‑friendly.
- They trained with batch sizes as high as 1,000,000, but found that performance gains mostly flatten out around 32,000. In short: you don’t need gigantic batches to get very good results.
Strong performance with limited hardware:
- Using only four specialized chips (TPUv4), their SigLiT setup reached 84.5% “zero-shot” accuracy on ImageNet in about two days. “Zero-shot” means the model can recognize objects without being directly trained on that exact task—it learned general knowledge from image–text pairs.
- From scratch, their SigLIP setup achieved around 73% zero-shot accuracy with far fewer resources than older approaches like CLIP.
Works across many languages:
- With over 100 languages, they still found 32,000 batch size is enough. Bigger batches didn’t help and sometimes hurt multilingual retrieval performance.
- Their multilingual model set new records on a large cross-language image–text retrieval benchmark (XM3600).
More stable and robust training:
- Training big batches can get unstable (sudden jumps in learning). They show that lowering a common optimizer setting (called beta2) helps keep training steady.
- The sigmoid method is more resilient to noisy data (for example, if some images or captions are wrong or shuffled), which is common on the web.
Practical fine-tuning tip:
- When starting from a pre-trained image model, turning off “weight decay” (a regularization setting) on those pre-trained image weights led to better results. This helps preserve the useful knowledge in the image encoder.

To make these results easier to spot, here are two headline numbers:

Setup (few chips)	Method	Time	Batch Size	ImageNet zero-shot
4 TPUv4 chips	SigLiT (frozen image encoder)	~2 days	20k	84.5%
32 TPUv4 chips	SigLIP (from scratch)	~5 days	32k	73.4%

What does this mean for the future?

Easier access: Since sigmoid loss is simpler and uses less memory, more teams with fewer machines can train strong language–image models.
Efficient training: You don’t need massive batches or huge hardware budgets to get top performance—about 32,000 batch size is a sweet spot.
Better models in practice: Robustness to noisy data and stable training means these models are more reliable on real-world web data.
Open models and code: The authors released their models and code, encouraging others to build on their work and try new ideas.

In short, this paper shows a simpler training rule (sigmoid loss) can make powerful language–image models easier, faster, and cheaper to train—without sacrificing quality.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (4)

Collections

GitHub

GitHub - google-research/big_vision: Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more. (1,877 stars)

Sigmoid Loss for Language Image Pre-Training

Summary

Introduction to Sigmoid Loss for Language Image Pre-Training

The Sigmoid Loss Approach

Implications of Sigmoid Loss on Pre-Training Efficiency

Impact on Multilingual Pre-Training

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A Simple Explanation of “Sigmoid Loss for Language-Image Pre-Training”

Overview

What questions did the researchers ask?

How did they do it? (Methods explained simply)

Main findings and why they matter

What does this mean for the future?

Open Problems

Continue Learning

Authors (4)

Collections

GitHub

Tweets

YouTube

HackerNews

Sigmoid Loss for Language Image Pre-Training

Summary

Introduction to Sigmoid Loss for Language Image Pre-Training

The Sigmoid Loss Approach

Implications of Sigmoid Loss on Pre-Training Efficiency

Impact on Multilingual Pre-Training

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A Simple Explanation of “Sigmoid Loss for Language-Image Pre-Training”

Overview

What questions did the researchers ask?

How did they do it? (Methods explained simply)

Main findings and why they matter

What does this mean for the future?

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

GitHub

Tweets

YouTube

HackerNews