Sigmoid Loss for Language Image Pre-Training
Abstract: We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We release our models at https://github.com/google-research/big_vision and hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
A Simple Explanation of “Sigmoid Loss for Language-Image Pre-Training”
Overview
This paper is about teaching computers to understand how pictures and sentences relate to each other. The authors introduce a new, simpler way to train these “language-image” models, called the sigmoid loss. Their method makes training faster, uses less memory, works well even with smaller groups of data, and still performs strongly with very large groups. They also share models and tips so more people can do this kind of training without needing huge amounts of hardware.
What questions did the researchers ask?
The paper looks at questions like:
- Can we replace the usual training rule (called softmax) with a simpler one (sigmoid) and get equal or better results?
- How big should the “batch size” be during training for best performance? Is bigger always better?
- Can this simpler method help when we have limited computing power?
- How does the method work across many languages and with noisy, imperfect data?
- What tricks help keep training stable and prevent crashes?
How did they do it? (Methods explained simply)
Think of training as a matching game:
- You have a bunch of images and their matching captions (like “a dog running”). The goal is to make the computer give high scores to true matches and low scores to mismatches.
Two common ways to score matches:
- Softmax loss (the usual way): It’s like picking the single “best” caption for each image from the entire batch at once. It needs to look at all pairs together and do a big normalization step. This can be memory-heavy and tricky across many machines.
- Sigmoid loss (their new way): It’s like rating each image–caption pair as “match” or “not a match” independently, without comparing all pairs at once. This makes the math simpler, uses less memory, and is easier to run on multiple machines.
Key ideas in everyday language:
- Batch size: This is how many image–caption pairs the model looks at in one go. Bigger batches usually help, but they also need more memory.
- Pairwise scoring: With sigmoid, the model scores each pair separately, so it doesn’t need to build a huge table of all comparisons.
- “Chunked” implementation: Imagine several computers, each holding a small portion of the data. Instead of gathering everything everywhere, each computer swaps small chunks of data with neighbors, computes local losses, and sums results. This saves lots of memory and time.
- Temperature and bias (two helpful knobs):
- Temperature controls how strongly the model separates matches from mismatches.
- Bias gives the model a sensible starting point, so it doesn’t make huge corrections at the beginning.
- Models used: They use a Vision Transformer (ViT) to turn images into numbers and a Transformer to turn text into numbers. These numbers (embeddings) are compared to see if image and text match.
Main findings and why they matter
Here are the most important results and what they mean:
- Simpler training that works great:
- The sigmoid loss often performs better than softmax when batch sizes are small (below about 16,000 pairs). As batches get larger, both methods become similar—but sigmoid stays simpler and more memory‑friendly.
- They trained with batch sizes as high as 1,000,000, but found that performance gains mostly flatten out around 32,000. In short: you don’t need gigantic batches to get very good results.
- Strong performance with limited hardware:
- Using only four specialized chips (TPUv4), their SigLiT setup reached 84.5% “zero-shot” accuracy on ImageNet in about two days. “Zero-shot” means the model can recognize objects without being directly trained on that exact task—it learned general knowledge from image–text pairs.
- From scratch, their SigLIP setup achieved around 73% zero-shot accuracy with far fewer resources than older approaches like CLIP.
- Works across many languages:
- With over 100 languages, they still found 32,000 batch size is enough. Bigger batches didn’t help and sometimes hurt multilingual retrieval performance.
- Their multilingual model set new records on a large cross-language image–text retrieval benchmark (XM3600).
- More stable and robust training:
- Training big batches can get unstable (sudden jumps in learning). They show that lowering a common optimizer setting (called beta2) helps keep training steady.
- The sigmoid method is more resilient to noisy data (for example, if some images or captions are wrong or shuffled), which is common on the web.
- Practical fine-tuning tip:
- When starting from a pre-trained image model, turning off “weight decay” (a regularization setting) on those pre-trained image weights led to better results. This helps preserve the useful knowledge in the image encoder.
To make these results easier to spot, here are two headline numbers:
| Setup (few chips) | Method | Time | Batch Size | ImageNet zero-shot |
|---|---|---|---|---|
| 4 TPUv4 chips | SigLiT (frozen image encoder) | ~2 days | 20k | 84.5% |
| 32 TPUv4 chips | SigLIP (from scratch) | ~5 days | 32k | 73.4% |
What does this mean for the future?
- Easier access: Since sigmoid loss is simpler and uses less memory, more teams with fewer machines can train strong language–image models.
- Efficient training: You don’t need massive batches or huge hardware budgets to get top performance—about 32,000 batch size is a sweet spot.
- Better models in practice: Robustness to noisy data and stable training means these models are more reliable on real-world web data.
- Open models and code: The authors released their models and code, encouraging others to build on their work and try new ideas.
In short, this paper shows a simpler training rule (sigmoid loss) can make powerful language–image models easier, faster, and cheaper to train—without sacrificing quality.
Collections
Sign up for free to add this paper to one or more collections.