No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning

Published 23 Sep 2025 in cs.CV, cs.AI, and cs.LG | (2509.18938v1)

Abstract: While deep learning, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), has significantly advanced classification performance, its typical reliance on extensive annotated datasets presents a major obstacle in many practical scenarios where such data is scarce. Vision-LLMs (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem. This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle. Requiring only the set of class names and no labeled training data, our method utilizes a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on the test data, enabling dynamic adaptation. The VLM identifies high-confidence samples, and the pre-trained visual model enhances their visual representations. These enhanced features then iteratively train the classifier, allowing the system to capture complementary semantic and visual cues without supervision. Notably, our approach avoids VLM fine-tuning and the use of LLMs, relying on the visual-only model to reduce the dependence on semantic representation. Experimental evaluations on ten diverse datasets demonstrate that our approach outperforms the baseline zero-shot method.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel zero-shot image classification approach that eliminates the need for labeled training data through confidence-based pseudo-labeling and self-learning cycles.
It utilizes Vision-Language Models and pre-trained feature extractors like ViT-G-14 to iteratively refine classifier training and improve performance across diverse datasets.
Experimental results show the framework achieving up to 76.97% average accuracy, surpassing state-of-the-art methods and enabling dynamic adaptation.

Overview of "No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning"

The paper presents a novel zero-shot image classification framework that tackles the limitation of requiring extensive labeled datasets. By leveraging Vision-LLMs (VLMs) and pre-trained models, the framework employs a self-learning cycle for pseudo-labeling and classifier training using test data directly, effectively enabling dynamic adaptation to new environments without supervision or labeled datasets.

Introduction

Image classification in deep learning typically relies on large annotated datasets for model training, which poses challenges in scenarios where such data is scarce. Vision-LLMs (VLMs) have emerged as promising tools for zero-shot classification, capable of achieving high accuracy without requiring labeled data. These VLMs use contrastive learning on image-text pairs, as demonstrated by models like CLIP, to embed both images and text into a shared representation space, facilitating tasks such as zero-shot classification.

Despite improvements in zero-shot classification, current methods often depend on LLMs and fine-tuning strategies, limiting their applicability in data-constrained environments. The paper introduces a new framework that circumvents these limitations by using confidence-based pseudo-labeling and self-learning cycles, requiring only class names for training.

Proposed Framework

The framework operates through three distinct steps in its pipeline: Seed Selection, Classifier Training, and Image Classification. Each step leverages different model capabilities for efficient processing.

Step A: Seed Selection

Seed selection identifies high-confidence samples from the test data using CLIP, encoding images and their class names to select the top candidates with high similarity scores. Improved selection further refines this process by using neighborhood consensus among selected candidates, enhancing reliability.

Figure 1: Our approach pipeline is divided into three steps. In (A), we construct the initial training set (SEED) using CLIP and extract features from the images using a pre-trained model. Next (B), we perform a self-learning cycle, incrementally tuning a classifier. In the third step (C), we make the predictions for each image in the dataset using the trained classifier.

Step B: Classifier Training

The selected seed set (high-confidence samples) is used for initial training of a lightweight classifier. This phase employs a pre-trained feature extractor (e.g., ViT-G-14) to generate robust feature vectors for iterative training processes, minimizing overfitting and maintaining adaptability.

Step C: Image Classification

After iterative refinement and self-learning cycles, the trained classifier predicts classes for the entire image dataset using robust feature vectors derived from the feature extractor. The self-learning process allows the framework to dynamically adjust and improve accuracy over successive cycles.

Experimental Results

The paper evaluates performance on ten diverse datasets, demonstrating significant accuracy improvements across all tested backbones and configurations. Notably, the framework achieves an average accuracy of up to 76.97%, surpassing state-of-the-art approaches and baseline models like CLIP.

Figure 2: Accuracy of the sample selection (default and improved versions) across the datasets using CLIP B/32 backbone.

Figure 3: Accuracy of the sample selection (default and improved versions) across the datasets using CLIP B/16 backbone.

Figure 4: Accuracy of the sample selection (default and improved versions) across the datasets using CLIP L/14 backbone.

The experimental results underscore the efficacy of the collaborative self-learning model, which dynamically refines pseudo-labels and enhances performance without relying on labeled data or extensive fine-tuning, demonstrated by consistent improvements across different datasets and backbones used.

Conclusion

The collaborative self-learning framework introduced in this paper offers a novel method for zero-shot image classification, facilitating enhanced accuracy and adaptability across diverse datasets without heavy reliance on labeled training data or extensive fine-tuning. By decoupling semantic and visual sources of information, the framework achieves efficient transfer learning and unsupervised training capabilities, paving the way for broader applications in resource-constrained settings in the future.

Markdown Report Issue