- The paper introduces LarvSeg, a novel framework that uses image-level classification data and a category-wise attentive classifier to extend segmentation to thousands of categories.
- It employs a segmentation backbone with a cosine classifier and a memory bank-based CA-Classifier to accurately segment novel categories without pixel-level annotations.
- Experimental results show improvements of up to 6.0 mIoU on novel categories, demonstrating the potential of leveraging classification data for large vocabulary segmentation.
Leveraging Image Classification Data for Large Vocabulary Semantic Segmentation
The paper "LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier" (2501.06862) introduces a novel framework, LarvSeg, to address the challenge of scaling up the vocabulary of semantic segmentation models by leveraging image classification data. This approach circumvents the limitations of language-guided segmentation models, which often struggle with out-of-distribution categories, and the difficulty of acquiring large-scale, pixel-level annotated datasets. LarvSeg uses image-level supervision from classification datasets to improve the segmentation performance, especially for categories lacking mask labels.
Methodology and Implementation
The LarvSeg framework (Figure 1) comprises a basic segmentation network, an image-level classification component, and a category-wise attentive classifier (CA-Classifier).
Figure 1: Illustration of a new paradigm to address large vocabulary semantic segmentation with image classification data.
The basic segmenter employs a backbone network (e.g., ViT-B/16) followed by a cosine classifier for pixel-wise category prediction. Image-level supervision is incorporated by performing image-level classification on the average pooled feature map. The overall training objective combines segmentation loss ($\mathcal{L}_{\text{seg}$) and classification loss ($\mathcal{L}_{\text{cls}$):
L=Lseg​+λcls​Lcls​.
A key observation is that models trained on segmentation data can group pixel features of categories outside the training vocabulary. This intra-category pixel compactness is exploited by the CA-Classifier (Figure 2) to apply supervision to precise regions.
Figure 2: Illustration of LarvSeg framework. The meaning of each icon is listed on the left. CA-Classifier and CA-Map stand for category-wise attentive classifier and category-wise attention map defined in Section~\ref{def:ca_map}.
The CA-Classifier uses a memory bank to store representative features for each novel category. A category-wise attention map (A) is generated based on the confidence scores from the memory bank, highlighting foreground regions and suppressing background regions. An auxiliary image-level classification task is then applied to the attentively pooled feature map.
The overall loss function is:
L=Lseg​+λcls​Lcls​+λaux​Laux​.
Experimental Results
The paper presents extensive experimental results on datasets such as COCO-Stuff (C171), ADE20K (A150), ADEFull (A847), and ImageNet21K (I21K). A significant contribution is the demonstration of a 21K-category semantic segmentation model using ImageNet21K. The qualitative results in Figure 3 show the model's ability to segment novel categories.
Figure 3: Visualization of 21K categories semantic segmentation.
The proposed baseline outperforms previous open vocabulary methods by a large margin, and LarvSeg further improves the baseline on novel categories by $6.0$ mIoU on A150 and $2.1$ mIoU on A847 (Table 1).
The visualization of model predictions in Figure 4 shows that LarvSeg recognizes categories such as radiator and painting, which are often missed by other models.
Figure 4: Visualization of model predictions. The tags show model names and the corresponding mIoUs of this image. Circles with different colours represent regions with novel categories in the image: sofa (in the red circle), radiator (in the dark blue circle) and painting (in the light blue circle).
Ablation studies validate the effectiveness of the CA-Classifier and the importance of cross-image semantic cues.
Implications and Future Directions
The LarvSeg framework offers a practical approach to scaling up semantic segmentation vocabularies by leveraging readily available image classification data. This approach addresses the limitations of existing language-guided segmentation models and the difficulty of obtaining large-scale, pixel-level annotated datasets. The results suggest that collecting high-quality, category-balanced data is critical for improving model performance.
Future research directions include exploring different memory update strategies, incorporating more sophisticated attention mechanisms, and extending the framework to other vision tasks. Additionally, a potentially fruitful avenue for further exploration involves investigating the impact of different backbone architectures and pre-training strategies on the overall performance of LarvSeg. The modular design of LarvSeg means it could be integrated with complementary techniques such as language-guided methods to further expand the segmentation vocabulary.
Conclusion
The paper "LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier" (2501.06862) presents a valuable contribution to the field of semantic segmentation by introducing a practical and effective framework for scaling up segmentation vocabularies using image classification data. The LarvSeg framework demonstrates strong performance on a variety of datasets, achieving state-of-the-art results in large vocabulary semantic segmentation.