LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier

Published 12 Jan 2025 in cs.CV and cs.AI | (2501.06862v1)

Abstract: Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code is available at https://github.com/HaojunYu1998/large_voc_seg.

Abstract PDF Upgrade to Chat

Summary

The paper introduces LarvSeg, a novel framework that uses image-level classification data and a category-wise attentive classifier to extend segmentation to thousands of categories.
It employs a segmentation backbone with a cosine classifier and a memory bank-based CA-Classifier to accurately segment novel categories without pixel-level annotations.
Experimental results show improvements of up to 6.0 mIoU on novel categories, demonstrating the potential of leveraging classification data for large vocabulary segmentation.

Leveraging Image Classification Data for Large Vocabulary Semantic Segmentation

The paper "LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier" (2501.06862) introduces a novel framework, LarvSeg, to address the challenge of scaling up the vocabulary of semantic segmentation models by leveraging image classification data. This approach circumvents the limitations of language-guided segmentation models, which often struggle with out-of-distribution categories, and the difficulty of acquiring large-scale, pixel-level annotated datasets. LarvSeg uses image-level supervision from classification datasets to improve the segmentation performance, especially for categories lacking mask labels.

Methodology and Implementation

The LarvSeg framework (Figure 1) comprises a basic segmentation network, an image-level classification component, and a category-wise attentive classifier (CA-Classifier).

Figure 1: Illustration of a new paradigm to address large vocabulary semantic segmentation with image classification data.

The basic segmenter employs a backbone network (e.g., ViT-B/16) followed by a cosine classifier for pixel-wise category prediction. Image-level supervision is incorporated by performing image-level classification on the average pooled feature map. The overall training objective combines segmentation loss ($\mathcal{L}_{\text{seg}$) and classification loss ($\mathcal{L}_{\text{cls}$): $\mathcal{L} = \mathcal{L}_{\text{seg}} + \lambda_{\text{cls}} \mathcal{L}_{\text{cls}}$ .

A key observation is that models trained on segmentation data can group pixel features of categories outside the training vocabulary. This intra-category pixel compactness is exploited by the CA-Classifier (Figure 2) to apply supervision to precise regions.

Figure 2: Illustration of LarvSeg framework. The meaning of each icon is listed on the left. CA-Classifier and CA-Map stand for category-wise attentive classifier and category-wise attention map defined in Section~\ref{def:ca_map}.

The CA-Classifier uses a memory bank to store representative features for each novel category. A category-wise attention map ( $\mathcal{A}$ ) is generated based on the confidence scores from the memory bank, highlighting foreground regions and suppressing background regions. An auxiliary image-level classification task is then applied to the attentively pooled feature map.

The overall loss function is: $\mathcal{L} = \mathcal{L}_{\text{seg}} + \lambda_{\text{cls}} \mathcal{L}_{\text{cls}} + \lambda_{\text{aux}} \mathcal{L}_{\text{aux}}$ .

Experimental Results

The paper presents extensive experimental results on datasets such as COCO-Stuff (C171), ADE20K (A150), ADEFull (A847), and ImageNet21K (I21K). A significant contribution is the demonstration of a 21K-category semantic segmentation model using ImageNet21K. The qualitative results in Figure 3 show the model's ability to segment novel categories.

Figure 3: Visualization of 21K categories semantic segmentation.

The proposed baseline outperforms previous open vocabulary methods by a large margin, and LarvSeg further improves the baseline on novel categories by $6.0$ mIoU on A150 and $2.1$ mIoU on A847 (Table 1).

The visualization of model predictions in Figure 4 shows that LarvSeg recognizes categories such as radiator and painting, which are often missed by other models.

Figure 4: Visualization of model predictions. The tags show model names and the corresponding mIoUs of this image. Circles with different colours represent regions with novel categories in the image: sofa (in the red circle), radiator (in the dark blue circle) and painting (in the light blue circle).

Ablation studies validate the effectiveness of the CA-Classifier and the importance of cross-image semantic cues.

Implications and Future Directions

The LarvSeg framework offers a practical approach to scaling up semantic segmentation vocabularies by leveraging readily available image classification data. This approach addresses the limitations of existing language-guided segmentation models and the difficulty of obtaining large-scale, pixel-level annotated datasets. The results suggest that collecting high-quality, category-balanced data is critical for improving model performance.

Future research directions include exploring different memory update strategies, incorporating more sophisticated attention mechanisms, and extending the framework to other vision tasks. Additionally, a potentially fruitful avenue for further exploration involves investigating the impact of different backbone architectures and pre-training strategies on the overall performance of LarvSeg. The modular design of LarvSeg means it could be integrated with complementary techniques such as language-guided methods to further expand the segmentation vocabulary.

Conclusion

The paper "LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier" (2501.06862) presents a valuable contribution to the field of semantic segmentation by introducing a practical and effective framework for scaling up segmentation vocabularies using image classification data. The LarvSeg framework demonstrates strong performance on a variety of datasets, achieving state-of-the-art results in large vocabulary semantic segmentation.

Markdown Report Issue