Siamese Image Modeling for Self-Supervised Vision Representation Learning

Published 2 Jun 2022 in cs.CV | (2206.01204v3)

Abstract: Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together representations from different views of the same image, while avoiding feature collapse. It lacks spatial sensitivity, which requires modeling the local structure within each image. On the other hand, MIM reconstructs the original content given a masked image. It instead does not have good semantic alignment, which requires projecting semantically similar views into nearby representations. To address this dilemma, we observe that (1) semantic alignment can be achieved by matching different image views with strong augmentations; (2) spatial sensitivity can benefit from predicting dense representations with masked images. Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. SiameseIM uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. SiameseIM can surpass both ID and MIM on a wide range of downstream tasks, including ImageNet finetuning and linear probing, COCO and LVIS detection, and ADE20k semantic segmentation. The improvement is more significant in few-shot, long-tail and robustness-concerned scenarios. Code shall be released at https://github.com/fundamentalvision/Siamese-Image-Modeling.

Abstract PDF Upgrade to Chat

Citations (87)

View on Semantic Scholar

Summary

The paper introduces Siamese Image Modeling (SiameseIM), a dual-branch framework that merges instance discrimination and masked image modeling to overcome spatial and semantic limitations.
The paper employs a novel positional embedding mechanism and dense loss to effectively align and reconstruct representations from different augmented views.
The paper's experiments demonstrate significant improvements across benchmarks like ImageNet classification, COCO detection, and ADE20k segmentation.

Siamese Image Modeling for Self-Supervised Vision Representation Learning

Abstract

This paper introduces a novel framework for self-supervised learning (SSL) in vision tasks, termed Siamese Image Modeling (SiameseIM). Drawing insights from existing SSL paradigms, such as Instance Discrimination (ID) and Masked Image Modeling (MIM), SiameseIM addresses limitations in spatial sensitivity and semantic alignment through a dual-branch Siamese network architecture. The proposed method surpasses traditional ID and MIM techniques across numerous downstream tasks, including ImageNet finetuning, linear probing, COCO detection, and ADE20k semantic segmentation.

Introduction

Self-supervised learning has garnered attention for its ability to leverage unlabeled data, competing closely with supervised learning in vision domains. ID and MIM have been prominent SSL frameworks, each with distinct properties: ID ensures semantic alignment by pulling together augmented views but lacks spatial sensitivity, while MIM reconstructs images from masked versions but struggles with semantic alignment. SiameseIM integrates the strengths of both approaches by using a dual-branch network to process augmented views from the same image, facilitating dense representation prediction through semantic matching and spatial modeling.

Figure 1: Comparisons among ID, MIM and SiameseIM. Matching different augmented views can help to learn semantic alignment, which is adopted by ID and SiameseIM. Predicting dense representations from masked images is beneficial to obtain spatial sensitivity, which is adopted by MIM and SiameseIM.

Methodology

Siamese Network Architecture

SiameseIM utilizes a Siamese network with two branches: the online branch employs an encoder to process visible patches of the first augmented view. A decoder follows, reconstructing the second view's representation based on relative positional information. The target branch encodes the second view using a momentum-based approach, establishing prediction targets.

Figure 2: The overview of our Siamese Image Modeling (SiameseIM). Different augmented views are fed into the online and target branches. The online encoder operates on the visible patches of $x_a$ . The online decoder accepts the $x_a$ tokens as well as mask tokens that correspond to predicted $x_b$ tokens. We use relative positions to inform the locations between $x_a$ and $x_b$ . The target encoder maps $x_b$ to the target representations. We finally apply the dense loss on the dense representations.

Positional Embedding and Dense Loss

A novel positional embedding mechanism enables the network to consider the positional variations between augmented views, critical for spatial alignment. The reconstruction process applies a dense loss, derived from ID methods and optimized for memory efficiency, aligning semantic and spatial representations without the need for global loss components.

Figure 3: Positional embedding for online decoder. The positions are calculated with respect to the left-top origin of $x_a$ .

Experimental Results

Extensive experiments demonstrate the superiority of SiameseIM over established SSL frameworks:

Image Classification: SiameseIM achieves top performance in ImageNet finetuning and linear probing, with significant gains in semantic alignment, evidenced by improved few-shot learning results with linear separability in features.
Object Detection: On COCO and LVIS datasets, SiameseIM delivers marked improvements in spatial sensitivity, particularly benefiting long-tail and robustness-concerned detection scenarios.
Semantic Segmentation and Robustness: Superior segmentation accuracy is observed on ADE20k, alongside enhanced robustness across various benchmarks, manifesting SiameseIM's balanced feature learning capabilities.

Conclusion

Siamese Image Modeling represents a significant advance in the field of self-supervised learning by harmonizing the benefits of ID and MIM within a singular framework. Through innovative dual-branch architecture and dense loss integration, SiameseIM effectively captures semantic alignment and spatial sensitivity, outperforming traditional SSL methods across diverse vision tasks. The framework opens promising avenues for future SSL research, addressing longstanding dilemmas in feature representation quality.

While the computational demands of SiameseIM necessitate further optimization, its potential to enrich self-supervised vision representation underscores critical strides in AI research, with ongoing efforts directed at refining efficiency and reducing bias risks in SSL deployments.

Markdown Report Issue