Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Published 23 Feb 2022 in cs.CV and stat.ML | (2202.11539v2)

Abstract: Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connectivity score based on the similarity of tokens. Foreground objects can then be segmented using a normalized graph-cut to group self-similar regions. We solve the graph-cut problem using spectral clustering with generalized eigen-decomposition and show that the second smallest eigenvector provides a cutting solution since its absolute value indicates the likelihood that a token belongs to a foreground object. Despite its simplicity, this approach significantly boosts the performance of unsupervised object discovery: we improve over the recent state of the art LOST by a margin of 6.9%, 8.1%, and 8.1% respectively on the VOC07, VOC12, and COCO20K. The performance can be further improved by adding a second stage class-agnostic detector (CAD). Our proposed method can be easily extended to unsupervised saliency detection and weakly supervised object detection. For unsupervised saliency detection, we improve IoU for 4.9%, 5.2%, 12.9% on ECSSD, DUTS, DUT-OMRON respectively compared to previous state of the art. For weakly supervised object detection, we achieve competitive performance on CUB and ImageNet.

Abstract PDF Upgrade to Chat

Citations (160)

View on Semantic Scholar

Summary

The paper presents TokenCut, which fuses self-supervised transformers with normalized cut to achieve superior unsupervised object discovery.
It models image patches as graph nodes and applies spectral clustering on token similarities to accurately segment foreground objects.
TokenCut shows practical promise in domains like autonomous driving and robotics by reducing reliance on extensive annotated data.

Overview of Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

The paper presents TokenCut, a self-supervised transformer-based method for unsupervised object discovery, aiming to advance the field of computer vision by significantly improving performance metrics across several tasks. The presented approach leverages the capabilities of self-supervised vision transformers, specifically DINO, to generate attention maps that are instrumental in identifying and segmenting objects within images without relying on human-annotated data.

The core innovation lies in the amalgamation of graph construction from self-supervised transformer features and the application of Normalized Cut (Ncut) for efficient foreground object segmentation. TokenCut excels in modeling image patches as nodes connected by weighted edges, reflecting connectivity measures based on token similarity. The utilization of spectral clustering and generalized eigen-decomposition on this graph enables the precise identification of foreground objects. The choice of the second smallest eigenvector as a cutting solution is a critical aspect, as its absolute value indicates the likelihood of tokens being part of a foreground object.

Numerical Achievements

TokenCut delivers impressive numerical results, outperforming LOST—the latest state-of-the-art—by margins of 6.9\%, 8.1\%, and 8.1\% on datasets VOC07, VOC12, and COCO20K, respectively. Additionally, TokenCut's performance is further enhanced using a second-stage class-agnostic detector (CAD), achieving increases of 5.7\%, 4.9\%, and 5.1\% over LOST+CAD in the aforementioned datasets.

For unsupervised saliency detection, TokenCut substantially improves the Intersection over Union (IoU) scores compared to state-of-the-art methods across datasets: ECSSD (by 4.9%), DUTS (by 5.2%), and DUT-OMRON (by 12.9%). In weakly supervised object detection, TokenCut exhibits competitive performance on CUB and ImageNet, showcasing its robustness across varying data contexts.

Practical and Theoretical Implications

The theoretical implications of TokenCut lie in its demonstration of the efficacy of self-supervised learning in visual tasks traditionally dominated by supervised methods. It introduces a paradigm shift wherein self-supervised transformers, when integrated with graph-based methods like Ncut, can discern objects with minimal data annotations, thereby reducing dependency on extensive labeled datasets.

Practically, TokenCut offers a promising alternative for real-world applications in sectors such as autonomous driving, manufacturing, and robotics, where data annotation is prohibitively expensive or feasible. The graph-based approach of TokenCut facilitates scalability while maintaining accuracy, making it adaptable for large datasets.

Prospective Developments

The profound discoveries and breakthroughs showcased in TokenCut open avenues for further exploration in optimizing unsupervised methods in computer vision. Future work may explore refining graph-based models, expanding token analysis, or developing alternative loss functions to harness more intrinsic correlations within unsupervised frameworks. Additionally, exploring hybrid models that balance supervised and unsupervised techniques could further elevate efficacy in vision tasks.

As AI continues to evolve, methodologies such as TokenCut underline the importance of developing adaptable and efficient models capable of learning complex representations without explicit supervision. This research is a step toward a more autonomous and proficient AI landscape, particularly in object discovery and detection tasks.

Markdown Report Issue