Scaling Open-Vocabulary Object Detection

Published 16 Jun 2023 in cs.CV | (2306.09683v3)

Abstract: Open-vocabulary object detection has benefited greatly from pretrained vision-LLMs, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

Abstract PDF Upgrade to Chat

Citations (120)

View on Semantic Scholar

Summary

The paper introduces an OWL-ST self-training method that uses 10 billion image-text pairs to overcome annotated data limitations.
It optimizes the OWLv2 architecture with techniques like token dropping and instance selection to enhance throughput by approximately 50%.
Scaling the training data achieves a 43% relative improvement in AP on rare classes, demonstrating significant practical advances in detection.

Scaling Open-Vocabulary Object Detection: A Summary

The paper "Scaling Open-Vocabulary Object Detection" by Matthias Minderer et al. explores advancements in open-vocabulary object detection by leveraging large-scale self-training techniques. Object detection is a pivotal task in computer vision with numerous applications, yet extending models to support open-vocabulary settings poses significant challenges due to the limitations of available annotated detection datasets. This paper puts forward the OWLv2 model and an innovative OWL-ST self-training methodology to address these challenges.

Key Contributions and Findings

Self-Training on Web-Scale Data: The authors introduce a self-training approach that utilizes an existing open-vocabulary detector to generate pseudo-box annotations for a massive dataset, WebLI, containing 10 billion image-text pairs. This approach allows the model to leverage weak semantic supervision from these pairs, significantly surpassing the traditional data constraint limitations.
Improved Model Architecture: The OWLv2 architecture is optimized for training efficiency, incorporating techniques like token dropping and instance selection to reduce computation without sacrificing performance. This optimized version enhances throughput and FLOP efficiency by approximately 50% compared to its predecessor while maintaining competitive accuracy across detection tasks.
Scaling Training Data: With the OWL-ST method, the researchers scale the training set size to over a billion examples, achieving substantial performance improvements. For example, using an OWL-ST model with an L/14 architecture improves the Average Precision (AP) on LVIS rare classes from 31.2% to 44.6%—demonstrating a 43% relative improvement. Through this paradigm, the model takes advantage of web-scale training data, similar to strategies seen in image classification and language modeling.
Label Space and Filtering: The authors propose an innovative labeling approach by using all possible N-grams from image-associated texts as detection prompts, paired with minimal filtering of pseudo-labels. This maximizes the variance of semantic contexts used during training, further enhancing open-vocabulary performance.
Model Scaling and Performance: The research confirms that larger models benefit disproportionately from extensive self-training, echoing results seen in other domains like language modeling. The study reveals that open-vocabulary detection is highly scalable, drawing parallels with the scaling laws discovered in vision transformers and other neural networks.

Implications

Theoretical Impacts: The work demonstrates that self-training on pseudo-labeled web-scale datasets provides a viable pathway for improving open-vocabulary object detection. It suggests that further scaling is both feasible and beneficial, presenting opportunities for future research in scaling strategies and model architectures tailored for open-vocabulary tasks.
Practical Advances: Practically, this research suggests more robust detection models capable of performing well on less-frequently encountered or entirely novel object classes. It opens the door for applications in diverse environments without the need for exhaustive manual labeling.
Future Directions: Future directions may include exploring even larger model capacities or alternative architectures that could benefit from even larger datasets, optimally balancing compute efficiency and model complexity. Additionally, improving the robustness and calibration of these models for fine-tuned applications versus open-world settings remains an open challenge.

In conclusion, this paper provides substantial advancements in using self-training methodologies on web-scale image-text data for open-vocabulary object detection. The OWL-ST framework and OWLv2 model represent significant steps forward, unlocking new potential for both current applications and future explorations in vision-LLM integration.