Open-Set Image Tagging with Multi-Grained Text Supervision

Published 23 Oct 2023 in cs.CV | (2310.15200v2)

Abstract: In this paper, we introduce the Recognize Anything Plus Model (RAM++), an open-set image tagging model effectively leveraging multi-grained text supervision. Previous approaches (e.g., CLIP) primarily utilize global text supervision paired with images, leading to sub-optimal performance in recognizing multiple individual semantic tags. In contrast, RAM++ seamlessly integrates individual tag supervision with global text supervision, all within a unified alignment framework. This integration not only ensures efficient recognition of predefined tag categories, but also enhances generalization capabilities for diverse open-set categories. Furthermore, RAM++ employs LLMs to convert semantically constrained tag supervision into more expansive tag description supervision, thereby enriching the scope of open-set visual description concepts. Comprehensive evaluations on various image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) open-set image tagging models on most aspects. Specifically, for predefined commonly used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5.0 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at \url{https://github.com/xinyu1205/recognize-anything}.

Abstract PDF Upgrade to Chat

Citations (20)

View on Semantic Scholar

Summary

The paper introduces a unified alignment framework that integrates image, tag, and text data to improve recognition accuracy in both predefined and open-set categories.
It employs LLM-based descriptive tag supervision to expand and enrich semantic understanding of visual concepts.
RAM++ achieves state-of-the-art results, outperforming models like CLIP with up to 15.4 mAP improvement on benchmarks such as ImageNet and OpenImages.

Open-Set Image Tagging with Multi-Grained Text Supervision

The paper introduces the Recognize Anything Plus Model (RAM++), designed for open-set image tagging utilizing multi-grained text supervision. This approach addresses limitations in prior models, such as CLIP, which predominantly integrate global text supervision. By combining individual tag supervision with global text supervision within a cohesive alignment framework, RAM++ not only enhances the recognition of predefined tag categories but also bolsters generalization for diverse open-set categories.

Key Contributions

Unified Alignment Framework: RAM++ integrates image-tag-text triplets within a unified alignment framework. This involves leveraging image-text and image-tag alignments concurrently through a shared alignment decoder. Such an integration is pivotal for improving tagging accuracy on both predefined and open-set categories.
LLM-Based Tag Description: The model utilizes LLMs to expand tag supervision into descriptive tag supervision. This transformation enhances the model's capability to perceive a broader scope of visual concepts, critical for open-set recognition.
State-of-the-Art Performance: RAM++ demonstrates superiority over existing models across several benchmarks. For predefined common categories, it surpasses CLIP by 10.2 mAP and 15.4 mAP on OpenImages and ImageNet, respectively. For open-set categories, RAM++ records improvements of 5.0 mAP and 6.4 mAP over CLIP and RAM on OpenImages benchmarks.

Technical Innovations

Multi-Grained Text Supervision: RAM++ integrates both global text supervision and individual tag supervision, improving recognition tasks that require localized feature identification.
Efficient Alignment Decoder: The alignment decoder differentiates RAM++ from other approaches by ensuring efficient recognition across numerous categories without performance degradation.
Automatic Re-weighting Mechanism: This mechanism addresses the integration of multiple tag descriptions, enhancing the model’s semantic alignment by re-weighting tag descriptions based on their contextual relevance to the image features.

Implications and Future Directions

The implications of RAM++ extend to enhancing the versatility of image recognition models, particularly in applications requiring robust open-set recognition. The integration of LLM knowledge during the training stage marks a significant shift, potentially influencing future research towards developing models that seamlessly blend visual and textual data more effectively.

Looking forward, further optimization of dataset scales could enhance RAM++’s capabilities, particularly for rare categories. Moreover, exploring the balance between alignment efficiency and performance remains crucial for refining open-set recognition.

Overall, RAM++ contributes an effective solution to open-set image tagging, setting new benchmarks in leveraging multi-grained text supervision. Its novel approaches to model architecture and supervision open pathways for subsequent advancements in image tagging and recognition models.