Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining

Published 30 Jul 2021 in cs.CV | (2107.14572v2)

Abstract: Nowadays, customer's demands for E-commerce are more diversified, which introduces more complications to the product retrieval industry. Previous methods are either subject to single-modal input or perform supervised image-level product retrieval, thus fail to accommodate real-life scenarios where enormous weakly annotated multi-modal data are present. In this paper, we investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval among fine-grained product categories. To promote the study of this challenging task, we contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval. Notably, Product1M contains over 1 million image-caption pairs and consists of two sample types, i.e., single-product and multi-product samples, which encompass a wide variety of cosmetics brands. In addition to the great diversity, Product1M enjoys several appealing characteristics including fine-grained categories, complex combinations, and fuzzy correspondence that well mimic the real-world scenes. Moreover, we propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE), that excels in capturing the potential synergy between multi-modal inputs via a hybrid-stream transformer in a self-supervised manner.CAPTURE generates discriminative instance features via masked multi-modal learning as well as cross-modal contrastive pretraining and it outperforms several SOTA cross-modal baselines. Extensive ablation studies well demonstrate the effectiveness and the generalization capacity of our model. Dataset and codes are available at https: //github.com/zhanxlin/Product1M.

Abstract PDF Upgrade to Chat

Citations (58)

View on Semantic Scholar

Summary

The paper introduces the Product1M dataset and CAPTURE model to address weakly supervised, instance-level product retrieval challenges in e-commerce.
The methodology employs a cross-modal transformer with masked multi-modal modeling and contrastive loss to effectively align image and text features.
Experimental results show improved precision and recall in zero-shot and fine-grained retrieval tasks, emphasizing its practical relevance.

The paper presents a novel research inquiry into the field of E-commerce by addressing the intricate problem of instance-level product retrieval in a multi-modal context, where existing methods have shown limitations. The primary contributions of this study include the introduction of Product1M, a comprehensive dataset designed to simulate real-world product retrieval scenarios, and the development of CAPTURE, a sophisticated model for enhancing multi-modal retrieval effectiveness.

Key Contributions

1. Introduction of Product1M Dataset:

Product1M emerges as an expansive dataset tailored for instance-level retrieval, particularly in the cosmetic sector, encompassing over a million image-caption pairs with substantial diversity. This dataset features two types of samples—single-product and multi-product—both reflecting real-world complexities such as fine-grained categories, diverse product combinations, and fuzziness in image-text correspondences, which pose challenges for retrieval tasks.

2. Weakly Supervised Retrieval Setting:

The paper explores a pragmatic scenario where multi-modal instance-level product retrieval is performed with weak supervision. Unlike traditional image-level retrieval, this approach necessitates the extraction of fine-grained, instance-level features from vast quantities of weakly annotated data, capturing the intricate attribute and category distinctions between products.

3. Development of CAPTURE Model:

CAPTURE, a Cross-modal contrAstive Product Transformer, is proposed as a solution to facilitate instance-level product retrieval in multi-modal settings. This hybrid-stream transformer model integrates cross-modal feature learning in a self-supervised manner, employing masked multi-modal modeling tasks and cross-modal contrastive loss to align and harmonize image-text features effectively.

Experimental Insights

Through rigorous experimentation, the paper validates the superiority of CAPTURE over existing cross-modal self-supervised pretraining methods. CAPTURE demonstrates enhanced precision and recall metrics across various configurations, reinforcing its applicability and effectiveness in real-world multi-modal retrieval scenarios. Particularly, the model's performance in zero-shot retrieval scenarios underscores its potential for adapting to dynamic product categorization without explicit annotations.

Implications and Future Directions

The contributions of this study have several significant implications:

Practical Application: The methodologies and insights gained from this research offer practical solutions for E-commerce platforms, enabling more accurate product retrieval systems that can efficiently handle multi-modal data and weak annotations.
Theoretical Advancements: This work extends the boundaries of cross-modal retrieval research by addressing instance-level challenges, highlighting the importance of multi-modal feature synchronization in achieving finer granularity in product retrieval.
Dataset Utility: Product1M serves as a valuable resource for continued exploration in both academic and commercial domains, providing a realistic benchmark for developing robust retrieval algorithms tailored to the intricacies of authentic E-commerce data.

Future Research Directions may focus on enhancing detection accuracy within CAPTURE, exploring novel augmentation techniques for improved multi-product detection, or examining transfer learning approaches to expand CAPTURE's applicability to broader domains beyond cosmetics.

In conclusion, this paper delivers a substantial advancement in understanding and executing instance-level retrieval in multi-modal, weakly supervised contexts, promising to catalyze further innovation within E-commerce product retrieval systems and beyond.