Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enhancing Dynamic Image Advertising with Vision-Language Pre-training

Published 25 Jun 2023 in cs.IR | (2306.14112v1)

Abstract: In the multimedia era, image is an effective medium in search advertising. Dynamic Image Advertising (DIA), a system that matches queries with ad images and generates multimodal ads, is introduced to improve user experience and ad revenue. The core of DIA is a query-image matching module performing ad image retrieval and relevance modeling. Current query-image matching suffers from limited and inconsistent data, and insufficient cross-modal interaction. Also, the separate optimization of retrieval and relevance models affects overall performance. To address this issue, we propose a vision-language framework consisting of two parts. First, we train a base model on large-scale image-text pairs to learn general multimodal representation. Then, we fine-tune the base model on advertising business data, unifying relevance modeling and retrieval through multi-objective learning. Our framework has been implemented in Baidu search advertising system "Phoneix Nest". Online evaluation shows that it improves cost per mille (CPM) and click-through rate (CTR) by 1.04% and 1.865%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. In Advances in Neural Information Processing Systems (NeurIPS).
  2. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
  3. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
  4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL-HLT).
  5. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR).
  6. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu’s Sponsored Search. In Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
  7. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
  8. Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark. In Advances in Neural Information Processing Systems (NeurIPS).
  9. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  10. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
  11. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
  12. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. https://doi.org/10.1007/s11263-016-0981-7
  13. Fluency-Guided Cross-Lingual Image Captioning. In Proceedings of the 25th ACM international conference on Multimedia (MM).
  14. AdsGNN: Behavior-Graph Augmented Relevance Modeling in Sponsored Search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
  15. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning (ICML).
  16. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems (NeurIPS).
  17. COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval. IEEE Transactions on Multimedia 21 (2018), 2347–2360.
  18. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the European Conference on Computer Vision (ECCV).
  19. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).
  20. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
  21. ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training. arXiv preprint arXiv:2209.15270 (2022).
  22. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2018).
  23. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In Proceedings of the 10th International Conference on Learning Representations (ICLR).
  24. AMCAD: Adaptive Mixed-Curvature Representation based Advertisement Retrieval System. In IEEE 38th International Conference on Data Engineering (ICDE).
  25. Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. arXiv preprint arXiv:2211.01335 (2022).
  26. CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv preprint arXiv:2205.01917 (2022).
  27. Boost CTR Prediction for New Advertisements via Modeling Visual Content. In IEEE International Conference on Big Data (Big Data).
  28. MixBERT for Image-Ad Relevance Scoring in Advertising. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM).
  29. TIRA in Baidu Image Advertising. In IEEE 37th International Conference on Data Engineering (ICDE).
  30. Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
  31. Uni-Retriever: Towards Learning the Unified Embedding Based Retriever in Bing Sponsored Search. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
  32. TextGNN: Improving Text Encoder via Graph Neural Network in Sponsored Search. In Proceedings of the Web Conference 2021 (WWW).
  33. AdsCVLR: Commercial Visual-Linguistic Representation Modeling in Sponsored Search. In Proceedings of the 30th ACM International Conference on Multimedia (MM).
  34. Kaleido-bert: Vision-language pre-training on fashion domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Citations (6)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.