Enhancing Dynamic Image Advertising with Vision-Language Pre-training
Abstract: In the multimedia era, image is an effective medium in search advertising. Dynamic Image Advertising (DIA), a system that matches queries with ad images and generates multimodal ads, is introduced to improve user experience and ad revenue. The core of DIA is a query-image matching module performing ad image retrieval and relevance modeling. Current query-image matching suffers from limited and inconsistent data, and insufficient cross-modal interaction. Also, the separate optimization of retrieval and relevance models affects overall performance. To address this issue, we propose a vision-language framework consisting of two parts. First, we train a base model on large-scale image-text pairs to learn general multimodal representation. Then, we fine-tune the base model on advertising business data, unifying relevance modeling and retrieval through multi-objective learning. Our framework has been implemented in Baidu search advertising system "Phoneix Nest". Online evaluation shows that it improves cost per mille (CPM) and click-through rate (CTR) by 1.04% and 1.865%.
- VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. In Advances in Neural Information Processing Systems (NeurIPS).
- UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL-HLT).
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR).
- MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu’s Sponsored Search. In Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
- FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
- Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark. In Advances in Neural Information Processing Systems (NeurIPS).
- Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. https://doi.org/10.1007/s11263-016-0981-7
- Fluency-Guided Cross-Lingual Image Captioning. In Proceedings of the 25th ACM international conference on Multimedia (MM).
- AdsGNN: Behavior-Graph Augmented Relevance Modeling in Sponsored Search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning (ICML).
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems (NeurIPS).
- COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval. IEEE Transactions on Multimedia 21 (2018), 2347–2360.
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the European Conference on Computer Vision (ECCV).
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
- ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training. arXiv preprint arXiv:2209.15270 (2022).
- Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2018).
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In Proceedings of the 10th International Conference on Learning Representations (ICLR).
- AMCAD: Adaptive Mixed-Curvature Representation based Advertisement Retrieval System. In IEEE 38th International Conference on Data Engineering (ICDE).
- Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. arXiv preprint arXiv:2211.01335 (2022).
- CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv preprint arXiv:2205.01917 (2022).
- Boost CTR Prediction for New Advertisements via Modeling Visual Content. In IEEE International Conference on Big Data (Big Data).
- MixBERT for Image-Ad Relevance Scoring in Advertising. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM).
- TIRA in Baidu Image Advertising. In IEEE 37th International Conference on Data Engineering (ICDE).
- Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
- Uni-Retriever: Towards Learning the Unified Embedding Based Retriever in Bing Sponsored Search. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
- TextGNN: Improving Text Encoder via Graph Neural Network in Sponsored Search. In Proceedings of the Web Conference 2021 (WWW).
- AdsCVLR: Commercial Visual-Linguistic Representation Modeling in Sponsored Search. In Proceedings of the 30th ACM International Conference on Multimedia (MM).
- Kaleido-bert: Vision-language pre-training on fashion domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.