Papers
Topics
Authors
Recent
Search
2000 character limit reached

VIP5: Towards Multimodal Foundation Models for Recommendation

Published 23 May 2023 in cs.IR, cs.AI, cs.HC, cs.LG, and cs.MM | (2305.14302v2)

Abstract: Computer Vision (CV), NLP, and Recommender Systems (RecSys) are three prominent AI applications that have traditionally developed independently, resulting in disparate modeling and engineering methodologies. This has impeded the ability for these fields to directly benefit from each other's advancements. With the recent development of foundation models, LLMs have emerged as a potential general-purpose interface for unifying different modalities and problem formulations. In light of this, we propose the development of a multimodal foundation model (MFM) considering visual, textual, and personalization modalities under the P5 recommendation paradigm, thus named VIP5 (Visual P5), to unify various modalities and recommendation tasks. This will enable the processing of multiple modalities in a shared architecture for improved recommendations. To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format. Additionally, we propose a parameter-efficient training method for foundation models, which involves freezing the P5 backbone and fine-tuning lightweight adapters, resulting in improved recommendation performance and increased efficiency in terms of training time and memory usage. Code and data of VIP5 are available at https://github.com/jeykigung/VIP5.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
  2. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations.
  3. Layer normalization. arXiv preprint arXiv:1607.06450.
  4. Language models are few-shot learners. In NeurIPS.
  5. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations.
  6. Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 765–774.
  7. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1931–1942. PMLR.
  8. M6-rec: Generative pretrained language models are open-ended recommender systems. arXiv preprint arXiv:2205.08084.
  9. Leveraging content-style item representation for visual recommendation. In European Conference on Information Retrieval, pages 84–92. Springer.
  10. Learning to generate product reviews from attributes. In EACL.
  11. Making pre-trained language models better few-shot learners. In ACL-IJCNLP.
  12. OpenAGI: When LLM Meets Domain Experts. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS).
  13. Improving personalized explanation generation through visualization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 244–255.
  14. Path language modeling over knowledge graphsfor explainable recommendation. In Proceedings of the ACM Web Conference 2022, pages 946–955.
  15. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the Sixteenth ACM Conference on Recommender Systems.
  16. Hiclip: Contrastive language-image pretraining with hierarchy-aware attention. In The Eleventh International Conference on Learning Representations.
  17. Ruining He and Julian McAuley. 2016. Vbpr: visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI conference on artificial intelligence, volume 30.
  18. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
  19. Explainable fashion recommendation: A semantic attribute region guided approach. arXiv preprint arXiv:1905.12862.
  20. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  21. UP5: Unbiased Foundation Model for Fairness-aware Recommendation. arXiv:2305.12090.
  22. How to Index Item IDs for Recommendation Foundation Models. In Proceedings of 1st International ACM SIGIR Conference on Information Retrieval in the Asia Pacific (SIGIR-AP).
  23. Visual prompt tuning. arXiv preprint arXiv:2203.12119.
  24. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094.
  25. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
  26. Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pages 197–206. IEEE.
  27. The power of scale for parameter-efficient prompt tuning. In EMNLP.
  28. Personalized transformer for explainable recommendation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4947–4957.
  29. Personalized prompt learning for explainable recommendation. ACM Transactions on Information Systems.
  30. Prompt Distillation for Efficient LLM-based Recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management.
  31. Large language models for generative recommendation: A survey and visionary discussions. arXiv:2309.01157.
  32. Neural rating regression with abstractive tips generation for recommendation. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 345–354.
  33. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In ACL.
  34. Yunqi Li and Yongfeng Zhang. 2023. Fairness of ChatGPT. arXiv:2305.18569.
  35. Guo Lin and Yongfeng Zhang. 2023. Sparks of Artificial General Recommender (AGR): Experiments with ChatGPT. Algorithms, 16(9).
  36. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
  37. Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
  38. Hierarchical gating networks for sequential recommendation. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 825–833.
  39. Heterogeneous fusion of semantic and collaborative information for visually-aware food recommendation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 3460–3468.
  40. Reframing instructional prompts to gptk’s language. Findings of the Association for Computational Linguistics: ACL 2022.
  41. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
  42. Language models are unsupervised multitask learners. OpenAI blog.
  43. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  44. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, page 452–461, Arlington, Virginia, USA. AUAI Press.
  45. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  46. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725.
  47. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP.
  48. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237.
  49. Fashionist: Personalising outfit recommendation for cold-start scenarios. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4527–4529.
  50. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23318–23340. PMLR.
  51. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  52. Emergent abilities of large language models. Transactions on Machine Learning Research.
  53. Learning from task descriptions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1361–1375.
  54. OpenP5: Benchmarking Foundation Models for Recommendation. arXiv:2306.11134.
  55. Natural Language is All a Graph Needs. arXiv:2308.07134.
  56. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
  57. Mining latent structures for multimedia recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3872–3880.
  58. Latent structures mining with contrastive modality fusion for multimedia recommendation. arXiv preprint arXiv:2111.00678.
  59. Joint representation learning for top-n recommendation with heterogeneous information sources. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1449–1458.
  60. Do users rate or review? boost phrase-level sentiment labeling with review-level sentiment classification. In SIGIR.
  61. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  62. A revisiting study of appropriate offline evaluation for top-n recommendation algorithms. ACM Transactions on Information Systems, 41(2):1–41.
  63. Learning to prompt for vision-language models. International Journal of Computer Vision, pages 1–12.
  64. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 1893–1902.
Citations (56)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

  1. GitHub - jeykigung/VIP5 (39 stars)