Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuseChat: A Conversational Music Recommendation System for Videos

Published 10 Oct 2023 in cs.LG, cs.CV, and cs.IR | (2310.06282v4)

Abstract: Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users' preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user's preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of LLM (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build a large-scale dataset, conversational music recommendation for videos, that simulates a two-turn interaction between a user and a recommender based on accurate music track information. Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods as well as offers strong interpretability and interactability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
  2. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Zero-shot learning for audio-based music classification and tagging. arXiv preprint arXiv:1907.02670, 2019.
  5. Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298, 2016.
  6. Infolm: A new metric to evaluate summarization & data2text generation, 2022.
  7. A unified multi-task learning framework for multi-goal conversational recommender systems. ACM Transactions on Information Systems, 41(3):1–25, 2023.
  8. Qlora: Efficient finetuning of quantized llms, 2023.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Toward universal text-to-music retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  11. Leveraging large language models in conversational recommender systems. arXiv preprint arXiv:2305.07961, 2023.
  12. Advances and challenges in conversational recommender systems: A survey. AI Open, 2:100–126, 2021.
  13. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023a.
  14. Chat-rec: Towards interactive and explainable llms-augmented recommender system. arXiv preprint arXiv:2303.14524, 2023b.
  15. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
  16. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022.
  17. Brains on beats, 2016.
  18. Lora: Low-rank adaptation of large language models, 2021.
  19. Audio–text retrieval based on contrastive learning and collaborative attention mechanism. Multimedia Systems, pages 1–14, 2023.
  20. Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415, 2022.
  21. A survey on conversational recommender systems. ACM Computing Surveys (CSUR), 54(5):1–36, 2021.
  22. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789, 2017.
  23. Disentangled multidimensional metric learning for music similarity. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10. IEEE, 2020.
  24. Vision transformers are parameter-efficient audio-visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2299–2309, 2023.
  25. Tackling data bias in music-avqa: Crafting a balanced dataset for unbiased question-answering. arXiv preprint arXiv:2310.06238, 2023.
  26. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  28. Contrastive audio-language learning for music. arXiv preprint arXiv:2208.12208, 2022a.
  29. Song describer: a platform for collecting textual descriptions of music recordings. In Ismir 2022 Hybrid Conference, 2022b.
  30. Language-guided music recommendation for video via prompt analogies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14784–14793, 2023.
  31. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  32. musicnn: Pre-trained convolutional neural networks for music audio tagging. arXiv preprint arXiv:1909.06654, 2019.
  33. Cross-modal music-video recommendation: A study of design choices. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2021.
  34. Improving language understanding by generative pre-training. 2018.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  36. It’s time for artistic correspondence in music and video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10564–10574, 2022.
  37. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  38. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Rethinking the evaluation for conversational recommendation in the era of large language models. arXiv preprint arXiv:2305.13112, 2023.
  42. Evaluation of cnn-based automatic music tagging models. arXiv preprint arXiv:2006.00751, 2020.
  43. Semi-supervised music tagging transformer. arXiv preprint arXiv:2111.13457, 2021.
  44. Wav2clip: Learning robust audio representations from clip. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4563–4567. IEEE, 2022.
  45. Improving conversational recommendation systems’ quality with context-aware item meta information. arXiv preprint arXiv:2112.08140, 2021.
  46. Cross-modal variational auto-encoder for content-based micro-video background music recommendation. IEEE Transactions on Multimedia, 2021.
  47. Audio-visual embedding for cross-modal music video retrieval through supervised deep cca. In 2018 IEEE International Symposium on Multimedia (ISM), pages 143–150. IEEE, 2018.
  48. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  49. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  50. Bertscore: Evaluating text generation with bert, 2020.
  51. S3t: Self-supervised pre-training with swin transformer for music classification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 606–610. IEEE, 2022.
  52. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (15)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.