Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model
Abstract: Every artist has a creative process that draws inspiration from previous artists and their works. Today, "inspiration" has been automated by generative music models. The black box nature of these models obscures the identity of the works that influence their creative output. As a result, users may inadvertently appropriate, misuse, or copy existing artists' works. We establish a replicable methodology to systematically identify similar pieces of music audio in a manner that is useful for understanding training data attribution. A key aspect of our approach is to harness an effective music audio similarity measure. We compare the effect of applying CLMR and CLAP embeddings to similarity measurement in a set of 5 million audio clips used to train VampNet, a recent open source generative music model. We validate this approach with a human listening study. We also explore the effect that modifications of an audio example (e.g., pitch shifting, time stretching, background noise) have on similarity measurements. This work is foundational to incorporating automated influence attribution into generative modeling, which promises to let model creators and users move from ignorant appropriation to informed creation. Audio samples that accompany this paper are available at https://tinyurl.com/exploring-musical-roots.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023).
- Olufunmilayo B Arewa. 2005. From JC Bach to hip hop: Musical borrowing, copyright and cultural context. NCL Rev. 84 (2005), 547.
- Julia Barnett. 2023. The ethical implications of generative audio models: A systematic literature review. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 146–161.
- Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158 (2023).
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
- SoundStorm: Efficient Parallel Audio Generation. arXiv preprint arXiv:2305.09636 (2023).
- Generation or Replication: Auscultating Audio Latent Diffusion Models. arXiv preprint arXiv:2310.10604 (2023).
- A review of audio fingerprinting. Journal of VLSI signal processing systems for signal, image and video technology 41 (2005), 271–284.
- Extracting training data from diffusion models. arXiv preprint arXiv:2301.13188 (2023).
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646 (2022).
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19). 267–284.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
- Crowdsourced pairwise-comparison for source separation evaluation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 606–610.
- Fast and easy crowdsourced perceptual audio evaluation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 619–623.
- Codified audio language modeling learns useful representations for music information retrieval. arXiv preprint arXiv:2107.05677 (2021).
- Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11315–11325.
- Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. arXiv preprint arXiv:2308.01546 (2023).
- Simple and Controllable Music Generation. arXiv preprint arXiv:2306.05284 (2023).
- Look, listen, and learn more: Design choices for deep audio embeddings. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3852–3856.
- An evaluation on large language model outputs: Discourse and memorization. arXiv preprint arXiv:2304.08637 (2023).
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438 (2022).
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020).
- Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems 33 (2020), 2881–2891.
- VampNet: Music Generation via Masked Acoustic Token Modeling. arXiv preprint arXiv:2307.04686 (2023).
- Jaap Haitsma and Ton Kalker. 2002. A highly robust audio fingerprinting system.. In Ismir, Vol. 2002. 107–115.
- Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR) 54, 11s (2022), 1–37.
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604–613.
- A survey on locality sensitive hashing algorithms and their applications. arXiv preprint arXiv:2102.08942 (2021).
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Bongjun Kim and Bryan Pardo. 2019. Improving content-based audio retrieval by vocal imitation feedback. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4100–4104.
- Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 119–132.
- A survey of query-by-humming similarity methods. In Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments. 1–4.
- High-Fidelity Audio Compression with Improved RVQGAN. arXiv preprint arXiv:2306.06546 (2023).
- Disentangled multidimensional metric learning for music similarity. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6–10.
- Map-music2vec: A simple and effective baseline for self-supervised music audio representation learning. arXiv preprint arXiv:2212.02508 (2022).
- CoverHunter: Cover Song Identification with Refined Attention and Alignments. arXiv preprint arXiv:2306.09025 (2023).
- Jamie Lund. 2013. Fixing music copyright. Brook. L. Rev. 79 (2013), 61.
- Reproducible Subjective Evaluation. In ICLR Workshop on ML Evaluation Standards.
- OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
- StemGen: A music generation model that listens. arXiv:2312.08723Â [cs.SD]
- Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation. Proceedings of the ACM on Management of Data 1, 2 (2023), 1–18.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
- Jerome H Reichman and Pamela Samuelson. 1997. Intellectual property rights in data. Vand. L. Rev. 50 (1997), 49.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Pamela Samuelson. 2023. Generative AI meets copyright. Science 381, 6654 (2023), 158–161.
- Ethical norms and issues in crowdsourcing practices: A Habermasian analysis. Information Systems Journal 29, 4 (2019), 811–837.
- Audio cover song identification and similarity: background, approaches, evaluation, and beyond. Advances in music information retrieval (2010), 307–332.
- Beyond fair pay: Ethical implications of NLP crowdsourcing. arXiv preprint arXiv:2104.10097 (2021).
- Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6048–6058.
- Janne Spijkervet and John Ashley Burgoyne. 2021. Contrastive learning of musical representations. arXiv preprint arXiv:2103.09410 (2021).
- Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems 35 (2022), 38274–38290.
- Avery Wang. 2006. The Shazam music recognition service. Commun. ACM 49, 8 (2006), 44–48.
- Avery Wang et al. 2003. An industrial strength audio search algorithm.. In Ismir, Vol. 2003. Washington, DC, 7–13.
- Milvus: A purpose-built vector data management system. In Proceedings of the 2021 International Conference on Management of Data. 2614–2627.
- Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095969
- Vroom! a search engine for sounds by vocal imitation queries. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval. 23–32.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.