The Ethical Implications of Generative Audio Models: A Systematic Literature Review
Abstract: Generative audio models typically focus their applications in music and speech generation, with recent models having human-like quality in their audio output. This paper conducts a systematic literature review of 884 papers in the area of generative audio models in order to both quantify the degree to which researchers in the field are considering potential negative impacts and identify the types of ethical implications researchers in this area need to consider. Though 65% of generative audio research papers note positive potential impacts of their work, less than 10% discuss any negative impacts. This jarringly small percentage of papers considering negative impact is particularly worrying because the issues brought to light by the few papers doing so are raising serious ethical implications and concerns relevant to the broader field such as the potential for fraud, deep-fakes, and copyright infringement. By quantifying this lack of ethical consideration in generative audio research and identifying key areas of potential harm, this paper lays the groundwork for future work in the field at a critical point in time in order to guide more conscientious research as this field progresses.
- Daron Acemoglu and Pascual Restrepo. 2018. Artificial intelligence, automation, and work. In The economics of artificial intelligence: An agenda. University of Chicago Press, 197–236.
- MusicLM: Generating Music From Text. arXiv preprint arXiv:2301.11325 (2023).
- The growing influence of industry in AI research. Science 379, 6635 (2023), 884–886.
- Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051 (2020).
- arXiv. 2023. About arXiv. https://info.arxiv.org/about/index.html
- ChoreoGraph: Music-conditioned Automatic Dance Choreography over a Style and Tempo Consistent Dynamic Graph. In Proceedings of the 30th ACM International Conference on Multimedia. 3917–3925.
- Julia Barnett and Nicholas Diakopoulos. 2022. Crowdsourcing Impacts: Exploring the Utility of Crowds for Anticipating Societal Impacts of Algorithmic Decision Making. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society. 56–67.
- Ethics and society review: Ethics reflection as a precondition to research funding. Proceedings of the National Academy of Sciences 118, 52 (2021), e2117261118.
- Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE transactions on pattern analysis and machine intelligence (2021).
- The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228 (2018).
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646 (2022).
- Introduction: the effectiveness of impact assessment instruments. Impact Assessment and Project Appraisal 27, 2 (2009), 91–93.
- Attributable Watermarking of Speech Generative Models. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3069–3073.
- Kresimir Delac and Mislav Grgic. 2004. A survey of biometric recognition methods. In Proceedings. Elmar-2004. 46th International Symposium on Electronics in Marine. IEEE, 184–193.
- V-Cloak: Intelligibility-, Naturalness-& Timbre-Preserving Real-Time Voice Anonymization. arXiv preprint arXiv:2210.15140 (2022).
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020).
- Nicholas Diakopoulos and Deborah Johnson. 2021. Anticipating and addressing the ethical implications of deepfakes in the context of elections. New Media & Society 23, 7 (2021), 2072–2098.
- Energy Consumption of Deep Generative Audio Models. arXiv preprint arXiv:2107.02621 (2021).
- Edelman. 2019. 2019 Edelman AI Survey.
- Philippe Esling et al. 2022. Challenges in creative generative models for music: a divergence maximization perspective. arXiv preprint arXiv:2211.08856 (2022).
- Music creation by example. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–13.
- Text-Based Editing of Talking-Head Video. ACM Trans. Graph. 38, 4, Article 68 (jul 2019), 14Â pages. https://doi.org/10.1145/3306346.3323028
- Sanchita Ghose and John Jeffrey Prevost. 2020. Autofoley: Artificial synthesis of synchronized sound tracks for silent videos with deep learning. IEEE Transactions on Multimedia 23 (2020), 1895–1907.
- Semi-supervised generative modeling for controllable speech synthesis. arXiv preprint arXiv:1910.01709 (2019).
- It’s time to do something: Mitigating the negative impacts of computing through a change to the peer review process. arXiv preprint arXiv:2112.09544 (2021).
- J Britt Holbrook and Robert Frodeman. 2011. Peer review and the ex ante assessment of societal impacts. Research Evaluation 20, 3 (2011), 239–246.
- AI song contest: Human-AI co-creation in songwriting. arXiv preprint arXiv:2010.05388 (2020).
- Music transformer. arXiv preprint arXiv: 1809.04281 (2018).
- Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv preprint arXiv:2301.12661 (2023).
- Let’s Face It: Probabilistic Multi-Modal Interlocutor-Aware Generation of Facial Gestures in Dyadic Settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (Virtual Event, Scotland, UK) (IVA ’20). Association for Computing Machinery, New York, NY, USA, Article 31, 8 pages.
- Paul Keller. 2023. Protecting creatives or impeding progress? Machine Learning and the EU Copyright Framework.
- Exciting, useful, worrying, futuristic: Public perception of artificial intelligence in 8 countries. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 627–637.
- Lawrence George Kersta. 1962. Voiceprint identification. The Journal of the Acoustical Society of America 34, 5 (1962), 725–725.
- Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems 33 (2020), 8067–8077.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning. PMLR, 5530–5540.
- Guided-TTS 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370 (2022).
- Robust Detection of Machine-Induced Audio Attacks in Intelligent Audio Systems with Microphone Array. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (Virtual Event, Republic of Korea) (CCS ’21). Association for Computing Machinery, New York, NY, USA, 1884–1899. https://doi.org/10.1145/3460120.3484755
- Toward broader impacts: Making sense of NSF’s merit review criteria in the context of the National Science Digital Library. Journal of the American Society for Information Science and Technology 63, 9 (2012), 1758–1772.
- How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven. arXiv preprint arXiv:2111.09509 (2021).
- Regulating deep fakes: legal and ethical considerations. Journal of Intellectual Property Law & Practice 15, 1 (2020), 24–31.
- Algorithmic impact assessments and accountability: The co-construction of impacts. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 735–746.
- Yisroel Mirsky and Wenke Lee. 2021. The Creation and Detection of Deepfakes: A Survey. ACM Comput. Surv. 54, 1, Article 7 (jan 2021), 41Â pages.
- Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Annals of internal medicine 151, 4 (2009), 264–269.
- Mihai Mutascu. 2021. Artificial intelligence and unemployment: New insights. Economic Analysis and Policy 69 (2021), 653–667.
- Unpacking the expressed consequences of AI research in broader impact statements. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 795–806.
- OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
- Image Transformer. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4055–4064.
- High fidelity speech regeneration with application to speech enhancement. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7143–7147.
- Unsupervised cross-domain singing voice conversion. arXiv preprint arXiv:2008.02830 (2020).
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
- Melanie R Roberts. 2009. Realizing societal benefit from academic research: Analysis of the National Science Foundation’s broader impacts criterion. Social Epistemology 23, 3-4 (2009), 199–219.
- Ethics and creativity in computer vision. arXiv preprint arXiv:2112.03111 (2021).
- Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 464–468.
- Sociotechnical Harms: Scoping a Taxonomy for Harm Reduction. arXiv preprint arXiv:2210.05791 (2022).
- An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. 29 (nov 2020), 132–157.
- Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models. arXiv preprint arXiv:2212.03860 (2022).
- AI as Social Glue: Uncovering the Roles of Deep Generative AI during Social Music Composition. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 582, 11 pages.
- Briony Swire-Thompson and David Lazer. 2019. Public health and online misinformation: challenges and recommendations. Annual review of public health 41 (2019), 433–451.
- EdiTTS: Score-based editing for controllable text-to-speech. arXiv preprint arXiv:2110.02584 (2021).
- Transflower: Probabilistic Autoregressive Dance Generation with Multimodal Attention. ACM Trans. Graph. 40, 6, Article 195 (dec 2021), 14Â pages. https://doi.org/10.1145/3478513.3480570
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Provable Copyright Protection for Generative Models. arXiv preprint arXiv:2302.10870 (2023).
- Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos. arXiv preprint arXiv:2206.04523 (2022).
- DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM ’20). Association for Computing Machinery, New York, NY, USA, 1207–1216.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 (2021).
- Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity. ACM Trans. Graph. 39, 6, Article 222 (nov 2020), 16Â pages. https://doi.org/10.1145/3414685.3417838
- Restoring degraded speech via a modified diffusion model. arXiv preprint arXiv:2104.11347 (2021).
- Vertical-Horizontal Structured Attention for Generating Music with Chords. arXiv preprint arXiv:2011.09078 (2020).
- Voice conversion with conditional SampleRNN. arXiv preprint arXiv:1808.08311 (2018).
- Music2Dance: DanceNet for Music-Driven Dance Generation. ACM Trans. Multimedia Comput. Commun. Appl. 18, 2, Article 65 (feb 2022), 21Â pages. https://doi.org/10.1145/3485664
- Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv preprint arXiv:2301.12867 (2023).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.