Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards generalizing deep-audio fake detection networks

Published 22 May 2023 in cs.SD, cs.LG, and eess.AS | (2305.13033v3)

Abstract: Today's generative neural networks allow the creation of high-quality synthetic speech at scale. While we welcome the creative use of this new technology, we must also recognize the risks. As synthetic speech is abused for monetary and identity theft, we require a broad set of deepfake identification tools. Furthermore, previous work reported a limited ability of deep classifiers to generalize to unseen audio generators. We study the frequency domain fingerprints of current audio generators. Building on top of the discovered frequency footprints, we train excellent lightweight detectors that generalize. We report improved results on the WaveFake dataset and an extended version. To account for the rapid progress in the field, we extend the WaveFake dataset by additionally considering samples drawn from the novel Avocodo and BigVGAN networks. For illustration purposes, the supplementary material contains audio samples of generator artifacts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Avocodo: Generative adversarial network for artifact-free vocoder. arXiv preprint arXiv:2206.13404, accepted for publication at the 37th AAAI conference on artificial intelligence (to appear), 2022.
  2. Github-repository - avocodo: Generative adversarial network for artifact-free vocoder. https://github.com/ncsoft/avocodo/commit/2999557bbd040a6f3eb6f7006a317d89537b78cd, 2023. Accessed: 2023-05-17.
  3. Fergal Cotter. Uses of complex wavelets in deep convolutional neural networks. PhD thesis, University of Cambridge, 2020.
  4. Explaining deepfake detection by analysing image matching. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, pp.  18–35. Springer, 2022.
  5. Fourier spectrum discrepancies in deep network generated images. Advances in neural information processing systems, 33:3022–3032, 2020.
  6. Multiresolution decomposition analysis via wavelet transforms for audio deepfake detection. In Speech and Computer: 24th International Conference, SPECOM 2022, Gurugram, India, November 14–16, 2022, Proceedings, pp.  188–200. Springer, 2022.
  7. Wavefake: A data set to facilitate audio deepfake detection. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract-round2.html.
  8. Leveraging frequency analysis for deep fake image recognition. In International conference on machine learning, pp. 3247–3258. PMLR, 2020.
  9. AST: audio spectrogram transformer. In Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek (eds.), Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pp.  571–575. ISCA, 2021. doi: 10.21437/INTERSPEECH.2021-698. URL https://doi.org/10.21437/Interspeech.2021-698.
  10. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
  11. Guardian. Ai song featuring fake drake and weeknd vocals pulled from streaming services. https://www.theguardian.com/music/2023/apr/18/ai-song-featuring-fake-drake-and-weeknd-vocals-pulled-from-streaming-services, 2023. Accessed: 2023-05-11.
  12. Anisotropic multiresolution analyses for deep fake detection. arXiv preprint arXiv:2210.14874, 2022.
  13. Spoken language processing: A guide to theory, algorithm, and system development. Prentice hall PTR, 2001.
  14. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, Frankreich, 07–09 Juli, volume 37 of Proceedings of Machine Learning Research, pp.  448–456, Lille, France, 07–09 Juli 2015. PMLR.
  15. Ripples in mathematics: the discrete wavelet transform. Springer Science & Business Media, 2001.
  16. Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. In Helen Meng, Bo Xu, and Thomas Fang Zheng (eds.), Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pp. 1496–1500. ISCA, 2020. doi: 10.21437/Interspeech.2020-1011. URL https://doi.org/10.21437/Interspeech.2020-1011.
  17. Faith Karimi. ‘mom, these bad men have me’: She believes scammers cloned her daughter’s voice in a fake kidnapping. https://edition.cnn.com/2023/04/29/us/ai-scam-calls-kidnapping-cec/index.html, 2023. Accessed: 2023-05-11.
  18. Sophia Khatsenkova. Audio deepfake scams: Criminals are using ai to sound like family and people are falling for it. https://www.euronews.com/next/2023/03/25/audio-deepfake-scams-criminals-are-using-ai-to-sound-like-family-and-people-are-falling-fo, 2023. Accessed: 2023-05-11.
  19. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  20. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
  21. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
  22. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
  23. Audio Replay Attack Detection with Deep Learning Frameworks. In Proc. Interspeech 2017, pp.  82–86, 2017. doi: 10.21437/Interspeech.2017-360.
  24. STC antispoofing systems for the asvspoof2019 challenge. In Gernot Kubin and Zdravko Kacic (eds.), Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp.  1033–1037. ISCA, 2019. doi: 10.21437/Interspeech.2019-1768. URL https://doi.org/10.21437/Interspeech.2019-1768.
  25. Bigvgan: A universal neural vocoder with large-scale training. International Conference on Learning Representations (ICLR) 2023, 2023a.
  26. Github-repository - bigvgan: A universal neural vocoder with large-scale training. https://github.com/nvidia/bigvgan, 2023b. Accessed: 2023-05-17.
  27. Wavelet-enhanced weakly supervised local feature learning for face forgery detection. In Proceedings of the 30th ACM International Conference on Multimedia, pp.  1299–1308, 2022.
  28. Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
  29. Do gans leave artificial fingerprints? In 2019 IEEE conference on multimedia information processing and retrieval (MIPR), pp.  506–511. IEEE, 2019.
  30. Moritz Wolter. Frequency Domain Methods in Recurrent Neural Networks for Sequential Data Processing. PhD thesis, Rheinische Friedrich-Wilhelms-Universität Bonn, July 2021. URL https://hdl.handle.net/20.500.11811/9245.
  31. Does audio deepfake detection generalize? In Hanseok Ko and John H. L. Hansen (eds.), Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pp.  2783–2787. ISCA, 2022. doi: 10.21437/Interspeech.2022-108. URL https://doi.org/10.21437/Interspeech.2022-108.
  32. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  33. Rethinking cnn models for audio classification. arXiv preprint arXiv:2007.11154, 2020.
  34. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  35. PC-Mag. Fbi: Scammers are interviewing for remote jobs using deepfake tech. https://www.pcmag.com/news/fbi-scammers-are-interviewing-for-remote-jobs-using-deepfake-tech, 2023. Accessed: 2023-05-11.
  36. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  3617–3621. IEEE, 2019.
  37. A comparison of features for synthetic speech detection. 2015.
  38. The people onscreen are fake. the disinformation is real. https://www.nytimes.com/2023/02/07/technology/artificial-intelligence-training-deepfake.html, 2023. Accessed: 2023-05-11.
  39. On the frequency bias of generative models. Advances in Neural Information Processing Systems, 34:18126–18136, 2021.
  40. Wavelets and filter banks. SIAM, 1996.
  41. Axiomatic attribution for deep networks. CoRR, abs/1703.01365, 2017.
  42. A new feature for automatic speaker verification anti-spoofing: Constant q cepstral coefficients. In Odyssey, volume 2016, pp.  283–290, 2016.
  43. STC Antispoofing Systems for the ASVspoof2021 Challenge. In Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp.  61–67, 2021. doi: 10.21437/ASVSPOOF.2021-10.
  44. Cnn-generated images are surprisingly easy to spot… for now. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 8692–8701. Computer Vision Foundation / IEEE, 2020a. doi: 10.1109/CVPR42600.2020.00872. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Wang_CNN-Generated_Images_Are_Surprisingly_Easy_to_Spot..._for_Now_CVPR_2020_paper.html.
  45. Cnn-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8695–8704, 2020b.
  46. Wavelet-packets for deepfake image analysis and detection. Machine Learning, Special Issue of the ECML PKDD 2022 Journal Track:1–33, August 2022. ISSN 0885-6125. doi: https://doi.org/10.1007/s10994-022-06225-5. URL https://rdcu.be/cUIRt.
  47. Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853, 2015. doi: 10.48550/arXiv.1505.00853.
  48. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6199–6203. IEEE, 2020.
  49. Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 492–498. IEEE, 2021.
  50. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, published as a conference paper at ICLR 2016, 2015. doi: 10.48550/arXiv.1511.07122.
  51. Dilated convolution neural network with leakyrelu for environmental sound classification. In 2017 22nd international conference on digital signal processing (DSP), pp.  1–5. IEEE, 2017.
Citations (3)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 3 likes about this paper.