Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer

Published 22 Feb 2024 in cs.SD, cs.CV, cs.LG, eess.AS, and eess.SP | (2402.14205v1)

Abstract: Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detecting synthetic speech shared on social platforms. In this paper we propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our experiments show that PS3DT performs well on ASVspoof2019 dataset compared to other approaches using spectrogram for synthetic speech detection. We also investigate generalization performance of PS3DT on In-the-Wild dataset. PS3DT generalizes well than several existing methods on detecting synthetic speech from an out-of-distribution dataset. We also evaluate robustness of PS3DT to detect telephone quality synthetic speech and synthetic speech shared on social platforms (compressed speech). PS3DT is robust to compression and can detect telephone quality synthetic speech better than several existing methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” Proceedings of the Interspeech, pp. 1008–1012, September 2019, Graz, Austria.
  2. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech,” Proceedings of the International Conference on Machine Learning, vol. 139, pp. 8599–8608, July 2021, Virtual.
  3. D. H. Klatt, “Review of Text‐to‐Speech Conversion for English,” The Journal of the Acoustical Society of America, vol. 82, no. 3, p. 737–793, May 1987.
  4. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” Proceedings of the International Conference on Learning Representations, pp. 1–15, May 2021, virtual.
  5. WellSaid Labs, Inc. 2022, “WELLSAID: AI Voice Over for Commercials,” 2022. [Online]. Available: https://wellsaidlabs.com/ai-voice-over
  6. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,” arXiv preprint, January 2023.
  7. B. Allyn, “Deepfake Video of Zelenskyy Could be ‘Tip of the Iceberg’ in Info War, Experts Warn,” https://www.npr.org/2022/03/16/1087062648/deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia, March 2022.
  8. B. Smith, “Goldman Sachs, Ozy Media and a $40 Million Conference Call Gone Wrong,” The New York Times, September 2021. [Online]. Available: https://www.nytimes.com/2021/09/26/business/media/ozy-media-goldman-sachs.html
  9. K. Bhagtani, A. K. S. Yadav, E. R. Bartusiak, Z. Xiang, R. Shao, S. Baireddy, and E. J. Delp, “An Overview of Recent Work in Multimedia Forensics,” Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval, pp. 324–329, August 2022, Virtual.
  10. J. Herre and H. Purnhagen, “General Audio Coding,” in The MPEG-4 Book, F. C. Pereira and T. Ebrahimi, Eds.   Upper Saddle River, NJ, USA: Prentice Hall PTR, 2002, pp. 487–544.
  11. Google Inc., “YouTube Recommended Upload Encoding Settings,” 2022. [Online]. Available: https://support.google.com/youtube/answer/1722171
  12. E. R. Bartusiak and E. J. Delp, “Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis,” Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, pp. 1426–1430, October 2021, Asilomar, CA.
  13. E. R. Bartusiak, “Machine Learning for Speech Forensics and Hypersonic Vehicle Applications,” Ph.D. dissertation, Purdue University, West Lafayette, IN, 12 2022.
  14. S. S. Stevens, J. Volkmann, and E. B. Newman, “A Scale for the Measurement of the Psychological Magnitude Pitch,” Journal of the Acoustical Society of America, vol. 8, pp. 185–190, June 1937.
  15. N. M. Müller, P. Czempin, F. Dieckmann, A. Froghyar, and K. Böttinger, “Does audio deepfake detection generalize?” Proceedings of the Interspeech, September 2022, Incheon, Korea.
  16. X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” arXiv preprint, 2022.
  17. X. Li, N. Li, C. Weng, X. Liu, D. Su, D. Yu, and H. Meng, “Replay and Synthetic Speech Detection with Res2Net Architecture,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6354–6358, June 2021, Toronto, Canada.
  18. M. Todisco, H. Delgado, and N. Evans, “Constant Q Cepstral Coefficients: A Spoofing Countermeasure for Automatic Speaker Verification,” Computer Speech & Language, vol. 45, pp. 516–535, September 2017.
  19. M. Sahidullah and G. Saha, “Design, Analysis, and Experimental Evaluation of Block Based Transformation in MFCC Computation for Speaker Recognition,” Speech Communication, vol. 54, pp. 543–565, May 2012.
  20. M. Zakariah, M. K. Khan, and H. Malik, “Digital Multimedia Audio Forensics: Past, Present and Future,” Multimedia Tools and Applications, vol. 77, no. 1, pp. 1009–1040, January 2017.
  21. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, June 2016, Las Vegas, NV.
  22. N. Subramani and D. Rao, “Learning Efficient Representations for Fake Speech Detection,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 5859–5866, April 2020, New York, NY.
  23. K. Bhagtani, A. K. S. Yadav, E. R. Bartusiak, Z. Xiang, R. Shao, S. Baireddy, and E. J. Delp, “An Overview of Recent Work in Media Forensics: Methods and Threats,” arXiv preprint arXiv:2204.12067, April 2022.
  24. Z. Xiang, A. K. S. Yadav, S. Tubaro, P. Bestagini, and E. J. Delp, “Extracting efficient spectrograms from mp3 compressed speech signals for synthetic speech detection,” Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, p. 163–168, 2023, Chicago, IL, USA.
  25. Y. Gong, C.-I. Lai, Y.-A. Chung, and J. Glass, “SSAST: Self-Supervised Audio Spectrogram Transformer,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 10 699–10 709, October 2022, Virtual.
  26. K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient Training of Audio Transformers with Patchout,” Proceedings of the Interspeech, pp. 2753–2757, September 2022, Incheon, Korea.
  27. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation,” Proceedings of Machine Learning Research, vol. 166, pp. 1–24, Dec 2022.
  28. A. K. S. Yadav, E. Bartusiak, K. Bhagtani, and E. J. Delp, “Synthetic speech attribution using self supervised audio spectrogram transformer,” Proceedings of the IS&T Media Watermarking, Security, and Forensics Conference, Electronic Imaging Symposium, January 2023, san Francisco, CA.
  29. A. K. Singh Yadav, Z. Xiang, E. R. Bartusiak, P. Bestagini, S. Tubaro, and E. J. Delp, “ASSD: Synthetic Speech Detection in the AAC Compressed Domain,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1–5, June 2023, Rhodes Island, Greece.
  30. K. Bhagtani, E. R. Bartusiak, A. K. S. Yadav, P. Bestagini, and E. J. Delp, “Synthesized speech attribution using the patchout spectrogram attribution transformer,” Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, p. 157–162, June 2023, Chicago, IL, USA.
  31. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16 000–16 009, June 2022.
  32. I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” Proceedings of the International Conference on Learning Representations, May 2019, New Orleans, LA.
  33. J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado, X. Wang, N. Evans, T. Kinnunen, K. Lee, V. Vestman, and A. Nautsch, “ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database,” March 2019. [Online]. Available: https://www.asvspoof.org/index2019.html
  34. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An Ontology and Human-labeled Dataset for Audio Events,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, March 2017, New Orleans, LA.
  35. H. Zeinali, T. Stafylakis, G. Athanasopoulou, J. Rohdin, I. Gkinis, L. Burget, and J. Černockỳ, “Detecting spoofing attacks using vgg and sincnet: But-omilia submission to asvspoof 2019 challenge,” Proceedings of the Interspeech, pp. 1073–1077, September 2019, Graz, Austria.
  36. K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
  37. T. Nosek, S. Suzić, B. Papić, and N. Jakovljević, “Synthesized Speech Detection Based on Spectrogram and Convolutional Neural Networks,” Proceedings of the IEEE Telecommunications Forum, pp. 1–4, November 2019, Belgrade, Serbia.
  38. J. Yang, R. K. Das, and H. Li, “Significance of Subband Features for Synthetic Speech Detection,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 2160–2170, November 2020.
  39. J. Yang and R. K. Das, “Long-term high frequency features for synthetic speech detection,” Digital Signal Processing, vol. 97, p. 102622, February 2020.
  40. N. M. Müller, F. Dieckmann, P. Czempin, R. Canals, K. Böttinger, and J. Williams, “Speech is silver, silence is golden: What do asvspoof-trained models really learn?” arXiv preprint, 2021.
  41. J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” Proceedings of the Speaker and Language Recognition Workshop, pp. 195–202, June 2018, Les Sables d’Olonne, France.
  42. Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kinnunen, Z.-H. Ling, and T. Toda, “Voice Conversion Challenge 2020 – Intra-lingual semi-parallel and cross-lingual voice conversion,” Proceedings of the Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 80–98, october 2020, Shanghai, China.
  43. International Organization for Standardization/International Electrotechnical Commission, “ISO/IEC 13818-3:1995 Information technology - Generic Coding of Moving Pictures and Associated Audio Information - Part 3: Audio,” 1995. [Online]. Available: https://www.iso.org/standard/22991.html
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.