Papers
Topics
Authors
Recent
Search
2000 character limit reached

Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

Published 10 Jun 2023 in eess.AS and cs.SD | (2306.06495v1)

Abstract: This paper describes an audio-visual speech enhancement (AV-SE) method that estimates from noisy input audio a mixture of the speech of the speaker appearing in an input video (on-screen target speech) and of a selected speaker not appearing in the video (off-screen target speech). Although conventional AV-SE methods have suppressed all off-screen sounds, it is necessary to listen to a specific pre-known speaker's speech (e.g., family member's voice and announcements in stations) in future applications of AV-SE (e.g., hearing aids), even when users' sight does not capture the speaker. To overcome this limitation, we extract a visual clue for the on-screen target speech from the input video and a voiceprint clue for the off-screen one from a pre-recorded speech of the speaker. Two clues from different domains are integrated as an audio-visual clue, and the proposed model directly estimates the target mixture. To improve the estimation accuracy, we introduce a temporal attention mechanism for the voiceprint clue and propose a training strategy called the muting strategy. Experimental results show that our method outperforms a baseline method that uses the state-of-the-art AV-SE and speaker extraction methods individually in terms of estimation accuracy and computational efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. T. Afouras, et al., “The conversation: Deep audio-visual speech enhancement,” in Interspeech, 2018, pp. 3244-3248.
  2. M. Gogate, et al., “CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement,” Information Fusion, vol. 63, pp. 273–285, 2020.
  3. D. Michelsanti, et al., “An overview of deep-learning based audio-visual speech enhancement and separation,” IEEE/ACM TASLP, vol. 29, pp. 1368–1396, 2021.
  4. K. Yang, et al., “Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis,” in IEEE/CVF CVPR, 2022, pp. 8227–8237.
  5. A. Ephrat, et al., “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM TOG, vol. 37, no. 4, pp. 1–11, 2018.
  6. W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,” The Journal of the Acoustical Society of America, vol. 26, no. 2, pp. 212–215, 1954.
  7. L. Girin, et al., “Audio-visual enhancement of speech in noise,” The Journal of the Acoustical Society of America, vol. 109, no. 6, pp. 3007–3020, 2001.
  8. T. Afouras, et al., “My lips are concealed: Audio-visual speech enhancement through obstructions,” in Interspeech, 2019, pp. 4295–4299.
  9. H. Sato, et al., “Multimodal attention fusion for target speaker extraction,” in IEEE SLT, 2021, pp. 778–784.
  10. Q. Wang, et al., “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Interspeech, 2019, pp. 2728–2732.
  11. M. Ge, et al., “Multi-stage speaker extraction with utterance and frame-level reference signals,” in IEEE ICASSP, 2021, pp. 6109-6113.
  12. T. Ochiai, et al., “Listen to what you want: Neural network-based universal sound selector,” in Interspeech, 2020, pp. 1441–1445.
  13. S. Liu, et al., “N-HANS: A neural network-based toolkit for in-the-wild audio enhancement,” Multimedia Tools and Applications, vol. 80, pp. 28365–28389, 2021.
  14. S. Pascual, et al., “SEGAN: Speech enhancement generative adversarial network,” in Interspeech, 2017, pp. 3642–3646.
  15. Y. Hu, et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Interspeech, 2020, pp. 2472–2476.
  16. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM TASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
  17. Z. Pan, et al., “Selective listening by synchronizing speech with lips,” IEEE/ACM TASLP, vol. 30, pp. 1650–1664, 2022.
  18. E. Vincent, et al., “Performance measurement in blind audio source separation,” IEEE/ACM TASLP, vol. 14, no. 4, pp. 1462–1469, 2006.
  19. S.-W. Chung, et al., “FaceFilter: Audio-visual speech separation using still images,” in Interspeech, 2020, pp. 3481–3485.
  20. Y. Hao, et al., “Wase: Learning when to attend for speaker extraction in cocktail party environments,” in IEEE ICASSP, 2021, pp. 6104–6108.
  21. J. S. Chung, et al., “VoxCeleb2: Deep speaker recognition,” in Interspeech, 2018, pp. 1086–1090.
  22. J. Garofolo, et al., “CSR-I (WSJ0) Complete LDC93S6A,” Philadelphia: Linguistic Data Consortium, 1993.
  23. J. F. Gemmeke, et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in IEEE ICASSP, 2017, pp. 776–780.
  24. Z. Pan, et al., “Muse: Multi-modal target speaker extraction with visual cues,” in IEEE ICASSP, 2021, pp. 6678–6682.
  25. J. Wu, et al., “Time domain audio visual speech separation,” in IEEE ASRU, 2019, pp. 667–673.
  26. T. Afouras, et al., “Deep audio-visual speech recognition,” IEEE TPAMI, 2018.
  27. J. L. Roux, et al., “SDR–half-baked or well done?,” in IEEE ICASSP, 2019, pp. 626–630.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.