Papers
Topics
Authors
Recent
Search
2000 character limit reached

Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection

Published 2 Nov 2024 in eess.AS and cs.SD | (2411.01174v2)

Abstract: Sound Event Detection (SED) is challenging in noisy environments where overlapping sounds obscure target events. Language-queried audio source separation (LASS) aims to isolate the target sound events from a noisy clip. However, this approach can fail when the exact target sound is unknown, particularly in noisy test sets, leading to reduced performance. To address this issue, we leverage the capabilities of LLMs to analyze and summarize acoustic data. By using LLMs to identify and select specific noise types, we implement a noise augmentation method for noise-robust fine-tuning. The fine-tuned model is applied to predict clip-wise event predictions as text queries for the LASS model. Our studies demonstrate that the proposed method improves SED performance in noisy environments. This work represents an early application of LLMs in noise-robust SED and suggests a promising direction for handling overlapping events in SED. Codes and pretrained models are available at https://github.com/apple-yinhan/Noise-robust-SED.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. A. Mesaros, T. Heittola, T. Virtanen, and M. D. Plumbley, “Sound event detection: a tutorial,” IEEE Signal Processing Magazine, vol. 38, no. 5, pp. 67–83, 2021.
  2. J. P. Bello, C. Mydlarz, and J. Salamon, “Sound analysis in smart cities,” Springer International Publishing, pp. 373–397, 2018.
  3. C. Debes, A. Merentitis, S. Sukhanov, M. Niessen, N. Frangiadakis, and A. Bauer, “Monitoring activities of daily living in smart homes: understanding human behavior,” IEEE Signal Processing Magazine, vol. 33, no. 2, pp. 81–94, 2016.
  4. M. Vacher, D. Istrate, L. Besacier, J.-F. Serignat, and E. Castelli, “Sound detection and classification for medical telesurvey,” in Proc. International Conference on Biomedical Engineering, 2004, pp. 395–398.
  5. T. Khandelwal, R. K. Das, and E. S. Chng, “Sound event detection: a journey through DCASE challenge series,” APSIPA Transactions on Signal and Information Processing, vol. 13, 2024.
  6. E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291–1303, 2017.
  7. Y. Xiao, T. Khandelwal, and R. K. Das, “FMSG submission for DCASE 2023 challenge task 4 on sound event detection with weak labels and synthetic soundscapes,” DCASE 2023 Challenge, Tech. Rep., 2023.
  8. Y. Xiao, H. Yin, J. Bai, and R. K. Das, “FMSG-JLESS submission for DCASE 2024 task4 on sound event detection with heterogeneous training dataset and potentially missing labels,” DCASE 2024 Challenge, Tech. Rep., 2024.
  9. Y. Xiao, H. Yin, J. Bai, and R. K. Das, “Mixstyle based domain generalization for sound event detection with heterogeneous training data,” arXiv preprint arXiv:2407.03654, 2024.
  10. M. Neri, F. Battisti, A. Neri, and M. Carli, “Sound event detection for human safety and security in noisy environments,” IEEE Access, vol. 10, pp. 134 230–134 240, 2022.
  11. T. Wan, Y. Zhou, Y. Ma, and H. Liu, “Noise robust sound event detection using deep learning and audio enhancement,” in Proc. IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 2019, pp. 1–5.
  12. R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detection in synthetic domestic environments,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 86–90.
  13. Y. Choi, O. Atif, J. Lee, D. Park, and Y. Chung, “Noise-robust sound-event classification system with texture analysis,” Symmetry, vol. 10, no. 9, p. 402, 2018.
  14. I. McLoughlin, H. Zhang, Z. Xie, Y. Song, and W. Xiao, “Robust sound event classification using deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 540–552, 2015.
  15. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  16. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Proc. International Conference on Neural Information Processing Systems (NIPS), 2020, pp. 1877–1901.
  17. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  18. Y. Xiao and R. K. Das, “Wilddesed: an llm-powered dataset for wild domestic environment sound event detection system,” in Proc. Workshop on Detection and Classification of Acoustic Scenes and Events, 2024.
  19. X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M. D. Plumbley, and W. Wang, “Separate what you describe: language-queried audio source separation,” in proc. Interspeech, 2022, pp. 1801–1805.
  20. Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proc. International Conference on Machine Learning (ICML), 2009, pp. 41–48.
  21. S. Braun, D. Neil, and S.-C. Liu, “A curriculum learning method for improved noise robustness in automatic speech recognition,” in Proc. European Signal Processing Conference (EUSIPCO), 2017, pp. 548–552.
  22. D. Ng, Y. Xiao, J. Q. Yip, Z. Yang, B. Tian, Q. Fu, E. S. Chng, and B. Ma, “Small footprint multi-channel network for keyword spotting with centroid based awareness,” in Proc. Interspeech, 2023, pp. 296–300.
  23. X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4555–4576, 2021.
  24. A. Pentina, V. Sharmanska, and C. H. Lampert, “Curriculum learning of multiple tasks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5492–5500.
  25. S. Cornell, J. Ebbers, C. Douwes, I. Martín-Morató, M. Harju, A. Mesaros, and R. Serizel, “Dcase 2024 task 4: Sound event detection with heterogeneous data and missing labels,” arXiv preprint arXiv:2406.08056, 2024.
  26. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: an ontology and human-labeled dataset for audio events,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780.
  27. Ç. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulović, “A framework for the robust evaluation of sound event detection,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 61–65.
  28. J. Ebbers, R. Haeb-Umbach, and R. Serizel, “Threshold independent evaluation of sound event detection scores,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 1021–1025.
  29. S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: audio pre-training with acoustic tokenizers,” in Proc. International Conference on Machine Learning (ICML), 2023, pp. 5178–5193.
  30. H. Yin, J. Bai, Y. Xiao, H. Wang, S. Zheng, Y. Chen, R. K. Das, C. Deng, and J. Chen, “Exploring text-queried sound event detection with audio source separation,” arXiv preprint arXiv:2409.13292, 2024.
  31. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.
  32. Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  33. J. Deng, W. Li, Y. Chen, and L. Duan, “Unbiased mean teacher for cross-domain object detection,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4091–4101.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.