Papers
Topics
Authors
Recent
Search
2000 character limit reached

Double Mixture: Towards Continual Event Detection from Speech

Published 20 Apr 2024 in cs.CL, cs.MM, cs.SD, and eess.AS | (2404.13289v2)

Abstract: Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, we propose a novel method, 'Double Mixture.' This method merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting. Our comprehensive experiments show that this task presents significant challenges that are not effectively addressed by current state-of-the-art methods in either computer vision or natural language processing. Our approach achieves the lowest rates of forgetting and the highest levels of generalization, proving robust across various continual learning sequences. Our code and data are available at https://anonymous.4open.science/status/Continual-SpeechED-6461.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. A framework for the robust evaluation of sound event detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 61–65.
  2. Towards Lifelong Learning of End-to-End ASR. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021. 2551–2555.
  3. Efficient Lifelong Learning with A-GEM. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Hkf2_sC5FX
  4. Aishell-ner: Named entity recognition from chinese speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8352–8356.
  5. Continual Multimodal Knowledge Graph Construction. CoRR abs/2305.08698 (2023). https://doi.org/10.48550/arXiv.2305.08698
  6. Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11828–11837.
  7. Recommender systems leveraging multimedia content. ACM Computing Surveys (CSUR) 53, 5 (2020), 1–38.
  8. CL-MASR: A Continual Learning Benchmark for Multilingual ASR. arXiv preprint arXiv:2310.16931 (2023).
  9. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
  10. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. arXiv preprint arXiv:2307.03183 (2023).
  11. Towards Event Extraction from Speech with Contextual Clues. CoRR abs/2401.15385 (2024). https://doi.org/10.48550/arXiv.2401.15385
  12. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
  13. Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40, 12 (2017), 2935–2947.
  14. Incremental Prompting: Episodic Memory Prompt for Lifelong Event Detection. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022. 2157–2165.
  15. David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems 30 (2017).
  16. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  17. SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023. 2829–2837.
  18. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European conference on computer vision (ECCV). 67–82.
  19. Training sound event detection with soft labels from crowdsourced annotations. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  20. Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia. 1015–1018.
  21. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 34 (2021), 8583–8595.
  22. Experience replay for continual learning. Advances in Neural Information Processing Systems 32 (2019).
  23. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning. PMLR, 4548–4557.
  24. Towards lifelong human assisted speaker diarization. Comput. Speech Lang. 77 (2023), 101437.
  25. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
  26. Steven Vander Eeckt and Hugo Van Hamme. 2022. Continual learning for monolingual end-to-end automatic speech recognition. In 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 459–463.
  27. Attention is all you need. Advances in neural information processing systems 30 (2017).
  28. Understanding shared speech-text representations. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  29. A Comprehensive Survey of Continual Learning: Theory, Method and Application. CoRR abs/2302.00487 (2023).
  30. Continual Event Extraction with Semantic Confusion Rectification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 11945–11955.
  31. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 139–149.
  32. Pretrained Language Model in Continual Learning: A Comparative Study. In The Tenth International Conference on Learning Representations. https://openreview.net/forum?id=figzpGMrdD
  33. Continual Learning for Large Language Models: A Survey. CoRR abs/2402.01364 (2024). https://doi.org/10.48550/arXiv.2402.01364
  34. KC-GEE: knowledge-based conditioning for generative event extraction. World Wide Web (WWW) 26, 6 (2023), 3983–3999. https://doi.org/10.1007/s11280-023-01216-5
  35. Towards relation extraction from speech. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022. 10751–10762. https://doi.org/10.18653/v1/2022.emnlp-main.738
  36. Semi-Supervised Sound Event Detection with Pre-Trained Model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  37. Bishan Yang and Tom Mitchell. 2016. Joint Extraction of Events and Entities within a Document Context. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 289–299.
  38. Towards Lifelong Learning of Multilingual Text-to-Speech Synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 8022–8026.
  39. Exploring pre-trained language models for event extraction and generation. In Proceedings of the 57th annual meeting of the association for computational linguistics. 5284–5294.
  40. The ACM multimedia 2019 live video streaming grand challenge. In Proceedings of the 27th ACM International Conference on Multimedia. 2622–2626.
  41. Lifelong Event Detection with Knowledge Transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021. 5278–5290.
  42. Continual Learning with Pre-Trained Models: A Survey. CoRR abs/2401.16386 (2024).
  43. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35 (2022), 7103–7114.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.