Double Mixture: Towards Continual Event Detection from Speech
Abstract: Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, we propose a novel method, 'Double Mixture.' This method merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting. Our comprehensive experiments show that this task presents significant challenges that are not effectively addressed by current state-of-the-art methods in either computer vision or natural language processing. Our approach achieves the lowest rates of forgetting and the highest levels of generalization, proving robust across various continual learning sequences. Our code and data are available at https://anonymous.4open.science/status/Continual-SpeechED-6461.
- A framework for the robust evaluation of sound event detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 61–65.
- Towards Lifelong Learning of End-to-End ASR. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021. 2551–2555.
- Efficient Lifelong Learning with A-GEM. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Hkf2_sC5FX
- Aishell-ner: Named entity recognition from chinese speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8352–8356.
- Continual Multimodal Knowledge Graph Construction. CoRR abs/2305.08698 (2023). https://doi.org/10.48550/arXiv.2305.08698
- Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11828–11837.
- Recommender systems leveraging multimedia content. ACM Computing Surveys (CSUR) 53, 5 (2020), 1–38.
- CL-MASR: A Continual Learning Benchmark for Multilingual ASR. arXiv preprint arXiv:2310.16931 (2023).
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
- Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. arXiv preprint arXiv:2307.03183 (2023).
- Towards Event Extraction from Speech with Contextual Clues. CoRR abs/2401.15385 (2024). https://doi.org/10.48550/arXiv.2401.15385
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
- Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40, 12 (2017), 2935–2947.
- Incremental Prompting: Episodic Memory Prompt for Lifelong Event Detection. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022. 2157–2165.
- David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems 30 (2017).
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
- SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023. 2829–2837.
- Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European conference on computer vision (ECCV). 67–82.
- Training sound event detection with soft labels from crowdsourced annotations. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia. 1015–1018.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 34 (2021), 8583–8595.
- Experience replay for continual learning. Advances in Neural Information Processing Systems 32 (2019).
- Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning. PMLR, 4548–4557.
- Towards lifelong human assisted speaker diarization. Comput. Speech Lang. 77 (2023), 101437.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
- Steven Vander Eeckt and Hugo Van Hamme. 2022. Continual learning for monolingual end-to-end automatic speech recognition. In 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 459–463.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Understanding shared speech-text representations. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- A Comprehensive Survey of Continual Learning: Theory, Method and Application. CoRR abs/2302.00487 (2023).
- Continual Event Extraction with Semantic Confusion Rectification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 11945–11955.
- Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 139–149.
- Pretrained Language Model in Continual Learning: A Comparative Study. In The Tenth International Conference on Learning Representations. https://openreview.net/forum?id=figzpGMrdD
- Continual Learning for Large Language Models: A Survey. CoRR abs/2402.01364 (2024). https://doi.org/10.48550/arXiv.2402.01364
- KC-GEE: knowledge-based conditioning for generative event extraction. World Wide Web (WWW) 26, 6 (2023), 3983–3999. https://doi.org/10.1007/s11280-023-01216-5
- Towards relation extraction from speech. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022. 10751–10762. https://doi.org/10.18653/v1/2022.emnlp-main.738
- Semi-Supervised Sound Event Detection with Pre-Trained Model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Bishan Yang and Tom Mitchell. 2016. Joint Extraction of Events and Entities within a Document Context. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 289–299.
- Towards Lifelong Learning of Multilingual Text-to-Speech Synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 8022–8026.
- Exploring pre-trained language models for event extraction and generation. In Proceedings of the 57th annual meeting of the association for computational linguistics. 5284–5294.
- The ACM multimedia 2019 live video streaming grand challenge. In Proceedings of the 27th ACM International Conference on Multimedia. 2622–2626.
- Lifelong Event Detection with Knowledge Transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021. 5278–5290.
- Continual Learning with Pre-Trained Models: A Survey. CoRR abs/2401.16386 (2024).
- Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35 (2022), 7103–7114.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.