Knowledge Distillation for Real-Time Classification of Early Media in Voice Communications
Abstract: This paper investigates the industrial setting of real-time classification of early media exchanged during the initialization phase of voice calls. We explore the application of state-of-the-art audio tagging models and highlight some limitations when applied to the classification of early media. While most existing approaches leverage convolutional neural networks, we propose a novel approach for low-resource requirements based on gradient-boosted trees. Our approach not only demonstrates a substantial improvement in runtime performance, but also exhibits a comparable accuracy. We show that leveraging knowledge distillation and class aggregation techniques to train a simpler and smaller model accelerates the classification of early media in voice calls. We provide a detailed analysis of the results on a proprietary and publicly available dataset, regarding accuracy and runtime performance. We additionally report a case study of the achieved performance improvements at a regional data center in India.
- B. Goode, “Voice over internet protocol (voip),” Proceedings of the IEEE, vol. 90, no. 9, pp. 1495–1517, 2002.
- E. Schooler, J. Rosenberg, H. Schulzrinne, A. Johnston, G. Camarillo, J. Peterson, R. Sparks, and M. J. Handley, “SIP: Session Initiation Protocol.” RFC 3261, July 2002.
- H. Nielsen, J. Mogul, L. M. Masinter, R. T. Fielding, J. Gettys, P. J. Leach, and T. Berners-Lee, “Hypertext Transfer Protocol – HTTP/1.1.” RFC 2616, June 1999.
- H. Schulzrinne and G. Camarillo, “Early Media and Ringing Tone Generation in the Session Initiation Protocol (SIP).” RFC 3960, Dec. 2004.
- A. Bugatti, A. Flammini, and P. Migliorati, “Audio classification in speech and music: a comparison between a statistical and a neural approach,” EURASIP Journal on Advances in Signal Processing, vol. 2002, 2002.
- Y. Lavner and D. Ruinskiy, “A decision-tree-based algorithm for speech/music classification and segmentation,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009, pp. 1–14, 2009.
- K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging using deep convolutional neural networks,” arXiv preprint arXiv:1606.00298, 2016.
- T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million song dataset,” 2011.
- H. Yu, C. Chen, X. Du, Y. Li, A. Rashwan, L. Hou, P. Jin, F. Yang, F. Liu, J. Kim, and J. Li, “TensorFlow Model Garden.” https://github.com/tensorflow/models, 2020.
- Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
- J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
- K. K. Mohammed, E. I. Abd El-Latif, N. E. El-Sayad, A. Darwish, and A. E. Hassanien, “Radio frequency fingerprint-based drone identification and classification using mel spectrograms and pre-trained yamnet neural,” Internet of Things, vol. 23, p. 100879, 2023.
- G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, 2017.
- A. H. Nour-Eldin and P. Kabal, “Mel-frequency cepstral coefficient-based bandwidth extension of narrowband speech,” in Conference of the International Speech Communication Association, 2008.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.