Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
Abstract: Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide applications. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter and deep learning-based methods can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on the AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boost the development of audio-visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. Finally, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.
- S. T. Shivappa, B. D. Rao, and M. M. Trivedi, “Audio-visual fusion and tracking with multilevel iterative decoding: Framework and experimental evaluation,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 5, pp. 882–894, 2010.
- G. Potamianos, C. Neti, and S. Deligne, “Joint audio-visual speech processing for recognition and enhancement,” in AVSP 2003-International Conference on Audio-Visual Speech Processing, 2003.
- I. D. Gebru, S. Ba, X. Li, and R. Horaud, “Audio-visual speaker diarization based on spatiotemporal bayesian fusion,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 5, pp. 1086–1099, 2017.
- P. C. Loizou, Speech Enhancement: Theory and Practice. CRC Press, 2007.
- A. Hampapur, L. Brown, J. Connell, A. Ekin, N. Haas, M. Lu, H. Merkl, and S. Pankanti, “Smart video surveillance: exploring the concept of multiscale spatiotemporal tracking,” IEEE signal processing magazine, vol. 22, no. 2, pp. 38–51, 2005.
- C. Pike, R. Taylor, T. Parnell, and F. Melchior, “Object-based 3d audio production for virtual reality using the audio definition model,” AES International Conference on Audio for Virtual and Augmented Reality, September 2016.
- P. Coleman, A. Franck, J. Francombe, Q. Liu, T. de Campos, R. J. Hughes, D. Menzies, M. F. S. Gálvez, Y. Tang, J. Woodcock, P. J. B. Jackson, F. Melchior, C. Pike, F. M. Fazi, T. J. Cox, and A. Hilton, “An audio-visual system for object-based audio: From recording to listening,” IEEE Transactions on Multimedia, vol. 20, no. 8, pp. 1919–1931, 2018.
- M. A. Mohd Izhar, M. Volino, A. Hilton, and P. J. B. Jackson, “Tracking sound sources for object-based spatial audio in 3D audio-visual production,” in Forum Acusticum, pp. 2051–2058, 2020.
- F. Schweiger, C. Pike, T. Nixon, M. Firth, B. Weir, P. Golds, M. Volino, C. Cieciura, M. Izhar, N. Graham-Rack, P. J. B. Jackson, and A. Ang, “Tools for 6-dof immersive audio-visual content capture and production,” in IBC, 2022.
- J. Zhao, P. Wu, X. Liu, Y. Xu, L. Mihaylova, S. Godsill, and W. Wang, “Audio-visual tracking of multiple speakers via a pmbm filter,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, 2022.
- V. Kılıç and W. Wang, “Audio-visual speaker tracking,” in Motion Tracking and Gesture Recognition, IntechOpen, 2017.
- X. Qian, A. Brutti, O. Lanz, M. Omologo, and A. Cavallaro, “Multi-speaker tracking from an audio–visual sensing device,” IEEE Transactions on Multimedia, vol. 21, no. 10, pp. 2576–2588, 2019.
- X. Qian, M. Madhavi, Z. Pan, J. Wang, and H. Li, “Multi-target doa estimation with an audio-visual fusion mechanism,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4280–4284, IEEE, 2021.
- S. T. Birchfield and S. Rangarajan, “Spatiograms versus histograms for region-based tracking,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 1158–1163, IEEE, 2005.
- T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 971–987, 2002.
- D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
- M. Omologo, P. Svaizer, and R. DeMori, “Spoken dialogue with computers,” 1998.
- G. Lathoud and M. Magimai-Doss, “A sector-based, frequency-domain approach to detection and localization of multiple speakers,” in Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., vol. 3, pp. iii–265, IEEE, 2005.
- R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276–280, 1986.
- Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley, “Polyphonic sound event detection and localization using a two-stage strategy,” arXiv preprint arXiv:1905.00268, 2019.
- R. E. Kalman, “A new approach to linear filtering and prediction problems,” 1960.
- M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking,” IEEE Transactions on signal processing, vol. 50, no. 2, pp. 174–188, 2002.
- R. P. S. Mahler, Statistical Multisource-Multitarget Information Fusion. USA: Artech House, Inc., 2007.
- R. P. Mahler, “Multitarget bayes filtering via first-order multitarget moments,” IEEE Transactions on Aerospace and Electronic systems, vol. 39, no. 4, pp. 1152–1178, 2003.
- Artech House Norwood, MA, USA, 2007.
- B.-T. Vo, B.-N. Vo, and A. Cantoni, “The cardinality balanced multi-target multi-bernoulli filter and its implementations,” IEEE Transactions on Signal Processing, vol. 57, no. 2, pp. 409–423, 2008.
- B.-T. Vo and B.-N. Vo, “Labeled random finite sets and multi-object conjugate priors,” IEEE Transactions on Signal Processing, vol. 61, no. 13, pp. 3460–3475, 2013.
- J. L. Williams, “Marginal multi-bernoulli filters: Rfs derivation of mht, jipda, and association-based member,” IEEE Transactions on Aerospace and Electronic Systems, vol. 51, no. 3, pp. 1664–1687, 2015.
- Á. F. García-Fernández, J. L. Williams, K. Granström, and L. Svensson, “Poisson multi-bernoulli mixture filter: direct derivation and implementation,” IEEE Transactions on Aerospace and Electronic Systems, vol. 54, no. 4, pp. 1883–1901, 2018.
- Y. Xia, K. Granstrcom, L. Svensson, and Á. F. García-Fernández, “Performance evaluation of multi-bernoulli conjugate priors for multi-target filtering,” in 2017 20th International Conference on Information Fusion (Fusion), pp. 1–8, IEEE, 2017.
- S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and K. Granström, “Mono-camera 3d multi-object tracking using deep learning detections and pmbm filtering,” in 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 433–440, IEEE, 2018.
- S. Pang and H. Radha, “Multi-object tracking using poisson multi-bernoulli mixture filtering for autonomous vehicles,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7963–7967, IEEE, 2021.
- C. Schymura and D. Kolossa, “Audiovisual speaker tracking using nonlinear dynamical systems with dynamic stream weights,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1065–1078, 2020.
- C. Schymura, T. Ochiai, M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, and D. Kolossa, “A dynamic stream weight backprop kalman filter for audiovisual speaker tracking,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 581–585, IEEE, 2020.
- A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468, IEEE, 2016.
- H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
- N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649, IEEE, 2017.
- John Wiley & Sons, 2006.
- E. A. Wan, R. Van Der Merwe, and S. Haykin, “The unscented kalman filter,” Kalman filtering and neural networks, vol. 5, no. 2007, pp. 221–280, 2001.
- E. D’Arca, N. M. Robertson, and J. Hopgood, “Person tracking via audio and video fusion,” 2012.
- J. S. Liu and R. Chen, “Sequential monte carlo methods for dynamic systems,” Journal of the American statistical association, vol. 93, no. 443, pp. 1032–1044, 1998.
- G. Kitagawa, “Monte carlo filter and smoother for non-gaussian nonlinear state space models,” Journal of computational and graphical statistics, vol. 5, no. 1, pp. 1–25, 1996.
- Z. Fang, G. Tong, and X. Xu, “Particle swarm optimized particle filter,” Control and decision, vol. 22, no. 3, p. 273, 2007.
- M. Tian, Y. Bo, Z. Chen, P. Wu, and C. Yue, “A new improved firefly clustering algorithm for smc-phd filter,” Applied Soft Computing, vol. 85, p. 105840, 2019.
- M. Tian, Z. Chen, H. Wang, and L. Liu, “An intelligent particle filter for infrared dim small target detection and tracking,” IEEE Transactions on Aerospace and Electronic Systems, 2022.
- B.-N. Vo and W.-K. Ma, “The gaussian mixture probability hypothesis density filter,” IEEE Transactions on signal processing, vol. 54, no. 11, pp. 4091–4104, 2006.
- B.-N. Vo, S. Singh, and A. Doucet, “Sequential monte carlo methods for multitarget filtering with random finite sets,” IEEE Transactions on Aerospace and electronic systems, vol. 41, no. 4, pp. 1224–1245, 2005.
- Y. Liu, W. Wang, J. Chambers, V. Kilic, and A. Hilton, “Particle flow smc-phd filter for audio-visual multi-speaker tracking,” in International Conference on Latent Variable Analysis and Signal Separation, pp. 344–353, Springer, 2017.
- F. Daum and J. Huang, “Nonlinear filters with log-homotopy,” in Signal and Data Processing of Small Targets 2007, vol. 6699, pp. 423–437, SPIE, 2007.
- Y. Ban, X. Alameda-Pineda, L. Girin, and R. Horaud, “Variational bayesian inference for audio-visual tracking of multiple speakers,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 5, pp. 1761–1776, 2019.
- W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust multi-modality multi-object tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2365–2374, 2019.
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition, pp. 3354–3361, IEEE, 2012.
- P. Ondruska and I. Posner, “Deep tracking: Seeing beyond seeing using recurrent neural networks,” in Thirtieth AAAI conference on artificial intelligence, 2016.
- A. Milan, S. H. Rezatofighi, A. Dick, I. Reid, and K. Schindler, “Online multi-target tracking using recurrent neural networks,” in Thirty-First AAAI conference on artificial intelligence, 2017.
- T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Trackformer: Multi-object tracking with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8844–8854, 2022.
- P. Sun, J. Cao, Y. Jiang, R. Zhang, E. Xie, Z. Yuan, C. Wang, and P. Luo, “Transtrack: Multiple object tracking with transformer,” arXiv preprint arXiv:2012.15460, 2020.
- Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” in European Conference on Computer Vision, pp. 1–21, Springer, 2022.
- Z. Qin, S. Zhou, L. Wang, J. Duan, G. Hua, and W. Tang, “Motiontrack: Learning robust short-term and long-term motions for multi-object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17939–17948, 2023.
- Y. Xu, X. Zhou, S. Chen, and F. Li, “Deep learning for multiple object tracking: a survey,” IET Computer Vision, vol. 13, no. 4, pp. 355–368, 2019.
- G. Wang, M. Song, and J.-N. Hwang, “Recent advances in embedding methods for multi-object tracking: a survey,” arXiv preprint arXiv:2205.10766, 2022.
- J. Willes, C. Reading, and S. L. Waslander, “Intertrack: Interaction transformer for 3d multi-object tracking,” in 2023 20th Conference on Robots and Vision (CRV), pp. 73–80, IEEE, 2023.
- X. Geng, M. Li, W. Liu, S. Zhu, H. Jiang, J. Bian, X. Fan, R. Peng, and J. Luo, “Person tracking by detection using dual visible-infrared cameras,” IEEE Internet of Things Journal, vol. 9, no. 22, pp. 23241–23251, 2022.
- X. Qian, Z. Wang, J. Wang, G. Guan, and H. Li, “Audio-visual cross-attention network for robotic speaker tracking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 550–562, 2022.
- T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel, “Backprop kf: Learning discriminative deterministic state estimators,” Advances in neural information processing systems, vol. 29, 2016.
- S. Kim, I. Petrunin, and H.-S. Shin, “A review of kalman filter with artificial intelligence techniques,” in 2022 Integrated Communication, Navigation and Surveillance Conference (ICNS), pp. 1–12, IEEE, 2022.
- R. Jonschkowski, D. Rastogi, and O. Brock, “Differentiable particle filters: End-to-end learning with algorithmic priors,” arXiv preprint arXiv:1805.11122, 2018.
- A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
- P. Karkus, D. Hsu, and W. S. Lee, “Particle filter networks with application to visual localization,” in Conference on robot learning, pp. 169–178, PMLR, 2018.
- M. Zhu, K. Murphy, and R. Jonschkowski, “Towards differentiable resampling,” arXiv preprint arXiv:2004.11938, 2020.
- X. Ma, P. Karkus, D. Hsu, and W. S. Lee, “Particle filter recurrent neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5101–5108, 2020.
- Y. Zhang, M. Yu, H. Zhang, D. Yu, and D. Wang, “Kalmannet: A learnable kalman filter for acoustic echo cancellation,” arXiv preprint arXiv:2301.12363, 2023.
- S. Jouaber, S. Bonnabel, S. Velasco-Forero, and M. Pilte, “Nnakf: A neural network adapted kalman filter for target tracking,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4075–4079, IEEE, 2021.
- J. Zhao, M. Plumbley, W. Wang, et al., “Attention-based end-to-end differentiable particle filter for audio speaker tracking,” 2023.
- M. Barnard, P. Koniusz, W. Wang, J. Kittler, S. M. Naqvi, and J. Chambers, “Robust multi-speaker tracking via dictionary learning and identity modeling,” IEEE Transactions on Multimedia, vol. 16, no. 3, pp. 864–880, 2014.
- V. Kılıç, M. Barnard, W. Wang, and J. Kittler, “Audio assisted robust visual tracking with adaptive particle filtering,” IEEE Transactions on Multimedia, vol. 17, no. 2, pp. 186–200, 2014.
- V. Kılıç, M. Barnard, W. Wang, A. Hilton, and J. Kittler, “Mean-shift and sparse sampling-based smc-phd filtering for audio informed visual speaker tracking,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2417–2431, 2016.
- X. Qian, A. Brutti, M. Omologo, and A. Cavallaro, “3d audio-visual speaker tracking with an adaptive particle filter,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2896–2900, IEEE, 2017.
- X. Qian, A. Xompero, A. Cavallaro, A. Brutti, O. Lanz, and M. Omologo, “3d mouth tracking from a compact microphone array co-located with a camera,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3071–3075, IEEE, 2018.
- Y. Liu, V. Kılıç, J. Guan, and W. Wang, “Audio–visual particle flow smc-phd filtering for multi-speaker tracking,” IEEE Transactions on Multimedia, vol. 22, no. 4, pp. 934–948, 2019.
- H. Liu, Y. Li, and B. Yang, “3d audio-visual speaker tracking with a two-layer particle filter,” in 2019 IEEE International Conference on Image Processing (ICIP), pp. 1955–1959, IEEE, 2019.
- S. Lin and X. Qian, “Audio-visual multi-speaker tracking based on the glmb framework.,” in INTERSPEECH, pp. 3082–3086, 2020.
- H. Liu, Y. Sun, Y. Li, and B. Yang, “3d audio-visual speaker tracking with a novel particle filter,” in 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7343–7348, IEEE, 2021.
- Y. Li, H. Liu, and H. Tang, “Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1456–1463, 2022.
- J. Zhao, P. Wu, X. Liu, S. Goudarzi, H. Liu, Y. Xu, and W. Wang, “Audio visual multi-speaker tracking with improved gcf and pmbm filter,” Proc. Interspeech 2022, pp. 3704–3708, 2022.
- F. Sanabria-Macias, M. Marron-Romera, and J. Macias-Guarasa, “Audiovisual tracking of multiple speakers in smart spaces,” Sensors, vol. 23, no. 15, p. 6969, 2023.
- Y. Liu, Y. Xu, P. Wu, and W. Wang, “Labelled non-zero diffusion particle flow smc-phd filtering for multi-speaker tracking,” IEEE Transactions on Multimedia, 2023.
- D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Transactions on pattern analysis and machine intelligence, vol. 25, no. 5, pp. 564–577, 2003.
- X. Qian, A. Brutti, O. Lanz, M. Omologo, and A. Cavallaro, “Audio-visual tracking of concurrent speakers,” IEEE Transactions on Multimedia, 2021.
- Y. Li, H. Liu, and H. Tang, “Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking,” arXiv preprint arXiv:2112.07423, 2021.
- X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.
- J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, and F. Huang, “Dsfd: dual shot face detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5060–5069, 2019.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp. 21–37, Springer, 2016.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299, 2017.
- J. Wilson and M. C. Lin, “Avot: Audio-visual object tracking of multiple objects for robotics,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10045–10051, IEEE, 2020.
- Q. Wang, L. Chai, H. Wu, Z. Nian, S. Niu, S. Zheng, Y. Wang, L. Sun, Y. Fang, J. Pan, et al., “The nerc-slip system for sound event localization and detection of dcase2022 challenge,” DCASE2022 Challenge, Tech. Rep., 2022.
- J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a” siamese” time delay neural network,” Advances in neural information processing systems, vol. 6, 1993.
- L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in European conference on computer vision, pp. 850–865, Springer, 2016.
- C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE transactions on acoustics, speech, and signal processing, vol. 24, no. 4, pp. 320–327, 1976.
- H. Do, H. F. Silverman, and Y. Yu, “A real-time SRP-PHAT source location implementation using stochastic region contraction(SRC) on a large-aperture microphone array,” in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–121–I–124, 2007.
- M. Omologo and P. Svaizer, “Use of the crosspower-spectrum phase in acoustic event location,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 288–292, 1997.
- A. Brutti, M. Omologo, and P. Svaizer, “Localization of multiple speakers based on a two step acoustic map analysis,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4349–4352, IEEE, 2008.
- R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge University Press, 2003.
- H. Sawada, R. Mukai, S. Araki, and S. Malcino, “Multiple source localization using independent component analysis,” in 2005 IEEE Antennas and Propagation Society International Symposium, vol. 4, pp. 81–84, IEEE, 2005.
- I. Trowitzsch, C. Schymura, D. Kolossa, and K. Obermayer, “Joining sound event detection and localization through spatial segregation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 487–502, 2019.
- S. Adavanne, A. Politis, and T. Virtanen, “Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network,” in 2018 26th European Signal Processing Conference (EUSIPCO), pp. 1462–1466, IEEE, 2018.
- J. M. Vera-Diaz, D. Pizarro, and J. Macias-Guarasa, “Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates,” Sensors, vol. 18, no. 10, p. 3418, 2018.
- D. Berghi, A. Hilton, and P. J. B. Jackson, “Visually supervised speaker detection and localization via microphone array,” in IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), 2021.
- X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “A learning-based approach to direction of arrival estimation in noisy and reverberant environments,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2814–2818, 2015.
- J. M. Vera-Diaz, D. Pizarro, and J. Macias-Guarasa, “Acoustic source localization with deep generalized cross correlations,” Signal Processing, vol. 187, p. 108169, 2021.
- S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34–48, 2018.
- S. Adavanne, A. Politis, and T. Virtanen, “Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network,” arXiv preprint arXiv:1904.12769, 2019.
- S. Adavanne, P. Pertilä, and T. Virtanen, “Sound event detection using spatial features and convolutional recurrent neural network,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 771–775, IEEE, 2017.
- H. C. Maruri, P. López-Meyer, J. Huang, J. A. del Hoyo Ontiveros, and H. Lu, “Gcc-phat cross-correlation audio features for simultaneous sound event localization and detection (seld) on multiple rooms,” in Workshop on Detection and Classification of Acoustic Scenes and Events, 2019.
- D. Berghi and P. J. B. Jackson, “Audio inputs for active speaker detection and localization via microphone array,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2023.
- C. Schymura, T. Ochiai, M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, and D. Kolossa, “Exploiting attention-based sequence-to-sequence architectures for sound event localization,” in 2020 28th European Signal Processing Conference (EUSIPCO), pp. 231–235, IEEE, 2021.
- C. Schymura, B. Bönninghoff, T. Ochiai, M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, and D. Kolossa, “Pilot: Introducing transformers for probabilistic sound event localization,” arXiv preprint arXiv:2106.03903, 2021.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- T. N. T. Nguyen, K. N. Watcharasupat, N. K. Nguyen, D. L. Jones, and W.-S. Gan, “Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022.
- T. N. Tho Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “SALSA-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” pp. 716–720, 2022.
- G. Lathoud, J.-M. Odobez, and D. Gatica-Perez, “Av16. 3: An audio-visual corpus for speaker localization and tracking,” in International Workshop on Machine Learning for Multimodal Interaction, pp. 182–195, Springer, 2004.
- G. Hinton, O. Vinyals, J. Dean, et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
- J. Zhao, P. Wu, S. Goudarzi, X. Liu, J. Sun, Y. Xu, and W. Wang, “Visually assisted self-supervised audio speaker localization and tracking,” in 2022 30th European Signal Processing Conference (EUSIPCO), pp. 787–791, IEEE, 2022.
- G. Irie, M. Ostrek, H. Wang, H. Kameoka, A. Kimura, T. Kawanishi, and K. Kashino, “Seeing through sounds: Predicting visual semantic segmentation results from multichannel audio signals,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3961–3964, IEEE, 2019.
- A. B. Vasudevan, D. Dai, and L. Van Gool, “Semantic object prediction and spatial sound super-resolution with binaural sounds,” in European Conference on Computer Vision, pp. 638–655, Springer, 2020.
- Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” Advances in Neural Information Processing Systems, vol. 29, pp. 892–900, 2016.
- C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba, “Self-supervised moving vehicle tracking with stereo sound,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7053–7062, 2019.
- F. R. Valverde, J. V. Hurtado, and A. Valada, “There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11612–11621, 2021.
- D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
- X. Wu, Z. Wu, L. Ju, and S. Wang, “Binaural audio-visual localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2961–2968, 2021.
- S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, “Joint probabilistic data association revisited,” in Proceedings of the IEEE international conference on computer vision, pp. 3047–3055, 2015.
- C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis tracking revisited,” in Proceedings of the IEEE international conference on computer vision, pp. 4696–4704, 2015.
- I. D. Gebru, S. Ba, G. Evangelidis, and R. Horaud, “Tracking the active speaker based on a joint audio-visual observation model,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 15–21, 2015.
- A. Deleforge, R. Horaud, Y. Y. Schechner, and L. Girin, “Co-localization of audio sources in images using binaural features and locally-linear regression,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 4, pp. 718–731, 2015.
- X. Alameda-Pineda, J. Sanchez-Riera, J. Wienke, V. Franc, J. Čech, K. Kulkarni, A. Deleforge, and R. Horaud, “Ravel: An annotated corpus for training robots with audiovisual abilities,” Journal on Multimodal User Interfaces, vol. 7, no. 1, pp. 79–91, 2013.
- E. Arnaud, H. Christensen, Y.-C. Lu, J. Barker, V. Khalidov, M. Hansard, B. Holveck, H. Mathieu, R. Narasimha, E. Taillant, et al., “The cava corpus: synchronised stereoscopic and binaural datasets with head movements,” in Proceedings of the 10th international conference on Multimodal interfaces, pp. 109–116, 2008.
- M. Taj, “Surveillance performance evaluation initiative (spevi)—audiovisual people dataset,” Internet: http://www. eecs. qmul. ac. uk/ andrea/spevi. html, 2007.
- J. Carletta, “Announcing the ami meeting corpus,” The ELRA Newsletter, vol. 11, no. 1, pp. 3–5, 2006.
- D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. M. Chu, A. Tyagi, J. R. Casas, J. Turmo, L. Cristoforetti, F. Tobia, et al., “The chil audiovisual corpus for lecture and meeting analysis inside smart rooms,” Language resources and evaluation, vol. 41, no. 3, pp. 389–407, 2007.
- M. H. Ooi, T. Solomon, Y. Podin, A. Mohan, W. Akin, M. A. Yusuf, S. del Sel, K. M. Kontol, B. F. Lai, D. Clear, et al., “Evaluation of different clinical sample types in diagnosis of human enterovirus 71-associated hand-foot-and-mouth disease,” Journal of clinical microbiology, vol. 45, no. 6, pp. 1858–1866, 2007.
- Q. Liu, W. Wang, T. de Campos, P. J. Jackson, and A. Hilton, “Multiple speaker tracking in spatial audio via phd filtering and depth-audio fusion,” IEEE Transactions on Multimedia, vol. 20, no. 7, pp. 1767–1780, 2017.
- D. Berghi, M. Volino, and P. J. B. Jackson, “Tragic Talkers: A Shakespearean sound- and light-field dataset for audio-visual machine learning research,” in European Conference on Visual Media Production, 2022.
- W. He, P. Motlicek, and J.-M. Odobez, “Deep neural networks for multiple speaker detection and localization,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 74–79, IEEE, 2018.
- A. Perez, V. Sanguineti, P. Morerio, and V. Murino, “Audio-visual model distillation using acoustic images,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2854–2863, 2020.
- J. Donley, V. Tourbabin, J.-S. Lee, M. Broyles, H. Jiang, J. Shen, M. Pantic, V. K. Ithapu, and R. Mehra, “Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments,” arXiv preprint arXiv:2107.04174, 2021.
- D. Schuhmacher, B.-T. Vo, and B.-N. Vo, “A consistent metric for performance evaluation of multi-object filters,” IEEE transactions on signal processing, vol. 56, no. 8, pp. 3447–3457, 2008.
- K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” EURASIP Journal on Image and Video Processing, vol. 2008, pp. 1–10, 2008.
- J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, and B. Leibe, “Hota: A higher order metric for evaluating multi-object tracking,” International journal of computer vision, vol. 129, no. 2, pp. 548–578, 2021.
- Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700, IEEE, 2018.
- Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” arXiv preprint arXiv:1810.04826, 2018.
- J. Ong, B. T. Vo, S. E. Nordholm, B. N. Vo, D. Moratuwage, and C. Shim, “Audio-visual based online multi-source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022.
- S. E. Nordholm, H. H. Dam, C. C. Lai, and E. A. Lehmann, “Broadband beamforming and optimization,” in Academic Press Library in Signal Processing, vol. 3, pp. 553–598, Elsevier, 2014.
- A. S. Subramanian, C. Weng, S. Watanabe, M. Yu, Y. Xu, S.-X. Zhang, and D. Yu, “Directional asr: A new paradigm for e2e multi-speaker speech recognition with source localization,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8433–8437, IEEE, 2021.
- Z.-Q. Wang, X. Zhang, and D. Wang, “Robust speaker localization guided by deep learning-based time-frequency masking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 178–188, 2018.
- A. S. Subramanian, C. Weng, S. Watanabe, M. Yu, and D. Yu, “Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition,” Computer Speech & Language, vol. 75, p. 101360, 2022.
- Y. Chen, X. Qian, Z. Pan, K. Chen, and H. Li, “Locselect: Target speaker localization with an auditory selective hearing mechanism,” arXiv preprint arXiv:2310.10497, 2023.
- S. Bhatti and J. Xu, “Survey of target tracking protocols using wireless sensor network,” in Fifth International Conference on Wireless and Mobile Communications, pp. 110–115, IEEE, 2009.
- S. Wang and A. Dekorsy, “Distributed consensus-based extended kalman filtering: A bayesian perspective,” in 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5, 2019.
- H. Rezaei, R. Mahboobi Esfanjani, A. Akbari, and M. H. Sedaaghi, “Event-triggered distributed kalman filter with consensus on estimation for state-saturated systems,” International Journal of Robust and Nonlinear Control, vol. 30, no. 18, pp. 8327–8339, 2020.
- K. Ma, L. Xu, and H. Fan, “Unscented kalman filtering for target tracking systems with packet dropout compensation,” IET Control Theory & Applications, vol. 13, no. 12, pp. 1901–1908, 2019.
- P. Wu, J. Zhao, S. Goudarzi, and W. Wang, “Partial arithmetic consensus based distributed intensity particle flow smc-phd filter for multi-target tracking,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5078–5082, IEEE, 2022.
- D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al., “Scaling egocentric vision: The epic-kitchens dataset,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736, 2018.
- K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022.
- J. Zhao, Y. Xu, X. Qian, and W. Wang, “Audio visual speaker localization from egocentric views,” arXiv preprint arXiv:2309.16308, 2023.
- X. Wang et al., “Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13763–13773, 2021.
- Y. Jiang et al., “Prompt-driven target speech diarization,” arXiv preprint arXiv:2310.14823, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.