Papers
Topics
Authors
Recent
Search
2000 character limit reached

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Published 15 Sep 2023 in cs.SD and eess.AS | (2309.08408v1)

Abstract: Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR.

Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.